# Guided Project: API and Web Data Scraping
## Part 1: API
- I started the project by looking for an API in [Public APIs](https://github.com/toddmotto/public-apis) and I selected the twitter API.
- Went to Twitter and created a developer account
- I setup a new app in twitter and got my credentials
- In order to keep my credentials safe, I created the following documents:
  - a file called .env in visual code were I created variables to assign tockens to
  - a file called .gitignore in visual code were I included the .env file to "protect" the tokens and for this info to not be uploaded to GitHub 
  - a file called loadCredentials.py to read the .env file
- With the files above created, I imported tweepy and loaded my credentials to the jupyter notebook

In [1]:
import tweepy

In [2]:
from loadCredentials import loadCredentials

cred = loadCredentials(["TWITTER_API_KEY","TWITTER_API_SECRET","TWITTER_ACCESS_TOKEN","TWITTER_ACCESS_TOKEN_SECRET"])
auth = tweepy.OAuthHandler(cred["TWITTER_API_KEY"], cred["TWITTER_API_SECRET"])
auth.set_access_token(cred["TWITTER_ACCESS_TOKEN"], cred["TWITTER_ACCESS_TOKEN_SECRET"])
api = tweepy.API(auth)

- I consulted the twitter API for my personal account information using the me method
- I imported pandas and json
- I created a data frame of the information in my account

In [3]:
mytw = api.me()

In [184]:
import pandas as pd
from pandas.io.json import json_normalize

In [5]:
mytw = api.me()
mytwit = pd.DataFrame([pd.Series(mytw._json)])
mytwit

Unnamed: 0,id,id_str,name,screen_name,location,profile_location,description,url,entities,protected,...,profile_use_background_image,has_extended_profile,default_profile,default_profile_image,following,follow_request_sent,notifications,translator_type,suspended,needs_phone_verification
0,360391229,360391229,Maris Font,marisfont,Miami,,,,{'description': {'urls': []}},True,...,True,False,True,False,False,False,False,none,False,False


- I consulted the twitter API for tweets containing "friyay" using the search method
- I created a data frame of the tweets that have "friyay" on them
- Since the data frame was huge, I printed all the columns to figure out what to work with

In [58]:
friyay = api.search("friyay")

In [59]:
tweets = pd.DataFrame([pd.Series(tweet._json) for tweet in friyay])
print(type(tweets))
tweets.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,created_at,id,id_str,text,truncated,entities,metadata,source,in_reply_to_status_id,in_reply_to_status_id_str,...,contributors,retweeted_status,is_quote_status,retweet_count,favorite_count,favorited,retweeted,lang,possibly_sensitive,extended_entities
0,Thu Nov 01 17:19:34 +0000 2018,1058045898456997889,1058045898456997889,RT @BrandedBills: By popular demand 😉\n\n🚨 HAT...,False,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://www.example.com"" rel=""nofollo...",,,...,,{'created_at': 'Fri Oct 26 18:30:38 +0000 2018...,False,282,0,False,False,en,,
1,Thu Nov 01 17:19:27 +0000 2018,1058045870963351552,1058045870963351552,RT @BrandedBills: By popular demand 😉\n\n🚨 HAT...,False,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://example.com"" rel=""nofollow"">T...",,,...,,{'created_at': 'Fri Oct 26 18:30:38 +0000 2018...,False,282,0,False,False,en,,
2,Thu Nov 01 17:19:01 +0000 2018,1058045761202589696,1058045761202589696,Have a clip of @robertwhitejoke's unforgettabl...,True,"{'hashtags': [{'text': 'bgt', 'indices': [68, ...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://buffer.com"" rel=""nofollow"">Bu...",,,...,,,False,0,0,False,False,en,False,
3,Thu Nov 01 17:18:58 +0000 2018,1058045748187664385,1058045748187664385,RT @BrandedBills: By popular demand 😉\n\n🚨 HAT...,False,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/iphone"" r...",,,...,,{'created_at': 'Fri Oct 26 18:30:38 +0000 2018...,False,282,0,False,False,en,,
4,Thu Nov 01 17:14:54 +0000 2018,1058044722931855360,1058044722931855360,RT @BrandedBills: By popular demand 😉\n\n🚨 HAT...,False,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com"" rel=""nofollow"">Tw...",,,...,,{'created_at': 'Fri Oct 26 18:30:38 +0000 2018...,False,282,0,False,False,en,,


In [19]:
tweets.columns

Index(['created_at', 'id', 'id_str', 'text', 'truncated', 'entities',
       'metadata', 'source', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo',
       'coordinates', 'place', 'contributors', 'retweeted_status',
       'is_quote_status', 'retweet_count', 'favorite_count', 'favorited',
       'retweeted', 'lang', 'possibly_sensitive', 'extended_entities'],
      dtype='object')

- I created a secon Data Frame only with the id, text, and retweet_count columns to understand which were the most popular tweets
- then I proceeded to export the output as .csv

In [51]:
tweets_final = tweets[['id','text','retweet_count']]
print(type(tweets_final))
tweets_final

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,id,text,retweet_count
0,1058040636715229184,RT @myLondis: 5 more days till #Halloween! Wha...,1412
1,1058039701909762048,&amp; yes my weekend starts on friyay,0
2,1058039625279770624,RT @realDonaldTrFan: Some IDIOT left a box of ...,2262
3,1058039539615387650,RT @squatterant: 14th November -Waterstones Go...,5
4,1058039368617619457,Tomorrow’s Friday. \n#fridayfeeling #friyay #w...,0
5,1058039289211219968,RT @BrandedBills: By popular demand 😉\n\n🚨 HAT...,279
6,1058038958553206784,RT @VonHausUS: 🎃We've got a spook-tacular #Giv...,220
7,1058038696736423937,RT @myLondis: 5 more days till #Halloween! Wha...,1412
8,1058038419123855360,RT @squatterant: 14th November -Waterstones Go...,5
9,1058038378007007236,Tomorrow is friyay!!!,0


In [57]:
# tweets_final.to_csv('output/API.csv', index=False)

## Part 2: Web Data Scraping
- you will need to scrape the HTML from your chosen page
- parse the HTML to extract the necessary information
- save the results to a text (txt) file if it is text or into a CSV file if it is tabular data

I selected an article, [World’s Best 10 Countries to Launch a Fintech Startup](https://medium.com/datadriveninvestor/worlds-best-10-cities-to-launch-a-fintech-startup-a3a1c739e04c) from Medium. From this article I wanted to extract the title of the article and the 10 countries. I choose the article since it is part of an industry I really like.
- First I imported the requests library 

In [1]:
import requests

- I proceded to specify the URL of the page I wante dto scrape and used the get and content method in the requests library to retreive the content 

In [7]:
url = 'https://medium.com/datadriveninvestor/worlds-best-10-cities-to-launch-a-fintech-startup-a3a1c739e04c'
html = requests.get(url).content
html[0:700]

b'<!DOCTYPE html><html xmlns:cc="http://creativecommons.org/ns#"><head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# medium-com: http://ogp.me/ns/fb/medium-com#"><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta name="viewport" content="width=device-width, initial-scale=1.0, viewport-fit=contain"><title>World\xe2\x80\x99s Best 10 Countries to Launch a Fintech Startup</title><link rel="canonical" href="https://medium.com/datadriveninvestor/worlds-best-10-cities-to-launch-a-fintech-startup-a3a1c739e04c"><meta name="title" content="World\xe2\x80\x99s Best 10 Countries to Launch a Fintech Startup"><meta name="referrer" content="unsafe-url"><meta name="description" content="FinTech st'

- I imported the BeautifulSoup library to read the raw HTML and parse the information I wanted 
- I went to the website itself and used the inspect element to identify the type of elements that contained the main header and the name of the 10 countries. I figured out it was h1 and h3, so i proced to extract all the text contained within these header tags. 

In [11]:
from bs4 import BeautifulSoup

In [12]:
soup = BeautifulSoup(html,"lxml")
soup

<!DOCTYPE html>
<html xmlns:cc="http://creativecommons.org/ns#"><head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# medium-com: http://ogp.me/ns/fb/medium-com#"><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="width=device-width, initial-scale=1.0, viewport-fit=contain" name="viewport"/><title>World’s Best 10 Countries to Launch a Fintech Startup</title><link href="https://medium.com/datadriveninvestor/worlds-best-10-cities-to-launch-a-fintech-startup-a3a1c739e04c" rel="canonical"/><meta content="World’s Best 10 Countries to Launch a Fintech Startup" name="title"/><meta content="unsafe-url" name="referrer"/><meta content="FinTech startups have become one of the hottest booms in the competitive business world of entrepreneurship with the revolutionizing of financial services. Merely a decade ago, Fin-techs acted as a…" name="description"/><meta content="#000000" name="theme-color"/><meta content="World’s Best 10 Countries to Launch a Fintech St

In [179]:
title = [e.text for e in soup.select('h1')]
countries = [e.text for e in soup.find_all('h3')][0:10]
print(type(title))
display(title)
print(type(headers))
display(countries)

<class 'list'>


['World’s Best 10 Countries to Launch a Fintech\xa0Startup']

<class 'list'>


['1. New\xa0Zealand',
 '2. Sweden',
 '3. Denmark',
 '4. United\xa0Kingdom',
 '5. Singapore',
 '6. Canada',
 '7. The Netherlands',
 '8. Ireland',
 '9. Switzerland',
 '10. Hong\xa0Kong']

In [182]:
def ranking(countries):
    for country in countries:
        countries.split('.')


<function __main__.ranking(countries)>

- I created a for loop to split the elements in the countries list and turn it into a nested list called ranking
- I proceded to create a Data Frame using pandas for the ranking of the countries

In [212]:
ranking = []
for country in countries:
    ranking.append(country.split("."))

print(type(ranking))
ranking

<class 'list'>


[['1', ' New\xa0Zealand'],
 ['2', ' Sweden'],
 ['3', ' Denmark'],
 ['4', ' United\xa0Kingdom'],
 ['5', ' Singapore'],
 ['6', ' Canada'],
 ['7', ' The Netherlands'],
 ['8', ' Ireland'],
 ['9', ' Switzerland'],
 ['10', ' Hong\xa0Kong']]

In [214]:
df = pd.DataFrame(ranking)
df.columns = ['Ranking', 'Country']
print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Ranking,Country
0,1,New Zealand
1,2,Sweden
2,3,Denmark
3,4,United Kingdom
4,5,Singapore
5,6,Canada
6,7,The Netherlands
7,8,Ireland
8,9,Switzerland
9,10,Hong Kong


In [215]:
df.to_csv('output/scraping.csv', index=False)