# Guided Project: API and Web Data Scraping
## Part 1: API
- I started the project by looking for an API in [Public APIs](https://github.com/toddmotto/public-apis) and I selected the twitter API.
- Went to Twitter and created a developer account
- I setup a new app in twitter and got my credentials
- In order to keep my credentials safe, I created the following documents:
  - a file called .env in visual code were I created variables to assign tockens to
  - a file called .gitignore in visual code were I included the .env file to "protect" the tokens and for this info to not be uploaded to GitHub 
  - a file called loadCredentials.py to read the .env file
- With the files above created, I imported tweepy and loaded my credentials to the jupyter notebook

In [1]:
import tweepy

In [2]:
from loadCredentials import loadCredentials

cred = loadCredentials(["TWITTER_API_KEY","TWITTER_API_SECRET","TWITTER_ACCESS_TOKEN","TWITTER_ACCESS_TOKEN_SECRET"])
auth = tweepy.OAuthHandler(cred["TWITTER_API_KEY"], cred["TWITTER_API_SECRET"])
auth.set_access_token(cred["TWITTER_ACCESS_TOKEN"], cred["TWITTER_ACCESS_TOKEN_SECRET"])
api = tweepy.API(auth)

### Simple Consultation
- I consulted the twitter API for my personal account information using the me method
- I imported pandas and json
- I created a data frame of the information in my account

In [3]:
mytw = api.me()

In [4]:
import pandas as pd
from pandas.io.json import json_normalize

In [5]:
mytw = api.me()
mytwit = pd.DataFrame([pd.Series(mytw._json)])
mytwit

Unnamed: 0,id,id_str,name,screen_name,location,profile_location,description,url,entities,protected,...,profile_use_background_image,has_extended_profile,default_profile,default_profile_image,following,follow_request_sent,notifications,translator_type,suspended,needs_phone_verification
0,360391229,360391229,Maris Font,marisfont,Miami,,,,{'description': {'urls': []}},True,...,True,False,True,False,False,False,False,none,False,False


In [6]:
# mytwit.to_csv('output/api_simple.csv', index=False)

### Medium Consultation
- I consulted the twitter API for tweets containing "friyay" using the search method
- I created a data frame of the tweets that have "friyay" on them
- Since the data frame was huge, I printed all the columns to figure out what to work with

In [7]:
friyay = api.search("friyay")

In [8]:
tweets = pd.DataFrame([pd.Series(tweet._json) for tweet in friyay])
print(type(tweets))
tweets.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,created_at,id,id_str,text,truncated,entities,metadata,source,in_reply_to_status_id,in_reply_to_status_id_str,...,is_quote_status,retweet_count,favorite_count,favorited,retweeted,lang,extended_entities,possibly_sensitive,quoted_status_id,quoted_status_id_str
0,Fri Nov 02 15:40:48 +0000 2018,1058383431644651521,1058383431644651521,RT @TeamTillett: Blimey!!!! That's a HUGE Till...,False,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://www.testadigrancazzochenonseia...",,,...,False,88,0,False,False,en,,,,
1,Fri Nov 02 15:40:48 +0000 2018,1058383430348550144,1058383430348550144,RT @SharkMontauk: Happy #friyay friends! 😍🦈🎉 h...,False,"{'hashtags': [{'text': 'friyay', 'indices': [2...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/#!/download/ipad"" ...",,,...,False,21,0,False,False,en,"{'media': [{'id': 1058320610282024960, 'id_str...",False,,
2,Fri Nov 02 15:40:45 +0000 2018,1058383417233027073,1058383417233027073,SO proud of my brother @THEREALSWIZZZ - Poison...,True,"{'hashtags': [{'text': 'music', 'indices': [10...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com"" rel=""nofollow"">Tw...",,,...,False,0,0,False,False,en,,False,,
3,Fri Nov 02 15:40:44 +0000 2018,1058383413600837632,1058383413600837632,RT @canadasfareast: Great view from Cape Spear...,False,"{'hashtags': [{'text': 'explorenl', 'indices':...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/android"" ...",,,...,True,1,0,False,False,en,,False,1.05831e+18,1.0583098982058476e+18
4,Fri Nov 02 15:40:42 +0000 2018,1058383404507545601,1058383404507545601,"Friday, FRIYAY! I think we need a picspam. I ...",False,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com"" rel=""nofollow"">Tw...",,,...,False,0,0,False,False,en,,,,


In [9]:
tweets.columns

Index(['created_at', 'id', 'id_str', 'text', 'truncated', 'entities',
       'metadata', 'source', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo',
       'coordinates', 'place', 'contributors', 'retweeted_status',
       'is_quote_status', 'retweet_count', 'favorite_count', 'favorited',
       'retweeted', 'lang', 'extended_entities', 'possibly_sensitive',
       'quoted_status_id', 'quoted_status_id_str'],
      dtype='object')

- In order to get the 'user' I had to first normalize the column using json_normalize
  - I tried normalizing 'entities' several times, but since it was nested more than two times, the method we currently know did not work. I tried using other tools, but continusly failed.
  - I wanted to normalize 'entities' because I wanted to influde the 'hashtags' information on my data frame. 
- With this, I created a second Data Frame with the 'user', 'id', 'text', and 'retweet_count' columns to understand which were the most popular tweets
- I as well sorted the data frame by 'retweet_count' to see which were the most popular tweets
- Then I proceeded to export the output as .csv

In [10]:
tweets['user'] = json_normalize(tweets['user'])['screen_name']

In [11]:
tweets_final = tweets[['user','id','text','retweet_count']].sort_values(['retweet_count'],ascending=[0])
print(type(tweets_final))
tweets_final

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,user,id,text,retweet_count
0,Igor_Mizzi,1058383431644651521,RT @TeamTillett: Blimey!!!! That's a HUGE Till...,88
13,DanHamill1987,1058383258143154177,RT @TeamTillett: Blimey!!!! That's a HUGE Till...,88
1,CraigNickoloff,1058383430348550144,RT @SharkMontauk: Happy #friyay friends! 😍🦈🎉 h...,21
9,Pacmangrig,1058383306138501122,RT @PardonMyTake: PMT 11-2 is now live. An awe...,8
7,XMANTHEEONLY,1058383320260771840,RT @ShoPopMusic_ZA: BOBBY-DangerBox Ft. X-MAN ...,3
14,DangerBox_SA,1058383257732034561,RT @ShoPopMusic_ZA: BOBBY-DangerBox Ft. X-MAN ...,3
3,TheRockNL,1058383413600837632,RT @canadasfareast: Great view from Cape Spear...,1
5,ankataye,1058383382055395330,Street Art London #London #StreetArt #bricklan...,1
2,lorenridinger,1058383417233027073,SO proud of my brother @THEREALSWIZZZ - Poison...,0
4,Nelle816,1058383404507545601,"Friday, FRIYAY! I think we need a picspam. I ...",0


In [12]:
# tweets_final.to_csv('output/api_medium.csv', index=False)

### Hard Consultation
- Marc asked me to make a consultation to the twitter API were I would get at least 500 tweets back

In [None]:
def get_tws(topic, count=500):
    results = api.search(topic, count=count)
    df = pd.DataFrame([pd.Series(tw._json) for tw in blackfriday])
    return df

blackfriday = get_tws("blackfriday")
blackfriday