# WeRateDogs Project - Wrangle & Analyze Twitter Data

By Nagashri Nagaraj<br>
Date November 13 2018

## Introduction:
Goal of this project is to wrangle and analyze @WeRateDogs Twitter data thru' web scraping using Tweepy API, Requests library, downloading csv for gathering data. Then assess and clean the data to create interesting and trustworthy analyses and visualizations. 

## Gathering Data:
1. Download the given csv file manually: twitter_archive_enhanced.csv<br>
2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv<br>
3. Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.<br>

In [3]:
#import libraries

import numpy as np
import pandas as pd
import os # to download files from Udacity server
import requests # to download files from Udacity server
import tweepy 
import json
from tqdm import *
import datetime as dt
import re
import matplotlib.pyplot as plt
#import seaborn as sns
%matplotlib inline


In [4]:
#Read twitter-archive-enhanced.csv and store it as dataframe variable archive

archive = pd.read_csv("data/twitter-archive-enhanced.csv")


In [40]:
archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [5]:

# Programatically download image prediction file from Udacity server using Requests library
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

# Save data
with open('data/image-predictions.tsv', "wb") as file: 
    file.write(response.content)
    
# Import data
df_breeds = pd.read_csv('data/image-predictions.tsv', sep = "\t")

df_breeds.sample(2)


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1734,821522889702862852,https://pbs.twimg.com/media/C2aitIUXAAAG-Wi.jpg,1,Doberman,0.763539,True,black-and-tan_coonhound,0.136602,True,miniature_pinscher,0.087654,True
97,667728196545200128,https://pbs.twimg.com/media/CUQ_QahUAAAVQjn.jpg,1,kuvasz,0.360159,True,golden_retriever,0.293744,True,Labrador_retriever,0.270673,True


Consumer API keys KJ4snTTqprVl13xESdynv0FLD (API key)

2I8tD942i78jbYOmttOrarGfnHJwSCDct6V3611el6to5AJyt8 (API secret key)

755364732-Qeb0twZzwoOHSwdjy2WNEO6V9SxVjs9rjfylXSeV (Access token)

uXwh7cRrVOwzHOa9d9bXCjzI63tiXgLM98rGjqm62h5tO (Access token secret)

https://developer.twitter.com/en/apps/15941238

In [8]:
# Import data from Twitter API

# authentication pieces
consumer_key = "KJ4snTTqprVl13xESdynv0FLD"
consumer_secret = "2I8tD942i78jbYOmttOrarGfnHJwSCDct6V3611el6to5AJyt8"
access_token = "755364732-Qeb0twZzwoOHSwdjy2WNEO6V9SxVjs9rjfylXSeV"
access_secret = "uXwh7cRrVOwzHOa9d9bXCjzI63tiXgLM98rGjqm62h5tO"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

# Create connection to API
api = tweepy.API(auth, 
                 parser = tweepy.parsers.JSONParser(), 
                 wait_on_rate_limit = True, 
                 wait_on_rate_limit_notify = True)

# Create list of twitter ids
tweet_ids = archive["tweet_id"].astype(list)


In [13]:
api

<tweepy.api.API at 0x105bcb0b8>

In [25]:
# Download tweepy status object based on tweet_id in archive and store it in a list.
tweets_data = []

# Tweets that can't be found for the tweet_ids are saved in the list below
failed_ids = []

for tweet_id in tqdm(tweet_ids):
    try:
        tweets_data.append(api.get_status(tweet_id))
    except Exception as e:
        failed_ids.append(tweet_id)

 15%|█▍        | 342/2356 [01:21<08:07,  4.13it/s]Rate limit reached. Sleeping for: 239
 53%|█████▎    | 1242/2356 [09:05<04:13,  4.39it/s]  Rate limit reached. Sleeping for: 680
 91%|█████████ | 2143/2356 [24:41<00:47,  4.51it/s]    Rate limit reached. Sleeping for: 682
100%|██████████| 2356/2356 [37:32<00:00,  4.32it/s]    


In [26]:
print("The list of tweets" ,len(tweets_data))
print("The list of tweets no found" , len(failed_ids))

The list of tweets 2340
The list of tweets no found 16


In [27]:
#Then in this code block we isolate the json part of each tweepy 
#status object that we have downloaded and we add them all into a list

my_list_of_dicts = []
for each_json_tweet in tweets_data:
    my_list_of_dicts.append(each_json_tweet)

In [55]:
#print(my_list_of_dicts[0:1])

In [44]:
# Read text file my_list_of_dicts line by line into a txt file:

with open('data/tweet_json.txt', 'w') as file:
        file.write(json.dumps(my_list_of_dicts, indent=4))

In [100]:
my_list = []
with open('data/tweet_json.txt', encoding='utf-8') as json_file:  
    all_data = json.load(json_file)
    for each_dictionary in all_data:
        tweet_id = each_dictionary['id']
        whole_tweet = each_dictionary['text']
        only_url = whole_tweet[whole_tweet.find('https'):]
        favorite_count = each_dictionary['favorite_count']
        retweet_count = each_dictionary['retweet_count']
        followers_count = each_dictionary['user']['followers_count']
        friends_count = each_dictionary['user']['friends_count']
        whole_source = each_dictionary['source']
        only_device = whole_source[whole_source.find('rel="nofollow">') + 15:-4]
        source = only_device
        retweeted_status = each_dictionary['retweeted_status'] = each_dictionary.get('retweeted_status', 'Original tweet')
        if retweeted_status == 'Original tweet':
            url = only_url
        else:
            retweeted_status = 'This is a retweet'
            url = 'This is a retweet'
        my_list.append({'tweet_id': str(tweet_id),
                             'whole_tweet': str(whole_tweet),
                             'favorite_count': int(favorite_count),
                             'retweet_count': int(retweet_count),
                             'followers_count': int(followers_count),
                             'friends_count': int(friends_count),
                             'url': url,
                             'source': source,
                             'text': text,
                             'retweeted_status': retweeted_status,
                            })
        df_json = pd.DataFrame(my_list, columns = ['tweet_id', 'whole_tweet','favorite_count','retweet_count', 
                                                           'followers_count', 'friends_count','source', 
                                                           'retweeted_status', 'url'])                    

In [101]:
df_json

Unnamed: 0,tweet_id,whole_tweet,favorite_count,retweet_count,followers_count,friends_count,source,retweeted_status,url
0,892420643555336193,This is Phineas. He's a mystical boy. Only eve...,38257,8381,7437146,10,Twitter for iPhone,Original tweet,https://t.co/MgUWQ76dJU
1,892177421306343426,This is Tilly. She's just checking pup on you....,32797,6188,7437146,10,Twitter for iPhone,Original tweet,https://t.co/aQFSeaCu9L
2,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,24696,4094,7437146,10,Twitter for iPhone,Original tweet,https://t.co/r0YlrsGCgy
3,891689557279858688,This is Darla. She commenced a snooze mid meal...,41584,8522,7437146,10,Twitter for iPhone,Original tweet,https://t.co/tD36da7qLQ
4,891327558926688256,This is Franklin. He would like you to stop ca...,39765,9232,7437146,10,Twitter for iPhone,Original tweet,https://t.co/0g0KMIVXZ3
5,891087950875897856,Here we have a majestic great white breaching ...,19956,3070,7437146,10,Twitter for iPhone,Original tweet,https://t.co/xx5cilW0Dd
6,890971913173991426,Meet Jax. He enjoys ice cream so much he gets ...,11681,2038,7437146,10,Twitter for iPhone,Original tweet,https://t.co/MV01Q820LT
7,890729181411237888,When you watch your owner call another dog a g...,64577,18611,7437146,10,Twitter for iPhone,Original tweet,https://t.co/hrcFOGi12V
8,890609185150312448,This is Zoey. She doesn't want to be one of th...,27431,4213,7437146,10,Twitter for iPhone,Original tweet,https://t.co/UkrdQyoYxV
9,890240255349198849,This is Cassie. She is a college pup. Studying...,31482,7279,7437146,10,Twitter for iPhone,Original tweet,https://t.co/l3TSS3o2M0


In [111]:
df_json1 = df_json.to_csv('data/df_json1.csv')

In [112]:
df_json1 = pd.read_csv('data/df_json1.csv')
df_json1.head()

Unnamed: 0.1,Unnamed: 0,tweet_id,whole_tweet,favorite_count,retweet_count,followers_count,friends_count,source,retweeted_status,url
0,0,892420643555336193,This is Phineas. He's a mystical boy. Only eve...,38257,8381,7437146,10,Twitter for iPhone,Original tweet,https://t.co/MgUWQ76dJU
1,1,892177421306343426,This is Tilly. She's just checking pup on you....,32797,6188,7437146,10,Twitter for iPhone,Original tweet,https://t.co/aQFSeaCu9L
2,2,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,24696,4094,7437146,10,Twitter for iPhone,Original tweet,https://t.co/r0YlrsGCgy
3,3,891689557279858688,This is Darla. She commenced a snooze mid meal...,41584,8522,7437146,10,Twitter for iPhone,Original tweet,https://t.co/tD36da7qLQ
4,4,891327558926688256,This is Franklin. He would like you to stop ca...,39765,9232,7437146,10,Twitter for iPhone,Original tweet,https://t.co/0g0KMIVXZ3
