# Dog Rates Data Wrangling

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gather">Gathering</a></li>
<li><a href="#assess">Assessing</a></li>
<li><a href="#clean">Cleaning</a></li>
<li><a href="#analyze&visualize">Analyze and Visualize</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>

<a id='intro'></a>
## Introduction

The dataset is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog.

The data wrangling process will consist of 3 main steps. They are:
1. **Data gathering** - The data can be gathered in many ways including web scraping, using APIs etc. The data can be gathered from a single source or from many different sources.
2. **Assessing the data** - The data needs to assessed for quality and tidiness issues. This can be done visually and/or programatically.
3. **Cleaning the data** - Based on the assessment, the data is cleaned and tested to make sure all the issues identified are resolved.

**Analyze & Visualize**
<br/>
Finally, the wrangled data is analyzed and visualized in an effective and insightful manner.

>*Importing all necessary packages for the data wrangling and analysis*

In [1]:
import requests
import tweepy
import pandas as pd
import time
import json

import config

<a id='gather'></a>
## Gathering

> The data for this analysis is to be gathered from multiple sources. They are:
>
> 1. The WeRateDogs Twitter archive is enhanced and provided. This file (**twitter_archive_enhanced.csv**) just needs to be downloaded.
>
> 2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (**image_predictions.tsv**) is hosted on Udacity's servers and should be downloaded programmatically.
>
> 3. Additional required and interesting data is to be obtained by querying the Twitter API for each tweet's JSON data and store each tweet's entire set of JSON data in a file (**tweet_json.txt**).

>*The file containing enhanced twitter archive (twitter_archive_enhanced.csv) has been manually downloaded and is available in the directory. The tweet image predictions file (image_predictions.tsv) is to be downloaded programmatically using the URL provided.*

In [2]:
# storing the URL provided in a variable
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# getting the response from the URL using requests library 
response = requests.get(url)

# with keyword ensures that the file is closed immediately the desired operation is complete
# file is opened for writing in binary mode
with open('image_predictions.tsv', 'wb') as file:
    # content of the response is written to the file
    file.write(response.content)

>*The file containing the image predictions is successfully saved in the working directory. The additional data needs to be downloaded by querying the Twitter API using tweepy library. In order to do that, create a twitter developer account after signing in/up. Once the account is created, the consumer keys and authentication tokens will be available for use.*

>*It is not safe to expose the consumer keys and authentication tokens via code. Hence, a config file can used and imported in this notebook. (In order to execute the rest of the notebook, please fill in the necessary details in the config.py file)*

>*Authenticate using the consumer keys and set the access tokens.* 

In [3]:
# create an OAuthHandler instance
auth = tweepy.OAuthHandler(config.API_KEY, config.API_SECRET_KEY)
# set the access tokens
auth.set_access_token(config.ACCESS_TOKEN, config.ACCESS_TOKEN_SECRET)

# create the API instance
# wait_on_rate_limit – whether or not to automatically wait for rate limits to replenish
# wait_on_rate_limit_notify – whether or not to print a notification when Tweepy is waiting for rate limits to replenish
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
api

<tweepy.api.API at 0x7ff76250afd0>

>*The API instance is created and ready for use now.*

>*The ID corresponding to each tweet is required in order to access the additional details of the tweet. These IDs are present in the twitter-archive-enhanced.csv file. Read the file and store as dataframe for further use.*

In [4]:
# read the file twitter-archive-enhanced.csv and store it in a dataframe 
twitter_archive_df = pd.read_csv('twitter-archive-enhanced.csv', index_col=None, encoding = 'utf-8')
twitter_archive_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [5]:
total_number_of_tweets = len(twitter_archive_df.tweet_id)
number_of_failures = 0
failed_tweets_dict = []

start = time.time()

# opening a text file in write mode and writing the JSON containing additional details of the tweet 
with open('tweet_json.txt', 'w') as txt_file:
    # looping over all the tweets whose IDs are present in the twitter_archive_df dataframe
    for tweet_id in twitter_archive_df.tweet_id:    
        try:
            # get a single status specified by the ID parameter
            # extended tweet mode gives the entire untruncated text of the Tweet
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            json.dump(tweet._json, txt_file)
            txt_file.write('\n')
        except tweepy.TweepError as e:
            number_of_failures += 1
            failed_tweets_dict.append(tweet_id)
            print('Tweet ID:', tweet_id, '-', e)
            continue

print('Total number of tweets:', total_number_of_tweets)
print('Time taken:', (time.time()-start)/60, 'minutes')
print('Total number of failed tweets:', number_of_failures)
print('List of failed tweet IDs:', failed_tweets_dict)

Tweet ID: 888202515573088257 - [{'code': 144, 'message': 'No status found with that ID.'}]
Tweet ID: 873697596434513921 - [{'code': 144, 'message': 'No status found with that ID.'}]
Tweet ID: 872668790621863937 - [{'code': 144, 'message': 'No status found with that ID.'}]
Tweet ID: 872261713294495745 - [{'code': 144, 'message': 'No status found with that ID.'}]
Tweet ID: 869988702071779329 - [{'code': 144, 'message': 'No status found with that ID.'}]
Tweet ID: 866816280283807744 - [{'code': 144, 'message': 'No status found with that ID.'}]
Tweet ID: 861769973181624320 - [{'code': 144, 'message': 'No status found with that ID.'}]
Tweet ID: 856602993587888130 - [{'code': 144, 'message': 'No status found with that ID.'}]
Tweet ID: 851953902622658560 - [{'code': 144, 'message': 'No status found with that ID.'}]
Tweet ID: 845459076796616705 - [{'code': 144, 'message': 'No status found with that ID.'}]
Tweet ID: 844704788403113984 - [{'code': 144, 'message': 'No status found with that ID.'}]

Rate limit reached. Sleeping for: 441


Tweet ID: 759566828574212096 - [{'code': 144, 'message': 'No status found with that ID.'}]
Tweet ID: 754011816964026368 - [{'code': 144, 'message': 'No status found with that ID.'}]
Tweet ID: 680055455951884288 - [{'code': 144, 'message': 'No status found with that ID.'}]


Rate limit reached. Sleeping for: 451


Total number of tweets: 2356
Time taken: 34.96101151307424 minutes
Total number of failed tweets: 25
List of failed tweet IDs: [888202515573088257, 873697596434513921, 872668790621863937, 872261713294495745, 869988702071779329, 866816280283807744, 861769973181624320, 856602993587888130, 851953902622658560, 845459076796616705, 844704788403113984, 842892208864923648, 837366284874571778, 837012587749474308, 829374341691346946, 827228250799742977, 812747805718642688, 802247111496568832, 779123168116150273, 775096608509886464, 771004394259247104, 770743923962707968, 759566828574212096, 754011816964026368, 680055455951884288]


>*Below are the explanations of the status codes for the errors:*<br/>
>*Status code 144 - Corresponds with HTTP 404. The requested Tweet ID is not found (if it existed, it was probably deleted)*<br/>
>*Status code 179 - Corresponds with HTTP 403. Thrown when a Tweet cannot be viewed by the authenticating user, usually due to the Tweet’s author having protected their Tweets.*

>*The additional data corresponding to all the tweets in the dataframe are available in tweet_json.txt file. The next step is to read the file and get the required data from JSONs (corresponding to each of the tweets). Finally, store the data in a new dataframe.*

In [6]:
additional_data = []

# opening the tweet_json.txt file in read mode 
with open('tweet_json.txt', 'r') as infile:
    # looping over each line of the file
    for record in infile:
        # convert string to JSON
        record_json_data = json.loads(record)
        # storing the required additional details in a list and appending it to the additional_data list
        additional_data.append([record_json_data['id'], record_json_data['retweet_count'], record_json_data['user']['favourites_count'], record_json_data['user']['followers_count'], record_json_data['created_at']])

# creating a new dataframe using the additional_data list of lists 
additional_data_df = pd.DataFrame(additional_data)
# defining the column names of the dataframe
additional_data_df.columns = ['tweet_id', 'retweet_count', 'favourites_count', 'followers_count', 'created_at']
additional_data_df.head() 

Unnamed: 0,tweet_id,retweet_count,favourites_count,followers_count,created_at
0,892420643555336193,7493,145959,8875964,Tue Aug 01 16:23:56 +0000 2017
1,892177421306343426,5560,145959,8875964,Tue Aug 01 00:17:27 +0000 2017
2,891815181378084864,3681,145959,8875964,Mon Jul 31 00:18:03 +0000 2017
3,891689557279858688,7662,145959,8875964,Sun Jul 30 15:58:51 +0000 2017
4,891327558926688256,8275,145959,8875964,Sat Jul 29 16:00:24 +0000 2017


>*Checking the number of records to confirm that all valid records in twitter_archive_df dataframe have a corresponding record in additional_data_df dataframe.*

In [7]:
additional_data_df.shape

(2331, 5)

>*We observe that except for the 25 tweet IDs that are not valid, there is one record for each tweet in the dataframe. The data gathering step is now complete.*

<a id='assess'></a>
## Assessing

<a id='clean'></a>
## Cleaning

<a id='analyze&visualize'></a>
## Analyze and Visualize

<a id='conclusion'></a>
## Conclusion

>*CONCLUDING REMARKS:*

>*REFERENCES:*