
# Project: [WeRateDogs] Data Wrangling

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gathering">Data Gathering</a></li>
<li><a href="#assessing">Data Assessing</a></li>
<li><a href="#cleaning">Data Cleaning</a></li>
<li><a href="#conclusions">Conclusions</a></li>
<li><a href="#limitations">Limitations</a>
</ul>

<a id='intro'></a>
## Introduction

The process of gathering data for analysis can include gathering from varitety of sources, in different formats. The dataset we are about to wrangle (analyse and visualize) is the tweet arhcive of Twitter user [@dog_rates](https://twitter.com/dog_rates), also known as [WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs).

WeRateDogs is a Twitter account that rates people's dogs witha. humoroius comment about the dog. We have an archived data that contains basic tweet data (weet ID, timestamp, text, etc) for over 5,000 tweets as they stood on August 1, 2017.

We would be working with 3 differente datasets in this project.
1. Enhanced Twitter Archive
2. Image Predictions File
3. Additional Data via the Twitter API

<a id='gathering'></a>
## Data Gathering

The goal here is to gather all the 3 datasets described in the introduction section.

In [1]:
# Importation of the gathered list of all packages to be used.

import pandas as pd
import requests
import tweepy
import json
from timeit import default_timer as timer

### 1. Enhanced Twitter Archive

The method for getting this dataset is by manual download, save it into our project folder and read it into a panda dataframe.

In [2]:
# Download and read the twitter archive data into a DataFrame

df_tweets_archived = pd.read_csv("twitter-archive-enhanced.csv")

In [3]:
df_tweets_archived.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


### 2. Image Predictions File

The method for getting this dataset is by programmatically downloading and savign the file into our project folder. We have been provided with the url, we need to use the `request` library to download the file.

In [4]:
# Download Image Predictiosn data via `request` library and store it.

# Source URL
image_predictions_source = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"

with open("image-predictions.tsv", mode="wb") as file:
    response = requests.get(image_predictions_source)
    file.write(response.content)

### 3. Additional Twitter Data

The method for getting this dataset is by using an API, Twitter's API specifically.

We have gone throught the process of creating a regular and developer's account on Twitter, so we can have access to the create a project on the portal.

In [5]:
# Setup credentials to setup tweepy's OAuth1UserHandler (previoulsy OAuthHandler) instance

consumer_key = '15HiUfbHoPGXkcXwkeTIhf2D6'
consumer_secret = 'UQpp0732wUJXuXHJljZIXNoMee6YMgHEmjc1IF3QshkPXRenZ6'
access_token = '826506400734003201-bUwulLRoQSV9ryBpnqLTc6zphh7Ycfa'
access_secret = '5kfzhpzfMj3o0SpveZcUJuNRzQLUeRmmRJnOVFR7rWHI5'

auth = tweepy.OAuth1UserHandler(
   consumer_key, consumer_secret, 
   access_token, access_secret
)

api = tweepy.API(auth, wait_on_rate_limit=True)

In [6]:
# Gather list of tweet's ids in archived dataset

tweet_ids = df_tweets_archived.tweet_id.values
len(tweet_ids)

2356

In [8]:
# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive (tweet_ids)

# Store list of tweets fetch that failed
tweetfetch_fails = {}

# Timer
start = timer()
print('Start - {}'.format(start))

# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for id in tweet_ids:
        try:
            tweet = api.get_status(id)
            
            tweet_info = {
                "id": str(id),
                "retweet_count": str(tweet._json['retweet_count']),
                "favorite_count": str(tweet._json['favorite_count'])
            }
            json.dump(tweet_info, outfile)
            outfile.write('\n')
            
        except tweepy.errors.TweepyException as e:
            tweetfetch_fails[id] = e
            pass
end = timer()
print('End - {} \n'.format(end))

print('Duration - {}'.format(end - start))
print(tweetfetch_fails)


Start - 45.364935291


Rate limit reached. Sleeping for: 171
Rate limit reached. Sleeping for: 243


End - 2192.832309083 

Duration - 2147.4673737919998
{888202515573088257: NotFound('404 Not Found\n144 - No status found with that ID.'), 877611172832227328: Forbidden('403 Forbidden\n179 - Sorry, you are not authorized to see this status.'), 873697596434513921: NotFound('404 Not Found\n144 - No status found with that ID.'), 872668790621863937: NotFound('404 Not Found\n144 - No status found with that ID.'), 872261713294495745: NotFound('404 Not Found\n144 - No status found with that ID.'), 869988702071779329: NotFound('404 Not Found\n144 - No status found with that ID.'), 866816280283807744: NotFound('404 Not Found\n144 - No status found with that ID.'), 861769973181624320: NotFound('404 Not Found\n144 - No status found with that ID.'), 856602993587888130: NotFound('404 Not Found\n144 - No status found with that ID.'), 856330835276025856: NotFound('404 Not Found\n144 - No status found with that ID.'), 851953902622658560: NotFound('404 Not Found\n144 - No status found with that ID.'), 8

In [50]:
# Reading our saved tweet_json file

tweets_fetched = []

with open("tweet_json.txt", "r") as tweet_json:
    for tweet in tweet_json.readlines():
        tweets_fetched.append(json.loads(tweet.strip()))
        
tweets_fetched[0:5]

[{'id': '892420643555336193',
  'retweet_count': '6877',
  'favorite_count': '32906'},
 {'id': '892177421306343426',
  'retweet_count': '5179',
  'favorite_count': '28436'},
 {'id': '891815181378084864',
  'retweet_count': '3422',
  'favorite_count': '21373'},
 {'id': '891689557279858688',
  'retweet_count': '7086',
  'favorite_count': '35873'},
 {'id': '891327558926688256',
  'retweet_count': '7598',
  'favorite_count': '34311'}]

In [51]:
df_tweets_fetched = pd.DataFrame(tweets_fetched)
df_tweets_fetched.head()

Unnamed: 0,id,retweet_count,favorite_count
0,892420643555336193,6877,32906
1,892177421306343426,5179,28436
2,891815181378084864,3422,21373
3,891689557279858688,7086,35873
4,891327558926688256,7598,34311
