## Wrangling @dog_rates (WeRateDogs) tweets and related data

#### To do:
- Gather data from a variety of sources and in different formats
    * Import provided `twitter-archive-enhanced.csv` file from disk
    * Download the `image-predictions.tsv` file from Udacity using 
        __requets__ library
    * Call Twtiite API to get full JSON objects for each `tweet_id` 
        in the provided archive and store to a single txt file
- Assess and fix quality and tidiness of data
    * Find issues w/ the data
    * Fix them
- Showcase/analyze the complete, clean, and tidy dataset
    * Analyze the dataset to find any interesting relationships
    * Visualize the findings

In [30]:
import pandas as pd
import numpy as np
import json
import requests
import tweepy
from io import StringIO

In [71]:
max_value = np.iinfo(np.int64)
print(len(str(max_value.max)), len('881607037314052096'))




19 18


In [72]:
# Read in the 'twitter_archive_enhanced.csv' file
df = pd.read_csv('twitter-archive-enhanced.csv', \
                 dtype={'in_reply_to_status_id': 'object', 'in_reply_to_user_id': 'object', \
                       'retweeted_status_id': 'object', 'retweeted_status_user_id': 'object'}, \
                 parse_dates=['timestamp', 'retweeted_status_timestamp'], \
                na_values=['', 'None'])

In [88]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null object
in_reply_to_user_id           78 non-null object
timestamp                     2356 non-null datetime64[ns, UTC]
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null object
retweeted_status_user_id      181 non-null object
retweeted_status_timestamp    181 non-null datetime64[ns, UTC]
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          1611 non-null object
doggo                         97 non-null object
floofer                       10 non-null object
pupper                        257 non-null object
puppo                         30 non-null object
dtypes: datetime6

In [4]:
# Get the 'image_predictions.tsv' file from Udacity
predictions_link = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# Wrap in a try/except block in-case requests raises any exceptions
try:
    r = requests.get(predictions_link)
except requests.exceptions.RequestException as e:
    print(e)

In [5]:
# Needed to convert the string output of r.text to a StringIO object
# Based on https://stackoverflow.com/questions/22604564/create-pandas-dataframe-from-a-string
predictions_data = StringIO(r.text)

# Read in predictions_data as a tab-sepperated csv
image_predictions = pd.read_csv(predictions_data, sep='\t')

In [6]:
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [7]:
# Read in the app auth info needed for tweepy
with open('twitter_keys.txt', 'r+') as file:
    consumer_key = file.readline().rstrip()
    consumer_secret = file.readline().rstrip()
    access_token = file.readline().rstrip()
    access_secret = file.readline().rstrip()

# Setup tweepy for use
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

In [8]:
# Set the wait_on_rate_limit and *_limit_notify to True
api.wait_on_rate_limit = True
api.wait_on_rate_limit_notify = True

In [104]:
# Attempt to get all the data for the tweets in 'twitter-archive-enchanced' csv file
for tweet in df.tweet_id:
    # Wrap the get_status call in a try/except block to handle deleted tweets
    # and any other possble exceptions
    try:
        tweet = api.get_status(tweet, tweet_mode='extended')
    except:
        # On exception continue to next loop iteration
        continue
    with open('tweet_json.txt', 'a+') as tweet_file:
        # Write tweet JSON to a new line in the 'tweet_json' txt file
        # and add a new-line so each tweet is on a sepparete line
        json.dump(tweet._json, tweet_file)
        tweet_file.write('\n')

Rate limit reached. Sleeping for: 33
Rate limit reached. Sleeping for: 770


In [74]:
df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,NaT,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,NaT,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,NaT,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,NaT,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,NaT,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [11]:
df[~(df.source.duplicated())].text[0]

"This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU"

In [75]:
# See if any of the 'names' appear too often
df.name.value_counts()

a          55
Charlie    12
Cooper     11
Lucy       11
Oliver     11
           ..
Dot         1
Rilo        1
Blanket     1
Gordon      1
Tedrick     1
Name: name, Length: 956, dtype: int64

In [45]:
# tweets that have extracted a 'name' of "a" should probably be set to None/NaN
df[df.name == 'a'].text[2352]

'This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx'

In [86]:
(df[~(df.in_reply_to_user_id.isna())])

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862663570751283e+17,2281181600.0,2017-07-15 16:51:35+00:00,"<a href=""http://twitter.com/download/iphone"" r...",@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,,NaT,,12,10,,,,,
55,881633300179243008,8.816070373140521e+17,47384430.0,2017-07-02 21:58:53+00:00,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,NaT,,17,10,,,,,
64,879674319642796034,8.795538273341727e+17,3105440746.0,2017-06-27 12:14:36+00:00,"<a href=""http://twitter.com/download/iphone"" r...",@RealKentMurphy 14/10 confirmed,,,NaT,,14,10,,,,,
113,870726314365509632,8.707262027424932e+17,16487760.0,2017-06-02 19:38:25+00:00,"<a href=""http://twitter.com/download/iphone"" r...",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,,,NaT,,10,10,,,,,
148,863427515083354112,8.634256455687741e+17,77596200.0,2017-05-13 16:15:35+00:00,"<a href=""http://twitter.com/download/iphone"" r...",@Jack_Septic_Eye I'd need a few more pics to p...,,,NaT,,12,10,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2038,671550332464455680,6.715448741650022e+17,4196983835.0,2015-12-01 04:44:10+00:00,"<a href=""http://twitter.com/download/iphone"" r...",After 22 minutes of careful deliberation this ...,,,NaT,,1,10,,,,,
2149,669684865554620416,6.693543826270495e+17,4196983835.0,2015-11-26 01:11:28+00:00,"<a href=""http://twitter.com/download/iphone"" r...",After countless hours of research and hundreds...,,,NaT,,11,10,,,,,
2169,669353438988365824,6.678064545737605e+17,4196983835.0,2015-11-25 03:14:30+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Tessa. She is also very pleased after ...,,,NaT,https://twitter.com/dog_rates/status/669353438...,10,10,Tessa,,,,
2189,668967877119254528,6.689207171325829e+17,21435658.0,2015-11-24 01:42:25+00:00,"<a href=""http://twitter.com/download/iphone"" r...",12/10 good shit Bubka\r\n@wane15,,,NaT,,12,10,,,,,


In [94]:
df[df.tweet_id == 671550332464455680].text

2038    After 22 minutes of careful deliberation this ...
Name: text, dtype: object

## Issues Found

#### Data Quality
1. There are re-tweets in the dataset
2. There are replies in the dataset (could be an issue)
3. Not all the tweets have a `name` for the dog(s)
    - and the name `a` is present on 55 of the tweets
4. The `source` column contains an entire HTML tag isntead of just the text to denote which client posted the tweet
5. 
6. 
7. 
8. 


#### Data Tidiness
1. The predictions of dog breed are in a sepparate table
2. 