# WeRateDogs Analysis Project

The purpose of this project is to scrape, clean and analyze tweets from the WeRateDogs twitter account.

In [11]:
"""Setting up the environment"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import io
import tweepy
import json

## Gather

### Loading twitter archives

The WeRateDogs Twitter archive is provided in form of a csv-file and loaded from the file system. The name of the file is `twitter-archive-enhanced.csv`.

In [14]:
tweets = pd.read_csv('twitter-archive-enhanced.csv')
tweets.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [15]:
tweets.shape

(2356, 17)

### Loading image predictions from the web

The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. The file (`image_predictions.tsv`) is hosted on Udacity's servers and is downloaded programmatically using the Requests library. The file contains the top three dog predictions based on a image, as well as the containing tweet.

In [16]:
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response = requests.get(url)
content = response.content
predictions = pd.read_csv(io.StringIO(content.decode('utf-8')), sep="\t")
predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [17]:
predictions.shape

(2075, 12)

### Load tweet metadata from Twitter

To load extensive metadata for each tweet, we use the TweePy API. As Twitter limits use of their API by 100 API calls per hour, we initialize the Twitter API with the `wait_on_rate_limit` and `wait_on_rate_limit_notify` parameter set. Thereby Tweepy will sleep for the rate limit to replenish once we have reached the rate limit. It will also print out a message. Not setting these parameters and repeatedly exceeding the rate limit might cause the account to be blocked.

In [23]:
"""Connecting to the TweePy API"""

"""TODO: Remove Key Secret"""
api_key = "a3ErriWDYrXoKPORmVLXUgHc4"
api_secret = "9zjvmCwpwyfGVNLnrOsxeG4qIAMvVCfckFTVoolH88soThd6nS"

auth = tweepy.OAuthHandler(api_key, api_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

Extended tweet content can be accessed via the function `get_status(tweet-id)`. We use it on all available tweets from the column `tweets.tweet_id` and story the retreived json entirely in a separate file.

In [21]:
"""Example for a response from the tweepy get_status function"""
r = api.get_status("666020888022790149")
r._json

{'created_at': 'Sun Nov 15 22:32:08 +0000 2015',
 'id': 666020888022790149,
 'id_str': '666020888022790149',
 'text': 'Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj',
 'truncated': False,
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 666020881337073664,
    'id_str': '666020881337073664',
    'indices': [108, 131],
    'media_url': 'http://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg',
    'url': 'https://t.co/BLDqew2Ijj',
    'display_url': 'pic.twitter.com/BLDqew2Ijj',
    'expanded_url': 'https://twitter.com/dog_rates/status/666020888022790149/photo/1',
    'type': 'photo',
    'sizes': {'medium': {'w': 960, 'h': 720, 'resize': 'fit'},
     'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'large': {'w': 960, 'h': 720, 'resize': 'fit'},
     'small': {'w': 680, 'h': 510,

We now call the API method to retreive extended tweet metadata and write the obtained json to file `tweet_json.txt`. For this purpose, the obtained json neets to be converted to a writable JSON-string. This can be one by creating a JSON dump via the `json` library. Furthermore we need to ensure that the request is OK and tweet data is retreived. 

When a TweepError is raised due to an error Twitter responded with, the error code (as described in the API documentation) can be accessed at `TweepError.response.text`. Note, however, that TweepErrors also may be raised with other things as message (for example plain error reason strings).


In [22]:
"""Retreive extended tweet metadata and write obtained json to file. It case of an error, write the 
error message to a separate file."""
error_ids = []
with open("tweet_json.txt", 'a') as outfile, open("twitter_errors.txt", 'a') as errorfile:
    for id in tweets.tweet_id:
        try:
            tweet = api.get_status(id)
            outfile.write(json.dumps(tweet._json))
            outfile.write('\n')
        except tweepy.TweepError as error:
            error_ids.append(id)
            errorfile.write(error.response.text)
            errorfile.write('\n')
            pass

We then read the file line-by-line and write the obtained json into a Pandas Dataframe. As the json contains multilevel data, we flatten the json using Pandas' `pd.json_normalize`. We then restrict the dataframe to relevant columns.

In [94]:
"""Reading tweet_json.txt line-by-line and write tweet data into dataframe"""
tweets_extended = pd.DataFrame()
with open("tweet_json.txt", 'r', encoding="utf-8") as infile:
    for line in infile:
        obj = json.loads(line)
        df = pd.json_normalize(obj)[['id','retweet_count','favorite_count','retweeted','geo','coordinates','created_at']]
        tweets_extended = tweets_extended.append(df)

In [96]:
tweets_extended.shape

(871, 7)

## Assess

## Clean

#### Define

#### Code

#### Test