In [27]:
# data gathering imports
import config
import os
import requests
import time
import tweepy
import json
import time

# standard data manipulation libraries
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

## Data Gathering
Using the tweet IDs from the WeRateDogs Twitter archive, I query the Twitter API for each tweet's JSON data and write the data to a text file.

In [28]:
# import WeRateDogs twitter archive (provided by Udacity)
archive = pd.read_csv('twitter-archive-enhanced.csv')

# import image predictions (provided by Udacity)
preds = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv',
                 auth=('user', 'pass'))

In [29]:
# Create API object to gather twitter data
consumer_key = config.CONSUMER_KEY
consumer_secret = config.CONSUMER_SECRET
access_token = config.ACCESS_TOKEN
access_secret = config.ACCESS_SECRET

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, 
                 parser=tweepy.parsers.JSONParser(), 
                 wait_on_rate_limit = True, 
                 wait_on_rate_limit_notify = True)

In [31]:
# query the Twitter API for each tweet's JSON data

with open('tweet_json.txt', 'w') as outfile:
    for num in archive['tweet_id']:
        try:
            tweet = api.get_status(num, tweet_mode='extended')
            json.dump(tweet, outfile, sort_keys=True, indent=4)
        except tweepy.TweepError:
            print(f"Tweet ID: {num} no longer exists.")                  

Tweet ID: 888202515573088257 no longer exists.
Tweet ID: 873697596434513921 no longer exists.
Tweet ID: 872668790621863937 no longer exists.
Tweet ID: 872261713294495745 no longer exists.
Tweet ID: 869988702071779329 no longer exists.
Tweet ID: 866816280283807744 no longer exists.
Tweet ID: 861769973181624320 no longer exists.
Tweet ID: 856602993587888130 no longer exists.
Tweet ID: 851953902622658560 no longer exists.
Tweet ID: 845459076796616705 no longer exists.
Tweet ID: 844704788403113984 no longer exists.
Tweet ID: 842892208864923648 no longer exists.
Tweet ID: 837366284874571778 no longer exists.
Tweet ID: 837012587749474308 no longer exists.
Tweet ID: 829374341691346946 no longer exists.
Tweet ID: 827228250799742977 no longer exists.
Tweet ID: 812747805718642688 no longer exists.
Tweet ID: 802247111496568832 no longer exists.
Tweet ID: 779123168116150273 no longer exists.
Tweet ID: 775096608509886464 no longer exists.
Tweet ID: 771004394259247104 no longer exists.
Tweet ID: 770

Rate limit reached. Sleeping for: 631


Tweet ID: 754011816964026368 no longer exists.
Tweet ID: 680055455951884288 no longer exists.


Rate limit reached. Sleeping for: 767


In [32]:
# Load in json data line by line and turn into pandas df
data = []
with open('tweet_json.txt') as json_file:
    for row in json_file:
        json_data = json.loads(row)
        data.append({"tweet_id":json_data["id"],
                   "favorites":json_data["favorite_count"],
                   "retweets":json_data["retweet_count"],
                   "timestamp":json_data["created_at"]})

tweets_df = pd.DataFrame(data, columns=['tweet_id',
                                       'favorites',
                                       'retweets',
                                       'timestamp'])

JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)

## Data Assessing
After gathering the data, I assess them visually and programmatically for quality and tidiness issues. I detect and document at least eight (8) quality issues and two (2) tidiness issues.

### Assessment

Key Points
Key points to keep in mind when data wrangling for this project:

- You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
- Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
- Cleaning includes merging individual pieces of data according to the rules of tidy data.
- The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
- You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

## Data Cleaning
Now I will clean each of the issues I documented above.

## Analyze and Visualize
Analyze and visualize the wrangled data. I produce three (3) insights and one (1) visualization.

### Insights