# Wrangle and Analyze Dog Tweets Data

## Intro

## Setup
First, import all necessary modules and libraries:

In [1]:
# JSON encoder and decoder - manipulation with json files
import json

# Numpy and pandas
import numpy as np
import pandas as pd

# Perform http requests
import requests

# Query Twitter API
import tweepy

# Time access and conversions module providing various time-related functions
import time

## Gather

### The WeRateDogs Twitter archive
The WeRateDogs Twitter archive contains basic tweet data of their 5000+ tweets. The archive does not contain every piece of information for a specific tweet. However, it contains the each tweet's text that was used to extract information such as rating, dog name and dog stage - programatically, by the author of the course on Data Wrangling. In addition, only tweets containing rating have been filtered.  

I have downloaded the file manually and stored it in the same folder as this Jupyter Notebook. Next, I'll load the data into Pandas dataframe in order to assess and clean data quality and tidiness issues, and create visualizations. 

The "enhanced" WeRateDogs Twitter archive is a standard csv ([comma-separated values](https://en.wikipedia.org/wiki/Comma-separated_values)) files that can be simply load into Pandas dataframe using Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function:

In [2]:
# Load the enhanced WeRateDogs Twitter archive data into tw_archive dataframe
tw_archive = pd.read_csv('twitter-archive-enhanced.csv')

### The tweet image predictions
The second part of the data contains predictions of the dog's breed that were made using a neural network and dog images in the tweets. This file is available on Udacity servers and can be downloaded from [here](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv).  

I'll download the file programatically using Requests library and load it into Pandas dataframe using Pandas read_csv function as it is a standard tsv ([tab-separated values](https://en.wikipedia.org/wiki/Tab-separated_values)) file and I can set sep parameter to work with tabs instead of default commas. 

In [3]:
# Download and save the tweet image predictions file
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

with open('image-predictions.tsv', mode='wb') as file:
    response = requests.get(url)
    # Print the response code
    print(response)
    # Write the response content into a file
    file.write(response.content)

<Response [200]>


In [4]:
# Load the tweet image predictions into img_predictions dataframe
img_predictions = pd.read_csv('image-predictions.tsv', sep='\t')

### Additional tweets data
The last pieces of information - each tweet's retweet count, favorite ("like") count and other potentially interesting data - can be obtained from [Twitter API](https://developer.twitter.com/en/docs) using [Tweepy](http://www.tweepy.org/) library.  

I've set up a Twitter application in order to query Twitter API since authorization is required. I stored my consumer key and consumer secret, as well as access token and access secret  locally in a separate file to avoid having them directly in a notebook. Let's read in the keys and secrets first to use them later in querying the API:

In [5]:
# Open file with keys and secrets, and store them in variables
with open('credentials.json') as file:
    credentials = json.load(file)
    consumer_key = credentials['consumer_key']
    consumer_secret = credentials['consumer_secret']
    access_token = credentials['access_token']
    access_secret = credentials['access_secret']

Get ready for querying the Twitter API:

In [6]:
# Create an OAuthHandler instance using consumer key and secret, 
# and set access token using access token and secret
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

# Create an API instance
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

Next, get each tweet's data using method [get_status()](http://docs.tweepy.org/en/latest/api.html#status-methods), as shown in [Twitter API - get tweets with specific id](https://stackoverflow.com/questions/28384588/twitter-api-get-tweets-with-specific-id), and store the retrieved json content in a file `tweet_json.txt`. To check the progress, tweet id and ellapsed time ([How to measure elapsed time in Python](https://stackoverflow.com/questions/7370801/how-to-measure-elapsed-time-in-python)) are printed along the loop execution. In addition, [try-except blocks](https://wiki.python.org/moin/HandlingExceptions) are used to take care of cases when the tweet has been deleted. 

In [7]:
# Get tweet ids into a list
tweet_ids = tw_archive['tweet_id']

# Create empty list to store ids of deleted tweets
deleted_tweets = []

# Set start time
start = time.time()

# Loop over tweet ids
for tweet_id in tweet_ids:
    try:
        # Query Twitter API with a tweet is and store the status in a file
        tweet = api.get_status(id=tweet_id, tweet_mode='extended')
        with open('tweet_json.txt', 'a') as file:
            json.dump(tweet._json, file)
            file.write('\n')
        # Record time after the query
        end = time.time()
        # Print the tweed id with the elapsed time in seconds
        print('Tweet with id {} queried, {} seconds.'.format(tweet_id, round(end-start, 2)))
        
    except Exception as e:
        # Print id of a deleted tweet
        print('Tweet with id {} has been deleted.'.format(tweet_id))
        # Add id of a deleted tweet into a list
        deleted_tweets.append(tweet_id)

Tweet with id 892420643555336193 queried, 0.61 seconds.
Tweet with id 892177421306343426 queried, 1.23 seconds.
Tweet with id 891815181378084864 queried, 2.0 seconds.
Tweet with id 891689557279858688 queried, 2.65 seconds.
Tweet with id 891327558926688256 queried, 3.3 seconds.
Tweet with id 891087950875897856 queried, 3.93 seconds.
Tweet with id 890971913173991426 queried, 4.9 seconds.
Tweet with id 890729181411237888 queried, 5.49 seconds.
Tweet with id 890609185150312448 queried, 6.08 seconds.
Tweet with id 890240255349198849 queried, 6.72 seconds.
Tweet with id 890006608113172480 queried, 7.33 seconds.
Tweet with id 889880896479866881 queried, 7.95 seconds.
Tweet with id 889665388333682689 queried, 8.52 seconds.
Tweet with id 889638837579907072 queried, 9.12 seconds.
Tweet with id 889531135344209921 queried, 9.73 seconds.
Tweet with id 889278841981685760 queried, 10.38 seconds.
Tweet with id 888917238123831296 queried, 11.0 seconds.
Tweet with id 888804989199671297 queried, 11.59 se

Rate limit reached. Sleeping for: 31


Tweet with id 798925684722855936 queried, 419.93 seconds.
Tweet with id 798705661114773508 queried, 420.53 seconds.
Tweet with id 798701998996647937 queried, 421.11 seconds.
Tweet with id 798697898615730177 queried, 422.02 seconds.
Tweet with id 798694562394996736 queried, 422.61 seconds.
Tweet with id 798686750113755136 queried, 423.21 seconds.
Tweet with id 798682547630837760 queried, 423.85 seconds.
Tweet with id 798673117451325440 queried, 424.44 seconds.
Tweet with id 798665375516884993 queried, 425.39 seconds.
Tweet with id 798644042770751489 queried, 426.24 seconds.
Tweet with id 798628517273620480 queried, 426.79 seconds.
Tweet with id 798585098161549313 queried, 427.36 seconds.
Tweet with id 798576900688019456 queried, 427.99 seconds.
Tweet with id 798340744599797760 queried, 428.89 seconds.
Tweet with id 798209839306514432 queried, 429.48 seconds.
Tweet with id 797971864723324932 queried, 430.09 seconds.
Tweet with id 797545162159308800 queried, 430.69 seconds.
Tweet with id 

Rate limit reached. Sleeping for: 346


Tweet with id 692752401762250755 queried, 1325.42 seconds.
Tweet with id 692568918515392513 queried, 1326.03 seconds.
Tweet with id 692535307825213440 queried, 1326.7 seconds.
Tweet with id 692530551048294401 queried, 1327.28 seconds.
Tweet with id 692423280028966913 queried, 1327.85 seconds.
Tweet with id 692417313023332352 queried, 1328.58 seconds.
Tweet with id 692187005137076224 queried, 1329.18 seconds.
Tweet with id 692158366030913536 queried, 1329.79 seconds.
Tweet with id 692142790915014657 queried, 1330.7 seconds.
Tweet with id 692041934689402880 queried, 1331.31 seconds.
Tweet with id 692017291282812928 queried, 1332.31 seconds.
Tweet with id 691820333922455552 queried, 1332.92 seconds.
Tweet with id 691793053716221953 queried, 1333.5 seconds.
Tweet with id 691756958957883396 queried, 1334.11 seconds.
Tweet with id 691675652215414786 queried, 1334.68 seconds.
Tweet with id 691483041324204033 queried, 1335.41 seconds.
Tweet with id 691459709405118465 queried, 1336.03 seconds.


Now, read the tweet json objects from the file and create a Pandas dataframe containing the tweet id, retweet count and favorite count:

In [8]:
# Create an empty list to store tweets
tweets_data = []

# Read tweet_json.txt line by line and add tweets json objects to the created list
with open('tweet_json.txt') as file:
    for line in file:
        tweets_data.append(json.loads(line))

# Create a dataframe from the additional tweets data
columns = ['id', 'retweet_count', 'favorite_count']
tw_add = pd.DataFrame(tweets_data)[columns]

## Assess

## Clean

## Visualizations