## Udacity Data Analyst Nanodegree - Project "Data Wrangling and Analyzing" ##

### Introduction ###
Using Python and its libraries, I will gather data from three sources, assess its quality and tidiness, then clean it.
The datasets:
1) The first dataset that I will be wrangling is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc.
2) Back to the basicness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API.
3) Furthermore, an Udacity employee ran every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

#### Gathering Data ####

In [1]:
import pandas as pd
import numpy as np
import requests
import tweepy
import os
import time
import json
import re
import seaborn as sns
import matplotlib.pyplot as plt 
%matplotlib inline

In [2]:
# load the twitter-archive-enhanced.csv into a DataFrame
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')
twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [3]:
twitter_archive.sort_values('timestamp')
twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [4]:
# download the image prediction file from Udacity's server using the requests function
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

with open (url.split('/')[-1], mode='wb') as file:
    file.write(response.content)

In [5]:
# load the image predictions data into a DataFrame
predictions = pd.read_csv('image-predictions.tsv', sep='\t')
predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [6]:
# load the data Tweeter API

CONSUMER_KEY = ""
CONSUMER_SECRET = ""
ACCESS_TOKEN = ""
ACCESS_TOKEN_SECRET = ""

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

api = tweepy.API(auth_handler = auth,
                 parser = tweepy.parsers.JSONParser(),
                 wait_on_rate_limit = True,
                 wait_on_rate_limit_notify = True)

In [7]:
missing_tweets = []

with open ('tweet_json.txt', 'a') as file:
    for tweet_id in twitter_archive['tweet_id']:
        try:
            start_time = time.time()
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            # writes one tweet per line
            file.write(json.dumps(tweet) + '\n')
            end_time = time.time()
            print('ID {} . Time in seconds: {}'.format(tweet_id, end_time-start_time))
        except Exception as e_message:
            missing_tweets.append(tweet_id)
            print("Error for ID: " + str(tweet_id) + str(e_message))
    print('End reached.')

ID 892420643555336193 . Time in seconds: 0.19039011001586914
ID 892177421306343426 . Time in seconds: 0.21448040008544922
ID 891815181378084864 . Time in seconds: 0.15755248069763184
ID 891689557279858688 . Time in seconds: 0.16284871101379395
ID 891327558926688256 . Time in seconds: 0.15909433364868164
ID 891087950875897856 . Time in seconds: 0.18929123878479004
ID 890971913173991426 . Time in seconds: 0.18886852264404297
ID 890729181411237888 . Time in seconds: 0.18662238121032715
ID 890609185150312448 . Time in seconds: 0.1603701114654541
ID 890240255349198849 . Time in seconds: 0.16298675537109375
ID 890006608113172480 . Time in seconds: 0.1514601707458496
ID 889880896479866881 . Time in seconds: 0.1869814395904541
ID 889665388333682689 . Time in seconds: 0.1500704288482666
ID 889638837579907072 . Time in seconds: 0.21160387992858887
ID 889531135344209921 . Time in seconds: 0.16748046875
ID 889278841981685760 . Time in seconds: 0.20888495445251465
ID 888917238123831296 . Time in se

In [8]:
missing_tweets

[888202515573088257,
 873697596434513921,
 872668790621863937,
 872261713294495745,
 869988702071779329,
 866816280283807744,
 861769973181624320,
 856602993587888130,
 851953902622658560,
 845459076796616705,
 844704788403113984,
 842892208864923648,
 837366284874571778,
 837012587749474308,
 829374341691346946,
 827228250799742977,
 812747805718642688,
 802247111496568832,
 779123168116150273,
 775096608509886464,
 771004394259247104,
 770743923962707968,
 759566828574212096,
 754011816964026368,
 680055455951884288]

In [9]:
# Try again to gather the missing tweets. 
missing_tweets_new = [] 

with open('tweet_json.txt', 'a') as file:
    for tweet_id in missing_tweets:
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')._json
            file.write(json.dumps(tweet) + '\n')
            
        except Exception as e_message:
            print("Error for ID: " + str(tweet_id) + str(e_message))
            missing_tweets_new.append(tweet_id)

Error for ID: 888202515573088257[{'code': 144, 'message': 'No status found with that ID.'}]
Error for ID: 873697596434513921[{'code': 144, 'message': 'No status found with that ID.'}]
Error for ID: 872668790621863937[{'code': 144, 'message': 'No status found with that ID.'}]
Error for ID: 872261713294495745[{'code': 144, 'message': 'No status found with that ID.'}]
Error for ID: 869988702071779329[{'code': 144, 'message': 'No status found with that ID.'}]
Error for ID: 866816280283807744[{'code': 144, 'message': 'No status found with that ID.'}]
Error for ID: 861769973181624320[{'code': 144, 'message': 'No status found with that ID.'}]
Error for ID: 856602993587888130[{'code': 144, 'message': 'No status found with that ID.'}]
Error for ID: 851953902622658560[{'code': 144, 'message': 'No status found with that ID.'}]
Error for ID: 845459076796616705[{'code': 144, 'message': 'No status found with that ID.'}]
Error for ID: 844704788403113984[{'code': 144, 'message': 'No status found with 

In [10]:
missing_tweets_new == missing_tweets

True

In [17]:
list_for_df = []

with open('tweet_json.txt') as json_file:
     json_data = [json.loads(line) for line in json_file]

JSONDecodeError: Extra data: line 1 column 3 (char 2)

In [None]:
tweet_id = json_data['id']
  favorite_count = json_data['favorite_count']
  retweet_count = json_data['retweet_count']
                
list_for_df.append({'tweet_id': tweet_id,
                        'favorite_count': favorite_count,
                        'retweet_count': retweet_count})

# create a new DataFrame 
df = pd.DataFrame(list_for_df, columns = ['tweet_id', 'favorite_count', 'retweet_count'])
df.head()

# Save the dataFrame in file
df.to_csv('tweet_json.txt', encoding = 'utf-8', index=False)