### Table of Contents

* [Chapter 1: Gather data](#chapter1)
    * [Section 1.1: The WeRankDogs Twitter archive](#section_1_1)
    * [Section 1.2: The tweet imgage predictions](#section_1_2)
    * [Section 1.3:The Twitter API data](#section_1_3)
        * [Section 1.3.1: Reading JSON data into Pandas dataframe](#section_1_3_1)
* [Chapter 2: Assess data](#chapter2)
    * [Section 2.1: WeRateDogs Twitter archive Assessment](#section_2_1)
        * [Section 2.1.1: WeRateDogs Twitter archive Quality Assessment](#section_2_1_1)
        * [Section 2.1.2: WeRateDogs Twitter archive Tidiness Assessment](#section_2_1_2)
    * [Section 2.2: Image prediction data Assessment](#section_2_2)
        * [Section 2.2.1: Image prediction data Quality Assessment](#section_2_2_1)
        * [Section 2.2.2: Image prediction data Tidiness Assessment](#section_2_2_2)
    * [Section 2.3: The Twitter API data Assessment](#section_2_3)
        * [Section 2.3.1: The Twitter API data Quality Assessment](#section_2_3_1)
        * [Section 2.3.2: The Twitter API data Tidiness Assessment](#section_2_3_2)
    * [Section 2.4: Assess data summary](#section_2_4)
        * [Section 2.4.1: Data quality issues summary](#section_2_4_1)
        * [Section 2.4.2: Data tidiness isuues summary](#section_2_4_2)
* [Chapter 3: Clean data](#chapter3)
    * [Section 3.1: Clean data quality issues](#section_3_1)
    * [Section 3.2: Clean data tidiness issues](#section_3_2)
* [Chapter 4: Analyze data](#chapter4)

### Imports

In [1]:
#Importing all required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import requests
import os
import tweepy
import json
import random
import time

# Chapter 1. Gather Data<a class="anchor" id="chapter1"></a>

## Section 1.1. The WeRateDogs Twitter archive <a class="anchor" id="section_1_1"></a>

The WeRateDogs Twitter archive. This file is provided by Udacity and downloaded manually by clicking on the provided link.

In [56]:
#Reading the WeRateDogs Twitter data from a csv file into a pandas dataframe
df_twitter_archive = pd.read_csv('twitter-archive-enhanced.csv', sep=',')
df_twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


## Section 1.2. The tweet image predictions <a class="anchor" id="section_1_2"></a>

The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

In [57]:
#Downloading the tweet image predictions programmatically using the provided URL
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
response

<Response [200]>

In [58]:
#Writing the downloaded response into a pandas df
with open('./image-predictions-3.tsv', 'wb') as file:
    file.write(response.content)
df_image_predictions = pd.read_csv('./image-predictions-3.tsv', sep='\t')
df_image_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [78]:
df_image_predictions.to_csv('image_predictions.csv', index = False)

## Section 1.3. The Twitter API data <a class="anchor" id="section_1_3"></a>

Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. Note: do not include your Twitter API keys, secrets, and tokens in your project submission.

In [59]:
CONSUMER_KEY = 'q0GYYcW2omAPxLklz0UEcvZqG'
CONSUMER_SECRET = 'NbQZEvosoCnBP2591NlZvmQ4a0efLITQz9hvcA7eMlWjzg2jar'
OAUTH_TOKEN ='2910438674-COfGYyKSv28Tovu97sHh9iSkxE3aIbm6SWF8RKR'
OAUTH_TOKEN_SECRET = 'H2L2jDrTPbixbON2BdLCGtrGqpnkbAczYYttK3tRkIbvq'

In [60]:
#Using tweepy library, we set up Twitter API object and set rate limit parameters

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [61]:
# Creating a list of tweet ids from WeRateDogs Twitter archive enhanced dataset
tweet_ids = list(df_twitter_archive['tweet_id'].unique())
#tweet_ids = ['855860136149123072','855862651834028034','892421']
len(tweet_ids)

2356

In [62]:
for status in api.user_timeline():
    print (status.id)

1220159113797062657
1220159080666296320
847480101893799936
826806830735228928
826396255257522176
826388389909950464
826387730259124224
779342963466010624


In [63]:
tweet_count = 0

# Creating an empty list for tweets to append tweet information to
tweets_list = []

# creating a dictionary for tweets that return errors 
tweets_with_error = {}


# start time of execution
start_time = time.time()

# For loop which will add each available tweet JSON data to tweets_list
for ids in tweet_ids:
    
    tweet_count += 1
    try:
        # Getting the tweet's JSON data and appending it to the tweet list
        tweet = api.get_status(ids, tweet_mode = 'extended')
        tweets_list.append(tweet._json)
    except tweepy.TweepError as err:
        # save the error to the tweets with error dictionary for review
        #print("Tweet has error for id:  " + str(ids)) #commenting out since many tweets had error
        tweets_with_error[ids] = err
        pass
    # Only print tweet id for every 100th tweet to save space
    if tweet_count % 100 == 0:
        print("loop number " + str(tweet_count))
    
        
# end time for excution
end_time = time.time()

#printing time for execution
print("Total run time for the loop is:", end_time - start_time)

loop number 100
loop number 200
loop number 300
loop number 400
loop number 500
loop number 600
loop number 700
loop number 800
loop number 900


Rate limit reached. Sleeping for: 473


loop number 1000
loop number 1100
loop number 1200
loop number 1300
loop number 1400
loop number 1500
loop number 1600
loop number 1700
loop number 1800


Rate limit reached. Sleeping for: 488


loop number 1900
loop number 2000
loop number 2100
loop number 2200
loop number 2300
Total run time for the loop is: 2061.120843887329


In [64]:
len(tweets_list)

2331

In [65]:
len(tweets_with_error)

25

In [66]:
#Looking at the tweet ids that we were not able to retrieve JSON data for
tweets_with_error

{888202515573088257: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 873697596434513921: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 872668790621863937: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 872261713294495745: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 869988702071779329: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 866816280283807744: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 861769973181624320: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 856602993587888130: tweepy.error.TweepError([{'code': 144,
  

- Out of 2356 unique tweet ids, we were able to get the JSON data for 2331 and are missing the data for 25 tweets ids.

### Section 1.3.1: Reading JSON data into Pandas dataframe <a class="anchor" id="section_1_3_1"></a>

In [70]:
#Saving the JSON data that we have retrieved into a text file
tweets_json_data = 'tweet_json.txt'

with open(tweets_json_data, 'w') as outfile:
    for tweets_json in tweets_list:
        json.dump(tweets_json, outfile)
        outfile.write('\n')

In [68]:
ls tweet_json.txt

tweet_json.txt


In [76]:
#Reading the JSON data from the text file we just created into a Pandas dataframe
tweets_list = []

with open(tweets_json_data, 'r') as json_file:
    # reading the first line 
    line = json_file.readline()
    while line:
        data = json.loads(line)

        # extracting relevant fields from the JSON data
        tweet_id = data['tweet_id']
        tweet_retweet_count = data['retweet_count']
        tweet_favorite_count = data['favorite_count']
        
        # creating a dictionary with the JSON data for retweet count and favorite count, then adding it to a list
        tweet_json_dict = {'tweet_id': tweet_id, 
                     'retweet_count': tweet_retweet_count, 
                     'favorite_count': tweet_favorite_count
                    }
        tweets_list.append(tweet_json_dict)

        # reading the next line of JSON data
        line = json_file.readline()

        
# converting the tweets JSON data dictionary list to a DataFrame
df_tweet_json_data = pd.DataFrame(tweets_list, 
                                   columns = ['tweet_id',
                                              'retweet_count',
                                              'favorite_count'])

df_tweet_json_data.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7474,35387
1,892177421306343426,5546,30631
2,891815181378084864,3670,23037
3,891689557279858688,7646,38680
4,891327558926688256,8248,36963


In [77]:
#Saving the tweets JSON data into a csv file

df_tweet_json_data.to_csv('tweet_json_data.csv', index = False)

With this JSON data, we have all three pieces of inforation we need to move on to assess phase.

# Chapter 2. Assess Data<a class="anchor" id="chapter2"></a>

In [80]:
#Reading the data files that we gathered in the previous section into pandas dataframes
df_twitter_archive = pd.read_csv('twitter-archive-enhanced.csv', sep=',')
df_image_predictions = pd.read_csv('image_predictions.csv', sep=',')
df_json_data = pd.read_csv('tweet_json_data.csv', sep=',')

In [81]:
#To make sure the files were loaded into dataframes properly, we will look at the first five rows of each dataframe
#Looking at twitter archive dataframe header
df_twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [82]:
#looking at image prediction dataframe header
df_image_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [83]:
#looking at JSON additional data dataframe header
df_json_data.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7474,35387
1,892177421306343426,5546,30631
2,891815181378084864,3670,23037
3,891689557279858688,7646,38680
4,891327558926688256,8248,36963


## Section 2.1. WeRateDogs Twitter archive Assessment <a class="anchor" id="section_2_1"></a>

We will visually and programmatically assess the archive data here. We can take a quick visual assessment by looking at header and footer but because there are many fields and over 2000 rows, programmatic assessment is more efficient.

In [84]:
df_twitter_archive.head(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [85]:
df_twitter_archive.tail(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2346,666058600524156928,,,2015-11-16 01:01:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is the Rand Paul of retrievers folks! He'...,,,,https://twitter.com/dog_rates/status/666058600...,8,10,the,,,,
2347,666057090499244032,,,2015-11-16 00:55:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",My oh my. This is a rare blond Canadian terrie...,,,,https://twitter.com/dog_rates/status/666057090...,9,10,a,,,,
2348,666055525042405380,,,2015-11-16 00:49:46 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a Siberian heavily armored polar bear ...,,,,https://twitter.com/dog_rates/status/666055525...,10,10,a,,,,
2349,666051853826850816,,,2015-11-16 00:35:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is an odd dog. Hard on the outside but lo...,,,,https://twitter.com/dog_rates/status/666051853...,2,10,an,,,,
2350,666050758794694657,,,2015-11-16 00:30:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a truly beautiful English Wilson Staff...,,,,https://twitter.com/dog_rates/status/666050758...,10,10,a,,,,
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,
2355,666020888022790149,,,2015-11-15 22:32:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a Japanese Irish Setter. Lost eye...,,,,https://twitter.com/dog_rates/status/666020888...,8,10,,,,,


In [87]:
sum(df_twitter_archive.duplicated())
#There are no rows where all the columns are duplicated

0

In [88]:
df_twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [117]:
2356-2297

59

In [92]:
df_twitter_archive['name'].value_counts().head(20)

None       745
a           55
Charlie     12
Oliver      11
Cooper      11
Lucy        11
Lola        10
Tucker      10
Penny       10
Bo           9
Winston      9
the          8
Sadie        8
Bailey       7
Toby         7
Daisy        7
Buddy        7
an           7
Jack         6
Koda         6
Name: name, dtype: int64

In [99]:
#Looks like real names are all capitalized so we want to filter the name column to those that only contain lower case values
df_twitter_archive[df_twitter_archive['name'].str.contains('^[a-z]', regex = True)]['name'].value_counts()

a               55
the              8
an               7
very             5
one              4
quite            4
just             4
not              2
actually         2
getting          2
mad              2
unacceptable     1
officially       1
this             1
by               1
all              1
his              1
space            1
old              1
life             1
infuriating      1
such             1
light            1
incredibly       1
my               1
Name: name, dtype: int64

In [100]:
df_twitter_archive['doggo'].value_counts()

None     2259
doggo      97
Name: doggo, dtype: int64

In [101]:
df_twitter_archive['floofer'].value_counts()

None       2346
floofer      10
Name: floofer, dtype: int64

In [102]:
df_twitter_archive['pupper'].value_counts()

None      2099
pupper     257
Name: pupper, dtype: int64

In [103]:
df_twitter_archive['puppo'].value_counts()

None     2326
puppo      30
Name: puppo, dtype: int64

In [113]:
df_twitter_archive['rating_denominator'].value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [116]:
len(df_twitter_archive[df_twitter_archive['rating_denominator']>10])

20

### Section 2.1.1. WeRateDogs Twitter archive quality Assessment <a class="anchor" id="section_2_1_1"></a>

Here are some of the quality issues found in Twitter archive data that we have identified and will address in clean section:
- 181 rows are actually retweets and not original tweets (retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp columns are not null for these records).
- 78 rows are similarly replies and not original tweets (in_reply_to_status_id, in_reply_to_user_id fields are not null for these).
- There are 59 tweets that are missing expanded URL field.
- The timestamp column is in object (string) format instead of datetime.
- name column has 745 values as None which is most likely missing values. And there are other values that are not names such as 'a', 'the' for 55 rows.
- In several columns such as name, doggo, floofer, pupper, puppo, it is hard to identify missing values because they are None instead of null.
- We know from the project description that in WeRateDogs, denominator should be 10. However, there are 20 rows with denominator greater than 10.

### Section 2.1.2. WeRateDogs Twitter archive tidiness Assessment <a class="anchor" id="section_2_1_2"></a>

Here are some of the tidiness issues found in Twitter archive data that we have identified and will address in clean section:
- There are 4 columns for dog stages (doggo, floofer, pupper, puppo). These are all different values for one variable and should be in one column.
- Columns retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, in_reply_to_status_id, in_reply_to_user_id are not relevant to original tweets.

## Section 2.2. Image prediction data Assessment <a class="anchor" id="section_2_2"></a>

In [104]:
df_image_predictions.head(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


In [105]:
df_image_predictions.tail(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2065,890240255349198849,https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg,1,Pembroke,0.511319,True,Cardigan,0.451038,True,Chihuahua,0.029248,True
2066,890609185150312448,https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg,1,Irish_terrier,0.487574,True,Irish_setter,0.193054,True,Chesapeake_Bay_retriever,0.118184,True
2067,890729181411237888,https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg,2,Pomeranian,0.566142,True,Eskimo_dog,0.178406,True,Pembroke,0.076507,True
2068,890971913173991426,https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg,1,Appenzeller,0.341703,True,Border_collie,0.199287,True,ice_lolly,0.193548,False
2069,891087950875897856,https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg,1,Chesapeake_Bay_retriever,0.425595,True,Irish_terrier,0.116317,True,Indian_elephant,0.076902,False
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True
2074,892420643555336193,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False


In [106]:
df_image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [107]:
df_image_predictions['p1'].value_counts()

golden_retriever      150
Labrador_retriever    100
Pembroke               89
Chihuahua              83
pug                    57
                     ... 
EntleBucher             1
hare                    1
guenon                  1
four-poster             1
lynx                    1
Name: p1, Length: 378, dtype: int64

In [109]:
sum(df_image_predictions['p1'].isnull())

0

### Section 2.2.1. Image prediction data quality Assessment <a class="anchor" id="section_2_2_1"></a>

- There are a total of 2075 records in prediction data which means 281 tweets will be missing breed predictions.

### Section 2.2.2. Image prediction data tidiness Assessment <a class="anchor" id="section_2_2_2"></a>

- Columns p1, p2, p3 can be re-named to more meaningful names. That will help understand other columns such as p1_conf or p1_dog better as well.
- This dataset should be combined with Twitter archive data.

## Section 2.3. The Twitter API data Assessment <a class="anchor" id="section_2_3"></a>

In [110]:
df_json_data.head(10)

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7474,35387
1,892177421306343426,5546,30631
2,891815181378084864,3670,23037
3,891689557279858688,7646,38680
4,891327558926688256,8248,36963
5,891087950875897856,2758,18629
6,890971913173991426,1792,10824
7,890729181411237888,16722,59631
8,890609185150312448,3814,25644
9,890240255349198849,6489,29257


In [111]:
df_json_data.tail(10)

Unnamed: 0,tweet_id,retweet_count,favorite_count
2321,666058600524156928,51,104
2322,666057090499244032,120,263
2323,666055525042405380,214,404
2324,666051853826850816,752,1099
2325,666050758794694657,51,122
2326,666049248165822465,39,96
2327,666044226329800704,124,265
2328,666033412701032449,39,109
2329,666029285002620928,41,119
2330,666020888022790149,449,2355


In [112]:
df_json_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 3 columns):
tweet_id          2331 non-null int64
retweet_count     2331 non-null int64
favorite_count    2331 non-null int64
dtypes: int64(3)
memory usage: 54.8 KB


### Section 2.3.1. The Twitter API data quality Assessment <a class="anchor" id="section_2_3_1"></a>

- 25 tweet ids at the time this code was run, were either deleted or set to private and are missing. 

### Section 2.3.2. The Twitter API data tidiness Assessment <a class="anchor" id="section_2_3_2"></a>

- This dataset should be combined with Twitter archive data.

## Section 2.4. Assess Data Summary <a class="anchor" id="section_2_4"></a>

### Section 2.4.1. Quality assessment summary <a class="anchor" id="section_2_4_1"></a>

1. 181 rows are actually retweets and not original tweets (retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp columns are not null for these records).
2. 78 rows are similarly replies and not original tweets (in_reply_to_status_id, in_reply_to_user_id fields are not null for these).
3. There are 59 tweets that are missing expanded URL field.
4. The timestamp column is in object (string) format instead of datetime.
5. name column has 745 values as None which is most likely missing values. And there are other values that are not names such as 'a', 'the' for 55 rows.
6. In several columns such as name, doggo, floofer, pupper, puppo, it is hard to identify missing values because they are None instead of null.
7. We know from the project description that in WeRateDogs, denominator should be 10. However, there are 20 rows with denominator greater than 10.
8. 25 tweet ids at the time this code was run, were either deleted or set to private and are missing. They are missing favorite and retweet count.


### Section 2.4.2. Tidiness assessment summary <a class="anchor" id="section_2_4_2"></a>

1. There are 4 columns for dog stages (doggo, floofer, pupper, puppo). These are all different values for one variable and should be in one column.
2. Columns retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, in_reply_to_status_id, in_reply_to_user_id are not relevant to original tweets.
3. Columns p1, p2, p3 can be re-named to more meaningful names. That will help understand other columns such as p1_conf or p1_dog better as well.
4. All three dataframes should be merged into one dataframe that contains all relevant information.

# Chapter 3. Clean Data<a class="anchor" id="chapter3"></a>

## Section 3.1. Clean data quality issues <a class="anchor" id="section_3_1"></a>

**Issue 1:** 181 rows are actually retweets and not original tweets (retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp columns are not null for these records).

**Define:** remove 181 rows that are retweets.

#### Code

#### Test

## Section 3.2. Clean data tidiness issues <a class="anchor" id="section_3_2"></a>

# Chapter 4. Analyze Data<a class="anchor" id="chapter4"></a>