# Data Wrangling and Analyzing

In this project, we'll gather, assess, and clean some Tweeter data then act on it through analysis, visualization and/or modeling.

## Table of Contents
- [Gather](#gather)
- [Access](#access)
- [Clean](#clean)
- [Analysis](#analysis)
- [Conclusion](#conclusion)

<a id='gather'></a>
## Gather

In [1]:
# Import necessary libraries
import pandas as pd
import requests
import tweepy

Load `twitter-archive-enhanced.csv` file.

In [2]:
df_tweets = pd.read_csv('twitter-archive-enhanced.csv')

Download `image_predictions.tsv` file using requests.

In [3]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(url)

In [4]:
# Check status of response
r.status_code

200

In [5]:
# Write response to file
# with open('image_predictions.tsv', 'w+') as file:
#     file.write(r.text)    

In [6]:
df_images = pd.read_csv('image_predictions.tsv', sep='\t')

Gather tweeter information with tweepy.

In [7]:
# Connect to tweepy
# with open('keys.txt', 'r') as file:
#     api_key = file.readline().split()[2]
#     api_secret = file.readline().split()[2]
#     access_token = file.readline().split()[2]
#     access_secret = file.readline().split()[2]
    
# auth = tweepy.OAuthHandler(api_key, api_secret)
# auth.set_access_token(access_token, access_secret)

# api = tweepy.API(auth, wait_on_rate_limit = True)

In [8]:
# Download tweet data and save to tweet_json.txt
import json

# file = open('tweet_json.txt', 'w+')
# count = 0
# for tweet_id in df['tweet_id']:
#     count = count + 1
#     print('{count} / 2356, {id}'.format(count=count, id=tweet_id))
#     try:
#         file.write(json.dumps(api.get_status(id=tweet_id)._json) + '\n')
#     except:
#         print('Deleted: ', tweet_id)
#         continue
# file.close()

In [9]:
# Extract retweet count and favorite count of each tweet
likes = []
with open('tweet_json.txt', 'r') as file:
    for line in file:
        tweet = json.loads(line)
        tweet_id = tweet['id_str']
        retweet = tweet['retweet_count'] 
        favorite = tweet['favorite_count']
      
        likes.append({'tweet_id': tweet_id,
                        'retweet_count': retweet,
                        'favorite_count':favorite})
df_likes = pd.DataFrame(likes, columns = ['tweet_id', 'retweet_count', 'favorite_count'])

<a id='access'></a>
## Access

Now we have 3 files dataframes on hand, `df_tweets`, `df_images`, and `df_likes`.

In [10]:
df_tweets.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [11]:
df_tweets.sample(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
675,789530877013393408,,,2016-10-21 18:16:44 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Rizzy. She smiles a lot. 12/10 contagi...,,,,https://twitter.com/dog_rates/status/789530877...,12,10,Rizzy,,,,
282,839239871831150596,,,2017-03-07 22:22:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Odie. He's big. 13/10 would attempt to...,,,,https://twitter.com/dog_rates/status/839239871...,13,10,Odie,,,,
245,845812042753855489,,,2017-03-26 01:38:00 +0000,"<a href=""http://twitter.com/download/iphone"" r...",We usually don't rate polar bears but this one...,,,,https://twitter.com/dog_rates/status/845812042...,13,10,,,,,


In [12]:
df_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [13]:
df_tweets.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [14]:
sum(df_tweets.tweet_id.duplicated())

0

In [15]:
df_tweets.rating_numerator.sort_values()

315        0
1016       0
2335       1
2261       1
2338       1
        ... 
2074     420
188      420
189      666
313      960
979     1776
Name: rating_numerator, Length: 2356, dtype: int64

In [16]:
df_tweets.query('rating_numerator == 1776')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
979,749981277374128128,,,2016-07-04 15:00:45 +0000,"<a href=""https://about.twitter.com/products/tw...",This is Atticus. He's quite simply America af....,,,,https://twitter.com/dog_rates/status/749981277...,1776,10,Atticus,,,,


In [17]:
df_tweets.rating_denominator.sort_values()

313       0
2335      2
516       7
1576     10
1575     10
       ... 
1635    110
1779    120
1634    130
902     150
1120    170
Name: rating_denominator, Length: 2356, dtype: int64

In [18]:
df_tweets.query('rating_denominator == 0 or rating_denominator == 2 or rating_denominator == 7')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,8.35246e+17,26259576.0,2017-02-24 21:54:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@jonnysun @Lin_Manuel ok jomny I know you're e...,,,,,960,0,,,,,
516,810984652412424192,,,2016-12-19 23:06:23 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Sam. She smiles 24/7 &amp; secretly aspir...,,,,"https://www.gofundme.com/sams-smile,https://tw...",24,7,Sam,,,,
2335,666287406224695296,,,2015-11-16 16:11:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is an Albanian 3 1/2 legged Episcopalian...,,,,https://twitter.com/dog_rates/status/666287406...,1,2,an,,,,


In [19]:
df_tweets[df_tweets.expanded_urls.isnull()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862664e+17,2281182000.0,2017-07-15 16:51:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,,,,12,10,,,,,
55,881633300179243008,8.81607e+17,47384430.0,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,,,17,10,,,,,
64,879674319642796034,8.795538e+17,3105441000.0,2017-06-27 12:14:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@RealKentMurphy 14/10 confirmed,,,,,14,10,,,,,
113,870726314365509632,8.707262e+17,16487760.0,2017-06-02 19:38:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,,,,,10,10,,,,,
148,863427515083354112,8.634256e+17,77596200.0,2017-05-13 16:15:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@Jack_Septic_Eye I'd need a few more pics to p...,,,,,12,10,,,,,
179,857214891891077121,8.571567e+17,180671000.0,2017-04-26 12:48:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@Marc_IRL pixelated af 12/10,,,,,12,10,,,,,
185,856330835276025856,,,2017-04-24 02:15:55 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @Jenna_Marbles: @dog_rates Thanks for ratin...,8.563302e+17,66699013.0,2017-04-24 02:13:14 +0000,,14,10,,,,,
186,856288084350160898,8.56286e+17,279281000.0,2017-04-23 23:26:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@xianmcguire @Jenna_Marbles Kardashians wouldn...,,,,,14,10,,,,,
188,855862651834028034,8.558616e+17,194351800.0,2017-04-22 19:15:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@dhmontgomery We also gave snoop dogg a 420/10...,,,,,420,10,,,,,
189,855860136149123072,8.558585e+17,13615720.0,2017-04-22 19:05:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@s8n You tried very hard to portray this good ...,,,,,666,10,,,,,


In [20]:
df_tweets[df_tweets.name == "None"]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
12,889665388333682689,,,2017-07-25 01:55:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a puppo that seems to be on the fence a...,,,,https://twitter.com/dog_rates/status/889665388...,13,10,,,,,puppo
24,887343217045368832,,,2017-07-18 16:08:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",You may not have known you needed to see this ...,,,,https://twitter.com/dog_rates/status/887343217...,13,10,,,,,
25,887101392804085760,,,2017-07-18 00:07:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This... is a Jubilant Antarctic House Bear. We...,,,,https://twitter.com/dog_rates/status/887101392...,12,10,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2342,666082916733198337,,,2015-11-16 02:38:37 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a well-established sunblockerspan...,,,,https://twitter.com/dog_rates/status/666082916...,6,10,,,,,
2343,666073100786774016,,,2015-11-16 01:59:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Let's hope this flight isn't Malaysian (lol). ...,,,,https://twitter.com/dog_rates/status/666073100...,10,10,,,,,
2344,666071193221509120,,,2015-11-16 01:52:02 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a northern speckled Rhododendron....,,,,https://twitter.com/dog_rates/status/666071193...,9,10,,,,,
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,


In [21]:
df_images.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [22]:
df_images.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [23]:
df_images.sample(3)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1724,819952236453363712,https://pbs.twimg.com/media/C2EONHNWQAUWxkP.jpg,1,American_Staffordshire_terrier,0.925505,True,Staffordshire_bullterrier,0.036221,True,Italian_greyhound,0.020412,True
721,685973236358713344,https://pbs.twimg.com/media/CYURBGoWYAAKey3.jpg,1,Siberian_husky,0.450678,True,Eskimo_dog,0.430275,True,malamute,0.11859,True
1461,778286810187399168,https://pbs.twimg.com/media/Cs0HuUTWcAUpSE8.jpg,1,Boston_bull,0.32207,True,pug,0.229903,True,muzzle,0.10142,False


In [24]:
sum(df_images.tweet_id.duplicated())

0

In [25]:
df_images.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [26]:
df_likes.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7305,34812
1,892177421306343426,5457,30164
2,891815181378084864,3598,22706
3,891689557279858688,7490,38065
4,891327558926688256,8051,36342


In [27]:
sum(df_likes.duplicated())

0

In [28]:
df_likes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tweet_id        2331 non-null   object
 1   retweet_count   2331 non-null   int64 
 2   favorite_count  2331 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 54.8+ KB


In [29]:
df_likes.describe()

Unnamed: 0,retweet_count,favorite_count
count,2331.0,2331.0
mean,2569.824539,7269.266409
std,4346.939571,11291.352969
min,1.0,0.0
25%,521.5,1258.0
50%,1197.0,3149.0
75%,2975.0,8885.5
max,73712.0,149541.0


### Quality

##### `df_tweets` table
- ~~Erroneous datatypes (`tweet_id`, `in_reply_to_status_id`, `in_reply_to_user_id`, `timestamp`, `retweeted_status_id`, `retweeted_status_user_id`, `retweet_status_timestamp columns`)~~
- ~~Missing information on `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, and `retweet_status_timestamp`~~
- ~~`rating_numerators` and `rating_denominators` are not matching the rating given in `text`~~
- ~~Rating numerators having very large range, some are not actual ratings (e.g. 1776 is not rating for a dog but the year of Declaration of Independence of the United States)~~
- ~~745 dogs are named "None", some are named "a"~~
- ~~`rating_numerator` and `rating_denominator` are wrong for `tweet_id` '666287406224695296', it should be 9/10 instead of 1/2~~
- ~~`rating_numerator` is wrong for `tweet_id` '883482846933004288', it should be 13.5 instead of 5~~
- ~~`rating_numerator` and `rating_denominator` are wrong for `tweet_id` '810984652412424192', 24/7 is not a rating. There is no rating for this dog~~
- ~~Some of the tweets are not original tweets~~
- ~~Some of the tweets contain videos instead of images~~

##### `df_images` table
- ~~Erroneous datatype (`tweet_id`)~~
- ~~Mix use of upper and lower cases first letters of prediction (p1, p2, p3)~~

##### `df_likes` table
- ~~Number of entries does not match number of entries of `df_tweets` - there are some deleted tweets~~
- ~~Erroneous datatype (`retweet_count`, `favorite_count`)~~

### Tidiness

##### `df_tweets` table
- ~~Four columns of dog stages~~
- ~~Retweet counts and favorite counts should be part of the `df_tweets` table~~

<a id='clean'></a>
## Clean

In [30]:
tweets_clean = df_tweets.copy()
images_clean = df_images.copy()
likes_clean = df_likes.copy()

### Quality

#### `tweets`: Some of the tweets are not original tweets

##### Define

Since we only care about original tweets with images, we are dropping all the retweets.

##### Code

In [31]:
# Make a mask of all tweets with 'in_reply_to_user_id' (meaning it's a reply), 
# 'retweeted_status_id' (meaning it's a retweet), and withough 'expanded_urls'
# (meaning it does not contain images)

mask = (tweets_clean['in_reply_to_user_id'].notnull()) | (tweets_clean['retweeted_status_id'].notnull())

In [32]:
tweets_clean.drop(tweets_clean[mask].index, axis = 0, inplace=True)

In [33]:
# Remove in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, 
# retweeted_status_timestamp because these rows have been emptied
tweets_clean.drop(columns=['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 
                           'retweeted_status_user_id', 'retweeted_status_timestamp'], inplace=True)

##### Test

In [34]:
tweets_clean.head()

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [35]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2097 non-null   int64 
 1   timestamp           2097 non-null   object
 2   source              2097 non-null   object
 3   text                2097 non-null   object
 4   expanded_urls       2094 non-null   object
 5   rating_numerator    2097 non-null   int64 
 6   rating_denominator  2097 non-null   int64 
 7   name                2097 non-null   object
 8   doggo               2097 non-null   object
 9   floofer             2097 non-null   object
 10  pupper              2097 non-null   object
 11  puppo               2097 non-null   object
dtypes: int64(3), object(9)
memory usage: 213.0+ KB


#### `tweets`: Some of the tweets contain videos instead of images

##### Define

Remove all tweets without an expanded_urls or expanded_urls contains the word 'vine'

##### Code

In [36]:
# Find all the expanded_urls that are either null or contains the word "vine"
mask = (tweets_clean['expanded_urls'].isnull()) | (tweets_clean['expanded_urls'].str.contains('vine', regex=False))

In [37]:
tweets_clean.drop(tweets_clean[mask].index, axis = 0, inplace=True)

##### Test

In [38]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2003 entries, 0 to 2355
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2003 non-null   int64 
 1   timestamp           2003 non-null   object
 2   source              2003 non-null   object
 3   text                2003 non-null   object
 4   expanded_urls       2003 non-null   object
 5   rating_numerator    2003 non-null   int64 
 6   rating_denominator  2003 non-null   int64 
 7   name                2003 non-null   object
 8   doggo               2003 non-null   object
 9   floofer             2003 non-null   object
 10  pupper              2003 non-null   object
 11  puppo               2003 non-null   object
dtypes: int64(3), object(9)
memory usage: 203.4+ KB


In [39]:
tweets_clean[tweets_clean['expanded_urls'].str.contains('vine')]

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [40]:
tweets_clean[tweets_clean['expanded_urls'].isnull()]

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


#### `tweets_clean`: Erroneous datatypes (`tweet_id`, `in_reply_to_status_id`, `in_reply_to_user_id`, `timestamp`,  `retweeted_status_id`, `retweeted_status_user_id`, `retweet_status_timestamp` columns)

##### Define

Change `tweet_id` to string, change `timestamp` to datetime

##### Code

In [41]:
tweets_clean.tweet_id = tweets_clean.tweet_id.astype(str)
tweets_clean.timestamp = pd.to_datetime(tweets_clean.timestamp)

##### Test

In [42]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2003 entries, 0 to 2355
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            2003 non-null   object             
 1   timestamp           2003 non-null   datetime64[ns, UTC]
 2   source              2003 non-null   object             
 3   text                2003 non-null   object             
 4   expanded_urls       2003 non-null   object             
 5   rating_numerator    2003 non-null   int64              
 6   rating_denominator  2003 non-null   int64              
 7   name                2003 non-null   object             
 8   doggo               2003 non-null   object             
 9   floofer             2003 non-null   object             
 10  pupper              2003 non-null   object             
 11  puppo               2003 non-null   object             
dtypes: datetime64[ns, UTC](1), int64(2

#### `images_clean`: Erroneous datatype (`tweet_id`)

##### Define

Change `tweet_id` to string

##### Code

In [43]:
images_clean.tweet_id = images_clean.tweet_id.astype(str)

##### Test

In [44]:
images_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   object 
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(1), object(5)
memory usage: 152.1+ KB


#### `likes_clean`: Erroneous datatype (`retweet_count`, `favorite_count`)

##### Define

Change `retweet_count` and `favorite_count` to int

##### Code

In [45]:
likes_clean.retweet_count = likes_clean.retweet_count.astype(int)
likes_clean.favorite_count = likes_clean.favorite_count.astype(int)

##### Test

In [46]:
likes_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tweet_id        2331 non-null   object
 1   retweet_count   2331 non-null   int32 
 2   favorite_count  2331 non-null   int32 
dtypes: int32(2), object(1)
memory usage: 36.5+ KB


### Tidiness

#### `tweets_clean`: Four columns of dog stages

##### Define

Merge the `doggo`, `floofer`, `pupper`, and `puppo` columns to a `dog_stages` column.


##### Code

In [47]:
# Replace 'None' and NaN with an empty string for columns 'doggo', 
# 'floofer', 'pupper' and 'puppo'

import numpy as np

tweets_clean['doggo'].replace('None', '', inplace=True)
tweets_clean['doggo'].replace(np.NaN, '', inplace=True)
tweets_clean['floofer'].replace('None', '', inplace=True)
tweets_clean['floofer'].replace(np.NaN, '', inplace=True)
tweets_clean['pupper'].replace('None', '', inplace=True)
tweets_clean['pupper'].replace(np.NaN, '', inplace=True)
tweets_clean['puppo'].replace('None', '', inplace=True)
tweets_clean['puppo'].replace(np.NaN, '', inplace=True)

In [48]:
# Extract information from tweets_clean['text'] to see if dog stage is mentioned
tweets_clean['dog_stages'] = tweets_clean['text'].str.extract('(doggo|floofer|pupper|puppo)', expand = True)

In [49]:
tweets_clean['dog_stages'].value_counts()

pupper     223
doggo       73
puppo       28
floofer      3
Name: dog_stages, dtype: int64

In [50]:
# Combine dog stages - some dogs have multiple stages
tweets_clean['dog_stages'] = tweets_clean['doggo'] + tweets_clean['floofer'] + tweets_clean['pupper'] + tweets_clean['puppo']
tweets_clean.loc[tweets_clean['dog_stages'] == 'doggopupper', 'dog_stages'] = 'doggo, pupper'
tweets_clean.loc[tweets_clean['dog_stages'] == 'doggopuppo', 'dog_stages'] = 'doggo, puppo'
tweets_clean.loc[tweets_clean['dog_stages'] == 'doggofloofer', 'dog_stages'] = 'doggo, floofer'

In [51]:
# Drop the original 'doggo','floofer','pupper','puppo' columns
tweets_clean.drop(['doggo','floofer','pupper','puppo'], axis=1, inplace = True)

##### Test

In [52]:
tweets_clean['dog_stages'].value_counts()

                  1694
pupper             204
doggo               66
puppo               22
doggo, pupper        8
floofer              7
doggo, puppo         1
doggo, floofer       1
Name: dog_stages, dtype: int64

#### `likes_clean`: Retweet counts and favorite counts should be part of the `tweets_clean` table

##### Define

Merge `likes_clean` and `tweets_clean`


##### Code

In [53]:
tweets_clean = pd.merge(tweets_clean, likes_clean, how='left', on='tweet_id')

In [54]:
# Null entries for retweet and favorite counts are due to deleted tweets
tweets_clean['retweet_count'] = tweets_clean['retweet_count'].fillna(0)
tweets_clean['favorite_count'] = tweets_clean['favorite_count'].fillna(0)

##### Test

In [55]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2003 entries, 0 to 2002
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            2003 non-null   object             
 1   timestamp           2003 non-null   datetime64[ns, UTC]
 2   source              2003 non-null   object             
 3   text                2003 non-null   object             
 4   expanded_urls       2003 non-null   object             
 5   rating_numerator    2003 non-null   int64              
 6   rating_denominator  2003 non-null   int64              
 7   name                2003 non-null   object             
 8   dog_stages          2003 non-null   object             
 9   retweet_count       2003 non-null   float64            
 10  favorite_count      2003 non-null   float64            
dtypes: datetime64[ns, UTC](1), float64(2), int64(2), object(6)
memory usage: 187.8+ KB


#### `tweets_clean`: `rating_numerators` and `rating_denominators` are not matching the rating given in `text`.

##### Define

Extract numbers from text and rewrite the `rating_numerator` and `rating_denominator` column.

##### Code

In [56]:
tweets_clean['rating'] = tweets_clean.text.str.extract('(\d+\.?\d*\/\d*\.?\d+)')

In [57]:
tweets_clean['rating'].unique()

array(['13/10', '12/10', '14/10', '13.5/10', '11/10', '6/10', '10/10',
       '0/10', '84/70', '24/7', '9.75/10', '5/10', '11.27/10', '3/10',
       '7/10', '8/10', '9/10', '4/10', '165/150', '1776/10', '9/11',
       '204/170', '4/20', '50/50', '99/90', '80/80', '45/50', '60/50',
       '44/40', '121/110', '7/11', '11.26/10', '2/10', '144/120', '88/80',
       '1/10', '420/10', '1/2'], dtype=object)

In [58]:
ratings = tweets_clean['rating'].str.split('/', expand = True)

In [59]:
tweets_clean['rating_numerator'] = ratings[0]

In [60]:
tweets_clean['rating_denominator'] = ratings[1]

In [61]:
tweets_clean.rating_numerator = tweets_clean.rating_numerator.astype(float)
tweets_clean.rating_denominator = tweets_clean.rating_denominator.astype(float)

In [62]:
tweets_clean.drop(columns=['rating'], inplace=True)

#### Test

In [63]:
tweets_clean.rating_numerator.sort_values()

246        0.0
1982       1.0
1985       1.0
1745       1.0
1533       1.0
         ...  
1452     144.0
685      165.0
870      204.0
1728     420.0
749     1776.0
Name: rating_numerator, Length: 2003, dtype: float64

In [64]:
tweets_clean.rating_denominator.sort_values()

1982      2.0
402       7.0
0        10.0
1342     10.0
1341     10.0
        ...  
969      90.0
1320    110.0
1452    120.0
685     150.0
870     170.0
Name: rating_denominator, Length: 2003, dtype: float64

#### `tweets_clean`: Rating numerators having very large range, some are not actual ratings (e.g. 1776 is not rating for a dog but the year of Declaration of Independence of the United States)

##### Define

Inpect outlier `rating_numerators` and deicde if deletion is needed.

##### Code

In [65]:
tweets_clean.query('rating_numerator == 0')

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,dog_stages,retweet_count,favorite_count
246,835152434251116546,2017-02-24 15:40:31+00:00,"<a href=""http://twitter.com/download/iphone"" r...",When you're so blinded by your systematic plag...,https://twitter.com/dog_rates/status/835152434...,0.0,10.0,,,2861.0,21618.0


In [66]:
# rating_numerator == 0 is a tweet about plagiarism, remove
tweets_clean.drop(tweets_clean[tweets_clean.rating_numerator == 0].index, axis = 0, inplace=True)

In [67]:
tweets_clean.query('rating_numerator == 1776')

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,dog_stages,retweet_count,favorite_count
749,749981277374128128,2016-07-04 15:00:45+00:00,"<a href=""https://about.twitter.com/products/tw...",This is Atticus. He's quite simply America af....,https://twitter.com/dog_rates/status/749981277...,1776.0,10.0,Atticus,,2361.0,4927.0


In [68]:
# 1776 is not rating for a dog but the year of Declaration of Independence of the United States
tweets_clean.drop(tweets_clean[tweets_clean.rating_numerator == 1776].index, axis = 0, inplace=True)

In [69]:
tweets_clean.query('rating_numerator == 24 and rating_denominator == 7')

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,dog_stages,retweet_count,favorite_count
402,810984652412424192,2016-12-19 23:06:23+00:00,"<a href=""http://twitter.com/download/iphone"" r...",Meet Sam. She smiles 24/7 &amp; secretly aspir...,"https://www.gofundme.com/sams-smile,https://tw...",24.0,7.0,Sam,,1387.0,5198.0


In [70]:
# 24/7 is not rating for a dog
tweets_clean.drop(tweets_clean[tweets_clean.rating_numerator == 24].index, axis = 0, inplace=True)

In [71]:
tweets_clean.query('rating_denominator == 2')

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,dog_stages,retweet_count,favorite_count
1982,666287406224695296,2015-11-16 16:11:11+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is an Albanian 3 1/2 legged Episcopalian...,https://twitter.com/dog_rates/status/666287406...,1.0,2.0,an,,57.0,130.0


In [72]:
# The correct rating for this tweet is 9/10 instead of 1/2
tweets_clean.drop(tweets_clean[tweets_clean.rating_denominator == 2].index, axis = 0, inplace=True)

#### Test

In [73]:
tweets_clean.query('rating_numerator == 0 or rating_numerator == 1776 or rating_denominator == 2 or rating_denominator == 7')

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,dog_stages,retweet_count,favorite_count


In [74]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1999 entries, 0 to 2002
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            1999 non-null   object             
 1   timestamp           1999 non-null   datetime64[ns, UTC]
 2   source              1999 non-null   object             
 3   text                1999 non-null   object             
 4   expanded_urls       1999 non-null   object             
 5   rating_numerator    1999 non-null   float64            
 6   rating_denominator  1999 non-null   float64            
 7   name                1999 non-null   object             
 8   dog_stages          1999 non-null   object             
 9   retweet_count       1999 non-null   float64            
 10  favorite_count      1999 non-null   float64            
dtypes: datetime64[ns, UTC](1), float64(4), object(6)
memory usage: 187.4+ KB


#### `tweets`: 745 dogs are named "None", some are named "a"

##### Define

Change dog names from "None"/"a" to empty string.

##### Code

In [75]:
tweets_clean['name'].replace('None', '', inplace=True)
tweets_clean['name'].replace('a', '', inplace=True)

##### Test

In [76]:
tweets_clean['name'].value_counts()

           608
Charlie     11
Lucy        10
Oliver      10
Cooper      10
          ... 
Anna         1
Eevee        1
Vince        1
Skye         1
Eugene       1
Name: name, Length: 934, dtype: int64

In [77]:
sum(tweets_clean['name'].isnull())

0

#### `images_clean`: Mix use of upper and lower cases first letters of prediction (p1, p2, p3)

##### Define

Change every string in `p1`, `p2`, `p3` to lower case

##### Code

In [78]:
images_clean['p1'] = images_clean['p1'].str.lower()
images_clean['p2'] = images_clean['p2'].str.lower()
images_clean['p3'] = images_clean['p3'].str.lower()

##### Test

In [79]:
images_clean.p1.str.islower().value_counts()

True    2075
Name: p1, dtype: int64

In [80]:
images_clean.p2.str.islower().value_counts()

True    2075
Name: p2, dtype: int64

In [81]:
images_clean.p3.str.islower().value_counts()

True    2075
Name: p3, dtype: int64

<a id='analysis'></a>
## Analysis

<a id='conclusion'></a>
## Conclusion