
# Collecting insights from the Twitter account @dog_rates


<img src="https://pbs.twimg.com/profile_images/948761950363664385/Fpr2Oz35_400x400.jpg" width="300" height="300" />
[(Font: WeRateDogs)](https://twitter.com/dog_rates)

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

>WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 7 million followers and has received international media coverage.

>We received via email exclusively to use in our project one archive containing the basic tweet data for all 5K+ of their tweets as they stood on August 1,2017.

>In this project we will collect, wrangle and analyse data from @dog_rates twitter account creating interesting and trustworthy analyses and visualizations.


In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import json
import tweepy
# matplot magic line
% matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

In this segment we will Gather, Asses and Clean the Data.

### Gathering Data

#### Reading twitter-archive-enhanced.csv file

In [2]:
#gathering Data from the CSV Archieve with pandas
tweets = pd.read_csv('twitter-archive-enhanced.csv')

#### Downloading image_predictions.tsv hosted on Udacity's servers

Using requests we will save the file, then we will open it with pandas.

In [3]:
#setting url
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
#getting the archieve in memory
r = requests.get(url)
r.status_code

200

In [4]:
#opening and saving the archive content
with open('predictions.tsv', 'wb') as f:
    f.write(r.content)

In [3]:
#gathering Data from the TSV Archieve with pandas
predictions = pd.read_csv('predictions.tsv', sep= '\t')

#### Getting more informations for tweets table from Twitter

We will use tweepy to get the aditional information of the tweets

In [4]:
#importing the class where the TWEEPY Conf is in. (Parameters we use below)
import tweepyconf as tpc

In [7]:
#To Run you sould take the tpc. out
#autentification
auth = tweepy.OAuthHandler(tpc.CONSUMER_KEY, tpc.CONSUMER_SECRET)
auth.set_access_token(tpc.OAUTH_TOKEN, tpc.OAUTH_TOKEN_SECRET)
api = tweepy.API(auth)

api.wait_on_rate_limit = True
api.wait_on_rate_limit_notify = True

In [8]:
tweets_ids_deleted = []
tweets_ids_error = []
with open('tweet_json.txt', 'w', encoding='utf-8') as f:
    #Making the tweet_id for loop
    for tid in tweets.tweet_id:
        try:
            tweet = api.get_status(tid, tweet_mode='extended')
            f.writelines(json.dumps(tweet._json) + '\n')
        except tweepy.TweepError as e:
            code = e.api_code
            #TWEEPY ERROR CODE 144 (TWEET DELETED ERROR: "'No status found with that ID.'")
            if code == 144:
                tweets_ids_deleted.append(tid)
            else:
                tweets_ids_error.append(tid)
            continue
        except:
            tweets_ids_error.append(tid)
            continue
    print('Total Errors: ' + str(len(tweets_ids_error)))
    print('Total Deleted: ' + str(len(tweets_ids_deleted)))

Rate limit reached. Sleeping for: 588
Rate limit reached. Sleeping for: 366


Total Errors: 10
Total Deleted: 13


Retrying to gather erros and append then into the end of the file.

In [9]:
tweets_ids_error = []
with open('tweet_json.txt', 'a', encoding='utf-8') as f:
    #Making the tweet_id for loop
    for tid in tweets_ids_error:
        try:
            tweet = api.get_status(tid, tweet_mode='extended')
            f.writelines(json.dumps(tweet._json) + '\n')
        except tweepy.TweepError as e:
            code = e.api_code
            if code == 144:
                tweets_ids_deleted.append(tid)
            else:
                print('code error: ' + str(code))
                tweets_ids_error.append(tid)
            continue
        except:
            tweets_ids_error.append(tid)
            continue
    print('Total Errors: ' + str(len(tweets_ids_error)))
    print('Total Deleted: ' + str(len(tweets_ids_deleted)))

Total Errors: 0
Total Deleted: 13


In [10]:
tweets_ids_deleted

[888202515573088257,
 873697596434513921,
 869988702071779329,
 866816280283807744,
 861769973181624320,
 845459076796616705,
 842892208864923648,
 837012587749474308,
 827228250799742977,
 802247111496568832,
 775096608509886464,
 770743923962707968,
 754011816964026368]

Opening tweet_json.txt with pandas

In [5]:
tweets_complement = pd.read_json('tweet_json.txt', lines=True)

### Assessing Data
In this segment we will look through the data frames to find problems, and list them in the problem section.

* Let's look into the tweets Data frame

In [6]:
tweets.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


The four last columns are dogs characteristics and should be just one column, besides the 'None' value should be NULL.
There are some columns we won't use, so we can drop these columns.
The source column should not have the http pattern.

In [7]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

Timestamp is not a date object
There are missing values from expanded_urls

In [8]:
tweets.nunique()

tweet_id                      2356
in_reply_to_status_id           77
in_reply_to_user_id             31
timestamp                     2356
source                           4
text                          2356
retweeted_status_id            181
retweeted_status_user_id        25
retweeted_status_timestamp     181
expanded_urls                 2218
rating_numerator                40
rating_denominator              18
name                           957
doggo                            2
floofer                          2
pupper                           2
puppo                            2
dtype: int64

In [9]:
tweets.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [10]:
pd.options.display.max_colwidth = 140
tweets[tweets.rating_denominator != 10]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,8.35246e+17,26259580.0,2017-02-24 21:54:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho",,,,,960,0,,,,,
342,832088576586297345,8.320875e+17,30582080.0,2017-02-16 04:45:50 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@docmisterio account started on 11/15/15,,,,,11,15,,,,,
433,820690176645140481,,,2017-01-15 17:52:40 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd,,,,"https://twitter.com/dog_rates/status/820690176645140481/photo/1,https://twitter.com/dog_rates/status/820690176645140481/photo/1,https://...",84,70,,,,,
516,810984652412424192,,,2016-12-19 23:06:23 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/9...,,,,"https://www.gofundme.com/sams-smile,https://twitter.com/dog_rates/status/810984652412424192/photo/1",24,7,Sam,,,,
784,775096608509886464,,,2016-09-11 22:20:06 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","RT @dog_rates: After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP http...",7.403732e+17,4196984000.0,2016-06-08 02:41:38 +0000,"https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://...",9,11,,,,,
902,758467244762497024,,,2016-07-28 01:00:57 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE,,,,https://twitter.com/dog_rates/status/758467244762497024/video/1,165,150,,,,,
1068,740373189193256964,,,2016-06-08 02:41:38 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDND...",,,,"https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://...",9,11,,,,,
1120,731156023742988288,,,2016-05-13 16:15:54 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv,,,,https://twitter.com/dog_rates/status/731156023742988288/photo/1,204,170,this,,,,
1165,722974582966214656,,,2016-04-21 02:25:47 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,,,,https://twitter.com/dog_rates/status/722974582966214656/photo/1,4,20,,,,,
1202,716439118184652801,,,2016-04-03 01:36:11 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq,,,,https://twitter.com/dog_rates/status/716439118184652801/photo/1,50,50,Bluebert,,,,


We can see that are wrong denominators and numerators.
We should notice there are replies for the tweets and retweets too.

In [11]:
sum(tweets.duplicated())

0

* Let's look into the predictions Data frame

In [12]:
predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [13]:
predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [14]:
predictions.nunique()

tweet_id    2075
jpg_url     2009
img_num        4
p1           378
p1_conf     2006
p1_dog         2
p2           405
p2_conf     2004
p2_dog         2
p3           408
p3_conf     2006
p3_dog         2
dtype: int64

There are equal JPEG URLs

In [15]:
sum(predictions.duplicated())

0

In [100]:
predictions_clean[predictions_clean.jpg_url.duplicated()]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1297,752309394570878976,https://pbs.twimg.com/ext_tw_video_thumb/675354114423808004/pu/img/qL1R_nGLqa6lmkOx.jpg,1,upright,0.303415,False,golden_retriever,0.181351,True,Brittany_spaniel,0.162084,True
1315,754874841593970688,https://pbs.twimg.com/media/CWza7kpWcAAdYLc.jpg,1,pug,0.272205,True,bull_mastiff,0.251530,True,bath_towel,0.116806,False
1333,757729163776290825,https://pbs.twimg.com/media/CWyD2HGUYAQ1Xa7.jpg,2,cash_machine,0.802333,False,schipperke,0.045519,True,German_shepherd,0.023353,True
1345,759159934323924993,https://pbs.twimg.com/media/CU1zsMSUAAAS0qW.jpg,1,Irish_terrier,0.254856,True,briard,0.227716,True,soft-coated_wheaten_terrier,0.223263,True
1349,759566828574212096,https://pbs.twimg.com/media/CkNjahBXAAQ2kWo.jpg,1,Labrador_retriever,0.967397,True,golden_retriever,0.016641,True,ice_bear,0.014858,False
1364,761371037149827077,https://pbs.twimg.com/tweet_video_thumb/CeBym7oXEAEWbEg.jpg,1,brown_bear,0.713293,False,Indian_elephant,0.172844,False,water_buffalo,0.038902,False
1368,761750502866649088,https://pbs.twimg.com/media/CYLDikFWEAAIy1y.jpg,1,golden_retriever,0.586937,True,Labrador_retriever,0.398260,True,kuvasz,0.005410,True
1387,766078092750233600,https://pbs.twimg.com/media/ChK1tdBWwAQ1flD.jpg,1,toy_poodle,0.420463,True,miniature_poodle,0.132640,True,Chesapeake_Bay_retriever,0.121523,True
1407,770093767776997377,https://pbs.twimg.com/media/CkjMx99UoAM2B1a.jpg,1,golden_retriever,0.843799,True,Labrador_retriever,0.052956,True,kelpie,0.035711,True
1417,771171053431250945,https://pbs.twimg.com/media/CVgdFjNWEAAxmbq.jpg,3,Samoyed,0.978833,True,Pomeranian,0.012763,True,Eskimo_dog,0.001853,True


In [101]:
predictions_clean[predictions_clean.jpg_url == 'https://pbs.twimg.com/media/CkjMx99UoAM2B1a.jpg']

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1201,741067306818797568,https://pbs.twimg.com/media/CkjMx99UoAM2B1a.jpg,1,golden_retriever,0.843799,True,Labrador_retriever,0.052956,True,kelpie,0.035711,True
1407,770093767776997377,https://pbs.twimg.com/media/CkjMx99UoAM2B1a.jpg,1,golden_retriever,0.843799,True,Labrador_retriever,0.052956,True,kelpie,0.035711,True


As we can see, the only diference in these duplicated roles is the tweet_id.
Also, we can notice there are predictions the none of them are dogs.

* Let's look into the tweets_complement Data frame

In [16]:
tweets_complement.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,quoted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,retweet_count,retweeted,retweeted_status,source,truncated,user
0,,,2017-08-01 16:23:56,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 892420639486877696, 'id_str': '892420639486877696', 'i...","{'media': [{'id': 892420639486877696, 'id_str': '892420639486877696', 'indices': [86, 109], 'media_url': 'http://pbs.twimg.com/media/DGK...",38614,False,This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,...,,,,,8542,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'WeRateDogs™', 'screen_name': 'dog_rates', 'location': '𝓶𝓮𝓻𝓬𝓱 ↴ DM YOUR DOGS', '..."
1,,,2017-08-01 00:17:27,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 892177413194625024, 'id_str': '892177413194625024', 'i...","{'media': [{'id': 892177413194625024, 'id_str': '892177413194625024', 'indices': [139, 162], 'media_url': 'http://pbs.twimg.com/media/DG...",33102,False,"This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/...",,...,,,,,6280,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'WeRateDogs™', 'screen_name': 'dog_rates', 'location': '𝓶𝓮𝓻𝓬𝓱 ↴ DM YOUR DOGS', '..."
2,,,2017-07-31 00:18:03,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 891815175371796480, 'id_str': '891815175371796480', 'i...","{'media': [{'id': 891815175371796480, 'id_str': '891815175371796480', 'indices': [122, 145], 'media_url': 'http://pbs.twimg.com/media/DG...",24922,False,This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/w...,,...,,,,,4161,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'WeRateDogs™', 'screen_name': 'dog_rates', 'location': '𝓶𝓮𝓻𝓬𝓱 ↴ DM YOUR DOGS', '..."
3,,,2017-07-30 15:58:51,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 891689552724799489, 'id_str': '891689552724799489', 'i...","{'media': [{'id': 891689552724799489, 'id_str': '891689552724799489', 'indices': [80, 103], 'media_url': 'http://pbs.twimg.com/media/DF_...",42011,False,This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,...,,,,,8668,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'WeRateDogs™', 'screen_name': 'dog_rates', 'location': '𝓶𝓮𝓻𝓬𝓱 ↴ DM YOUR DOGS', '..."
4,,,2017-07-29 16:00:24,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': [129, 138]}], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 89132755194...","{'media': [{'id': 891327551943041024, 'id_str': '891327551943041024', 'indices': [139, 162], 'media_url': 'http://pbs.twimg.com/media/DF...",40163,False,"This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWe...",,...,,,,,9419,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'WeRateDogs™', 'screen_name': 'dog_rates', 'location': '𝓶𝓮𝓻𝓬𝓱 ↴ DM YOUR DOGS', '..."


There are lots of columns we won't use

In [17]:
tweets_complement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2333 entries, 0 to 2332
Data columns (total 32 columns):
contributors                     0 non-null float64
coordinates                      0 non-null float64
created_at                       2333 non-null datetime64[ns]
display_text_range               2333 non-null object
entities                         2333 non-null object
extended_entities                2061 non-null object
favorite_count                   2333 non-null int64
favorited                        2333 non-null bool
full_text                        2333 non-null object
geo                              0 non-null float64
id                               2333 non-null int64
id_str                           2333 non-null int64
in_reply_to_screen_name          78 non-null object
in_reply_to_status_id            78 non-null float64
in_reply_to_status_id_str        78 non-null float64
in_reply_to_user_id              78 non-null float64
in_reply_to_user_id_str          78 n

Lot's of columns with null values

#### Data Problems
* Tweets dataframe

        ** unify columns doggo, floofer, pupper, puppo.
        ** drop unused/repeated columns like in_reply_to_user_id, retweeted_status_user_id...
        ** timestamp not in time format.
        ** fix numerators.
        ** fix denominators.
        ** there are tweets that have already been deleted.
        ** replies in the middle of tweets dataframe.
        ** retweets in the middle of tweets dataframe.

* predictions dataframe

        ** Make Jpeg URLs unique (Duplicated lines)


* Tweets_complement dataframe
    
        ** drop unused/repeated columns

* Combined dataframes

        ** unify tweets dataframe with tweets_complement dataframe

### Cleaning Data
In this segment we will clean the data and solve the problems above

Let's first make a copy of each dataframe

In [56]:
tweets_clean = tweets.copy()
predictions_clean = predictions.copy()
tweets_complement_clean = tweets_complement.copy()

#### Tidiness

1 - take repeated columns or columns we won't use from `tweets` dataframe out

##### Define

Drop the columns

##### Code

In [57]:
tweets_clean = tweets_clean.drop(['source', 'in_reply_to_user_id','retweeted_status_user_id'
                                  ,'retweeted_status_timestamp'], axis=1)

##### Test

In [58]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 13 columns):
tweet_id                 2356 non-null int64
in_reply_to_status_id    78 non-null float64
timestamp                2356 non-null object
text                     2356 non-null object
retweeted_status_id      181 non-null float64
expanded_urls            2297 non-null object
rating_numerator         2356 non-null int64
rating_denominator       2356 non-null int64
name                     2356 non-null object
doggo                    2356 non-null object
floofer                  2356 non-null object
pupper                   2356 non-null object
puppo                    2356 non-null object
dtypes: float64(2), int64(3), object(8)
memory usage: 239.4+ KB


2 - take repeated columns or columns we won't use from `tweets_complement` and repeated information from `tweets` out

##### Define

Drop the columns

##### Code

In [59]:
tweets_complement_clean = tweets_complement_clean.drop(['contributors','coordinates','display_text_range','entities',
                                                        'extended_entities','full_text','geo','id_str','in_reply_to_screen_name',
                                                        'in_reply_to_status_id','in_reply_to_status_id_str',
                                                        'in_reply_to_user_id','in_reply_to_user_id_str','is_quote_status',
                                                        'lang','place','possibly_sensitive','possibly_sensitive_appealable',
                                                        'quoted_status','quoted_status_id','quoted_status_id_str',
                                                        'quoted_status_permalink','source','truncated','user',
                                                        'retweeted_status'], axis=1)

##### Test

In [60]:
tweets_complement_clean.head()

Unnamed: 0,created_at,favorite_count,favorited,id,retweet_count,retweeted
0,2017-08-01 16:23:56,38614,False,892420643555336193,8542,False
1,2017-08-01 00:17:27,33102,False,892177421306343426,6280,False
2,2017-07-31 00:18:03,24922,False,891815181378084864,4161,False
3,2017-07-30 15:58:51,42011,False,891689557279858688,8668,False
4,2017-07-29 16:00:24,40163,False,891327558926688256,9419,False


3 - Four columns `doggo`, `floofer`, `pupper`, `puppo` should be values in 1 column

##### Define

We'll create a new column with the result of the other columns, them clean it and drop the unified columns.

##### Code

In [61]:
tweets_clean['characteristic'] = tweets['doggo'] + tweets['floofer'] + tweets['pupper'] + tweets['puppo']
tweets_clean.characteristic = tweets_clean.characteristic.replace("None","", regex=True)
tweets_clean = tweets_clean.drop(['doggo', 'floofer','pupper','puppo'], axis=1)

##### Test

In [24]:
tweets_clean.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,timestamp,text,retweeted_status_id,expanded_urls,rating_numerator,rating_denominator,name,characteristic
2010,672248013293752320,,2015-12-03 02:56:30 +0000,10/10 for dog. 7/10 for cat. 12/10 for human. Much skill. Would pet all https://t.co/uhx5gfpx5k,,https://twitter.com/dog_rates/status/672248013293752320/photo/1,10,10,,
213,851591660324737024,,2017-04-11 00:24:08 +0000,Oh jeez u did me quite the spook little fella. We normally don't rate triceratops but this one seems suspiciously good. 11/10 would pet ...,,https://twitter.com/dog_rates/status/851591660324737024/photo/1,11,10,,
742,780476555013349377,,2016-09-26 18:38:05 +0000,RT @Patreon: Well. @dog_rates is on Patreon. \n\n12/10. \n\nhttps://t.co/rnKvzt6RJs https://t.co/v4e2ywe8iO,7.804657e+17,"https://www.patreon.com/WeRateDogs,https://twitter.com/Patreon/status/780465709297995776/photo/1,https://www.patreon.com/WeRateDogs,http...",12,10,,
344,832032802820481025,,2017-02-16 01:04:13 +0000,This is Miguel. He was the only remaining doggo at the adoption center after the weekend. Let's change that. 12/10\n\nhttps://t.co/P0bO8...,,"https://www.petfinder.com/petdetail/34918210,https://twitter.com/dog_rates/status/832032802820481025/photo/1,https://twitter.com/dog_rat...",12,10,Miguel,doggo
258,843604394117681152,,2017-03-19 23:25:35 +0000,This is Hank. He's been outside for 3 minutes and already made a friend. Way to go Hank. 11/10 for both https://t.co/wHUElL84RC,,https://twitter.com/dog_rates/status/843604394117681152/photo/1,11,10,Hank,


4 - Let's merge the `tweets_clean` with `tweets_complement_clean` dataframes

##### Define

Merge the dataframes with left join

##### Code

In [62]:
tweets_clean = tweets_clean.merge(tweets_complement_clean, left_on='tweet_id', right_on='id', how = 'left')

##### Test

In [63]:
#drop the duplicated column (tweets_id = id)
tweets_clean = tweets_clean.drop(['id'], axis=1)
tweets_clean.sample()

Unnamed: 0,tweet_id,in_reply_to_status_id,timestamp,text,retweeted_status_id,expanded_urls,rating_numerator,rating_denominator,name,characteristic,created_at,favorite_count,favorited,retweet_count,retweeted
480,815736392542261248,,2017-01-02 01:48:06 +0000,This is Akumi. It's his birthday. He received many lickable gifts. 11/10 happy h*ckin birthday https://t.co/gd9UlLOCQ0,,"https://twitter.com/dog_rates/status/815736392542261248/photo/1,https://twitter.com/dog_rates/status/815736392542261248/photo/1,https://...",11,10,Akumi,,2017-01-02 01:48:06,10651.0,False,2535.0,False


#### Quality

1 - Tweets that have already been deleted

##### Define

Delete these deleted tweets from `tweets_clean`

##### Code

In [27]:
for tid in tweets_ids_deleted:
    tweets_clean = tweets_clean[tweets_clean.tweet_id != tid]

NameError: name 'tweets_ids_deleted' is not defined

##### Test

In [116]:
sum(tweets_clean.tweet_id.isin(tweets_ids_deleted))

0

2 - Erroneous datatype

##### Define

Change timestamp to data type

##### Code

In [64]:
tweets_clean.timestamp = pd.to_datetime(tweets_clean.timestamp)

##### Test

In [65]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 0 to 2355
Data columns (total 15 columns):
tweet_id                 2356 non-null int64
in_reply_to_status_id    78 non-null float64
timestamp                2356 non-null datetime64[ns]
text                     2356 non-null object
retweeted_status_id      181 non-null float64
expanded_urls            2297 non-null object
rating_numerator         2356 non-null int64
rating_denominator       2356 non-null int64
name                     2356 non-null object
characteristic           2356 non-null object
created_at               2333 non-null datetime64[ns]
favorite_count           2333 non-null float64
favorited                2333 non-null object
retweet_count            2333 non-null float64
retweeted                2333 non-null object
dtypes: datetime64[ns](2), float64(4), int64(3), object(6)
memory usage: 294.5+ KB


3 - deleting the replies from the dataframe

4 - deleting the retweets from the dataframe

##### Define

drop lines where the column in_reply_to_status_id and retweeted_status_id is a non-null value

##### Code

In [66]:
tweets_clean = tweets_clean[np.isnan(tweets_clean.in_reply_to_status_id)]
tweets_clean = tweets_clean[np.isnan(tweets_clean.retweeted_status_id)]

##### Test

In [67]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 15 columns):
tweet_id                 2097 non-null int64
in_reply_to_status_id    0 non-null float64
timestamp                2097 non-null datetime64[ns]
text                     2097 non-null object
retweeted_status_id      0 non-null float64
expanded_urls            2094 non-null object
rating_numerator         2097 non-null int64
rating_denominator       2097 non-null int64
name                     2097 non-null object
characteristic           2097 non-null object
created_at               2088 non-null datetime64[ns]
favorite_count           2088 non-null float64
favorited                2088 non-null object
retweet_count            2088 non-null float64
retweeted                2088 non-null object
dtypes: datetime64[ns](2), float64(4), int64(3), object(6)
memory usage: 262.1+ KB


In [68]:
tweets_clean = tweets_clean.drop(['retweeted_status_id', 'in_reply_to_status_id'], axis=1)

5 - rating_denominator that differs 10, we see by the texts that are some there aren't. In some cases the text mining could get the wrong numbers.

6 - wrong rating_numerators

So we can check and fix it


##### Define

Replace the wrong numbers with the right ones

##### Code

In [70]:
tweets_clean[tweets_clean.rating_denominator != 10]

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,characteristic,created_at,favorite_count,favorited,retweet_count,retweeted
433,820690176645140481,2017-01-15 17:52:40,The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd,"https://twitter.com/dog_rates/status/820690176645140481/photo/1,https://twitter.com/dog_rates/status/820690176645140481/photo/1,https://...",84,70,,,2017-01-15 17:52:40,13175.0,False,3586.0,False
516,810984652412424192,2016-12-19 23:06:23,Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/9...,"https://www.gofundme.com/sams-smile,https://twitter.com/dog_rates/status/810984652412424192/photo/1",24,7,Sam,,2016-12-19 23:06:23,5791.0,False,1603.0,False
902,758467244762497024,2016-07-28 01:00:57,Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE,https://twitter.com/dog_rates/status/758467244762497024/video/1,165,150,,,2016-07-28 01:00:57,5168.0,False,2457.0,False
1068,740373189193256964,2016-06-08 02:41:38,"After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDND...","https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://...",9,11,,,2016-06-08 02:41:38,37020.0,False,14533.0,False
1120,731156023742988288,2016-05-13 16:15:54,Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv,https://twitter.com/dog_rates/status/731156023742988288/photo/1,204,170,this,,2016-05-13 16:15:54,4079.0,False,1385.0,False
1165,722974582966214656,2016-04-21 02:25:47,Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,https://twitter.com/dog_rates/status/722974582966214656/photo/1,4,20,,,2016-04-21 02:25:47,4367.0,False,1709.0,False
1202,716439118184652801,2016-04-03 01:36:11,This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq,https://twitter.com/dog_rates/status/716439118184652801/photo/1,50,50,Bluebert,,2016-04-03 01:36:11,2504.0,False,237.0,False
1228,713900603437621249,2016-03-27 01:29:02,Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1,https://twitter.com/dog_rates/status/713900603437621249/photo/1,99,90,,,2016-03-27 01:29:02,2994.0,False,807.0,False
1254,710658690886586372,2016-03-18 02:46:49,Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12,https://twitter.com/dog_rates/status/710658690886586372/photo/1,80,80,,,2016-03-18 02:46:49,2453.0,False,612.0,False
1274,709198395643068416,2016-03-14 02:04:08,"From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/...",https://twitter.com/dog_rates/status/709198395643068416/photo/1,45,50,,,2016-03-14 02:04:08,2563.0,False,694.0,False


wrong
tweets_id - num - dem
666287406224695296 - 9 - 10
682962037429899265 - 10 - 10
716439118184652801 - 11 - 10
722974582966214656 - 13 - 10
740373189193256964 - 14 - 10

In [86]:
tweets_clean.ix[tweets_clean[tweets_clean.tweet_id == 666287406224695296].index,'rating_numerator']=9
tweets_clean.ix[tweets_clean[tweets_clean.tweet_id == 666287406224695296].index,'rating_denominator']=10

tweets_clean.ix[tweets_clean[tweets_clean.tweet_id == 682962037429899265].index,'rating_numerator']=10
tweets_clean.ix[tweets_clean[tweets_clean.tweet_id == 682962037429899265].index,'rating_denominator']=10

tweets_clean.ix[tweets_clean[tweets_clean.tweet_id == 716439118184652801].index,'rating_numerator']=11
tweets_clean.ix[tweets_clean[tweets_clean.tweet_id == 716439118184652801].index,'rating_denominator']=10

tweets_clean.ix[tweets_clean[tweets_clean.tweet_id == 722974582966214656].index,'rating_numerator']=13
tweets_clean.ix[tweets_clean[tweets_clean.tweet_id == 722974582966214656].index,'rating_denominator']=10

tweets_clean.ix[tweets_clean[tweets_clean.tweet_id == 740373189193256964].index,'rating_numerator']=14
tweets_clean.ix[tweets_clean[tweets_clean.tweet_id == 740373189193256964].index,'rating_denominator']=10

In [91]:
tweets_clean[tweets_clean.tweet_id == 666287406224695296]

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,characteristic,created_at,favorite_count,favorited,retweet_count,retweeted
2335,666287406224695296,2015-11-16 16:11:11,This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv,https://twitter.com/dog_rates/status/666287406224695296/photo/1,9,10,an,,2015-11-16 16:11:11,149.0,False,66.0,False


7 - There shouldn't be repeated jpegs in the predictions table 


##### Define

Deleted the duplicated rows, letting just 1 from each jpeg

##### Code