# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.

In [61]:
import pandas as pd
import numpy as np
import requests
import os
import tweepy
import json
import time
import matplotlib.pyplot as plt
import seaborn as sb
% matplotlib inline

In [62]:
sb.set()

1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [63]:
twitter_archive_enhanced_df = pd.read_csv('twitter-archive-enhanced.csv')

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [3]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [4]:
response

<Response [200]>

In [5]:
response.content

b"tweet_id\tjpg_url\timg_num\tp1\tp1_conf\tp1_dog\tp2\tp2_conf\tp2_dog\tp3\tp3_conf\tp3_dog\n666020888022790149\thttps://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg\t1\tWelsh_springer_spaniel\t0.465074\tTrue\tcollie\t0.156665\tTrue\tShetland_sheepdog\t0.0614285\tTrue\n666029285002620928\thttps://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg\t1\tredbone\t0.506826\tTrue\tminiature_pinscher\t0.07419169999999999\tTrue\tRhodesian_ridgeback\t0.07201\tTrue\n666033412701032449\thttps://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg\t1\tGerman_shepherd\t0.596461\tTrue\tmalinois\t0.13858399999999998\tTrue\tbloodhound\t0.11619700000000001\tTrue\n666044226329800704\thttps://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg\t1\tRhodesian_ridgeback\t0.408143\tTrue\tredbone\t0.360687\tTrue\tminiature_pinscher\t0.222752\tTrue\n666049248165822465\thttps://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg\t1\tminiature_pinscher\t0.560311\tTrue\tRottweiler\t0.243682\tTrue\tDoberman\t0.154629\tTrue\n666050758794694657\thttps://pbs.twimg.com/

In [6]:
with open(os.path.join(url.split('/')[-1]), mode = 'wb') as file:
    file.write(response.content)

In [64]:
twitter_image_predictions_df = pd.read_csv('image-predictions.tsv', sep='\t')

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

**Documentation:**

[WeRateDogs](https://twitter.com/dog_rates) Twitter account.

[Models Reference](https://docs.tweepy.org/en/stable/models.html?highlight=models%20reference) for the [Tweepy API](https://docs.tweepy.org/en/stable/api.html) (included the [Status object](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet) used below).

In [8]:
consumer_key = 'API KEY'
consumer_secret = 'API KEY SECRET'
access_token = 'ACCESS TOKEN'
access_secret = 'ACCESS TOKEN SECRET'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

In [9]:
# example

id = 885311592912609280

# fetching the status
status = api.get_status(id, tweet_mode="extended")

# fetching the retweet_count attribute
retweet_count = status.retweet_count

# fetching the favorite_count attribute
favorite_count = status.favorite_count

# fetching the favorited attribute (whether the status has been favourited by the authenticated user or not)
favorited = status.favorited

# fetching the sensitive attribute (whether the status is sensitive or not)
possibly_sensitive = status.possibly_sensitive

# fetching the retweeted attribute
retweeted = status.retweeted

print(status.full_text)

RT @dog_rates: This is Lilly. She just parallel barked. Kindly requests a reward now. 13/10 would pet so well https://t.co/SATN4If5H5


In [10]:
tweet_ids = twitter_archive_enhanced_df.tweet_id.values
fails_dict = {}
start = time.time()
with open('tweet_json.txt', 'w') as file:
    for tweet_id in tweet_ids:
        try:
            # fetching the status
            status = api.get_status(tweet_id, tweet_mode="extended")
            print(status.full_text + '\n')
            
            # write the status object to the file.
            json.dump(status._json, file)
            
            file.write('\n')
        except tweepy.TweepError as e:
            fails_dict[tweet_id] = e
            print(e)
end = time.time()
print(end - start)

This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU

This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV

This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB

This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ

This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f

Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh

Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below

https://t.co/Zr4hWfAs1H http

In [12]:
fails_dict

{888202515573088257: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 873697596434513921: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 872668790621863937: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 872261713294495745: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 869988702071779329: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 866816280283807744: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 861769973181624320: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 856602993587888130: tweepy.error.TweepError([{'code': 144,
  

In [16]:
len(fails_dict)

28

In [65]:
# list of dictionaries to build file by file and later convert to a dataframe
tweet_list = []

with open('tweet_json.txt', 'r') as file:
    lines = file.readlines()
    for tweet in lines:
        #print(tweet)
        data = json.loads(tweet)
        tweet_id = data['id']
        retweet_count = data['retweet_count']
        favorite_count = data['favorite_count']
        tweet_list.append(
            {
                'tweet_id': tweet_id,
                'retweet_count': retweet_count,
                'favorite_count': favorite_count
            }
        )
    
tweets_df = pd.DataFrame(
    tweet_list, 
    columns = [
        'tweet_id', 
        'retweet_count', 
        'favorite_count'
    ]
)

In [66]:
# store the new data frame
tweets_df.to_csv('tweets.csv', index=False)

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



### **Visual assessment**

**`twitter-archive-enhanced` dataset**

In [19]:
twitter_archive_enhanced_df

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


- **tweet_id**: tweet Id. Last part of the tweet URL after "status/".
- **in_reply_to_status_id**: tweet Id to reply.
- **in_reply_to_user_id**: user Id to reply. 
- **timestamp**: tweet timestamp.
- **source**: device from which the tweet was posted. 
- **text**: the text of the status.
- **retweeted_status_id**: retweeted tweet Id.
- **retweeted_status_user_id**: retweeted tweet user Id.
- **retweeted_status_timestamp**: retweeted tweet timestamp.
- **expanded_urls**: tweet URL. 
- **rating_numerator**: rating numerator.
- **rating_denominator**: rating denominator.
- **name**: dog's name.
- **doggo**: one of the dog stages.
- **floofer**: one of the dog stages. 
- **pupper**: one of the dog stages.
- **puppo**: one of the dog stages.

**`twitter-image-predictions` dataset**

In [20]:
twitter_image_predictions_df

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


- **tweet_id**: tweet Id. Last part of the tweet URL after "status/".
- **jpg_url**: URL with the tweet image.
- **img_num**: image number that corresponded to the most confident prediction.
- **p1**: the algorithm's #1 prediction for the image in the tweet.
- **p1_conf**: how confident the algorithm is in its #1 prediction.
- **p1_dog**: whether or not the #1 prediction is a breed of dog.
- **p2**: the algorithm's second most likely prediction.
- **p2_conf**: how confident the algorithm is in its #2 prediction.
- **p2_dog**: whether or not the #2 prediction is a breed of dog.
- **p3**: the algorithm's #3 prediction for the image in the tweet.
- **p3_conf**: how confident the algorithm is in its #3 prediction.
- **p3_dog**: whether or not the #3 prediction is a breed of dog.

**`tweet_json` dataset**

In [21]:
tweets_df.sample(7)

Unnamed: 0,tweet_id,retweet_count,favorite_count
1301,705970349788291072,818,2983
1076,735256018284875776,817,3127
963,748705597323898880,889,2628
25,886983233522544640,6430,30796
1205,713175907180089344,1355,4156
363,827653905312006145,2833,14818
1600,684594889858887680,3236,8329


- **tweet_id**: tweet Id. Last part of the tweet URL after "status/".
- **retweet_count**: number of retweets of the status.
- **favorite_count**: number of likes of the status.

As we know, for large data sets, Pandas collapses rows and columns, so I'm using **Google Sheets** for visual assesment.

### **Programmatic assessment**

**`twitter-archive-enhanced` dataset**

In [22]:
twitter_archive_enhanced_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [23]:
# number of duplicate rows 
sum(twitter_archive_enhanced_df.duplicated())

0

In [24]:
twitter_archive_enhanced_df.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [25]:
type(twitter_archive_enhanced_df['timestamp'][0])

str

In [26]:
twitter_archive_enhanced_df.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [27]:
type(twitter_archive_enhanced_df['source'][0])

str

In [28]:
type(twitter_archive_enhanced_df['text'][0])

str

In [29]:
type(twitter_archive_enhanced_df['retweeted_status_timestamp'][0])

float

In [30]:
type(twitter_archive_enhanced_df['expanded_urls'][0])

str

In [31]:
type(twitter_archive_enhanced_df['name'][0])

str

In [32]:
twitter_archive_enhanced_df.isna().sum()

tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

In [33]:
twitter_archive_enhanced_df['name'].value_counts()['None']

745

In [34]:
twitter_archive_enhanced_df.in_reply_to_status_id.isna().sum()

2278

In [35]:
twitter_archive_enhanced_df.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [36]:
twitter_archive_enhanced_df.rating_numerator.value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

**`twitter-image-predictions` dataset**

In [37]:
twitter_image_predictions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [38]:
# number of duplicate rows 
sum(twitter_image_predictions_df.duplicated())

0

In [39]:
type(twitter_image_predictions_df['jpg_url'][0])

str

In [40]:
type(twitter_image_predictions_df['p1'][0])

str

In [41]:
type(twitter_image_predictions_df['p2'][0])

str

In [42]:
type(twitter_image_predictions_df['p3'][0])

str

In [43]:
type(twitter_image_predictions_df['p1_dog'][0])

numpy.bool_

In [44]:
len(twitter_image_predictions_df[twitter_image_predictions_df.p1_dog == False])

543

In [45]:
twitter_image_predictions_df.p1.values

array(['Welsh_springer_spaniel', 'redbone', 'German_shepherd', ...,
       'Chihuahua', 'Chihuahua', 'orange'], dtype=object)

In [46]:
twitter_image_predictions_df.p2.values

array(['collie', 'miniature_pinscher', 'malinois', ..., 'malamute',
       'Pekinese', 'bagel'], dtype=object)

In [47]:
twitter_image_predictions_df.p3.values

array(['Shetland_sheepdog', 'Rhodesian_ridgeback', 'bloodhound', ...,
       'kelpie', 'papillon', 'banana'], dtype=object)

**`tweet_json` dataset**

In [48]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2328 entries, 0 to 2327
Data columns (total 3 columns):
tweet_id          2328 non-null int64
retweet_count     2328 non-null int64
favorite_count    2328 non-null int64
dtypes: int64(3)
memory usage: 54.6 KB


In [49]:
# number of duplicate rows 
sum(tweets_df.duplicated())

0

<a id='quality'></a>
### Quality issues

[1](#1). `twitter-archive-enhanced` table: contains retweets and some tweets don't have image predictions in the `image-predictions` table. We only want original ratings that have images.


[2](#2).  `image-predictions` table: `p1_dog`, `p2_dog`, `p3_dog` columns contain some rows where the image does not show a breed of dog. We only want dog tweets. 


[3](#3). `twitter-archive-enhanced` table: `source` column, the set of values in this column can be converted to an ordinal categorical variable with the following values: iPhone, Vine, Twitter Web Client and TweetDeck.


[4](#4). `twitter-archive-enhanced` table: `timestamp` column, erroneous datatype, should be a date time data type.


[5](#5). `twitter-archive-enhanced` table: `expanded_urls` column contains repeated values.


[6](#6). `twitter-archive-enhanced` table: `rating_denominator` and `rating_numerator` columns, ratings are not correct. Some values out of range (i.e. rating denominator 0 for tweet ID 835246439529840640).


[7](#7). `twitter-archive-enhanced` table: `rating_denominator` and `rating_numerator` columns. Some values are outliers and others are calculated based on the number of dogs in the photo. 


[8](#8). `image-predictions` table: `p1`, `p2`, `p3` columns, names in lowercase and some of them separated by "_".


[9](#9). `image-predictions` table: `p1_conf`, `p2_conf`, `p3_conf` columns, the prediction should be displayed as a percentage.

<a id='tidiness'></a>
### Tidiness issues
[1](#10). `tweets` table: `retweet_count`and `favorite_count` columns should be part of the `twitter-archive-enhanced` table.


[2](#11). `twitter-archive-enhanced` table: `doggo`, `floofer`, `pupper`, `puppo` columns should be one column, since this is one variable.


[3](#12). `image-predictions` table: this table meets all three requirements for tidiness: 

- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table

However, only with the purpose of being able to generate a single master dataset with all the information that interests me, I am going to add the column `p1` from the table `image-predictions` in the dataset `twitter-archive-enhanced`. This way, I can have information about the breed of the dog. 

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [67]:
# Make copies of original pieces of data
archive_clean = twitter_archive_enhanced_df.copy()

image_predictions_clean = twitter_image_predictions_df.copy()

tweets_clean = tweets_df.copy()

### Quality Issues

<a id='1'></a>
### [Issue](#quality) #1:

#### Define:
- Remove retweets and tweets that don't have image prediction.

#### Code

In [68]:
# finding tweet ids that don't have image predictions.
idx1 = pd.Index(archive_clean.tweet_id) # twitter archive enhanced: 2356
idx2 = pd.Index(image_predictions_clean.tweet_id) # image predictions dataset: 2075 rows

# number of tweets without image prediction: 281
tweet_ids  = idx1.difference(idx2).values

tweet_ids

array([667070482143944705, 668587383441514497, 668967877119254528,
       669684865554620416, 671550332464455680, 673716320723169284,
       674307341513269249, 674330906434379776, 674606911342424069,
       674742531037511680, 675849018447167488, 676121918416756736,
       676590572941893632, 676593408224403456, 676916996760600576,
       677335745548390400, 677961670166224897, 678023323247357953,
       678708137298427904, 679001094530465792, 679405845277462528,
       679872969355714560, 680805554198020098, 681340665377193984,
       682088079302213632, 682808988178739200, 683515932363329536,
       684147889187209216, 684588130326986752, 684830982659280897,
       684969860808454144, 685681090388975616, 686035780142297088,
       686286779679375361, 686394059078897668, 686760001961103360,
       687399393394311168, 687732144991551489, 687841446767013888,
       689255633275777024, 689993469801164801, 690348396616552449,
       690607260360429569, 690989312272396288, 691793053716221

In [69]:
# function that removes rows based on tweet id
def drop_tweets_by_id(tweet_ids):
    for id in tweet_ids:
        archive_clean.drop(archive_clean[archive_clean.tweet_id == id].index, inplace=True)

In [70]:
# removing tweet ids without image prediction from twitter archive enhanced (clean dataframe)
drop_tweets_by_id(tweet_ids)

(Values in the columns `in_reply_to_user_id` and `retweeted_status_id` tell us if it is a tweet in response to a user or if it is a retweet. Therefore we drop the rows that have values in these columns.)

In [71]:
archive_clean.drop(archive_clean[archive_clean['in_reply_to_user_id'].notnull()].index, inplace=True)
archive_clean.drop(archive_clean[archive_clean['retweeted_status_id'].notnull()].index, inplace=True)

In [72]:
# remove unnecessary columns 
archive_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], axis=1, inplace=True)

#### Test

In [73]:
archive_clean.shape

(1971, 12)

In [74]:
# confirm some columns are gone
list(archive_clean)

['tweet_id',
 'timestamp',
 'source',
 'text',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'doggo',
 'floofer',
 'pupper',
 'puppo']

<a id='2'></a>
### [Issue](#quality) #2:

#### Define

- Remove tweets where the prediction is not a breed of dog.

#### Code

In [75]:
tweet_ids = image_predictions_clean[image_predictions_clean.p1_dog == False].tweet_id.values

In [76]:
tweet_ids

array([666051853826850816, 666057090499244032, 666104133288665088,
       666268910803644416, 666293911632134144, 666337882303524864,
       666362758909284353, 666411507551481857, 666430724426358785,
       666776908487630848, 666786068205871104, 666837028449972224,
       666983947667116034, 666996132027977728, 667012601033924608,
       667065535570550784, 667188689915760640, 667369227918143488,
       667437278097252352, 667443425659232256, 667524857454854144,
       667549055577362432, 667550882905632768, 667550904950915073,
       667724302356258817, 667766675769573376, 667782464991965184,
       667806454573760512, 667866724293877760, 667873844930215936,
       667878741721415682, 667911425562669056, 667915453470232577,
       667937095915278337, 668142349051129856, 668154635664932864,
       668226093875376128, 668256321989451776, 668291999406125056,
       668297328638447616, 668466899341221888, 668480044826800133,
       668544745690562560, 668614819948453888, 668620235289837

In [77]:
# removing tweet ids from twitter image predictions (clean dataframe)
for id in tweet_ids:
    image_predictions_clean.drop(image_predictions_clean[image_predictions_clean.tweet_id == id].index, inplace=True)

In [78]:
# removing tweet ids from twitter archive enhanced (clean dataframe)
drop_tweets_by_id(tweet_ids)

#### Test

In [79]:
image_predictions_clean.p1_dog.value_counts()

True    1532
Name: p1_dog, dtype: int64

In [80]:
# tweet id 667070482143944705 does not contain a breed of dog 

exist = 667070482143944705 in archive_clean.values
print(exist)
exist = 667070482143944705 in image_predictions_clean.values
print(exist)

False
False


<a id='3'></a>
### [Issue](#quality) #3:

#### Define

- Replace the current values with an ordinal categorical variable with more representative values for the user.

#### Code

In [81]:
# function to replace keywork in cell for a new value
def replace_source_by(key_word, categorical_var):
    archive_clean.loc[(archive_clean.source.str.contains(key_word)), 'source'] = categorical_var

In [82]:
archive_clean.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     1437
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       19
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>       7
Name: source, dtype: int64

In [83]:
replace_source_by('iPhone', 'iPhone')
replace_source_by('Twitter Web Client', 'Twitter Web Client')
replace_source_by('TweetDeck', 'TweetDeck')

In [84]:
# change data type to category
archive_clean.source = archive_clean.source.astype('category')

#### Test

In [85]:
archive_clean.source.value_counts()

iPhone                1437
Twitter Web Client      19
TweetDeck                7
Name: source, dtype: int64

In [86]:
archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1463 entries, 1 to 2355
Data columns (total 12 columns):
tweet_id              1463 non-null int64
timestamp             1463 non-null object
source                1463 non-null category
text                  1463 non-null object
expanded_urls         1463 non-null object
rating_numerator      1463 non-null int64
rating_denominator    1463 non-null int64
name                  1463 non-null object
doggo                 1463 non-null object
floofer               1463 non-null object
pupper                1463 non-null object
puppo                 1463 non-null object
dtypes: category(1), int64(3), object(8)
memory usage: 138.7+ KB


<a id='4'></a>
### [Issue](#quality) #4:

#### Define

- Convert to data type, datetime (keeping date format).

#### Code

In [87]:
archive_clean.timestamp = pd.to_datetime(archive_clean.timestamp, format="%Y-%m-%d %H:%M:%S")

#### Test

In [88]:
archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1463 entries, 1 to 2355
Data columns (total 12 columns):
tweet_id              1463 non-null int64
timestamp             1463 non-null datetime64[ns]
source                1463 non-null category
text                  1463 non-null object
expanded_urls         1463 non-null object
rating_numerator      1463 non-null int64
rating_denominator    1463 non-null int64
name                  1463 non-null object
doggo                 1463 non-null object
floofer               1463 non-null object
pupper                1463 non-null object
puppo                 1463 non-null object
dtypes: category(1), datetime64[ns](1), int64(3), object(7)
memory usage: 138.7+ KB


<a id='5'></a>
### [Issue](#quality) #5:

##### Define

- Remove the URLs that are repeated in each cell. 

##### Code

In [89]:
# iterate through the column checking if there is more than one URL and, in that case, checking if it is repeated
for i in range(archive_clean.shape[0]):
    urls = archive_clean.expanded_urls.values[i].split(',')
    if len(urls) > 1:
        urls = list(set(urls))
    archive_clean.expanded_urls.values[i] = urls

##### Test

In [90]:
type(archive_clean.expanded_urls.values[1])

list

In [91]:
archive_clean.expanded_urls.values[1]

['https://twitter.com/dog_rates/status/891815181378084864/photo/1']

In [92]:
archive_clean.head()

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1,892177421306343426,2017-08-01 00:17:27,iPhone,This is Tilly. She's just checking pup on you....,[https://twitter.com/dog_rates/status/89217742...,13,10,Tilly,,,,
2,891815181378084864,2017-07-31 00:18:03,iPhone,This is Archie. He is a rare Norwegian Pouncin...,[https://twitter.com/dog_rates/status/89181518...,12,10,Archie,,,,
4,891327558926688256,2017-07-29 16:00:24,iPhone,This is Franklin. He would like you to stop ca...,[https://twitter.com/dog_rates/status/89132755...,12,10,Franklin,,,,
5,891087950875897856,2017-07-29 00:08:17,iPhone,Here we have a majestic great white breaching ...,[https://twitter.com/dog_rates/status/89108795...,13,10,,,,,
6,890971913173991426,2017-07-28 16:27:12,iPhone,Meet Jax. He enjoys ice cream so much he gets ...,[https://twitter.com/dog_rates/status/89097191...,13,10,Jax,,,,


<a id='6'></a>
### [Issue](#quality) #6:

##### Define

- Extract correct ratings from the `text` column using regular expression. 

##### Code

In [93]:
ratings = archive_clean.text.str.extract('((?:\d+\.)?\d+)\/(\d+)' ,expand=True)

In [94]:
ratings.columns = ['rating_numerator', 'rating_denominator']

In [95]:
ratings.rating_numerator = ratings.rating_numerator.astype(float)
ratings.rating_denominator = ratings.rating_denominator.astype(float)

In [96]:
archive_clean.rating_numerator = ratings.rating_numerator
archive_clean.rating_denominator = ratings.rating_denominator

##### Test

In [97]:
archive_clean.rating_numerator.value_counts()

12.00     378
10.00     318
11.00     305
13.00     208
9.00      107
8.00       58
7.00       24
14.00      19
6.00       12
5.00        9
4.00        5
3.00        3
9.75        1
11.27       1
13.50       1
165.00      1
84.00       1
11.26       1
24.00       1
2.00        1
44.00       1
1.00        1
88.00       1
99.00       1
50.00       1
80.00       1
45.00       1
60.00       1
121.00      1
Name: rating_numerator, dtype: int64

In [98]:
archive_clean.rating_denominator.value_counts()

10.0     1449
50.0        3
80.0        2
150.0       1
2.0         1
110.0       1
40.0        1
90.0        1
20.0        1
11.0        1
7.0         1
70.0        1
Name: rating_denominator, dtype: int64

In [99]:
archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1463 entries, 1 to 2355
Data columns (total 12 columns):
tweet_id              1463 non-null int64
timestamp             1463 non-null datetime64[ns]
source                1463 non-null category
text                  1463 non-null object
expanded_urls         1463 non-null object
rating_numerator      1463 non-null float64
rating_denominator    1463 non-null float64
name                  1463 non-null object
doggo                 1463 non-null object
floofer               1463 non-null object
pupper                1463 non-null object
puppo                 1463 non-null object
dtypes: category(1), datetime64[ns](1), float64(2), int64(1), object(7)
memory usage: 138.7+ KB


<a id='7'></a>
### [Issue](#quality) #7:

##### Define

- Correct some ratings based on the number of dogs that appear in the image.
- Identify and remove tweets with outlier ratings. This [article](https://www.wikihow.com/Calculate-Outliers) will help in calculating these outliers.

##### Code

In [100]:
# based on the denominator values, we checked some tweets that are suspected of 
# containing multiple dogs in the image.
archive_clean.rating_denominator.value_counts()

10.0     1449
50.0        3
80.0        2
150.0       1
2.0         1
110.0       1
40.0        1
90.0        1
20.0        1
11.0        1
7.0         1
70.0        1
Name: rating_denominator, dtype: int64

In [101]:
# function to reset denominator and numerator values based on number of dogs displayed in tweet image
def set_ratings(tweet_id, divisor):
    numerator = archive_clean[archive_clean.tweet_id == tweet_id].rating_numerator.values[0]
    denominator = archive_clean[archive_clean.tweet_id == tweet_id].rating_denominator.values[0]
    
    archive_clean.loc[archive_clean.tweet_id == tweet_id, 'rating_numerator'] = numerator / divisor
    archive_clean.loc[archive_clean.tweet_id == tweet_id, 'rating_denominator'] = denominator / divisor

In [102]:
# fixing some tweets where several dogs appear in the image.
tweet_id = archive_clean[archive_clean.rating_denominator == 70.0].tweet_id.values[0]
print(tweet_id)

set_ratings(tweet_id, 7)

820690176645140481


In [103]:
tweet_id = archive_clean[archive_clean.rating_denominator == 90.0].tweet_id.values[0]
print(tweet_id)

set_ratings(tweet_id, 9)

713900603437621249


In [104]:
tweet_id = archive_clean[archive_clean.rating_denominator == 40.0].tweet_id.values[0]
print(tweet_id)

set_ratings(tweet_id, 4)

697463031882764288


In [105]:
tweet_ids = archive_clean[archive_clean.rating_denominator == 80.0].tweet_id.values
print(tweet_ids)

set_ratings(tweet_ids[0], 8)
set_ratings(tweet_ids[1], 8)

[710658690886586372 675853064436391936]


In [106]:
# next, discard the outliers
denominator_values = archive_clean.rating_denominator.sort_values().values
denominator_values

array([   2.,    7.,   10., ...,   50.,  110.,  150.])

In [107]:
Q1 = np.percentile(denominator_values, 25, interpolation = 'midpoint') 
Q2 = np.percentile(denominator_values, 50, interpolation = 'midpoint') 
Q3 = np.percentile(denominator_values, 75, interpolation = 'midpoint') 
  
print('Q1 25 percentile of the given data is, ', Q1)
print('Q1 50 percentile of the given data is, ', Q2)
print('Q1 75 percentile of the given data is, ', Q3)
  
IQR = Q3 - Q1 
print('Interquartile range is', IQR)

Q1 25 percentile of the given data is,  10.0
Q1 50 percentile of the given data is,  10.0
Q1 75 percentile of the given data is,  10.0
Interquartile range is 0.0


In [108]:
low_limit = Q1 - 1.5 * IQR
up_limit = Q3 + 1.5 * IQR
print('low_limit is', low_limit)
print('up_limit is', up_limit)

low_limit is 10.0
up_limit is 10.0


In [109]:
# remove tweets with outliers in the denominator
archive_clean.drop(archive_clean[archive_clean.rating_denominator != 10.0].index, inplace=True)

##### Test

In [110]:
archive_clean.rating_denominator.value_counts()

10.0    1454
Name: rating_denominator, dtype: int64

<a id='8'></a>
### [Issue](#quality) #8:

##### Define

- Convert the prediction algorithms with the first letter capitalized and separated by spaces. 

##### Code

In [111]:
image_predictions_clean.p1 = image_predictions_clean.p1.str.replace('_', ' ')
image_predictions_clean.p1 = image_predictions_clean.p1.str.capitalize()

image_predictions_clean.p2 = image_predictions_clean.p2.str.replace('_', ' ')
image_predictions_clean.p2 = image_predictions_clean.p2.str.capitalize()

image_predictions_clean.p3 = image_predictions_clean.p3.str.replace('_', ' ')
image_predictions_clean.p3 = image_predictions_clean.p3.str.capitalize()

##### Test

In [112]:
image_predictions_clean.p1.value_counts().head(10)

Golden retriever      150
Labrador retriever    100
Pembroke               89
Chihuahua              83
Pug                    57
Chow                   44
Samoyed                43
Toy poodle             39
Pomeranian             38
Cocker spaniel         30
Name: p1, dtype: int64

<a id='9'></a>
### [Issue](#quality) #9:

##### Define
- Express the algorithm's prediction as a percentage and round the value with one decimal place.

##### Code

In [113]:
image_predictions_clean.p1_conf = (image_predictions_clean.p1_conf * 100)
image_predictions_clean.p1_conf = image_predictions_clean.p1_conf.round(1)

image_predictions_clean.p2_conf = (image_predictions_clean.p2_conf * 100)
image_predictions_clean.p2_conf = image_predictions_clean.p2_conf.round(1)

image_predictions_clean.p3_conf = (image_predictions_clean.p3_conf * 100)
image_predictions_clean.p3_conf = image_predictions_clean.p3_conf.round(1)

##### Test

In [114]:
image_predictions_clean.p1_conf.head(10)

0     46.5
1     50.7
2     59.6
3     40.8
4     56.0
5     65.1
7     69.3
9     20.1
10    77.6
11    50.4
Name: p1_conf, dtype: float64

In [115]:
image_predictions_clean.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
452,674752233200820224,https://pbs.twimg.com/media/CV0zkzEU4AAzLc5.jpg,2,Vizsla,66.6,True,Redbone,17.3,True,Basset,13.5,True
584,678969228704284672,https://pbs.twimg.com/media/CWwu6OLUkAEo3gq.jpg,1,Labrador retriever,68.0,True,Chesapeake bay retriever,20.2,True,Golden retriever,2.0,True
762,688898160958271489,https://pbs.twimg.com/media/CY91OENWUAE5agj.jpg,1,Ibizan hound,85.3,True,Chihuahua,4.0,True,Italian greyhound,3.5,True
1423,772117678702071809,https://pbs.twimg.com/media/Crcc7pqXEAAM5O2.jpg,1,Labrador retriever,21.8,True,Beagle,15.8,True,Golden retriever,12.8,True
1613,801958328846974976,https://pbs.twimg.com/media/CyEg2AXUsAA1Qpf.jpg,1,Staffordshire bullterrier,32.8,True,American staffordshire terrier,27.2,True,Labrador retriever,24.8,True


### Tidiness Issues

<a id='10'></a>
### [Issue](#tidiness) #1:

##### Define

- Add the columns `favorite_count` and `favorited` to the dataset `twitter-archive-enhanced`.

- Remove missing tweets. The number of rows in the table `tweets` does not match the number of rows in the table `twitter-archive-enhanced` because during the query process through the Twitter API I got an error "No status found with that ID."

##### Code

In [116]:
# dropping rows with error

# number of tweets without info: 
idx1 = pd.Index(archive_clean.tweet_id) # twitter archive enhanced: 2356
idx2 = pd.Index(tweets_clean.tweet_id) # tweets: 2328 rows

tweet_ids  = idx1.difference(idx2).values
tweet_ids

array([680055455951884288, 754011816964026368, 759923798737051648,
       779123168116150273, 829374341691346946, 837366284874571778,
       844704788403113984, 872261713294495745])

In [117]:
# removing tweet ids from twitter archive enhanced (clean dataframe)
drop_tweets_by_id(tweet_ids)

In [118]:
# add the new columns to the twitter archive enhanced (clean dataset)
archive_clean = pd.merge(archive_clean, tweets_clean, on=['tweet_id'], how='left')

##### Test

In [119]:
archive_clean.sample(5)

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,retweet_count,favorite_count
0,892177421306343426,2017-08-01 00:17:27,iPhone,This is Tilly. She's just checking pup on you....,[https://twitter.com/dog_rates/status/89217742...,13.0,10.0,Tilly,,,,,5359,29701
324,805826884734976000,2016-12-05 17:31:15,iPhone,This is Duke. He is not a fan of the pupporazz...,[https://twitter.com/dog_rates/status/80582688...,12.0,10.0,Duke,,,,,1741,6346
110,860524505164394496,2017-05-05 16:00:04,iPhone,This is Carl. He likes to dance. Doesn't care ...,[https://twitter.com/dog_rates/status/86052450...,13.0,10.0,Carl,,,,,4671,21565
403,786363235746385920,2016-10-13 00:29:39,iPhone,This is Rizzo. He has many talents. A true ren...,[https://twitter.com/dog_rates/status/78636323...,13.0,10.0,Rizzo,doggo,,,,3284,10434
1192,673686845050527744,2015-12-07 02:13:55,iPhone,This is George. He's upset that the 4th of Jul...,[https://twitter.com/dog_rates/status/67368684...,11.0,10.0,George,,,,,387,1297


In [120]:
# confirm new columns added
list(archive_clean)

['tweet_id',
 'timestamp',
 'source',
 'text',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'doggo',
 'floofer',
 'pupper',
 'puppo',
 'retweet_count',
 'favorite_count']

<a id='11'></a>
### [Issue](#tidiness) #2:

##### Define

- Melt the `doggo`, `floofer`, `pupper` and `puppo` columns to a `stage` column.

##### Code

In [121]:
# replace columns with value 'None' by empty value
for i in range(archive_clean.shape[0]):
    archive_clean.doggo.replace('None', '', inplace=True)
    archive_clean.floofer.replace('None', '', inplace=True)
    archive_clean.pupper.replace('None', '', inplace=True)
    archive_clean.puppo.replace('None', '', inplace=True)

In [122]:
# we add the values of each cell
archive_clean['stage'] = archive_clean['doggo'] + archive_clean['floofer'] + archive_clean['pupper'] + archive_clean['puppo'] 

In [123]:
# some rows contain several types of stages at the same time, so they must be separated by 
# commas to make them easier to read 
archive_clean.stage.value_counts()

                1220
pupper           144
doggo             47
puppo             19
doggopupper        7
floofer            7
doggopuppo         1
doggofloofer       1
Name: stage, dtype: int64

In [124]:
archive_clean.loc[archive_clean.stage == 'doggopupper', 'stage'] = 'doggo,pupper'
archive_clean.loc[archive_clean.stage == 'doggopuppo', 'stage'] = 'doggo,puppo'
archive_clean.loc[archive_clean.stage == 'doggofloofer', 'stage'] = 'doggo,floofer'

In [125]:
# remove unnecessary columns 
archive_clean.drop(['doggo', 'floofer', 'pupper', 'puppo'], axis=1, inplace=True)

##### Test

In [126]:
archive_clean.stage.value_counts()

                 1220
pupper            144
doggo              47
puppo              19
doggo,pupper        7
floofer             7
doggo,floofer       1
doggo,puppo         1
Name: stage, dtype: int64

<a id='12'></a>
### [Issue](#tidiness) #3:

##### Define

- Include column `p1` from the dataset `image-predictions` in the dataset `twitter-archive-enhanced`.

##### Code

In [127]:
# dataset with the columns of interest 
sub_df = image_predictions_clean[["tweet_id", "p1"]]

In [128]:
sub_df.head(5)

Unnamed: 0,tweet_id,p1
0,666020888022790149,Welsh springer spaniel
1,666029285002620928,Redbone
2,666033412701032449,German shepherd
3,666044226329800704,Rhodesian ridgeback
4,666049248165822465,Miniature pinscher


In [129]:
# add the new column
archive_clean = pd.merge(archive_clean, sub_df, on=['tweet_id'], how='left')

In [130]:
# rename the column with a more meaningful name
archive_clean = archive_clean.rename(columns = {'p1': 'breed'})

##### Test

In [131]:
# confirm new column added
list(archive_clean)

['tweet_id',
 'timestamp',
 'source',
 'text',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'retweet_count',
 'favorite_count',
 'stage',
 'breed']

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [132]:
archive_clean.to_csv('twitter_archive_master.csv', index=False)