# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [2]:
import pandas as pd
import json
import numpy as np

In [3]:
archive_data = pd.read_csv('twitter-archive-enhanced.csv')

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [4]:
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"

In [5]:
response = requests.get(url)
print(response.status_code)

NameError: name 'requests' is not defined

In [None]:
with open('image-predictions.tsv','w') as file:
    file.write(response.content.decode())

In [6]:
image_pred = pd.read_csv('image-predictions.tsv',sep='\t')

In [7]:
image_pred.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [None]:
with open('tweet-json.json','w') as file:
    json_ = requests.get('https://video.udacity-data.com/topher/2018/November/5be5fb7d_tweet-json/tweet-json.txt')
    file.write(json_.content.decode())
    
    
with open('tweet-json.json','r') as file:
    dicts = [json.loads(i) for i in file]
        
df=[{'id':i['id'],'retweet_count':i['retweet_count'],'favorite_count':i['favorite_count'],'retweeted':i['retweeted'],'lang':i['lang']} for i in dicts]


df=pd.DataFrame(df)

df.to_csv('twitter-archive-master.csv',index=False)

In [8]:
archive_master=pd.read_csv('twitter-archive-master.csv')
archive_master.head()

Unnamed: 0,id,retweet_count,favorite_count,retweeted,lang
0,892420643555336193,8853,39467,False,en
1,892177421306343426,6514,33819,False,en
2,891815181378084864,4328,25461,False,en
3,891689557279858688,8964,42908,False,en
4,891327558926688256,9774,41048,False,en


## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



In [9]:
archive_data.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [10]:
archive_data[archive_data.name == 'Canela']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Canela. She attempted s...,8.87474e+17,4196984000.0,2017-07-19 00:47:34 +0000,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,,,,
23,887473957103951883,,,2017-07-19 00:47:34 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Canela. She attempted some fancy porch...,,,,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,,,,


In [11]:
archive_data.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [12]:
archive_data[archive_data.rating_denominator == 0]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,8.35246e+17,26259576.0,2017-02-24 21:54:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@jonnysun @Lin_Manuel ok jomny I know you're e...,,,,,960,0,,,,,


In [13]:
archive_data.source.loc[1]

'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'

In [14]:
archive_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [15]:
archive_data.dtypes

tweet_id                        int64
in_reply_to_status_id         float64
in_reply_to_user_id           float64
timestamp                      object
source                         object
text                           object
retweeted_status_id           float64
retweeted_status_user_id      float64
retweeted_status_timestamp     object
expanded_urls                  object
rating_numerator                int64
rating_denominator              int64
name                           object
doggo                          object
floofer                        object
pupper                         object
puppo                          object
dtype: object

In [16]:
archive_data.expanded_urls.sample(20)

1348    https://twitter.com/dog_rates/status/704347321...
210     https://twitter.com/dog_rates/status/852189679...
659     https://twitter.com/dog_rates/status/791406955...
1305    https://twitter.com/dog_rates/status/707387676...
400     https://twitter.com/dog_rates/status/824775126...
55                                                    NaN
895     https://twitter.com/dog_rates/status/670319130...
853     https://twitter.com/dog_rates/status/765371061...
2263    https://twitter.com/dog_rates/status/667544320...
742     https://www.patreon.com/WeRateDogs,https://twi...
735     https://twitter.com/dog_rates/status/781163403...
1535    https://twitter.com/dog_rates/status/689977555...
1869    https://twitter.com/dog_rates/status/675153376...
887     https://twitter.com/dog_rates/status/759923798...
908     https://twitter.com/dog_rates/status/679062614...
187     https://twitter.com/dog_rates/status/856282028...
514     https://twitter.com/dog_rates/status/811627233...
1814    https:

In [17]:
archive_data.iloc[657]

tweet_id                                                     791774931465953280
in_reply_to_status_id                                                       NaN
in_reply_to_user_id                                                         NaN
timestamp                                             2016-10-27 22:53:48 +0000
source                        <a href="http://vine.co" rel="nofollow">Vine -...
text                          Vine will be deeply missed. This was by far my...
retweeted_status_id                                                         NaN
retweeted_status_user_id                                                    NaN
retweeted_status_timestamp                                                  NaN
expanded_urls                                     https://vine.co/v/ea0OwvPTx9l
rating_numerator                                                             14
rating_denominator                                                           10
name                                    

In [18]:
archive_data[archive_data.duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [19]:
image_pred.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [20]:
image_pred[image_pred.duplicated()]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog


In [21]:
image_pred['p1_dog'].value_counts()

True     1532
False     543
Name: p1_dog, dtype: int64

In [22]:
image_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [23]:
image_pred.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
256,670764103623966721,https://pbs.twimg.com/media/CU8IY0pWIAA2AJ-.jpg,1,Norfolk_terrier,0.17285,True,golden_retriever,0.072702,True,television,0.037494,False
1995,874296783580663808,https://pbs.twimg.com/media/DCIgSR0XgAANEOY.jpg,1,cocker_spaniel,0.437216,True,miniature_poodle,0.277191,True,toy_poodle,0.157402,True
678,683498322573824003,https://pbs.twimg.com/media/CXxGGOsUwAAr62n.jpg,1,Airedale,0.945362,True,Irish_terrier,0.02685,True,Lakeland_terrier,0.016826,True
1819,834209720923721728,https://pbs.twimg.com/media/C5O1UAaWIAAMBMd.jpg,1,golden_retriever,0.754799,True,Pekinese,0.197861,True,Labrador_retriever,0.008654,True
1866,843604394117681152,https://pbs.twimg.com/media/C7UVuE_U0AI8GGl.jpg,1,Labrador_retriever,0.430583,True,golden_retriever,0.263581,True,Great_Pyrenees,0.179385,True
522,676582956622721024,https://pbs.twimg.com/media/CWO0m8tUwAAB901.jpg,1,seat_belt,0.790028,False,Boston_bull,0.196307,True,French_bulldog,0.012429,True
226,670361874861563904,https://pbs.twimg.com/media/CU2akCQWsAIbaOV.jpg,1,platypus,0.974075,False,spotted_salamander,0.011068,False,bison,0.003897,False
1442,775364825476165632,https://pbs.twimg.com/media/CsKmMB2WAAAXcAy.jpg,3,beagle,0.571229,True,Chihuahua,0.175257,True,Pembroke,0.034306,True
505,675891555769696257,https://pbs.twimg.com/media/CWE_x33UwAEE3no.jpg,1,Italian_greyhound,0.305637,True,whippet,0.232057,True,Great_Dane,0.117806,True
1184,738883359779196928,https://pbs.twimg.com/media/CkEKe3QWYAAwoDy.jpg,2,Labrador_retriever,0.691137,True,golden_retriever,0.195558,True,Chesapeake_Bay_retriever,0.019585,True


In [24]:
image_pred.jpg_url.loc[300]

'https://pbs.twimg.com/media/CVGbPgrWIAAQ1fB.jpg'

In [25]:
image_pred.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1361,761227390836215808,https://pbs.twimg.com/media/CpBsRleW8AEfO8G.jpg,1,cougar,0.306512,False,French_bulldog,0.280802,True,boxer,0.054523,True
789,690597161306841088,https://pbs.twimg.com/media/CZV-c9NVIAEWtiU.jpg,1,Lhasa,0.0975,True,koala,0.091934,False,sunglasses,0.091505,False
167,668986018524233728,https://pbs.twimg.com/media/CUi3PIrWoAAPvPT.jpg,1,doormat,0.976103,False,Chihuahua,0.00564,True,Norfolk_terrier,0.003913,True
454,674764817387900928,https://pbs.twimg.com/media/CV0_BSuWIAIvE9k.jpg,2,Samoyed,0.634695,True,Arctic_fox,0.309853,False,kuvasz,0.019641,True
1056,714606013974974464,https://pbs.twimg.com/media/CerKYG8WAAM1aE-.jpg,1,Norfolk_terrier,0.293007,True,Labrador_retriever,0.256198,True,golden_retriever,0.129643,True
103,667806454573760512,https://pbs.twimg.com/media/CUSGbXeVAAAgztZ.jpg,1,toyshop,0.253089,False,Chihuahua,0.187155,True,Brabancon_griffon,0.112799,True
2008,878057613040115712,https://pbs.twimg.com/media/DC98vABUIAA97pz.jpg,1,French_bulldog,0.839097,True,Boston_bull,0.078799,True,toy_terrier,0.015243,True
354,672591762242805761,https://pbs.twimg.com/media/CVWGotpXAAMRfGq.jpg,1,kuvasz,0.777659,True,Great_Pyrenees,0.112517,True,golden_retriever,0.038351,True
288,671159727754231808,https://pbs.twimg.com/media/CVBwNjVWwAAlUFQ.jpg,1,pitcher,0.117446,False,sunglasses,0.062487,False,mask,0.059517,False
1702,817171292965273600,https://pbs.twimg.com/media/C1cs8uAWgAEwbXc.jpg,1,golden_retriever,0.295483,True,Irish_setter,0.144431,True,Chesapeake_Bay_retriever,0.077879,True


In [26]:
archive_master.dtypes

id                 int64
retweet_count      int64
favorite_count     int64
retweeted           bool
lang              object
dtype: object

In [27]:
archive_master.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              2354 non-null   int64 
 1   retweet_count   2354 non-null   int64 
 2   favorite_count  2354 non-null   int64 
 3   retweeted       2354 non-null   bool  
 4   lang            2354 non-null   object
dtypes: bool(1), int64(3), object(1)
memory usage: 76.0+ KB


### Quality issues
1. Timestamp is an object type instead of a datetime type

2. The a tag in the source column

3. rating_denominator of 0

4. None instead of NaN

5. expanded_url missing url

6. retweeted user id is float and id is int instaed of string

7. replace a in the data with the real dog name where possible.

8. retweets result in duplicated data

### Tidiness issues
1. doggo, floofer, pupper, and puppo columns should be 1 column. i.e('dog stage')

2. One table is enough for the data

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [28]:
# Make copies of original pieces of data
archive_data_clean = archive_data.copy()
archive_master_clean = archive_master.copy()
image_pred_clean = image_pred.copy()

### Issue #1: Timestamp is an object type instead of a datetime type

#### Define: Convert timestamp column of archive_data table should be converted from string to datetime

#### Code

In [29]:
import datetime

In [30]:
archive_data_clean.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [31]:
archive_data_clean["timestamp"] = pd.to_datetime(archive_data_clean["timestamp"])

In [32]:
archive_data_clean["retweeted_status_timestamp"] = pd.to_datetime(archive_data_clean["retweeted_status_timestamp"])

#### Test

In [33]:
archive_data_clean.dtypes

tweet_id                                    int64
in_reply_to_status_id                     float64
in_reply_to_user_id                       float64
timestamp                     datetime64[ns, UTC]
source                                     object
text                                       object
retweeted_status_id                       float64
retweeted_status_user_id                  float64
retweeted_status_timestamp    datetime64[ns, UTC]
expanded_urls                              object
rating_numerator                            int64
rating_denominator                          int64
name                                       object
doggo                                      object
floofer                                    object
pupper                                     object
puppo                                      object
dtype: object

### Issue #2: The a tag in the source column

#### Define: use regex to extract the url from the source column

#### Code

In [34]:
archive_data_clean['source'] = archive_data_clean['source'].str.extract('^<.+(http.+)\"\s.+')

In [35]:
archive_data_clean['source']

0       http://twitter.com/download/iphone
1       http://twitter.com/download/iphone
2       http://twitter.com/download/iphone
3       http://twitter.com/download/iphone
4       http://twitter.com/download/iphone
                       ...                
2351    http://twitter.com/download/iphone
2352    http://twitter.com/download/iphone
2353    http://twitter.com/download/iphone
2354    http://twitter.com/download/iphone
2355    http://twitter.com/download/iphone
Name: source, Length: 2356, dtype: object

#### Test

In [36]:
archive_data_clean['source'].sample(10)

262     http://twitter.com/download/iphone
2262                    http://twitter.com
376     http://twitter.com/download/iphone
173     http://twitter.com/download/iphone
443     http://twitter.com/download/iphone
1150    http://twitter.com/download/iphone
2096    http://twitter.com/download/iphone
1767    http://twitter.com/download/iphone
1103    http://twitter.com/download/iphone
192     http://twitter.com/download/iphone
Name: source, dtype: object

### Issue #3: the rating denominator for some dogs is zero

use regex to get this data from the text

In [37]:
zeros = archive_data_clean[(archive_data_clean['rating_denominator'] == 0) | (archive_data_clean['rating_numerator'] == 0)]

In [38]:
zeros

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,8.35246e+17,26259580.0,2017-02-24 21:54:03+00:00,http://twitter.com/download/iphone,@jonnysun @Lin_Manuel ok jomny I know you're e...,,,NaT,,960,0,,,,,
315,835152434251116546,,,2017-02-24 15:40:31+00:00,http://twitter.com/download/iphone,When you're so blinded by your systematic plag...,,,NaT,https://twitter.com/dog_rates/status/835152434...,0,10,,,,,
1016,746906459439529985,7.468859e+17,4196984000.0,2016-06-26 03:22:31+00:00,http://twitter.com/download/iphone,"PUPDATE: can't see any. Even if I could, I cou...",,,NaT,https://twitter.com/dog_rates/status/746906459...,0,10,,,,,


In [39]:
zeros.text[313]

"@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho"

In [40]:
zeros[['rating_numerator','rating_denominator']] = zeros['text'].str.extract('.+\s(\d+)/([0-9]{2}).+')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  zeros[['rating_numerator','rating_denominator']] = zeros['text'].str.extract('.+\s(\d+)/([0-9]{2}).+')


In [41]:
zeros

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,8.35246e+17,26259580.0,2017-02-24 21:54:03+00:00,http://twitter.com/download/iphone,@jonnysun @Lin_Manuel ok jomny I know you're e...,,,NaT,,13,10,,,,,
315,835152434251116546,,,2017-02-24 15:40:31+00:00,http://twitter.com/download/iphone,When you're so blinded by your systematic plag...,,,NaT,https://twitter.com/dog_rates/status/835152434...,0,10,,,,,
1016,746906459439529985,7.468859e+17,4196984000.0,2016-06-26 03:22:31+00:00,http://twitter.com/download/iphone,"PUPDATE: can't see any. Even if I could, I cou...",,,NaT,https://twitter.com/dog_rates/status/746906459...,0,10,,,,,


In [42]:
archive_data_clean.drop(zeros.index,axis=0,inplace=True)

In [43]:
archive_data_clean = archive_data_clean.append(zeros)

  archive_data_clean = archive_data_clean.append(zeros)


In [44]:
archive_data_clean.reset_index(drop = True , inplace=True)

#### Test

In [45]:
archive_data_clean[(archive_data_clean['rating_denominator'] == 0)]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


### Issue #4: None instead of NaN

#### Define: convert the None string to NaN

#### Code

In [46]:
archive_data_clean = archive_data_clean.replace('None',np.nan)

#### Test

In [47]:
archive_data_clean.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56+00:00,http://twitter.com/download/iphone,This is Phineas. He's a mystical boy. Only eve...,,,NaT,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27+00:00,http://twitter.com/download/iphone,This is Tilly. She's just checking pup on you....,,,NaT,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03+00:00,http://twitter.com/download/iphone,This is Archie. He is a rare Norwegian Pouncin...,,,NaT,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51+00:00,http://twitter.com/download/iphone,This is Darla. She commenced a snooze mid meal...,,,NaT,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24+00:00,http://twitter.com/download/iphone,This is Franklin. He would like you to stop ca...,,,NaT,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


### Issue #5: retweeted user id is float and user_id and id are int instead of string

#### Define: Convert all id columns to string

#### Code

In [48]:
archive_data_clean['tweet_id'] = archive_data_clean['tweet_id'].astype('string')

In [49]:
non_na = archive_data_clean[archive_data_clean['retweeted_status_id'].isna() == False]

In [50]:
non_na['retweeted_status_id'] = non_na['retweeted_status_id'].apply(lambda x:int(x)).astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_na['retweeted_status_id'] = non_na['retweeted_status_id'].apply(lambda x:int(x)).astype(str)


In [51]:
non_na['retweeted_status_user_id'] = non_na['retweeted_status_user_id'].apply(lambda x:int(x)).astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_na['retweeted_status_user_id'] = non_na['retweeted_status_user_id'].apply(lambda x:int(x)).astype(str)


In [52]:
archive_data_clean = archive_data_clean.drop(non_na.index,axis=0)
archive_data_clean = archive_data_clean.append(non_na).reset_index(drop = True)

  archive_data_clean = archive_data_clean.append(non_na).reset_index(drop = True)


In [53]:
archive_data_clean['tweet_id'] = archive_data_clean['tweet_id'].astype(str)

In [54]:
image_pred_clean['tweet_id'] = image_pred_clean['tweet_id'].astype(str)

In [55]:
archive_master_clean['id'] = archive_master_clean['id'].astype(str)

In [56]:
archive_master_clean.dtypes

id                object
retweet_count      int64
favorite_count     int64
retweeted           bool
lang              object
dtype: object

In [57]:
image_pred_clean.dtypes

tweet_id     object
jpg_url      object
img_num       int64
p1           object
p1_conf     float64
p1_dog         bool
p2           object
p2_conf     float64
p2_dog         bool
p3           object
p3_conf     float64
p3_dog         bool
dtype: object

In [58]:
archive_data_clean.dtypes

tweet_id                                   object
in_reply_to_status_id                     float64
in_reply_to_user_id                       float64
timestamp                     datetime64[ns, UTC]
source                                     object
text                                       object
retweeted_status_id                        object
retweeted_status_user_id                   object
retweeted_status_timestamp    datetime64[ns, UTC]
expanded_urls                              object
rating_numerator                           object
rating_denominator                         object
name                                       object
doggo                                      object
floofer                                    object
pupper                                     object
puppo                                      object
dtype: object

### Issue #6: expanded url missing url

#### Define: Expanded url can be derived by following the format source/id/photo/1

#### Code

In [59]:
#lets first get expanded url for non retweeted tweets
exp_urls = archive_data_clean[(archive_data_clean['expanded_urls'].isna() == True)]

In [60]:
exp_urls.head(1)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
29,886267009285017600,8.862664e+17,2281182000.0,2017-07-15 16:51:35+00:00,http://twitter.com/download/iphone,@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,,NaT,,12,10,,,,,


In [61]:
exp_urls['expanded_urls'] = [a+'/'+b+'/'+'source/photo/1' for (a,b) in zip(exp_urls['source'],exp_urls['tweet_id'])]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exp_urls['expanded_urls'] = [a+'/'+b+'/'+'source/photo/1' for (a,b) in zip(exp_urls['source'],exp_urls['tweet_id'])]


In [62]:
#now to handle retweets
exp_urls[exp_urls['retweeted_status_id'].isna() == False]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2200,856330835276025856,,,2017-04-24 02:15:55+00:00,http://twitter.com/download/iphone,RT @Jenna_Marbles: @dog_rates Thanks for ratin...,856330158768218112,66699013,2017-04-24 02:13:14+00:00,http://twitter.com/download/iphone/85633083527...,14,10,,,,,


In [63]:
exp_urls.at[2200,'expanded_urls'] = exp_urls['expanded_urls'].loc[2200] + ',' + exp_urls['source'].loc[2200]+'/'+exp_urls['retweeted_status_id'].loc[2200]+'/'+'source/photo/1'

In [64]:
archive_data_clean.drop(exp_urls.index,axis=0,inplace=True)

In [65]:
archive_data_clean = archive_data_clean.append(exp_urls).reset_index(drop=True)

  archive_data_clean = archive_data_clean.append(exp_urls).reset_index(drop=True)


#### Test

In [66]:
archive_data_clean[(archive_data_clean['expanded_urls'].isna() == True)]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


### Issue #7: Replace a with the dog name where found

#### Define: replace a with dog name if possible using regex

#### Code

In [67]:
len(archive_data_clean[archive_data_clean['name']=='a'])

55

In [68]:
a_name = archive_data_clean[archive_data_clean['name']=='a']
a_name.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
51,881536004380872706,,,2017-07-02 15:32:16+00:00,http://twitter.com/download/iphone,Here is a pupper approaching maximum borkdrive...,,,NaT,https://twitter.com/dog_rates/status/881536004...,14,10,a,,,pupper,
496,792913359805018113,,,2016-10-31 02:17:31+00:00,http://twitter.com/download/iphone,Here is a perfect example of someone who has t...,,,NaT,https://twitter.com/dog_rates/status/792913359...,13,10,a,,,,
617,772581559778025472,,,2016-09-04 23:46:12+00:00,http://twitter.com/download/iphone,Guys this is getting so out of hand. We only r...,,,NaT,https://twitter.com/dog_rates/status/772581559...,10,10,a,,,,
794,747885874273214464,,,2016-06-28 20:14:22+00:00,http://twitter.com/download/iphone,This is a mighty rare blue-tailed hammer sherk...,,,NaT,https://twitter.com/dog_rates/status/747885874...,8,10,a,,,,
796,747816857231626240,,,2016-06-28 15:40:07+00:00,http://twitter.com/download/iphone,Viewer discretion is advised. This is a terrib...,,,NaT,https://twitter.com/dog_rates/status/747816857...,4,10,a,,,,


In [69]:
names = a_name['text'].str.extract('.+named\s(\w+).+').dropna()
for i in names.index:
    archive_data_clean.at[i,'name'] = names.loc[i][0]

In [70]:
a_name.drop(names.index,axis=0,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  a_name.drop(names.index,axis=0,inplace=True)


In [71]:
a_name['text'].loc[2111]

'This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx'

In [72]:
a_name['text'].str.extract('.+name\sis\s(\w+).+').dropna()

Unnamed: 0,0
2047,Daryl


In [73]:
archive_data_clean.at[2047,'name'] = 'Daryl'

In [74]:
a_name.drop(2047,axis=0,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  a_name.drop(2047,axis=0,inplace=True)


#### Test

In [75]:
archive_data_clean.name.loc[2047]

'Daryl'

In [76]:
len(a_name)

35

In [77]:
archive_data_clean[archive_data_clean['name']=='jimothy'][['text','name']]

Unnamed: 0,text,name


### Issue #8: Retweets result in duplicate data

#### Define: remove retweets

#### Code

In [82]:
#lets use texts that point at retweets using regex
retweets = archive_data_clean[archive_data_clean['text'].str.match('^RT')]

In [83]:
archive_data_clean.drop(retweets.index,axis=0,inplace=True)

In [84]:
archive_data_clean = archive_data_clean.reset_index(drop=True)

#### Test

In [93]:
archive_data_clean[archive_data_clean['retweeted_status_user_id'].isna() == False]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


### Tidiness issues

#### Issue #1: doggo, floofer, pupper, and puppo columns should be 1 column. i.e('dog stage')

#### Define: Combine the four columns into 1

In [104]:
archive_data_clean['type'] = archive_data_clean['doggo'].fillna('') + archive_data_clean['puppo'].fillna('') + archive_data_clean['fluffer'].fillna('') + archive_data_clean['pupper'].fillna('')

In [106]:
archive_data_clean[archive_data_clean['type'] != '']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,type
9,890240255349198849,,,2017-07-26 15:59:51+00:00,http://twitter.com/download/iphone,This is Cassie. She is a college pup. Studying...,,,NaT,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,,doggo
12,889665388333682689,,,2017-07-25 01:55:32+00:00,http://twitter.com/download/iphone,Here's a puppo that seems to be on the fence a...,,,NaT,https://twitter.com/dog_rates/status/889665388...,13,10,,,,,puppo,puppo
14,889531135344209921,,,2017-07-24 17:02:04+00:00,http://twitter.com/download/iphone,This is Stuart. He's sporting his favorite fan...,,,NaT,https://twitter.com/dog_rates/status/889531135...,13,10,Stuart,,,,puppo,puppo
39,884162670584377345,,,2017-07-09 21:29:42+00:00,http://twitter.com/download/iphone,Meet Yogi. He doesn't have any important dog m...,,,NaT,https://twitter.com/dog_rates/status/884162670...,12,10,Yogi,doggo,,,,doggo
64,878776093423087618,,,2017-06-25 00:45:22+00:00,http://twitter.com/download/iphone,This is Snoopy. He's a proud #PrideMonthPuppo....,,,NaT,https://twitter.com/dog_rates/status/878776093...,13,10,Snoopy,,,,puppo,puppo
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
990,716080869887381504,,,2016-04-02 01:52:38+00:00,http://twitter.com/download/iphone,Here's a super majestic doggo and a sunset 11/...,,,NaT,https://twitter.com/dog_rates/status/716080869...,11,10,,doggo,,,,doggo
2139,800859414831898624,8.008580e+17,2.918590e+08,2016-11-22 00:32:18+00:00,http://twitter.com/download/iphone,@SkyWilliams doggo simply protecting you from ...,,,NaT,http://twitter.com/download/iphone/80085941483...,11,10,,doggo,,,,doggo
2141,786051337297522688,7.727430e+17,7.305050e+17,2016-10-12 03:50:17+00:00,http://twitter.com/download/iphone,13/10 for breakdancing puppo @shibbnbot,,,NaT,http://twitter.com/download/iphone/78605133729...,13,10,,,,,puppo,puppo
2144,763956972077010945,7.638652e+17,1.584641e+07,2016-08-12 04:35:10+00:00,http://twitter.com/download/iphone,@TheEllenShow I'm not sure if you know this bu...,,,NaT,http://twitter.com/download/iphone/76395697207...,12,10,,doggo,,,,doggo


## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization