# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [329]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import os
import requests
import tweepy
from io import StringIO
import json
from tqdm import tqdm


In [330]:
tweet_archive = pd.read_csv('twitter-archive-enhanced.csv') # read in the data

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [331]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
data = response.text
image_pred = pd.read_csv(StringIO(data), sep='\t')
image_pred.to_csv('image_predictions.tsv')

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [332]:
from dotenv import load_dotenv
load_dotenv()

bearer_token = os.environ.get('BEARER_TOKEN')

tweet_id = list(tweet_archive['tweet_id'])
missing_tweets = []

In [333]:
# if not os.path.exists('tweet_json.txt'):
#     with open('tweet_json.txt', 'w'): pass
# def get_tweet():
#     auth = tweepy.OAuth2BearerHandler(bearer_token)
#     api = tweepy.API(auth)
#     for id in tqdm(tweet_id):
#         try:
#             tweet = api.get_status(id, tweet_mode='extended')
#             with open('tweet_json.txt', 'a') as f:
#                 json.dump(tweet._json, f)
#                 f.write('\n')
#         except:
#             print('Missing Tweet for id: {}'.format(id))
#             missing_tweets.append(id)
#             continue

# # Driver code
# if __name__ == '__main__':
# #   Call the function
#     get_tweet()


In [336]:
# with open('tweet_json.txt', 'r') as f:
with open('json.txt', 'r') as f:
    gathered_tweet_df = pd.DataFrame(columns=('tweet_id', 'retweet_count', 'favorite_count', 'created_at'))
    tweets = f.readlines()
    for tweet in tweets:
        tweet = json.loads(tweet)
        gathered_tweet_df.loc[len(gathered_tweet_df.index)] = [tweet['id'], tweet['retweet_count'], tweet['favorite_count'], tweet['created_at']]

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



In [337]:
gathered_tweet_df.shape

(2354, 4)

In [338]:
gathered_tweet_df.sample(4)

Unnamed: 0,tweet_id,retweet_count,favorite_count,created_at
417,822462944365645825,17209,31800,Fri Jan 20 15:17:01 +0000 2017
1307,707059547140169728,759,2796,Tue Mar 08 04:25:07 +0000 2016
661,790987426131050500,2483,11089,Tue Oct 25 18:44:32 +0000 2016
1829,676215927814406144,661,1881,Mon Dec 14 01:43:35 +0000 2015


In [339]:
gathered_tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2354 entries, 0 to 2353
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tweet_id        2354 non-null   object
 1   retweet_count   2354 non-null   object
 2   favorite_count  2354 non-null   object
 3   created_at      2354 non-null   object
dtypes: object(4)
memory usage: 92.0+ KB


In [340]:
gathered_tweet_df.describe()

Unnamed: 0,tweet_id,retweet_count,favorite_count,created_at
count,2354,2354,2354,2354
unique,2354,1724,2007,2354
top,667495797102141441,3652,0,Wed Jan 06 04:11:43 +0000 2016
freq,1,5,179,1


In [341]:
# Check null values in gathered_tweet_df
gathered_tweet_df.isnull().sum()

tweet_id          0
retweet_count     0
favorite_count    0
created_at        0
dtype: int64

In [342]:
image_pred.sample(4)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
103,667806454573760512,https://pbs.twimg.com/media/CUSGbXeVAAAgztZ.jpg,1,toyshop,0.253089,False,Chihuahua,0.187155,True,Brabancon_griffon,0.112799,True
1447,776088319444877312,https://pbs.twimg.com/media/CsU4NKkW8AUI5eG.jpg,3,web_site,0.999916,False,pug,7.7e-05,True,menu,2e-06,False
1522,788070120937619456,https://pbs.twimg.com/media/Co-hmcYXYAASkiG.jpg,1,golden_retriever,0.735163,True,Sussex_spaniel,0.064897,True,Labrador_retriever,0.047704,True
1941,861005113778896900,https://pbs.twimg.com/media/C_LnlF5VoAEsL1K.jpg,1,German_shepherd,0.507951,True,Pembroke,0.136113,True,muzzle,0.075764,False


In [343]:
image_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [344]:
image_pred.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [345]:
# Check null values in tweet_archive
tweet_archive.isnull().sum()

tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

In [346]:
tweet_archive.sample(4)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
920,756303284449767430,,,2016-07-22 01:42:09 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Pwease accept dis rose on behalf of dog. 11/10...,,,,https://twitter.com/dog_rates/status/756303284...,11,10,,,,,
180,857062103051644929,,,2017-04-26 02:41:43 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @AaronChewning: First time wearing my @dog_...,8.570611e+17,58709723.0,2017-04-26 02:37:47 +0000,https://twitter.com/AaronChewning/status/85706...,13,10,,,,,
1232,713175907180089344,,,2016-03-25 01:29:21 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Opie and Clarkus. Clarkus fell as...,,,,https://twitter.com/dog_rates/status/713175907...,10,10,Opie,,,,
1535,689977555533848577,,,2016-01-21 01:07:23 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Happy Wednesday here's a pup wearing a beret. ...,,,,https://twitter.com/dog_rates/status/689977555...,12,10,,,,,


In [347]:
tweet_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [348]:
tweet_archive.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [349]:
# Check null values in tweet_archive
tweet_archive.isnull().sum()

tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

In [350]:
''' Columns such as retweet_status_id have high null values
Those columns and some other ones have high null values and are not useful for our analysis
'''


' Columns such as retweet_status_id have high null values\nThose columns and some other ones have high null values and are not useful for our analysis\n'

In [351]:
tweet_archive.query('doggo == "doggo"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,
43,884162670584377345,,,2017-07-09 21:29:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Yogi. He doesn't have any important dog m...,,,,https://twitter.com/dog_rates/status/884162670...,12,10,Yogi,doggo,,,
99,872967104147763200,,,2017-06-09 00:02:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a very large dog. He has a date later. ...,,,,https://twitter.com/dog_rates/status/872967104...,12,10,,doggo,,,
108,871515927908634625,,,2017-06-04 23:56:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Napolean. He's a Raggedy East Nicaragu...,,,,https://twitter.com/dog_rates/status/871515927...,12,10,Napolean,doggo,,,
110,871102520638267392,,,2017-06-03 20:33:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH,,,,https://twitter.com/animalcog/status/871075758...,14,10,,doggo,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1117,732375214819057664,,,2016-05-17 01:00:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Kyle (pronounced 'Mitch'). He strives ...,,,,https://twitter.com/dog_rates/status/732375214...,11,10,Kyle,doggo,,,
1141,727644517743104000,,,2016-05-03 23:42:26 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a doggo struggling to cope with the win...,,,,https://twitter.com/dog_rates/status/727644517...,13,10,,doggo,,,
1156,724771698126512129,,,2016-04-26 01:26:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Nothin better than a doggo and a sunset. 11/10...,,,,https://twitter.com/dog_rates/status/724771698...,11,10,,doggo,,,
1176,719991154352222208,,,2016-04-12 20:50:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This doggo was initially thrilled when she saw...,,,,https://twitter.com/dog_rates/status/719991154...,10,10,,doggo,,,


In [352]:
pd.set_option('display.max_colwidth', None)

In [353]:
tweet_archive[['text','name', 'rating_numerator', 'rating_denominator']].sample(15)


Unnamed: 0,text,name,rating_numerator,rating_denominator
2053,Meet Penelope. She is a white Macadamias Duodenum. Very excited about wall. Lives on Frosted Flakes. 11/10 good pup https://t.co/CqcRagJlyS,Penelope,11,10
499,Here's an anonymous doggo that appears to be very done with Christmas. 11/10 cheer up pup https://t.co/BzITyGw3JA,,11,10
1666,NAAAAAAA ZAPENYAAAAA MABADI-CHIBAWAAA 12/10 https://t.co/Ny4iM6FDtz,,12,10
247,RT @dog_rates: Here's a heartwarming scene of a single father raising his two pups. Downright awe-inspiring af. 12/10 for everyone https://…,,12,10
1738,This little pupper just arrived. 11/10 would snug https://t.co/DA5aqnSGfB,,11,10
57,Meet Elliot. He's a Canadian Forrest Pup. Unusual number of antlers for a dog. Sneaky tongue slip to celebrate #Canada150. 12/10 would pet https://t.co/cgwJwowTMC,Elliot,12,10
304,"This is Ava. She just blasted off. Streamline af. Aerodynamic as h*ck. One small step for pupper, one giant leap for pupkind. 12/10 https://t.co/W4KffrdX3Q",Ava,12,10
1524,This is Lolo. She's America af. Behind in science &amp; math but can say whatever she wants on Twitter. 11/10 ...Merica https://t.co/Nwi3SYe8KA,Lolo,11,10
319,RT @dog_rates: This is Leo. He was a skater pup. She said see ya later pup. He wasn't good enough for her. 12/10 you're good enough for me…,Leo,12,10
1487,This is Milo. He doesn't understand your fancy human gestures. Will lick instead. 10/10 can't faze this pupper https://t.co/OhodPIDOpW,Milo,10,10


In [354]:
tweet_archive[['text', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp']].sample(15)

Unnamed: 0,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp
2053,Meet Penelope. She is a white Macadamias Duodenum. Very excited about wall. Lives on Frosted Flakes. 11/10 good pup https://t.co/CqcRagJlyS,,,
203,This is Rumpole. He'll be your Uber driver this evening. Won't start driving until you buckle pup. 13/10 h*ckin safe good boy https://t.co/EX9Z3EXlVP,,,
1600,This pupper has a magical eye. 11/10 I can't stop looking at it https://t.co/heAGpKTpPW,,,
237,Meet Daisy. She's been pup for adoption for months now but hasn't gotten any applications. 11/10 let's change that\n\nhttps://t.co/Jlb9L0m3J0 https://t.co/Eh7fGFuy6r,,,
2347,My oh my. This is a rare blond Canadian terrier on wheels. Only $8.98. Rather docile. 9/10 very rare https://t.co/yWBqbrzy8O,,,
182,RT @dog_rates: This is Luna. It's her first time outside and a bee stung her nose. Completely h*ckin uncalled for. 13/10 where's the bee I…,8.447048e+17,4196984000.0,2017-03-23 00:18:10 +0000
560,This is Marley. She's having a ruff day. Pretty pupset. 12/10 would assist https://t.co/yLm7hQ6UXh,,,
1507,This is Richie and Plip. They are the best of pals. Do everything together. 10/10 for both https://t.co/KMdwNgONkV,,,
819,We only rate dogs. Pls stop sending in non-canines like this Arctic Floof Kangaroo. This is very frustrating. 11/10 https://t.co/qlUDuPoE3d,,,
449,This is Bo. He was a very good First Doggo. 14/10 would be an absolute honor to pet https://t.co/AdPKrI8BZ1,,,


In [355]:
tweet_archive['name'].value_counts()

None       745
a           55
Charlie     12
Lucy        11
Cooper      11
          ... 
Chevy        1
Sparky       1
Dot          1
Fabio        1
Lulu         1
Name: name, Length: 957, dtype: int64

### Quality issues
1. Some of the tweets are retweets and some are not even about dogs and still have ratings

2. Some of the columns like in_reply_to_status_id, in_reply_to_user_id have no real use case and are motly null

3. Some of the dog names are incorrect and some of them having the value None

4. Incorrect ratings for some of the dogs

5. Incorrect data type for some of the columns like timestamp

6. Some tweets have a creation date greater than August, 1 2017 in the gathered tweet df

7. 

8.

### Tidiness issues
1. The dog stages should have been a single column instead of being split into three

2.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [356]:
# Make copies of original pieces of data
tweet_archive_copy = tweet_archive.copy()
image_pred_copy = image_pred.copy()
gathered_tweet_df_copy = gathered_tweet_df.copy()

In [357]:
dogitionary = ['doggo', 'floofer', 'pupper', 'puppo']

### Issue #1:
* Some of the tweets are retweets and may not be about dogs

#### Define:
- Tweets having non-null values in retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp should be dropped
- It is noticed from the describe function above we have a total of 181 non-null values in these columns

#### Code

In [358]:
tweet_archive_copy.shape

(2356, 17)

In [359]:
# Drop rows having non-null values in retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp columns of tweet_archive_copy
tweet_archive_copy = tweet_archive_copy.loc[tweet_archive_copy['retweeted_status_id'].isnull()]

In [360]:
# Drop the retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp columns of tweet_archive_copy
tweet_archive_copy = tweet_archive_copy.drop(['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], axis=1)

#### Test

In [361]:
tweet_archive_copy.shape


(2175, 14)

In [362]:
tweet_archive_copy.sample(4)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1477,693622659251335168,,,2016-01-31 02:31:43 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",When you keepin the popcorn bucket in your lap and she reach for some... 10/10 https://t.co/a1IrjaID3X,https://twitter.com/dog_rates/status/693622659251335168/photo/1,10,10,,,,,
2251,667806454573760512,,,2015-11-20 20:47:20 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Filup. He is overcome with joy after finally meeting his father. 10/10 https://t.co/TBmDJXJB75,https://twitter.com/dog_rates/status/667806454573760512/photo/1,10,10,Filup,,,,
1738,679527802031484928,,,2015-12-23 05:03:47 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This little pupper just arrived. 11/10 would snug https://t.co/DA5aqnSGfB,https://twitter.com/dog_rates/status/679527802031484928/photo/1,11,10,,,,pupper,
62,880095782870896641,,,2017-06-28 16:09:20 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Please don't send in photos without dogs in them. We're not @porch_rates. Insubordinate and churlish. Pretty good porch tho 11/10 https://t.co/HauE8M3Bu4,https://twitter.com/dog_rates/status/880095782870896641/photo/1,11,10,,,,,


### Issue #2:
- Invalid columns with almost all NaN values

#### Define:
- in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp,source
- Drop above columns with the drop function

#### Code

In [363]:
useless_columns = ['in_reply_to_status_id', 'in_reply_to_user_id','source']

In [364]:
tweet_archive_copy.drop(useless_columns, axis=1, inplace=True)

#### Test

In [365]:
tweet_archive_copy.sample(4)

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1611,685325112850124800,2016-01-08 05:00:14 +0000,"""Tristan do not speak to me with that kind of tone or I will take away the Xbox."" 10/10 https://t.co/VGPH0TfESw",https://twitter.com/dog_rates/status/685325112850124800/photo/1,10,10,,,,,
127,867900495410671616,2017-05-26 00:29:37 +0000,"Unbelievable. We only rate dogs. Please don't send in non-canines like the ""I"" from Pixar's opening credits. Thank you... 12/10 https://t.co/JMhDNv5wXZ",https://twitter.com/dog_rates/status/867900495410671616/photo/1,12,10,,,,,
887,759923798737051648,2016-08-01 01:28:46 +0000,We only rate dogs... this is a Taiwanese Guide Walrus. Im getting real heckin tired of this. Please send dogs. 10/10 https://t.co/49hkNAsubi,https://twitter.com/dog_rates/status/759923798737051648/photo/1,10,10,,,,,
2084,670807719151067136,2015-11-29 03:33:17 +0000,"Say hello to Andy. He can balance on one foot, obliterate u in checkers, &amp; transform into a rug. 11/10 much talents https://t.co/idzH8JH06g","https://twitter.com/dog_rates/status/670807719151067136/photo/1,https://twitter.com/dog_rates/status/670807719151067136/photo/1,https://twitter.com/dog_rates/status/670807719151067136/photo/1",11,10,Andy,,,,


### Issue #3:
- Incorrect names
- None values for some of the names

#### Define:
- Find the names that are not correct by using value count
- Replace incorrect names and None values with NaN

#### Code

In [366]:
# # First, remove all tweets that don't contain any of the dog words
# for word in dogitionary:
#     tweet_archive_copy = tweet_archive_copy[tweet_archive_copy['text'].str.contains(word)]


In [367]:
# Create a csv file containg names of dogs and view them visually
counts = tweet_archive_copy['name'].value_counts()
counts.to_csv('name.csv', index=True)

In [368]:
# Get all the invalid names and remove them from the dataframe
# We notice invalid names starts with lowercase letters.

# Create a list of invalid names
invalid_names = ['None']
for name in tweet_archive_copy.name:
    if name[0].islower():
        invalid_names.append(name)

In [369]:
# Get unique invalid names
invalid_names = list(set(invalid_names))

In [370]:
tweet_archive_copy.shape

(2175, 11)

In [371]:
# Remove invalid names from the dataframe
tweet_archive_copy = tweet_archive_copy[~tweet_archive_copy['name'].isin(invalid_names)]
tweet_archive_copy.name.value_counts()

Charlie    11
Lucy       11
Oliver     10
Cooper     10
Penny       9
           ..
Benny       1
Chubbs      1
Alfy        1
Hamrick     1
Lulu        1
Name: name, Length: 930, dtype: int64

In [372]:
# View the dataframe
tweet_archive_copy.shape

(1391, 11)

#### Test

In [373]:
# verify that the dataframe is now clean of invalid names
tweet_archive_copy[['text','name']].sample(10)

Unnamed: 0,text,name
1478,Meet Phil. He's big af. Currently destroying this nice family home. Completely uncalled for. 3/10 not a good pupper https://t.co/fShNNhBWYx,Phil
2159,This is Keith. He's had 13 DUIs. 7/10 that's too many Keith https://t.co/fa7olwrF9Y,Keith
637,This is Moreton. He's the Good Boy Who Lived. 13/10 magical as h*ck https://t.co/rLHGx3VAF3,Moreton
219,This is Riley. He's making new friends. Jubilant as h*ck for the fun times ahead. 11/10 for all pups pictured https://t.co/PCX25VV78l,Riley
592,This is Iroh. He's in a predicament. 12/10 someone help him https://t.co/KJAKO2kXsL,Iroh
2047,This is Scruffers. He's being violated on multiple levels and is not happy about it. 9/10 hang in there Scruffers https://t.co/nLQoltwEZ7,Scruffers
2148,Say hello to Clarence. Clarence thought he saw a squirrel. He was just trying to help. 8/10 poor Clarence https://t.co/tbFaTUHLJB,Clarence
468,This is Chloe. She fell asleep at the wheel. Absolute menace on the roadways. Sneaky tongue slip tho. 11/10 https://t.co/r6SLVN2VUH,Chloe
609,This is Cassie. She steals things. Guilt increases slightly each time. 12/10 would forgive almost immediately https://t.co/Ia19irLwyB,Cassie
964,This is Malcolm. He's absolutely terrified of heights. 8/10 hang in there pupper https://t.co/SVU00Sc9U2,Malcolm


### Issue #4:
Incorrect Ratings for some of the dogs

#### Define
- We were told the denominator is always 10. By viewing the describe function above we can confirm the denominator has
- numbers greater than 10
- We will find all numbers greater than 10 in the denominator column and replace them with 10.
- We will also find uncommon numerators and replace them with proper values

#### Code

In [374]:
# Reset pandas display options
pd.reset_option('display.max_colwidth')

In [375]:
tweet_archive_copy[['text','name', 'rating_numerator', 'rating_denominator']].sample(10)

Unnamed: 0,text,name,rating_numerator,rating_denominator
1367,This is Sansa. She's gotten too big for her ch...,Sansa,11,10
364,This is Malcolm. He goes from sneaky tongue sl...,Malcolm,12,10
844,This is Brudge. He's a Doberdog. Going to be h...,Brudge,11,10
837,This is Philbert. His toilet broke and he does...,Philbert,11,10
2279,This is Biden. Biden just tripped... 7/10 http...,Biden,7,10
1185,This is Carper. He's a Tortellini Angiosperm. ...,Carper,11,10
1119,This is Solomon. He's a Beneroo Cumberflop. 12...,Solomon,12,10
1983,This is Terry. He's a Toasty Western Sriracha....,Terry,10,10
1034,This is Oliver. He's downright gorgeous as hel...,Oliver,12,10
673,This is Eli. He can fly. 13/10 magical af http...,Eli,13,10


In [376]:
tweet_archive_copy.query('rating_denominator < 10')

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
516,810984652412424192,2016-12-19 23:06:23 +0000,Meet Sam. She smiles 24/7 &amp; secretly aspir...,"https://www.gofundme.com/sams-smile,https://tw...",24,7,Sam,,,,


In [377]:
# Find and replace rating denominator less than 10 with 10
tweet_archive_copy.loc[tweet_archive_copy['rating_denominator'] < 10, 'rating_denominator'] = 10

In [378]:
tweet_archive_copy.query('rating_denominator > 10')

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1202,716439118184652801,2016-04-03 01:36:11 +0000,This is Bluebert. He just saw that both #Final...,https://twitter.com/dog_rates/status/716439118...,50,50,Bluebert,,,,
1662,682962037429899265,2016-01-01 16:30:13 +0000,This is Darrel. He just robbed a 7/11 and is i...,https://twitter.com/dog_rates/status/682962037...,7,11,Darrel,,,,


In [379]:
# Find and replace rating denominator greater than 10 with 10
tweet_archive_copy.loc[tweet_archive_copy['rating_denominator'] > 10, 'rating_denominator'] = 10

In [380]:
# View the distribution of rating numerator
tweet_archive_copy.rating_numerator.describe()

count    1391.000000
mean       12.091301
std        47.413241
min         2.000000
25%        10.000000
50%        11.000000
75%        12.000000
max      1776.000000
Name: rating_numerator, dtype: float64

In [381]:
# Find all ratings numerator greater than the 75th percentile
greater_than_75 = tweet_archive_copy['rating_numerator'][tweet_archive_copy['rating_numerator'] > tweet_archive_copy['rating_numerator'].quantile(0.75)]
print(greater_than_75.value_counts())

13      183
14       17
1776      1
75        1
50        1
27        1
24        1
Name: rating_numerator, dtype: int64


In [382]:
# Find all tweets with rating numerator greater than 75th percentile
tweet_archive_copy.query('rating_numerator > rating_numerator.quantile(0.75)')

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,2017-08-01 16:23:56 +0000,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,2017-08-01 00:17:27 +0000,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
3,891689557279858688,2017-07-30 15:58:51 +0000,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
6,890971913173991426,2017-07-28 16:27:12 +0000,Meet Jax. He enjoys ice cream so much he gets ...,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
8,890609185150312448,2017-07-27 16:25:51 +0000,This is Zoey. She doesn't want to be one of th...,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
...,...,...,...,...,...,...,...,...,...,...,...
1519,690735892932222976,2016-01-23 03:20:44 +0000,Say hello to Peaches. She's a Dingleberry Zand...,https://twitter.com/dog_rates/status/690735892...,13,10,Peaches,,,,
1562,688211956440801280,2016-01-16 04:11:31 +0000,This is Derby. He's a superstar. 13/10 (vid by...,https://twitter.com/dog_rates/status/688211956...,13,10,Derby,,,,
1906,674468880899788800,2015-12-09 06:01:26 +0000,This is Louis. He thinks he's flying. 13/10 th...,https://twitter.com/dog_rates/status/674468880...,13,10,Louis,,,,
1952,673680198160809984,2015-12-07 01:47:30 +0000,This is Shnuggles. I would kill for Shnuggles....,https://twitter.com/dog_rates/status/673680198...,13,10,Shnuggles,,,,


In [383]:
# Drop the row with the outlier value of 1776
tweet_archive_copy = tweet_archive_copy[tweet_archive_copy['rating_numerator'] != 1776]

#### Test

In [384]:
# Print the highest and lowest rating denominator and numerator
print(tweet_archive_copy['rating_denominator'].max())
print(tweet_archive_copy['rating_denominator'].min())
print(tweet_archive_copy['rating_numerator'].max())


10
10
75


### Issue 5:
- Some columns have wrong datatypes

#### Define
- Change datatypes for columns such as timestamp using pandas datetime function

#### Code

In [385]:
# Confirm datatypes of columns
tweet_archive_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1390 entries, 0 to 2325
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            1390 non-null   int64 
 1   timestamp           1390 non-null   object
 2   text                1390 non-null   object
 3   expanded_urls       1390 non-null   object
 4   rating_numerator    1390 non-null   int64 
 5   rating_denominator  1390 non-null   int64 
 6   name                1390 non-null   object
 7   doggo               1390 non-null   object
 8   floofer             1390 non-null   object
 9   pupper              1390 non-null   object
 10  puppo               1390 non-null   object
dtypes: int64(3), object(8)
memory usage: 130.3+ KB


In [386]:
gathered_tweet_df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2354 entries, 0 to 2353
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tweet_id        2354 non-null   object
 1   retweet_count   2354 non-null   object
 2   favorite_count  2354 non-null   object
 3   created_at      2354 non-null   object
dtypes: object(4)
memory usage: 92.0+ KB


In [387]:
# Change the datatype for timestamp column to datetime
tweet_archive_copy['timestamp'] = pd.to_datetime(tweet_archive_copy['timestamp'])

In [388]:
# Change the datatype for timestamp column to datetime
gathered_tweet_df_copy['created_at'] = pd.to_datetime(gathered_tweet_df_copy['created_at'])

In [389]:
# Find distribution of tweets by year of creation in tweet_archive_copy
tweet_archive_copy['timestamp'].dt.year.value_counts()

2016    727
2015    379
2017    284
Name: timestamp, dtype: int64

In [390]:
# Find distribution of tweets by year of creation in gathered_tweet_df_copy
gathered_tweet_df_copy['created_at'].dt.year.value_counts()

2016    1182
2015     690
2017     482
Name: created_at, dtype: int64

#### Test

In [391]:
tweet_archive_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1390 entries, 0 to 2325
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            1390 non-null   int64              
 1   timestamp           1390 non-null   datetime64[ns, UTC]
 2   text                1390 non-null   object             
 3   expanded_urls       1390 non-null   object             
 4   rating_numerator    1390 non-null   int64              
 5   rating_denominator  1390 non-null   int64              
 6   name                1390 non-null   object             
 7   doggo               1390 non-null   object             
 8   floofer             1390 non-null   object             
 9   pupper              1390 non-null   object             
 10  puppo               1390 non-null   object             
dtypes: datetime64[ns, UTC](1), int64(3), object(7)
memory usage: 130.3+ KB


In [392]:
gathered_tweet_df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2354 entries, 0 to 2353
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype              
---  ------          --------------  -----              
 0   tweet_id        2354 non-null   object             
 1   retweet_count   2354 non-null   object             
 2   favorite_count  2354 non-null   object             
 3   created_at      2354 non-null   datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), object(3)
memory usage: 92.0+ KB


### Issue 6:
- Some of the tweets in gathered tweets are beyond August 1, 2017

#### Define:
- Delete all tweets whose creation date is greater than August 1, 2017
- Use the query

#### Code

### Tidiness Issue

### Issue 1
- The three dog stages is not necessary. There should all be under a single column

#### Define
- Collapse the three dog stages into a single column
- Use the pandas melt function

In [393]:
# Use pandas melt function to convert the dataframe to long format
tweet_archive_copy = pd.melt(tweet_archive_copy, id_vars=['tweet_id','timestamp','text','expanded_urls','rating_numerator','rating_denominator','name'], value_vars=['doggo','floofer', 'pupper','puppo'], var_name='dog_type')

In [394]:
tweet_archive_copy.drop('value', axis=1, inplace=True)

#### Test

In [395]:
tweet_archive_copy.sample(4)

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,dog_type
705,720059472081784833,2016-04-13 01:22:10+00:00,This is Charleson. He lost his plunger. Looked...,https://twitter.com/dog_rates/status/720059472...,9,10,Charleson,doggo
1997,746131877086527488,2016-06-24 00:04:36+00:00,This is Gustav. He has claimed that plant. It ...,https://twitter.com/dog_rates/status/746131877...,10,10,Gustav,floofer
707,719704490224398336,2016-04-12 01:51:36+00:00,This is Clyde. He's making sure you're having ...,https://twitter.com/dog_rates/status/719704490...,12,10,Clyde,doggo
1097,676603393314578432,2015-12-15 03:23:14+00:00,This is Godzilla pupper. He had a ruff childho...,https://twitter.com/dog_rates/status/676603393...,9,10,Godzilla,doggo


In [None]:
tweet_archive_copy.shape

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization