# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [7]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import os
import requests
import tweepy
from io import StringIO
import json
from tqdm import tqdm


In [8]:
tweet_archive = pd.read_csv('twitter-archive-enhanced.csv') # read in the data

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [9]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
data = response.text
image_pred = pd.read_csv(StringIO(data), sep='\t')
image_pred.to_csv('image_predictions.tsv')

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [10]:
from dotenv import load_dotenv
load_dotenv()

bearer_token = os.environ.get('BEARER_TOKEN')

tweet_id = list(tweet_archive['tweet_id'])
missing_tweets = []

In [11]:
# if not os.path.exists('tweet_json.txt'):
#     with open('tweet_json.txt', 'w'): pass
# def get_tweet():
#     auth = tweepy.OAuth2BearerHandler(bearer_token)
#     api = tweepy.API(auth)
#     for id in tqdm(tweet_id):
#         try:
#             tweet = api.get_status(id, tweet_mode='extended')
#             with open('tweet_json.txt', 'a') as f:
#                 json.dump(tweet._json, f)
#                 f.write('\n')
#         except:
#             print('Missing Tweet for id: {}'.format(id))
#             missing_tweets.append(id)
#             continue

# # Driver code
# if __name__ == '__main__':
# #   Call the function
#     get_tweet()


In [12]:
# with open('tweet_json.txt', 'r') as f:
with open('json.txt', 'r') as f:
    tweet_df = pd.DataFrame(columns=('tweet_id', 'retweet_count', 'favorite_count'))
    tweets = f.readlines()
    for tweet in tweets:
        tweet = json.loads(tweet)
        tweet_df.loc[len(tweet_df.index)] = [tweet['id'], tweet['retweet_count'], tweet['favorite_count']]

In [13]:
tweet_df.shape

(2354, 3)

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



In [14]:
tweet_df.sample(4)

Unnamed: 0,tweet_id,retweet_count,favorite_count
1389,700143752053182464,3129,8282
148,863079547188785154,1195,9094
802,772152991789019136,1300,4181
1207,715696743237730304,1454,4257


In [15]:
image_pred.sample(4)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
863,697259378236399616,https://pbs.twimg.com/media/Ca0ps3AXEAAnp9m.jpg,1,Great_Dane,0.999223,True,boxer,0.000187,True,whippet,0.000151,True
1620,802952499103731712,https://pbs.twimg.com/media/CySpCSHXcAAN-qC.jpg,1,chow,0.944032,True,golden_retriever,0.01724,True,Pomeranian,0.012085,True
1245,747512671126323200,https://pbs.twimg.com/media/Cl-yykwWkAAqUCE.jpg,1,Cardigan,0.111493,True,malinois,0.095089,True,German_shepherd,0.080146,True
194,669573570759163904,https://pbs.twimg.com/media/CUrNmtFWoAAnWCD.jpg,1,West_Highland_white_terrier,0.946828,True,miniature_schnauzer,0.022344,True,cairn,0.009462,True


In [16]:
tweet_df.describe()

Unnamed: 0,tweet_id,retweet_count,favorite_count
count,2354,2354,2354
unique,2354,1724,2007
top,667495797102141441,3652,0
freq,1,5,179


In [17]:
tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2354 entries, 0 to 2353
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tweet_id        2354 non-null   object
 1   retweet_count   2354 non-null   object
 2   favorite_count  2354 non-null   object
dtypes: object(3)
memory usage: 73.6+ KB


In [18]:
image_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [19]:
image_pred.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [20]:
tweet_archive.tail(4)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,
2355,666020888022790149,,,2015-11-15 22:32:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a Japanese Irish Setter. Lost eye...,,,,https://twitter.com/dog_rates/status/666020888...,8,10,,,,,


In [21]:
tweet_archive.query('doggo == "doggo"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,
43,884162670584377345,,,2017-07-09 21:29:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Yogi. He doesn't have any important dog m...,,,,https://twitter.com/dog_rates/status/884162670...,12,10,Yogi,doggo,,,
99,872967104147763200,,,2017-06-09 00:02:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a very large dog. He has a date later. ...,,,,https://twitter.com/dog_rates/status/872967104...,12,10,,doggo,,,
108,871515927908634625,,,2017-06-04 23:56:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Napolean. He's a Raggedy East Nicaragu...,,,,https://twitter.com/dog_rates/status/871515927...,12,10,Napolean,doggo,,,
110,871102520638267392,,,2017-06-03 20:33:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH,,,,https://twitter.com/animalcog/status/871075758...,14,10,,doggo,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1117,732375214819057664,,,2016-05-17 01:00:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Kyle (pronounced 'Mitch'). He strives ...,,,,https://twitter.com/dog_rates/status/732375214...,11,10,Kyle,doggo,,,
1141,727644517743104000,,,2016-05-03 23:42:26 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a doggo struggling to cope with the win...,,,,https://twitter.com/dog_rates/status/727644517...,13,10,,doggo,,,
1156,724771698126512129,,,2016-04-26 01:26:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Nothin better than a doggo and a sunset. 11/10...,,,,https://twitter.com/dog_rates/status/724771698...,11,10,,doggo,,,
1176,719991154352222208,,,2016-04-12 20:50:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This doggo was initially thrilled when she saw...,,,,https://twitter.com/dog_rates/status/719991154...,10,10,,doggo,,,


In [22]:
tweet_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [23]:
tweet_archive.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [24]:
# Check null values in tweet_archive
tweet_archive.isnull().sum()

tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

In [25]:
''' Columns such as retweet_status_id have high null values
Those columns and some other ones have high null values and are not useful for our analysis
'''


' Columns such as retweet_status_id have high null values\nThose columns and some other ones have high null values and are not useful for our analysis\n'

In [26]:
pd.set_option('display.max_colwidth', None)

In [27]:
tweet_archive[['text','name', 'rating_numerator', 'rating_denominator']].sample(10)

Unnamed: 0,text,name,rating_numerator,rating_denominator
338,This is Charlie. He fell asleep on a heating vent. Would puppreciate your assistance. 11/10 someone help Charlie https://t.co/Dhdx5HnQ4d,Charlie,11,10
70,This is Beau. That is Beau's balloon. He takes it everywhere. 13/10 would protect at all costs https://t.co/YDtpCjIPKN,Beau,13,10
1574,Another magnificent photo. 12/10 https://t.co/X5w387K5jr,,12,10
527,Here's a pupper in a onesie. Quite pupset about it. Currently plotting revenge. 12/10 would rescue https://t.co/xQfrbNK3HD,,12,10
168,Sorry for the lack of posts today. I came home from school and had to spend quality time with my puppo. Her name is Zoey and she's 13/10 https://t.co/BArWupFAn0,,13,10
91,RT @dog_rates: This is Coco. At first I thought she was a cloud but clouds don't bork with such passion. 12/10 would hug softly https://t.c…,Coco,12,10
1683,This is Bo. He's a Benedoop Cumbersnatch. Seems frustrated with own feet. Portable as hell. 11/10 very solid pupper https://t.co/TONMhRoQh7,Bo,11,10
902,Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE,,165,150
127,"Unbelievable. We only rate dogs. Please don't send in non-canines like the ""I"" from Pixar's opening credits. Thank you... 12/10 https://t.co/JMhDNv5wXZ",,12,10
1020,This is Maddie. She gets some wicked air time. Hardcore barkour. 11/10 nimble af https://t.co/bROYbceZ1u,Maddie,11,10


In [28]:
tweet_archive['name'].value_counts()

None       745
a           55
Charlie     12
Oliver      11
Cooper      11
          ... 
Sobe         1
Erik         1
space        1
Sparky       1
Fabio        1
Name: name, Length: 957, dtype: int64

### Quality issues
1. Some of the columns like in_reply_to_status_id	in_reply_to_user_id have no real use case

2. Some of the dog names are incorrect and some of them having the value None

3. Incorrect ratings for some of the dogs

4. We notice the dog stages are having values None instead of NaN and some of them are wrong

5.

6.

7.

8.

### Tidiness issues
1. The dog stages should have been a single column instead of being split into three

2.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [29]:
# Make copies of original pieces of data
tweet_archive_copy = tweet_archive.copy()
image_pred_copy = image_pred.copy()
tweet_df_copy = tweet_df.copy()

In [30]:
dogitionary = ['doggo', 'floofer', 'pupper', 'puppo']

### Issue #1:
- Invalid columns with almost all NaN values

#### Define:
- in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp,source
- Drop above columns with the drop function

#### Code

In [31]:
useless_columns = ['in_reply_to_status_id', 'in_reply_to_user_id','retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp', 'source']

In [32]:
tweet_archive_copy.drop(useless_columns, axis=1, inplace=True)

#### Test

In [33]:
tweet_archive_copy.sample(4)

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1921,674262580978937856,2015-12-08 16:21:41 +0000,This is Gus. He's super stoked about being an elephant. Couldn't be happier. 9/10 for elephant pupper https://t.co/gJS1qU0jP7,https://twitter.com/dog_rates/status/674262580978937856/photo/1,9,10,Gus,,,pupper,
122,869227993411051520,2017-05-29 16:24:37 +0000,This is Gizmo. His favorite thing is standing pupright like a hooman. Sneaky tongue slip status achieved. 13/10 would boop well https://t.co/IoR3n1fiiQ,https://twitter.com/dog_rates/status/869227993411051520/photo/1,13,10,Gizmo,,,,
2144,669942763794931712,2015-11-26 18:16:16 +0000,This is Andru. He made his very own lacrosse stick. Much dedication. Big dreams. Tongue slip. 11/10 go get em Andru https://t.co/1VJoY3OJ1F,https://twitter.com/dog_rates/status/669942763794931712/photo/1,11,10,Andru,,,,
279,839990271299457024,2017-03-10 00:04:21 +0000,Meet Sojourner. His nose is a Fibonacci Spiral. Legendary af. 13/10 we must protect him at all costs https://t.co/r7W1NbkOtr,"https://twitter.com/dog_rates/status/839990271299457024/photo/1,https://twitter.com/dog_rates/status/839990271299457024/photo/1",13,10,Sojourner,,,,


### Issue #2:
- Incorrect names
- None values for some of the names

#### Define:
- Find the names that are not correct by using value count
- Replace incorrect names and None values with NaN

#### Code

In [34]:
# # First, remove all tweets that don't contain any of the dog words
# for word in dogitionary:
#     tweet_archive_copy = tweet_archive_copy[tweet_archive_copy['text'].str.contains(word)]


In [35]:
# Create a csv file containg names of dogs and view them visually
counts = tweet_archive_copy['name'].value_counts()
counts.to_csv('name.csv', index=True)

In [36]:
# Get all the invalid names and remove them from the dataframe
# We notice invalid names starts with lowercase letters.

# Create a list of invalid names
invalid_names = ['None']
for name in tweet_archive_copy.name:
    if name[0].islower():
        invalid_names.append(name)

In [37]:
# Get unique invalid names
invalid_names = list(set(invalid_names))

In [38]:
tweet_archive_copy.shape

(2356, 11)

In [39]:
# Remove invalid names from the dataframe
tweet_archive_copy = tweet_archive_copy[~tweet_archive_copy['name'].isin(invalid_names)]
tweet_archive_copy.name.value_counts()

Charlie      12
Oliver       11
Lucy         11
Cooper       11
Lola         10
             ..
Grizzwald     1
Blipson       1
Goliath       1
Marq          1
Fabio         1
Name: name, Length: 931, dtype: int64

In [40]:
# View the dataframe
tweet_archive_copy.shape

(1502, 11)

#### Test

In [41]:
# verify that the dataframe is now clean of invalid names
tweet_archive_copy[['text','name']].sample(10)

Unnamed: 0,text,name
1756,This is Mia. She makes awful decisions. 8/10 https://t.co/G6TQVgTcZz,Mia
1214,This is Michelangelope. He's half coffee cup. Rare af. 12/10 would hug until someone stopped me https://t.co/tvVDY0G911,Michelangelope
1835,Meet Striker. He's ready for Christmas. 11/10 https://t.co/B3xxSLjQSH,Striker
2151,Meet Louie. He just pounded that bottle of wine. 9/10 goodnight Louie https://t.co/RAwZvMKRZB,Louie
1247,This is Gary. He just wanted to say hi. 9/10 very personable pup https://t.co/Sk3CbhmKSW,Gary
2262,Here is George. George took a selfie of his new man bun and that is downright epic. (Also looks like Rand Paul) 9/10 https://t.co/afRtVsoIIb,George
2283,This is Joshwa. He is a fuckboy supreme. He clearly relies on owner but doesn't respect them. Dreamy eyes tho 11/10 https://t.co/60xYFRATPZ,Joshwa
907,This is Leonard. He hides in bushes to escape his problems. 10/10 relatable af https://t.co/TdyGTcX0uo,Leonard
1384,This is Daniel. He's a neat pup. Exotic af. Custom paws. Leaps unannounced. Would totally pet. 7/10 daaamn Daniel https://t.co/5XaR0kj8cr,Daniel
220,Say hello to Boomer. He's a sandy pupper. Having a h*ckin blast. 12/10 would pet passionately https://t.co/ecb3LvExde,Boomer


### Issue #3:
Incorrect Ratings for some of the dogs

#### Define
- We were told the denominator is always 10. By viewing the describe function above we can confirm the denominator has
- numbers greater than 10
- We will find all numbers greater than 10 in the denominator column and replace them with 10.
- We will also find uncommon numerators and replace them with proper values

#### Code

In [42]:
# Reset pandas display options
pd.reset_option('display.max_colwidth')

In [43]:
tweet_archive_copy[['text','name', 'rating_numerator', 'rating_denominator']].sample(10)

Unnamed: 0,text,name,rating_numerator,rating_denominator
908,RT @dog_rates: This is Chompsky. He lives up t...,Chompsky,11,10
129,This is Shikha. She just watched you drop a sk...,Shikha,12,10
489,This is Chubbs. He dug a hole and now he's stu...,Chubbs,11,10
892,This is Oakley. He has no idea what happened h...,Oakley,11,10
1934,Say hello to Aiden. His eyes are magical. Love...,Aiden,11,10
795,This is Bungalo. She uses that face to get wha...,Bungalo,12,10
763,This is Sophie. She's a Jubilant Bush Pupper. ...,Sophie,27,10
2134,This is Randall. He's from Chernobyl. Built pl...,Randall,5,10
726,This is Timmy. He's quite large. According to ...,Timmy,11,10
1061,This is Theodore. He just saw an adult wearing...,Theodore,12,10


In [44]:
tweet_archive_copy.query('rating_denominator < 10')

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
516,810984652412424192,2016-12-19 23:06:23 +0000,Meet Sam. She smiles 24/7 &amp; secretly aspir...,"https://www.gofundme.com/sams-smile,https://tw...",24,7,Sam,,,,


In [45]:
# Find and replace rating denominator less than 10 with 10
tweet_archive_copy.loc[tweet_archive_copy['rating_denominator'] < 10, 'rating_denominator'] = 10

In [46]:
tweet_archive_copy.query('rating_denominator > 10')

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1202,716439118184652801,2016-04-03 01:36:11 +0000,This is Bluebert. He just saw that both #Final...,https://twitter.com/dog_rates/status/716439118...,50,50,Bluebert,,,,
1662,682962037429899265,2016-01-01 16:30:13 +0000,This is Darrel. He just robbed a 7/11 and is i...,https://twitter.com/dog_rates/status/682962037...,7,11,Darrel,,,,


In [47]:
# Find and replace rating denominator greater than 10 with 10
tweet_archive_copy.loc[tweet_archive_copy['rating_denominator'] > 10, 'rating_denominator'] = 10

In [48]:
# View the distribution of rating numerator
tweet_archive_copy.rating_numerator.describe()

count    1502.000000
mean       12.101864
std        45.656711
min         2.000000
25%        10.000000
50%        11.000000
75%        12.000000
max      1776.000000
Name: rating_numerator, dtype: float64

In [49]:
# Find all ratings numerator greater than the 75th percentile
greater_than_75 = tweet_archive_copy['rating_numerator'][tweet_archive_copy['rating_numerator'] > tweet_archive_copy['rating_numerator'].quantile(0.75)]
print(greater_than_75.value_counts())

13      204
14       21
75        2
1776      1
50        1
27        1
24        1
Name: rating_numerator, dtype: int64


In [50]:
# Find all tweets with rating numerator greater than 75th percentile
tweet_archive_copy.query('rating_numerator > rating_numerator.quantile(0.75)')

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,2017-08-01 16:23:56 +0000,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,2017-08-01 00:17:27 +0000,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
3,891689557279858688,2017-07-30 15:58:51 +0000,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
6,890971913173991426,2017-07-28 16:27:12 +0000,Meet Jax. He enjoys ice cream so much he gets ...,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
8,890609185150312448,2017-07-27 16:25:51 +0000,This is Zoey. She doesn't want to be one of th...,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
...,...,...,...,...,...,...,...,...,...,...,...
1519,690735892932222976,2016-01-23 03:20:44 +0000,Say hello to Peaches. She's a Dingleberry Zand...,https://twitter.com/dog_rates/status/690735892...,13,10,Peaches,,,,
1562,688211956440801280,2016-01-16 04:11:31 +0000,This is Derby. He's a superstar. 13/10 (vid by...,https://twitter.com/dog_rates/status/688211956...,13,10,Derby,,,,
1906,674468880899788800,2015-12-09 06:01:26 +0000,This is Louis. He thinks he's flying. 13/10 th...,https://twitter.com/dog_rates/status/674468880...,13,10,Louis,,,,
1952,673680198160809984,2015-12-07 01:47:30 +0000,This is Shnuggles. I would kill for Shnuggles....,https://twitter.com/dog_rates/status/673680198...,13,10,Shnuggles,,,,


In [51]:
# Drop the row with the outlier value of 1776
tweet_archive_copy = tweet_archive_copy[tweet_archive_copy['rating_numerator'] != 1776]

#### Test

In [52]:
# Print the highest and lowest rating denominator and numerator
print(tweet_archive_copy['rating_denominator'].max())
print(tweet_archive_copy['rating_denominator'].min())
print(tweet_archive_copy['rating_numerator'].max())


10
10
75


### Issue 4
- The three dog stages is not necessary. There should all be under a single column

#### Define
- Collapse the three dog stages into a single column
- Use the pandas melt function

In [53]:
# Use pandas melt function to convert the dataframe to long format
tweet_archive_copy = pd.melt(tweet_archive_copy, id_vars=['tweet_id','timestamp','text','expanded_urls','rating_numerator','rating_denominator','name'], value_vars=['doggo','floofer', 'pupper','puppo'], var_name='dog_type')

In [54]:
tweet_archive_copy.drop('value', axis=1, inplace=True)

#### Test

In [55]:
tweet_archive_copy.sample(4)

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,dog_type
368,810984652412424192,2016-12-19 23:06:23 +0000,Meet Sam. She smiles 24/7 &amp; secretly aspir...,"https://www.gofundme.com/sams-smile,https://tw...",24,10,Sam,doggo
5691,678389028614488064,2015-12-20 01:38:42 +0000,This is Bella. She just learned that her final...,https://twitter.com/dog_rates/status/678389028...,11,10,Bella,puppo
179,842115215311396866,2017-03-15 20:48:07 +0000,This is Bentley. It's his first time going to ...,https://twitter.com/dog_rates/status/842115215...,12,10,Bentley,doggo
4676,843235543001513987,2017-03-18 22:59:54 +0000,This is Tycho. She just had new wheels install...,https://twitter.com/dog_rates/status/843235543...,13,10,Tycho,puppo


## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization