# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [1683]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import os
import requests
import tweepy
from io import StringIO
import json
from tqdm import tqdm


In [1684]:
tweet_archive = pd.read_csv('twitter-archive-enhanced.csv') # read in the data

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [1685]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
data = response.text
image_pred = pd.read_csv(StringIO(data), sep='\t')
image_pred.to_csv('image_predictions.tsv')

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [1686]:
from dotenv import load_dotenv
load_dotenv()

bearer_token = os.environ.get('BEARER_TOKEN')

tweet_id = list(tweet_archive['tweet_id'])
missing_tweets = []

In [1687]:
# if not os.path.exists('tweet_json.txt'):
#     with open('tweet_json.txt', 'w'): pass
# def get_tweet():
#     auth = tweepy.OAuth2BearerHandler(bearer_token)
#     api = tweepy.API(auth)
#     for id in tqdm(tweet_id):
#         try:
#             tweet = api.get_status(id, tweet_mode='extended')
#             with open('tweet_json.txt', 'a') as f:
#                 json.dump(tweet._json, f)
#                 f.write('\n')
#         except:
#             print('Missing Tweet for id: {}'.format(id))
#             missing_tweets.append(id)
#             continue

# # Driver code
# if __name__ == '__main__':
# #   Call the function
#     get_tweet()


In [1688]:
# with open('tweet_json.txt', 'r') as f:
with open('json.txt', 'r') as f:
    gathered_tweet_df = pd.DataFrame(columns=('tweet_id', 'retweet_count', 'favorite_count'))
    tweets = f.readlines()
    for tweet in tweets:
        tweet = json.loads(tweet)
        gathered_tweet_df.loc[len(gathered_tweet_df.index)] = [tweet['id'], tweet['retweet_count'], tweet['favorite_count']]

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



In [1689]:
gathered_tweet_df.shape

(2354, 3)

In [1690]:
gathered_tweet_df.sample(4)

Unnamed: 0,tweet_id,retweet_count,favorite_count
620,796116448414461957,2813,10139
561,802600418706604034,1714,7938
2228,668237644992782336,3100,6614
1828,676219687039057920,31989,67100


In [1691]:
gathered_tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2354 entries, 0 to 2353
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tweet_id        2354 non-null   object
 1   retweet_count   2354 non-null   object
 2   favorite_count  2354 non-null   object
dtypes: object(3)
memory usage: 73.6+ KB


In [1692]:
gathered_tweet_df.describe()

Unnamed: 0,tweet_id,retweet_count,favorite_count
count,2354,2354,2354
unique,2354,1724,2007
top,667495797102141441,3652,0
freq,1,5,179


In [1693]:
# Check null values in gathered_tweet_df
gathered_tweet_df.isnull().sum()

tweet_id          0
retweet_count     0
favorite_count    0
dtype: int64

In [1694]:
# Check duplicates in gathered_tweet_df
gathered_tweet_df.duplicated().sum()

0

In [1695]:
image_pred.sample(4)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
576,678675843183484930,https://pbs.twimg.com/media/CWskEqnWUAAQZW_.jpg,1,maze,0.33985,False,streetcar,0.099688,False,sundial,0.084808,False
1403,769212283578875904,https://pbs.twimg.com/media/CqzKfQgXEAAWIY-.jpg,1,golden_retriever,0.166538,True,Pekinese,0.148215,True,cocker_spaniel,0.082735,True
1959,865718153858494464,https://pbs.twimg.com/media/DAOmEZiXYAAcv2S.jpg,1,golden_retriever,0.673664,True,kuvasz,0.157523,True,Labrador_retriever,0.126073,True
910,700747788515020802,https://pbs.twimg.com/media/CbmOY41UAAQylmA.jpg,1,Great_Pyrenees,0.481333,True,Samoyed,0.311769,True,Maltese_dog,0.074962,True


In [1696]:
image_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [1697]:
image_pred.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [1698]:
# Check duplicates in image_pred
image_pred.duplicated().sum()

0

In [1699]:
# Check null values in tweet_archive
tweet_archive.isnull().sum()

tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

In [1700]:
tweet_archive.sample(4)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2124,670374371102445568,,,2015-11-27 22:51:19 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Daisy. She's rebellious. Full of teen angst. Thought her food should be evenly dispersed around the room. 12/10 https://t.co/8yzgYzP94K,,,,https://twitter.com/dog_rates/status/670374371102445568/photo/1,12,10,Daisy,,,,
2248,667866724293877760,,,2015-11-21 00:46:50 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Shaggy. He knows exactly how to solve the puzzle but can't talk. All he wants to do is help. 10/10 great guy https://t.co/SBmWbfAg6X,,,,https://twitter.com/dog_rates/status/667866724293877760/photo/1,10,10,Shaggy,,,,
790,773922284943896577,,,2016-09-08 16:33:46 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Heinrich (pronounced ""Pat""). He's a Botswanian Vanderfloof. Snazzy af bandana. 12/10 downright puptacular https://t.co/G56ikYAqFg",,,,https://twitter.com/dog_rates/status/773922284943896577/photo/1,12,10,Heinrich,,,,
1389,700167517596164096,,,2016-02-18 03:58:39 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Dotsy. She's stuck as hell. 10/10 https://t.co/A0h4lnhU4s,,,,https://twitter.com/dog_rates/status/700167517596164096/photo/1,10,10,Dotsy,,,,


In [1701]:
tweet_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [1702]:
tweet_archive.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [1703]:
# Check null values in tweet_archive
tweet_archive.isnull().sum()

tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

In [1704]:
''' Columns such as retweet_status_id have high null values
Those columns and some other ones have high null values and are not useful for our analysis
'''


' Columns such as retweet_status_id have high null values\nThose columns and some other ones have high null values and are not useful for our analysis\n'

In [1705]:
# Check duplicates in tweet_archive
tweet_archive.duplicated().sum()

0

In [1706]:
tweet_archive.query('doggo == "doggo"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A,,,,https://twitter.com/dog_rates/status/890240255349198849/photo/1,14,10,Cassie,doggo,,,
43,884162670584377345,,,2017-07-09 21:29:42 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Yogi. He doesn't have any important dog meetings today he just enjoys looking his best at all times. 12/10 for dangerously dapper doggo https://t.co/YSI00BzTBZ,,,,https://twitter.com/dog_rates/status/884162670584377345/photo/1,12,10,Yogi,doggo,,,
99,872967104147763200,,,2017-06-09 00:02:31 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here's a very large dog. He has a date later. Politely asked this water person to check if his breath is bad. 12/10 good to go doggo https://t.co/EMYIdoblMR,,,,"https://twitter.com/dog_rates/status/872967104147763200/photo/1,https://twitter.com/dog_rates/status/872967104147763200/photo/1",12,10,,doggo,,,
108,871515927908634625,,,2017-06-04 23:56:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Napolean. He's a Raggedy East Nicaraguan Zoom Zoom. Runs on one leg. Built for deception. No eyes. Good with kids. 12/10 great doggo https://t.co/PR7B7w1rUw,,,,"https://twitter.com/dog_rates/status/871515927908634625/photo/1,https://twitter.com/dog_rates/status/871515927908634625/photo/1",12,10,Napolean,doggo,,,
110,871102520638267392,,,2017-06-03 20:33:19 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH,,,,https://twitter.com/animalcog/status/871075758080503809,14,10,,doggo,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1117,732375214819057664,,,2016-05-17 01:00:32 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Kyle (pronounced 'Mitch'). He strives to be the best doggo he can be. 11/10 would pat on head approvingly https://t.co/aA2GiTGvlE,,,,https://twitter.com/dog_rates/status/732375214819057664/photo/1,11,10,Kyle,doggo,,,
1141,727644517743104000,,,2016-05-03 23:42:26 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here's a doggo struggling to cope with the winds. 13/10 https://t.co/qv3aUwaouT,,,,"https://twitter.com/dog_rates/status/727644517743104000/photo/1,https://twitter.com/dog_rates/status/727644517743104000/photo/1",13,10,,doggo,,,
1156,724771698126512129,,,2016-04-26 01:26:53 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Nothin better than a doggo and a sunset. 11/10 https://t.co/JlFqOhrHEs,,,,"https://twitter.com/dog_rates/status/724771698126512129/photo/1,https://twitter.com/dog_rates/status/724771698126512129/photo/1,https://twitter.com/dog_rates/status/724771698126512129/photo/1,https://twitter.com/dog_rates/status/724771698126512129/photo/1",11,10,,doggo,,,
1176,719991154352222208,,,2016-04-12 20:50:42 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This doggo was initially thrilled when she saw the happy cartoon pup but quickly realized she'd been deceived. 10/10 https://t.co/mvnBGaWULV,,,,"https://twitter.com/dog_rates/status/719991154352222208/photo/1,https://twitter.com/dog_rates/status/719991154352222208/photo/1",10,10,,doggo,,,


In [1707]:
pd.set_option('display.max_colwidth', None)

In [1708]:
# Find the tweet with the highest rating numerator
tweet_archive.query('rating_numerator == rating_numerator.max()')['text']

979    This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh
Name: text, dtype: object

In [1709]:
tweet_archive[['text','name', 'rating_numerator', 'rating_denominator']].sample(15)


Unnamed: 0,text,name,rating_numerator,rating_denominator
527,Here's a pupper in a onesie. Quite pupset about it. Currently plotting revenge. 12/10 would rescue https://t.co/xQfrbNK3HD,,12,10
1120,Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv,this,204,170
205,Meet Benny. He likes being adorable and making fun of you while you're on the trampoline. 12/10 let's help him out\n\nhttps://t.co/aVMjBqAy1x https://t.co/7gx2LksT3U,Benny,12,10
1457,This is just a beautiful pupper good shit evolution. 12/10 https://t.co/2L8pI0Z2Ib,just,12,10
1262,This is Tater. His underbite is fierce af. Doesn't give a damn about your engagement photo. 8/10 https://t.co/nLuPY3pY12,Tater,8,10
1432,Meet Blipson. He's a Doowap Hufflepuff. That Ugg is his temporary home while he's struggling with unemployment 11/10 https://t.co/YKvt0J5MXr,Blipson,11,10
1516,This golden is happy to refute the soft mouth egg test. Not a fan of sweeping generalizations. 11/10 #notallpuppers https://t.co/DgXYBDMM3E,,11,10
1687,This is Apollo. He thought you weren't coming back so he had a mental breakdown. 8/10 we've all been there https://t.co/ojUBrDCHLT,Apollo,8,10
892,This is Oakley. He has no idea what happened here. Even offered to help clean it up. 11/10 such a heckin good boy https://t.co/vT3JM8b989,Oakley,11,10
1246,Oh. My. God. 13/10 magical af https://t.co/Ezu6jQrKAZ,,13,10


In [1710]:
tweet_archive[['text', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp']].sample(15)

Unnamed: 0,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp
1032,This is Sugar. She excels underwater. 12/10 photogenic af https://t.co/AWMeXJJz64,,,
368,This is Fiona. She's an exotic dog. Seems rather impatient. Jaw extension on another level tho. Looks slippery. 10/10 would still pet https://t.co/vst2SEVJO3,,,
2105,Honor to rate this dog. Great teeth. Nice horns. Unbelievable posture. Fun to pet. Big enough to ride. 10/10 rad dog https://t.co/7JMAHdJ6A4,,,
367,This is Alfie. He's your Lyft for tonight. Kindly requests you buckle pup and remain reasonably calm during the ride. 13/10 he must focus https://t.co/AqPTHYUBFz,,,
1669,"I know we joke around on here, but this is getting really frustrating. We rate dogs. Not T-Rex. Thank you... 8/10 https://t.co/5aFw7SWyxU",,,
1637,This is Tino. He really likes corndogs. 9/10 https://t.co/cUxGtnBfc2,,,
690,This is Moose. He's rather h*ckin dangerous (you can tell by the collar). 11/10 would still attempt to snug https://t.co/lHVHGdDzb3,,,
103,We. Only. Rate. Dogs. Do not send in other things like this fluffy floor shark clearly ready to attack. Get it together guys... 12/10 https://t.co/BZHiKx3FpQ,,,
931,"When you hear your owner say they need to hatch another egg, but you've already been on 17 walks today. 10/10 https://t.co/lFEoGqZ4oA",,,
1570,This is Ember. That's the q-tip she owes money to. 11/10 pay up pup. (vid by @leanda_h) https://t.co/kGRcRjRJRl,,,


In [1711]:
tweet_archive['name'].value_counts()

None       745
a           55
Charlie     12
Lucy        11
Cooper      11
          ... 
Chevy        1
Sparky       1
Dot          1
Fabio        1
Lulu         1
Name: name, Length: 957, dtype: int64

### Quality issues
``tweet_archive table``

1. Some of the tweets are retweets and some are not even about dogs and still have ratings

2. Some of the columns like in_reply_to_status_id, in_reply_to_user_id have no real use case and are motly null

3. Some of the dog names are incorrect and some of them having the value None

4. Incorrect ratings for some of the dogs

5. Incorrect data type for some of the columns like timestamp

``image_pred table``

6. Tweets with false p1_dog value tend not to be dog

7. Image number column doesn't seem to convey any actual value for analysis

8. Wrong data type for p1, p2, p3

### Tidiness issues
1. The dog stages should have been a single column instead of being split into three

2. Too many datasets. They can be merged for managability

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [1712]:
# Make copies of original pieces of data
tweet_archive_copy = tweet_archive.copy()
image_pred_copy = image_pred.copy()
gathered_tweet_df_copy = gathered_tweet_df.copy()

In [1713]:
dogitionary = ['doggo', 'floofer', 'pupper', 'puppo']

### Issue #1:
* Some of the tweets are retweets and may not be about dogs

#### Define:
- Tweets having non-null values in retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp should be dropped
- It is noticed from the describe function above we have a total of 181 non-null values in these columns

#### Code

In [1714]:
tweet_archive_copy.shape

(2356, 17)

In [1715]:
# Drop rows having non-null values in retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp columns of tweet_archive_copy
tweet_archive_copy = tweet_archive_copy.loc[tweet_archive_copy['retweeted_status_id'].isnull()]

In [1716]:
# Drop the retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp columns of tweet_archive_copy
tweet_archive_copy = tweet_archive_copy.drop(['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], axis=1)

#### Test

In [1717]:
tweet_archive_copy.shape


(2175, 14)

In [1718]:
tweet_archive_copy.sample(4)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
925,755110668769038337,,,2016-07-18 18:43:07 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Watson. He trust falls on command. 13/10 it's elementary... (IG: wat.ki) https://t.co/goX3jewkYN,https://twitter.com/dog_rates/status/755110668769038337/video/1,13,10,Watson,,,,
997,748324050481647620,,,2016-06-30 01:15:31 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Duke. He permanently looks like he just tripped over something. 11/10 https://t.co/1sNtG7GgiO,"https://twitter.com/dog_rates/status/748324050481647620/photo/1,https://twitter.com/dog_rates/status/748324050481647620/photo/1",11,10,Duke,,,,
1699,680970795137544192,,,2015-12-27 04:37:44 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",I thought I made this very clear. We only rate dogs. Stop sending other things like this shark. Thank you... 9/10 https://t.co/CXSJZ4Stk3,https://twitter.com/dog_rates/status/680970795137544192/photo/1,9,10,,,,,
1208,715704790270025728,,,2016-04-01 00:58:13 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine - Make a Scene</a>",This is Bentley. He gives kisses back. 11/10 precious af (vid by @emmaallen25) https://t.co/9PnKkKzoUp,https://vine.co/v/ijAlDnuOD0l,11,10,Bentley,,,,


### Issue #2:
- Invalid columns with almost all NaN values

#### Define:
- in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp,source
- Drop above columns with the drop function

#### Code

In [1719]:
useless_columns = ['in_reply_to_status_id', 'in_reply_to_user_id','source']

In [1720]:
tweet_archive_copy.drop(useless_columns, axis=1, inplace=True)

#### Test

In [1721]:
tweet_archive_copy.sample(4)

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1458,695074328191332352,2016-02-04 02:40:08 +0000,This is Lorenzo. He's educated af. Just graduated college. 11/10 poor pupper can't even comprehend his debt https://t.co/dH3GzcjCtQ,https://twitter.com/dog_rates/status/695074328191332352/photo/1,11,10,Lorenzo,,,pupper,
1911,674410619106390016,2015-12-09 02:09:56 +0000,This is Lenny. He wants to be a sprinkler. 10/10 you got this Lenny https://t.co/CZ0YaB40Hn,https://twitter.com/dog_rates/status/674410619106390016/photo/1,10,10,Lenny,,,,
1610,685532292383666176,2016-01-08 18:43:29 +0000,"For the last time, WE. DO. NOT. RATE. BULBASAUR. We only rate dogs. Please only send dogs. Thank you ...9/10 https://t.co/GboDG8WhJG",https://twitter.com/dog_rates/status/685532292383666176/photo/1,9,10,,,,,
1139,728015554473250816,2016-05-05 00:16:48 +0000,This is Rueben. He has reached ultimate pupper zen state. 11/10 tranquil af https://t.co/Z167HgtnBi,https://twitter.com/dog_rates/status/728015554473250816/photo/1,11,10,Rueben,,,pupper,


### Issue #3:
- Incorrect names
- None values for some of the names

#### Define:
- Find the names that are not correct by using value count
- Replace incorrect names and None values with NaN

#### Code

In [1722]:
# # First, remove all tweets that don't contain any of the dog words
# for word in dogitionary:
#     tweet_archive_copy = tweet_archive_copy[tweet_archive_copy['text'].str.contains(word)]


In [1723]:
# Create a csv file containg names of dogs and view them visually
counts = tweet_archive_copy['name'].value_counts()
counts.to_csv('name.csv', index=True)

In [1724]:
# Get all the invalid names and remove them from the dataframe
# We notice invalid names starts with lowercase letters.

# Create a list of invalid names
invalid_names = ['None']
for name in tweet_archive_copy.name:
    if name[0].islower():
        invalid_names.append(name)

In [1725]:
# Get unique invalid names
invalid_names = list(set(invalid_names))

In [1726]:
tweet_archive_copy.shape

(2175, 11)

In [1727]:
# Remove invalid names from the dataframe
tweet_archive_copy = tweet_archive_copy[~tweet_archive_copy['name'].isin(invalid_names)]
tweet_archive_copy.name.value_counts()

Charlie    11
Lucy       11
Oliver     10
Cooper     10
Penny       9
           ..
Benny       1
Chubbs      1
Alfy        1
Hamrick     1
Lulu        1
Name: name, Length: 930, dtype: int64

In [1728]:
# View the dataframe
tweet_archive_copy.shape

(1391, 11)

#### Test

In [1729]:
# verify that the dataframe is now clean of invalid names
tweet_archive_copy[['text','name']].sample(10)

Unnamed: 0,text,name
1733,This is Rinna. She's melting. 10/10 get inside pupper https://t.co/PA0czwucsb,Rinna
1211,This is Bertson. He just wants to say hi. 11/10 would boop nose https://t.co/hwv7Wq6gDA,Bertson
464,Meet Strudel. He's rather h*ckin pupset that your clothes clash. 11/10 click the link to see how u can help Strudel\n\nhttps://t.co/3uxgLz8d0l https://t.co/O0ECL1StB2,Strudel
1328,This is Lucy. She's a Venetian Kerploof. Supposed to be navigating. Quite irresponsible. Fancy ass collar tho 12/10 https://t.co/8tjnz1L8DI,Lucy
1721,This is Reggie. His Santa hat is a little big. 10/10 he's still having fun https://t.co/w0dcGXq7qK,Reggie
791,This is Loki. He knows he's adorable. One ear always pupared. 12/10 would snug in depicted fashion forever https://t.co/OqNggd4Oio,Loki
757,This is Penny. She's a sailor pup. 11/10 would take to the open seas with https://t.co/0rRxyBQt32,Penny
365,This is Dexter. He was reunited with his mom yesterday after she was stuck in Iran during the travel Bannon. 13/10 welcome home https://t.co/U50RlRw4is,Dexter
782,This is Finley. He's an independent doggo still adjusting to life on his own. 11/10 https://t.co/7FNcBaKbci,Finley
1089,This is Bella. She's ubering home after a few too many drinks. 10/10 socially conscious af https://t.co/KxkOgq80Xj,Bella


### Issue #4:
Incorrect Ratings for some of the dogs

#### Define
- We were told the denominator is always 10. By viewing the describe function above we can confirm the denominator has
- numbers greater than 10
- We will find all numbers greater than 10 in the denominator column and replace them with 10.
- We will also find uncommon numerators and replace them with proper values

#### Code

In [1730]:
# Reset pandas display options
pd.reset_option('display.max_colwidth')

In [1731]:
tweet_archive_copy[['text','name', 'rating_numerator', 'rating_denominator']].sample(10)

Unnamed: 0,text,name,rating_numerator,rating_denominator
2292,This is Bradlay. He is a Ronaldinho Matsuyama ...,Bradlay,11,10
659,Say hello to Levi. He's a Madagascan Butterbop...,Levi,12,10
2100,Meet Danny. He's too good to look at the road ...,Danny,6,10
1507,This is Richie and Plip. They are the best of ...,Richie,10,10
1074,This is Simba. He's the grand prize. The troph...,Simba,12,10
1276,Meet Rodney. He's a Ukranian Boomchicka. Outsi...,Rodney,10,10
2245,Meet Stu. Stu has stacks on stacks and an eye ...,Stu,10,10
652,Meet BeBe. She rocks the messy bun of your dre...,BeBe,12,10
834,Meet Chevy. He had a late breakfast and now ha...,Chevy,11,10
2164,This is Oliviér. He's a Baptist Hindquarter. A...,Oliviér,10,10


In [1732]:
# Find the distribution of denominator values
tweet_archive_copy['rating_denominator'].value_counts()

10    1388
50       1
11       1
7        1
Name: rating_denominator, dtype: int64

In [1733]:
tweet_archive_copy.query('rating_denominator < 10')

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
516,810984652412424192,2016-12-19 23:06:23 +0000,Meet Sam. She smiles 24/7 &amp; secretly aspir...,"https://www.gofundme.com/sams-smile,https://tw...",24,7,Sam,,,,


In [1734]:
tweet_archive_copy.query('rating_denominator > 10')

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1202,716439118184652801,2016-04-03 01:36:11 +0000,This is Bluebert. He just saw that both #Final...,https://twitter.com/dog_rates/status/716439118...,50,50,Bluebert,,,,
1662,682962037429899265,2016-01-01 16:30:13 +0000,This is Darrel. He just robbed a 7/11 and is i...,https://twitter.com/dog_rates/status/682962037...,7,11,Darrel,,,,


In [1735]:
# View the distribution of rating numerator
tweet_archive_copy.rating_numerator.describe()

count    1391.000000
mean       12.091301
std        47.413241
min         2.000000
25%        10.000000
50%        11.000000
75%        12.000000
max      1776.000000
Name: rating_numerator, dtype: float64

In [1736]:
# Find all ratings numerator greater than the 75th percentile
greater_than_75 = tweet_archive_copy['rating_numerator'][tweet_archive_copy['rating_numerator'] > tweet_archive_copy['rating_numerator'].quantile(0.75)]
print(greater_than_75.value_counts())

13      183
14       17
1776      1
75        1
50        1
27        1
24        1
Name: rating_numerator, dtype: int64


In [1737]:
pd.set_option('display.max_colwidth', None)

In [1738]:
# Find the tweet with the highest rating numerator
tweet_archive_copy.query('rating_numerator == rating_numerator.max()')

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
979,749981277374128128,2016-07-04 15:00:45 +0000,This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh,https://twitter.com/dog_rates/status/749981277374128128/photo/1,1776,10,Atticus,,,,


In [1739]:
# The above tweet has a rating numerator greater than the 75th percentile and seems to somehow correlate to the American Independence Day. 
# We can drop this tweet from the dataframe since it is an outlier and we don't want to include it in our analysis.
tweet_archive_copy = tweet_archive_copy[tweet_archive_copy['rating_numerator'] != 1776]

In [1740]:
tweet_archive_copy.query('rating_numerator == "75"')

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
695,786709082849828864,2016-10-13 23:23:56 +0000,"This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS",https://twitter.com/dog_rates/status/786709082849828864/photo/1,75,10,Logan,,,,


In [1741]:
# On close analysis we find that the extracted ratings numerator is wrong. We need to correct the rating.
# We can correct the ratings by using the approximated figure in the tweet.

# Replace the rating numerator with the approximated figure
tweet_archive_copy['rating_numerator'] = tweet_archive_copy['rating_numerator'].apply(lambda x: 10 if x==75 else x)

In [1742]:
tweet_archive_copy.query('rating_numerator == "50"')

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1202,716439118184652801,2016-04-03 01:36:11 +0000,This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq,https://twitter.com/dog_rates/status/716439118184652801/photo/1,50,50,Bluebert,,,,


In [1743]:
# Replace the numerator and denominator ratings having a value of 50 with the exact figure
tweet_archive_copy['rating_numerator'] = tweet_archive_copy['rating_numerator'].apply(lambda x: 11 if x==50 else x)
tweet_archive_copy['rating_denominator'] = tweet_archive_copy['rating_denominator'].apply(lambda x: 10 if x==50 else x)


In [1744]:
tweet_archive_copy.query('rating_numerator == "27"')

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
763,778027034220126208,2016-09-20 00:24:34 +0000,This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq,https://twitter.com/dog_rates/status/778027034220126208/photo/1,27,10,Sophie,,,pupper,


In [1745]:
# Using the above manual process we can correct the ratings numerator to the approximated figure.
tweet_archive_copy['rating_numerator'] = tweet_archive_copy['rating_numerator'].apply(lambda x: 11 if x==27 else x)

In [1746]:
tweet_archive_copy.query('rating_numerator == "24"')

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
516,810984652412424192,2016-12-19 23:06:23 +0000,Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx,"https://www.gofundme.com/sams-smile,https://twitter.com/dog_rates/status/810984652412424192/photo/1",24,7,Sam,,,,


In [1747]:
# Drop row with rating numerator of 24
tweet_archive_copy = tweet_archive_copy[tweet_archive_copy['rating_numerator'] != 24]

#### Test

In [1748]:
# Print the highest and lowest rating denominator and numerator
tweet_archive_copy.groupby('rating_denominator').rating_numerator.max().sort_values(ascending=False)



rating_denominator
10    14
11     7
Name: rating_numerator, dtype: int64

In [1749]:
tweet_archive_copy.groupby('rating_numerator').rating_denominator.max().sort_values(ascending=False)


rating_numerator
7     11
14    10
13    10
12    10
11    10
10    10
9     10
8     10
6     10
5     10
4     10
3     10
2     10
Name: rating_denominator, dtype: int64

### Issue 5:
- Some columns have wrong datatypes

#### Define
- Change datatypes for columns such as timestamp using pandas datetime function

#### Code

In [1750]:
# Confirm datatypes of columns
tweet_archive_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1389 entries, 0 to 2325
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            1389 non-null   int64 
 1   timestamp           1389 non-null   object
 2   text                1389 non-null   object
 3   expanded_urls       1389 non-null   object
 4   rating_numerator    1389 non-null   int64 
 5   rating_denominator  1389 non-null   int64 
 6   name                1389 non-null   object
 7   doggo               1389 non-null   object
 8   floofer             1389 non-null   object
 9   pupper              1389 non-null   object
 10  puppo               1389 non-null   object
dtypes: int64(3), object(8)
memory usage: 130.2+ KB


In [1751]:
# Change the datatype for timestamp column to datetime
tweet_archive_copy['timestamp'] = pd.to_datetime(tweet_archive_copy['timestamp'])

In [1752]:
# Find distribution of tweets by year of creation in tweet_archive_copy
tweet_archive_copy['timestamp'].dt.year.value_counts()

2016    726
2015    379
2017    284
Name: timestamp, dtype: int64

#### Test

In [1753]:
tweet_archive_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1389 entries, 0 to 2325
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            1389 non-null   int64              
 1   timestamp           1389 non-null   datetime64[ns, UTC]
 2   text                1389 non-null   object             
 3   expanded_urls       1389 non-null   object             
 4   rating_numerator    1389 non-null   int64              
 5   rating_denominator  1389 non-null   int64              
 6   name                1389 non-null   object             
 7   doggo               1389 non-null   object             
 8   floofer             1389 non-null   object             
 9   pupper              1389 non-null   object             
 10  puppo               1389 non-null   object             
dtypes: datetime64[ns, UTC](1), int64(3), object(7)
memory usage: 130.2+ KB


### Issue 6:
- Tweets with false p1_dog value tend not to be dogs

#### Define
- Delete rows containing False as a value in p1_dog column

#### Code

In [1754]:
# Delete rows containing False in the p1_dog column in the image_pred_copy dataframe
image_pred_copy = image_pred_copy[image_pred_copy['p1_dog'] == True]

#### Test

In [1755]:
image_pred_copy.groupby('p1_dog').count()

Unnamed: 0_level_0,tweet_id,jpg_url,img_num,p1,p1_conf,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
p1_dog,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
True,1532,1532,1532,1532,1532,1532,1532,1532,1532,1532,1532


In [1756]:
image_pred_copy.shape

(1532, 12)

### Issue 7:
- img_num column does'nt seem to convey any actual value for analys

#### Define
- img_num column should be dropped using the drop method

#### Code

In [1757]:
# Drop image_num column from image_pred
image_pred_copy = image_pred_copy.drop(['img_num'], axis=1)

<?@>

#### Test

In [1758]:
image_pred_copy.sample(4)

Unnamed: 0,tweet_id,jpg_url,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1176,737678689543020544,https://pbs.twimg.com/media/CjzC2oGWYAAyIfG.jpg,Pembroke,0.935307,True,Cardigan,0.049874,True,Chihuahua,0.011603,True
1842,838476387338051585,https://pbs.twimg.com/media/C6Ld0wYWgAQQqMC.jpg,Great_Pyrenees,0.997692,True,kuvasz,0.001001,True,Newfoundland,0.000405,True
2008,878057613040115712,https://pbs.twimg.com/media/DC98vABUIAA97pz.jpg,French_bulldog,0.839097,True,Boston_bull,0.078799,True,toy_terrier,0.015243,True
987,707741517457260545,https://pbs.twimg.com/media/CdJnJ1dUEAARNcf.jpg,whippet,0.738371,True,Italian_greyhound,0.191789,True,American_Staffordshire_terrier,0.020126,True


### Issue 8:
- Wrong data type for p1, p2, p3

#### Define
- Convert p1, p2, p3 data type from strings to categorical type

#### Code

In [1759]:
# Convert p1, p2, p3 datatype to categorical

convert_dict = {'p1': 'category', 'p2': 'category', 'p3': 'category'}
image_pred_copy = image_pred_copy.astype(convert_dict)

#### Test

In [1760]:
image_pred_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1532 entries, 0 to 2073
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   tweet_id  1532 non-null   int64   
 1   jpg_url   1532 non-null   object  
 2   p1        1532 non-null   category
 3   p1_conf   1532 non-null   float64 
 4   p1_dog    1532 non-null   bool    
 5   p2        1532 non-null   category
 6   p2_conf   1532 non-null   float64 
 7   p2_dog    1532 non-null   bool    
 8   p3        1532 non-null   category
 9   p3_conf   1532 non-null   float64 
 10  p3_dog    1532 non-null   bool    
dtypes: bool(3), category(3), float64(3), int64(1), object(1)
memory usage: 107.8+ KB


### Tidiness Issue

### Issue 1
- The three dog stages is not necessary. There should all be under a single column

#### Define
- Collapse the three dog stages into a single column
- Use the pandas melt function

In [1761]:
# Use pandas melt function to convert the dataframe to long format
tweet_archive_copy = pd.melt(tweet_archive_copy, id_vars=['tweet_id','timestamp','text','expanded_urls','rating_numerator','rating_denominator','name'], value_vars=['doggo','floofer', 'pupper','puppo'], var_name='dog_type')

In [1762]:
tweet_archive_copy.drop('value', axis=1, inplace=True)

#### Test

In [1763]:
tweet_archive_copy.sample(4)

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,dog_type
757,710997087345876993,2016-03-19 01:11:29+00:00,Meet Milo and Amos. They are the best of pals. Both 12/10 would pet at the same time https://t.co/Mv37BHEyyD,https://twitter.com/dog_rates/status/710997087345876993/photo/1,12,10,Milo,doggo
991,684460069371654144,2016-01-05 19:42:51+00:00,This is Jeph. He's a Western Sagittarius Dookmarriot. Frightened by leaf. Caught him off guard. 10/10 calm down Jeph https://t.co/bicyOV6lju,https://twitter.com/dog_rates/status/684460069371654144/photo/1,10,10,Jeph,doggo
4085,669214165781868544,2015-11-24 18:01:05+00:00,This is Jaspers. He is a northeastern Gillette. Just got his license. Very excited. 10/10 they grow up so fast https://t.co/cieaOI0RuT,https://twitter.com/dog_rates/status/669214165781868544/photo/1,10,10,Jaspers,pupper
1789,788150585577050112,2016-10-17 22:51:57+00:00,This is Leo. He's a golden chow. Rather h*ckin rare. 13/10 would give extra pats https://t.co/xosHjFzVXc,"https://twitter.com/dog_rates/status/788150585577050112/photo/1,https://twitter.com/dog_rates/status/788150585577050112/photo/1,https://twitter.com/dog_rates/status/788150585577050112/photo/1,https://twitter.com/dog_rates/status/788150585577050112/photo/1",13,10,Leo,floofer


In [1764]:
tweet_archive_copy.shape

(5556, 8)

### Issue 2:
- Three datasets instead of one

#### Define
- The three datasets can be merged together using pandas merge function

#### Code

In [1766]:
from functools import reduce

#define list of DataFrames
dfs = [tweet_archive_copy, image_pred_copy, gathered_tweet_df]

#merge all DataFrames into one
final_df = reduce(lambda  left,right: pd.merge(left,right,on=['tweet_id'],
                                            how='outer'), dfs)

In [1768]:
final_df.shape

(6522, 20)

In [1769]:
# Drop rows with NaN values in final_df
final_df = final_df.dropna()

In [1770]:
final_df.sample(4)

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,dog_type,jpg_url,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,retweet_count,favorite_count
1284,8.059329e+17,2016-12-06 00:32:26+00:00,This is Major. He put on a tie for his first real walk. Only a little crooked. Can also drool upwards. H*ckin talented. 12/10 https://t.co/Zcwr8LgoO8,https://twitter.com/dog_rates/status/805932879469572096/photo/1,12.0,10.0,Major,doggo,https://pbs.twimg.com/media/Cy8_qt0UUAAHuuN.jpg,Norwegian_elkhound,0.657967,True,keeshond,0.319136,True,Leonberg,0.007947,True,2209,9178
4948,6.711157e+17,2015-11-29 23:57:10+00:00,"Meet Phred. He isn't steering, looking at the road, or wearing a seatbelt. Phred is a rolling tornado of danger 6/10 https://t.co/mZD7Bo7HfV",https://twitter.com/dog_rates/status/671115716440031232/photo/1,6.0,10.0,Phred,doggo,https://pbs.twimg.com/media/CVBILUgVAAA1ZUr.jpg,malinois,0.406341,True,kelpie,0.143366,True,dingo,0.129802,False,842,1436
1762,7.804594e+17,2016-09-26 17:29:48+00:00,"This is Bear. Don't worry, he's not a real bear tho. Contains unreal amounts of squish. 11/10 heteroskedastic af https://t.co/coi4l1T2Sm",https://twitter.com/dog_rates/status/780459368902959104/photo/1,11.0,10.0,Bear,pupper,https://pbs.twimg.com/media/CtS_p9kXEAE2nh8.jpg,Great_Dane,0.382491,True,German_shepherd,0.312026,True,bull_mastiff,0.033272,True,1224,5892
1097,8.171713e+17,2017-01-06 00:49:53+00:00,This is Tebow. He kindly requests that you put down the coffee and play with him. 13/10 such a good boy https://t.co/56uBP28eqw,https://twitter.com/dog_rates/status/817171292965273600/photo/1,13.0,10.0,Tebow,floofer,https://pbs.twimg.com/media/C1cs8uAWgAEwbXc.jpg,golden_retriever,0.295483,True,Irish_setter,0.144431,True,Chesapeake_Bay_retriever,0.077879,True,2326,9690


In [1771]:
final_df.shape

(4172, 20)

In [1773]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4172 entries, 4 to 5555
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            4172 non-null   float64            
 1   timestamp           4172 non-null   datetime64[ns, UTC]
 2   text                4172 non-null   object             
 3   expanded_urls       4172 non-null   object             
 4   rating_numerator    4172 non-null   float64            
 5   rating_denominator  4172 non-null   float64            
 6   name                4172 non-null   object             
 7   dog_type            4172 non-null   object             
 8   jpg_url             4172 non-null   object             
 9   p1                  4172 non-null   category           
 10  p1_conf             4172 non-null   float64            
 11  p1_dog              4172 non-null   object             
 12  p2                  4172 non-null 

- Integer values for tweet id
- ratings should int64
- dog_type should be a category
- retweets and favorite counts should be int64

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [1772]:
# Save final_df to csv
final_df.to_csv('twitter_archive_master.csv', index=False)

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization