# Dog Rates Data Wrangling

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gather">Gathering</a></li>
<li><a href="#assess">Assessing</a></li>
<li><a href="#clean">Cleaning</a></li>
<li><a href="#analyze&visualize">Analyze and Visualize</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>

<a id='intro'></a>
## Introduction

The dataset is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog.

The data wrangling process will consist of 3 main steps. They are:
1. **Data gathering** - The data can be gathered in many ways including web scraping, using APIs etc. The data can be gathered from a single source or from many different sources.
2. **Assessing the data** - The data needs to assessed for quality and tidiness issues. This can be done visually and/or programatically.
3. **Cleaning the data** - Based on the assessment, the data is cleaned and tested to make sure all the issues identified are resolved.

**Analyze & Visualize**
<br/>
Finally, the wrangled data is analyzed and visualized in an effective and insightful manner.

>*Importing all necessary packages for the data wrangling and analysis*

In [1]:
import requests
import tweepy
import pandas as pd
import time
import json
from functools import reduce

import config

In [2]:
pd.options.display.max_rows
pd.set_option('display.max_colwidth', None)

<a id='gather'></a>
## Gathering

> The data for this analysis is to be gathered from multiple sources. They are:
>
>> 1. The WeRateDogs Twitter archive is enhanced and provided. This file (**twitter_archive_enhanced.csv**) just needs to be downloaded.
>
>> 2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (**image_predictions.tsv**) is hosted on Udacity's servers and should be downloaded programmatically.
>
>> 3. Additional required and interesting data is to be obtained by querying the Twitter API for each tweet's JSON data and store each tweet's entire set of JSON data in a file (**tweet_json.txt**).

>*The file containing enhanced twitter archive (twitter_archive_enhanced.csv) has been manually downloaded and is available in the directory. The tweet image predictions file (image_predictions.tsv) is to be downloaded programmatically using the URL provided.*

In [3]:
# storing the URL provided in a variable
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# getting the response from the URL using requests library 
response = requests.get(url)

# with keyword ensures that the file is closed immediately the desired operation is complete
# file is opened for writing in binary mode
with open('image_predictions.tsv', 'wb') as file:
    # content of the response is written to the file
    file.write(response.content)

>*The file containing the image predictions is successfully saved in the working directory. Now, this data needs to be stored in a new dataframe for further steps of the data wrangling process. In order to read a TSV using pandas, the separator (sep) should be defined to be '\t'.*

In [4]:
image_predictions_df = pd.read_csv('image_predictions.tsv', sep='\t', index_col=None)
image_predictions_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


>*The additional data needs to be downloaded by querying the Twitter API using tweepy library. In order to do that, create a twitter developer account after signing in/up. Once the account is created, the consumer keys and authentication tokens will be available for use.*

>*It is not safe to expose the consumer keys and authentication tokens via code. Hence, a config file can used and imported in this notebook. (In order to execute the rest of the notebook, please fill in the necessary details in the config.py file)*

>*Authenticate using the consumer keys and set the access tokens.* 

In [5]:
# create an OAuthHandler instance
auth = tweepy.OAuthHandler(config.API_KEY, config.API_SECRET_KEY)
# set the access tokens
auth.set_access_token(config.ACCESS_TOKEN, config.ACCESS_TOKEN_SECRET)

# create the API instance
# wait_on_rate_limit – whether or not to automatically wait for rate limits to replenish
# wait_on_rate_limit_notify – whether or not to print a notification when Tweepy is waiting for rate limits to replenish
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
api

<tweepy.api.API at 0x7f412e60da58>

>*The API instance is created and ready for use now.*

>*The ID corresponding to each tweet is required in order to access the additional details of the tweet. These IDs are present in the twitter-archive-enhanced.csv file. Read the file and store as dataframe for further use.*

In [6]:
# read the file twitter-archive-enhanced.csv and store it in a dataframe 
twitter_archive_df = pd.read_csv('twitter-archive-enhanced.csv', index_col=None, encoding = 'utf-8')
twitter_archive_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,,,,


In [None]:
total_number_of_tweets = len(twitter_archive_df.tweet_id)
number_of_failures = 0
failed_tweets_dict = []

start = time.time()

# opening a text file in write mode and writing the JSON containing additional details of the tweet 
with open('tweet_json.txt', 'w') as txt_file:
    # looping over all the tweets whose IDs are present in the twitter_archive_df dataframe
    for tweet_id in twitter_archive_df.tweet_id:    
        try:
            # get a single status specified by the ID parameter
            # extended tweet mode gives the entire untruncated text of the Tweet
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            json.dump(tweet._json, txt_file)
            txt_file.write('\n')
        except tweepy.TweepError as e:
            number_of_failures += 1
            failed_tweets_dict.append(tweet_id)
            print('Tweet ID:', tweet_id, '-', e)
            continue

print('Total number of tweets:', total_number_of_tweets)
print('Time taken:', (time.time()-start)/60, 'minutes')
print('Total number of failed tweets:', number_of_failures)
print('List of failed tweet IDs:', failed_tweets_dict)

>*Below are the explanations of the status codes for the errors:*<br/>
>>*Status code 144 - Corresponds with HTTP 404. The requested Tweet ID is not found (if it existed, it was probably deleted)*
>
>>*Status code 179 - Corresponds with HTTP 403. Thrown when a Tweet cannot be viewed by the authenticating user, usually due to the Tweet’s author having protected their Tweets.*

>*The additional data corresponding to all the tweets in the dataframe are available in tweet_json.txt file. The next step is to read the file and get the required data from JSONs (corresponding to each of the tweets). Finally, store the data in a new dataframe.*

In [7]:
additional_data = []

# opening the tweet_json.txt file in read mode 
with open('tweet_json.txt', 'r') as infile:
    # looping over each line of the file
    for record in infile:
        # convert string to JSON
        record_json_data = json.loads(record)
        # storing the required additional details in a list and appending it to the additional_data list
        additional_data.append([record_json_data['id'], record_json_data['retweet_count'], record_json_data['user']['favourites_count'], record_json_data['user']['followers_count'], record_json_data['created_at']])

# creating a new dataframe using the additional_data list of lists 
additional_data_df = pd.DataFrame(additional_data)
# defining the column names of the dataframe
additional_data_df.columns = ['tweet_id', 'retweet_count', 'favourites_count', 'followers_count', 'created_at']
additional_data_df.head() 

Unnamed: 0,tweet_id,retweet_count,favourites_count,followers_count,created_at
0,892420643555336193,7492,145955,8876619,Tue Aug 01 16:23:56 +0000 2017
1,892177421306343426,5559,145955,8876619,Tue Aug 01 00:17:27 +0000 2017
2,891815181378084864,3681,145955,8876619,Mon Jul 31 00:18:03 +0000 2017
3,891689557279858688,7661,145955,8876619,Sun Jul 30 15:58:51 +0000 2017
4,891327558926688256,8273,145955,8876619,Sat Jul 29 16:00:24 +0000 2017


>*Checking the number of records to confirm that all valid records in twitter_archive_df dataframe have a corresponding record in additional_data_df dataframe.*

In [8]:
additional_data_df.shape

(2331, 5)

>*We observe that except for the 25 tweet IDs that are not valid, there is one record for each tweet in the dataframe. The data gathering step is now complete.*

<a id='assess'></a>
## Assessing

>There are 2 types of issues that need to be assessed. They are:
>
>>1. Quality issues - issues with content. Low quality data is also known as dirty data.
>
>>2. Tidiness issues -  issues with structure that prevent easy analysis. Untidy data is also known as messy data. Tidy data requirements: 1.Each variable forms a column  2.Each observation forms a row  3.Each type of observational unit forms a table
>
>These issues can be assessed in 2 ways. They are:
>
>>1. Visual assessment - scrolling through the data in your preferred software application.
>
>>2. Programmatic assessment - using code to view specific portions and summaries of the data.

##### Visual assessement

#### 1. twitter_archive_df

In [9]:
twitter_archive_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,,,,


In [10]:
twitter_archive_df.tail()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq,,,,https://twitter.com/dog_rates/status/666049248165822465/photo/1,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx,,,,https://twitter.com/dog_rates/status/666044226329800704/photo/1,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR,,,,https://twitter.com/dog_rates/status/666033412701032449/photo/1,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI,,,,https://twitter.com/dog_rates/status/666029285002620928/photo/1,7,10,a,,,,
2355,666020888022790149,,,2015-11-15 22:32:08 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj,,,,https://twitter.com/dog_rates/status/666020888022790149/photo/1,8,10,,,,,


In [11]:
twitter_archive_df.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1588,686730991906516992,,,2016-01-12 02:06:41 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",I just love this picture. 12/10 lovely af https://t.co/Kc84eFNhYU,,,,https://twitter.com/dog_rates/status/686730991906516992/photo/1,12,10,,,,,
1359,703356393781329922,,,2016-02-26 23:10:06 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Socks. That water pup w the super legs just splashed him. Socks did not appreciate that. 9/10 and 2/10 https://t.co/8rc5I22bBf,,,,https://twitter.com/dog_rates/status/703356393781329922/photo/1,9,10,Socks,,,,
746,780074436359819264,,,2016-09-25 16:00:13 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine - Make a Scene</a>",Here's a doggo questioning his entire existence. 10/10 someone tell him he's a good boy https://t.co/dVm5Hgdpeb,,,,https://vine.co/v/5nzYBpl0TY2,10,10,,doggo,,,
1259,710272297844797440,,,2016-03-17 01:11:26 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",We 👏🏻 only 👏🏻 rate 👏🏻 dogs. Pls stop sending in non-canines like this Dutch Panda Worm. This is infuriating. 11/10 https://t.co/odfLzBonG2,,,,https://twitter.com/dog_rates/status/710272297844797440/photo/1,11,10,infuriating,,,,
728,782021823840026624,,,2016-10-01 00:58:26 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Harper. She scraped her elbow attempting a backflip off a tree. Valiant effort tho. 12/10 https://t.co/oHKJHghrp5,7.076109e+17,4196984000.0,2016-03-09 16:56:11 +0000,"https://twitter.com/dog_rates/status/707610948723478529/photo/1,https://twitter.com/dog_rates/status/707610948723478529/photo/1",12,10,Harper,,,,


>*From the visual assessment performed by scrolling through select records of the dataframe, the following are the issue(s) identified:*
>
>>*Quality issues:*
>
>>> 1) In `twitter_archive_df`, the collowing columns have missing values:
>
>>>>   i. `in_reply_to_status_id`
>
>>>>   ii. `in_reply_to_user_id`
>
>>>>   iii. `retweeted_status_id`
>
>>>>   iv. `retweeted_status_user_id`
>
>>>>   v. `retweeted_status_timestamp`
>
>>> 2) Some dog names are not valid (Eg. a, None)
>
>>> 3) Records corresponding to retweets should be removed
>
>>*Tidiness issue:*
>
>>>In `twitter_archive_df`, the following columns should be combined into one (as each variable should be represented in a single column):
>
>>>>   i. `doggo`
>
>>>>   ii. `floofer`
>
>>>>   iii. `pupper`
>
>>>>   iv. `puppo`

#### 2. image_predictions_df

In [12]:
image_predictions_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [13]:
image_predictions_df.tail()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True
2074,892420643555336193,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False


In [14]:
image_predictions_df.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
247,670668383499735048,https://pbs.twimg.com/media/CU6xVkbWsAAeHeU.jpg,1,banana,0.107317,False,orange,0.099662,False,bagel,0.089033,False
1943,861383897657036800,https://pbs.twimg.com/media/C_RAFTxUAAAbXjV.jpg,1,Cardigan,0.771008,True,Pembroke,0.137174,True,French_bulldog,0.063309,True
1619,802624713319034886,https://pbs.twimg.com/media/CsrjryzWgAAZY00.jpg,1,cocker_spaniel,0.253442,True,golden_retriever,0.16285,True,otterhound,0.110921,True
1863,842846295480000512,https://pbs.twimg.com/media/C7JkO0rX0AErh7X.jpg,1,Labrador_retriever,0.461076,True,golden_retriever,0.154946,True,Chihuahua,0.110249,True
1008,709207347839836162,https://pbs.twimg.com/media/CdecUSzUIAAHCvg.jpg,1,Chihuahua,0.948323,True,Italian_greyhound,0.01773,True,quilt,0.016688,False


>*From the visual assessment performed by scrolling through select records of the dataframe, the following are the issue(s) identified:*
>
>>*Quality issue:*
>
>>>The breeds of the dogs predicted in `p1`, `p2` and `p3` do not follow any standard naming (have underscores, lower case)
>
>>*Tidiness issue:*
>
>>>The `image_prediction_df` dataframe can be joined with the `twitter_archive_df`based on the tweet ID that is common for the two dataframes

#### 3. additional_data_df

In [15]:
additional_data_df.head()

Unnamed: 0,tweet_id,retweet_count,favourites_count,followers_count,created_at
0,892420643555336193,7492,145955,8876619,Tue Aug 01 16:23:56 +0000 2017
1,892177421306343426,5559,145955,8876619,Tue Aug 01 00:17:27 +0000 2017
2,891815181378084864,3681,145955,8876619,Mon Jul 31 00:18:03 +0000 2017
3,891689557279858688,7661,145955,8876619,Sun Jul 30 15:58:51 +0000 2017
4,891327558926688256,8273,145955,8876619,Sat Jul 29 16:00:24 +0000 2017


In [16]:
additional_data_df.tail()

Unnamed: 0,tweet_id,retweet_count,favourites_count,followers_count,created_at
2326,666049248165822465,40,145955,8876644,Mon Nov 16 00:24:50 +0000 2015
2327,666044226329800704,125,145955,8876644,Mon Nov 16 00:04:52 +0000 2015
2328,666033412701032449,39,145955,8876644,Sun Nov 15 23:21:54 +0000 2015
2329,666029285002620928,41,145955,8876644,Sun Nov 15 23:05:30 +0000 2015
2330,666020888022790149,454,145955,8876644,Sun Nov 15 22:32:08 +0000 2015


In [17]:
additional_data_df.sample(5)

Unnamed: 0,tweet_id,retweet_count,favourites_count,followers_count,created_at
553,801538201127157760,2063,145955,8876624,Wed Nov 23 21:29:33 +0000 2016
1308,705475953783398401,886,145955,8876630,Thu Mar 03 19:32:29 +0000 2016
1003,746056683365994496,792,145955,8876627,Thu Jun 23 19:05:49 +0000 2016
265,840370681858686976,4426,145955,8876621,Sat Mar 11 01:15:58 +0000 2017
502,810284430598270976,11204,145955,8876622,Sun Dec 18 00:43:57 +0000 2016


>*From the visual assessment performed by scrolling through select records of the dataframe, the following are the issue(s) identified:*
>
>>*Quality issue:*
>
>>>In `additional_data_df`, the `created_at` field is not in datetime format
>
>>*Tidiness issue:*
>
>>>The `additional_data_df` dataframe can be joined with the `twitter_archive_df` and `image_prediction_df` based on the tweet ID that is common for the three dataframes

##### Programmatic assessement

#### 1. twitter_archive_df

In [18]:
# getting the basic information including missing values and data types of the fields in the dataframe
twitter_archive_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

>**Quality Issue(s):**
>
>1. The `timestamp` and `retweeted_status_timestamp` are not in datetime format
>
>2. The fields `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id` and `retweeted_status_user_id` should be strings
>
>3. `expanded_urls` column has some missing values
>
>4. `tweet_id` should not an integer since numerical operations will not be performed

In [19]:
# getting the stats for the numerical fields in the dataframe
twitter_archive_df.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [20]:
twitter_archive_df['rating_denominator'].unique()

array([ 10,   0,  15,  70,   7,  11, 150, 170,  20,  50,  90,  80,  40,
       130, 110,  16, 120,   2])

In [21]:
twitter_archive_df['rating_numerator'].unique()

array([  13,   12,   14,    5,   17,   11,   10,  420,  666,    6,   15,
        182,  960,    0,   75,    7,   84,    9,   24,    8,    1,   27,
          3,    4,  165, 1776,  204,   50,   99,   80,   45,   60,   44,
        143,  121,   20,   26,    2,  144,   88])

>**Quality Issue(s):**
>
>1. Looking at the minimum, maximum and other quartile values for numerator and denominator ratings, it looks like they are incorrect in some records
>
>2. `tweet_id` should not an integer since numerical operations will not be performed

In [22]:
# checking if there are any duplicate records in the dataframe
twitter_archive_df.duplicated().sum()

0

#### 2. image_prediction_df

In [23]:
# getting the basic information including missing values and data types of the fields in the dataframe
image_predictions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


>**Quality Issue(s):**
>
>1. There are some missing records in the dataframe since the `twitter_archive_df` dataframe has 2356 records.This issue should be revisited after the dataframes are merged. 
>
>2. `tweet_id` should not an integer since numerical operations will not be performed

In [24]:
# checking if there are any duplicate records in the dataframe
image_predictions_df.duplicated().sum()

0

#### 3. additional_data_df

In [25]:
# getting the basic information including missing values and data types of the fields in the dataframe
additional_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   tweet_id          2331 non-null   int64 
 1   retweet_count     2331 non-null   int64 
 2   favourites_count  2331 non-null   int64 
 3   followers_count   2331 non-null   int64 
 4   created_at        2331 non-null   object
dtypes: int64(4), object(1)
memory usage: 91.2+ KB


In [26]:
# checking if there are any duplicate records in the dataframe
additional_data_df.duplicated().sum()

0

>**Quality Issue(s):**
>
>There are some missing records in the dataframe since the `twitter_archive_df` dataframe has 2356 records (The missing 25 tweets had IDs that were not valid as discussed during data gathering). This issue should be revisited after the dataframes are merged. 

To summarize, the following are the quality and tidiness issues found in the data:

### Quality
##### `twitter_archive` table
- The following columns have missing values: `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `expanded_urls`, and `retweeted_status_timestamp`
- Some dog names (`name`) are not valid (Eg. a, None)
- The `timestamp` and `retweeted_status_timestamp` are not in datetime format
- Incorrect `rating_numerator` and `rating_denominator` values
- The fields `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id` and `retweeted_status_user_id` should be strings
- Records corresponding to retweets should be removed

##### `image_predictions` table
- Predicted breed of dogs in in `p1`, `p2` and `p3` do not follow any standard naming (have underscores, lower case)

##### `additional_data` table
- The `created_at field` is not in datetime format

##### `twitter_archive`, `image_predictions` & `additional_data` tables 
- `tweet_id` should not an integer since numerical operations will not be performed on it
- There are some missing records in `image_predictions` (2075 records) and `additional_data_df` (2331 records) when compared to the `twitter_archive_df` (2356 records). This should be revisited after merging the dataframes.

### Tidiness
- The three dataframes should be merged into one since "Each type of observational unit forms a table"
- The four columns in image_predictions_df dataframe - `doggo`, `floofer`, `pupper`, and `puppo` - should be combined into a single field since "Each variable forms a column"

>*The issues associated with the data are now assessed and segregated into quality and tidiness issues. The next step is to clean the data programmatically.*

<a id='clean'></a>
## Cleaning

>There are 2 types of cleaning. They are:
>
>>1. Manual (not recommended unless the issues are one-off occurrences)
>
>>2. Programmatic
>
>The programmatic data cleaning process includes 3 steps. They are:
>
>>1. Define: convert our assessments into defined cleaning tasks. These definitions also serve as an instruction list so others (or yourself in the future) can look at your work and reproduce it.
>
>>2. Code: convert those definitions to code and run that code.
>
>>3. Test: test your dataset, visually or with code, to make sure your cleaning operations worked.
>
>It is recommended to always make copies of the original pieces of data before cleaning.

In [27]:
# making copies of the dataframes
twitter_archive_clean_df = twitter_archive_df.copy()
image_predictions_clean_df = image_predictions_df.copy()
additional_data_clean_df = additional_data_df.copy()

### Missing Data

#### [Quality Issue 1]
#### `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `expanded_urls`, and 
#### `retweeted_status_timestamp`: Missing values

##### Define

There is no way to generate/get the values corresponding to `in_reply_to_status_id`, `in_reply_to_user_id`, and `expanded_urls`. Also, they are not necessary for further analysis. So these columns can be dropped. The fields `retweeted_status_id`, `retweeted_status_user_id`, and `retweeted_status_timestamp` are used to identify records corresponding to retweets. They will be used in order to resolve the issue of dropping records that correspond to retweets. This will result in empty columns for the fields. Then, the columns can be dropped.

##### Code

In [28]:
# dropping the columns inplace
twitter_archive_clean_df.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'expanded_urls'], axis=1, inplace=True)

##### Test

In [29]:
# checking if the columns have been dropped
twitter_archive_clean_df.columns

Index(['tweet_id', 'timestamp', 'source', 'text', 'retweeted_status_id',
       'retweeted_status_user_id', 'retweeted_status_timestamp',
       'rating_numerator', 'rating_denominator', 'name', 'doggo', 'floofer',
       'pupper', 'puppo'],
      dtype='object')

### Tidiness

#### [Tidiness Issue 1]
#### Merging the dataframes

##### Define

Inner join to be performed on the 3 dataframes. Since the merge() can join only 2 dataframes at a time, we use reduce() which is used to apply a particular function passed in its argument to all of the list elements mentioned in the sequence passed along. The inner join is peformed based on the `tweet_id`. In the case of an inner join, only the records in which `tweet_id` match in all the dataframes are present in the final dataframe.

##### Code

In [30]:
# creating a list containing all the dataframes to be merged
dfs = [twitter_archive_clean_df, image_predictions_clean_df, additional_data_clean_df]

# inner join is performed on the dataframes  
master_df = reduce(lambda left,right: pd.merge(left, right, on='tweet_id', how='inner'), dfs)
master_df.head()

Unnamed: 0,tweet_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,rating_numerator,rating_denominator,name,...,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,retweet_count,favourites_count,followers_count,created_at
0,892420643555336193,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,13,10,Phineas,...,bagel,0.085851,False,banana,0.07611,False,7492,145955,8876619,Tue Aug 01 16:23:56 +0000 2017
1,892177421306343426,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,13,10,Tilly,...,Pekinese,0.090647,True,papillon,0.068957,True,5559,145955,8876619,Tue Aug 01 00:17:27 +0000 2017
2,891815181378084864,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,12,10,Archie,...,malamute,0.078253,True,kelpie,0.031379,True,3681,145955,8876619,Mon Jul 31 00:18:03 +0000 2017
3,891689557279858688,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,13,10,Darla,...,Labrador_retriever,0.168086,True,spatula,0.040836,False,7661,145955,8876619,Sun Jul 30 15:58:51 +0000 2017
4,891327558926688256,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",,,,12,10,Franklin,...,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True,8273,145955,8876619,Sat Jul 29 16:00:24 +0000 2017


In [31]:
# getting missing value count and data types of all the columns of the merged dataframe
master_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2059 entries, 0 to 2058
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2059 non-null   int64  
 1   timestamp                   2059 non-null   object 
 2   source                      2059 non-null   object 
 3   text                        2059 non-null   object 
 4   retweeted_status_id         72 non-null     float64
 5   retweeted_status_user_id    72 non-null     float64
 6   retweeted_status_timestamp  72 non-null     object 
 7   rating_numerator            2059 non-null   int64  
 8   rating_denominator          2059 non-null   int64  
 9   name                        2059 non-null   object 
 10  doggo                       2059 non-null   object 
 11  floofer                     2059 non-null   object 
 12  pupper                      2059 non-null   object 
 13  puppo                       2059 

##### Test

In [32]:
# checking the shape of the merged dataframe
master_df.shape

(2059, 29)

In [33]:
# checking the columns of the merged dataframe
master_df.columns

Index(['tweet_id', 'timestamp', 'source', 'text', 'retweeted_status_id',
       'retweeted_status_user_id', 'retweeted_status_timestamp',
       'rating_numerator', 'rating_denominator', 'name', 'doggo', 'floofer',
       'pupper', 'puppo', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog',
       'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog', 'retweet_count',
       'favourites_count', 'followers_count', 'created_at'],
      dtype='object')

#### [Tidiness Issue 2]
#### Maintaining single column for dog stage

##### Define

The 4 columns corresponding to each stage are to be combined into one column. First, we have to check if each dog is classified as only one stage and get the required numbers for verification after cleaning. Next, we need to combine the 4 columns to get a single column containing a list of all their values. Finally, we get a list of unique values for each list. If the list consists of only one stage, then no operation should be performed else, all the 'None' values are removed. The resulting list is considered as the list of stages of the dog.  

##### Code

Getting the expected number of each stage

In [34]:
master_df.doggo.value_counts()

None     1981
doggo      78
Name: doggo, dtype: int64

In [35]:
master_df.floofer.value_counts()

None       2051
floofer       8
Name: floofer, dtype: int64

In [36]:
master_df.pupper.value_counts()

None      1838
pupper     221
Name: pupper, dtype: int64

In [37]:
master_df.puppo.value_counts()

None     2035
puppo      24
Name: puppo, dtype: int64

In [38]:
master_df[(master_df.doggo == 'None') & (master_df.floofer == 'None') & (master_df.pupper == 'None') & (master_df.puppo == 'None')].count()[0]

1741

In summary,

Number of doggos = 78 <br/>
Number of floofers = 8 <br/>
Number of puppers = 221 <br/>
Number of puppos = 24 <br/>
Number of Nones = 1741 <br/>

These numbers do not add up to the total number of records (2059). Hence, we can infer that few of the dogs are categorized under more than one stage. 

In [39]:
# combining the columns into a single column of stages
master_df['dog_stages']= master_df[['doggo', 'floofer', 'pupper', 'puppo']].values.tolist()

# dropping the individual stage columns
master_df.drop(['doggo', 'floofer', 'pupper', 'puppo'], axis=1, inplace=True)

In [40]:
# creating a new column in the dataframe to hold the final values of the dog stage(s)
master_df['dog_stage'] = pd.Series()

for index, row in master_df.iterrows():
    # getting unique list of stage(s) of the dog
    stages = list(set(row.dog_stages))
    # if the list has more than one value, remove 'None' from the list
    if len(stages) > 1:
        stages.remove('None')
    # assigning the (string representation) list of stages under which the dog is categorized 
    master_df.loc[index, 'dog_stage'] = str(stages)

# drop the intermediate dog_stages column
master_df.drop('dog_stages', axis=1, inplace=True)
    
master_df.head()

  


Unnamed: 0,tweet_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,rating_numerator,rating_denominator,name,...,p2_conf,p2_dog,p3,p3_conf,p3_dog,retweet_count,favourites_count,followers_count,created_at,dog_stage
0,892420643555336193,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,13,10,Phineas,...,0.085851,False,banana,0.07611,False,7492,145955,8876619,Tue Aug 01 16:23:56 +0000 2017,['None']
1,892177421306343426,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,13,10,Tilly,...,0.090647,True,papillon,0.068957,True,5559,145955,8876619,Tue Aug 01 00:17:27 +0000 2017,['None']
2,891815181378084864,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,12,10,Archie,...,0.078253,True,kelpie,0.031379,True,3681,145955,8876619,Mon Jul 31 00:18:03 +0000 2017,['None']
3,891689557279858688,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,13,10,Darla,...,0.168086,True,spatula,0.040836,False,7661,145955,8876619,Sun Jul 30 15:58:51 +0000 2017,['None']
4,891327558926688256,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",,,,12,10,Franklin,...,0.22577,True,German_short-haired_pointer,0.175219,True,8273,145955,8876619,Sat Jul 29 16:00:24 +0000 2017,['None']


In [41]:
# getting data types and missing values of each field
master_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2059 entries, 0 to 2058
Data columns (total 26 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2059 non-null   int64  
 1   timestamp                   2059 non-null   object 
 2   source                      2059 non-null   object 
 3   text                        2059 non-null   object 
 4   retweeted_status_id         72 non-null     float64
 5   retweeted_status_user_id    72 non-null     float64
 6   retweeted_status_timestamp  72 non-null     object 
 7   rating_numerator            2059 non-null   int64  
 8   rating_denominator          2059 non-null   int64  
 9   name                        2059 non-null   object 
 10  jpg_url                     2059 non-null   object 
 11  img_num                     2059 non-null   int64  
 12  p1                          2059 non-null   object 
 13  p1_conf                     2059 

##### Test

In [42]:
# checking the shape of the dataframe
master_df.shape

(2059, 26)

In [43]:
# checking the columns of the dataframe
master_df.columns

Index(['tweet_id', 'timestamp', 'source', 'text', 'retweeted_status_id',
       'retweeted_status_user_id', 'retweeted_status_timestamp',
       'rating_numerator', 'rating_denominator', 'name', 'jpg_url', 'img_num',
       'p1', 'p1_conf', 'p1_dog', 'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf',
       'p3_dog', 'retweet_count', 'favourites_count', 'followers_count',
       'created_at', 'dog_stage'],
      dtype='object')

In [44]:
# get the distribution of dog stages in the dataset
master_df.dog_stage.value_counts()

['None']                1741
['pupper']               210
['doggo']                 65
['puppo']                 23
['doggo', 'pupper']       11
['floofer']                7
['doggo', 'floofer']       1
['doggo', 'puppo']         1
Name: dog_stage, dtype: int64

### Quality

#### [Quality Issue 2]
#### Remove records corresponding to retweets

##### Define

Only original tweets are to be considered for analysis. The records corresponding to retweets can be identified by the fact that the fields `retweeted_status_id`, `retweeted_status_user_id`, and `retweeted_status_timestamp` have some value. In all the other cases, it is NaN. So, the task is to remove all the records for which the retweet related values are available.

##### Code

In [45]:
# get shape of the dataframe for future verification
master_df.shape

(2059, 26)

In [46]:
# create a dataframe containing the records to the dropped (corresponding to retweets)
filtered_df = master_df[(master_df.retweeted_status_id.notnull()) & (master_df.retweeted_status_user_id.notnull()) & (master_df.retweeted_status_timestamp.notnull())]

# drop the records corresponding to retweets
master_df.drop(filtered_df.index, inplace = True) 

##### Test

In [47]:
# getting the shape to verify that 72 records have been dropped
master_df.shape 

(1987, 26)

#### [Quality Issue 3]
#### Incorrect `rating_denominator` values

##### Define

The denominator value should always be 10. We should first take a look at all the cases where the denominator is greater than 10 and decide if we can update the denominator of the rating to 10 and check the corresponding numerators as well.

##### Code

In [48]:
# getting all the records where denominator of the rating is greater than 10
master_df[master_df['rating_denominator'] > 10][['tweet_id', 'text', 'rating_denominator', 'rating_numerator']]

Unnamed: 0,tweet_id,text,rating_denominator,rating_numerator
336,820690176645140481,The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd,70,84
722,758467244762497024,Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE,150,165
863,740373189193256964,"After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ",11,9
911,731156023742988288,Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv,170,204
954,722974582966214656,Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,20,4
988,716439118184652801,This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq,50,50
1009,713900603437621249,Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1,90,99
1034,710658690886586372,Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12,80,80
1052,709198395643068416,"From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK",50,45
1118,704054845121142784,Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa,50,60


Looking at the `text` corresponding to each of the records, we can see that it is because there are multiple dogs are being rated. Naturally, the numerators of the corresponding ratings should also be fixed. 

In two record with tweet_ids 682962037429899265 and 740373189193256964, the wrong fraction has been picked up. These can be changed.

In [49]:
# getting all the records where denominator of the rating is greater than 10 and is a multiple of 10
master_df[(master_df['rating_denominator'] > 10) & (master_df['rating_denominator'] % 10 == 0)][['tweet_id', 'text', 'rating_denominator', 'rating_numerator']]

Unnamed: 0,tweet_id,text,rating_denominator,rating_numerator
336,820690176645140481,The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd,70,84
722,758467244762497024,Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE,150,165
911,731156023742988288,Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv,170,204
954,722974582966214656,Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,20,4
988,716439118184652801,This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq,50,50
1009,713900603437621249,Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1,90,99
1034,710658690886586372,Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12,80,80
1052,709198395643068416,"From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK",50,45
1118,704054845121142784,Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa,50,60
1194,697463031882764288,Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ,40,44


In [50]:
# creating a dataframe with the required conditions
filtered_df = master_df.loc[(master_df['rating_denominator'] > 10) & (master_df['rating_denominator'] % 10 == 0)]

# iterrating through the records to update the numerator and denominator
for index, row in filtered_df.iterrows():
    # since there are multiple dogs, divide by number of dogs
    master_df.loc[index, 'rating_numerator'] = row.rating_numerator/(row.rating_denominator/10)
    master_df.loc[index, 'rating_denominator'] = 10   ### row.rating_denominator/(row.rating_denominator/10)

In [51]:
# checking for other records with denominator of the rating greater than 10
master_df[master_df['rating_denominator'] > 10][['tweet_id', 'text', 'rating_denominator', 'rating_numerator']]

Unnamed: 0,tweet_id,text,rating_denominator,rating_numerator
863,740373189193256964,"After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ",11,9
1392,682962037429899265,This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5,11,7


In [52]:
# the numerators and denominators of the ratings of the tweets are updated using the text
master_df.loc[(master_df['tweet_id'] == 682962037429899265), 'rating_denominator'] = 10
master_df.loc[(master_df['tweet_id'] == 682962037429899265), 'rating_numerator'] = 10

master_df.loc[(master_df['tweet_id'] == 740373189193256964), 'rating_denominator'] = 10
master_df.loc[(master_df['tweet_id'] == 740373189193256964), 'rating_numerator'] = 14

##### Test

In [53]:
# checking if there are any more records where denominator of the rating is greater than 10
master_df[master_df['rating_denominator']>10][['tweet_id', 'text', 'rating_denominator', 'rating_numerator']]

Unnamed: 0,tweet_id,text,rating_denominator,rating_numerator


#### [Quality Issue 4]
#### Incorrect `rating_numerator` values

##### Define

Firstly, check the records with numerator greater than 20 since it has been established that most of the ratings are  intentionally assigned a rating with numerator greater than 10. From the text, update the numerators. Wherever there is no rating or the rating is invalid, set the numerator and denominator to 10.

##### Code

In [54]:
# getting all the records where numerator of the rating is greater than 20
master_df[master_df['rating_numerator'] >= 20][['tweet_id', 'text', 'rating_denominator', 'rating_numerator']]

Unnamed: 0,tweet_id,text,rating_denominator,rating_numerator
406,810984652412424192,Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx,7,24
548,786709082849828864,"This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS",10,75
603,778027034220126208,This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq,10,27
789,749981277374128128,This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh,10,1776
1438,680494726643068929,Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD,10,26
1781,670842764863651840,After so many requests... here you go.\n\nGood dogg. 420/10 https://t.co/yfAAo1gdeY,10,420


In [55]:
# the numerators and denominators of the ratings of the tweets are updated using the text
master_df.loc[(master_df['tweet_id'] == 786709082849828864), 'rating_numerator'] = 9.75
master_df.loc[(master_df['tweet_id'] == 778027034220126208), 'rating_numerator'] = 11.27
master_df.loc[(master_df['tweet_id'] == 749981277374128128), 'rating_numerator'] = 17.76
master_df.loc[(master_df['tweet_id'] == 680494726643068929), 'rating_numerator'] = 11.26

# approximating numerator and denominator to 10 wherever there is no rating/no valid rating in the text
master_df.loc[(master_df['tweet_id'] == 670842764863651840), 'rating_numerator'] = 10

master_df.loc[(master_df['tweet_id'] == 810984652412424192), 'rating_numerator'] = 10
master_df.loc[(master_df['tweet_id'] == 810984652412424192), 'rating_numerator'] = 10

##### Test

In [56]:
# checking if there are any more records where numerator of the rating is greater than 20
master_df[master_df['rating_numerator'] >= 20][['tweet_id', 'text', 'rating_denominator', 'rating_numerator']]

Unnamed: 0,tweet_id,text,rating_denominator,rating_numerator


#### [Quality Issue 5]
#### Invalid dog names

##### Define

##### Code

##### Test

#### [Quality Issue 6]
#### Naming convention of breed of dog

##### Define

##### Code

##### Test

#### [Quality Issue 7]
#### `tweet_id` datatype

##### Define

##### Code

##### Test

#### [Quality Issue 8]
#### `timestamp` and `created_at` datatype

##### Define

##### Code

##### Test

<a id='analyze&visualize'></a>
## Analyze and Visualize

<a id='conclusion'></a>
## Conclusion

>*CONCLUDING REMARKS:*

>*REFERENCES:*