# Wrangle and Analyze Data
## Project Details: WeRateDogs
Tasks in this project are as follows:

- Data wrangling, which consists of:
    1. Gathering data
    2. Assessing data
    3. Cleaning data
- Storing, analyzing, and visualizing the wrangled data
- Reporting on 
    1. Data wrangling efforts
    2. Data analyses and visualizations

## Contents
- [Project Details and overview](#overview)
- [Points to Note](#note)
- [Data Gathering](#gathering)
- [Data Assessing](#assessing)
- [Data Cleaning](#cleaning)

<a id='overview'></a>
## Project Details and overview
[*Ref: Project Overview section under concepts in Wrangle and Analyze Data*]

Real-world data rarely comes clean. Using Python and its libraries, you will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. You will document your wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries) and/or SQL.

The dataset that you will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

---------------------

<a id='note'></a>
## Points to Note:

- We only require original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.

- Fully assessing and cleaning the entire dataset requires exceptional effort so only a subset of its issues (eight (8) quality issues and two (2) tidiness issues at minimum) need to be assessed and cleaned.

- Cleaning includes merging individual pieces of data according to the rules of tidy data.

- The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.

- We do not need to gather the tweets beyond August 1st, 2017. We can, but note that we won't be able to gather the image predictions for these tweets since we don't have access to the algorithm used.

----------------------------

<a id='gathering'></a>
## Data Gathering
In this project of analysing WeRateDogs ([@dog_rates](https://twitter.com/dog_rates)) Twitter handle, we are goona gather data in both manual apporach and programatic approach, by using the data given by Udacity Team we are done with the manual process.
Further, we are gonna use Twitter API services and perform requests to get the data.

1. **twitter_archive_enhanced.csv** : Provided by Udacity Team
2. **image-predictions.tsv**: Provided by Udacity Team | Can be retrived with an Enpoint request without Authorization 
    - [URL of the file: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv)]
3. Gathering Tweets, retweets and count etc,. Data from twitter API using Tweepy library with personal API credentials 


### Let's go with one at a time with the above Three ways of Data Gathering

#### 1. twitter_archive_enchanced.csv : Provided by Udacity Team

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [4]:
archive_data = pd.read_csv('twitter-archive-enhanced.csv')

In [5]:
archive_data.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [9]:
archive_data.shape

(2356, 17)

In [20]:
archive_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

The data set consists of 
- 2356 Observation 
- 17 features

In which, the features 
- in_reply_to_status_id 
- in_reply_to_user_id
- retweeted_status_id
- retweeted_status_user_id
- retweeted_status_timestamp

Has Null objects

#### 2. image-predictions.tsv :  Provided by Udacity Team 

In [14]:
import requests

# URL found in the Wrangle and Analyze data section.
url_tsv_endpoint = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# Requesting the data from the above url with requests library
image_data_request = requests.get(url_tsv_endpoint, allow_redirects=True)
open('image_predictions.tsv', 'wb').write(image_data_request.content)

335079

In [16]:
images_data = pd.read_csv('image_predictions.tsv',
                           sep = '\t')
images_data.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [17]:
images_data.shape

(2075, 12)

In [19]:
images_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


The image_predictions.tsv dataset consists of
- 2075 observations
- 12 Features 

#### 3. Gathering Tweets, retweets and count etc,. Data from twitter API using Tweepy library with personal API credentials 


In [21]:
import tweepy 
import json
import time

In [22]:
# API Credentials 

consumer_key = 
consumer_secret = 
access_token = 
access_secret =

In [23]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, parser=tweepy.parsers.JSONParser())

In [24]:
api

<tweepy.api.API at 0x1d29f688f48>

In [25]:
start = time.time()

tweet_ids = archive_data.tweet_id.values

tweets_data = []
tweet_success = []
tweet_failure = []

for tweet_id in tweet_ids:
        try:
            data = api.get_status(tweet_id, tweet_mode='extended', 
                                  wait_on_rate_limit = True, 
                                  wait_on_rate_limit_notify = True)
            tweets_data.append(data)
            tweet_success.append(tweet_id)
        except:
            tweet_failure.append(tweet_id)
            print(tweet_id)

# this stops the timer            
end = time.time()

print(end - start)

888202515573088257
873697596434513921
872668790621863937
872261713294495745
869988702071779329
866816280283807744
861769973181624320
856602993587888130
851953902622658560
845459076796616705
844704788403113984
842892208864923648
837366284874571778
837012587749474308
829374341691346946
827228250799742977
812747805718642688
802247111496568832
779123168116150273
775096608509886464
771004394259247104
770743923962707968
759566828574212096


Rate limit reached. Sleeping for: 219


754011816964026368
680055455951884288


Rate limit reached. Sleeping for: 201


2212.290293931961


In [108]:
# lengh of the result
print("The lengh of the result", len(tweets_data))
# The tweet_id of the errors
print("The lengh of the errors", len(tweet_failure))

The lengh of the result 2331
The lengh of the errors 25


In [26]:
# storing the recived data in a file named 'tweet_json.txt'
with open('tweet_json.txt', mode = 'w') as file:
    json.dump(tweets_data, file)

In [27]:
twitter_counts_df = pd.read_json('tweet_json.txt')
twitter_counts_df['tweet_id'] = tweet_success
twitter_counts_df = twitter_counts_df[['tweet_id',
                                       'favorite_count',
                                       'retweet_count']]

twitter_counts_df.head()

Unnamed: 0,tweet_id,favorite_count,retweet_count
0,892420643555336193,36370,7739
1,892177421306343426,31317,5730
2,891815181378084864,23602,3790
3,891689557279858688,39660,7895
4,891327558926688256,37857,8524


-----------------------

<a id='assessing'></a>
## Data Assessing
Assessing is also identifying structural (tidiness) issues that make analysis difficult.

In this specific segment on Data Assessing, we'll be going through two major Data Issues.
1. **Data Quality Issues**
2. **Data Tidiness Issues**

### Data Quality Issues
issues with content. Low quality data is also known as dirty data.
Dirty data = low quality data = content issues

From the Lesson 3 of Data Wranglins section.
Thr four important data quality factors are :
    1. Completeness : Which concludes, weather the data consists of Missing data.
    2. Validity : All about the structured data communication that converys any meaning
    3. Accuracy : Deals about the inaccurate data, in which there are chances that dirty data can show up as valid data.
    3. Consistency : Deals with data Standrdization
    

Let's go through the data quality issues with the three various data sources that we gathered in the **Data Gathering** section.
#### archive_data:
1. Completeness: 
    - Missing data found in following features: 
        - in_reply_to_status_id
        - in_reply_to_user_id
        - retweeted_status_id
        - retweeted_status_user_id
        - retweeted_status_timestamp
        - expanded_urls
2. Validity:
    - dog names: few dogs names conisits of  
        - 'None' as a name, or 'a', or 'an.'
    - This data consists of Duplicate data, which is a result of having retweets.
3. Accuracy: 
    - retweeted_status_timestamp
    - timespamp 

    The above mentioned features are in type Object.

4. Consistency:
    - The well known column 'rating_denominator' supossed to be standard 10, but there are multitude of various values.
    - Score feature consists of HTML tags
    
#### image_data:
- Validity:
  - p1
  - p2
  - p3
  
  columns have invalid data
- Consistency:
  - p1
  - p2
  - p3
  
  columns aren't consistent when it comes to capitalization
  - in p1, p2 and p3 columns there is an underscore for multi-word dog breeds
  

#### twitter_counts_df:
- Completeness:
  - Missing Data Available

Messy data = untidy data = structural issues

### Tidiness Issues

Three requirements for tidiness:
1. Each variable forms a column
2. Each observation forms a row
3. Each type of observational unit forms a table

**archive_data:**
- The last four columns all relate to the same variable (dogoo, floofer, pupper, puppo)

**image_data:**
- this data set is part of the same observational unit as the data in the archive - one table with all basic information about the dog ratings

**twitter_counts_df:**
- this data set is also part of the same observational unit - one table with all basic information about the dog ratings

#### Further analysis on **archive_data DataFrame**

In [106]:
archive_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [85]:
archive_data.describe

<bound method NDFrame.describe of                 tweet_id  in_reply_to_status_id  in_reply_to_user_id  \
0     892420643555336193                    NaN                  NaN   
1     892177421306343426                    NaN                  NaN   
2     891815181378084864                    NaN                  NaN   
3     891689557279858688                    NaN                  NaN   
4     891327558926688256                    NaN                  NaN   
...                  ...                    ...                  ...   
2351  666049248165822465                    NaN                  NaN   
2352  666044226329800704                    NaN                  NaN   
2353  666033412701032449                    NaN                  NaN   
2354  666029285002620928                    NaN                  NaN   
2355  666020888022790149                    NaN                  NaN   

                      timestamp  \
0     2017-08-01 16:23:56 +0000   
1     2017-08-01 00:17:27 +0000

In [86]:
archive_data[archive_data.tweet_id.duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [87]:
archive_data['name'].value_counts()

None       745
a           55
Charlie     12
Lucy        11
Oliver      11
          ... 
Ralphy       1
Dutch        1
Fiji         1
Todo         1
Torque       1
Name: name, Length: 957, dtype: int64

In [88]:
archive_data.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

#### Further analysis on image_data DataFrame

In [89]:
images_data.head(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [90]:
images_data.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1713,818614493328580609,https://pbs.twimg.com/media/C1xNgraVIAA3EVb.jpg,4,Chihuahua,0.450722,True,Border_terrier,0.204177,True,beagle,0.092774,True
1768,827199976799354881,https://pbs.twimg.com/media/C3rN-lcWEAA9CmR.jpg,4,Great_Dane,0.869681,True,American_Staffordshire_terrier,0.026658,True,boxer,0.019866,True
1918,855459453768019968,https://pbs.twimg.com/media/C98z1ZAXsAEIFFn.jpg,2,Blenheim_spaniel,0.389513,True,Pekinese,0.18822,True,Japanese_spaniel,0.082628,True
1292,751583847268179968,https://pbs.twimg.com/media/Cm4phTpWcAAgLsr.jpg,1,dalmatian,0.868304,True,studio_couch,0.059623,False,snow_leopard,0.013876,False
1526,788765914992902144,https://pbs.twimg.com/media/CvJCabcWgAIoUxW.jpg,1,cocker_spaniel,0.500509,True,golden_retriever,0.272734,True,jigsaw_puzzle,0.041476,False
648,681694085539872773,https://pbs.twimg.com/media/CXXdJ7CVAAALu23.jpg,1,toy_poodle,0.920992,True,miniature_poodle,0.060857,True,Maltese_dog,0.006064,True
559,677700003327029250,https://pbs.twimg.com/media/CWesj06W4AAIKl8.jpg,1,Siberian_husky,0.120849,True,junco,0.079206,False,malamute,0.063088,True
736,687102708889812993,https://pbs.twimg.com/media/CYkURJjW8AEamoI.jpg,1,fiddler_crab,0.992069,False,quail,0.002491,False,rock_crab,0.001513,False
1100,720775346191278080,https://pbs.twimg.com/media/CgC1WqMW4AI1_N0.jpg,1,Newfoundland,0.48997,True,groenendael,0.174497,True,giant_schnauzer,0.079067,True
112,667911425562669056,https://pbs.twimg.com/media/CUTl5m1WUAAabZG.jpg,1,frilled_lizard,0.257695,False,ox,0.23516,False,triceratops,0.085317,False


In [91]:
images_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [92]:
images_data[images_data.tweet_id.duplicated()]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog


In [93]:
images_data['p1'].value_counts()

golden_retriever      150
Labrador_retriever    100
Pembroke               89
Chihuahua              83
pug                    57
                     ... 
mailbox                 1
sundial                 1
peacock                 1
quilt                   1
bow                     1
Name: p1, Length: 378, dtype: int64

In [94]:
images_data['p2'].value_counts()

Labrador_retriever    104
golden_retriever       92
Cardigan               73
Chihuahua              44
Pomeranian             42
                     ... 
china_cabinet           1
bobsled                 1
breastplate             1
sliding_door            1
snowmobile              1
Name: p2, Length: 405, dtype: int64

In [95]:
images_data['p3'].value_counts()

Labrador_retriever    79
Chihuahua             58
golden_retriever      48
Eskimo_dog            38
kelpie                35
                      ..
barbell                1
pool_table             1
shovel                 1
bulletproof_vest       1
joystick               1
Name: p3, Length: 408, dtype: int64

#### Further analysis on image_data DataFrame

In [96]:
twitter_counts_df.head()

Unnamed: 0,tweet_id,favorite_count,retweet_count
0,892420643555336193,36370,7739
1,892177421306343426,31317,5730
2,891815181378084864,23602,3790
3,891689557279858688,39660,7895
4,891327558926688256,37857,8524


In [97]:
twitter_counts_df.sample(10)

Unnamed: 0,tweet_id,favorite_count,retweet_count
984,747594051852075008,3694,1059
1146,720775346191278080,2467,662
1507,690015576308211712,2498,729
521,806620845233815552,0,5680
1754,677716515794329600,3030,967
1191,715009755312439296,4156,1223
2079,670668383499735048,10331,4820
1055,739238157791694849,115875,57966
2176,668645506898350081,850,511
1615,684177701129875456,2016,663


In [98]:
twitter_counts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   tweet_id        2331 non-null   int64
 1   favorite_count  2331 non-null   int64
 2   retweet_count   2331 non-null   int64
dtypes: int64(3)
memory usage: 54.8 KB


In [99]:
twitter_counts_df.describe

<bound method NDFrame.describe of                 tweet_id  favorite_count  retweet_count
0     892420643555336193           36370           7739
1     892177421306343426           31317           5730
2     891815181378084864           23602           3790
3     891689557279858688           39660           7895
4     891327558926688256           37857           8524
...                  ...             ...            ...
2326  666049248165822465              97             41
2327  666044226329800704             273            133
2328  666033412701032449             113             41
2329  666029285002620928             121             42
2330  666020888022790149            2425            465

[2331 rows x 3 columns]>

In [100]:
twitter_counts_df[twitter_counts_df.tweet_id.duplicated()]

Unnamed: 0,tweet_id,favorite_count,retweet_count


In [101]:
twitter_counts_df.retweet_count.mean()

2716.026598026598

In [102]:
twitter_counts_df.favorite_count.mean()

7593.646932646932

In [103]:
print("The Mean of retweet_count : ", twitter_counts_df.retweet_count.mean())
print("The Median of retweet_count : ", twitter_counts_df.retweet_count.median())
print("The Standard of retweet_count : ", twitter_counts_df.retweet_count.std())


The Mean of retweet_count :  2716.026598026598
The Median of retweet_count :  1275.0
The Standard of retweet_count :  4594.428054479699


In [142]:
print("The Mean of favorite_count : ", twitter_counts_df.favorite_count.mean())
print("The Median of favorite_count : ", twitter_counts_df.favorite_count.median())
print("The Standard of favorite_count : ", twitter_counts_df.favorite_count.std())


The Mean of favorite_count :  7593.646932646932
The Median of favorite_count :  3307.0
The Standard of favorite_count :  11782.276170945204


#### Data Tidiness
Untidy data issues will be fond for structural issues

1. No need to all the informations in images dataset, (tweet_id and jpg_url what matters)
2. Dog "stage" variable in four columns: doggo, floofer, pupper, puppo
3. Join 'tweet_info' and 'image_predictions' to 'twitter_archive'

-----------

<a id='cleaning'></a>
## Data Cleaning
Cleaning is the third step in the data wrangling process:

**Types of cleaning:**
    1. Manual (not recommended unless the issues are single occurrences)
    2. Programmatic
The **programmatic** data cleaning process:
- ***Define***: convert the data into defined cleaning tasks. These definitions also serve as an instruction list so others (or yourself in the future) can look at your work and reproduce it.
- ***Code***: convert those definitions to code and run that code.
- ***Test***: test your dataset, visually or with code, to make sure your cleaning operations worked.


ref: [Cleaning summary in Cleaning Data lesson]

In [143]:
# Making a copy of the tables before moving ahead for cleaning
archive_data_clean = archive_data.copy()
images_data_clean = images_data.copy()
twitter_counts_df_clean = twitter_counts_df.copy()

### ***Define***
1. Merging the cleaned data frames of images, archive and twitter_count_df and corret the dog types.
2. Framing a single specific column for different dog types:
    - Doggo, Floofer, Pupper, Puppo. 
    - and remove columns which are not required
        - in_reply_to_status_id
        - in_reply_to_user_id
        - retweeted_status_id
        - retweeted_status_user_id
        - retweeted_status_timestamp
3. Deleting retweets.
4. Removing / Deleting the columns/feature which are not required
5. Changing or converting the tweet_id from an integet type to a string type.
6. Changing the timestamp into actual datetime format.
7. Naming issues cocrrection.
8. Standardization over dog ratings.

### ***1***

**Code and Test**: Merging the cleaned data frames of images, archive and twitter_count_df and corret the dog types.

In [144]:
from functools import reduce

dfs = [archive_data_clean, images_data_clean, twitter_counts_df_clean]
twitter_dogs = reduce(lambda left, right:  pd.merge(left, right, on = 'tweet_id'), 
                      dfs)

In [145]:
twitter_dogs.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,...,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,favorite_count,retweet_count
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,...,0.097049,False,bagel,0.085851,False,banana,0.07611,False,36370,7739
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,...,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True,31317,5730
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,...,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True,23602,3790
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,...,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False,39660,7895
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,...,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True,37857,8524


In [146]:
twitter_dogs.sample(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,...,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,favorite_count,retweet_count
403,811744202451197953,,,2016-12-22 01:24:33 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Halo. She likes watermelon. 13/10 http...,,,,https://twitter.com/dog_rates/status/811744202...,...,0.386082,True,Labrador_retriever,0.202862,True,golden_retriever,0.170487,True,7721,1637
220,839290600511926273,,,2017-03-08 01:44:07 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @alexmartindawg: THE DRINK IS DR. PUPPER 10...,8.392899e+17,41198420.0,2017-03-08 01:41:24 +0000,https://twitter.com/alexmartindawg/status/8392...,...,0.670892,False,monitor,0.101565,False,screen,0.075306,False,0,144
654,770093767776997377,,,2016-08-29 03:00:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is just downright precious...,7.410673e+17,4196984000.0,2016-06-10 00:39:48 +0000,https://twitter.com/dog_rates/status/741067306...,...,0.843799,True,Labrador_retriever,0.052956,True,kelpie,0.035711,True,0,3105
1346,685325112850124800,,,2016-01-08 05:00:14 +0000,"<a href=""http://twitter.com/download/iphone"" r...","""Tristan do not speak to me with that kind of ...",,,,https://twitter.com/dog_rates/status/685325112...,...,0.586937,True,Labrador_retriever,0.39826,True,kuvasz,0.00541,True,9483,3966
1327,686730991906516992,,,2016-01-12 02:06:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",I just love this picture. 12/10 lovely af http...,,,,https://twitter.com/dog_rates/status/686730991...,...,0.338812,True,Newfoundland,0.180925,True,golden_retriever,0.180023,True,4145,1196
1622,674410619106390016,,,2015-12-09 02:09:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Lenny. He wants to be a sprinkler. 10/...,,,,https://twitter.com/dog_rates/status/674410619...,...,0.698207,False,sea_lion,0.046475,False,beagle,0.019427,True,1168,448
752,753420520834629632,,,2016-07-14 02:47:04 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we are witnessing an isolated squad of bo...,,,,https://twitter.com/dog_rates/status/753420520...,...,0.267961,False,lakeside,0.085764,False,rapeseed,0.040809,False,7990,3605
633,772877495989305348,,,2016-09-05 19:22:09 +0000,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",You need to watch these two doggos argue throu...,,,,https://twitter.com/dog_rates/status/772877495...,...,0.218303,False,Norwegian_elkhound,0.138523,True,wombat,0.074217,False,8717,3948
796,749036806121881602,,,2016-07-02 00:27:45 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Dietrich. He hops at random. Other dog...,,,,https://twitter.com/dog_rates/status/749036806...,...,0.960276,False,West_Highland_white_terrier,0.019522,True,Samoyed,0.006396,True,3114,797
84,872820683541237760,,,2017-06-08 14:20:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here are my favorite #dogsatpollingstations \n...,,,,https://twitter.com/dog_rates/status/872820683...,...,0.99912,True,French_bulldog,0.000552,True,bull_mastiff,7.3e-05,True,13963,3439


In [147]:
twitter_dogs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2059 entries, 0 to 2058
Data columns (total 30 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2059 non-null   int64  
 1   in_reply_to_status_id       23 non-null     float64
 2   in_reply_to_user_id         23 non-null     float64
 3   timestamp                   2059 non-null   object 
 4   source                      2059 non-null   object 
 5   text                        2059 non-null   object 
 6   retweeted_status_id         72 non-null     float64
 7   retweeted_status_user_id    72 non-null     float64
 8   retweeted_status_timestamp  72 non-null     object 
 9   expanded_urls               2059 non-null   object 
 10  rating_numerator            2059 non-null   int64  
 11  rating_denominator          2059 non-null   int64  
 12  name                        2059 non-null   object 
 13  doggo                       2059 

### ***2***

**Code and Test**:
Framing a single specific column for different dog types:
- Doggo, Floofer, Pupper, Puppo. 
- and remove columns which are not required
    - in_reply_to_status_id
    - in_reply_to_user_id
    - retweeted_status_id
    - retweeted_status_user_id
    - retweeted_status_timestamp

In [148]:
twitter_dogs['dog_type'] = twitter_dogs['text'].str.extract('(doggo|floofer|pupper|puppo)')

In [149]:
twitter_dogs[['dog_type', 'doggo', 'floofer', 'pupper', 'puppo']].sample(20)

Unnamed: 0,dog_type,doggo,floofer,pupper,puppo
407,,,,,
61,,,,,
1478,,,,,
1401,,,,,
1180,,,,,
601,,,,,
749,,,,,
1974,,,,,
1030,pupper,,,pupper,
276,doggo,doggo,,,


In [150]:
twitter_dogs.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,...,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,favorite_count,retweet_count,dog_type
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,...,False,bagel,0.085851,False,banana,0.07611,False,36370,7739,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,...,True,Pekinese,0.090647,True,papillon,0.068957,True,31317,5730,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,...,True,malamute,0.078253,True,kelpie,0.031379,True,23602,3790,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,...,False,Labrador_retriever,0.168086,True,spatula,0.040836,False,39660,7895,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,...,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True,37857,8524,


In [151]:
twitter_dogs[twitter_dogs.tweet_id == 891815181378084864]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,...,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,favorite_count,retweet_count,dog_type
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,...,True,malamute,0.078253,True,kelpie,0.031379,True,23602,3790,


In [152]:
twitter_dogs.dog_type.value_counts()

pupper     230
doggo       73
puppo       28
floofer      3
Name: dog_type, dtype: int64

#### Observation
- The count of breeds from the `value_counts()`:
    - Pupper : 230
    - Doggo : 73
    - Puppo : 28
    - Floofer : 3

### ***3***

**Code and Test**: Deleting retweets.

In [153]:
twitter_dogs = twitter_dogs[np.isnan(twitter_dogs.retweeted_status_id)]

In [154]:
twitter_dogs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1987 entries, 0 to 2058
Data columns (total 31 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    1987 non-null   int64  
 1   in_reply_to_status_id       23 non-null     float64
 2   in_reply_to_user_id         23 non-null     float64
 3   timestamp                   1987 non-null   object 
 4   source                      1987 non-null   object 
 5   text                        1987 non-null   object 
 6   retweeted_status_id         0 non-null      float64
 7   retweeted_status_user_id    0 non-null      float64
 8   retweeted_status_timestamp  0 non-null      object 
 9   expanded_urls               1987 non-null   object 
 10  rating_numerator            1987 non-null   int64  
 11  rating_denominator          1987 non-null   int64  
 12  name                        1987 non-null   object 
 13  doggo                       1987 

In [155]:
twitter_dogs = twitter_dogs.drop(['retweeted_status_id', 
                                  'retweeted_status_user_id',
                                  'retweeted_status_timestamp'], axis=1)

In [156]:
twitter_dogs.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,...,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,favorite_count,retweet_count,dog_type
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,...,False,bagel,0.085851,False,banana,0.07611,False,36370,7739,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,...,True,Pekinese,0.090647,True,papillon,0.068957,True,31317,5730,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,...,True,malamute,0.078253,True,kelpie,0.031379,True,23602,3790,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,...,False,Labrador_retriever,0.168086,True,spatula,0.040836,False,39660,7895,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,...,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True,37857,8524,


### ***4***

**Code and Test**: Removing / Deleting the columns/feature which are not required


In [157]:
twitter_dogs.drop(['in_reply_to_status_id', 
                  'in_reply_to_user_id',
                  'source',
                  'img_num'], axis = 1, inplace=True)

In [158]:
twitter_dogs.head()

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,...,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,favorite_count,retweet_count,dog_type
0,892420643555336193,2017-08-01 16:23:56 +0000,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,...,False,bagel,0.085851,False,banana,0.07611,False,36370,7739,
1,892177421306343426,2017-08-01 00:17:27 +0000,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,...,True,Pekinese,0.090647,True,papillon,0.068957,True,31317,5730,
2,891815181378084864,2017-07-31 00:18:03 +0000,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,...,True,malamute,0.078253,True,kelpie,0.031379,True,23602,3790,
3,891689557279858688,2017-07-30 15:58:51 +0000,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,...,False,Labrador_retriever,0.168086,True,spatula,0.040836,False,39660,7895,
4,891327558926688256,2017-07-29 16:00:24 +0000,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,...,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True,37857,8524,


### ***5***

**Code and Test**:
Changing or converting the tweet_id from an integet type to a string type.


In [159]:
twitter_dogs['tweet_id'] = twitter_dogs['tweet_id'].astype(str)

In [160]:
twitter_dogs.head()

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,...,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,favorite_count,retweet_count,dog_type
0,892420643555336193,2017-08-01 16:23:56 +0000,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,...,False,bagel,0.085851,False,banana,0.07611,False,36370,7739,
1,892177421306343426,2017-08-01 00:17:27 +0000,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,...,True,Pekinese,0.090647,True,papillon,0.068957,True,31317,5730,
2,891815181378084864,2017-07-31 00:18:03 +0000,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,...,True,malamute,0.078253,True,kelpie,0.031379,True,23602,3790,
3,891689557279858688,2017-07-30 15:58:51 +0000,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,...,False,Labrador_retriever,0.168086,True,spatula,0.040836,False,39660,7895,
4,891327558926688256,2017-07-29 16:00:24 +0000,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,...,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True,37857,8524,


In [161]:
twitter_dogs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1987 entries, 0 to 2058
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   tweet_id            1987 non-null   object 
 1   timestamp           1987 non-null   object 
 2   text                1987 non-null   object 
 3   expanded_urls       1987 non-null   object 
 4   rating_numerator    1987 non-null   int64  
 5   rating_denominator  1987 non-null   int64  
 6   name                1987 non-null   object 
 7   doggo               1987 non-null   object 
 8   floofer             1987 non-null   object 
 9   pupper              1987 non-null   object 
 10  puppo               1987 non-null   object 
 11  jpg_url             1987 non-null   object 
 12  p1                  1987 non-null   object 
 13  p1_conf             1987 non-null   float64
 14  p1_dog              1987 non-null   bool   
 15  p2                  1987 non-null   object 
 16  p2_con

### ***6***

**Code and Test**:
Changing the timestamp into actual datetime format.

In [162]:
twitter_dogs['timestamp'] = twitter_dogs['timestamp'].str.slice(start=0, stop=-6)

In [163]:
twitter_dogs['timestamp'] = pd.to_datetime(twitter_dogs['timestamp'], format = "%Y-%m-%d %H:%M:%S")

In [164]:
twitter_dogs.head()


Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,...,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,favorite_count,retweet_count,dog_type
0,892420643555336193,2017-08-01 16:23:56,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,...,False,bagel,0.085851,False,banana,0.07611,False,36370,7739,
1,892177421306343426,2017-08-01 00:17:27,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,...,True,Pekinese,0.090647,True,papillon,0.068957,True,31317,5730,
2,891815181378084864,2017-07-31 00:18:03,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,...,True,malamute,0.078253,True,kelpie,0.031379,True,23602,3790,
3,891689557279858688,2017-07-30 15:58:51,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,...,False,Labrador_retriever,0.168086,True,spatula,0.040836,False,39660,7895,
4,891327558926688256,2017-07-29 16:00:24,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,...,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True,37857,8524,


### ***7***

**Code and Test**:
Naming issues cocrrection.

In [165]:
twitter_dogs.name = twitter_dogs.name.str.replace('^[a-z]+', 'None')

In [166]:
twitter_dogs['name'].value_counts()

None       644
Oliver      10
Cooper      10
Charlie     10
Penny        9
          ... 
Sobe         1
Danny        1
Richie       1
Nigel        1
Andru        1
Name: name, Length: 912, dtype: int64

In [167]:
twitter_dogs['name'].sample(10)

1507       None
1845       None
1661    Kendall
1341      Marty
94         Zoey
636        None
1150    Vincent
935        None
1830      Larry
1037       None
Name: name, dtype: object

### ***8***

**Code and Test**:
Standardization over dog ratings.

In [168]:
twitter_dogs['rating_numerator'] = twitter_dogs['rating_numerator'].astype(float)
twitter_dogs['rating_denominator'] = twitter_dogs['rating_denominator'].astype(float)
twitter_dogs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1987 entries, 0 to 2058
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   tweet_id            1987 non-null   object        
 1   timestamp           1987 non-null   datetime64[ns]
 2   text                1987 non-null   object        
 3   expanded_urls       1987 non-null   object        
 4   rating_numerator    1987 non-null   float64       
 5   rating_denominator  1987 non-null   float64       
 6   name                1987 non-null   object        
 7   doggo               1987 non-null   object        
 8   floofer             1987 non-null   object        
 9   pupper              1987 non-null   object        
 10  puppo               1987 non-null   object        
 11  jpg_url             1987 non-null   object        
 12  p1                  1987 non-null   object        
 13  p1_conf             1987 non-null   float64     

In [169]:
import re

ratings_decimals_text = []
ratings_decimals_index = []
ratings_decimals = []

for i, text in twitter_dogs['text'].iteritems():
    if bool(re.search('\d+\.\d+\/\d+', text)):
        ratings_decimals_text.append(text)
        ratings_decimals_index.append(i)
        ratings_decimals.append(re.search('\d+\.\d+', text).group())

In [172]:
ratings_decimals_text

['This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 https://t.co/qjrljjt948',
 "This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS",
 "This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq",
 'Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD']

In [173]:
ratings_decimals_index

[40, 548, 603, 1438]

In [174]:
twitter_dogs.loc[ratings_decimals_index[0],'rating_numerator'] = float(ratings_decimals[0])
twitter_dogs.loc[ratings_decimals_index[1],'rating_numerator'] = float(ratings_decimals[1])
twitter_dogs.loc[ratings_decimals_index[2],'rating_numerator'] = float(ratings_decimals[2])
twitter_dogs.loc[ratings_decimals_index[3],'rating_numerator'] = float(ratings_decimals[3])

In [177]:
twitter_dogs.loc[80]

tweet_id                                             874012996292530176
timestamp                                           2017-06-11 21:18:31
text                  This is Sebastian. He can't see all the colors...
expanded_urls         https://twitter.com/dog_rates/status/874012996...
rating_numerator                                                     13
rating_denominator                                                   10
name                                                          Sebastian
doggo                                                              None
floofer                                                            None
pupper                                                             None
puppo                                                             puppo
jpg_url                 https://pbs.twimg.com/media/DCEeLxjXsAAvNSM.jpg
p1                                                             Cardigan
p1_conf                                                        0

In [178]:
twitter_dogs['rating'] = twitter_dogs['rating_numerator'] / twitter_dogs['rating_denominator']

In [179]:
twitter_dogs.sample(10)

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,...,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,favorite_count,retweet_count,dog_type,rating
1812,670474236058800128,2015-11-28 05:28:09,Honor to rate this dog. Great teeth. Nice horn...,https://twitter.com/dog_rates/status/670474236...,10.0,10.0,,,,,...,siamang,0.062536,False,gorilla,0.058894,False,1461,712,,1.0
636,772581559778025472,2016-09-04 23:46:12,Guys this is getting so out of hand. We only r...,https://twitter.com/dog_rates/status/772581559...,10.0,10.0,,,,,...,Border_collie,0.128352,True,Saint_Bernard,0.059476,True,6610,1723,,1.0
1212,695409464418041856,2016-02-05 00:51:51,This is Bob. He just got back from his job int...,https://twitter.com/dog_rates/status/695409464...,10.0,10.0,Bob,,,,...,bull_mastiff,0.001749,True,Pekinese,0.000304,True,8677,3574,,1.0
7,890729181411237888,2017-07-28 00:22:40,When you watch your owner call another dog a g...,https://twitter.com/dog_rates/status/890729181...,13.0,10.0,,,,,...,Eskimo_dog,0.178406,True,Pembroke,0.076507,True,61263,17268,,1.3
452,801285448605831168,2016-11-23 04:45:12,oh h*ck 10/10 https://t.co/bC69RrW559,https://twitter.com/dog_rates/status/801285448...,10.0,10.0,,,,,...,beach_wagon,0.081125,False,convertible,0.064534,False,6206,848,,1.0
914,730427201120833536,2016-05-11 15:59:50,This is Crystal. She's flawless. Really wants ...,https://twitter.com/dog_rates/status/730427201...,11.0,10.0,Crystal,,,,...,Siberian_husky,0.289288,True,Staffordshire_bullterrier,0.008771,True,3487,1047,,1.1
1004,714258258790387713,2016-03-28 01:10:13,Meet Travis and Flurp. Travis is pretty chill ...,https://twitter.com/dog_rates/status/714258258...,10.0,10.0,Travis,,,,...,Chesapeake_Bay_retriever,0.101834,True,beagle,0.101294,True,3035,716,,1.0
1801,670778058496974848,2015-11-29 01:35:26,"""To bone or not to bone?""\n10/10 https://t.co/...",https://twitter.com/dog_rates/status/670778058...,10.0,10.0,,,,,...,Brabancon_griffon,0.112032,True,boxer,0.039051,True,321,71,,1.0
1411,681694085539872773,2015-12-29 04:31:49,This is Bo. He's a Benedoop Cumbersnatch. Seem...,https://twitter.com/dog_rates/status/681694085...,11.0,10.0,Bo,,,pupper,...,miniature_poodle,0.060857,True,Maltese_dog,0.006064,True,12799,4055,pupper,1.1
1818,670442337873600512,2015-11-28 03:21:24,Meet Koda. He's large. Looks very soft. Great ...,https://twitter.com/dog_rates/status/670442337...,11.0,10.0,Koda,,,,...,otterhound,0.256302,True,Irish_terrier,0.187315,True,627,191,,1.1


In [180]:
twitter_dogs.loc[498]

tweet_id                                             793845145112371200
timestamp                                           2016-11-02 16:00:06
text                  This is Clark. He was just caught wearing pant...
expanded_urls         https://twitter.com/dog_rates/status/793845145...
rating_numerator                                                     13
rating_denominator                                                   10
name                                                              Clark
doggo                                                              None
floofer                                                            None
pupper                                                             None
puppo                                                              None
jpg_url                 https://pbs.twimg.com/media/CwRN8H6WgAASe4X.jpg
p1                                                 Old_English_sheepdog
p1_conf                                                        0

In [181]:
twitter_dogs.rating.describe()

count    1987.000000
mean        1.164651
std         4.071452
min         0.000000
25%         1.000000
50%         1.100000
75%         1.200000
max       177.600000
Name: rating, dtype: float64

In [182]:
twitter_dogs

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,...,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,favorite_count,retweet_count,dog_type,rating
0,892420643555336193,2017-08-01 16:23:56,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13.0,10.0,Phineas,,,,...,bagel,0.085851,False,banana,0.076110,False,36370,7739,,1.3
1,892177421306343426,2017-08-01 00:17:27,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13.0,10.0,Tilly,,,,...,Pekinese,0.090647,True,papillon,0.068957,True,31317,5730,,1.3
2,891815181378084864,2017-07-31 00:18:03,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12.0,10.0,Archie,,,,...,malamute,0.078253,True,kelpie,0.031379,True,23602,3790,,1.2
3,891689557279858688,2017-07-30 15:58:51,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13.0,10.0,Darla,,,,...,Labrador_retriever,0.168086,True,spatula,0.040836,False,39660,7895,,1.3
4,891327558926688256,2017-07-29 16:00:24,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12.0,10.0,Franklin,,,,...,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True,37857,8524,,1.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2054,666049248165822465,2015-11-16 00:24:50,Here we have a 1949 1st generation vulpix. Enj...,https://twitter.com/dog_rates/status/666049248...,5.0,10.0,,,,,...,Rottweiler,0.243682,True,Doberman,0.154629,True,97,41,,0.5
2055,666044226329800704,2015-11-16 00:04:52,This is a purebred Piers Morgan. Loves to Netf...,https://twitter.com/dog_rates/status/666044226...,6.0,10.0,,,,,...,redbone,0.360687,True,miniature_pinscher,0.222752,True,273,133,,0.6
2056,666033412701032449,2015-11-15 23:21:54,Here is a very happy pup. Big fan of well-main...,https://twitter.com/dog_rates/status/666033412...,9.0,10.0,,,,,...,malinois,0.138584,True,bloodhound,0.116197,True,113,41,,0.9
2057,666029285002620928,2015-11-15 23:05:30,This is a western brown Mitsubishi terrier. Up...,https://twitter.com/dog_rates/status/666029285...,7.0,10.0,,,,,...,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True,121,42,,0.7


In [184]:
twitter_dogs.to_csv('twitter_archive_master.csv',
                    encoding = 'utf-8',
                    index=False)

In [185]:
archive_master_clean = pd.read_csv('twitter_archive_master.csv')
archive_master_clean.head()

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,...,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,favorite_count,retweet_count,dog_type,rating
0,892420643555336193,2017-08-01 16:23:56,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13.0,10.0,Phineas,,,,...,bagel,0.085851,False,banana,0.07611,False,36370,7739,,1.3
1,892177421306343426,2017-08-01 00:17:27,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13.0,10.0,Tilly,,,,...,Pekinese,0.090647,True,papillon,0.068957,True,31317,5730,,1.3
2,891815181378084864,2017-07-31 00:18:03,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12.0,10.0,Archie,,,,...,malamute,0.078253,True,kelpie,0.031379,True,23602,3790,,1.2
3,891689557279858688,2017-07-30 15:58:51,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13.0,10.0,Darla,,,,...,Labrador_retriever,0.168086,True,spatula,0.040836,False,39660,7895,,1.3
4,891327558926688256,2017-07-29 16:00:24,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12.0,10.0,Franklin,,,,...,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True,37857,8524,,1.2


**Please Find the *Analyzing_and_Visualizing_Data.ipynb* for Visualizations**