# Wrangle and Analyse WeRateDogs Twitter Data

#### Udacity alx Data Analyst Nanodegree
#### Salami Suleiman, September 2022

## Introduction

This project illustrates methods to gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. 

The dataset used in this notebook is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

## Data Gathering

* In this section, we will gather three pieces of data for the data wrangling

In [167]:
# load required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tweepy
import json 
import re

%matplotlib inline


In [47]:
# load Twitter archive dataset

path = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv"
twt_archive = pd.read_csv(path)

twt_archive.head(1)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,


In [48]:
# load tweet image predictions dataset

path = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
img_pred = pd.read_csv(path, sep = "\t")
img_pred.head(2)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True


In [49]:
# load Additional data from the Twitter API from txt file

df_tweet = []
with open('tweet-json.txt') as f:
    for line in f:
        tweet = (json.loads(line))
        tweet_id = tweet['id']
        retweets_count = tweet['retweet_count']
        favorite_count = tweet['favorite_count']
        df_tweet.append({'tweet_id':tweet_id, 'retweets_count':retweets_count, 'favorite_count':favorite_count, })
        
twt_api = pd.DataFrame(df_tweet)
twt_api.head(3)         

Unnamed: 0,tweet_id,retweets_count,favorite_count
0,892420643555336193,8853,39467
1,892177421306343426,6514,33819
2,891815181378084864,4328,25461


## Assessing the data

In this section, we perform visual and programatic assessment of the 3 datasets and outline our quality and tidiness issues .

We start with the visual assessments by looking at the data with pandas and excel

In [53]:
# visual assessment of Twitter archive dataset

twt_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [54]:
# visual assessment of tweet image predictions dataset

img_pred.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [55]:
# visual assessment of data from the Twitter API from txt file

twt_api.head()

Unnamed: 0,tweet_id,retweets_count,favorite_count
0,892420643555336193,8853,39467
1,892177421306343426,6514,33819
2,891815181378084864,4328,25461
3,891689557279858688,8964,42908
4,891327558926688256,9774,41048


We begin our programtic assessment from here by using multiple approaches

In [59]:
# assess the various data types associated with the variables

twt_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [60]:
img_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [61]:
twt_api.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   tweet_id        2354 non-null   int64
 1   retweets_count  2354 non-null   int64
 2   favorite_count  2354 non-null   int64
dtypes: int64(3)
memory usage: 55.3 KB


In [51]:
# check for duplicates

twt_archive.duplicated().sum()

0

In [43]:
img_pred.duplicated().sum()

0

In [44]:
twt_api.duplicated().sum()

0

In [63]:
# check for missing data

twt_archive.isna().sum()

tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

In [64]:
img_pred.isna().sum()

tweet_id    0
jpg_url     0
img_num     0
p1          0
p1_conf     0
p1_dog      0
p2          0
p2_conf     0
p2_dog      0
p3          0
p3_conf     0
p3_dog      0
dtype: int64

In [65]:
twt_api.isna().sum()

tweet_id          0
retweets_count    0
favorite_count    0
dtype: int64

In [56]:
# check summary stats on numeric variables

twt_archive.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [58]:
img_pred.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [57]:
twt_api.describe()

Unnamed: 0,tweet_id,retweets_count,favorite_count
count,2354.0,2354.0,2354.0
mean,7.426978e+17,3164.797366,8080.968564
std,6.852812e+16,5284.770364,11814.771334
min,6.660209e+17,0.0,0.0
25%,6.783975e+17,624.5,1415.0
50%,7.194596e+17,1473.5,3603.5
75%,7.993058e+17,3652.0,10122.25
max,8.924206e+17,79515.0,132810.0


### Quality issues

#### `twitter_archive` table
* tweet_id is number not a string
* only keep original ratings (no retweets) that have images for analysis
* 'None' is used to represet missing data in name column and dog stage columns
* 'timestamp' should be formatted as a date
* 'expanded_urls', etc should be dropped from the data for the analysis
* numerator ratings should be formatted as floats
* incorrect dog names name column
* some ratings_numerator values have decimal
* some records have more than on dog stage

#### `image_predictions` table
* tweet_id is number not a string

#### `twitter_api_data` table
* tweet_id is number not a string

### Tidiness issues

#### `twitter_archive` table
* the dog stages: doggo, floofer, pupper and puppo columns should be merged into one column 

#### `image_predictions` table
* the image predictions table should be merged with the twitter archive 

#### `twitter_api_data` table
* the twitter api table columns should be merged with the twitter archive




## Cleaning the data

In this section, we perform data cleaning on the 3 datasets using the define-code-test framework.

We begin be making copies of the orignal data sets

In [314]:
# Make copies of the original datasets

twt_archive_clean = twt_archive.copy()
img_pred_clean = img_pred.copy()
twt_api_clean = twt_api.copy()

* define: only keep original ratings (no retweets) that have images for analysis

* code:

In [315]:
# filter out retweets using retweeted_status_user_id

twt_archive_clean = twt_archive_clean.query('retweeted_status_user_id.isnull()')

* test:

In [316]:
# test

twt_archive_clean.retweeted_status_user_id.value_counts().sum()

0

* define: drop 'expanded_urls' etc. column

* code:

In [317]:
#drop columns

twt_archive_clean.drop(columns = ['expanded_urls', 'in_reply_to_status_id', 'in_reply_to_user_id', 'source',
                  'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], inplace=True)

* test:

In [251]:
# test

twt_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2175 non-null   int64 
 1   timestamp           2175 non-null   object
 2   text                2175 non-null   object
 3   rating_numerator    2175 non-null   int64 
 4   rating_denominator  2175 non-null   int64 
 5   name                2175 non-null   object
 6   doggo               2175 non-null   object
 7   floofer             2175 non-null   object
 8   pupper              2175 non-null   object
 9   puppo               2175 non-null   object
dtypes: int64(3), object(7)
memory usage: 186.9+ KB


* define: change tweet_id data type to string

* code:

In [318]:
# convert tweet_id to a string 

twt_archive_clean.tweet_id = twt_archive_clean.tweet_id.astype(str)
img_pred_clean.tweet_id = img_pred_clean.tweet_id.astype(str)
twt_api_clean.tweet_id = twt_api_clean.tweet_id.astype(str)

* test:

In [319]:
# test

twt_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2175 non-null   object
 1   timestamp           2175 non-null   object
 2   text                2175 non-null   object
 3   rating_numerator    2175 non-null   int64 
 4   rating_denominator  2175 non-null   int64 
 5   name                2175 non-null   object
 6   doggo               2175 non-null   object
 7   floofer             2175 non-null   object
 8   pupper              2175 non-null   object
 9   puppo               2175 non-null   object
dtypes: int64(2), object(8)
memory usage: 186.9+ KB


* define: change timestamp to datetime

* code:

In [320]:
# convert timestamp to datetime

twt_archive_clean.timestamp = pd.to_datetime(twt_archive_clean.timestamp)

* test:

In [255]:
# test

twt_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            2175 non-null   object             
 1   timestamp           2175 non-null   datetime64[ns, UTC]
 2   text                2175 non-null   object             
 3   rating_numerator    2175 non-null   int64              
 4   rating_denominator  2175 non-null   int64              
 5   name                2175 non-null   object             
 6   doggo               2175 non-null   object             
 7   floofer             2175 non-null   object             
 8   pupper              2175 non-null   object             
 9   puppo               2175 non-null   object             
dtypes: datetime64[ns, UTC](1), int64(2), object(7)
memory usage: 186.9+ KB


* define: fix incorrect dog names and set to NA

* code:

In [321]:
import warnings
warnings.filterwarnings('ignore') # disable warnings from computation

# remove all improper dog names and replace with NA

twt_archive_clean.name = twt_archive_clean.name.str.replace('^[a-z]', 'None' )
twt_archive_clean.loc[twt_archive_clean['name'] == 'None']= np.NaN

* test:

In [322]:
# test

twt_archive_clean.name.value_counts() 

Lucy          11
Charlie       11
Cooper        10
Oliver        10
Tucker         9
              ..
Wishes         1
Rose           1
Theo           1
Fido           1
Christoper     1
Name: name, Length: 953, dtype: int64

In [323]:
twt_archive_clean.name.isna().sum()

735

* define: fix numerator ratings with decimals

* code:

In [324]:
decimal_numerators = []
for i, text in twt_archive_clean['text'].iteritems():
    if bool(re.search('\d+\.\d+\/\d+', str(text))):
        decimal_numerators.append({twt_archive_clean['tweet_id'][i]:[i, text, twt_archive_clean['rating_numerator'][i]]})
        
decimal_numerators

[{'883482846933004288': [45,
   'This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 https://t.co/qjrljjt948',
   5.0]},
 {'786709082849828864': [695,
   "This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS",
   75.0]},
 {'778027034220126208': [763,
   "This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq",
   27.0]}]

In [327]:
# change values

twt_archive_clean.at[45,'rating_numerator'] = 13.5
twt_archive_clean.at[695,'rating_numerator'] = 9.75
twt_archive_clean.at[763,'rating_numerator'] = 11.27

* test:

In [330]:
# test

decimal_numerators = []
for i, text in twt_archive_clean['text'].iteritems():
    if bool(re.search('\d+\.\d+\/\d+', str(text))):
        decimal_numerators.append({twt_archive_clean['tweet_id'][i]:[text, twt_archive_clean['rating_numerator'][i]]})
        
decimal_numerators


[{'883482846933004288': ['This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 https://t.co/qjrljjt948',
   13.5]},
 {'786709082849828864': ["This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS",
   9.75]},
 {'778027034220126208': ["This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq",
   11.27]}]

* define: change numerator and denominator ratings to float

* code:

In [331]:
# convert to float datatype
twt_archive_clean[['rating_numerator', 'rating_denominator']] = twt_archive_clean[['rating_numerator','rating_denominator']].astype(float)


* test:

In [263]:
#test

twt_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            1440 non-null   object             
 1   timestamp           1440 non-null   datetime64[ns, UTC]
 2   text                1440 non-null   object             
 3   rating_numerator    1440 non-null   float64            
 4   rating_denominator  1440 non-null   float64            
 5   name                1440 non-null   object             
 6   doggo               1440 non-null   object             
 7   floofer             1440 non-null   object             
 8   pupper              1440 non-null   object             
 9   puppo               1440 non-null   object             
dtypes: datetime64[ns, UTC](1), float64(2), object(7)
memory usage: 251.5+ KB


* define: Melt the doggo, floofer, pupper, puppo columns to a dog_stage column.

* code:

In [333]:
twt_archive_clean = pd.melt(twt_archive_clean, id_vars=['tweet_id', 'timestamp', 'text', 'rating_numerator', 'rating_denominator', 'name'],
                           var_name='dog_stager', value_name='dog_stage')
twt_archive_clean = twt_archive_clean.drop('dog_stager', axis=1)

* test:

In [265]:
# test

twt_archive_clean.head()

Unnamed: 0,tweet_id,timestamp,text,rating_numerator,rating_denominator,name,dog_stage
0,892420643555336193,2017-08-01 16:23:56+00:00,This is Phineas. He's a mystical boy. Only eve...,13.0,10.0,Phineas,
1,892177421306343426,2017-08-01 00:17:27+00:00,This is Tilly. She's just checking pup on you....,13.0,10.0,Tilly,
2,891815181378084864,2017-07-31 00:18:03+00:00,This is Archie. He is a rare Norwegian Pouncin...,12.0,10.0,Archie,
3,891689557279858688,2017-07-30 15:58:51+00:00,This is Darla. She commenced a snooze mid meal...,13.0,10.0,Darla,
4,891327558926688256,2017-07-29 16:00:24+00:00,This is Franklin. He would like you to stop ca...,12.0,10.0,Franklin,


In [266]:
twt_archive_clean.dog_stage.value_counts() 

None       5561
pupper      133
doggo        45
puppo        16
floofer       5
Name: dog_stage, dtype: int64

* define: remove duplicated rows

* code:

In [334]:
twt_archive_clean.duplicated().sum()

7060

* test:

In [335]:
twt_archive_clean.shape

(8700, 7)

In [336]:
# test
twt_archive_clean.drop_duplicates(inplace=True)
twt_archive_clean.shape

(1640, 7)

* define: convert dog_stage to category

* code:

In [337]:
# convert to category datatype 
twt_archive_clean.dog_stage = twt_archive_clean.dog_stage.astype('category')

* test:

In [272]:
twt_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1640 entries, 0 to 7430
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            1639 non-null   object             
 1   timestamp           1639 non-null   datetime64[ns, UTC]
 2   text                1639 non-null   object             
 3   rating_numerator    1639 non-null   float64            
 4   rating_denominator  1639 non-null   float64            
 5   name                1639 non-null   object             
 6   dog_stage           1639 non-null   category           
dtypes: category(1), datetime64[ns, UTC](1), float64(2), object(3)
memory usage: 91.5+ KB


* define: merge image prediction and twitter api datasets to twitter archive

* code:

In [339]:
twt_archive_clean = pd.merge(left=twt_archive_clean, right=img_pred_clean, how='left', on='tweet_id')
twt_archive_clean = pd.merge(left=twt_archive_clean, right=twt_api_clean, how='left', on='tweet_id')

* test:

In [340]:
# test

twt_archive_clean.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1640 entries, 0 to 1639
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            1639 non-null   object             
 1   timestamp           1639 non-null   datetime64[ns, UTC]
 2   text                1639 non-null   object             
 3   rating_numerator    1639 non-null   float64            
 4   rating_denominator  1639 non-null   float64            
 5   name                1639 non-null   object             
 6   dog_stage           1639 non-null   category           
 7   jpg_url             1583 non-null   object             
 8   img_num             1583 non-null   float64            
 9   p1                  1583 non-null   object             
 10  p1_conf             1583 non-null   float64            
 11  p1_dog              1583 non-null   object             
 12  p2                  1583 non-null 

* define: remove missing values

* code

In [341]:
twt_archive_clean.isna().sum()

tweet_id               1
timestamp              1
text                   1
rating_numerator       1
rating_denominator     1
name                   1
dog_stage              1
jpg_url               57
img_num               57
p1                    57
p1_conf               57
p1_dog                57
p2                    57
p2_conf               57
p2_dog                57
p3                    57
p3_conf               57
p3_dog                57
retweets_count         1
favorite_count         1
dtype: int64

In [342]:
twt_archive_clean.dropna(axis = 0, inplace=True) 

* test

In [343]:
# test

twt_archive_clean.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1583 entries, 0 to 1639
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            1583 non-null   object             
 1   timestamp           1583 non-null   datetime64[ns, UTC]
 2   text                1583 non-null   object             
 3   rating_numerator    1583 non-null   float64            
 4   rating_denominator  1583 non-null   float64            
 5   name                1583 non-null   object             
 6   dog_stage           1583 non-null   category           
 7   jpg_url             1583 non-null   object             
 8   img_num             1583 non-null   float64            
 9   p1                  1583 non-null   object             
 10  p1_conf             1583 non-null   float64            
 11  p1_dog              1583 non-null   object             
 12  p2                  1583 non-null 

## Save cleaned data


In [344]:
twt_archive_clean.to_csv('twitter_archive_master.csv', index=False) 