<a href="https://colab.research.google.com/github/lustraka/Data_Analysis_Workouts/blob/main/Analyse_Twitter_Data/wrangle_act.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project: Wrangling and Analyze Data

In [1]:
# Import dependencies
import requests
import os
import json
import tweepy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [None]:
path_csv = 'https://raw.githubusercontent.com/lustraka/Data_Analysis_Workouts/main/Analyse_Twitter_Data/'
dfa = pd.read_csv(path_csv+'twitter-archive-enhanced.csv')
dfa.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [None]:
url_tsv = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(url_tsv)
with open('image-predictions.tsv', 'wb') as file:
  file.write(r.content)
dfi = pd.read_csv('image-predictions.tsv', sep='\t')
dfi.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [None]:
consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [None]:
from timeit import default_timer as timer

count = 0
fails_dict =  {}
start = timer()

if 'tweet_json.txt' in os.listdir():
  os.remove('tweet_json.txt')

with open('tweet_json.txt', 'a') as file:
  for tweet_id in dfa.tweet_id.values:
    count += 1
    print(str(count) + ': ' + str(tweet_id))
    try:
      status = api.get_status(tweet_id, tweet_mode='extended')._json
      print("Success")
      file.write(json.dumps(status, ensure_ascii=False)+'\n')
    except tweepy.TweepError as e:
      print('Fail')
      fails_dict[tweet_id] = e
      pass
    except e:
      print('Fail', e)
end = timer()
print(f'Elapsed time: {end - start}')
print(fails_dict)

Data gathered form Twitter API:

| Attribute | Type | Description |
| --- | :-: | --- |
| id | int | The integer representation of unique identifier for this Tweet |
| retweet_count | int | Number of times this Tweet has been retweeted. |
| favorite_count | int | *Nullable*. Indicates approximately how many times this tweet has been liked by Twitter users. |

Reference: [Tweepy docs: Tweet Object](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet)



In [None]:
df_tweets = []
with open('tweet_json.txt', 'r') as file:
  line = file.readline()
  while line:
    status = json.loads(line)
    df_tweets.append({'tweet_id': status['id'], 'retweet_count': status['retweet_count'], 'favorite_count': status['favorite_count']})
    line = file.readline()
dft = pd.DataFrame(df_tweets)
dft.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7164,34458
1,892177421306343426,5385,29870
2,891815181378084864,3551,22490
3,891689557279858688,7383,37671
4,891327558926688256,7922,35988


In [2]:
# Upload the database
url_db = 'https://github.com/lustraka/Data_Analysis_Workouts/blob/main/Analyse_Twitter_Data/weratedogsdata.db?raw=true'
r = requests.get(url_db)
with open('weratedogsdata.db', 'wb') as file:
  file.write(r.content)

from sqlalchemy import create_engine
# Create SQLAlchemy engine and connect to the database
engine = create_engine('sqlite:///weratedogsdata.db')

# Read dataframes from SQlite database
dfa = pd.read_sql('SELECT * FROM dba', engine)
dfi = pd.read_sql('SELECT * FROM dbi', engine)
dft = pd.read_sql('SELECT * FROM dbt', engine)

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



### The archive `twitter_archive_enhanced.csv` (alias `dba`)
> "I extracted this data programmatically, but I didn't do a very good job. The ratings probably aren't all correct. Same goes for the dog names and probably dog stages (see below for more information on these) too. You'll need to assess and clean these columns if you want to use them for analysis and visualization."

In [3]:
dfa.sample(15)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
254,844580511645339650,,,2017-03-22 16:04:20 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Charlie. He wants to know if you have ...,,,,https://twitter.com/dog_rates/status/844580511...,11,10,Charlie,,,,
1806,676936541936185344,,,2015-12-16 01:27:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we see a rare pouched pupper. Ample stora...,,,,https://twitter.com/dog_rates/status/676936541...,8,10,,,,pupper,
1570,687732144991551489,,,2016-01-14 20:24:55 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",This is Ember. That's the q-tip she owes money...,,,,https://vine.co/v/iOuMphL5DBY,11,10,Ember,,,,
1190,718234618122661888,,,2016-04-08 00:30:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Suki. She was born with a blurry tail ...,,,,https://twitter.com/dog_rates/status/718234618...,11,10,Suki,,,,
1328,705970349788291072,,,2016-03-05 04:17:02 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Lucy. She's a Venetian Kerploof. Suppo...,,,,https://twitter.com/dog_rates/status/705970349...,12,10,Lucy,,,,
330,833124694597443584,,,2017-02-19 01:23:00 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Gidget. She's a spy pupper. Stealthy a...,,,,https://twitter.com/dog_rates/status/833124694...,12,10,Gidget,,,pupper,
1077,739544079319588864,,,2016-06-05 19:47:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This... is a Tyrannosaurus rex. We only rate d...,,,,https://twitter.com/dog_rates/status/739544079...,10,10,,,,,
1830,676219687039057920,,,2015-12-14 01:58:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Kenneth. He's stuck in a bubble. 10/10...,,,,https://twitter.com/dog_rates/status/676219687...,10,10,Kenneth,,,,
203,853299958564483072,,,2017-04-15 17:32:18 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Rumpole. He'll be your Uber driver thi...,,,,https://twitter.com/dog_rates/status/853299958...,13,10,Rumpole,,,,
353,831309418084069378,,,2017-02-14 01:09:44 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Scooter and his son Montoya. Scooter ...,,,,https://twitter.com/dog_rates/status/831309418...,12,10,Scooter,,,,


In [4]:
dfa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [5]:
for col in dfa.columns[[10,11,13,14,15,16]]:
  print(dfa[col].unique())

[  13   12   14    5   17   11   10  420  666    6   15  182  960    0
   75    7   84    9   24    8    1   27    3    4  165 1776  204   50
   99   80   45   60   44  143  121   20   26    2  144   88]
[ 10   0  15  70   7  11 150 170  20  50  90  80  40 130 110  16 120   2]
['None' 'doggo']
['None' 'floofer']
['None' 'pupper']
['None' 'puppo']


#### Curated `twitter_archive_enhanced.csv` info

| # | Variable | Non-Null | Nunique | Dtype | Notes |
|---|----------|----------|---------|-------|-------|
| 0 | tweet_id | 2356 | 2356 | int64  | |
| 1 | in_reply_to_status_id | 78 | 77 | float64 | these tweets are replies |
| 2 | in_reply_to_user_id | 78 | 31 | float64 | see $\uparrow$ |
| 3 | timestamp | 2356 | 2356 | object | object $\to$ datetime | 
| 4 | source | 2356 | 4 | object | |
| 5 | text | 2356 | 2356 | object | some tweets don't have an image (1) |
| 6 | retweeted_status_id | 181 | 181 | float64 | these are retweets |
| 7 | retweeted_status_user_id | 181 | 25 | float64 | see $\uparrow$ |
| 8 | retweeted_status_timestamp | 181 | 181 | object | see $\uparrow$ |
| 9 | expanded_urls | 2297 | 2218 | object | missing values |
| 10 | rating_numerator | 2356 | 40 | int64  | entries with numerator $> 20$ may be incorrect (4a) |
| 11 | rating_denominator | 2356 | 18 | int64  | entries with denominator $\neq 10$ may be incorrect (4b) |
| 12 | name | 2356 | 957 | object | incorrect names or missing values (2) |
| 13 | doggo | 2356 | 2 | object | a value as a column + (3) some misclassified stages|
| 14 | floofer | 2356 | 2 | object | see $\uparrow$ |
| 15 | pupper | 2356 | 2 | object | see $\uparrow$ |
| 16 | puppo | 2356 | 2 | object | see $\uparrow$ |

Source: visual and programmatic assessment

```python
# #, Variable, Non-Null (Count), Dtype:
dfa.info()
# Nunique:
dfa.nunique()
# Check unique values
for col in dfa.columns[[10,11,13,14,15,16]]:
  print(dfa[col].unique())
# Notes
# (1) Some tweets don't have an image
dfa.loc[dfa.text.apply(lambda s: 'https://t.co' not in s)].shape[0]
# [Out] 124
```

In [6]:
# (2a) Incorrect names - begin with a lowercase
import re
print(re.findall(r';([a-z].*?);', ';'.join(dfa.name.unique())))

['such', 'a', 'quite', 'not', 'one', 'incredibly', 'mad', 'an', 'very', 'just', 'my', 'his', 'actually', 'getting', 'this', 'all', 'old', 'infuriating', 'the', 'by', 'officially', 'life', 'light', 'space']


In [7]:
# (2b) Incorrect names - None
dfa.loc[dfa.name == 'None'].shape[0]

745

In [8]:
# (3a) Misclassified stages - indicated in the stage but not present in the text
stages = ['doggo', 'pupper', 'puppo', 'floofer']
print('Stage     | Total | Misclassified |')
print('-'*35)
for stage in stages:
  total = dfa.loc[dfa[stage] == stage].shape[0]
  missed = dfa.loc[(dfa[stage] == stage) & (dfa.text.apply(lambda s: stage not in s.lower()))].shape[0]
  print(f"{stage.ljust(9)} | {total:5d} | {missed:13d} |")

Stage     | Total | Misclassified |
-----------------------------------
doggo     |    97 |             0 |
pupper    |   257 |             0 |
puppo     |    30 |             0 |
floofer   |    10 |             0 |


In [9]:
# (3b) Misclassified stages - not indicated in the stage but is present in the text
stages = ['doggo', 'pupper', 'puppo', 'floofer']
print('Stage     | Total | Misclassified |')
print('-'*35)
for stage in stages:
  total = dfa.loc[dfa[stage] == stage].shape[0]
  missed = dfa.loc[(dfa[stage] != stage) & (dfa.text.apply(lambda s: stage in s.lower()))].shape[0]
  print(f"{stage.ljust(9)} | {total:5d} | {missed:13d} |")

Stage     | Total | Misclassified |
-----------------------------------
doggo     |    97 |            10 |
pupper    |   257 |            26 |
puppo     |    30 |             8 |
floofer   |    10 |             0 |


##### Note (4) Ratings where `rating_numerator` $ > 20$ or  `rating_denomiator` $\neq 10$
Code used:
```python
# Show the whole text
pd.options.display.max_colwidth = None

# (4a) Show tweets with possibly incorrect rating : rating_numerator > 20
dfa.loc[dfa.rating_numerator > 20, ['text', 'rating_numerator', 'rating_denominator']]

# (4b) Show tweets with possibly incorrect rating : rating_denominator != 10
dfa.loc[dfa.rating_denominator != 10, ['text', 'rating_numerator', 'rating_denominator']]

```
In cases where users used float numbers, such as 9.75/10 or 11.27/10, we will use the floor rounding, i.e. 9/10 or 11/10 respectively. We will correct only those rating which were incorrectly identified in the text. Ratings with weird values used in the text are left unchanged cos they're good dogs Brent.

Results:

In [11]:
# Show the whole text
pd.options.display.max_colwidth = None

# Fill dict with key = index and value = correct rating
incorrect_rating = {313 : '13/10', 340 : '9/10', 763: '11/10', 313 : '13/10', 784 : '14/10', 1165 : '13/10', 1202 : '11/10', 1662 : '10/10', 2335 : '9/10'}

# Indicate tweets with missing rating
missing_rating = [342, 516]

# Show tweet with incorrectly identified rating
dfa.loc[list(incorrect_rating.keys()), ['text', 'rating_numerator', 'rating_denominator']]

Unnamed: 0,text,rating_numerator,rating_denominator
313,"@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho",960,0
340,"RT @dog_rates: This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wu…",75,10
763,This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq,27,10
784,"RT @dog_rates: After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https:/…",9,11
1165,Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,4,20
1202,This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq,50,50
1662,This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5,7,11
2335,This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv,1,2


### The Tweet Image Predictions `image_predictions.tsv`
>A table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images)

In [12]:
dfi.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1238,746818907684614144,https://pbs.twimg.com/media/Cl071YVWEAAlF7N.jpg,1,dingo,0.175518,0,timber_wolf,0.133647,0,Ibizan_hound,0.101537,1
774,689623661272240129,https://pbs.twimg.com/media/CZIJD2SWIAMJgNI.jpg,1,toy_poodle,0.279604,1,mashed_potato,0.208564,0,Labrador_retriever,0.077481,1
1679,813217897535406080,https://pbs.twimg.com/media/C0khWkVXEAI389B.jpg,1,Samoyed,0.905972,1,Pomeranian,0.048038,1,West_Highland_white_terrier,0.035667,1
1977,870374049280663552,https://pbs.twimg.com/media/DBQwlFCXkAACSkI.jpg,1,golden_retriever,0.841001,1,Great_Pyrenees,0.099278,1,Labrador_retriever,0.032621,1
871,697943111201378304,https://pbs.twimg.com/media/Ca-XjfiUsAAUa8f.jpg,1,Great_Dane,0.126924,1,Greater_Swiss_Mountain_dog,0.110037,1,German_short-haired_pointer,0.090816,1
656,682259524040966145,https://pbs.twimg.com/media/CXffar9WYAArfpw.jpg,1,Siberian_husky,0.43967,1,Eskimo_dog,0.340474,1,malamute,0.101253,1
997,708356463048204288,https://pbs.twimg.com/media/CdSWcc1XIAAXc6H.jpg,2,pug,0.871283,1,French_bulldog,0.04182,1,bath_towel,0.015228,0
629,680913438424612864,https://pbs.twimg.com/media/CXMXKKHUMAA1QN3.jpg,1,Pomeranian,0.615678,1,golden_retriever,0.126455,1,Chihuahua,0.087184,1
1919,855851453814013952,https://pbs.twimg.com/media/C-CYWrvWAAU8AXH.jpg,1,flat-coated_retriever,0.321676,1,Labrador_retriever,0.115138,1,groenendael,0.0961,1
209,669972011175813120,https://pbs.twimg.com/media/CUw3_QiUEAA8cT9.jpg,1,teddy,0.953071,0,koala,0.007027,0,fur_coat,0.005368,0


In [13]:
dfi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   int64  
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   int64  
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 194.7+ KB


#### Curated Info

| # | Variable | Non-Null | Nunique | Dtype | Notes |
|---|----------|----------|---------|-------|-------|
| 0 | tweet_id | 2075 | 2078 | int64 | |
| 1 | jpg_url | 2075 | 2009 | object | |
| 2 | img_num | 2075 | 4 | int64 | the image number that corresponded to the most confident prediction|
| 3 | p1 | 2075 | 378 | object | prediction |
| 4 | p1_conf | 2075 | 2006 | float64 | confidence of prediction |
| 5 | p1_dog | 2075 | 2 | int64 | Is the prediction a breed of dog? : int $\to$ bool |
| 6 | p2 | 2075 | 405 | object | dtto |
| 7 | p2_conf | 2075 | 2004 | float64 | dtto |
| 8 | p2_dog | 2075 | 2 | int64 | dtto |
| 9 | p3 | 2075 | 408 | object | dtto |
| 10 | p3_conf | 2075 | 2006 | float64 | dtto |
| 11 | p3_dog | 2075 | 2 | int64 | dtto |

Source: visual and programmatic assessment

```python
# #, Variable, Non-Null (Count), Dtype:
dfa.info()
# Nunique:
dfa.nunique()
```


### Additional Data From Twitter API

In [15]:
dft.sample(10)

Unnamed: 0,tweet_id,retweet_count,favorite_count
136,864279568663928832,2633,13309
808,768609597686943744,1129,3932
1490,690932576555528194,940,3101
1818,675820929667219457,219,972
149,861383897657036800,9480,32940
1434,694905863685980160,869,2595
1458,693109034023534592,566,1604
597,796484825502875648,1682,7338
1145,720340705894408192,909,2691
620,793256262322548741,7999,19337


In [16]:
dft.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2327 entries, 0 to 2326
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   tweet_id        2327 non-null   int64
 1   retweet_count   2327 non-null   int64
 2   favorite_count  2327 non-null   int64
dtypes: int64(3)
memory usage: 54.7 KB


#### Curated Info

| # | Variable | Non-Null | Nunique | Dtype | Notes |
|---|----------|----------|---------|-------|-------|
| 0 | tweet_id | 2327 | 2327 | int64 | |
| 1 | retweet_count | 2327 | 1671 | int64 | |
| 2 | favorite_count | 2327 | 2006 | int64 | |

Source: visual and programmatic assessment

```python
# #, Variable, Non-Null (Count), Dtype:
dfa.info()
# Nunique:
dfa.nunique()
```

### Quality issues
1. Replies are not original tweets.
2. Retweets are not original tweets.
3. Some tweets don't have any image
4. Some ratings are incorrectly identified
5. Some ratings are missing
6. Names starting with lowercase are incorrect
7. Names with value None are incorrect
8. Column timestamp has the dtype object (string)


### Tidiness issues
1. Dogs' stages (doggo, pupper, puppo, floofer) as columns
2. Multiple image predictions in one row
3. Data in multiple datasets

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [22]:
# Make copies of original pieces of data
dfa_clean = dfa.copy() # archive
dfi_clean = dfi.copy() # image predictions
dft_clean = dft.copy() # data from Twitter API

### Q1: Replies are not original tweets.

#### Define:
Remove replies from `dfa_clean` dataframe and drop variables *in_reply_to_status_id* and *in_reply_to_user_id*.

#### Code

In [23]:
dfa_clean = dfa_clean.loc[dfa_clean.in_reply_to_status_id.isna()]
print(dfa_clean.in_reply_to_status_id.notna().sum())
dfa_clean.drop(columns=['in_reply_to_status_id', 'in_reply_to_user_id'], inplace=True)

0


#### Test

In [24]:
dfa_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2278 entries, 0 to 2355
Data columns (total 15 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2278 non-null   int64  
 1   timestamp                   2278 non-null   object 
 2   source                      2278 non-null   object 
 3   text                        2278 non-null   object 
 4   retweeted_status_id         181 non-null    float64
 5   retweeted_status_user_id    181 non-null    float64
 6   retweeted_status_timestamp  181 non-null    object 
 7   expanded_urls               2274 non-null   object 
 8   rating_numerator            2278 non-null   int64  
 9   rating_denominator          2278 non-null   int64  
 10  name                        2278 non-null   object 
 11  doggo                       2278 non-null   object 
 12  floofer                     2278 non-null   object 
 13  pupper                      2278 

### Q2: Retweets are not original tweets.

#### Define
Remove retweets from `dfa_clean` and drop variables *retweeted_status_id*, *retweeted_status_user_id*, and *retweeted_status_timestamp*.

#### Code

In [25]:
dfa_clean = dfa_clean.loc[dfa_clean.retweeted_status_id.isna()]
print(dfa_clean.retweeted_status_id.notna().sum())
dfa_clean.drop(columns=['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], inplace=True)

0


#### Test

In [26]:
dfa_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2097 non-null   int64 
 1   timestamp           2097 non-null   object
 2   source              2097 non-null   object
 3   text                2097 non-null   object
 4   expanded_urls       2094 non-null   object
 5   rating_numerator    2097 non-null   int64 
 6   rating_denominator  2097 non-null   int64 
 7   name                2097 non-null   object
 8   doggo               2097 non-null   object
 9   floofer             2097 non-null   object
 10  pupper              2097 non-null   object
 11  puppo               2097 non-null   object
dtypes: int64(3), object(9)
memory usage: 213.0+ KB


### Q3: Some tweets don't have any image

#### Define
Remove tweets that don't have image from `dfa_clean`.

#### Code

In [28]:
dfa_clean = dfa_clean.loc[dfa_clean.text.apply(lambda s: 'https://t.co' in s)]

#### Test

In [29]:
dfa_clean.loc[dfa_clean.text.apply(lambda s: 'https://t.co' not in s)].shape[0]

0

### Q4: Some ratings are incorrectly identified

#### Define

#### Code

#### Test

### Q5: Some ratings are missing

#### Define

#### Code

#### Test

### Q6: Names starting with lowercase are incorrect

#### Define

#### Code

#### Test

### Q7: Names with value None are incorrect

#### Define

#### Code

#### Test

### Q8: Column timestamp has the dtype object (string)

#### Define

#### Code

#### Test

### T1: Dogs' stages (doggo, pupper, puppo, floofer) as columns

#### Define

#### Code

#### Test

### T2: Multiple image predictions in one row

#### Define

#### Code

#### Test

### T3: Data in multiple datasets

#### Define

#### Code

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization