# WeRateDogs Twitter Data Wrangling and Analysis Project
DECI L3 FINAL PROJECT 👨🏻‍🎓

Made by Mohamed Elsayed Zaky 😁

## Prepration 🍰

In [147]:
# Main Libraires
import pandas as pd
import numpy as np
import requests
# Twitter API
import tweepy
import json
# to read tsv file after downlading it using requests
import io
# .env for API keys
from dotenv import load_dotenv
import os

## 1️⃣ Gathering Data 🕵🏻‍♂️

1. Reading `twitter-archive-enhanced.csv` file

In [148]:
tae = pd.read_csv("twitter-archive-enhanced.csv")

2. Reading the `image_predictions.tsv` file

In [149]:
# the url of the image_predictions.tsv file
image_predictions_url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
# import it in pandas
image_predictions = requests.get(image_predictions_url)

# Using IO library convert the tsv file to file like object to be able to read
content = image_predictions.content.decode('utf-8')
file_like_object = io.StringIO(content)

In [150]:
nlp = pd.read_csv(file_like_object, sep="\t")

3. Setup Twitter API for addtional data

Note:
- **<div style="color: red">Because Twitter API Free Plan doesn't offer get_status so I am forced to use the "tweets_json.txt" file in the classroom</div>**

In [151]:
# Gather Three colums `tweet_id`, `retweet_count`, `favorite_count`
tweet_id = []
retweet_count = []
favorite_count = []

# Read twets_json.txt
with open("tweets_json.txt", "r") as file:
  for line in file:
    # Read line as dict
    data = json.loads(line)
    # append data to the list
    tweet_id.append(data['id'])
    retweet_count.append(data['retweet_count'])
    favorite_count.append(data['favorite_count'])
    
tweets_data = {
  "Tweet id": tweet_id,
  "Retweet count": retweet_count,
  "Favorite count": favorite_count,
}

twitter_df = pd.DataFrame(tweets_data)

## 2️⃣ Assessing 👨🏻‍💻

Let's see the head of the dataset 🕵🏻‍♀️

In [152]:
tae.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


Hmmm! `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id` and `retweeted_status_timestamp`  seems to be the majority is null let's check that!

In [153]:
tae['in_reply_to_status_id'].isna().value_counts()

in_reply_to_status_id
True     2278
False      78
Name: count, dtype: int64

In [154]:
tae.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        1611 non-null   object 
 13  doggo                       97 no

There are about 181 retweet let's check that

In [155]:
tae[~tae["retweeted_status_id"].isna()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Canela. She attempted s...,8.874740e+17,4.196984e+09,2017-07-19 00:47:34 +0000,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,,,,
32,886054160059072513,,,2017-07-15 02:45:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,8.860537e+17,1.960740e+07,2017-07-15 02:44:07 +0000,https://twitter.com/dog_rates/status/886053434...,12,10,,,,,
36,885311592912609280,,,2017-07-13 01:35:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Lilly. She just paralle...,8.305833e+17,4.196984e+09,2017-02-12 01:04:29 +0000,https://twitter.com/dog_rates/status/830583320...,13,10,Lilly,,,,
68,879130579576475649,,,2017-06-26 00:13:58 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Emmy. She was adopted t...,8.780576e+17,4.196984e+09,2017-06-23 01:10:23 +0000,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,,,,
73,878404777348136964,,,2017-06-24 00:09:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Shadow. In an attempt to r...,8.782815e+17,4.196984e+09,2017-06-23 16:00:04 +0000,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1023,746521445350707200,,,2016-06-25 01:52:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Shaggy. He knows exactl...,6.678667e+17,4.196984e+09,2015-11-21 00:46:50 +0000,https://twitter.com/dog_rates/status/667866724...,10,10,Shaggy,,,,
1043,743835915802583040,,,2016-06-17 16:01:16 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Extremely intelligent dog here....,6.671383e+17,4.196984e+09,2015-11-19 00:32:12 +0000,https://twitter.com/dog_rates/status/667138269...,10,10,,,,,
1242,711998809858043904,,,2016-03-21 19:31:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @twitter: @dog_rates Awesome Tweet! 12/10. ...,7.119983e+17,7.832140e+05,2016-03-21 19:29:52 +0000,https://twitter.com/twitter/status/71199827977...,12,10,,,,,
2259,667550904950915073,,,2015-11-20 03:51:52 +0000,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",RT @dogratingrating: Exceptional talent. Origi...,6.675487e+17,4.296832e+09,2015-11-20 03:43:06 +0000,https://twitter.com/dogratingrating/status/667...,12,10,,,,,


Also I am wandring what's inside the `text` so let's take a random entry

In [176]:
cleaned_tae["text"][1500]

"This is Edgar. He's a Sassafras Puggleflash. Nothing satisfies him. Not since the war. 10/10 cheer up pup https://t.co/1NgMb9BTWB"

No need to the last link, it gives me the link of the tweet

**Quality Issues**:
- `timestamp` column is string rather than datetime object
- `timestamp` must have the date only no need to the time.
- `source` column has the html anchor tage `<a>` with the link. It must be the link only.
- No need to `source` column as it's the same repeated value. No need to visit the settings page.
- There are five colums have more than 70% of it null values.
- `expanded_urls` is just this 'https://twitter.com/dog_rates/status/' with the id of the tweet. no need to it as we have the id.
- `retweeted_status_id`, `retweeted_status_user_id` and `retweeted_status_timestamp` Tells us that there are about 181 entries is just a retweet. we do not need them.
- `name` has about 745 null values

**Tidness Issues**:
- The `doggo`, `floofer`, `pupper`, and `puppo` columns are just terms that refers to the dogs 
characteristics. It must be one colum called `dog characteristics`

## 3️⃣ Cleaning 🧼

First let's make a copy of tae dataframe

In [156]:
cleaned_tae = tae.copy()

Let's drop the colums we that we does not need it.

In [157]:
cleaned_tae.drop(columns = ['in_reply_to_status_id', 'in_reply_to_user_id', 'source', 'expanded_urls'], inplace=True)

Also let's drop any retweet entry

In [158]:
cleaned_tae = cleaned_tae[cleaned_tae['retweeted_status_id'].isna()]

Then no need to `retweeted_status_id`, `retweeted_status_user_id`, and `retweeted_status_timestamp` columns

In [159]:
cleaned_tae.drop(columns = ['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], inplace=True)

Convert time stamp to datetime and trim the time

In [166]:
cleaned_tae["timestamp"] = pd.to_datetime(cleaned_tae["timestamp"].str.slice(0, 10))

let's clean text column from the last link

In [183]:
cleaned_tae["text"] = cleaned_tae["text"].str[:-24]

In [184]:
cleaned_tae

Unnamed: 0,tweet_id,timestamp,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,2017-08-01,This is Phineas. He's a mystical boy. Only eve...,13,10,Phineas,,,,
1,892177421306343426,2017-08-01,This is Tilly. She's just checking pup on you....,13,10,Tilly,,,,
2,891815181378084864,2017-07-31,This is Archie. He is a rare Norwegian Pouncin...,12,10,Archie,,,,
3,891689557279858688,2017-07-30,This is Darla. She commenced a snooze mid meal...,13,10,Darla,,,,
4,891327558926688256,2017-07-29,This is Franklin. He would like you to stop ca...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,2015-11-16,Here we have a 1949 1st generation vulpix. Enj...,5,10,,,,,
2352,666044226329800704,2015-11-16,This is a purebred Piers Morgan. Loves to Netf...,6,10,a,,,,
2353,666033412701032449,2015-11-15,Here is a very happy pup. Big fan of well-main...,9,10,a,,,,
2354,666029285002620928,2015-11-15,This is a western brown Mitsubishi terrier. Up...,7,10,a,,,,


## 4️⃣ Storing data 🧠

## 5️⃣ Analyzing, and visualizing data 📊

## 6️⃣ Reporting 👨🏻‍🏫📃