# UDACITY PROJECT 4 - WRANGLE AND ANALYZE DATA
## DOG TWITTER DATA ANALYSIS 
### *Jhonatan Nagasako*
#### *24-FEB-2021*

<hr size="5"/>

<a id='contents'></a>
# Table of Contents

<ul>
<li><a href="#intro">A. INTRODUCTION</a></li>
<li><a href="#scope">B. PROJECT MOTIVATION-SCOPE</a></li>
<li><a href="#gather">1. GATHERING DATA</a></li>
<li><a href="#assess">2. ASSESSING DATA</a></li>
<li><a href="#clean">3. CLEANING DATA</a></li>
<li><a href="#store">4. STORING AND ACTING ON WRANGLED DATA</a></li>
<li><a href="#report">5. REPORT-DISCUSSION-CONCLUSION</a></li>
<li><a href="#files">6. PROJECT FILES</a></li>
</ul>

<hr size="5"/>

<a id='intro'></a>
# A. INTRODUCTION

Real-world data rarely comes clean. Using Python and its libraries, data was gathered from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. Data wrangling efforts was documented in a Jupyter Notebook, which was then showcased  through analyses and visualizations using Python (and its libraries) and/or SQL.

The dataset that used for wrangling (and analyzing and visualizing) is the tweet archive of Twitter user [@dog_rates](https://twitter.com/dog_rates), also known as [WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs). WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because ["they're good dogs Brent."](https://knowyourmeme.com/memes/theyre-good-dogs-brent). WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs [downloaded their Twitter archive](https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive) and sent it to Udacity via email exclusively for you to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon.


![dog and twitter](https://video.udacity-data.com/topher/2017/October/59dd378f_dog-rates-social/dog-rates-social.jpg)

*Image via [Boston Magazine](https://www.bostonmagazine.com/arts-entertainment/2017/04/18/dog-rates-mit/)*

<a href="#contents">[Table of Contents]</a>

<a id='scope'></a>
# B. Project Motivation
## Context
The goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

## The Data
### Enhanced Twitter Archive

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356).

![table of tweets analyzed](https://video.udacity-data.com/topher/2017/October/59dd4791_screenshot-2017-10-10-18.19.36/screenshot-2017-10-10-18.19.36.png)
*The extracted data from each tweet's text*

### Extracted data from tweet text
The extracted data from each tweet's text

This provided data set was extracted programmatically, but more processing (e.g., cleaning and tyding) is requried. The ratings probably aren't all correct. Same goes for the dog names and probably dog stages (see below for more information on these) too. As stated before more data processing is required to assess and clean these columns for later analysis and visualization.

![dog dictionary](https://video.udacity-data.com/topher/2017/October/59e04ceb_dogtionary-combined/dogtionary-combined.png)
*The Dogtionary explains the various stages of dog: doggo, pupper, puppo, and floof(er) (via the [#WeRateDogs book on Amazon](https://www.amazon.com/WeRateDogs-Most-Hilarious-Adorable-Youve/dp/1510717145))*

### Additional Data via the Twitter API

Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. The WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. The Twitter's API was used to query this valuable data. 

**Please note that the Twitter API was NOT utilized for this project for data securty/privacy reasons. This data was provided for the scope of this project.**

### Image Predictions File

One more cool thing: Every image in the WeRateDogs Twitter archive was processed through a [neural network](https://www.youtube.com/watch?v=2-Ol7ZB0MmU) that can classify breeds of dogs* (provided by project). The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

![tweet image prediction](https://video.udacity-data.com/topher/2017/October/59dd4d2c_screenshot-2017-10-10-18.43.41/screenshot-2017-10-10-18.43.41.png)
*Tweet image prediction data*

### Image predictions
Tweet image prediction data

So for the last row in that table:

tweet_id is the last part of the tweet URL after "status/" → https://twitter.com/dog_rates/status/889531135344209921
* p1 is the algorithm's #1 prediction for the image in the tweet → **golden retriever**
* p1_conf is how confident the algorithm is in its #1 prediction → **95%**
* p1_dog is whether or not the #1 prediction is a breed of dog → **TRUE**
* p2 is the algorithm's second most likely prediction → **Labrador retriever**
* p2_conf is how confident the algorithm is in its #2 prediction → **1%**
* p2_dog is whether or not the #2 prediction is a breed of dog → **TRUE**
* etc.

And the #1 prediction for the image in that tweet was spot on:

@dog_rates tweet
![gold retriever](https://video.udacity-data.com/topher/2017/October/59dd4e05_dog-pred/dog-pred.png)
*A golden retriever named Stuart*

So that's all fun and good. But all of this additional data will need to be gathered, assessed, and cleaned--which is the scope of this project

## Key Points
Key points to keep in mind when data wrangling for this project:

* Only use original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html).
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* It is not required to gather tweets beyond August 1st, 2017 because it is out of scope. Image predictions cannot be gathered for new tweet data after this date because the source file for the image prediction is not provided--again out of scope for this project

<a href="#contents">[Table of Contents]</a>

<hr size="5"/>

<a id='gather'></a>
# 1. GATHERING DATA

<font color=blue>

## Gathering Data - Set 1 Requirements
**1.1 CRITERIA:** The student is able to gather data from a variety of sources and file formats.

**1.1 SPECIFICATION:**
Data is successfully gathered:
* From at least the three (3) different sources on the Project Details page.
* In at least the three (3) different file formats on the Project Details page.

Each piece of data is imported into a separate pandas DataFrame at first.

<a href="#contents">[Table of Contents]</a>

In [1]:
# import statements for all of the packages used for analysis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
import statsmodels.api as sm;

# package required to get images from twitter data
import requests

# package required to access json file
import json


# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html

%matplotlib inline

### Gather Specification 1 - data provided (e.g., via flashdrive)

<a href="#gather">[Gathering Data Requirements]</a> <a href="#contents">[Table of Contents]</a>

In [2]:
# Gather Specification 1 - file given (file provided via flashdrive)
df_archive = pd.read_csv('twitter-archive-enhanced.csv')
df_archive.head(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,


### Gather Specification 2 - data accessed by internet via Requests library

<a href="#gather">[Gathering Data Requirements]</a> <a href="#contents">[Table of Contents]</a>

In [3]:
# Gather Specification 2 - access via internet to get images for predictions, accessed programmatically
# https://pypi.org/project/requests/

# note that running this code will generate the file "image_prediction.tsv"

r = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv',
                auth=('user','pass'))

# better check status and 
assert r.status_code == 200, 'Request corruption detected, status code is supposed to be 200 (aka "HTTP OK Success")'
assert r.encoding == 'utf-8', 'Request corruption detected, encoding was NOT \'utf-8\' (will cause issues later in json coding)'

open('image_prediction.tsv', 'wb').write(r.content)
df_prediction = pd.read_csv('image_prediction.tsv', sep='\t')
df_prediction.head(3)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True


### Gather Specification 3 - data accessed by internet via Twitter API Tweepy

<a href="#gather">[Gathering Data Requirements]</a> <a href="#contents">[Table of Contents]</a>

In [4]:
# Gather Specification 3 - json file
# https://stackoverflow.com/questions/47612822/how-to-create-pandas-dataframe-from-twitter-search-api
twitter_list =[]

with open('tweet-json.txt', encoding='utf-8') as json_file:
    for each_dictionary in json_file:   
        
        tweets_dict = {} # the dictionary terms are stored here
        tweets_json = json.loads(each_dictionary)
        
        tweets_dict['tweet_id'] = tweets_json['id_str']
        tweets_dict['retweet_count'] = tweets_json['retweet_count']
        tweets_dict['favorite_count'] = tweets_json['favorite_count']
        tweets_dict['full_text'] = tweets_json['full_text']
        
        twitter_list.append(tweets_dict)

# write to dataframe
df_twitter = pd.DataFrame(twitter_list)
df_twitter.head(3)

Unnamed: 0,tweet_id,retweet_count,favorite_count,full_text
0,892420643555336193,8853,39467,This is Phineas. He's a mystical boy. Only eve...
1,892177421306343426,6514,33819,This is Tilly. She's just checking pup on you....
2,891815181378084864,4328,25461,This is Archie. He is a rare Norwegian Pouncin...


<font color='red'>

**_Code below is required to programically interface and gather data from TWITTER using the ```Tweepy``` API. However, I used the provided file instead because I do NOT want to create Twitter account for cyber securty/privacy reasons._**

<font color='blue'>
             
> Essentially, you will run the Twitter API with will get you a JSON file (code block below). This JSON file will then be read using Python to get the ```tweet-json.txt``` used in the code block above. 


```
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = df_1.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)
```


## End of Gathering Data Section

<a href="#gather">[Gathering Data Requirements]</a> <a href="#contents">[Table of Contents]</a>

<hr size="5"/>

<a id='assess'></a>
# 2. ASSESSING DATA

<font color=blue>

**2.1 CRITERIA:** The student is able to assess data visually and programmatically for quality and tidiness.

**2.1 SPECIFICATION:**
Two types of assessment are used:

* Visual assessment: each piece of gathered data is displayed in the Jupyter Notebook for visual assessment purposes. Once displayed, data can additionally be assessed in an external application (e.g. Excel, text editor).
* Programmatic assessment: pandas' functions and/or methods are used to assess the data.

---

**2.2 CRITERIA:** The student is able to thoroughly assess a dataset.

**2.2 SPECIFICATION:**
At least eight (8) data quality issues and two (2) tidiness issues are detected, and include the issues to clean to satisfy the Project Motivation. Each issue is documented in one to a few sentences each.

<a href="#contents">[Table of Contents]</a>

## Assess Specification 2.1 - Assessing data visual assessment (e.g., via CSV)

* Visual assessment: each piece of gathered data is displayed in the Jupyter Notebook for visual assessment purposes. Once displayed, data can additionally be assessed in an external application (e.g. Excel, text editor).

<a href="#assess">[Assessing Data Requirements]</a> <a href="#contents">[Table of Contents]</a>

Current dataframes active from **_Section 1. GATHER_**
1. ```df_archive``` = tweet metadata
2. ```df_prediction``` = tweet picture prediction
3. ```df_twitter``` = tweet retweet_count, favorite_count, and full_text

In [5]:
# visual assessment - warning to user that opening these files "may" cause viewing software to crash
# help: https://www.guru99.com/python-check-if-file-exists.html

import os.path
from os import path
fileFlag = False # initilized as FALSE -- assuming .csv files have not be created from dataframes in this notebook

In [6]:
# help: https://www.geeksforgeeks.org/g-fact-41-multiple-return-values-in-python/
def createCSV():
    # adding "1-" for easier file handling in folder
    df_archive.to_csv('1-archive.csv')
    df_prediction.to_csv('1-prediction.csv')
    df_twitter.to_csv('1-twitter.csv')
    fileFlag = True
    print('.csv Files created ... Ready for user VISUAL assessment')
    return fileFlag

if path.exists('1-archive.csv') == True and path.exists('1-prediction.csv') == True and path.exists('1-twitter.csv') == True:
    print("Files aready created! ... Ready for user VISUAL assessment")
elif fileFlag == False:
    fileFlag = createCSV() # Assign returned tuple , execute file create assuming fileFlag is FALSE (has not be done)
else:
    assert path.exists('1-archive.csv') == True, "You need to create the 1-archive.csv file"
    assert path.exists('1-prediction.csv') == True, "You need to create the 1-prediction.csv file"
    assert path.exists('1-twitter.csv') == True, "You need to create the 1-twitter.csv file"
    print("Files exists! ... Ready for user VISUAL assessment")

# could add button here for "confirmation" data was reviewed

Files aready created! ... Ready for user VISUAL assessment


## Assess Specification 2.1 - Assessing data PROGRAMMICALLY assessment (e.g., via CSV)

* Programmatic assessment: pandas' functions and/or methods are used to assess the data.

<a href="#assess">[Assessing Data Requirements]</a> <a href="#contents">[Table of Contents]</a>

In [7]:
# Programmically assessment of data

#import the newly created .csv files
df_archive = pd.read_csv('1-archive.csv')
df_prediction = pd.read_csv('1-prediction.csv')
df_twitter = pd.read_csv('1-twitter.csv')

<font color='blue'>

Review ```df_archive``` data programmatically

In [8]:
df_archive.head(3)

Unnamed: 0.1,Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,


In [9]:
df_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Unnamed: 0                  2356 non-null   int64  
 1   tweet_id                    2356 non-null   int64  
 2   in_reply_to_status_id       78 non-null     float64
 3   in_reply_to_user_id         78 non-null     float64
 4   timestamp                   2356 non-null   object 
 5   source                      2356 non-null   object 
 6   text                        2356 non-null   object 
 7   retweeted_status_id         181 non-null    float64
 8   retweeted_status_user_id    181 non-null    float64
 9   retweeted_status_timestamp  181 non-null    object 
 10  expanded_urls               2297 non-null   object 
 11  rating_numerator            2356 non-null   int64  
 12  rating_denominator          2356 non-null   int64  
 13  name                        2356 

<font color='blue'>

Review ```df_prediction``` data programmatically

In [10]:
df_prediction.head(3)

Unnamed: 0.1,Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True


In [11]:
df_prediction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  2075 non-null   int64  
 1   tweet_id    2075 non-null   int64  
 2   jpg_url     2075 non-null   object 
 3   img_num     2075 non-null   int64  
 4   p1          2075 non-null   object 
 5   p1_conf     2075 non-null   float64
 6   p1_dog      2075 non-null   bool   
 7   p2          2075 non-null   object 
 8   p2_conf     2075 non-null   float64
 9   p2_dog      2075 non-null   bool   
 10  p3          2075 non-null   object 
 11  p3_conf     2075 non-null   float64
 12  p3_dog      2075 non-null   bool   
dtypes: bool(3), float64(3), int64(3), object(4)
memory usage: 168.3+ KB


<font color='blue'>

Review ```df_twitter``` data programmatically

In [12]:
df_twitter.head(3)

Unnamed: 0.1,Unnamed: 0,tweet_id,retweet_count,favorite_count,full_text
0,0,892420643555336193,8853,39467,This is Phineas. He's a mystical boy. Only eve...
1,1,892177421306343426,6514,33819,This is Tilly. She's just checking pup on you....
2,2,891815181378084864,4328,25461,This is Archie. He is a rare Norwegian Pouncin...


In [13]:
df_twitter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Unnamed: 0      2354 non-null   int64 
 1   tweet_id        2354 non-null   int64 
 2   retweet_count   2354 non-null   int64 
 3   favorite_count  2354 non-null   int64 
 4   full_text       2354 non-null   object
dtypes: int64(4), object(1)
memory usage: 92.1+ KB


## Assess Specification 2.2 - Identify data quality and tidiness issues and steps for cleaning

* At least eight (8) data quality issues and two (2) tidiness issues are detected, and include the issues to clean to satisfy the Project Motivation. Each issue is documented in one to a few sentences each.

**Tips for Tidying**
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table
*Reference for [tidy data here](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)*

**Tips for Common Data Quality Issues**
1. Missing data
2. Invalide data (e.g., state a negative height, or other datatype validation errors--str vs int vs float, think there can only be 2 people in a room... not 2.54 people in a room... *unless there's ghosts lol*)
3. Inaccurate data (e.g., specifying a foot = 5 inches, which is WRONG. A foot = 12 inches)
4. Inconsistent data (e.g., mixing up units, some data captured as cm instead of inches)

<a href="#assess">[Assessing Data Requirements]</a> <a href="#contents">[Table of Contents]</a>

In [14]:
# another look at the data, determine what cleaning steps are required
print('archive data\n')
df_archive.info()
print('______________________________________________________________\n')

print('prediction data\n')
df_prediction.info()
print('______________________________________________________________\n')

print('twitter data\n')
df_twitter.info()
print('______________________________________________________________\n')


archive data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Unnamed: 0                  2356 non-null   int64  
 1   tweet_id                    2356 non-null   int64  
 2   in_reply_to_status_id       78 non-null     float64
 3   in_reply_to_user_id         78 non-null     float64
 4   timestamp                   2356 non-null   object 
 5   source                      2356 non-null   object 
 6   text                        2356 non-null   object 
 7   retweeted_status_id         181 non-null    float64
 8   retweeted_status_user_id    181 non-null    float64
 9   retweeted_status_timestamp  181 non-null    object 
 10  expanded_urls               2297 non-null   object 
 11  rating_numerator            2356 non-null   int64  
 12  rating_denominator          2356 non-null   int64  
 13  name               

In [15]:
print('Any duplicated records in df_archive? -->', df_archive.duplicated().sum())
print('Any duplicated records in df_prediction? --> ', df_prediction.duplicated().sum())
print('Any duplicated records in df_twitter? --> ',df_twitter.duplicated().sum())

Any duplicated records in df_archive? --> 0
Any duplicated records in df_prediction? -->  0
Any duplicated records in df_twitter? -->  0


In [16]:
print('Any missing records in df_archive?\n', df_archive.isnull().sum())
print('\nAny missing records in df_prediction?\n', df_prediction.isnull().sum())
print('\nAny missing records in df_twitter?\n',df_twitter.isnull().sum())

Any missing records in df_archive?
 Unnamed: 0                       0
tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

Any missing records in df_prediction?
 Unnamed: 0    0
tweet_id      0
jpg_url       0
img_num       0
p1            0
p1_conf       0
p1_dog        0
p2            0
p2_conf       0
p2_dog        0
p3            0
p3_conf       0
p3_dog        0
dtype: int64

Any missing records in df_twitter?
 Unnamed: 0        0
twe

In [17]:
print('archive data\n')
df_archive.nunique()

archive data



Unnamed: 0                    2356
tweet_id                      2356
in_reply_to_status_id           77
in_reply_to_user_id             31
timestamp                     2356
source                           4
text                          2356
retweeted_status_id            181
retweeted_status_user_id        25
retweeted_status_timestamp     181
expanded_urls                 2218
rating_numerator                40
rating_denominator              18
name                           957
doggo                            2
floofer                          2
pupper                           2
puppo                            2
dtype: int64

In [18]:
print('prediction data\n')
df_prediction.nunique()

prediction data



Unnamed: 0    2075
tweet_id      2075
jpg_url       2009
img_num          4
p1             378
p1_conf       2006
p1_dog           2
p2             405
p2_conf       2004
p2_dog           2
p3             408
p3_conf       2006
p3_dog           2
dtype: int64

In [19]:
print('twitter data\n')
df_twitter.nunique()

twitter data



Unnamed: 0        2354
tweet_id          2354
retweet_count     1724
favorite_count    2007
full_text         2354
dtype: int64

In [20]:
# quick looks at statis to determine spread to eliminate outliers
ratings_level = 100*(df_archive.rating_numerator / df_archive.rating_denominator)
ratings_level
ratings_level.describe()

# so, lets make the cut-off point for exceedingly high levels of ratings be 120%... 
# but that deletes over 400 records... so lets say 150% then! Only 12 records removed!

count    2356.0
mean        inf
std         NaN
min         0.0
25%       100.0
50%       110.0
75%       120.0
max         inf
dtype: float64

In [21]:
# quick look at the stats of retweets and favorites to understand spread to remove "zero" values
df_twitter.describe()

# actually, lets not remove anything, this is data from Twitter database, 
# these "zero" because low probability that they were mistakes 

Unnamed: 0.1,Unnamed: 0,tweet_id,retweet_count,favorite_count
count,2354.0,2354.0,2354.0,2354.0
mean,1176.5,7.426978e+17,3164.797366,8080.968564
std,679.685589,6.852812e+16,5284.770364,11814.771334
min,0.0,6.660209e+17,0.0,0.0
25%,588.25,6.783975e+17,624.5,1415.0
50%,1176.5,7.194596e+17,1473.5,3603.5
75%,1764.75,7.993058e+17,3652.0,10122.25
max,2353.0,8.924206e+17,79515.0,132810.0


<font color='blue'>
    
## QUALITY
    
✔️ 1. all dataframes --> Remove missing data column ```Unnamed: 0```

2. Missing data, Drop columns:
    * df_archive
        * in_reply_to_status_id
        * in_reply_to_user_id
        * retweeted_status_id
        * retweeted_status_user_id
        * retweeted_status_timestamp
        * expanded_urls 
        * in_reply_to_status_id
        * in_reply_to_user_id 
    
3. df_archive --> convert columns to boolean for easier analysis
    * doggo
    * floofer
    * pupper
    * puppo

4. df_archive --> convert form object to in64
    * doggo
    * floofer
    * pupper
    * puppo
    
5. df_archive --> need for classification of ```floof``` for dogs that did not meeting the four listed catagories below:
    * doggo
    * floofer
    * pupper
    * puppo
    
6. df_archive --> remove entries with EXTREME ratings outliers (anything above 150%). Those entries are: 
    * 55 with 170%
    * 188 with 4200%
    * 189 with 6660%
    * 290 with 1820%
    * 340 with 750%
    * 516 with 343%
    * 695 with 750%
    * 763 with 270%
    * 979 with 17760% wow!
    * 1712 with 260%
    * 2074 with 4200%

7. df_archive --> remove entry 313 because no denominator is provided (entered as 0)

8. df_prediction --> make all names of dogs CAPITALIZED, to prevent any variants of different capitilizations

*At least eight (8) data quality issues and two (2) tidiness issues are detected, and include the issues to clean to satisfy the Project Motivation. Each issue is documented in one to a few sentences each.*
    
**Tips for Common Data Quality Issues**
1. Missing data
2. Invalide data (e.g., state a negative height, or other datatype validation errors--str vs int vs float, think there can only be 2 people in a room... not 2.54 people in a room... *unless there's ghosts lol*)
3. Inaccurate data (e.g., specifying a foot = 5 inches, which is WRONG. A foot = 12 inches)
4. Inconsistent data (e.g., mixing up units, some data captured as cm instead of inches)


<font color='blue'>

## TIDINESS
1. Combine all seperated tables, join by ```tweet_id```

2. df_twitter drop column ```full_text``` because it contains multiple observational units (e.g., name, url, score, etc.)

*At least eight (8) data quality issues and two (2) tidiness issues are detected, and include the issues to clean to satisfy the Project Motivation. Each issue is documented in one to a few sentences each.*
    
**Tips for Tidying**
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table
*Reference for [tidy data here](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)*
    
## FEATURE ENGINEEERING
1. df_archive
    * create a percentage column of the numerator and denometer ratings
    * create column for twitter-web vs twitter-phone vs twitter-tweetdeck vs vine source
    * create a column for "bonus" for any ratings that exceed score of 10

<hr size="5"/>

<a id='clean'></a>
# 3. CLEANING DATA

<a href="#contents">[Table of Contents]</a>

<font color=blue>
    
**3.1 CRITERIA:** The student uses the steps in the data cleaning process to guide their cleaning efforts.

**3.1 SPECIFICATION:**
The define, code, and test steps of the cleaning process are clearly documented.

<font color=blue>
    
**3.2 CRITERIA:** The student is able to thoroughly clean a dataset programmatically.

**3.2 SPECIFICATION:**

Copies of the original pieces of data are made prior to cleaning.

All issues identified in the assess phase are successfully cleaned (if possible) using Python and pandas, and include the cleaning tasks required to satisfy the Project Motivation.

A tidy master dataset (or datasets, if appropriate) with all pieces of gathered data is created.

<hr size="5"/>

<a id='store'></a>
# 4. STORING AND ACTING ON WRANGLED DATA

<a href="#contents">[Table of Contents]</a>

<font color=blue>
    
**4.1 CRITERIA:** The student is able to store a gathered, assessed, and cleaned dataset.

**4.1 SPECIFICATION:**

Students will save their gathered, assessed, and cleaned master dataset(s) to a CSV file or a SQLite database.

<font color=blue>
    
**4.2 CRITERIA:** The student is able to act on their wrangled data to produce insights (e.g. analyses, visualizations, and/or models).

**4.2 SPECIFICATION:**

The master dataset is analyzed using pandas or SQL in the Jupyter Notebook and at least three (3) separate insights are produced.

At least one (1) labeled visualization is produced in the Jupyter Notebook using Python’s plotting libraries or in Tableau.

Students must make it clear in their wrangling work that they assessed and cleaned (if necessary) the data upon which the analyses and visualizations are based.

<hr size="5"/>

<a id='report'></a>
# 5. REPORT-DISCUSSION-CONCLUSION

<a href="#contents">[Table of Contents]</a>

<font color=blue>
    
**5.1 CRITERIA:** The student is able to reflect upon and describe their data wrangling efforts.

**5.1 SPECIFICATION:**

The student’s wrangling efforts are briefly described. This document (wrangle_report.pdf or wrangle_report.html) is concise and approximately 300-600 words in length.

At least one (1) labeled visualization is produced in the Jupyter Notebook using Python’s plotting libraries or in Tableau.

Students must make it clear in their wrangling work that they assessed and cleaned (if necessary) the data upon which the analyses and visualizations are based.

<font color=blue>
    
**5.2 CRITERIA:** The student is able to describe some insights found in their wrangled dataset.

**5.2 SPECIFICATION:**

The three (3) or more insights the student found are communicated. At least one (1) visualization is included.

This document (act_report.pdf or act_report.html) is at least 250 words in length.

<hr size="5"/>

<a id='files'></a>
# 6. PROJECT FILES

<a href="#contents">[Table of Contents]</a>

<font color=blue>
    
**6.1 CRITERIA:** Are all required files included in the student's submission?

**6.1 SPECIFICATION:**
The following files (with identical filenames) are included:

* wrangle_act.ipynb
* wrangle_report.pdf or wrangle_report.html
* act_report.pdf or act_report.html
    
All dataset files are included, including the stored master dataset(s), with filenames and extensions as specified on the Project Submission page.

<hr size="5"/>

# End of Data Project!

Made with ❤️ by Jhon!

<a href="#contents">[Table of Contents]</a>