# UDACITY PROJECT 4 - WRANGLE AND ANALYZE DATA
## DOG TWITTER DATA ANALYSIS 
### *Jhonatan Nagasako*
#### *24-FEB-2021*

<hr size="5"/>

<a id='contents'></a>
# Table of Contents

<ul>
<li><a href="#intro">A. INTRODUCTION</a></li>
<li><a href="#scope">B. PROJECT MOTIVATION-SCOPE</a></li>
<li><a href="#gather">1. GATHERING DATA</a></li>
<li><a href="#assess">2. ASSESSING DATA</a></li>
<li><a href="#clean">3. CLEANING DATA</a></li>
<li><a href="#store">4. STORING AND ACTING ON WRANGLED DATA</a></li>
<li><a href="#report">5. REPORT-DISCUSSION-CONCLUSION</a></li>
<li><a href="#files">6. PROJECT FILES</a></li>
</ul>

<hr size="5"/>

<a id='intro'></a>
# A. INTRODUCTION

Real-world data rarely comes clean. Using Python and its libraries, data was gathered from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. Data wrangling efforts was documented in a Jupyter Notebook, which was then showcased  through analyses and visualizations using Python (and its libraries) and/or SQL.

The dataset that used for wrangling (and analyzing and visualizing) is the tweet archive of Twitter user [@dog_rates](https://twitter.com/dog_rates), also known as [WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs). WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because ["they're good dogs Brent."](https://knowyourmeme.com/memes/theyre-good-dogs-brent). WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs [downloaded their Twitter archive](https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive) and sent it to Udacity via email exclusively for you to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon.


![dog and twitter](https://video.udacity-data.com/topher/2017/October/59dd378f_dog-rates-social/dog-rates-social.jpg)

*Image via [Boston Magazine](https://www.bostonmagazine.com/arts-entertainment/2017/04/18/dog-rates-mit/)*

<a href="#contents">[back to contents]</a>

<a id='scope'></a>
# B. Project Motivation
## Context
The goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

## The Data
### Enhanced Twitter Archive

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356).

![table of tweets analyzed](https://video.udacity-data.com/topher/2017/October/59dd4791_screenshot-2017-10-10-18.19.36/screenshot-2017-10-10-18.19.36.png)
*The extracted data from each tweet's text*

### Extracted data from tweet text
The extracted data from each tweet's text

This provided data set was extracted programmatically, but more processing (e.g., cleaning and tyding) is requried. The ratings probably aren't all correct. Same goes for the dog names and probably dog stages (see below for more information on these) too. As stated before more data processing is required to assess and clean these columns for later analysis and visualization.

![dog dictionary](https://video.udacity-data.com/topher/2017/October/59e04ceb_dogtionary-combined/dogtionary-combined.png)
*The Dogtionary explains the various stages of dog: doggo, pupper, puppo, and floof(er) (via the [#WeRateDogs book on Amazon](https://www.amazon.com/WeRateDogs-Most-Hilarious-Adorable-Youve/dp/1510717145))*

### Additional Data via the Twitter API

Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. The WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. The Twitter's API was used to query this valuable data. 

**Please note that the Twitter API was NOT utilized for this project for data securty/privacy reasons. This data was provided for the scope of this project.**

### Image Predictions File

One more cool thing: Every image in the WeRateDogs Twitter archive was processed through a [neural network](https://www.youtube.com/watch?v=2-Ol7ZB0MmU) that can classify breeds of dogs* (provided by project). The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

![tweet image prediction](https://video.udacity-data.com/topher/2017/October/59dd4d2c_screenshot-2017-10-10-18.43.41/screenshot-2017-10-10-18.43.41.png)
*Tweet image prediction data*

### Image predictions
Tweet image prediction data

So for the last row in that table:

tweet_id is the last part of the tweet URL after "status/" → https://twitter.com/dog_rates/status/889531135344209921
* p1 is the algorithm's #1 prediction for the image in the tweet → **golden retriever**
* p1_conf is how confident the algorithm is in its #1 prediction → **95%**
* p1_dog is whether or not the #1 prediction is a breed of dog → **TRUE**
* p2 is the algorithm's second most likely prediction → **Labrador retriever**
* p2_conf is how confident the algorithm is in its #2 prediction → **1%**
* p2_dog is whether or not the #2 prediction is a breed of dog → **TRUE**
* etc.

And the #1 prediction for the image in that tweet was spot on:

@dog_rates tweet
![gold retriever](https://video.udacity-data.com/topher/2017/October/59dd4e05_dog-pred/dog-pred.png)
*A golden retriever named Stuart*

So that's all fun and good. But all of this additional data will need to be gathered, assessed, and cleaned--which is the scope of this project

## Key Points
Key points to keep in mind when data wrangling for this project:

* Only use original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html).
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* It is not required to gather tweets beyond August 1st, 2017 because it is out of scope. Image predictions cannot be gathered for new tweet data after this date because the source file for the image prediction is not provided--again out of scope for this project

<a href="#contents">[back to contents]</a>

<hr size="5"/>

<a id='gather'></a>
# 1. GATHERING DATA

<a href="#contents">[back to contents]</a>

<font color=blue>
    
**1. CRITERIA:** The student is able to gather data from a variety of sources and file formats.

**1. SPECIFICATION:**
Data is successfully gathered:
* From at least the three (3) different sources on the Project Details page.
* In at least the three (3) different file formats on the Project Details page.

Each piece of data is imported into a separate pandas DataFrame at first.

<hr size="5"/>

<a id='assess'></a>
# 2. ASSESSING DATA

<a href="#contents">[back to contents]</a>

<font color=blue>
    
**2.1 CRITERIA:** The student is able to assess data visually and programmatically for quality and tidiness.

**2.1 SPECIFICATION:**
Two types of assessment are used:

* Visual assessment: each piece of gathered data is displayed in the Jupyter Notebook for visual assessment purposes. Once displayed, data can additionally be assessed in an external application (e.g. Excel, text editor).
* Programmatic assessment: pandas' functions and/or methods are used to assess the data.

<font color=blue>
    
**2.2 CRITERIA:** The student is able to thoroughly assess a dataset.

**2.2 SPECIFICATION:**
At least eight (8) data quality issues and two (2) tidiness issues are detected, and include the issues to clean to satisfy the Project Motivation. Each issue is documented in one to a few sentences each.

<hr size="5"/>

<a id='clean'></a>
# 3. CLEANING DATA

<a href="#contents">[back to contents]</a>

<font color=blue>
    
**3.1 CRITERIA:** The student uses the steps in the data cleaning process to guide their cleaning efforts.

**3.1 SPECIFICATION:**
The define, code, and test steps of the cleaning process are clearly documented.

<font color=blue>
    
**3.2 CRITERIA:** The student is able to thoroughly clean a dataset programmatically.

**3.2 SPECIFICATION:**

Copies of the original pieces of data are made prior to cleaning.

All issues identified in the assess phase are successfully cleaned (if possible) using Python and pandas, and include the cleaning tasks required to satisfy the Project Motivation.

A tidy master dataset (or datasets, if appropriate) with all pieces of gathered data is created.

<hr size="5"/>

<a id='store'></a>
# 4. STORING AND ACTING ON WRANGLED DATA

<a href="#contents">[back to contents]</a>

<font color=blue>
    
**4.1 CRITERIA:** The student is able to store a gathered, assessed, and cleaned dataset.

**4.1 SPECIFICATION:**

Students will save their gathered, assessed, and cleaned master dataset(s) to a CSV file or a SQLite database.

<font color=blue>
    
**4.2 CRITERIA:** The student is able to act on their wrangled data to produce insights (e.g. analyses, visualizations, and/or models).

**4.2 SPECIFICATION:**

The master dataset is analyzed using pandas or SQL in the Jupyter Notebook and at least three (3) separate insights are produced.

At least one (1) labeled visualization is produced in the Jupyter Notebook using Python’s plotting libraries or in Tableau.

Students must make it clear in their wrangling work that they assessed and cleaned (if necessary) the data upon which the analyses and visualizations are based.

<hr size="5"/>

<a id='report'></a>
# 5. REPORT-DISCUSSION-CONCLUSION

<a href="#contents">[back to contents]</a>

<font color=blue>
    
**5.1 CRITERIA:** The student is able to reflect upon and describe their data wrangling efforts.

**5.1 SPECIFICATION:**

The student’s wrangling efforts are briefly described. This document (wrangle_report.pdf or wrangle_report.html) is concise and approximately 300-600 words in length.

At least one (1) labeled visualization is produced in the Jupyter Notebook using Python’s plotting libraries or in Tableau.

Students must make it clear in their wrangling work that they assessed and cleaned (if necessary) the data upon which the analyses and visualizations are based.

<font color=blue>
    
**5.2 CRITERIA:** The student is able to describe some insights found in their wrangled dataset.

**5.2 SPECIFICATION:**

The three (3) or more insights the student found are communicated. At least one (1) visualization is included.

This document (act_report.pdf or act_report.html) is at least 250 words in length.

<hr size="5"/>

<a id='files'></a>
# 6. PROJECT FILES

<a href="#contents">[back to contents]</a>

<font color=blue>
    
**6.1 CRITERIA:** Are all required files included in the student's submission?

**6.1 SPECIFICATION:**
The following files (with identical filenames) are included:

* wrangle_act.ipynb
* wrangle_report.pdf or wrangle_report.html
* act_report.pdf or act_report.html
    
All dataset files are included, including the stored master dataset(s), with filenames and extensions as specified on the Project Submission page.

<hr size="5"/>

*Code below is required to programically interface and gather data from TWITTER. I do NOT want to create Twitter account for cyber securty/privacy reasons.*

```
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = df_1.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)
```

In [1]:
# initial setup and import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
import statsmodels.api as sm;

%matplotlib inline

In [2]:
# load data file
df = pd.read_csv('tweet-json.txt')

df.head(5)

ParserError: Error tokenizing data. C error: Expected 126 fields in line 5, saw 146


<a id='intro'></a>
# test

<a href="#contents">[back to contents]</a>


<ul>
<li><a href="#contents">1. GATHERING DATA</a></li>
</ul>