# Wrangle Report WeRatesDogs

## Description

*"A huge amount of effort is spent cleaning data to get it ready for analysis"*, H. Wickham.

The following document will detail a data wrangling approach on a twitter dataset, as part of an assignment for the Udacity Data Analyst Nanodegree. More details can be found on the main github repository page.

*"This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets."*, H. Wickham.

[Tidy Data, H Wickham, published on the Journal of Statistical Software](#https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf)

## Gather

Depending on the source of your data, and what format it's in, the steps in gathering data vary.
High-level gathering process: obtaining data (downloading a file from the internet, scraping a web page, querying an API, etc.) and importing that data into your programming environment (e.g., Jupyter Notebook).

In this assignement, the data gathered is a follow:

- **The WeRateDogs Twitter archive**. Download this file manually by clicking the following link: [twitter_archive_enhanced.csv](#https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv).

- **The tweet image predictions**, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the following [URL](#https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv).

- **Twitter API call**,each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data.

## Assess

**Assess data for:**
- Quality: issues with content. Low quality data is also known as dirty data.
- Tidiness: issues with structure that prevent easy analysis. Untidy data is also known as messy data. Tidy data requirements:
    - Each variable forms a column.
    - Each observation forms a row.
    - Each type of observational unit forms a table.

**Types of assessment:**
- Visual assessment: scrolling through the data in your preferred software application (Google Sheets, Excel, a text editor, etc.).
- Programmatic assessment: using code to view specific portions and summaries of the data (pandas' head, tail, and info methods, for example).

In this assignement, the data issues identified after the assessment are as follow:

**Quality**, *issues with content. Low quality data is also known as dirty data.*
- Drop unused columns
- Drop retweeted rows
- Drop rows without image classifier outputs nor twitter API call outputs
- Fix the dog name extracts
- Clean the source field
- Clean the rating_numerators and rating_denominators
- Correct the data types
- Unpivot the dog stages into one column
- Convert the value "None" to numpy.nan


**Tidiness**, *issues with structure that prevent easy analysis. Untidy data is also known as messy data.*
- Drop unused columns in twitter_archive and in image_classifier_output
- Merge twitter_archive, image_classifier_output and json_tweets into one dataset


## Clean
**Types of cleaning:**
- Manual (not recommended unless the issues are single occurrences)
- Programmatic
**The programmatic data cleaning process:**
- Define: convert our assessments into defined cleaning tasks. These definitions also serve as an instruction list so others (or yourself in the future) can look at your work and reproduce it.
- Code: convert those definitions to code and run that code.
- Test: test your dataset, visually or with code, to make sure your cleaning operations worked.
    Always make copies of the original pieces of data before cleaning!

**Reassess and Iterate**
- After cleaning, always reassess and iterate on any of the data wrangling steps if necessary.

**In this assignement, the data cleaning has involved:**
- Extensive usage of built in methods within the Pandas package
- For loops
- Regular Expressions
- Merge
- Melt (unpivot)

