# Wrangle Report

This document briefly describes the wrangling efforts for the WeRateDogs Twitter dataset in the `wrangle_act.ipynb` notebook.

## Gathering Data

The dataset was gathered through the following methods:
1. File on hand - `twitter-archive-enhanced.csv`
2. File hosted on Udacity's servers - `image-predictions.tsv`
3. File hosted on Udacity's servers - `tweet_json.txt`

### `twitter-archive-enhanced.csv`

Using the pandas library, `.csv` files were read directly into a DataFrame.

### `image-predictions.tsv`

Using the requests library, files hosted on the internet were programmatically downloaded.  Once downloaded, the pandas library was used to read in the `.tsv` into a DataFrame.

### `tweet-json.txt`

Using the requests library, files hosted on the internet were programmatically downloaded.  The text file is read line by line to append the `tweet_id`, `favorite_count`, and `retweet_count` into a DataFrame

## Assessing Data
Once the data has all been gathered into individual DataFrames, the data is assessed both visually and programmatically to look for any quality and tidiness issues.

Programmatic Methods:
- .head()
- .describe()
- .info()
- .duplicated()
- .value_counts()
- .query()
- .sum()

The issues were categorized by quality and tidiness.

### Quality Issues
#### Completeness
1. `archive`: Missing and incorrect dog names

#### Validity
4. `archive`: Retweets may capture the same dog twice with a different tweet_id
5. `archive`: Replies do not have images
6. `predictions`: 324 predictions where the top 3 predictions are not dog breeds.  Sampling data reveals turtles, fish, sloth, etc.

#### Accuracy
7. `archive`: Rating numerator and denominator have many outliers

#### Consistency
9. `archive`: Timestamp column is a string
10. `archive`: Source displays url

### Tidiness Issues

#### Each variable forms a column
11. `predictions`: Four columns for stages of dog (doggo, pupper, puppo, floofer) should be one category column

#### Each observation forms a row
- N/A

#### Each type of observational unit forms a table
12. `predictions`: Observational unit is for image prediction, `jpg_url` should be part of `archive` table.
13. `tweets`: Retweet and favorite should be appended to `archive` table.

#### Missing and incorrect dog names.

Most of the tweets introduce the dog's name in the beginning of each tweet with "This is ...".

The previous gathering efforts took note of this pattern and was able to capture most of the dog's name by extracting the word after "This is ...".

However, if the tweet did not begin with "This is ..." the name was defaulted to "None".  This explains the 745 records where the dog's name is "None".

This method may also explains why the second most dog name is "a".  For example, if the tweet began with "This is a good boy..." then the method assigned the letter "a" to the dog's name.

On further inspection, if the dog's name was lowercase, it was likely labeled incorrectly.

The cleaning effort tried to correct the dog's name by filtering by incorrectly labeled tweets, and finding their name in the body of the text.

The notebook only includes correction for dog names labeled as "a" because of time constraint. However, with more time and cleaning effort more dog names could be found.

#### Predictions are not dog breeds.

For the many of predictions where dog predictions are False, the images did not contain a dog.

However, there are some instances where a dog is in a busy photo and a dog breed is not predicted.

For example, a photo of a dog taken from behind and his face is in the reflection of a computer monitor.  The top three predictions were for items on the desk.

Retraining the model may provide more accurate breed predictions.