## Wrangling Report

> documentation for data wrangling steps: gather, assess, and clean

As was described in the initial project outline the dataset provided was quite interesting, but contained issues that would need to be assesed and cleaned. But before that could be done the individual datasets had to be gathered.

### Gathering

Part of the dataset was provided, in the form of the `twitter-archive-enhanced.csv` file, which contained a Twitter archive for the `dog_rates` account. There was also an output of a image prediction model for the tweets and that had to be downloaded programatically using the [**Requests**](https://requests.kennethreitz.org/en/master/) library. And then saving the file locally.

The next part was to query the Twitter API to gather additional information that wasn't present in the provided archive file, like favorite or re-tweet counts. This was harder than just using _requests_ to download a file. A developer account needs to be created with Twitter, which doesn't always work; and in this instance took a bit of time before it was approved. 

Once approved credentials need to be generated & saved locally to be used by [**Tweepy**](https://www.tweepy.org/), which is a Python library for making Twitter API requests. These credentials need to be on the machine & can be read into the [Jupyter](https://jupyter.org/) notebook, but should not be displayed in the notebook or be commited to git. With the credentials in hand I could now query the data from the Twitter API. 

This has some learning curves as there's a **rate limit** on how many requests can be made within a certain time period. Tweepy has some variables (_wait_on_rate_limit_ & _wait_on_rate_limit_notify)_ that can be set to `True` to avoid the API requests timing out or having an exceptions. To avoid any issues around this I wrapped each request in a `try/except` block to not stop if one of the requets had run into an issue.

I looped through each `tweet_id` in the provded Twitter archive & appended the results, if successfull, to a list. When the Twitter API query completed I saved the compiled list of full tweet data to a file as a JSON array. This file was then read into it's own dataframe using the `read_json` method.

With that the **gathering** stage of the project was complete.

### Assessing

With all the necessary data gathered I began looking at the data to find any issues that might be present. This was done with programatic assesments using `pandas` built-in functions and with visual assesments of looking at specific records that presented issues.

The following issues were found:


#### Data Quality
1. There are re-tweets in the dataset. These should be ignored/removed.
2. There are replies in the dataset & they should also be ignored/removed.
3. The reply status ids also appear to have been rounded as the `in_reply_to_status_id` column values appear in scientific notation, even though the column was imported as a string. This could be addressed with the `tweet_json` data that was pulled from the Twitter API, but since I'm choosing to ignore/remove any replies before the analysis it will be moot.
4. Not all the tweets have a `name` for the dog(s). And the name `a` is present on 55 of the tweets. It should be set to `NaN`
5. The `source` column contains an entire HTML tag isntead of just the text to denote which client posted the tweet
6. Certain tweets have the wrong rating. Some of these are from re-tweets & replies; others are from group ratings (multiple dogs), but the following are wrong & should be corrected:

| tweet_id | current_rating | intended_rating |
|:---|---|---|
| `810984652412424192 ` | 24 | NaN |
| `786709082849828864` | 75 | 9.75 |
| `778027034220126208` | 27 | 11.27 | 
| `716439118184652801` | 50 | 11 |
| `680494726643068929` | 26 | 11.26 |
| `740373189193256964` | 9 | 14 |
| `722974582966214656` | 4 | 13 |
| `682962037429899265` | 7 | 10 |
| `666287406224695296` | 1 | 9 |

7. Similarly some of the denominator values are also questionable. Some of them are replies or re-tweets; and the group ratings are present as well, but the following need to be corrected:

| tweet_id | current_denominator | intended_denominator |
|:---|---|---|
| `810984652412424192 ` | 7 | NaN |
| `740373189193256964` | 11 | 10 |
| `722974582966214656` | 20 | 10 |
| `716439118184652801` | 50 | 10 |
| `682962037429899265` | 11 | 10 |
| `666287406224695296` | 2 | 10 |

8. There are 14 tweets that have more than a single `dog_stage` listed. They should be corrected to most accurate one for the image. The scenarios are:
    - `doggo` and `pupper`
    - `doggo` and `floofer`
    - `doggo` and `puppo`
    
    Some of them contain multiple/different dogs & the multiple instances of `dog_stage` are warranted. Those that need to be fixed are here:
    
| tweet_id | current_stage(s) | intended_stage(s) |
|:---|---|---|
| `855851453814013952` | doggo, puppo | puppo |
| `854010172552949760` | doggo, floofer | NaN |
| `817777686764523521` | doggo, pupper | pupper |
| `801115127852503040` | doggo, pupper | pupper |
| `785639753186217984` | doggo, pupper | doggo |
| `751583847268179968` | doggo, pupper | NaN |

    
    
9. The ratings are spread is spread across 2 columns `rating_numerator` and `rating_denominator`. A cleaner representation would be to calculate the actual normalized rating as a `float` to be able to make comparissons & analysis on the data and to account for multiple dogs per tweet.


#### Data Tidiness
1. The predictions of dog breed are not tidy. Each prediction should form it's own observation with the `breed` (or prediction of what's in the image), the `confidence` of the prediction, and whether or not the prediction is actually a `dog`
2. The `retweet_count` and `favorite_count` pulled in from the Twitter API are in sepparate table and should be merged with the master dataset
3. Dog stages (doggo, pupper, puppo, fluffo) are in sepparate columns and can be combined into a single tidy column `dog_stage`

### Cleaning

With all the relevant data quality & tidiness issues identified and documented. It was time to clean the dataset so it could be analyzed for insights. 

This was done in multiple different ways by working thourhg the list of identified issues and is documented in the cleaning code with comments.

With cleaning complete the master dataset(s) was ready to analysis & was saved out to a file for possible future use.