# Wrangle Report
Jesse Fredrickson

11/24/18

## Gathering

There were 3 datasets which I was tasked with gathering for this project. Those were:
- `tweet_archive`: A dataset consisting of tweets from the WeRateDogs twitter account, which stores each tweet as a row, and includes each tweet's ID as well as some basic string comprehension of the tweet's text. This data was given to me as a .csv file, and was simply read into an iPython notebook using pandas.
- `image_predictions`: A dataset containing some neural net analysis of an image associated with a tweet, as well as the tweet ID. The neural net attempted to guess the breed of the dog in the picture. This dataset was retrieved programmatically from a URL offered by Udacity.
- `tweets_json_df`: A dataset containing detailed tweet information on a set of tweets from the WeRateDogs account, including each tweet's ID, along with how many times it was favorited and retweeted. This dataset was scraped from Twitter using the tweepy API and read into an iPython notebook as JSON data.

## Assessing

Once I had each dataset read into my iPython notebook as a pandas DataFrame, I assessed each for quality and tidiness individually. This meant first looking at the rows and columns of each dataframe and determining where there was overlap between the dataframes in order to decide which columns should be kept and which should be dropped in each table. Furthermore this allowed me to decide whether each dataframe was formatted properly, with each row representing a unique tweet and each column representing properties of each tweet. As part of the Pandas `.info()` method, I also saw how many missing values appeared in each dataset. To validate accuracy and consistency of each field, I ran some basic statistics on numerical fields to check that fields such as `rating_numerator` in the `tweet_archive` dataframe contained expected values. For string fields such as tweet text, I printed a range of entires and read through them individually to look for unexpected  characters or other invalid values. I also performed some basic string comprehension of my own to determine if the rating numerator and denominator fields in `tweet_id` had been calculated correctly, and I checked for duplicate tweet_ids as well. In order to validate the `tweet_archive.name` field, I looked at value counts for each name. After my assessments, I complied the following list of errors:
### Quality
#### `tweet_archive` table
- tweet id is an integer not a string
- timestamp is a string not a datetime object
- ratings numerator and denominator often incorrect
    - numerator needs to allow for decimal points
    - 420/10, 666/10, 0/10, 007/10, 84/70, 24/7, 204/170, 99/90, 45/50, 44/40, 4/20 should be dropped
        - drop ratings with a numerator over 15
    - 960/00 should be 13/10; 9/11 should be 14/10; 4/20 should be 13/10, 50/50 should be 11/10, 7/11 should be 10/10
- duplicate tweets should be removed by text
- name should not include 'a', 'an', 'the'. Recalculate also allowing for "meet x" format


#### `image_predictions` table
- tweets where the neural net `p1_dog` is False should not be used for analysis (not pictures of dogs)
- tweet_id is an integer not a string

#### `tweets_json_df` table
- id is an integer not a string

### Tidiness
#### `tweet_archive` table
- doggo, floofer, pupper, and puppo columns should be collapsed into one row called 'type'
- remove `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`


#### `image_predictions` table
- join to tweet_archive table on tweet id

#### `tweets_json_df` table
- join `favorite_count`, `retweet_count` to tweet_archive table on `id` col

## Cleaning

Before addressing each error, I made a copy of each of my dataframes in order to preserve the original raw data in case I made any errors in my cleaning operations. Proceeding with data cleaning, I followed a strict 'define-code-test' methodology to organize my efforts. Errors of a datatype nature were easily addressed by converting between string, integer, and datetime types. I recalculated the rating_numerator field using regular expressions which were strict enough to eliminate unusual one-off and 'group' ratings, where for example Snoop Dogg had been awarded a score of 420/10, and often times a group of dogs was awarded a rating with a combined numerator and denominator such as 99/90. My regex formula also allowed for numerators that included decimal points, which I rounded off in post-processing. Although I had found that there were no duplicate tweet IDs, I did find that some tweets had duplicate, or extremely similar text bodies, so I implemented a 'fuzzy matching' method to detect and remove tweets that were extremely similar. This step revealed retweets that had been mistakenly included. I next recalculated the name column with another more restrictive regular expression which eliminated names that had been assigned as 'a' or 'an'. For the `image_predictions` dataframe, I removed all entries where the first guess of the neural net was not a dog, as those likely represented bad data, and validating the results of the neural net would have been prohibitably time intensive. Correcting for tidiness issues, I first condensed the 'doggo', 'floofer', 'pupper' and 'puppo' columns of `tweet_archive` into a single column with a custom function. Next, since I joined all three dataframes together into a master dataframe on an inner join to eliminate incomplete rows which would interfere with analysis later. Finally, I exported the cleaned master dataframe as a .csv file.