# Data Wrangling Report

## Introduction

> This report contains a detailed description of the efforts made in wrangling the WeRateDogs Twitter archive data, Image Prediction data, and supplementary tweet data obtained via the Twitter API.

## Step 1: Data Gathering

### WeRateDogs Twitter Archive Data:

<ol>
    <li>Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv).</li>
    <li>Read the downloaded csv file as a pandas dataframe df_archive</li>
</ol>

### Tweet Image Prediction Data

<ol>
    <li>Use the given url to download the tweet image prediction (image_predictions.tsv) via the Requests library.</li>
    <li>Read the downloaded tsv file as a pandas dataframe df_image.</li>
</ol>

### Additional Data From Twitter

<ol>
    <li>Get the tweet ids from the archive data.</li>
    <li>Extract and store the API keys that will be used to query the data.</li>
    <li>Authenticate and set access via the Tweepy library.</li>
    <li>Use the Tweepy library to write the data into the tweet_json text file.</li>
    <li>Read the data in the text file into a pandas dataframe</li>
</ol>

## Step 2: Data Assessment

### Method Applied To All Dataframes

<ol>
    <li>Use the dataframe shape attribute to get information on the number of observations and features.</li>
    <li>Draw a random sample of 20 observations and visually assess.</li>
    <li>Use the dataframe info method to get information on null values and datatype of each feature.</li>
    <li>Combine the duplicated and sum methods to get information on the number of duplicated observations.</li>
    <li>Use the describe method to obtain statistical information on the numerical features.</li>
    <li>Use the value_counts and unique methods to further get information on categorical features.</li>
</ol>

### Issues Identified

#### Quality issues
1. There are non null values in the retweeted_status_id column, indicating a retweet, this is not an original rating and goes against project instructions.

2.  The data type of the tweet_id column of the enhanced twiteer archive dataset is integer and not string. 

3. The timestamp column of the enhanced twitter archive dataset is not a datetime object.

4. There are missing values in the expanded_urls column of the enhanced twitter archive dataset that have not been handled.

5. The tweet_id column in the image predictios dataset is in integer datatype and not string.

6. Inconsistent representation of names in the p1,p2 and p3 columns of the image prediction dataset.

7. The id column of the api data is not in string format.

8. Relatively few number of unhandled missing values in the extended_entities, possibly_sensitive and possibly_sensitive_appealable column of the api data.

9. Overwhelming number of missing values in the in_reply_to_status_id, in_reply_to_status_id_str, in_reply_to_user_id, in_reply_to_user_id_str, in_reply_to_screen_namegeo, coordinates, place, contributors, retweeted_status, quoted_status_id,quoted_status_id_str,quoted_status_permalink and quoted_status columns of the api data.

10. The rating_denominator column of the archive dataset has an entry with a value of 0.

11. The display_text_range column of the api data contains a list with two variables, the initial and final number of characters in the tweet.

#### Tidiness issues
1. The id_str column is a duplicate of the id column in the api data.

2. doggo, floofer, pupper, puppo columns in twitter_archive_enhanced.csv should be combined into a single column as this is one variable that identify the stage of dog.

3. Information about one type of observational unit (tweets) is spread across three different files/dataframes. So these three dataframes should be merged as they are part of the same observational unit.

## Step 3: Data Cleaning

> The following strategies were employed to handle the above quality and tidiness issues. Before cleaning, copies of the assessed dataframes will be made. These copies will be cleaned.

### Quality

> Respective solutions to the quality issues in order.

<ol>
    <li>Drop the rows in which the value of the retweeted_status_id column is not null.</li>
    <li>Change the datatype of the tweet_id column in the archive dataset from integer to string.</li>
    <li>Change the datatype of the timestamp column of the enhanced twitter archive dataset to datetime</li>
    <li>So as not to lose the entire row, fill the missing values of the expanded_urls column of the enhanced twitter archive dataset with the string 'Missing'.</li>
    <li>Change the datatype of the tweet_id column in the image predictios dataset from integer to string.</li>
    <li>Split the values in the p1,p2,p3 columns of the image prediction dataset by underscore, then capitalize and join back together for a consistent representation.</li>
    <li>Change the datatype of the id column of the twitter api data from integer to string, and rename to tweet_id for consistency with the other two dataframes.</li>
    <li>Fill the missing values in the extended_entities column with the string 'Missing', and the missing values in the possibly_sensitive and possibly_sensitive_appealable columns with 0.0.</li>
    <li>Drop all columns with up to 2000 missing values.</li>
    <li>Drop the row with a rating denominator of 0</li>
    <li>Create a new column display_text_length from the display_text_range column, by indexing the second element of the list. Then dropping the display_text_range column.</li>
</ol>

### Tidiness

> Respective solutions to the tidiness issues in order.

<ol>
    <li>Drop the id_str column of the api dataset.</li>
    <li>Create a single column containing the stage of the dog, also accounting for multiple stages, and drop the rows without a dog stage to allow suitable analysis.</li>
    <li>Merge the df_archive_clean, df_image_clean and df_tweets_clean into one master dataset.</li>
</ol>

At the end of our data wrangling efforts, we store the cleaned datasets in master files, to be used for analysis.