# Wrangling Report

## Imported Libraries

The following libraries were imported:
1. pandas
2. numpy
3. matplotlib.pyplot
4. seaborn
5. tweepy

## Gathering

I have read from the following files:

|File Name|DataFrame|
|---|---|
|twitter-archive-enhanced.csv|twitter_archive_df|
|tweet_json.txt|tweet_json_df|
|[image-predictions.tsv](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv)|image_predictions_df|

## Assessing

_______________________________________________

### General

1. `twitter_archive_df` size is 2356, while `tweet_json_df` has 2354 (2 less rows) and `image_predictions_df` has 2075 (281 less rows)
2. all three df's are connected in `twitter_archive_df.tweet_id`, `tweet_json_df.id`, and `image_predictions_df.tweet_id`

_______________________________________________

### twitter_archive_df

#### Quality issues

##### Visually Assessment:
1. `name` column has non-standard lowercase values to be investigated programmatically 
2. < 1% of `rating_denominator` != 10.
3. < 2% of `rating_numerator` is > 14.
4. `source` column has the HTML tags, URL, and content all together.

##### Programmatic Assessment:
5.  `tweet_id` dtype is int (not str)
6. `timestamp` is in object format
7. `name` column contains 745 "None" values
8. `name` column contains 25 unique invalid lowercase names (total = 109 values) using regex search with str.contains.
9. `in_reply_to_status_id` and `in_reply_to_user_id` have only 78/2356 non-null values with 96.7% missing data
10. `retweeted_status_id`, `retweeted_status_user_id`, and `retweeted_status_timestamp` have only 181/2356 non-null values with 92% missing data
11. 380 rows only have dogs classified breed into four categories:
    * `doggo`
    * `floofer`
    * `pupper`
    * `puppo`
    
#### Tidiness issues

##### Programmatic Assessment:
1. The following columns is a subclass of `name` column:
    * `doggo`
    * `floofer`
    * `pupper`
    * `puppo`
2. The following columns concerning replies (out of scope):
    * `in_reply_to_status_id`
    * `in_reply_to_user_id`
3. The following columns concerning retweets (out of scope):
    * `retweeted_status_id`
    * `retweeted_status_user_id`
    * `retweeted_status_timestamp`

_______________________________________________

### tweet_json_df

#### Quality issues

##### Visually Assessment:
1. `id` dtype is int (not str)
2. Duplicated columns `id` and `id_str`
3. Note: `in_reply_to_status_id` and `in_reply_to_status_id_str` columns has the same values, e.g. duplicate 
4. Note: `in_reply_to_user_id` and `in_reply_to_user_id_str` columns has the same values, e.g. duplicate 
5. `favorited` column has two values: True and False
6. `retweeted` column has one value: False
7. `is_quote_status` if True refers to quoting tweets (irrelevant). 
8. `lang` column has 7 unique values

##### Programmatic Assessment:
9. `extended_entities` has 281 null values == missing rows in image predictions.
10. The following column has only (False) value:
    * `retweeted`
    * `truncated`
11. The following columns is out of interest as they concern quoted tweets:    
    * `quoted_status_id`
    * `quoted_status_id_str`
    * `quoted_status`
12. The following columns has almost empty data:
    * `geo` (empty)
    * `coordinates`
    * `place`
    * `contributors`
13. The following colums has 9 unique values yet with data dtype == object (not categorical)
    * `lang`
14. `retweeted_status` if notnull() refers to RT tweets (irrelevant)

#### Tidiness issues

##### Visually Assessment:
1. `source` column contains more than one data.

##### Programmatic Assessment:
2. The following columns concerning replies, retweets, likes, sensitivities, quotes, location, or languages which are out of scope:
    * `in_reply_to_status_id`
    * `in_reply_to_status_id_str`
    * `in_reply_to_user_id`
    * `in_reply_to_user_id_str`
    * `in_reply_to_screen_name`
    * `retweet_count`
    * `favorited`
    * `possibly_sensitive`
    * `ossibly_sensitive_appealable`
    * `retweeted_status`
3. 179 rows with non-null values of `retweeted_status` where also `favorite_count` == 0 vs. considerable number of retweets!

_______________________________________________

### image_predictions_df

#### Quality issues
##### Programmatic Assessment:
1. `jpg_url` has duplicated values (links to images) and consequently double entry, e.g. RT @dog_rates: 
2. `tweet_id` dtype is int (not str)

#### Tidiness issues
##### Visually Assessment:
1. `p1`, `p2`, and `p3` are not using standard format, e.g. some lowercase, other title case, some _ seperated or - or space

##### Programmatic Assessment:
2. Zero missing data but has 281 less rows

## Cleaning

All comments have been addressed and rectified using the three steps:
* Define
* Code
* Test
over the course of cleaning section of this report.

After that, we merged the three dataframes into a single DF named `master_twitter_archive_df` which Undergone a second stage of assessing and cleaning before storing, analyzing and vizualizing.

_______________________________________________

# Storing, Analyzing, and Visualizing Data Report

## Storing

The master DF data was exported into a single csv file named:
* twitter_archive_master.csv

|File Name|DataFrame|
|---|---|
|twitter_archive_master.csv|df|

## Analyzing

* Statistical analysis through describe
* count by unique values
* mean success rates per image prediction algorithm 
* grouby and aggregate tweets count, RT/Likes sum and rating mean

## Visualizing

We have visualized the following characteristics:
* value counts of unique dog breed prediction, e.g. p1, p2, and p3
* total number of dog breeds with unique value counts of +30
* total quantity of above count
* correlation between retweets and likes in scatter plot
* correlation between retweets and likes in Correlation matrix
* trends of tweets counts, RT/likes sum, rating mean over the course of weekdays and months