## Data Wrangling Report

## Table of Contents
- [Gathering](#Gathering)
- [Assessing](#Assessing)
- [Cleaning](#Cleaning)

<a id='Gathering'></a>
## Gathering
Firstly, we started by gathering the twitter data from different sources, we already had `twitter_archive_enhanced.csv` on hands so we imported it directly using pandas's methods with dataframe name `twt_df`. We want image predictions for the images in the tweets and the data for that are available in this [site](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv) , so we used requests library to extract the data from the site and we saved it with name `image_predictions.tcv`, then we imported it using pandas's methods with dataframe name `img_prds`. The last data we needed was the favorites and retweets counts of each tweet, and that was available by using twitter API to extract it with JSON format, so we opened `tweet-json.txt` that has the extracted data, and went through it line by line to add it `twt_data` dataframe.

<a id='Assessing'></a>
## Assessing
Secondly, we wanted to do a visual and programmatic assessment to find any quality and tidiness issues. we started by doing visual assessment and we opened each dataframe individually, for the quality we noticed that in `twt_df` many rows had None instead of NaN, the source of each tweet is in a URL format, and some tweets were not dog ratings. For the programmatic assessment we mostly used pandas's methods to find issues that can't be found visually, and the following was the `twt_df` issues we found :
- Tweet id had wrong datatype(Integer).
- Time of the tweet and retweets had wrong datatypes(String).
- All the status and user id's had wrong datatypes(float).
- Some entries had a rating denominator other than 10.
- Some dogs' names were `a`, `the` and `an` and many more wrong names.
- Some retweets are not from @dog_rates.
- Some dogs had more than just one dog type.
- Some entries had wrong ratings, for example, 75 instead of 9.75.

`img_prds` issue:
- Tweet id had wrong datatype(Integer).
- Some tweets didn't have a dog image.

`twt_data` issue:
- All numeric values have String datatype.

for tidiness, we wanted to merge `twt_df` and `twt_data` into one dataframe, delete any rows in `twt_df` and `twt_data` that didn't - have an image prediction in `img_prds`, there were 4 columns for dog type instead of one and we wanted to make sure that all the dataframes has the same tweets.


<a id='Cleaning'></a>

## Cleaning

Lastly, for cleaning, we fixed all the issues in the assessing phase , and the fixes were as followed:
#### Quality : 
##### `twt_df` :
- Changed every None to NaN
- Changed the source from a URL to the source name (ex. IPhone)
- Deleted Tweets that were not a dog rate.
- Tweet id datatype changed from `Integer` to `String`
- All time columns datatypes changed from `String` to `datatime`.
- All status and user id's columns changed from `float` to `String`.
- Deleted any tweet with a rating denominator other than 10.
- Deleted any tweet that has `a` , `the` and `an` ... in the name column.
- Deleted any retweets.
- Deleted any row that has many dog types.
- Changed ( 75 , 26, 27 ) in the rating nominator to ( 10 , 11 , 11).

##### `img_prds` :
- Tweet id datatype changed from `Integer` to `String`
- Deleted any tweet that didn't predict a dog image.

##### `twt_data` :
- Changed all numeric value columns datatypes from `String` to `Integer`.

#### Tidiness : 
- Merged `twt_df` and `twt_data` into one dataframe.
- Deleted any tweet in `twt_df` and `twt_data` that didn't have an image prediction in `img_prds`.
- Merged dog types columns into one column with name `dog_type`.
- Made all the dataframes have the same tweets.