## Reporting: wrangle_report


Date: 16/03/2022

Author: Isabel Klint

## WeRateDogs Twitter Data Wrangling Report
This report briefly describes my wrangling efforts for WeRateDogs Twitter datasets. This data I then used to create analyses/visualisations for my report, **act_report.ipynb.**

#### Datasets: 
- WeRateDogs ‘enhanced’ twitter archive
- each tweet's retweet count and favorite count (I pulled using twitter API)
- tweet image predictions file

### About the enhanced Twitter archive
The WeRateDogs Twitter archive for this project contains 2356 tweets with ratings. Columns are the text of each tweet, from which info was extracted programmatically for the columns: 
- rating
- dog name 
- dog rank (doggo, pupper, or puppo)

The data was extracted programmatically (not by me) and contains errors. For dog rank info, see below.

#### Definitions of dog: doggo, pupper, puppo, and floof(er) adapted from #WeRateDogs book):
##### Doggo
- A big pupper usually older, appears to have its life in order.
##### Pupper
- A small doggo, usually younger and inexperienced.
##### Puppo
- A transitional phase between pupper and doggo.
##### Floof
- Any dog, typically with access fur.
- Dog fur.

### About the retweet/favorite count data
Retweet and favorite count for each tweet were gathered using Twitter's API. I used the tweet IDs within the WeRateDogs twitter archive.

### About the image predictions file
The images in the WeRateDogs Twitter archive ('image_predictions.tsv') were classified for dog breed via neural network (not my work). The result is a table of top pic predictions alongside each tweet ID, pic URL, and the pic # that corresponded to the most confident prediction.

I gathered, assessed, and cleaned all the data, as described in this report.

### Quality issues

1. enhanced tweets: retweets in table unneeded for analysis (completeness). 

2. all tables: also unneeded for analysis are the retweeted tweets columns in the enhanced tweets table, the multiple prediction columns in the image predictions table, and the timestamp column in the tweet columns table.

3. column names are opaque or ugly: timestamp_x', 'p1', and 'p1_dog'.

4. enhanced tweets: some tweets lack dog ratings (completeness) 

5. enhanced tweets: look for numerators less than denominators in ratings (accuracy)

6. enhanced tweets: look for denominators less than 10 (accuracy)

7. enhanced tweets: some tweets lack dog name (completeness)

8. enhanced tweets: some tweets have invalid dog name (validity)

9. enhanced tweets: tweet id datatype is int64, however these are unique and do not require calculation (consistency)

10. enhanced tweets: ratings columns should be float to allow decimals and later analysis. (consistency)

11. enhanced tweets: timestamp datatype is not datetime (consistency)

12. tweet columns: favorites and retweets datatypes are float, should be int (consistency)


### Tidiness issues

1. image predictions: three separate columns of dog rankings (variable does not form a single column)

2. image predictions: a merged dog rank column shows some dogs have a double-ranking (single value per cell) 

### Clean data
The above quality issues were resolved (see ***wrangle_act.ipynb***) resulting in the clean dataset ***twitter_archive_master.csv.***

Cleaning problems such as column removal or datatype changes I will not describe here.

Notable cleaning problems/solutions:
- The dog ratings philosophy evolved over time. This became clear in the cleaning process.
- The dog name scraping was imperfect. I avoided time-consuming string matching because the value of the name column did not greatly impact my analysis.