# Wrangle and Analyze Data
#### Wrangling report

## Gather
Due to some issues with the Twitter developer account, I decided to take the shortcut approach instead since it is allowed. On the plus side, it saves time and allows for a cleaner notebook.

The data comes from 3 different source:

* Enhanced Twitter Archive: WeRateDogs Twitter archive, accessible as a local csv file.

* Image Predictions File: results of an image classifier for dog breeds, accessible as an online tsv file.

* Additional Data: accessible via Twitter API (or alternatively local json file).

## Assess

**The Process**

This has been achieved through a series operations on each of the datasets, this includes:

* visually checking samples of the data

* calculating and investigating some key stats

* checking for nulls

* checking for duplicates

**Observations on quality**

* Incorrect dog names including None and a

* `timestamp` is `object`, should be `datetime`

* `tweet_id` is `int`, should be `str` 

* Some columns have missing data

* `id` should be renamed `tweet_id` for consistency

* many dogs aren't classified (dogger, pupper ...)

* unusual rating scale, making comparisons difficult

* tables don't have the same number of entries, meaning there are missing lines

* several denominator values, including 0!

* different number of retweets depending on how to filter for it

**Observations on tidiness**

* last 4 columns can be replaced with one

* this data can be combined in 1 table instead of 3

* only predictions with highest pobability are neede

* retweets and replies included but not needed

## Clean

**I followed these steps**

* change `id` and `tweet_id` types to `str`

* rename `id` to `tweet_id` and cast as string

* join all tables on `tweet_id`

* remove rows with denominator of 0 if any

* replace wrong dog names with `None`

* remove retweets and related columns

* change `timestamp` type to `datetime` 

* combine `["doggo", "floofer", "pupper", "puppo"]` columns into one as `stage`

* add new stage [chimera](https://i.imgur.com/WeeuxcC.png) in case of multiple stages,

* calculate a rating score

## Analysis and Visualisation
In this section I provided some insight into the data at hand by answering a few question.

### How many submissions are real dogs?


True dog percentage: 74.06%

### Top 10 rated dogs


<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>tweet_id</th>
      <th>name</th>
      <th>stage</th>
      <th>p1_dog</th>
      <th>rating_score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>749981277374128128</td>
      <td>Atticus</td>
      <td>None</td>
      <td>False</td>
      <td>177.600000</td>
    </tr>
    <tr>
      <td>670842764863651840</td>
      <td>None</td>
      <td>None</td>
      <td>False</td>
      <td>42.000000</td>
    </tr>
    <tr>
      <td>786709082849828864</td>
      <td>Logan</td>
      <td>None</td>
      <td>True</td>
      <td>7.500000</td>
    </tr>
    <tr>
      <td>810984652412424192</td>
      <td>Sam</td>
      <td>None</td>
      <td>True</td>
      <td>3.428571</td>
    </tr>
    <tr>
      <td>778027034220126208</td>
      <td>Sophie</td>
      <td>pupper</td>
      <td>True</td>
      <td>2.700000</td>
    </tr>
    <tr>
      <td>680494726643068929</td>
      <td>None</td>
      <td>None</td>
      <td>True</td>
      <td>2.600000</td>
    </tr>
    <tr>
      <td>870063196459192321</td>
      <td>Clifford</td>
      <td>None</td>
      <td>False</td>
      <td>1.400000</td>
    </tr>
    <tr>
      <td>863079547188785154</td>
      <td>None</td>
      <td>None</td>
      <td>True</td>
      <td>1.400000</td>
    </tr>
    <tr>
      <td>864873206498414592</td>
      <td>None</td>
      <td>None</td>
      <td>False</td>
      <td>1.400000</td>
    </tr>
    <tr>
      <td>807621403335917568</td>
      <td>Ollie</td>
      <td>pupper</td>
      <td>True</td>
      <td>1.400000</td>
    </tr>
  </tbody>
</table>
</div>

### Images of the top 3 dogs

![png](wrangle_act_files/wrangle_act_49_0.png)

![png](wrangle_act_files/wrangle_act_50_0.png)

![png](wrangle_act_files/wrangle_act_51_0.png)

Even for top dogs, not all are real, and it seems like the dog detection model is not always reliable as well.

### At which stage does a dog get better ratings?


<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>stage</th>
      <th>rating_score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>puppo</th>
      <td>1.200000</td>
    </tr>
    <tr>
      <th>floofer</th>
      <td>1.200000</td>
    </tr>
    <tr>
      <th>doggo</th>
      <td>1.188889</td>
    </tr>
    <tr>
      <th>None</th>
      <td>1.180192</td>
    </tr>
    <tr>
      <th>chimera</th>
      <td>1.118182</td>
    </tr>
    <tr>
      <th>pupper</th>
      <td>1.071429</td>
    </tr>
  </tbody>
</table>
</div>

### Source of submissions

*Twitter for iPhone: 98.04%

*Twitter Web Client: 1.40%

*TweetDeck: 0.55%

### Retweet vs Favorite Count

![png](wrangle_act_files/wrangle_act_59_0.png)

### Rating Distribution

* For all values:

![png](wrangle_act_files/wrangle_act_61_0.png)

* For \[0,5\] range:

![png](wrangle_act_files/wrangle_act_62_0.png)

## Conclusion

This project provides us with the opportunity to to practice the whole data wrangling process from start to finish.

I started by gathering the data from the 3 different sources and checking several aspects of it to get an idea on what issues are there, in terms of both quality and tidiness.

In the second step, I worked on addressing those issues as best as I can, and ended up with a much cleaner dataset that I saved locally. Some issues have not been addressed though, like the accuracy of rating values, and the messy source column.

Finally, It was time for getting some insights from the final dataset. Its details are discussed in `act_report.html`.

In [3]:
from subprocess import call
call(["python", "-m", "nbconvert", "wrangle_report.ipynb"])

0