[WeRateDogs](https://twitter.com/dog_rates?s=20) is a Twitter account that posts and rates pictures of dogs.
In my analysis, I want to answer the following questions:

- What is WeRateDogs's posting trend by month?
- What is the monthly trend of interactions with WeRateDogs's posts?
- What are the most popular dog breeds based on number of posts, interactions by Twitter users, and ratings?
- Is there any correlation between WeRateDogs's ratings and the interactions by Twitter users?

# 1. Gathering data

There are 3 files with the following format: csv, tsv, json. I manually download these files

---

# 2. Assessing data
### 2.1 Understanding the dataset columns
`archive`:
-  `tweet_id`: unique identifier for each tweet
- `in_reply_to_status_id`: original tweet_id if the row is a **reply**. If not, null
- `in_reply_to_user_id`: user id of the original tweet if the row  is a **reply**.  If not, null
- `timestamp`: time when this tweet was created
- `source`: HTML-formatted string of platform to post the tweet
-  `text`: content of the tweet
- `retweeted_status_id`: original tweet_id if the row is a **retweet**. If not, null
- `retweeted_status_user_id`: user id of the original tweet if the row is a **retweet**.  If not, null
- `expanded_urls`: tweet url
- `rating_numerator`: numerator of the rating of the dog. Note: ratings almost always greater than 10
- `rating_denominator`: denominator of the rating of the dog. Note: ratings always have a denominator of 10
- `name`: dog names
- `doggo` / `floofer` / `pupper` / `puppo`: one of the 4 dog stages

`predictions` columns:
- `tweet_id`: mentioned before
- `jpg_url`: dogs' image url
- `img_num`: the image number that corresponded to the most confident prediction (1 to 4, since tweets can have up to 4  images)
- `p1` / `p2` / `p3`: result of dogs that the #x (pX) algorithm to predict the image in the tweet
- `p1_conf` / `p2_conf` / `p3_conf`: how confident the algorithm is in its pX predictions
- `p1_dog` / `p2_dog` / `p3_dog`: whether or not the #x prediction is that breed of dog

`tweet` columns:
- `tweet_id`: as mentioned
- `retweet_count`: number of times this tweet has been retweeted
- `favourite_count`: how many times this tweet has been liked by twitter users
- `display_text_range`: an array of 2 unicode code point, identifying the inclusive start and exclusive end of the displayable content of the tweet

### 2.2 Conduct assessment to define the cleaning process
**About the quality**

`archive`:
- contains retweets. We only care about unique posts -> this might be considered duplication
- 281 records of tweet_id missing in `predictions`
- error datatypes: `in_reply_to_status_id`,`in_reply_to_user_id`,`timestamp`
- unnecessary html tags in `source` to differentiate utility name
-  `rating_numerator` has values <10 as well as some very large numbers
- `rating_denominator` has values other than 10
- wrong dog names starting with lowercase characters and glibberish (eg: a,an,actually,by)
- some records have more than one dog stages

`predictions`:
- After tidying, prediction number needs to have int type
- Value in `px` are inconsistent in the first letter capitalization
- Not all have dog-related prediction -> need to be dropped
- Duplicated `jpg_url` which are related to retweets
**About the tidiness**
- `archive` without any duplicates (i.e: retweets) will have empty `retweeted_status_id`,`retweeted_status_user_id`,`retweeted_status_timestamp`, which can be dropped 
- `doggo, floofer, pupper and puppo` should be merged into one column named `stage`
- from 3 `px`, 1 should be picked then `breed` should be added in `archive`
- `retweet_count` and `favorite_count` from `tweet` should be joined with `archive`
---
### 3. Cleaning data

See [WeRateDogsProject]() for further details.

Some samples:

1. 
```
num_columns = range(1,4,1)

for num in num_columns:
    column_name = f'p{num}_dog'
    mask |= predictions_clean[column_name] 

predictions_clean = predictions_clean[mask]
```
2. 
```
def extract_last_fraction(text):
    fraction_pattern = r'(\d+\.?\d*)/(\d+\.?\d*)'
    matches = re.findall(fraction_pattern,text) 
    if matches:
        last_match = matches[-1] #take the last fraction
        numerator = float(last_match[0])
        denominator = float(last_match[1])
        return numerator, denominator
```
3. 
```
for i in range(len(multiple_stage_tweet_id)):
    archive_clean.loc[archive_clean['tweet_id'] == multiple_stage_tweet_id[i], 'stage'] = multiple_stage_value[i]
```

---

### 4. Storing data

---

### 5. Analyzing data and answer questions

**Question 1: What is WeRateDogs's posting trend by month?**

The number of ratings on WeRateDogs has had a general downward trend by month since the account was first started, going from a peak of almost 300 tweets in one month down to around 50 in the latest months. There was a sharp decline only a few months after the account was created. However, this does not necessarily mean that the number of overall posts by the account has decreased, as my analysis only focused on original content with images. It is possible that the account has focused on other areas of content, such as retweets, videos, and interactions with followers. 

![monthly_tweet_count](graphs/Tweet%20posts%20by%20Month.png)

**Question 2: What is the monthly trend of interactions with WeRateDogs's posts?**
I define "interactions" = total retweets + favorites. 

There has been a steady upward trend in the average retweets and favorites since the account was first started, going from almost no interaction to around 35,000 retweets and favorites per post in the latest months. This trend is encouraging given that the account's original purpose was to post images with ratings.

![monthly_interactions](graphs/Average%20interactions%20per%20tweet.png)

**Question 3: What are the most popular dog breeds based on number of posts, interactions by Twitter users, and ratings?**

Here are the most popular dog breeds by the number of posts. 

![top_10_by_count](graphs/Most%20popular%20dog%20breeds%20by%20tweet%20counts.png)

Golden retrievers are by far the most commonly rated breed, followed by labrador, pembrokes and chihuahua

![top_10_by_interactions](graphs/Most%20popular%20dog%20breeds%20by%20average%20interactions.png)

Saluki, afghan hound and french bulldog are breeds that often received highest number of interactions on average.

![top_10_by_rating](graphs/Most%20popular%20dog%20breeds%20by%20average%20ratings.png)

Not much different between the top 10 dog breeds by average ratings, which brings me to the 4th questions:

**Question 4: Is there any correlation between WeRateDogs's ratings and the interactions by Twitter users?**

There appears to be a positive correlation. I check for **Pearson correlation coefficient** to see how strong this correlation. r=0.48 -> the correlation is moderately positive, with the very low p-value (<<0.05) indicates that the correlation is statistically significant

![scatter_by_breed](graphs/Average%20interactions%20vs.%20ratings%20by%20breed.png)