<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Developing a classification model for Twitter topics (Formula 1 vs MotoGP)

### Contents:
- [Background](#Background)
- [Data Collection](#Data-Collection)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)

## Background

Since the release of the popular Netflix series "Drive to Survive" on 8th March 2019, Formula 1’s popularity has risen rapidly on Twitter, with the number of followers on their official account increasing from 3.5million in 2018 to 9.1million in 2023. Given the influx of interest in the topic of Formula 1, there is a need for a classifier to differentiate tweets about Formula 1 from other auto racing related tweets such as MotoGP, Nascar, etc, such that Twitter is able to recommend the relevant tweets to users' news feeds.

## Problem Statement

Twitter needs a classification model that maximizes the recommendation of Formula 1 related tweets to users who have indicated interest in the topic, while minimizing the recommendation of other auto racing related tweets. 

This project aims to build a classifier that is able to differentiate Formula 1 related tweets from MotoGP related tweets.

## Data Collection

For this project, we will be scraping tweets from the official twitter accounts of Formula 1 (@F1) and MotoGP (@MotoGP) for the period of 1st Jan 2022 to 28th Feb 2023. The official racing seasons for both runs from March to November, and during the off season we expect that the tweets would be of a different nature compared to during the racing season. Our dataset would contain tweets from both racing season and off-season, so that our model will be able to account for seasonal differences.

In [1]:
# Import libraries
import pandas as pd
import snscrape.modules.twitter as sntwitter

In [2]:
# Scraping Tweets from @F1
f1_tweets = []

f1_scraper = sntwitter.TwitterSearchScraper('"from:F1" since:2022-01-01 until:2023-02-28')
for i, tweet in enumerate(f1_scraper.get_items()):
    data = [
        tweet.date,
        tweet.rawContent,
        tweet.user.username,
        tweet.likeCount,
        tweet.retweetCount,
    ]
    f1_tweets.append(data)

In [3]:
# Checking number of F1 tweets collected
len(f1_tweets)

8173

In [4]:
# Compiling the F1 data into a dataframe
f1_df = pd.DataFrame(f1_tweets, columns = ['date','content','username','like_count','retweet_count'])
f1_df

Unnamed: 0,date,content,username,like_count,retweet_count
0,2023-02-27 20:00:00+00:00,Tensions were running high between the Team Pr...,F1,22495,1637
1,2023-02-27 19:00:02+00:00,"Striking a pose, ft. a glorious mullet too ✨\n...",F1,3552,291
2,2023-02-27 18:02:00+00:00,💰🆕 FANTASY PRICE REVEAL 🆕💰\n\nCreate your ulti...,F1,1526,146
3,2023-02-27 17:55:00+00:00,It's almost time for F1 Fantasy to restart! 🤩\...,F1,1215,210
4,2023-02-27 17:37:05+00:00,We couldn't not put these two together! 🥰\n\n@...,F1,8126,785
...,...,...,...,...,...
8168,2022-01-01 17:17:15+00:00,"New car, new circuit, new driver line-ups...\n...",F1,6087,342
8169,2022-01-01 14:54:00+00:00,Give us one bold prediction for the 2022 seaso...,F1,12284,308
8170,2022-01-01 11:30:00+00:00,Time to get these dates down so you don't miss...,F1,51321,7542
8171,2022-01-01 08:48:00+00:00,New year. New car. New era.\n\n#F1 #HappyNewYe...,F1,65253,4489


In [5]:
# Scraping Tweets from @MotoGP
motogp_tweets = []

motogp_scraper = sntwitter.TwitterSearchScraper('"from:MotoGP" since:2022-01-01 until:2023-02-28')
for i, tweet in enumerate(motogp_scraper.get_items()):
    data = [
        tweet.date,
        tweet.rawContent,
        tweet.user.username,
        tweet.likeCount,
        tweet.retweetCount,
    ]
    motogp_tweets.append(data)

In [6]:
# Checking number of MotoGP tweets collected
len(motogp_tweets)

11677

In [7]:
# Compiling the MotoGP data into a dataframe
motogp_df = pd.DataFrame(motogp_tweets, columns = ['date','content','username','like_count','retweet_count'])
motogp_df

Unnamed: 0,date,content,username,like_count,retweet_count
0,2023-02-27 12:40:00+00:00,@KTM_Racing @GresiniRacing @pramacracing Six d...,MotoGP,115,10
1,2023-02-27 12:30:59+00:00,The 2023 grid is looking great so far! 🔥 https...,MotoGP,348,35
2,2023-02-27 12:28:33+00:00,We've had some cool team presentations so far!...,MotoGP,1384,134
3,2023-02-27 11:15:24+00:00,"Five team launches, the #PortimaoTest and so m...",MotoGP,347,39
4,2023-02-26 09:01:15+00:00,About last Sunday 😎 \n\nWe just cannot get eno...,MotoGP,1262,140
...,...,...,...,...,...
11672,2022-01-01 21:30:39+00:00,@jazzycat62 @ValeYellow46 https://t.co/Up0SRc0...,MotoGP,7,2
11673,2022-01-01 19:01:13+00:00,The man that defined over 2 decades of #MotoGP...,MotoGP,3284,570
11674,2022-01-01 17:02:17+00:00,Chapter 9 - Where Heroes Are Made 🦸‍♂️\n\nOne ...,MotoGP,438,52
11675,2022-01-01 15:01:06+00:00,"""Maybe one day"" 👀\n\nThere are many that think...",MotoGP,433,25


In [8]:
# Combine F1 and MotoGP datasets
df = f1_df.append(motogp_df)
df

  df = f1_df.append(motogp_df)


Unnamed: 0,date,content,username,like_count,retweet_count
0,2023-02-27 20:00:00+00:00,Tensions were running high between the Team Pr...,F1,22495,1637
1,2023-02-27 19:00:02+00:00,"Striking a pose, ft. a glorious mullet too ✨\n...",F1,3552,291
2,2023-02-27 18:02:00+00:00,💰🆕 FANTASY PRICE REVEAL 🆕💰\n\nCreate your ulti...,F1,1526,146
3,2023-02-27 17:55:00+00:00,It's almost time for F1 Fantasy to restart! 🤩\...,F1,1215,210
4,2023-02-27 17:37:05+00:00,We couldn't not put these two together! 🥰\n\n@...,F1,8126,785
...,...,...,...,...,...
11672,2022-01-01 21:30:39+00:00,@jazzycat62 @ValeYellow46 https://t.co/Up0SRc0...,MotoGP,7,2
11673,2022-01-01 19:01:13+00:00,The man that defined over 2 decades of #MotoGP...,MotoGP,3284,570
11674,2022-01-01 17:02:17+00:00,Chapter 9 - Where Heroes Are Made 🦸‍♂️\n\nOne ...,MotoGP,438,52
11675,2022-01-01 15:01:06+00:00,"""Maybe one day"" 👀\n\nThere are many that think...",MotoGP,433,25


In [9]:
# Export combined dataset into csv
df.to_csv("../data/train_data.csv", index = False)

# Info on data collected
Number of F1 tweets collected: 8,173  
Number of MotoGP tweets collected: 11,677