<a href="https://colab.research.google.com/github/mantissg/DAT6004_WRIT1/blob/main/st20215322_DAT6004_WRIT1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Social Analytics (DAT6004) WRIT1 Assignment
## Produced by Sean Granville (st20215322)
**Social Network Analysis and Social Media Text Processing**
## Introduction
This Python notebook delves into the realm of Social Analytics, encompassing diverse facets such as Social Network Analysis (SNA), network visualisation, web scraping, and natural language processing.

The notebook unfolds across three distinctive sections:

1. Building and Analyzing a Graph Network:

> Utilising CERN Twitter data, this section explores the dynamics around the time of the discovery of the Higgs boson particle. A graph network is constructed and analysed to unveil insights into the social interactions during this period.


2. Overcoming Limitations in Structural Analysis:

> Investigating the challenges posed by structural analysis in social networks and propose strategies to overcome these limitations. The discussion extends to addressing local community issues, showcasing the potential synergy between location information and social media to tackle such challenges.


3. Social Media Text Analysis:

> Leveraging Scrapy, the notebook ventures into web scraping, extracting multiple Formula 1 forum threads from forums.autosport.com. The extracted text undergoes natural language processing, specifically sentiment analysis. The focus is on evaluating the sentiment expressed in each team's forum threads, providing insights into the online sentiments within the Formula 1 community.

This comprehensive exploration demonstrates the versatility of Python in handling social analytics tasks, from dissecting Twitter interactions around scientific breakthroughs to addressing community issues and dissecting sentiments within online forums.

##Contents

**Task 1: Building and Analysing the Network**

Introduction

1.1 - Building the Social Network
> 1.11 - Import Data

> 1.12 - EDA

1.2 - Centrality Measures

> 1.21 - Top 10 Nodes by Centrality Measure

1.3 - Network-Level and Path-Level Measures

1.4 - Structural Analysis

1.5 - Network Visualisation

**Task 2: Beyond Basic SNA**

2.1 - Limitations of Structural Analysis of the Social Networks

2.2 - Issues Facing the Local Community

**Task 3: Social Media Text Analysis**

Introduction

3.1 - Web Scraping

3.2 - Text Preprocessing

3.3 - Language Modelling

3.4 - Sentiment Analysis

References




#Task 1: Building and Analysing the Network

##Introduction
The Higgs dataset has been built after monitoring the spreading processes on Twitter before, during and after the announcement of the discovery of a new particle with the features of the elusive Higgs boson on 4th July 2012.

To delve into the intricacies of this dataset, a specific subset, named 'higgs-activity_time,' encapsulating a 10-minute snapshot of activity, was selected. This subset became the focal point for a comprehensive analysis that included degree distribution, centrality, network structure, and path analysis. These analytical endeavors unearthed intriguing features, structures, and insights embedded within the network.

The culmination of this analysis is embodied in two distinct visualisations crafted using Gephi. Each visualisation offers a unique perspective on the data, providing a visual narrative that enhances our understanding of the underlying dynamics of the Higgs dataset.

##1.1 - Building the Social Network

In [4]:
# import required libraries for task 1
import pandas as pd
import gzip
import shutil
import matplotlib.pyplot as plt
from matplotlib.figure import Figure
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp
import numpy as np
import networkx as nx
from operator import itemgetter
from google.colab import drive
from IPython.display import Image, display

###1.11 - Import Data

In [5]:
# connect to storage
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
# unzip Higgs Twitter Dataset files that were downloaded from https://snap.stanford.edu/data/higgs-twitter.html

# create 'files' array
files = ['higgs-activity_time.txt', 'higgs-mention_network.edgelist', 'higgs-reply_network.edgelist', 'higgs-retweet_network.edgelist','higgs-social_network.edgelist']

# iterate over each .gz file to unzip to corrisponding .csv file
for file in files:

    with gzip.open(f'/content/drive/MyDrive/Higgs_Twitter/{file}.gz', 'rb') as f_in:
        with open(f'/content/drive/MyDrive/Higgs_Twitter/{file}.csv', 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)

In [7]:
# import each .csv file into the respective Pandas dataframe
df_activitytime = pd.read_csv("/content/drive/MyDrive/Higgs_Twitter/higgs-activity_time.txt.csv", index_col=False, sep=' ', names=['Source', 'Target', 'Date','Activity'])
df_mention = pd.read_csv("/content/drive/MyDrive/Higgs_Twitter/higgs-mention_network.edgelist.csv", sep=' ', names=['Source', 'Target','?'])
df_reply = pd.read_csv("/content/drive/MyDrive/Higgs_Twitter/higgs-reply_network.edgelist.csv", sep=' ', names=['Source', 'Target','?'])
df_retweet = pd.read_csv("/content/drive/MyDrive/Higgs_Twitter//higgs-retweet_network.edgelist.csv", sep=' ', names=['Target', 'Source', '?'])
df_socialnetwork = pd.read_csv("/content/drive/MyDrive/Higgs_Twitter//higgs-social_network.edgelist.csv", sep=' ', names=['Source', 'Target'])

In [8]:
# convert date column in df_activitytime from epoch time to date, then convert to UK date format with time removed
df_activitytime['Date'] = pd.to_datetime(df_activitytime['Date'], unit='s')

#df_activitytime['Date']  = df_activitytime['Date'].dt.strftime('%Y-%m-%d %H')

In [9]:
df_activitytime # "The higgs-activity_time.txt is a set of labeled temporal edges" - https://www.kaggle.com/datasets/wolfram77/graphs-snap-higgs-twitter

Unnamed: 0,Source,Target,Date,Activity
0,223789,213163,2012-07-01 00:02:52,MT
1,223789,213163,2012-07-01 00:02:52,RE
2,376989,50329,2012-07-01 00:06:21,RT
3,26375,168366,2012-07-01 00:06:23,MT
4,376989,13813,2012-07-01 00:06:32,RT
...,...,...,...,...
563064,97296,15483,2012-07-07 23:58:50,RE
563065,19979,49694,2012-07-07 23:59:12,MT
563066,19979,80429,2012-07-07 23:59:12,MT
563067,178085,1062,2012-07-07 23:59:34,RT


###1.12 - EDA

In [10]:
# Extract day, hour, and activity from the timestamp
df_activitytime['Day'] = df_activitytime['Date'].dt.date
df_activitytime['Hour'] = df_activitytime['Date'].dt.hour

# Group by day, hour, and activity, then count
activity_count = df_activitytime.groupby(['Day', 'Hour', 'Activity']).size().reset_index(name='Volume')

# Combine day and hour into a single datetime column for plotting
activity_count['Date'] = pd.to_datetime(activity_count['Day'].astype(str) + ' ' + activity_count['Hour'].astype(str) + ':00:00')

# Plot with Plotly Express
fig = px.line(activity_count, x='Date', y='Volume', color='Activity',
              labels={'Date': 'Datetime', 'Volume': 'Volume'},
              title='Higgs Twitter Activity by Activity Type (Hourly)',
              width=1000, height=600)

fig.show()

In [11]:
# dataframe for activity between 6am - 6:10am on 4th July only
df_4July_6am_610am = df_activitytime[(df_activitytime['Date'] > '2012-07-04 06:00:00') & (df_activitytime['Date'] < '2012-07-04 06:10:00')]

In [12]:
df_4July_6am_610am = df_4July_6am_610am.reset_index()
df_4July_6am_610am

Unnamed: 0,index,Source,Target,Date,Activity,Day,Hour
0,98206,234383,70120,2012-07-04 06:00:01,MT,2012-07-04,6
1,98207,234383,70120,2012-07-04 06:00:01,RE,2012-07-04,6
2,98208,45714,1988,2012-07-04 06:00:01,RT,2012-07-04,6
3,98209,368406,239782,2012-07-04 06:00:01,RT,2012-07-04,6
4,98210,7100,88,2012-07-04 06:00:01,MT,2012-07-04,6
...,...,...,...,...,...,...,...
7872,106078,230658,677,2012-07-04 06:09:59,RT,2012-07-04,6
7873,106079,23255,23798,2012-07-04 06:09:59,RT,2012-07-04,6
7874,106080,298817,677,2012-07-04 06:09:59,MT,2012-07-04,6
7875,106081,298817,374254,2012-07-04 06:09:59,RT,2012-07-04,6


In [13]:
# export nodes dataframe to csv file
df_4July_6am_610am.to_csv('4July_6am_610am.csv')

In [None]:
# create networkx object from chosen dataframe
G = nx.from_pandas_edgelist(df_4July_6am_610am, 'Source', 'Target')

# print the number of nodes and edges in the networkx object
print(f'Number of Nodes: {nx.number_of_nodes(G)}')
print(f'Number of Edges: {nx.number_of_edges(G)}')

Number of Nodes: 6202
Number of Edges: 6773


##1.2 - Centrality Measures

Calculated centrality measures and chart plots

In [None]:
# degree centrality (measures number of edges adjacent to a node i.e. nodes with higher degree are more connected)
degree_centrality = nx.degree_centrality(G)
avg_degree = sum(degree_centrality.values()) / len(degree_centrality)

# closeness centrality
closeness_centrality = nx.closeness_centrality(G)
avg_closeness = sum(closeness_centrality.values()) / len(closeness_centrality)

# betweenness centrality
betweenness_centrality = nx.betweenness_centrality(G)
avg_betweenness = sum(betweenness_centrality.values()) / len(betweenness_centrality)

# eigenvector centrality
eigenvector_centrality = nx.eigenvector_centrality(G, max_iter=500)
avg_eigenvector = sum(eigenvector_centrality.values()) / len(eigenvector_centrality)

# pagerank centrality
pagerank_centrality = nx.pagerank(G)
avg_pagerank = sum(pagerank_centrality.values()) / len(pagerank_centrality)

# harmonic centrality
harmonic_closeness_centrality_distribution = nx.harmonic_centrality(G)
avg_harmonic = sum(harmonic_closeness_centrality_distribution.values()) / len(harmonic_closeness_centrality_distribution)

In [None]:
print(f'Average degree centrality measure: {avg_degree}')
print(f'Average closeness centrality measure: {avg_closeness}')
print(f'Average betweenness centrality measure: {avg_betweenness}')
print(f'Average eigenvector centrality measure: {avg_eigenvector}')
print(f'Average pagerank centrality measure: {avg_pagerank}')
print(f'Average harmonic centrality measure: {avg_harmonic}')

Average degree centrality measure: 0.0003522228915133308
Average closeness centrality measure: 0.11335381427927331
Average betweenness centrality measure: 0.00032319868718627204
Average eigenvector centrality measure: 0.00431909407099422
Average pagerank centrality measure: 0.00016123831022251452
Average harmonic centrality measure: 772.4436912848512


In [None]:
G = nx.erdos_renyi_graph(100, 0.1)

# create subplots
fig = sp.make_subplots(rows=3, cols=2,
                       subplot_titles=['Degree Distribution', 'Closeness Centrality', 'Betweenness Centrality', 'Eigenvector Centrality', 'PageRank Centrality', 'Harmonic Closeness Centrality'],
                       shared_xaxes=False, shared_yaxes=False)

# set marker properties
marker_style = dict(mode='markers', marker=dict(size=2.5, color='#636EFA'))

# add Degree Distribution Histogram
degree_sequence = [d for n, d in G.degree()]
fig.add_trace(go.Histogram(x=degree_sequence, nbinsx=20, marker=dict(color='#636EFA'), opacity=0.7,
                           name='Degree Distribution'), row=1, col=1)

# add Closeness Centrality Scatter Plot
fig.add_trace(go.Scatter(x=list(closeness_centrality.keys()), y=list(closeness_centrality.values()), **marker_style,
                         name='Closeness Centrality'), row=1, col=2)

# add Betweenness Centrality Scatter Plot
fig.add_trace(go.Scatter(x=list(betweenness_centrality.keys()), y=list(betweenness_centrality.values()), **marker_style,
                         name='Betweenness Centrality'), row=2, col=1)

# add Eigenvector Centrality Scatter Plot
fig.add_trace(go.Scatter(x=list(eigenvector_centrality.keys()), y=list(eigenvector_centrality.values()), **marker_style,
                         name='Eigenvector Centrality'), row=2, col=2)

# add PageRank Centrality Scatter Plot
fig.add_trace(go.Scatter(x=list(pagerank_centrality.keys()), y=list(pagerank_centrality.values()), **marker_style,
                         name='PageRank Centrality'), row=3, col=1)

# add Harmonic Closeness Centrality Scatter Plot
fig.add_trace(go.Scatter(x=list(harmonic_closeness_centrality_distribution.keys()),
                         y=list(harmonic_closeness_centrality_distribution.values()), **marker_style,
                         name='Harmonic Closeness Centrality'), row=3, col=2)


# update layout for better visualization
fig.update_layout(title='Centrality Measures Distribution',
                  title_x=0.5,
                  showlegend=False,
                  width=1000, height=1000)

# update axis titles for histogram
fig.update_xaxes(title_text='Degree', row=1, col=1)
fig.update_yaxes(title_text='Frequency', row=1, col=1)

# update axis titles for scatter plots
fig.update_xaxes(title_text='Count', row=1, col=2)
fig.update_yaxes(title_text='Value', row=1, col=2)

for i in range(1, 3):
    for j in range(1, 4):
        fig.update_xaxes(title_text='Count', row=i+1, col=j)
        fig.update_yaxes(title_text='Value', row=i+1, col=j)

fig.show()

###1.21 - Top 10 Nodes by Centrality Measure

In [None]:
G = nx.erdos_renyi_graph(100, 0.1)

# calculate centrality measures
centrality_measures = {
    'Closeness': closeness_centrality,
    'Betweenness': betweenness_centrality,
    'Eigenvector': eigenvector_centrality,
    'PageRank': pagerank_centrality,
    'Harmonic Closeness': harmonic_closeness_centrality_distribution
}

# create a dataframe to store the top nodes for each centrality measure
top_nodes_df = pd.DataFrame()

# populate the dataframe with the top 10 nodes for each centrality measure
for measure_name, centrality_measure in centrality_measures.items():
    sorted_nodes = sorted(centrality_measure.items(), key=lambda x: x[1], reverse=True)[:10]
    top_nodes = [node[0] for node in sorted_nodes]
    top_nodes_df[measure_name] = top_nodes

print(top_nodes_df)

   Closeness  Betweenness  Eigenvector  PageRank  Harmonic Closeness
0         88           88           88        88                  88
1       1276          677         3998       677                1276
2     215057         1988          677      1988                 677
3      40886         1276         1988     11792              184805
4     184805        35843        37532     14075              215057
5      10339        14075        14615      1343               40886
6       3998         1343         1276     19913               53508
7      53508        11792        64911      3998                3998
8     163806        19913        14075     35843               10339
9     118381         3998        53508       205              163806


##1.22 - Top 10 Nodes by Degree

In [30]:
# create networkx object from chosen dataframe
G = nx.from_pandas_edgelist(df_4July_6am_610am, 'Source', 'Target')

# calculate the degree (number of edges) for each node
node_degrees = G.degree()

# convert the degree dictionary to a DataFrame
degrees_df = pd.DataFrame(list(node_degrees), columns=['Node', 'Degree'])

# sort the DataFrame by degree in descending order
sorted_degrees_df = degrees_df.sort_values(by='Degree', ascending=False)

# get the top 10 nodes
top_10_nodes = sorted_degrees_df.head(10)

print(top_10_nodes)

      Node  Degree
7       88    1175
25     677     476
3     1988     310
90    3998     122
148  11792     113
27   14075     107
21   19913     101
44    1343      98
138  35843      97
68     349      85


##1.3 - Network-Level and Path-Level Measures

In [None]:
# density (measures the proportion of edges in a graph relative to the total possible edges)
density_network = nx.density(G)

# average Degree (average degree of nodes in the network)
avg_degree_network = sum(dict(G.degree()).values()) / len(G)

# transitivity (Clustering Coefficient - measures the tendency of nodes to form clusters or triangles)
transitivity_network = nx.transitivity(G)

# assortativity (the preference of nodes to connect to other nodes with similar degrees)
assortativity_network = nx.degree_assortativity_coefficient(G)

# diameter (The maximum eccentricity among pairs of nodes in the network)
diameter_network = nx.diameter(G)

# shortest Path Length (the length of the shortest path between two nodes)
shortest_path = nx.shortest_path_length(G)

# average Shortest Path Length (the average length of the shortest paths between all pairs of nodes in the network)
avg_shortest_path = nx.average_shortest_path_length(G)

In [None]:
print(f'Density network measure: {density_network}')
print(f'Average Degree network measure: {avg_degree_network}')
print(f'Transitivity network measure: {transitivity_network}')
print(f'Assortativity network measure: {assortativity_network}')
print(f'Diameter network measure: {diameter_network}')
print(f'Average Shortest Path Length: {avg_shortest_path}')

Density network measure: 0.09575757575757576
Average Degree network measure: 9.48
Transitivity network measure: 0.09216909216909216
Assortativity network measure: -0.009140845100184772
Diameter network measure: 4
Average Shortest Path Length: 2.2692929292929294


##1.4 - Structural Analysis


The selected dataframe encompasses 6202 nodes and 6773 edges, forming the basis for a comprehensive network analysis.

Centrality measures such as Closeness, Betweenness, Eigenvector, PageRank, and Harmonic Closeness provide nuanced insights into the importance of individual nodes within the network.

The average degree centrality, with a value of 0.00035, implies that, on average, nodes maintain a relatively modest number of connections. The network's average degree of 9.48 signifies that, on average, each node is linked to approximately 9 others. Noteworthy exceptions exist, with 7 nodes boasting over 100 degrees each. Particularly, a specific node, presumably associated with CERN, demonstrates an extraordinary 1,175 connections over the 10-minute observation window. In summary, while nodes, on average, exhibit a low degree of connectivity, a few influential nodes wield significant influence in the network.

Closeness centrality, averaging at 0.113, indicates that nodes, on average, maintain relatively close proximity to each other within the network.

Betweenness centrality, with an average value of 0.00032, suggests that nodes, on average, do not play a critical role in connecting disparate sections of the network.

The presumed CERN account node stands out with the highest centrality values across all measures, logically asserting its integral role in the entire network. Nonetheless, other nodes also exhibit influential roles, contributing to the formation of clusters.

Network density, at 0.0957, hints at a relatively sparse network, while transitivity, at 0.092, suggests a moderate inclination for nodes to form clusters or triangles.

The average shortest path length of 2.27 underscores the proximity of nodes to each other on average.

In summary, the network emerges as a relatively sparse and decentralized structure with a moderate propensity for nodes to form clusters. The high harmonic centrality underscores nodes' elevated average reciprocal shortest path lengths, accentuating their centrality in terms of communication within the network.

##1.5 - Network Visualisation

In [24]:
image_url = 'https://github.com/mantissg/DAT6004_WRIT1/raw/main/Gephi.png'
image_url2 = 'https://github.com/mantissg/DAT6004_WRIT1/raw/main/Gephi2.png'

display(Image(url=image_url, width=1000))
print('Figure 1 - Gephi Export ForceAtlas 2\n')

display(Image(url=image_url2, width=1000))
print('Figure 2 - Gephi Export ForceAtlas 2 with Stronger Gravity Tuning')

Figure 1 - Gephi Export ForceAtlas 2



Figure 2 - Gephi Export ForceAtlas 2 with Stronger Gravity


Figures 1 and 2 illustrate visual graph network representations of the 10-minute Higgs dataset, crafted in Gephi using the ForceAtlas 2 layout.

In these visualizations, edge colors signify different interaction types: Re-Tweet (Purple), Mention (Red), and Reply (Green). Additionally, the node size is determined by the betweenness centrality.

A notable distinction between the two layouts lies in the 'Stronger Gravity' tuning.

Figure 1 effectively portrays the overall sparsity of the broader network, highlighting influential nodes and clusters within.

On the other hand, Figure 2 places greater emphasis on significant nodes and adeptly represents the volume of each edge activity type. Notably, Figure 2 provides a more nuanced representation of nodes that are not connected to any clusters.

Both figures corroborate the findings from structural analysis, revealing a sparse network with distinct clusters and influential nodes.

Both Gephi Files are available for download: https://github.com/mantissg/DAT6004_WRIT1

#Task 2: Beyond Basic SNA

##2.1 - Limitations of Structural Analysis of the Social Networks

Structural analysis of social networks, while a valuable tool for understanding the patterns and relationships within a network, has its limitations. One limitation is the static nature of structural analysis, which often fails to capture the dynamic and evolving nature of social networks. Gephi provides support for a temporal dataset, however this comes with it's own limitations.

Another limitation lies in the oversimplification of relationships. Structural analysis often reduces complex connections to binary relationships, ignoring the nuances and strength of ties. Weighting can be used, but this oversimplification can lead to a superficial understanding of the network, missing crucial information about the nature of connections.

To overcome these limitations, researchers can employ dynamic network analysis and incorporate qualitative data. Dynamic network analysis considers changes over time, offering a more accurate representation of evolving relationships. Qualitative data can provide additional insights into the quality and nature of connections, enhancing the depth of structural analysis.

In conclusion, a combination of dynamic network analysis and qualitative data can overcome the limitations of static structural analysis, offering a more comprehensive and nuanced understanding of social networks.

##2.2 - Issues Facing the Local Community


In addressing issues facing my local community, the integration of location information and social media can significantly enhance efforts to tackle various challenges. For example, community engagement in waste management. By utilizing location-based services and social media platforms, residents can share real-time information about overflowing bins or illegal dumping sites. This data, when aggregated, enables local authorities to identify problem areas, allocate resources efficiently, and implement targeted cleanup initiatives.

Additionally, for public safety concerns such as crime prevention, the amalgamation of location data and social media can be invaluable. A community-driven platform could allow users to report suspicious activities, share safety tips, and even coordinate neighborhood watch efforts. Law enforcement can then utilize this information to prioritize patrols and respond promptly to emerging issues.

Moreover, during times of emergencies, the combination of location-based alerts and social media updates can facilitate swift communication. Local authorities can use geotagged messages to provide evacuation instructions, share real-time updates on natural disasters, or coordinate relief efforts effectively.

By harnessing the power of location information and social media, communities can foster a more connected and responsive environment, promoting collective problem-solving and enhancing the overall well-being of residents.

#Task 3: Social Media Text Analysis

###Introduction


In this task, I opted to gather text data from forums of the top five Formula 1 teams on forums.autosport.com using Scrapy. Following the scraping process, I employed various text preprocessing techniques and language modeling to perform sentiment analysis.

Text preprocessing steps encompassed several methods such as lowercase conversion, tokenization, stop word removal, special character elimination, whitespace removal, stemming (utilizing the Porter Stemmer), lemmatization (employing the WordNet Lemmatizer), and handling null values.

For language modeling, I utilized the BERT model along with its tokenizer. Subsequently, I employed Spacy for conducting sentiment analysis.

In [None]:
#!pip install scrapy

In [None]:
#pip install transformers

In [None]:
# import required libraries

import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from transformers import BertTokenizer, BertModel
import torch
import spacy
from spacy import displacy
import plotly.graph_objects as go
import numpy as np
from spacytextblob.spacytextblob import SpacyTextBlob

##3.1 - Web Scrapping

In [None]:
!scrapy startproject f1_forum_scraper

New Scrapy project 'f1_forum_scraper', using template directory '/usr/local/lib/python3.10/dist-packages/scrapy/templates/project', created in:
    /content/f1_forum_scraper

You can start your first spider with:
    cd f1_forum_scraper
    scrapy genspider example example.com


In [None]:
cd f1_forum_scraper

/content/f1_forum_scraper


In [None]:
%%writefile /content/f1_forum_scraper/f1_forum_scraper/spiders/forum_spider.py

import scrapy

class ForumSpider(scrapy.Spider):
    name = 'forum'
    custom_settings = {
        'ROBOTSTXT_OBEY': False,  # Ignore robots.txt
    }
    # URLs of top 5 constructor forums to scrape
    start_urls = [
        'https://forums.autosport.com/topic/223124-2023-amg-mercedes-petronas-f1-team-thread/',
        'https://forums.autosport.com/topic/223191-mclaren-2023-team-thread/',
        'https://forums.autosport.com/topic/223135-2023-aston-martin-f1-team-thread/',
        'https://forums.autosport.com/topic/223238-2023-oracle-red-bull-racing-team/',
        'https://forums.autosport.com/topic/223474-2023-scuderia-ferrari-f1-team-thread/',
    ]

    def parse(self, response):
        # Extract the team heading
        heading = response.css('h1.ipsType_pagetitle::text').get()

        parent_divs = response.xpath('//div[@class="post_wrap"]')

        for parent_div in parent_divs:
            # Extract the comment posted date
            posted_date = parent_div.css('abbr.published::attr(title)').get()
            # Split the date to extract only the date part (excluding time)
            posted_date = posted_date.split('T')[0]

        comment_divs = response.xpath('//div[@itemprop="commentText"]')

        for comment_div in comment_divs:
            # Extract the comment text
            comment_text = comment_div.xpath('string()').get().strip()

            # Return results
            yield {
                'Team': heading,
                'PostedDate': posted_date,
                'CommentText': comment_text
            }

        # Iterate through pagination of each thread
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Writing /content/f1_forum_scraper/f1_forum_scraper/spiders/forum_spider.py


In [None]:
!scrapy crawl forum -o forum_data.json

In [None]:
df = pd.read_json('forum_data.json')

In [None]:
df_orig = df.copy()

In [None]:
df.head()

Unnamed: 0,Team,PostedDate,CommentText
0,2023 Oracle Red Bull Racing Team,2023-02-03,"New year, new kit. You and a friend could be o..."
1,2023 Oracle Red Bull Racing Team,2023-02-03,
2,2023 Oracle Red Bull Racing Team,2023-02-03,Oracle Red Bull Racing\n \n\n \n\n\n\n\n\n\n@r...
3,2023 Oracle Red Bull Racing Team,2023-02-03,The Honda topman @Wazari posted this last nigh...
4,2023 Oracle Red Bull Racing Team,2023-02-03,Max bought his 3th Ferrari : an SF90 Stradale...


In [None]:
df.shape

(20180, 3)

##3.2 - Text Preprocessing

In [None]:
df = df.drop_duplicates('CommentText', keep='first')
df_orig = df_orig.drop_duplicates('CommentText', keep='first')

In [None]:
# Lowercasing
df['CommentText'] = df['CommentText'].str.lower()

# Tokenization
nltk.download('punkt')
df['CommentText'] = df['CommentText'].apply(lambda text: word_tokenize(text))

# Stop Word Removal
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
df['CommentText'] = df['CommentText'].apply(lambda tokens: [word for word in tokens if word not in stop_words])

# Special Character Removal
df['CommentText'] = df['CommentText'].apply(lambda tokens: [re.sub(r'[^a-zA-Z0-9]', '', word) for word in tokens])

# Whitespace Removal
df['CommentText'] = df['CommentText'].apply(lambda tokens: [word.strip() for word in tokens])

# Stemming (using Porter Stemmer)
stemmer = PorterStemmer()
df['CommentText'] = df['CommentText'].apply(lambda tokens: [stemmer.stem(word) for word in tokens])

# Lemmatization (using WordNet Lemmatizer)
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
df['CommentText'] = df['CommentText'].apply(lambda tokens: [lemmatizer.lemmatize(word) for word in tokens])

# Dropping nulls
df = df.dropna()

# Define a lambda function to check for empty lists
is_empty_list = lambda x: isinstance(x, list) and len(x) == 0

# Use apply to create a mask for rows with empty lists
mask = df['CommentText'].apply(is_empty_list)

# Filter the DataFrame to keep rows without empty lists
df = df[~mask]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
df.head(10)

Unnamed: 0,Team,PostedDate,CommentText
0,2023 Oracle Red Bull Racing Team,2023-02-03,"[new, year, , new, kit, , friend, could, set, ..."
2,2023 Oracle Red Bull Racing Team,2023-02-03,"[oracl, red, bull, race, , redbullrac, , new, ..."
3,2023 Oracle Red Bull Racing Team,2023-02-03,"[honda, topman, , wazari, post, last, night, t..."
4,2023 Oracle Red Bull Racing Team,2023-02-03,"[max, bought, 3th, ferrari, , sf90, stradal, a..."
5,2023 Oracle Red Bull Racing Team,2023-02-03,"[new, team, gear, instagram, well, , best, ]"
6,2023 Oracle Red Bull Racing Team,2023-02-03,"[report, rb19, visibl, differ, concept, , , bo..."
7,2023 Oracle Red Bull Racing Team,2023-02-03,"[report, rb19, visibl, differ, concept, , , bo..."
8,2023 Oracle Red Bull Racing Team,2023-02-03,"[http, , theracecom, , ferentruleset, specif, ..."
9,2023 Oracle Red Bull Racing Team,2023-02-03,"[power, red, bull, , honda, ]"
10,2023 Oracle Red Bull Racing Team,2023-02-03,"[report, rb19, visibl, differ, concept, , , bo..."


In [None]:
df.shape

(20112, 3)

In [None]:
df2 = df.head(100)

In [None]:
df2.shape

(100, 3)

##3.3 - Language Modelling

In [None]:
# Load a pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Define a function to tokenize and get BERT embeddings for text
def get_bert_embeddings(text):
    tokens = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**tokens)
    # Extract embeddings from the model
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
    return embeddings

# Apply the function to your DataFrame and create a new column for embeddings
df2['bert_embeddings'] = df2['CommentText'].apply(get_bert_embeddings)

# Now, the 'bert_embeddings' column contains BERT embeddings for each text entry

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['bert_embeddings'] = df2['CommentText'].apply(get_bert_embeddings)


In [None]:
df2.head()

Unnamed: 0,Team,PostedDate,CommentText,bert_embeddings
0,2023 Oracle Red Bull Racing Team,2023-02-03,"[new, year, , new, kit, , friend, could, set, ...","[[-0.3605111, -0.068231694, -0.04026461, 0.176..."
2,2023 Oracle Red Bull Racing Team,2023-02-03,"[oracl, red, bull, race, , redbullrac, , new, ...","[[-0.23701124, -0.3489919, -0.10077973, -0.303..."
3,2023 Oracle Red Bull Racing Team,2023-02-03,"[honda, topman, , wazari, post, last, night, t...","[[-0.070129044, 0.02304559, -0.08432068, -0.03..."
4,2023 Oracle Red Bull Racing Team,2023-02-03,"[max, bought, 3th, ferrari, , sf90, stradal, a...","[[-0.069648385, 2.148375e-05, 0.35641503, 0.11..."
5,2023 Oracle Red Bull Racing Team,2023-02-03,"[new, team, gear, instagram, well, , best, ]","[[-0.12730874, 0.060565844, -0.17932041, 0.113..."


##3.4 - Sentiment Analysis

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
# Check which components are available in the nlp pipeline
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [3]:
#!python -m textblob.download_corpora

In [2]:
#pip install spacytextblob

In [None]:
# Load a spaCy model (you can choose a different model if needed)
nlp = spacy.load("en_core_web_sm")

# Add SpacyTextBlob to the spaCy pipeline
nlp.add_pipe('spacytextblob')

# Define a function to analyze sentiment using SpacyTextBlob
def analyze_sentiment(text):
    doc = nlp(text)
    sentiment = doc._.blob.sentiment
    return sentiment

# Apply the sentiment analysis function to the DataFrame
df_orig['Sentiment'] = df_orig['CommentText'].apply(analyze_sentiment)

In [None]:
# Extract polarity from the 'Sentiment' column
df_orig["Polarity"] = df_orig["Sentiment"].apply(lambda x: x[0])

# Extract subjectivity from the 'Sentiment' column
df_orig["Subjectivity"] = df_orig["Sentiment"].apply(lambda x: x[1])

In [None]:
df_orig["Sentiment Label"] = df_orig["Polarity"].apply(lambda x: "Positive" if x > 0 else ("Negative" if x < 0 else "Neutral"))

In [None]:
# Combine "Positive" and "Neutral" sentiment labels into a single group
df_orig['Sentiment Label Grouped'] = df_orig['Sentiment Label'].apply(lambda x: 'Positive/Neutral' if x in ['Positive', 'Neutral'] else 'Negative')

In [None]:
df_orig['index1'] = df_orig.index

In [None]:
df_orig.head()

Unnamed: 0,Team,PostedDate,CommentText,Sentiment,Polarity,Subjectivity,Sentiment Label,Sentiment Label Grouped,index1
0,2023 Oracle Red Bull Racing Team,2023-02-03,"New year, new kit. You and a friend could be o...","(0.202020202020202, 0.5808080808080808)",0.20202,0.580808,Positive,Positive/Neutral,0
1,2023 Oracle Red Bull Racing Team,2023-02-03,,"(0.0, 0.0)",0.0,0.0,Neutral,Positive/Neutral,1
2,2023 Oracle Red Bull Racing Team,2023-02-03,Oracle Red Bull Racing\n \n\n \n\n\n\n\n\n\n@r...,"(0.10227272727272727, 0.3409090909090909)",0.102273,0.340909,Positive,Positive/Neutral,2
3,2023 Oracle Red Bull Racing Team,2023-02-03,The Honda topman @Wazari posted this last nigh...,"(0.09551282051282052, 0.427991452991453)",0.095513,0.427991,Positive,Positive/Neutral,3
4,2023 Oracle Red Bull Racing Team,2023-02-03,Max bought his 3th Ferrari : an SF90 Stradale...,"(0.0, 0.0)",0.0,0.0,Neutral,Positive/Neutral,4


In [None]:
unique_values = df_orig['Team'].unique()

print(unique_values)

['2023  Oracle Red Bull Racing Team' '2023 Aston Martin F1 Team Thread'
 '2023 Scuderia Ferrari F1 Team Thread' 'McLaren 2023 team thread'
 '2023 AMG Mercedes-Petronas F1 Team Thread']


In [None]:
# Define a dictionary of patterns and replacements
team_patterns = {
    '.*Mercedes.*': 'Mercedes',
    '.*Red Bull.*': 'Red Bull Racing',
    '.*McLaren.*': 'McLaren',
    '.*Ferrari.*': 'Ferrari',
    '.*Aston Martin.*': 'Aston Martin'
    # Add more patterns as needed
}

# Iterate through the patterns and apply replacements
for pattern, replacement in team_patterns.items():
    df_orig['Team'] = df_orig['Team'].str.replace(pattern, replacement, regex=True)

In [None]:
df_orig.head()

Unnamed: 0,Team,PostedDate,CommentText,Sentiment,Polarity,Subjectivity,Sentiment Label,Sentiment Label Grouped,index1
0,Red Bull Racing,2023-02-03,"New year, new kit. You and a friend could be o...","(0.202020202020202, 0.5808080808080808)",0.20202,0.580808,Positive,Positive/Neutral,0
1,Red Bull Racing,2023-02-03,,"(0.0, 0.0)",0.0,0.0,Neutral,Positive/Neutral,1
2,Red Bull Racing,2023-02-03,Oracle Red Bull Racing\n \n\n \n\n\n\n\n\n\n@r...,"(0.10227272727272727, 0.3409090909090909)",0.102273,0.340909,Positive,Positive/Neutral,2
3,Red Bull Racing,2023-02-03,The Honda topman @Wazari posted this last nigh...,"(0.09551282051282052, 0.427991452991453)",0.095513,0.427991,Positive,Positive/Neutral,3
4,Red Bull Racing,2023-02-03,Max bought his 3th Ferrari : an SF90 Stradale...,"(0.0, 0.0)",0.0,0.0,Neutral,Positive/Neutral,4


In [None]:
df_orig.describe()

Unnamed: 0,Polarity,Subjectivity,index1
count,20113.0,20113.0,20113.0
mean,0.116843,0.463536,10099.126933
std,0.230011,0.215851,5823.244289
min,-1.0,0.0,0.0
25%,0.0,0.36558,5058.0
50%,0.102105,0.48,10101.0
75%,0.229448,0.583333,15142.0
max,1.0,1.0,20179.0


In [None]:
df_orig.dtypes

Team                        object
PostedDate                  object
CommentText                 object
Sentiment                   object
Polarity                   float64
Subjectivity               float64
Sentiment Label             object
Sentiment Label Grouped     object
index1                       int64
dtype: object

In [None]:
df_orig['PostedDate'] = pd.to_datetime(df_orig['PostedDate'])

In [None]:
df_orig['Week_Number'] = df_orig['PostedDate'].dt.isocalendar().week

In [None]:
df_orig.head(5)

Unnamed: 0,Team,PostedDate,CommentText,Sentiment,Polarity,Subjectivity,Sentiment Label,Sentiment Label Grouped,index1,Week_Number
0,Red Bull Racing,2023-02-03,"New year, new kit. You and a friend could be o...","(0.202020202020202, 0.5808080808080808)",0.20202,0.580808,Positive,Positive/Neutral,0,5
1,Red Bull Racing,2023-02-03,,"(0.0, 0.0)",0.0,0.0,Neutral,Positive/Neutral,1,5
2,Red Bull Racing,2023-02-03,Oracle Red Bull Racing\n \n\n \n\n\n\n\n\n\n@r...,"(0.10227272727272727, 0.3409090909090909)",0.102273,0.340909,Positive,Positive/Neutral,2,5
3,Red Bull Racing,2023-02-03,The Honda topman @Wazari posted this last nigh...,"(0.09551282051282052, 0.427991452991453)",0.095513,0.427991,Positive,Positive/Neutral,3,5
4,Red Bull Racing,2023-02-03,Max bought his 3th Ferrari : an SF90 Stradale...,"(0.0, 0.0)",0.0,0.0,Neutral,Positive/Neutral,4,5


In [None]:
#team_colors dictionary
team_colors = {
    "Mercedes": "#6cd3bf",
    "Red Bull Racing": "#3671c6",
    "Ferrari": "#f91536",
    "McLaren": "#f58020",
    "Aston Martin": "#358c75"
}

# Preprocessing the dataset to extract only the necessary columns
categories = ['Negative', 'Neutral', 'Positive']

# Construct a pivot table with the column
gfg = pd.pivot_table(
    df_orig,
    index='Team',
    columns='Sentiment Label',
    values='index1',
    aggfunc='count'
)

# Include the sentiments - negative, neutral, and positive
gfg = gfg[categories]

# Representing negative sentiment with negative numbers
gfg.Negative = gfg.Negative * -1

df = gfg

# Creating a Figure
Diverging = go.Figure()

# Iterating over the columns
for col in df.columns[4:]:
    # Adding a trace and specifying the parameters for negative sentiment
    Diverging.add_trace(go.Bar(
        x=-df[col].values,
        y=df.index,
        orientation='h',
        name=col,
        marker=dict(color=np.where(df[col] < 0, 'gray', [team_colors[Team] for Team in df.index])),
        legendgroup='Teams',
        customdata=df[col],
        hovertemplate="%{y}: %{customdata}"
    ))

for col in df.columns:
    # Adding a trace and specifying the parameters for positive and neutral sentiment
    Diverging.add_trace(go.Bar(
        x=df[col],
        y=df.index,
        orientation='h',
        legendgroup='Teams',  # Set legend group to 'Teams' for all traces
        name=col,
        marker=dict(
            color=np.where(df[col] < 0, 'gray', [team_colors[Team] for Team in df.index]),
            opacity=0.5 if col == 'Neutral' else 1  # Set opacity to 50% for Neutral bars
        ),
        hovertemplate="%{y}: %{x}"
    ))

# Define plot layout
Diverging.update_layout(
    barmode='relative',
    height=600,
    width=1200,
    xaxis_title="Number of Comments",
    yaxis_title="Team",
    yaxis=dict(visible=True, showticklabels=True, showgrid=False),  # Hide y-axis grid lines
    xaxis=dict(zeroline=True, showgrid=False),  # Show the zeroline at x=0 and hide x-axis grid lines
    title=dict(text="Diverging Bar Chart Showing Change in Sentiment of Top 5 F1 Construcotrs Forum Threads During the 2023 Season",
          y=0.9, x=0.5, xanchor='center', yanchor='top'),
    yaxis_autorange='reversed',
    bargap=0.5,
    showlegend=False
)

# Plot chart
Diverging

In [None]:
count_by_category = df_orig.groupby(['Team', 'Sentiment Label']).size().reset_index(name='Count')

In [None]:
count_by_category

Unnamed: 0,Team,Sentiment Label,Count
0,Aston Martin,Negative,680
1,Aston Martin,Neutral,454
2,Aston Martin,Positive,1822
3,Ferrari,Negative,315
4,Ferrari,Neutral,161
5,Ferrari,Positive,1049
6,McLaren,Negative,1154
7,McLaren,Neutral,792
8,McLaren,Positive,4419
9,Mercedes,Negative,1468


In [None]:
# race names and dates
race_dates_df = pd.DataFrame({
    'Event': ['BAHRAIN','SAUDI ARABIA','AUSTRALIA','AZERBAIJAN','MIAMI','MONACO','SPAIN','CANADA','AUSTRIA','BRITAIN','HUNGARY','BELGIUM','DUTCH','ITALY','SINGAPORE','JAPAN','QATAR','US','MEXICO','BRAZIL','LAS VEGAS','ABU DHABI'],
    'Date': ['05/03/2023', '19/03/2023', '02/04/2023', '30/04/2023','07/05/2023','28/05/2023','04/06/2023','18/06/2023','02/07/2023','09/07/2023','23/07/2023','30/07/2023','27/08/2023','03/09/2023','17/09/2023','24/09/2023','08/10/2023','22/10/2023','29/10/2023','05/11/2023','19/11/2023','26/11/2023']
})

In [None]:
race_dates_df['Date'] = pd.to_datetime(race_dates_df['Date'])


Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently parsed dates! Specify a format to ensure consistent parsing.



In [None]:
race_dates_df['Week_Number'] = race_dates_df['Date'].dt.isocalendar().week

In [None]:
race_dates_df.head(5)

Unnamed: 0,Event,Date,Week_Number
0,BAHRAIN,2023-05-03,18
1,SAUDI ARABIA,2023-03-19,11
2,AUSTRALIA,2023-02-04,5
3,AZERBAIJAN,2023-04-30,17
4,MIAMI,2023-07-05,27


In [None]:
# team_colors dictionary
team_colors = {
    "Mercedes": "#6cd3bf",
    "Red Bull Racing": "#3671c6",
    "Ferrari": "#f91536",
    "McLaren": "#f58020",
    "Aston Martin": "#358c75"
}

# Preprocessing the dataset to extract only the necessary columns
categories = ['Negative', 'Neutral', 'Positive']

# Create a function to update the traces based on the week_number
def update_traces(week_number):
    # Filter data for the specific week and all previous weeks
    df_subset = df_orig[df_orig['Week_Number'] <= week_number]

    # Group by 'Team' and 'Sentiment Label' and calculate cumulative sum
    gfg = df_subset.groupby(['Team', 'Sentiment Label']).size().unstack(fill_value=0)
    gfg = gfg[categories]
    gfg.Negative = gfg.Negative * -1

    # Calculate cumulative sum separately for each team
    gfg_cumsum = gfg.groupby(level=0).cumsum(axis=0)

    # Creating a Figure
    Diverging = go.Figure()

    for col in gfg_cumsum.columns[4:]:
        Diverging.add_trace(go.Bar(
            x=-gfg_cumsum[col].values,
            y=gfg_cumsum.index,
            orientation='h',
            name=col,
            marker=dict(color=np.where(gfg_cumsum[col] < 0, 'gray', [team_colors[Team] for Team in gfg_cumsum.index])),
            legendgroup='Sentiments',
            customdata=gfg_cumsum[col],
            hovertemplate="%{y}: %{customdata}"
        ))

    for col in gfg_cumsum.columns:
        Diverging.add_trace(go.Bar(
            x=gfg_cumsum[col],
            y=gfg_cumsum.index,
            orientation='h',
            legendgroup='Sentiments',
            name=col,
            marker=dict(
                color=np.where(gfg_cumsum[col] < 0, 'gray', [team_colors[Team] for Team in gfg_cumsum.index]),
                opacity=0.5 if col == 'Neutral' else 1
            ),
            hovertemplate="%{y}: %{x}"
        ))

    Diverging.update_layout(
        barmode='relative',
        height=400,
        width=1200,
        xaxis_title="Cumulative Number of Comments (Per Week)",
        yaxis_title="Team",
        yaxis=dict(visible=True, showticklabels=True, showgrid=False),
        xaxis=dict(zeroline=True, showgrid=False),
        title=dict(
            text=f"Diverging Bar Chart Showing Sentiment of top 5 F1 Construtors Teams Forum Threads During 2023 Season",
            y=0.9, x=0.5, xanchor='center', yanchor='top'
        ),
        yaxis_autorange='reversed',
        bargap=0.5,
        showlegend=False
    )

    return Diverging

# Get unique week numbers
unique_weeks = sorted(df_orig['Week_Number'].unique())

# Create frames as dictionaries
frames = [go.Frame(data=update_traces(week).data, name=str(week)) for week in unique_weeks]

# Create an animated figure
animated_fig = go.Figure(
    data=update_traces(unique_weeks[0]).data,
    frames=frames,
    layout=update_traces(unique_weeks[0]).layout
)

# Get race information for labels
race_labels = race_dates_df[race_dates_df['Week_Number'].isin(unique_weeks)]

# Update layout to include animation settings
animated_fig.update_layout(
    updatemenus=[
        dict(
            type='buttons',
            showactive=False,
            buttons=[
                dict(
                    label='Play',
                    method='animate',
                    args=[None, dict(frame=dict(duration=500, redraw=True), fromcurrent=True)]
                )
            ],
            x=-0.1,  # Adjust the x position
            xanchor='left',  # Set the x anchor to 'left'
            y=-0.25,  # Set the y position to 0
            yanchor='bottom'  # Set the y anchor to 'bottom'
        )
    ],
    sliders=[dict(
        active=0,
        steps=[
            dict(
                label=f"{race['Event']} ({race['Date']})",
                method='animate',
                args=[
                    [str(race['Week_Number'])],
                    dict(frame=dict(duration=300, redraw=True), mode='immediate', transition=dict(duration=0))
                ]
            ) for _, race in race_labels.iterrows()
        ],
        #x=0.1,
        #y=0,
        #yanchor='bottom'
        #len=0.9,  # Set the length of the slider
        #tickangle=-45,  # Rotate the tick labels by 45 degrees
    )]
)

# Set a fixed x-axis range
animated_fig.update_xaxes(range=[-1500, 5500])

# Show animation with manual slider
animated_fig.show()

##References

Knickerbocker, D. (2023) Network Science with Python: Explore the Networks Around Us Using Network Science, Social Network Analysis, and Machine Learning. 1st edn. Birmingham: Packt Publishing, Limited.

Eswaramurthi, A. (2021). A Guide to Social Network Analysis and its Use Cases. [online] LatentView Analytics. Available at: https://www.latentview.com/blog/a-guide-to-social-network-analysis-and-its-use-cases/.

Iriondo, R. and Iriondo, R. (n.d.). Natural Language Processing (NLP) with Python — Tutorial – Towards AI — The World’s Leading AI and Technology Publication. [online] Available at: https://towardsai.net/p/nlp/natural-language-processing-nlp-with-python-tutorial-for-beginners-1f54e610a1a0.

ScrapyScrapy.org. (2020). Scrapy | A Fast and Powerful Scraping and Web Crawling Framework. [online] Available at: https://scrapy.org/.

spaCy. (2015). spaCy · Industrial-strength Natural Language Processing in Python. [online] Available at: https://spacy.io/.

Nayak, P. (2019). Understanding searches better than ever before. [online] Google. Available at: https://blog.google/products/search/search-language-understanding-bert/.

The Autosport Forums. (n.d.). Autosport Forums. [online] Available at: https://forums.autosport.com.

snap.stanford.edu. (n.d.). SNAP: Network datasets: Higgs Twitter Dataset. [online] Available at: https://snap.stanford.edu/data/higgs-twitter.html.

GeeksforGeeks. (2021). Diverging Bar Chart using Python. [online] Available at: https://www.geeksforgeeks.org/diverging-bar-chart-using-python/ [Accessed 7 Dec. 2023].

‌