<!--

# Final Report - Work in Progress
- Research Hypothesis / Questions:
    - Is Formula 1 fandom Toxic?
    - Are there specific groups that show more toxic behaviour then others?
    - Is the toxicity a "self-made" problem of Formula 1?
- APIs: Youtube
    - (Not reddit as post are often off topic especially during the off season, that we are currently in)
- Methods:
    - TBD
    - Dictionary
        - Formula 1 specific words that are toxic
        - racism / ethnic slurs -> [@ethnic_slurs]
        - toxicity -> [@orthrus-lexicon_orthrus_2022]
        - hate speech -> [@van_der_vegt_grievance_2021]
        - insults -> [@van_der_vegt_grievance_2021]
    - Transformer classifier
        - sentiment -> cardiffnlp/twitter-roberta-base-sentiment-latest [@tweet_sentiment_classifier]
        - racism -> jaumefib/datathon-against-racism [@hate_speech_classifier]
        - hate speech -> Hate-speech-CNERG/dehatebert-mono-english [@racism_classifier]
    - statistical analysis
        - group toxic behavior by drivers and teams
        - group by topics
            - topic modelling?
- Contents:
    - Introduction
        - What is Formula 1
        - Why do we need to analyze this
        - introduce the three research questions / hypothesis
    - Fundamentals
        - Formula 1
        - What is fandom
          - 
        - Defining toxic fan behavior
        - Youtube API
        - Maybe explaining the used methods?
    - Concept
        - What will be done
        - How will i be doing it
    - Creating the Dataset
        - Explain Dataset creation
    - Applying Method 1
    - Applying Method 2
    - Results

-->

# Analysing Toxicity in Formula 1 Fandom - Computational Analysis of Communications Final
Author: Leon Knorr

Matr-Nr: 1902854

## Introduction
Formula 1 is the highest class of international racing for open-wheel single-seater formula racing cars and is generally considered the most competitive, fastest and hardest class of motor racing. Since it’s first season in 1950, Formula 1 is visiting a diverse list of many different countries, where the best drivers in the world are racing against each other in teams of two drivers to determine the best driver and the best team on the Formula 1 grid [@about_f1]. These events are visited by thousands of Fans, with millions more following them on television and social media. With the 2021 season being one of the closest and most entertaining seasons in the history of Formula 1, where Red Bulls Max Verstappen beat Mercedes driver Lewis Hamilton in the grand finale of the season under controversial circumstances after a full season of controversy, drama and intense on track battles and with the release of Netflix Drive To Survive, Formula 1s popularity is growing rapidly. But, reports of Toxic and abusive Fan behavior at events and in comment sections on social media are accumulating, and casts an ugly shadow over Formula 1s latest successes [@woodhouse_scary_2022].
As the reports over toxic and abusive fan behaviours in social media and at live events are rising, Formula 1 as well as Fans and drivers are taking a stand against toxicity in the Formula 1 community. However, an independent and scientific analysis of this topic is missing and therefore the accusations are sort of hanging in the air without a solid scientific foundation. Therefore, in order to tackle this problem research into the toxicity of Formula 1 fandom is a necassety to gain valuable insights into understanding the problem, where it originates from and to build a foundation for future measures to make attending Formula 1 events as well as the media around it a safer and more enjoyable experience. To take the first step into this direction, this thesis will analyse Youtube comments of the Formula 1 channel in order to determine:

- If the Formula 1 fandom is toxic
- Are there specific groups that are more toxic then others?
- Is the toxicity a "self-made" problem of Formula 1 and where is the toxicity originating from?

## Fundamentals
In this chapter the necessary fundamental knowledge is presented.

### Formula 1
Formula 1 is the worlds most prestigous motor racing competition, as well as the world's most popular annual sporting series [@about_f1]. It marks the highest class of international open-wheel single-seater formula racing. The first Formula 1 competition was held in 1950, since then the competiton for the world drivers championship (wdc) which determines the worlds best driver and the world constructors championship (wcc) which determines the best team, is held annualy and is sanctioned by the Fédération Internationale de l'Automobile (FIA). During the competition (also called a season), Formula 1 visits a variety of different countries and racing tracks, each event (Grands Prix) is attended by thousands of people with millions watching from home [@formula_1_2023]. All rights of the Formula 1 brand and the competition itself is owned by Formula One World Championship Limited, which is a corporation, that provides media distribution and promotion services, besides that, it controls the contracts, distribtution, and commercial management of rights and licenses of formula 1 [@formula_1_limited_company_profile]. The term Formula 1 is used to describe the corporation, as well as the competition, as they can't exist without each other.

### What is Fandom
According to Cornel Sandvoss Fandom is a community of people that are regularly, consuming a given popular narrative or text with great emotional involvement [@toxic_fandom]. The members of the community are called fans, which is a short form of "fanatic" [@arouh_toxic_2020]. In other words, a fandom is a community of people that are fanatic about a popular narrative or text such as a tv series, movie franchise or sports.

Becoming a fan starts with the adoption of a fan identity about a fan object, thus fandom can be a powerful of defining the self. The fan object can be anything that people can be fanatic about, this may be a simple object such as trains or a virtual asset such as a movie franchise. Therefore, by taking part in a fandom, people are expressing themselfs through an identity they've chosen for themselfs. As a result, fans may lead to see the fan object as an extension of themselfs and thus react personally threatened if the fan object is facing a threat such as accusations etc [@toxic_fandom]. In addition to creating a strong part of their own identity, fans feel more connected or socialised through their fandom, as studies indicate, that even if fans don't interact with other members of a fan community, they still perceive themselfs as part of that community. Because of that, fans not only become personally invested in their fandom, they become socially invested as well [@toxic_fandom].

As a result of the strong connection fans build up to their fan object, the time-frame in which this self identity has been chosen is also playing a role. As an example, many people build a fandom in their childhood about a tv series, franchise or sport, this often leads to them feeling entitled to having their fan object preserved as they deem acceptable. This behaviour is also called fan entitlement. A good example for this behaviour are the news movies and series in the Lord of the Rings and Star Wars franchises, as most fan communities of these franchises have been outraged about the new characters and story lines, where many people claimed that this "ruined their childhood" [@toxic_fandom].

From an economic point of view, fandom and fan cultures are seen as the ideal costumers. They are eager to get their hands on the newest products and they are stable with re-occuring purchases, since intense consumption is considered a part of the fan identity [@arouh_toxic_2020].

### Defining Toxic Fan behaviour

### Youtube API

## Concept

## The Dataset

In [2]:
from dotenv import dotenv_values
import googleapiclient.discovery
import pandas as pd

api_keys = dotenv_values("keys.env")
api_service_name = "youtube"
api_version = "v3"
api_key = api_keys["YOUTUBE_API_KEY"]
max_results = 1000
youtube_api = googleapiclient.discovery.build(api_service_name, api_version, developerKey = api_key)

In [None]:
Formula1_official_channel = youtube_api.channels().list(part='snippet' ,forUsername='Formula1').execute()['items'][0]
videos_after_2020 = youtube_api.search().list(channelId=Formula1_official_channel["id"],
        maxResults=max_results,
        publishedAfter="2020-01-01T00:00:00Z",
        part='id').execute()
video_ids_after_2020 = [item['id']['videoId'] for item in videos_after_2020['items']]
while len(video_ids_after_2020) < max_results and "nextPageToken" in videos_after_2020.keys():
        videos_after_2020 = youtube_api.search().list(channelId=Formula1_official_channel["id"],
        maxResults=max_results,
        publishedAfter="2020-01-01T00:00:00Z",
        part='id',
        pageToken=videos_after_2020["nextPageToken"]).execute()
        video_ids_after_2020 = video_ids_after_2020 + [item['id']['videoId'] for item in videos_after_2020['items']]


In [None]:
df_list = []
for video_id in video_ids_after_2020:
    video_data = youtube_api.videos().list(part='snippet, statistics', id=video_id).execute()
    snippet = video_data['items'][0]['snippet']
    statistics = video_data['items'][0]['statistics']
    df_list.append(
    {
        "video_id":video_id,
        "title": snippet['title'],
        "description": snippet['description'],
        "channel": snippet['channelTitle'],
        "published_at": snippet['publishedAt'],
        "tags": snippet['tags'] if "tags" in snippet.keys() else None,
        "like_count": statistics['likeCount'],
        "favorite_count": statistics['favoriteCount'],
        "comment_count": statistics['commentCount'] if "commentCount" in statistics.keys() else 0
    })

videos = pd.DataFrame(df_list)
videos

In [None]:
video_ids_after_2020 = videos.video_id.to_list()
video_ids_after_2020

In [None]:
df_list_comments = []
for video_id in video_ids_after_2020:
    if videos.loc[videos['video_id'] == video_id].comment_count.iloc[0] == 0:
        continue
    top_level_comments = youtube_api.commentThreads().list(part="snippet",
        maxResults=50,
        order="relevance",
        videoId=video_id).execute()['items']
    for top_level_comment in top_level_comments:
        replies = youtube_api.comments().list(part="snippet",
            maxResults=50,
            parentId=top_level_comment['snippet']['topLevelComment']['id']).execute()['items']
        df_list_comments.append(
        {
            "video_id": video_id,
            "id": top_level_comment['snippet']['topLevelComment']['id'],
            "text": top_level_comment['snippet']['topLevelComment']['snippet']['textDisplay'],
            "user": top_level_comment['snippet']['topLevelComment']['snippet']['authorChannelId']['value'],
            "like_count": top_level_comment['snippet']['topLevelComment']['snippet']['likeCount'],
            "published_at": top_level_comment['snippet']['topLevelComment']['snippet']['publishedAt'],
            "reply_count": top_level_comment['snippet']['totalReplyCount']
        })
        for reply in replies:
            df_list_comments.append(
            {
                "video_id": video_id,
                "id": reply['id'],
                "text": reply['snippet']['textDisplay'],
                "user": reply['snippet']['authorChannelId']['value'],
                "like_count": reply['snippet']['likeCount'],
                "published_at": reply['snippet']['publishedAt'],
                "reply_count": 0
            })

comment_df: pd.DataFrame = pd.DataFrame(df_list_comments)
comment_df

In [None]:
videos.to_pickle("datasets/video_data.pkl")
comment_df.to_pickle("datasets/comment_data.pkl")

In [3]:
videos: pd.DataFrame = pd.read_pickle("datasets/video_data.pkl")
comment_df: pd.DataFrame = pd.read_pickle("datasets/comment_data.pkl")

In [None]:
videos

### Dataset limitations

## Dictionary Analysis

### Othrus Lexicon for Toxicity

In [None]:
comment_df

In [4]:
with open("dictionaries/toxic_words.txt") as toxic_words_file:
    set_of_toxic_words: set = set([word.strip() for word in toxic_words_file.readlines()])
set_of_toxic_words

{'jack***',
 'stink',
 'dago',
 'cretins',
 'opportunist',
 'chink',
 'kumming',
 'bigots',
 'ludicrously',
 'criminalmentally',
 'phuks',
 'rapist',
 'buttplug',
 'selfloving',
 'reek',
 'shithead',
 'idiota',
 'fanatical',
 'pucker',
 'mediocre',
 'egotist',
 'asssucker',
 'manboy',
 'vag',
 'whitetrash',
 'feeble-minded',
 'disgusted',
 'laughable',
 'cox',
 'commie',
 'psychotic',
 'queerbait',
 'pimpis',
 'faggs',
 'jan',
 'wop',
 'titfuck',
 'beastial',
 'blow job',
 'infidels',
 'numbskull',
 'fornicating',
 'snot',
 'mo-fo',
 'disgusting',
 'sicko',
 'neutered',
 'anti-french',
 'titties',
 'weenie',
 'boner',
 'masturbation',
 'antichrist',
 'dimwitted',
 'muslims',
 'fink',
 'gosh-darned',
 'idiotphobe',
 'phukked',
 'drumpf',
 'f*uck',
 'dicksucking',
 'charlatans',
 'assbanger',
 'fabricator',
 'dumbed',
 'peckerhead',
 'reproduce',
 'imbecile',
 'corrupted',
 'pompas',
 'sabotage',
 'goshdarned',
 'spurt',
 'stinks',
 'murdering',
 'fukin',
 'cumslut',
 'ridicule',
 'assgo

In [5]:
import numpy as np
from collections import Counter
from typing import Tuple

def dictionary_analysis_over_set_intersection(dict_name: str, dict_set: set, data: pd.DataFrame) -> Tuple[pd.DataFrame, Counter]:
    dict_word_counter: Counter = Counter()
    dict_word_count: list = []
    for row in data.text:
        dict_words_in_comment: set = set(row.split(" ")).intersection(dict_set)
        dict_word_counter.update(dict_words_in_comment)
        dict_word_count.append(len(dict_words_in_comment))
    data[f"{dict_name}_word_count"] = dict_word_count
    return data, dict_word_counter

In [6]:
comment_df, toxic_word_counter = dictionary_analysis_over_set_intersection(dict_name="toxic", dict_set=set_of_toxic_words, data=comment_df)
comment_df.loc[comment_df["toxic_word_count"] > 0]

Unnamed: 0,video_id,id,text,user,like_count,published_at,reply_count,toxic_word_count
45,Z8wPGQhw4Pg,UgxXqdTRbrYXuxQQCVh4AaABAg.9CGSVWl_40q9CHkZgsp1SK,@Keisuke Takahasi ferrari needs some time at t...,UCiTlmx-EYVTmzQKDx4m4rgQ,1,2020-08-13T04:37:20Z,0,1
73,eioKgQUICjA,UgxcOZFJK8moiNCjIjp4AaABAg,The development of everything including the ca...,UC4rb1tv1SDYSkb2aV-b028g,256,2022-05-25T16:03:13Z,0,1
181,jlSXQuVnHAE,UgxurCAEQFF6vd1mye54AaABAg.9BKmPlNxJID9BMRaNqTUPt,@Rizaldi Ramdlani Pamungkas You weren&#39;t ev...,UCXv3XwEyoze6sutnSmNtmMw,0,2020-07-21T03:47:37Z,0,2
267,I1WEmbI12H4,UgwyhwIhgKevCF_hUyN4AaABAg.9AoJrrp9V0a9Aoki5Bw02a,God the 2019 French Grand Prix was dreadful,UCxzc_8degY9mHQp_LK46xdw,9,2020-07-07T16:30:00Z,0,1
270,5daN3RDsP80,UgzrRxvUoECK2M8vOAh4AaABAg.9VXqzFB0Hvy9VdqSMSaz08,@FIA Random Penalty Generator Machine You are ...,UCM9Sm_7O6qFUV_E6Ec4We-Q,0,2021-12-07T09:47:47Z,0,1
...,...,...,...,...,...,...,...,...
2493,7G7KewfdzTY,UgyHYEmG50jkG9rY0HR4AaABAg,"luke smith with the insane traction <a href=""h...",UCAaFqlgiHtYfJB8zYVt4Big,27,2022-07-13T20:56:21Z,0,1
2588,mYfBKflmgAQ,Ugzti_zPdpEc3gw5ZPd4AaABAg,The pure silence from the crowd is killing me 🫠😂😂,UCg5kkdme8ppRkmhiwZTqaxA,194,2022-12-10T11:18:13Z,3,1
2600,3oy0msSIkEI,UgwI2-SU-3i8bAQgQQp4AaABAg.9iNbXFAHftm9iNpYgga4xC,"@Oscar Arrieta though i love him, i immediatel...",UCoc2g8Cz_yboQOWie_Eq0iA,1,2022-11-13T18:32:20Z,0,1
2632,PdW4pLiOtL0,Ugzrrw9zFYyEfdCCDMZ4AaABAg,I think Mercedes will figure out their rear en...,UCNBRYTDssd7yL5tuGNpc-4w,258,2021-03-14T17:12:51Z,11,1


In [7]:
toxic_word_counter

Counter({'ass': 1,
         'insane': 13,
         'stfu': 1,
         'bullshit': 1,
         'God': 1,
         'hating': 3,
         'terrible': 3,
         'immature': 1,
         'garbage': 1,
         'crazy': 5,
         'trash': 5,
         'shills': 1,
         'a**': 1,
         'kick': 3,
         'monster': 5,
         'treacherous': 3,
         'blow': 1,
         'mediocre': 3,
         'clown': 3,
         'threats': 1,
         'useless': 1,
         'weird': 4,
         'beating': 5,
         'con': 2,
         'killing': 6,
         'rear': 6,
         'pathetic': 1,
         'chick': 1,
         'fake': 5,
         'deluded': 1,
         'bums': 1,
         'disgrace': 1,
         'cheating': 1,
         'ridiculously': 1,
         'fooled': 1,
         'aggressive': 1,
         'messed': 1,
         'duh': 1,
         'fukin': 1,
         'rat': 1,
         'beaten': 1,
         'weak': 2,
         'fkn': 1,
         'choke': 1,
         'fails': 2,
         'l': 1,

### Grievance Dictionary

### Ethnic Slurs

In [9]:
from os import listdir
import os.path

dict_files: list = list(filter(lambda f: f[-4:] == ".csv" ,listdir("dictionaries/ethnic_slurs/")))
dict_df: pd.DataFrame = pd.DataFrame()
for file in dict_files:
    part = pd.read_csv(os.path.join("dictionaries/ethnic_slurs", file))
    dict_df = pd.concat([part, dict_df])
dict_df.reset_index(inplace=True, drop=True)
ethnic_slurs_set: set = set(dict_df.Term.to_list())
dict_df

Unnamed: 0,Term,Location or origin,Targets,"Meaning, origin and notes",References
0,"Eight ball, 8ball",,Black people,"Referring to the black ball in pool. Slang, us...",
1,Eyetie,"United States, United Kingdom",Italian people,"Originated through the mispronunciation of ""It...",
2,"Dago, Dego","United States, Commonwealth","Italians, Spaniards, Portuguese people","Possibly derived from the Spanish name ""Diego""",
3,"Dago, Dego",United States,Italian people,,
4,Dal Khor,Urdu-speaking people,Indians and Pakistanis (specifically Punjabis),"The term literally translates to ""dal eater"", ...",
...,...,...,...,...,...
424,Huinca,"Argentina, Chile","Non-Mapuche Chileans, non-Mapuche Argentines",Mapuche term dating back at least to the Conqu...,
425,Hun,"United States, United Kingdom",German people,"(United States, United Kingdom) Germans, espec...",
426,Hun,Ireland,Protestants and British soldiers,A Protestant in Northern Ireland or historical...,
427,"Hunky, Hunk",United States,Central European laborers.,It originated in the coal regions of Pennsylva...,


In [11]:
comment_df, ethnic_slurs_counter = dictionary_analysis_over_set_intersection(dict_name="ethnic_slurs", dict_set=ethnic_slurs_set, data=comment_df)
comment_df.loc[comment_df["ethnic_slurs_word_count"] > 0]

Unnamed: 0,video_id,id,text,user,like_count,published_at,reply_count,toxic_word_count,ethnic_slurs_word_count
1899,c4ieY1Yo5C4,UgynHR27cZvEkt6KkC14AaABAg.9CDj_1GsSqV9CDkfMCw9ba,Agreed. It was such a shame... I think Mick an...,UCPwwkdYZTo0AfQzPUVceT2A,11,2020-08-11T15:21:17Z,0,1,1


## Transformer Classifiers

## Results

## Bibliography

<!--

In [85]:
import os
os.system("jupyter nbconvert --to markdown final.ipynb")
os.system("pandoc -s final.md -t pdf -o final.pdf --citeproc --bibliography=refs.bib --csl=apa.csl")

[NbConvertApp] Converting notebook final.ipynb to markdown
[NbConvertApp] Writing 14508 bytes to final.md


0

-->