As a quick reminder, my capstone is looking for negative/hate comments in the chats of Twitch streams. I've done some preprocessing in a seperate notebook (titled Merge_datasets) and brought in that saved dataframe here.

In [7]:
#Imports
from textblob import TextBlob
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords, wordnet
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from imblearn.over_sampling import SMOTE
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.metrics import recall_score, accuracy_score, make_scorer, ConfusionMatrixDisplay, confusion_matrix

#Reads in data and drops nulls
df = pd.read_csv('../data/small_merged_chats')
sw = stopwords.words('english')
df = df.dropna(subset = ['body'])

The data I'm using, while cleaned from the original, had some non-human messages I didn't catch until looking at the tokens. The cell below gets rid of chats that come from these bots using regex (they are mostly ad bots so they're advertising links to patreon/discord). There are also different colored/varaitions of emojis used on Twitch like pog vs. pogChamp or kappa vs. kappaRainbow that convey essentially the same meaning. In an effort to trim the data down the `emoji_shorten` function finds the most popular emojis that have a lot of variations and truncates them all to the same word.

In [8]:
#Get rid of chats with links (often promo/not real messages)
def ad(chat):
    result = False
    result = bool(re.search(r'www\.[a-z]?\.?(com)+|[a-z]+\.(com)', chat))
    result = bool(re.search(r'http\S+', chat))
    return result

#Makes ad column and gets rid of any ad messages
df['is_ad'] = df['body'].apply(ad)
df = df[df['is_ad'] == False]

#Changes popular emojis to one type so no variants ie) pog + pogChamp -> pog + pog
def emoji_shorten(chat):
    chat = re.sub(r'(?i) \bpog(\w)*\b |\bpog(\w)*\b', '', chat)
    chat = re.sub(r'(?i) \blul(\w)*\b |\blul(\w)*\b', '', chat)
    chat = re.sub(r'(?i) \bkappa(\w)*\b |\bkappa(\w)*\b', '', chat)
    return chat

#Creates new column with emojis shortened to simple form
df['chats'] = df.body.apply(lambda x: emoji_shorten(x))

VADER, while great with social media chats and slang, doesn't have Twitch specific lingo. In an effort to get more accurate labels I've added some of the popular tokens that I noticed are missing from VADER's vocabulary. The polarity of the sentiment is calculated using a study from [here](https://dl.acm.org/doi/10.1145/3365523) in an effort to get an accurate number rather than just my own intuition. I then add this to the analyzer's vocabulary when I create the labels in the following cell.

In [9]:
#Words to add to VADER
#Values calculated from: https://dl.acm.org/doi/10.1145/3365523
new_words = {
    'noice': 1.8,
    'scum': -2.0,
    'kap': 0.5,
    'kappa': 0.5,
    'lul': 1.8,
    'omegalol': 1.8,
    'strats': 2.0,
    'rekt': 0,
    'owo': 1.0,
    'tweaker': -2.3,
    'pog': 2.8,
    'pag': 2.8,
    'incel': -3.1,
    'tilted': -0.7,
    'feelsbadman': -2.6,
    'feelsgoodman': 3.7,
    'trash': -2.0,
    'rip': -1.2,
    'ez': 1.9,
    'clap': 2.7,
    'hyperbruh': -0.6,
    'f': 0.5,
    'F': 0.5,
    'discord': 0,
    'PJSalt': -1.2,
    'Kreygasm': 2.8,
    'kreygasm': 2.8,
    'homo': -3.5,
    'clip': 0.5,
    'rAcIsM': -3.1,
    'based': 2.0,
    'Based': 2.0,
    'PepeHands': -1.7,
    'WutFace': -1.7,
    'FailFish': -2.0,
    'BabyRage': -1.6,
    'ANELE': -0.8,
    'haHAA': -0.5,
    'ResidentSleeper': -1.2,
    'cmonBruh': -1.0
}

In [10]:
#Creates labels for data
positive = 0
negative = 0
neutral = 0
polarity = 0
neutral_list = []
negative_list = []
positive_list = []
for chat in df.chats:
    analysis = TextBlob(chat)
    vad = SentimentIntensityAnalyzer()
    #New words get added to vocabulary
    vad.lexicon.update(new_words)
    score = vad.polarity_scores(chat)
    neg = score['neg']
    neu = score['neu']
    pos = score['pos']
    comp = score['compound']
    polarity += analysis.sentiment.polarity
 
    if neg > pos:
        negative_list.append(chat)
        negative += 1
    elif pos > neg:
        positive_list.append(chat)
        positive += 1
 
    elif pos == neg:
        neutral_list.append(chat)
        neutral += 1

These are just functions I'm using to tokenize my chat data. I'm again removing any non-ASCII characters to avoid crashes or any emojis encodings that lost characters then tokenizing the chats. These are then stripped of any unwanted puntuation and lemmatized. Notably I've noticed a good number of mispellings of words that may be adding extra noise, but I haven't found a good way to spellcheck without correcting non-words like the Twitch emoticons as well, so that's a project on hold I'm hoping to implement into the tokenizer function at some point.

In [12]:
#Replaces pos tags with lemmatize compatable tags
def pos_replace(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
    
#Makes list of punctuation to exclude, keeps certain symbols
punct = list(string.punctuation)
keep_punct = ['?', '!', '@', ',', '.']
punct = [p for p in punct if p not in keep_punct]

#Removes non-ASCII characters (aka emojis that cant be converted to original symbol)
def remove_junk(tweet):
    return ''.join([i if ord(i) < 128 else ' ' for i in tweet])

def chat_tokenizer(doc, stop_words = sw):
    #Gets rid of weird characters
    doc = remove_junk(doc)
    #Tokenizes using NLTK Twitter Tokenizer as chats like tweets
    chat_token = TweetTokenizer(strip_handles = True)
    doc = chat_token.tokenize(doc)
    #Strips extra puntuation I don't want to keep
    doc = [w for w in doc if w not in punct]
    #Lemmatizes tokens
    doc = pos_tag(doc)
    doc = [(w[0], pos_replace(w[1])) for w in doc]
    lemmatizer = WordNetLemmatizer() 
    doc = [lemmatizer.lemmatize(word[0], word[1]) for word in doc]
    
    return doc

As you can see there's class imbalance, so I've been SMOTEing the minority in my models. The actual simple model is down below. I opted to start with Bayes as it tends to do well with text classification. However, the metric I'm using is recall, as I'm trying to limit the number of missed hate chats so I'm weighting the model against these false negatives. I've found with Bayes though the validation recall score is always much higher than the training, so I'm also using accuracy as a metric for now to be able to tell if the model is overfit/if there is actually a gap in performance between training and validation sets.

In [15]:
#Hate or negative chat = 1 / neutral + positive = 0
df["label"] = np.where(df["chats"].isin(negative_list), 1, 0)
df.label.value_counts(normalize = True)

0    0.887196
1    0.112804
Name: label, dtype: float64

In [14]:
#Creates features and target then performs train test split
y = df['label']
X = df['chats']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 213)

#Pipeline for processing and fitting model
mnb_cv = imbpipeline(steps=[
    ('preproc', CountVectorizer(lowercase = False, tokenizer = chat_tokenizer)),
    ('smote', SMOTE(sampling_strategy = 'minority', random_state = 213)),
    ('mnb', MultinomialNB())
])

#Fits model and prints training score
mnb_cv.fit(X_train, y_train)
preds = mnb_cv.predict(X_train)
print("Training Recall:", recall_score(preds, y_train))
print("Training Accuracy:", mnb_cv.score(X_train, y_train))
#Cross validates model and prints average result
scoring = {
    'acc': make_scorer(accuracy_score),
    'rec': 'recall'
}
scores = cross_validate(mnb_cv, X_train, y_train, cv = 5, scoring = scoring)
print("Validation Recall:" + str(np.mean(scores['test_rec'])))
print("Validation Accuracy:" + str(np.mean(scores['test_acc'])))

Training Recall: 0.5652359295054008
Training Accuracy: 0.9109250243427459
Validation Recall:0.8309753921199263
Validation Accuracy:0.8851282051282052


The Models notebook in this same folder has a bunch of different iterations/some more feature engineering and exploring if there's anything I missed here you might have questions about. I opted to put the first simple model in this notebook though since the Models one is... messy and hard to navigate.