# Supervised Learning Methoden: Random Forest und Logistische Regression mit Word2Vec

In diesem Notebook schauen wir uns zwei supervised learning Methoden an, Random Forest und die Logistische Regression. Wir werden es in diesem Notebook sehr einfach halten, da das Hauptaugenmerk auf DistilBert und seinen Variationen liegt.

Wir wollen lediglich schauen, wie gut denn eine "klasschische" supervised learning Methode sich im Vergleich zum Transformer macht und sich die geringe Trainingszeit lohnt.

Das Dataset ist aus **1_Emotion_Model.ipynb** und behinhalet schon alle Emotionsvektoren sowie die Spalte *cleaned_review*. Für die Word2Vec Modelle werden wir aber noch die Punktutations, special Charaktere und Zahlen weggeben und eine neue Spalte *cleaned_review_sv* anheften.

In [109]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import re
from nltk.tokenize import word_tokenize

## Einlesen der Daten

In [110]:
df = pd.read_parquet("data/processed_data/steam_reviews_with_emotions_full.parquet")


In [111]:
df.head()

Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,620980,Beat Saber,79244667,english,"Since I am 80 + years old, it is very importan...",2020-11-14 10:51:22,2020-11-14 10:51:22,True,1493,176,...,0.001054,0.001134,0.094612,0.003521,0.01092,0.00234,0.00046,0.001183,0.001294,0.349538
1,1113000,Persona 4 Golden,70806847,english,Please buy this game if you want more Persona ...,2020-06-15 02:03:47,2020-11-26 04:15:44,True,1490,31,...,0.000184,1.6e-05,0.000899,2.1e-05,0.000102,9.3e-05,6.6e-05,6.9e-05,3.2e-05,0.000344
2,1145360,Hades,75662801,english,You can date the medusa head. Post Launch Edi...,2020-09-08 19:18:50,2020-10-10 19:57:02,True,1486,745,...,0.000248,1.2e-05,0.000412,1.6e-05,0.000564,2.3e-05,1.9e-05,5e-05,3.1e-05,0.993369
3,1225330,NBA 2K21,75410143,english,There is very little difference from 2k20. The...,2020-09-04 06:20:35,2020-09-04 06:20:35,False,1484,99,...,0.000521,0.007182,0.003316,0.001158,0.043133,0.002145,0.000302,0.004193,0.036822,0.029876
4,105600,Terraria,78393147,english,---{Graphics}--- ✖ Masterpiece ✖ Beautiful ✅Go...,2020-10-30 12:14:22,2020-11-26 04:01:45,True,1461,156,...,0.0016,0.000301,0.013872,0.000554,0.005503,0.000414,0.00016,0.000862,0.000123,0.699785


## Word2Vec embedding

Wir haben uns hier für einfache Standardwerte vom Model entschieden, jedoch einmal mit der vector_size und zwischen Skip Gram und CBOW gewechselt.

In [112]:
# Funktion für Tokenisierung
def preprocess_text(text):
    
    # Alles entfernen was keine Buchstaben sind.
    text = re.sub(r'[^a-z\s]', '', text)
    
    # Tokenisierung
    words = word_tokenize(text)

    return words

In [113]:
# Neue Spalte
df['cleaned_review_sv'] = df['cleaned_review'].apply(preprocess_text)

In [114]:
tokenized_reviews = df['cleaned_review_sv'].tolist() # Zur Liste machen für die Funktion Word2Vec

### Skip Gram | dim = 100

In [115]:
# Vector_size und window sind Standard, min_count wurde auf 1 gesetzt und die Worker hoch für schnelleres berechnen. sg = 1 bedeute Skip Gram Modell.
w2v_model = Word2Vec(sentences = tokenized_reviews, vector_size = 100, window = 5, min_count = 1, workers = 12, sg = 1)

In [116]:
# Funktion welches jeden tokenized Textreview in eine Zeile averaged, damit dies von den Modellen gelesen werden kann. Falls der Token im W2V nicht vorkommt werden dementsprechened nuller drangehängt
def get_average_word2vec(tokens_list, model):
    vectors = [model.wv[word] for word in tokens_list if word in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(100)

In [117]:
# Funktion angewandt bei der der Column "cleaned_review_sv" die tokens der Lambdafunction sind 
df['review_vector'] = df['cleaned_review_sv'].apply(lambda tokens: get_average_word2vec(tokens, w2v_model))


word2vec_df= pd.DataFrame(df['review_vector'].tolist())
word2vec_df.to_parquet("data\\word2vec_data\\word2vec_review_vector_SG_100.parquet")

### Skip Gram | dim = 300

In [118]:
# Wie oben nur andere vector_size
w2v_model_sg = Word2Vec(sentences = tokenized_reviews, vector_size = 300, window = 5, min_count = 1, workers = 12, sg = 1)

In [119]:
def get_average_word2vec(tokens_list, model):
    vectors = [model.wv[word] for word in tokens_list if word in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(300)

In [120]:
df['review_vector_sg'] = df['cleaned_review_sv'].apply(lambda tokens: get_average_word2vec(tokens, w2v_model_sg))

word2vec_df_sg = pd.DataFrame(df['review_vector_sg'].tolist())
word2vec_df_sg.to_parquet("data\\word2vec_data\\word2vec_review_vector_SG_300.parquet")

### CBOW | dim = 300

In [121]:
# Wie oben nur mit sg = 0 für CBOW
w2v_model_cbow = Word2Vec(sentences = tokenized_reviews, vector_size = 300, window = 5, min_count = 1, workers = 12, sg = 0)

In [122]:
def get_average_word2vec(tokens_list, model):
    vectors = [model.wv[word] for word in tokens_list if word in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(300)

df['review_vector_cbow'] = df['cleaned_review_sv'].apply(lambda tokens: get_average_word2vec(tokens, w2v_model_cbow))

word2vec_df_cbow = pd.DataFrame(df['review_vector_cbow'].tolist())
word2vec_df_cbow.to_parquet("data\\word2vec_data\\word2vec_review_vector_CBOW_300.parquet")

In [123]:
df.head(5)

Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,realization,relief,remorse,sadness,surprise,neutral,cleaned_review_sv,review_vector,review_vector_sg,review_vector_cbow
0,620980,Beat Saber,79244667,english,"Since I am 80 + years old, it is very importan...",2020-11-14 10:51:22,2020-11-14 10:51:22,True,1493,176,...,0.01092,0.00234,0.00046,0.001183,0.001294,0.349538,"[since, i, am, years, old, it, is, very, impor...","[-0.2726916, 0.19899613, 0.22751918, 0.1773474...","[-0.081652, 0.052123804, -0.22066393, -0.18838...","[-0.43056703, -0.7118685, -0.4633068, 0.205449..."
1,1113000,Persona 4 Golden,70806847,english,Please buy this game if you want more Persona ...,2020-06-15 02:03:47,2020-11-26 04:15:44,True,1490,31,...,0.000102,9.3e-05,6.6e-05,6.9e-05,3.2e-05,0.000344,"[please, buy, this, game, if, you, want, more,...","[-0.2883354, 0.14829771, 0.15931796, 0.2871740...","[-0.082975045, 0.08470067, -0.20024474, -0.164...","[0.0012572557, -0.55559087, -0.65425366, 0.860..."
2,1145360,Hades,75662801,english,You can date the medusa head. Post Launch Edi...,2020-09-08 19:18:50,2020-10-10 19:57:02,True,1486,745,...,0.000564,2.3e-05,1.9e-05,5e-05,3.1e-05,0.993369,"[you, can, date, the, medusa, head, post, laun...","[-0.39996672, 0.18868315, 0.123849116, 0.31465...","[-0.086352706, 0.16583632, -0.22587965, -0.156...","[-0.27830422, -0.15577766, -0.24705377, 0.0147..."
3,1225330,NBA 2K21,75410143,english,There is very little difference from 2k20. The...,2020-09-04 06:20:35,2020-09-04 06:20:35,False,1484,99,...,0.043133,0.002145,0.000302,0.004193,0.036822,0.029876,"[there, is, very, little, difference, from, k,...","[-0.25944784, 0.16957393, 0.18401708, 0.324794...","[-0.042988148, 0.15374173, -0.17773253, -0.180...","[-0.34721655, -0.48042405, -0.2059811, -0.0819..."
4,105600,Terraria,78393147,english,---{Graphics}--- ✖ Masterpiece ✖ Beautiful ✅Go...,2020-10-30 12:14:22,2020-11-26 04:01:45,True,1461,156,...,0.005503,0.000414,0.00016,0.000862,0.000123,0.699785,"[graphics, masterpiece, beautiful, good, decen...","[-0.31568402, 0.09190998, 0.23750567, 0.391256...","[0.032259822, 0.15602392, -0.17051366, -0.1357...","[-0.033696365, -0.47074488, -0.41680184, 0.202..."


## Mit Emotionen

### Logistic Regression

##### CBOW | dim = 300

In [124]:
# Features
X = pd.concat([df[["admiration", "amusement", "anger", "annoyance", "approval", "caring",
            "confusion", "curiosity", "desire", "disappointment", "disapproval",
            "disgust", "embarrassment", "excitement", "fear", "gratitude", "grief",
            "joy", "love", "nervousness", "optimism", "pride", "realization",
           "relief", "remorse", "sadness", "surprise", "neutral"]], word2vec_df_cbow], axis = 1)

# Als String machen, da es sonst Probleme gibt für die train_test_split Funktion
X.columns = X.columns.astype(str)

y = df['recommended']

# random_state sagt den seed, test_size ist 20% und stratify = y behält das Verhältnis der recommended Spalte, was gut für imbalanced data ist (was hier der Fall ist)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

Max iterations = 10

*No class weight*

In [125]:
# Wieder einen Seed setzen, 10 iterationen und alle vorhandened CPU threads benutzen und das Model erstellen
logreg_model = LogisticRegression(random_state = 42, max_iter = 10, n_jobs = -1)

# Das Modell auf trainieren mit den Trainingsdaten
logreg_model.fit(X_train, y_train)

# Das Modell auf die Testdaten vorhersagen
y_pred_logreg = logreg_model.predict(X_test)

# classification_report() Funktion benutzen um alle Metriken dazu bekommen, zero_division = 0, wenn es eine Division mit 0 gibt wird das Ergebnis 0. 
print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.77      0.61      0.68     15830
        True       0.90      0.95      0.92     55832

    accuracy                           0.87     71662
   macro avg       0.83      0.78      0.80     71662
weighted avg       0.87      0.87      0.87     71662



*Class weight = 'balanced'*

In [126]:
# Für alle das selbe, nur diesmal wird ein class_weight auf 'balanced' gesetzt was dafür sorgt, dass die unterrepräsentierte Klasse (recommended: False) mehr Gewicht bekommt.
logreg_model = LogisticRegression(class_weight='balanced', random_state = 42, max_iter = 10, n_jobs = -1)

logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.65      0.86      0.74     15830
        True       0.96      0.87      0.91     55832

    accuracy                           0.87     71662
   macro avg       0.80      0.86      0.83     71662
weighted avg       0.89      0.87      0.87     71662



Max iterations = 100

*No class weight*

In [127]:
logreg_model = LogisticRegression(random_state = 42, max_iter = 100, n_jobs = -1)

logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.81      0.72      0.76     15830
        True       0.92      0.95      0.94     55832

    accuracy                           0.90     71662
   macro avg       0.87      0.84      0.85     71662
weighted avg       0.90      0.90      0.90     71662



*Class weight = 'balanced'*

In [128]:
logreg_model = LogisticRegression(class_weight='balanced', random_state = 42, max_iter = 100, n_jobs = -1)

logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.67      0.86      0.76     15830
        True       0.96      0.88      0.92     55832

    accuracy                           0.88     71662
   macro avg       0.81      0.87      0.84     71662
weighted avg       0.89      0.88      0.88     71662



Max iterations = 1000

*No class weight*

In [129]:
logreg_model = LogisticRegression(random_state = 42, max_iter = 1000, n_jobs = -1)

logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.81      0.72      0.76     15830
        True       0.92      0.95      0.94     55832

    accuracy                           0.90     71662
   macro avg       0.87      0.84      0.85     71662
weighted avg       0.90      0.90      0.90     71662



*Class weight = 'balanced'*

In [130]:
logreg_model = LogisticRegression(class_weight='balanced', random_state = 42, max_iter = 1000, n_jobs = -1)

logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.67      0.86      0.75     15830
        True       0.96      0.88      0.92     55832

    accuracy                           0.88     71662
   macro avg       0.81      0.87      0.84     71662
weighted avg       0.89      0.88      0.88     71662



##### Skip Gram | dim = 300

In [131]:
X = pd.concat([df[["admiration", "amusement", "anger", "annoyance", "approval", "caring",
            "confusion", "curiosity", "desire", "disappointment", "disapproval",
            "disgust", "embarrassment", "excitement", "fear", "gratitude", "grief",
            "joy", "love", "nervousness", "optimism", "pride", "realization",
           "relief", "remorse", "sadness", "surprise", "neutral"]], word2vec_df_sg], axis = 1)

X.columns = X.columns.astype(str)

y = df['recommended']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

Max iterations = 10

*No class weights*

In [132]:
logreg_model = LogisticRegression(random_state = 42, max_iter = 10, n_jobs = -1)

logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.80      0.63      0.71     15830
        True       0.90      0.95      0.93     55832

    accuracy                           0.88     71662
   macro avg       0.85      0.79      0.82     71662
weighted avg       0.88      0.88      0.88     71662



*Class weight = 'balanced'*

In [133]:
logreg_model = LogisticRegression(class_weight='balanced', random_state = 42, max_iter = 10, n_jobs = -1)
a
logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.62      0.86      0.72     15830
        True       0.95      0.85      0.90     55832

    accuracy                           0.85     71662
   macro avg       0.79      0.86      0.81     71662
weighted avg       0.88      0.85      0.86     71662



Max iterations = 100

*No class weights*

In [134]:
logreg_model = LogisticRegression(random_state = 42, max_iter = 100, n_jobs = -1)

logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.82      0.73      0.77     15830
        True       0.92      0.95      0.94     55832

    accuracy                           0.90     71662
   macro avg       0.87      0.84      0.85     71662
weighted avg       0.90      0.90      0.90     71662



*Class weight = 'balanced'*

In [135]:
logreg_model = LogisticRegression(class_weight='balanced', random_state = 42, max_iter = 100, n_jobs = -1)

logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.68      0.87      0.76     15830
        True       0.96      0.88      0.92     55832

    accuracy                           0.88     71662
   macro avg       0.82      0.88      0.84     71662
weighted avg       0.90      0.88      0.89     71662



##### Skip Gram | dim = 100

In [136]:
X = pd.concat([df[["admiration", "amusement", "anger", "annoyance", "approval", "caring",
            "confusion", "curiosity", "desire", "disappointment", "disapproval",
            "disgust", "embarrassment", "excitement", "fear", "gratitude", "grief",
            "joy", "love", "nervousness", "optimism", "pride", "realization",
           "relief", "remorse", "sadness", "surprise", "neutral"]], word2vec_df], axis = 1)

X.columns = X.columns.astype(str)

y = df['recommended']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

Max iterations = 10

*No class weight*

In [137]:
logreg_model = LogisticRegression(random_state = 42, max_iter = 10, n_jobs = -1)

logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.80      0.64      0.71     15830
        True       0.90      0.96      0.93     55832

    accuracy                           0.89     71662
   macro avg       0.85      0.80      0.82     71662
weighted avg       0.88      0.89      0.88     71662



*Class weight = 'balanced'*

In [138]:
logreg_model = LogisticRegression(class_weight='balanced', random_state = 42, max_iter = 10, n_jobs = -1)

logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.62      0.86      0.72     15830
        True       0.95      0.85      0.90     55832

    accuracy                           0.85     71662
   macro avg       0.79      0.85      0.81     71662
weighted avg       0.88      0.85      0.86     71662



Max iteration = 100

*No class weight*

In [139]:
logreg_model = LogisticRegression(random_state = 42, max_iter = 100, n_jobs = -1)

logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.81      0.72      0.76     15830
        True       0.92      0.95      0.94     55832

    accuracy                           0.90     71662
   macro avg       0.87      0.83      0.85     71662
weighted avg       0.90      0.90      0.90     71662



*Class weight = 'balanced'*

In [140]:
logreg_model = LogisticRegression(class_weight='balanced', random_state = 42, max_iter = 100, n_jobs = -1)

logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.67      0.86      0.75     15830
        True       0.96      0.88      0.92     55832

    accuracy                           0.88     71662
   macro avg       0.81      0.87      0.84     71662
weighted avg       0.89      0.88      0.88     71662



Für die Logistische Regression scheint es besser zu sein, class_weight = 'balanced' wegzulassen. Das 300er skip gram modell hat am besten abgeschnitten bzgl der Metriken und 100 iterationen scheinen auch zu reichen. Daher werden wir nur mehr das 300er skip gram Modell für den Random Forest benutzen.

### Random Forest

##### Skip Gram | dim = 300

Number estimators = 10

In [152]:
X = pd.concat([df[["admiration", "amusement", "anger", "annoyance", "approval", "caring",
            "confusion", "curiosity", "desire", "disappointment", "disapproval",
            "disgust", "embarrassment", "excitement", "fear", "gratitude", "grief",
            "joy", "love", "nervousness", "optimism", "pride", "realization",
           "relief", "remorse", "sadness", "surprise", "neutral"]], word2vec_df_sg], axis = 1)

X.columns = X.columns.astype(str)

y = df['recommended']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

*No class weights*

In [153]:
# Einziger Unterschied ist nun, dass wir n_estimators haben welche die Anzahl der Bäume angeben
rf_model = RandomForestClassifier(n_estimators = 10, random_state = 42, n_jobs = -1)

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf, zero_division = 0))

Random Forest Classification Report:
              precision    recall  f1-score   support

       False       0.77      0.73      0.75     15830
        True       0.92      0.94      0.93     55832

    accuracy                           0.89     71662
   macro avg       0.85      0.83      0.84     71662
weighted avg       0.89      0.89      0.89     71662



*Class weights = 'balanced'*

In [154]:
rf_model = RandomForestClassifier(n_estimators = 10, class_weight = 'balanced', random_state = 42, n_jobs = -1)

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf, zero_division = 0))

Random Forest Classification Report:
              precision    recall  f1-score   support

       False       0.80      0.68      0.74     15830
        True       0.91      0.95      0.93     55832

    accuracy                           0.89     71662
   macro avg       0.86      0.82      0.84     71662
weighted avg       0.89      0.89      0.89     71662



Number estimators = 100

*No class weights*

In [155]:
rf_model = RandomForestClassifier(n_estimators = 100, random_state = 42, n_jobs = -1)

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf, zero_division = 0))

Random Forest Classification Report:
              precision    recall  f1-score   support

       False       0.85      0.70      0.77     15830
        True       0.92      0.96      0.94     55832

    accuracy                           0.91     71662
   macro avg       0.88      0.83      0.86     71662
weighted avg       0.90      0.91      0.90     71662



*Class weights = 'balanced'*

In [156]:
rf_model = RandomForestClassifier(n_estimators = 100, class_weight = 'balanced', random_state = 42, n_jobs = -1)

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf, zero_division = 0))

Random Forest Classification Report:
              precision    recall  f1-score   support

       False       0.86      0.66      0.74     15830
        True       0.91      0.97      0.94     55832

    accuracy                           0.90     71662
   macro avg       0.88      0.81      0.84     71662
weighted avg       0.90      0.90      0.90     71662



Auch hier schneidet das Modell ohne balanced class weights besser ab, wenn auch nicht um sehr viel. Die n_estimators von 10 auf 100 hochzuschrauben hat for allem die Precision der "not recommended" Daten verbessert.

Außerderm hat die Random Forest Modell besser als die einfache logistische Regression abgeschnitten. Eine accuracy Verbesserung um 0.01, sowie von Precision und F1-Score um auch 0.01. Nur Der Recall hat sich leicht verschlechtert um 0.01.

## Ohne Emotionen

Die besten logistic regression und random forest Modelle werden nun ohne Emotionen getestet für den Unterschied zu den vorherigen Modellen, mit und ohne class_weights = 'balanced'.

In [157]:
# Hier ist X nur der Word2Vec vector des 300er Skip Gram Modells.
X = word2vec_df_sg
y = df['recommended']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

#### No class weights

##### Random Forest

In [158]:
rf_model = RandomForestClassifier(n_estimators = 100, random_state = 42, n_jobs = -1)

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf, zero_division=0))

Random Forest Classification Report:
              precision    recall  f1-score   support

       False       0.87      0.48      0.62     15848
        True       0.87      0.98      0.92     55814

    accuracy                           0.87     71662
   macro avg       0.87      0.73      0.77     71662
weighted avg       0.87      0.87      0.86     71662



##### Logistic Regression

In [159]:
logreg_model = LogisticRegression(random_state = 42, max_iter = 100, n_jobs = -1)

logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.79      0.63      0.70     15848
        True       0.90      0.95      0.93     55814

    accuracy                           0.88     71662
   macro avg       0.84      0.79      0.81     71662
weighted avg       0.88      0.88      0.88     71662



#### Class weights = 'balanced'

##### Random Forest

In [160]:
rf_model = RandomForestClassifier(n_estimators = 100, class_weight = 'balanced', random_state = 42, n_jobs = -1)

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf, zero_division=0))

Random Forest Classification Report:
              precision    recall  f1-score   support

       False       0.86      0.44      0.58     15848
        True       0.86      0.98      0.92     55814

    accuracy                           0.86     71662
   macro avg       0.86      0.71      0.75     71662
weighted avg       0.86      0.86      0.84     71662



##### Logistic Regression

In [161]:
logreg_model = LogisticRegression(random_state = 42, class_weight = 'balanced', max_iter = 100, n_jobs = -1)

logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg, zero_division = 0))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

       False       0.62      0.86      0.72     15848
        True       0.96      0.85      0.90     55814

    accuracy                           0.85     71662
   macro avg       0.79      0.86      0.81     71662
weighted avg       0.88      0.85      0.86     71662



## Conclusion

Man erkennt, dass ohne die Emotionen die "Not Recommended" Spalte schlechter abschneidet als mit. Die Diskrepanz zwischen der Precision und dem Recall sind hier vor allem stärker zu sehen. Das balancing hat hier auch einen größeren Einfluss, was an den umgekerten Precision und Recall Werten bei jeweils der Logistischen Regression und dem Random Forest zu sehen ist.

Dennoch sind die Modelle mit den Emotionen besser was die Metrikwerte angeht als ohne.

Auch ohne das Emotionsmodell im Vorhinein, werden die Reviews gut klassifiziert.

Im Vergleich zu dem DistilBert Modell Auch die supervised Modelle scheinen nicht viel schlechter abzuschneiden was die Klassifikation der Reviews angeht. Es schwankt zwischen 0.05 - 0.07 je Metrik. Mit einer Accuracy von 0.9 ist es dennoch ein überraschend gutes Ergebnis, vor allem wenn man die geringere Trainingszeit dagegenrechnet. Womöglich kann man mit einem erweiterten Grid Search die Metrik-Werte von den supervised Modellen noch verbessern, da sich dieses Projekt jedoch um ein Transformer-Modell handelt, belassen wir es bei diesen Ergebnissen.