# Data Preprocessing

Hier beginnen wir von 0 und die EDA wird direkt übersprungen, da dieser in einem anderen Notebook durchgeführt wurde.

**Durch die EDA wurde folgende Kriterien festgestellt:**

* 2020-2021 Daten 

* Englischer Sprache

* maximal 100 helpful votes haben

* maximal 100 funny votes haben

* Maximale Spielzeit von 10000 Stunden haben

* Maximale Spielzeit von 336 Stunden in den letzten zwei Wochen haben


**Zielorientierte Datenbereinigung:**

Da der Plan dieses Projektes ein Transformer Modell ist, welches predicten soll, ob ein Review positiv ist oder nicht, werden wir die Daten so vorbereiten, dass sie für das Modell geeignet sind.

* Textlänge wird auf 512 Token beschränkt

* Zu kurze Reviews werden entfernt (mindestens 10 Tokens)

* Zu lange Reviews werden gekürzt (512 Tokens)


**Daten-/Modellorientierte Vorbereitungen:**

* Duplikate werden entfernt

* Handling von Sonderzeichen und Emojis ( Nötig für Transformer ?.: afaik nicht wirklich nötig, da Transformer auch mit Sonderzeichen umgehen kann)

* Handling von HTML Tags ( Nötig für Transformer? Könnte sein das HTML tags unnötige Tokens verschwenden.)

* Lowercasing (Probably, da der Transformer Case-Sensitive ist, falls alles lowercase ist wird der Transformer nicht zwischen Groß- und Kleinschreibung unterscheiden, erwarteter Performancegewinn)

* Klassenbalance wird hergestellt (~50/50)

* Optionale Balance auf Spiele/Genre Basis (Entweder Spiele oder Genres werden balanciert)



## EDA Kriterium Datenvorbereiung

In [1]:
import pandas as pd

In [2]:
df = pd.read_parquet("data/steam_reviews.parquet")

In [3]:
# timestamps zu datum
df["timestamp_created"] = pd.to_datetime(df["timestamp_created"], unit='s')

In [4]:
df.head()

Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,steam_purchase,received_for_free,written_during_early_access,author.steamid,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played
3,292030,The Witcher 3: Wild Hunt,85184605,english,"One of the best RPG's of all time, worthy of a...",2021-01-23 05:32:50,1611379970,True,0,0,...,True,False,False,76561199054755373,5,3,5587.0,3200.0,5524.0,1611384000.0
5,292030,The Witcher 3: Wild Hunt,85184171,english,"good story, good graphics. lots to do.",2021-01-23 05:21:04,1611379264,True,0,0,...,True,False,False,76561198170193529,11,1,823.0,823.0,823.0,1611379000.0
6,292030,The Witcher 3: Wild Hunt,85184064,english,"dis gud,",2021-01-23 05:18:11,1611379091,True,0,0,...,True,False,False,76561198119302812,27,2,4192.0,3398.0,4192.0,1611352000.0
18,292030,The Witcher 3: Wild Hunt,85180436,english,favorite game of all time cant wait for the Ne...,2021-01-23 03:38:06,1611373086,True,0,0,...,True,False,False,76561198065591528,33,1,23329.0,177.0,23329.0,1611219000.0
20,292030,The Witcher 3: Wild Hunt,85179753,english,Why wouldn't you get this,2021-01-23 03:19:38,1611371978,True,0,0,...,True,False,False,76561198996835044,131,2,8557.0,2004.0,8557.0,1611371000.0


In [5]:
# Definiere manuelle Grenzwerte für die Bereinigung der Daten
realistic_limits = {
    'votes_helpful': 100, # mmaximal 100 helpful votes
    'votes_funny': 100, # maximal 100 funny votes
    'author.playtime_forever': 10000, # Maximale Spielzeit von 10000 Minuten haben
    'author.playtime_last_two_weeks': 336, # Maximale Spielzeit von 336 Minuten in den letzten zwei Wochen haben
    'timestamp_created': pd.Timestamp(2020, 1, 1), # auf das Jahr 2020 beschränken
}

# Filtere die Daten basierend auf den festgelegten Grenzwerten
df = df[(df['votes_helpful'] <= realistic_limits['votes_helpful']) &
                (df['votes_funny'] <= realistic_limits['votes_funny']) &
                (df['author.playtime_forever'] <= realistic_limits['author.playtime_forever']) &
                (df['author.playtime_last_two_weeks'] <= realistic_limits['author.playtime_last_two_weeks']) &
                (df['timestamp_created'] >= realistic_limits['timestamp_created'])]

In [6]:
df.head()

Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,steam_purchase,received_for_free,written_during_early_access,author.steamid,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played
21,292030,The Witcher 3: Wild Hunt,85179400,english,it is ok\n,2021-01-23 03:09:52,1611371392,True,0,0,...,True,False,False,76561198284845223,60,9,2518.0,242.0,2518.0,1611371000.0
22,292030,The Witcher 3: Wild Hunt,85179341,english,worth\n,2021-01-23 03:08:38,1611371318,True,0,0,...,True,False,False,76561198370568524,59,5,553.0,35.0,517.0,1611374000.0
24,292030,The Witcher 3: Wild Hunt,85178164,english,Isn't Geralt hot enough to get both Yennefer a...,2021-01-23 02:37:58,1611369478,True,0,0,...,True,False,False,76561198040150323,51,37,165.0,0.0,165.0,1437876000.0
40,292030,The Witcher 3: Wild Hunt,85174737,english,The only thing bigger than the world map is ur...,2021-01-23 01:08:08,1611364088,True,0,0,...,True,False,False,76561199001673284,111,11,8068.0,0.0,8068.0,1597494000.0
45,292030,The Witcher 3: Wild Hunt,85172984,english,better than cyberpoop,2021-01-23 00:19:31,1611361171,True,0,0,...,False,False,False,76561198033304276,47,14,3159.0,0.0,3159.0,1524274000.0


In [7]:
# duplikate entfernen, auf basis aller daten
df = df.drop_duplicates()
df = df.dropna(subset=["review"])

### API Daten Holen

In [8]:
#import requests
#import pandas as pd
#import pyarrow.parquet as pq
#
## Beispiel für den DataFrame mit den Spiel-IDs
#app_ids = df['app_id'].unique()
#
## Definiere die Funktion, um die API zu nutzen und die Daten zu extrahieren
#def get_game_info(app_id):
#    url = f"https://store.steampowered.com/api/appdetails?appids={app_id}&l=english"
#    response = requests.get(url)
#    
#    if response.status_code == 200:
#        data = response.json().get(str(app_id), {}).get('data', {})
#        if data:
#            publisher = data.get('publishers', [None])[0]  # Publisher
#            genres = [genre['description'] for genre in data.get('genres', [])]  # Genre(s)
#            notes = data.get('content_descriptors', {}).get('notes', None)  # Notes
#            return {'app_id': app_id, 'publisher': publisher, 'genres': genres, 'notes': notes}
#    return {'app_id': app_id, 'publisher': None, 'genres': None, 'notes': None}
#
## Erstelle eine Liste, um die Daten zu speichern
#game_info_list = []
#
## Iteriere über die Spiel-IDs und hole die Daten von der API
#for app_id in app_ids:
#    game_info = get_game_info(app_id)
#    game_info_list.append(game_info)
#
## Konvertiere die Ergebnisse in einen DataFrame
#df_game_info = pd.DataFrame(game_info_list)
#
## Speichere die Daten im Parquet-Format
#df_game_info.to_parquet('data/game_info.parquet', engine='pyarrow')
#
#print("Daten erfolgreich in 'game_info.parquet' gespeichert.")

In [9]:
# Importiere die Daten aus dem Parquet-Dateiformat
df_game_info = pd.read_parquet('data/game_info.parquet')

In [10]:
# join die beiden daten:
df = df.merge(df_game_info, on='app_id', how='left')

In [11]:
df.head(5)

Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,author.steamid,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played,publisher,genres,notes
0,292030,The Witcher 3: Wild Hunt,85179400,english,it is ok\n,2021-01-23 03:09:52,1611371392,True,0,0,...,76561198284845223,60,9,2518.0,242.0,2518.0,1611371000.0,CD PROJEKT RED,[RPG],The Witcher 3: Wild Hunt contains strong langu...
1,292030,The Witcher 3: Wild Hunt,85179341,english,worth\n,2021-01-23 03:08:38,1611371318,True,0,0,...,76561198370568524,59,5,553.0,35.0,517.0,1611374000.0,CD PROJEKT RED,[RPG],The Witcher 3: Wild Hunt contains strong langu...
2,292030,The Witcher 3: Wild Hunt,85178164,english,Isn't Geralt hot enough to get both Yennefer a...,2021-01-23 02:37:58,1611369478,True,0,0,...,76561198040150323,51,37,165.0,0.0,165.0,1437876000.0,CD PROJEKT RED,[RPG],The Witcher 3: Wild Hunt contains strong langu...
3,292030,The Witcher 3: Wild Hunt,85174737,english,The only thing bigger than the world map is ur...,2021-01-23 01:08:08,1611364088,True,0,0,...,76561199001673284,111,11,8068.0,0.0,8068.0,1597494000.0,CD PROJEKT RED,[RPG],The Witcher 3: Wild Hunt contains strong langu...
4,292030,The Witcher 3: Wild Hunt,85172984,english,better than cyberpoop,2021-01-23 00:19:31,1611361171,True,0,0,...,76561198033304276,47,14,3159.0,0.0,3159.0,1524274000.0,CD PROJEKT RED,[RPG],The Witcher 3: Wild Hunt contains strong langu...


## Daten-/Modellorientierte Bereinigung

In [12]:
label_distribution = df['recommended'].value_counts() / len(df)
print(label_distribution)


recommended
True     0.932143
False    0.067857
Name: count, dtype: float64


In [13]:
df = df.sort_values(by=['votes_helpful', 'votes_funny'], ascending=False)
df = df[(df['votes_helpful'] > 0) | (df['votes_funny'] > 0)]

In [14]:
df

Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,author.steamid,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played,publisher,genres,notes
343099,367520,Hollow Knight,65060521,english,This game is sexy,2020-03-15 02:26:54,1584239214,True,100,93,...,76561198307691522,23,2,6766.0,0.0,1881.0,1.597273e+09,Team Cherry,"[Action, Adventure, Indie]",
294292,1225330,NBA 2K21,75417828,english,"Hey, someone that played longer than 14 minute...",2020-09-04 10:35:14,1599447332,True,100,80,...,76561198116686192,274,27,6385.0,0.0,305.0,1.607874e+09,2K,"[Simulation, Sports]",
2050510,374320,DARK SOULS™ III,74441698,english,I found a rapport with the game. \nThe game ha...,2020-08-17 06:08:35,1597674239,True,100,60,...,76561198169477203,1094,36,142.0,0.0,142.0,1.587203e+09,"FromSoftware, Inc.",[Action],
1354010,698780,Doki Doki Literature Club,73446749,english,yurrrrrr big fat tiddies big fat mommy milkers...,2020-07-29 03:45:37,1595994365,True,100,48,...,76561198353092274,35,3,1034.0,0.0,1034.0,1.519216e+09,Team Salvato,"[Casual, Free To Play, Indie]",
2306092,489830,The Elder Scrolls V: Skyrim Special Edition,68440886,english,You know you wanna play it. Don't bother with ...,2020-05-02 16:40:29,1588437629,True,100,39,...,76561198307276518,31,7,2437.0,0.0,430.0,1.591840e+09,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2503685,546560,Half-Life: Alyx,65649322,english,"so far its great, except the fact that vive co...",2020-03-23 19:12:34,1585168953,True,0,1,...,76561198203885895,167,21,605.0,0.0,36.0,1.590979e+09,Valve,"[Action, Adventure]",Includes violence and gore.
2503699,546560,Half-Life: Alyx,65649110,english,"Apart from physics of grabbing small items, th...",2020-03-23 19:09:38,1584990578,True,0,1,...,76561198059511685,280,94,1240.0,0.0,111.0,1.593772e+09,Valve,"[Action, Adventure]",Includes violence and gore.
2503792,546560,Half-Life: Alyx,65647882,english,headcrab scared the shit out of me\n10/10,2020-03-23 18:49:54,1584989394,True,0,1,...,76561199018717049,6,1,274.0,0.0,62.0,1.585249e+09,Valve,"[Action, Adventure]",Includes violence and gore.
2503879,546560,Half-Life: Alyx,65646109,english,Great game. Incredible graphics! Runs great on...,2020-03-23 18:21:33,1584987693,True,0,1,...,76561198075128005,65,2,83.0,0.0,6.0,1.585254e+09,Valve,"[Action, Adventure]",Includes violence and gore.


In [15]:
# Gruppiere nach Spiel-ID (app_id) und zähle die Anzahl der Reviews pro Spiel
reviews_per_game = df['app_id'].value_counts()
print(reviews_per_game)

lower_bound = reviews_per_game.quantile(0.25)
upper_bound = reviews_per_game.quantile(0.75)

filtered_games = reviews_per_game[(reviews_per_game >= lower_bound) & (reviews_per_game <= upper_bound)].index

app_id
945360    37790
739630    23479
782330    19865
546560    12108
105600    11933
          ...  
872790       12
334040        9
772540        2
454200        2
619290        1
Name: count, Length: 315, dtype: int64


In [16]:
df_filtered = df[df['app_id'].isin(filtered_games)]
df_filtered

Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,author.steamid,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played,publisher,genres,notes
294292,1225330,NBA 2K21,75417828,english,"Hey, someone that played longer than 14 minute...",2020-09-04 10:35:14,1599447332,True,100,80,...,76561198116686192,274,27,6385.0,0.0,305.0,1.607874e+09,2K,"[Simulation, Sports]",
1783215,412830,STEINS;GATE,72999780,english,EL\n\nPSY\n\nKONGROO,2020-07-20 13:14:10,1595250850,True,100,32,...,76561198047527295,326,56,3451.0,0.0,2450.0,1.595637e+09,"Spike Chunsoft Co., Ltd.",[Adventure],
2070954,626690,Sword Art Online: Fatal Bullet,72878586,english,---{Graphics}---\n☐ You forget what reality is...,2020-07-18 05:53:53,1595051633,True,100,20,...,76561198129428276,83,2,2982.0,0.0,1088.0,1.608368e+09,BANDAI NAMCO Entertainment,"[Action, RPG]",
2090570,356190,Middle-earth™: Shadow of War™,61617517,english,I was fighting with some PITA enemy captain wh...,2020-01-10 10:39:24,1578653101,True,100,14,...,76561197983489066,312,32,2685.0,0.0,1726.0,1.579110e+09,WB Games,"[Action, Adventure, RPG]",
2362643,666140,My Time At Portia,69214890,english,If Stardew Valley and Animal Crossing had a ch...,2020-05-16 04:42:11,1589604131,True,100,10,...,76561198273476495,77,4,4819.0,0.0,4479.0,1.605565e+09,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2467094,421020,DiRT 4,64401555,english,>left corner \n>try to drift\n>??????????\n>th...,2020-03-02 10:25:26,1583144726,True,0,1,...,76561198063033909,184,9,2227.0,0.0,1359.0,1.593595e+09,,,
2467110,421020,DiRT 4,63957766,english,a bit casual but team management balances it,2020-02-22 22:26:37,1582410397,True,0,1,...,76561198182160538,72,22,532.0,0.0,409.0,1.582753e+09,,,
2467137,421020,DiRT 4,63250526,english,gud.,2020-02-10 01:46:39,1581299199,True,0,1,...,76561198194429819,28,3,1612.0,0.0,835.0,1.593969e+09,,,
2467176,421020,DiRT 4,62475415,english,is fun if enjoyed with proper wheel & pedal se...,2020-01-26 21:17:18,1580073438,True,0,1,...,76561198129644538,26,2,4339.0,0.0,3812.0,1.595189e+09,,,


In [17]:
import re

# HTML-Tags & Lowercase
df_filtered['cleaned_review'] = df_filtered['review'].apply(lambda x: re.sub(r'<.*?>', '', x))
df_filtered['cleaned_review'] = df_filtered['cleaned_review'].str.lower()
df_filtered

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['cleaned_review'] = df_filtered['review'].apply(lambda x: re.sub(r'<.*?>', '', x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['cleaned_review'] = df_filtered['cleaned_review'].str.lower()


Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played,publisher,genres,notes,cleaned_review
294292,1225330,NBA 2K21,75417828,english,"Hey, someone that played longer than 14 minute...",2020-09-04 10:35:14,1599447332,True,100,80,...,274,27,6385.0,0.0,305.0,1.607874e+09,2K,"[Simulation, Sports]",,"hey, someone that played longer than 14 minute..."
1783215,412830,STEINS;GATE,72999780,english,EL\n\nPSY\n\nKONGROO,2020-07-20 13:14:10,1595250850,True,100,32,...,326,56,3451.0,0.0,2450.0,1.595637e+09,"Spike Chunsoft Co., Ltd.",[Adventure],,el\n\npsy\n\nkongroo
2070954,626690,Sword Art Online: Fatal Bullet,72878586,english,---{Graphics}---\n☐ You forget what reality is...,2020-07-18 05:53:53,1595051633,True,100,20,...,83,2,2982.0,0.0,1088.0,1.608368e+09,BANDAI NAMCO Entertainment,"[Action, RPG]",,---{graphics}---\n☐ you forget what reality is...
2090570,356190,Middle-earth™: Shadow of War™,61617517,english,I was fighting with some PITA enemy captain wh...,2020-01-10 10:39:24,1578653101,True,100,14,...,312,32,2685.0,0.0,1726.0,1.579110e+09,WB Games,"[Action, Adventure, RPG]",,i was fighting with some pita enemy captain wh...
2362643,666140,My Time At Portia,69214890,english,If Stardew Valley and Animal Crossing had a ch...,2020-05-16 04:42:11,1589604131,True,100,10,...,77,4,4819.0,0.0,4479.0,1.605565e+09,,,,if stardew valley and animal crossing had a ch...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2467094,421020,DiRT 4,64401555,english,>left corner \n>try to drift\n>??????????\n>th...,2020-03-02 10:25:26,1583144726,True,0,1,...,184,9,2227.0,0.0,1359.0,1.593595e+09,,,,>left corner \n>try to drift\n>??????????\n>th...
2467110,421020,DiRT 4,63957766,english,a bit casual but team management balances it,2020-02-22 22:26:37,1582410397,True,0,1,...,72,22,532.0,0.0,409.0,1.582753e+09,,,,a bit casual but team management balances it
2467137,421020,DiRT 4,63250526,english,gud.,2020-02-10 01:46:39,1581299199,True,0,1,...,28,3,1612.0,0.0,835.0,1.593969e+09,,,,gud.
2467176,421020,DiRT 4,62475415,english,is fun if enjoyed with proper wheel & pedal se...,2020-01-26 21:17:18,1580073438,True,0,1,...,26,2,4339.0,0.0,3812.0,1.595189e+09,,,,is fun if enjoyed with proper wheel & pedal se...


In [18]:
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

df_filtered['tokenized'] = df_filtered['cleaned_review'].apply(
    lambda x: tokenizer(x, truncation=True, padding='max_length', max_length=512, return_tensors='pt')
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['tokenized'] = df_filtered['cleaned_review'].apply(


In [19]:
df_filtered

Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played,publisher,genres,notes,cleaned_review,tokenized
294292,1225330,NBA 2K21,75417828,english,"Hey, someone that played longer than 14 minute...",2020-09-04 10:35:14,1599447332,True,100,80,...,27,6385.0,0.0,305.0,1.607874e+09,2K,"[Simulation, Sports]",,"hey, someone that played longer than 14 minute...","[input_ids, attention_mask]"
1783215,412830,STEINS;GATE,72999780,english,EL\n\nPSY\n\nKONGROO,2020-07-20 13:14:10,1595250850,True,100,32,...,56,3451.0,0.0,2450.0,1.595637e+09,"Spike Chunsoft Co., Ltd.",[Adventure],,el\n\npsy\n\nkongroo,"[input_ids, attention_mask]"
2070954,626690,Sword Art Online: Fatal Bullet,72878586,english,---{Graphics}---\n☐ You forget what reality is...,2020-07-18 05:53:53,1595051633,True,100,20,...,2,2982.0,0.0,1088.0,1.608368e+09,BANDAI NAMCO Entertainment,"[Action, RPG]",,---{graphics}---\n☐ you forget what reality is...,"[input_ids, attention_mask]"
2090570,356190,Middle-earth™: Shadow of War™,61617517,english,I was fighting with some PITA enemy captain wh...,2020-01-10 10:39:24,1578653101,True,100,14,...,32,2685.0,0.0,1726.0,1.579110e+09,WB Games,"[Action, Adventure, RPG]",,i was fighting with some pita enemy captain wh...,"[input_ids, attention_mask]"
2362643,666140,My Time At Portia,69214890,english,If Stardew Valley and Animal Crossing had a ch...,2020-05-16 04:42:11,1589604131,True,100,10,...,4,4819.0,0.0,4479.0,1.605565e+09,,,,if stardew valley and animal crossing had a ch...,"[input_ids, attention_mask]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2467094,421020,DiRT 4,64401555,english,>left corner \n>try to drift\n>??????????\n>th...,2020-03-02 10:25:26,1583144726,True,0,1,...,9,2227.0,0.0,1359.0,1.593595e+09,,,,>left corner \n>try to drift\n>??????????\n>th...,"[input_ids, attention_mask]"
2467110,421020,DiRT 4,63957766,english,a bit casual but team management balances it,2020-02-22 22:26:37,1582410397,True,0,1,...,22,532.0,0.0,409.0,1.582753e+09,,,,a bit casual but team management balances it,"[input_ids, attention_mask]"
2467137,421020,DiRT 4,63250526,english,gud.,2020-02-10 01:46:39,1581299199,True,0,1,...,3,1612.0,0.0,835.0,1.593969e+09,,,,gud.,"[input_ids, attention_mask]"
2467176,421020,DiRT 4,62475415,english,is fun if enjoyed with proper wheel & pedal se...,2020-01-26 21:17:18,1580073438,True,0,1,...,2,4339.0,0.0,3812.0,1.595189e+09,,,,is fun if enjoyed with proper wheel & pedal se...,"[input_ids, attention_mask]"


In [20]:
label_distribution = df_filtered['recommended'].value_counts() / len(df_filtered)
print(label_distribution)


recommended
True     0.764158
False    0.235842
Name: count, dtype: float64


### BLA BLA TEST

In [25]:
# Lösche den ursprünglichen DataFrame, um Speicher freizugeben
del df

# Optionale Speicherkontrolle
import gc
gc.collect()

0

In [21]:
from sklearn.model_selection import train_test_split

# Splitte die Daten in Trainings- und Validierungsset (80/20 Split)
train_df, val_df = train_test_split(df_filtered, test_size=0.2, stratify=df_filtered['recommended'], random_state=42)

print(f"Training Set: {len(train_df)} Reviews")
print(f"Validation Set: {len(val_df)} Reviews")

Training Set: 93913 Reviews
Validation Set: 23479 Reviews


In [23]:
import torch

# Prüfe, ob eine GPU verfügbar ist
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"GPU ist verfügbar: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("Keine GPU verfügbar, CPU wird verwendet")

GPU ist verfügbar: NVIDIA GeForce RTX 2080 SUPER


In [24]:
import torch
from transformers import DistilBertTokenizer
from tqdm import tqdm

# DistilBERT-Tokenizer laden
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Funktion zum Tokenisieren mit Fortschrittsanzeige
def tokenize_and_encode(df, tokenizer, max_length=512):
    input_ids = []
    attention_masks = []
    labels = []

    # Verwende tqdm, um den Fortschritt anzuzeigen
    for index, row in tqdm(df.iterrows(), total=len(df), desc="Tokenisierung"):
        encoded = tokenizer.encode_plus(
            row['cleaned_review'],             # Der Review-Text
            add_special_tokens=True,           # Sondertokens hinzufügen (CLS, SEP)
            max_length=max_length,             # Max. Länge der Tokens
            padding='max_length',              # Padding auf max. Länge
            truncation=True,                   # Kürzen bei Überschreitung der max. Länge
            return_attention_mask=True,        # Rückgabe der Attention Mask
            return_tensors='pt'                # Rückgabe als PyTorch Tensoren
        )

        input_ids.append(encoded['input_ids'])
        attention_masks.append(encoded['attention_mask'])
        labels.append(int(row['recommended']))

    # Konvertiere Listen in Tensoren
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    labels = torch.tensor(labels)

    return input_ids, attention_masks, labels

# Tokenisiere die Trainings- und Validierungsdaten mit Fortschrittsanzeige
train_input_ids, train_attention_masks, train_labels = tokenize_and_encode(train_df, tokenizer)
val_input_ids, val_attention_masks, val_labels = tokenize_and_encode(val_df, tokenizer)

# Verschiebe die Daten auf die GPU (falls verfügbar)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_input_ids = train_input_ids.to(device)
train_attention_masks = train_attention_masks.to(device)
train_labels = train_labels.to(device)

val_input_ids = val_input_ids.to(device)
val_attention_masks = val_attention_masks.to(device)
val_labels = val_labels.to(device)


Tokenisierung: 100%|██████████| 93913/93913 [01:29<00:00, 1053.16it/s]
Tokenisierung: 100%|██████████| 23479/23479 [00:22<00:00, 1062.34it/s]


In [26]:
from torch.utils.data import DataLoader, TensorDataset
from tqdm import tqdm

# Erstelle TensorDatasets
train_dataset = TensorDataset(train_input_ids, train_attention_masks, train_labels)
val_dataset = TensorDataset(val_input_ids, val_attention_masks, val_labels)

# Erstelle DataLoader für das Training und die Validierung
batch_size = 16

train_dataloader = DataLoader(train_dataset, batch_size=batch_size,shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size)

# Beispiel, wie du den Fortschritt im Training mit tqdm verfolgen kannst
for batch in tqdm(train_dataloader, desc="Training Batch Processing"):
    # Hier würdest du die Batch-Verarbeitung starten
    pass


Training Batch Processing: 100%|██████████| 5870/5870 [00:01<00:00, 3725.02it/s]


In [27]:
from transformers import DistilBertForSequenceClassification, AdamW

# Lade das DistilBERT-Modell für die binäre Klassifikation
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2  # Da wir eine binäre Klassifikation haben (True/False)
)

# Verschiebe das Modell auf die GPU
model = model.to(device)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
# Definiere den Optimizer (AdamW) und die Loss-Funktion (Cross-Entropy-Loss)
optimizer = AdamW(model.parameters(), lr=2e-5)  # learning rate von 2e-5

# Cross-Entropy Loss für Klassifizierung
loss_fn = torch.nn.CrossEntropyLoss()



In [29]:
from tqdm import tqdm

# Anzahl der Epochen
epochs = 3

# Trainingsschleife
for epoch in range(epochs):
    print(f"Epoche {epoch + 1}/{epochs}")

    # Setze das Modell in den Trainingsmodus
    model.train()

    # Initialisiere den Epoch-Loss
    total_loss = 0

    # Training mit tqdm Fortschrittsanzeige
    for batch in tqdm(train_dataloader, desc="Training"):
        # Entpacke den Batch
        input_ids = batch[0].to(device)
        attention_masks = batch[1].to(device)
        labels = batch[2].to(device)

        # Setze die Gradienten auf null
        optimizer.zero_grad()

        # Vorwärtsdurchlauf (Forward Pass)
        outputs = model(input_ids, attention_mask=attention_masks, labels=labels)
        loss = outputs.loss
        logits = outputs.logits

        # Rückwärtsdurchlauf (Backward Pass)
        loss.backward()

        # Optimizer-Schritt
        optimizer.step()

        # Addiere den Verlust des aktuellen Batches
        total_loss += loss.item()

    # Berechne den durchschnittlichen Verlust der Epoche
    avg_train_loss = total_loss / len(train_dataloader)
    print(f"Durchschnittlicher Trainingsverlust: {avg_train_loss}")


Epoche 1/3


Training: 100%|██████████| 5870/5870 [40:54<00:00,  2.39it/s]


Durchschnittlicher Trainingsverlust: 0.21494498485826222
Epoche 2/3


Training: 100%|██████████| 5870/5870 [40:47<00:00,  2.40it/s]


Durchschnittlicher Trainingsverlust: 0.13872198582086648
Epoche 3/3


Training: 100%|██████████| 5870/5870 [41:13<00:00,  2.37it/s]

Durchschnittlicher Trainingsverlust: 0.08843472301610449





In [30]:
save_path = f"distilbert_model_epoch_{epoch+1}.pt"
torch.save({
    'epoch': epoch+1,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': avg_train_loss,
}, save_path)

In [31]:
from sklearn.metrics import accuracy_score

# Evaluation auf dem Validierungssatz
model.eval()  # Setze das Modell in den Evaluationsmodus
val_loss = 0
predictions, true_labels = [], []

# Validierung mit tqdm Fortschrittsanzeige
for batch in tqdm(val_dataloader, desc="Validierung"):
    input_ids = batch[0].to(device)
    attention_masks = batch[1].to(device)
    labels = batch[2].to(device)

    with torch.no_grad():  # Deaktiviere das Gradienten-Tracking für Validierung
        outputs = model(input_ids, attention_mask=attention_masks, labels=labels)
        loss = outputs.loss
        logits = outputs.logits

    # Addiere den Verlust
    val_loss += loss.item()

    # Füge die vorhergesagten Labels und die wahren Labels hinzu
    predictions.extend(torch.argmax(logits, dim=1).cpu().numpy())
    true_labels.extend(labels.cpu().numpy())

# Berechne die Genauigkeit
accuracy = accuracy_score(true_labels, predictions)
print(f"Validierungsgenauigkeit: {accuracy}")


Validierung: 100%|██████████| 1468/1468 [03:41<00:00,  6.64it/s]

Validierungsgenauigkeit: 0.9296392520976191





In [32]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

# Berechne zusätzliche Metriken
precision = precision_score(true_labels, predictions)
recall = recall_score(true_labels, predictions)
f1 = f1_score(true_labels, predictions)

# Konfusionsmatrix anzeigen
conf_matrix = confusion_matrix(true_labels, predictions)

print(f"Validierungsgenauigkeit: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")
print(f"Konfusionsmatrix:\n{conf_matrix}")


Validierungsgenauigkeit: 0.9296392520976191
Precision: 0.9541652726664436
Recall: 0.9537398283357486
F1-Score: 0.9539525030661167
Konfusionsmatrix:
[[ 4715   822]
 [  830 17112]]


In [33]:
# Durchschnittlichen Validierungsverlust berechnen
avg_val_loss = val_loss / len(val_dataloader)
print(f"Durchschnittlicher Validierungsverlust: {avg_val_loss}")

Durchschnittlicher Validierungsverlust: 0.21015889932154375


In [34]:
import torch

# Speicherpfad für das Modell
model_save_path = "distilbert_model_final.pt"

# Speichere den Modellzustand (state_dict) und optional den Optimizer
torch.save({
    'model_state_dict': model.state_dict(),  # Speichert die Modellgewichte
    'optimizer_state_dict': optimizer.state_dict(),  # Optional: Speichert den Optimizer-Zustand
}, model_save_path)

print(f"Modell erfolgreich unter {model_save_path} gespeichert.")


Modell erfolgreich unter distilbert_model_final.pt gespeichert.


In [35]:
# Lade das gespeicherte Modell
checkpoint = torch.load("distilbert_model_final.pt")

# Lade die Gewichte ins Modell
model.load_state_dict(checkpoint['model_state_dict'])

# Optional: Lade den Optimizer-Zustand, falls du das Training fortsetzen möchtest
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

# Setze das Modell in den Evaluationsmodus (bei Inferenz)
model.eval()

print("Modell erfolgreich geladen und einsatzbereit.")

Modell erfolgreich geladen und einsatzbereit.


## Neuer Text für die Vorhersage

In [1]:
import torch

# Prüfe, ob eine GPU verfügbar ist
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"GPU ist verfügbar: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("Keine GPU verfügbar, CPU wird verwendet")

GPU ist verfügbar: NVIDIA GeForce RTX 2080 SUPER


In [2]:
from transformers import DistilBertForSequenceClassification, AdamW

# Lade das DistilBERT-Modell für die binäre Klassifikation
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2  # Da wir eine binäre Klassifikation haben (True/False)
)

# Verschiebe das Modell auf die GPU
model = model.to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
import torch
from transformers import DistilBertTokenizer

# Lade das gespeicherte Modell
checkpoint = torch.load("distilbert_model_final.pt")
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()  # Setze das Modell in den Evaluationsmodus

# Lade den Tokenizer, den du beim Training verwendet hast
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Verschiebe das Modell auf die GPU, falls verfügbar
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [24]:
def preprocess_text(text, tokenizer, max_length=512):
    # Tokenisiere den Text
    encoding = tokenizer.encode_plus(
        text,
        add_special_tokens=True,  # Sondertokens (CLS, SEP) hinzufügen
        max_length=max_length,    # Max. Länge auf 512 beschränken
        padding='max_length',     # Auf max. Länge paddden
        truncation=True,          # Kürzen, wenn länger als max_length
        return_tensors='pt'       # Rückgabe als PyTorch-Tensor
    )
    
    return encoding['input_ids'].to(device), encoding['attention_mask'].to(device)

# Beispieltext (neuer Review)
test_text = """

It's my first FIFA and I like it so far.
Not too much fun against real players as they are miles better than me, but after grinding some Squad Battles, I managed to get some wins in Rivals as well.
Fun to learn the technics, fun to trade on the market.
I like it.

"""

# Preprocess den Text
input_ids, attention_mask = preprocess_text(test_text, tokenizer)


In [25]:
# Mache eine Vorhersage mit dem geladenen Modell
with torch.no_grad():  # Gradientenberechnung deaktivieren, da wir nur inferieren
    outputs = model(input_ids, attention_mask=attention_mask)
    logits = outputs.logits

# Konvertiere die Logits in Wahrscheinlichkeiten (Softmax) und finde die Vorhersage
predicted_class = torch.argmax(logits, dim=1).cpu().numpy()[0]

# Ausgabe der Vorhersage
if predicted_class == 1:
    print("Das Modell empfiehlt das Spiel!")
else:
    print("Das Modell empfiehlt das Spiel nicht.")


Das Modell empfiehlt das Spiel!
