# Data Preprocessing

Hier beginnen wir von 0 und die EDA wird direkt übersprungen, da dieser in einem anderen Notebook durchgeführt wurde.

**Durch die EDA wurde folgende Kriterien festgestellt:**

* 2020-2021 Daten 

* Englischer Sprache

* maximal 100 helpful votes haben

* maximal 100 funny votes haben

* Maximale Spielzeit von 10000 Stunden haben

* Maximale Spielzeit von 336 Stunden in den letzten zwei Wochen haben


**Zielorientierte Datenbereinigung:**

Da der Plan dieses Projektes ein Transformer Modell ist, welches predicten soll, ob ein Review positiv ist oder nicht, werden wir die Daten so vorbereiten, dass sie für das Modell geeignet sind.

* Textlänge wird auf 512 Token beschränkt

* Zu kurze Reviews werden entfernt (mindestens 10 Tokens)

* Zu lange Reviews werden gekürzt (512 Tokens)


**Daten-/Modellorientierte Vorbereitungen:**

* Duplikate werden entfernt

* Handling von Sonderzeichen und Emojis ( Nötig für Transformer ?.: afaik nicht wirklich nötig, da Transformer auch mit Sonderzeichen umgehen kann)

* Handling von HTML Tags ( Nötig für Transformer? Könnte sein das HTML tags unnötige Tokens verschwenden.)

* Lowercasing (Probably, da der Transformer Case-Sensitive ist, falls alles lowercase ist wird der Transformer nicht zwischen Groß- und Kleinschreibung unterscheiden, erwarteter Performancegewinn)

* Klassenbalance wird hergestellt (~50/50)

* Optionale Balance auf Spiele/Genre Basis (Entweder Spiele oder Genres werden balanciert)



## EDA Kriterium Datenvorbereiung

In [1]:
import pandas as pd

In [2]:
df = pd.read_parquet("data/steam_reviews.parquet")

In [3]:
# timestamps zu datum
df["timestamp_created"] = pd.to_datetime(df["timestamp_created"], unit='s')

In [4]:
df.head()

Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,steam_purchase,received_for_free,written_during_early_access,author.steamid,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played
3,292030,The Witcher 3: Wild Hunt,85184605,english,"One of the best RPG's of all time, worthy of a...",2021-01-23 05:32:50,1611379970,True,0,0,...,True,False,False,76561199054755373,5,3,5587.0,3200.0,5524.0,1611384000.0
5,292030,The Witcher 3: Wild Hunt,85184171,english,"good story, good graphics. lots to do.",2021-01-23 05:21:04,1611379264,True,0,0,...,True,False,False,76561198170193529,11,1,823.0,823.0,823.0,1611379000.0
6,292030,The Witcher 3: Wild Hunt,85184064,english,"dis gud,",2021-01-23 05:18:11,1611379091,True,0,0,...,True,False,False,76561198119302812,27,2,4192.0,3398.0,4192.0,1611352000.0
18,292030,The Witcher 3: Wild Hunt,85180436,english,favorite game of all time cant wait for the Ne...,2021-01-23 03:38:06,1611373086,True,0,0,...,True,False,False,76561198065591528,33,1,23329.0,177.0,23329.0,1611219000.0
20,292030,The Witcher 3: Wild Hunt,85179753,english,Why wouldn't you get this,2021-01-23 03:19:38,1611371978,True,0,0,...,True,False,False,76561198996835044,131,2,8557.0,2004.0,8557.0,1611371000.0


In [13]:
# Definiere manuelle Grenzwerte für die Bereinigung der Daten
realistic_limits = {
    'votes_helpful': 1500, # mmaximal 1500 helpful votes
    'votes_funny': 1000, # maximal 1000 funny votes
    'author.playtime_forever': 15000, # Maximale Spielzeit von 15000 Minuten haben
    'timestamp_created': pd.Timestamp(2020, 1, 1), # auf das Jahr 2020 beschränken
}

# Filtere die Daten basierend auf den festgelegten Grenzwerten
df = df[(df['votes_helpful'] <= realistic_limits['votes_helpful']) &
                (df['votes_funny'] <= realistic_limits['votes_funny']) &
                (df['author.playtime_forever'] <= realistic_limits['author.playtime_forever']) &
                (df['timestamp_created'] >= realistic_limits['timestamp_created'])]

In [14]:
df.head()

Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,steam_purchase,received_for_free,written_during_early_access,author.steamid,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played
3,292030,The Witcher 3: Wild Hunt,85184605,english,"One of the best RPG's of all time, worthy of a...",2021-01-23 05:32:50,1611379970,True,0,0,...,True,False,False,76561199054755373,5,3,5587.0,3200.0,5524.0,1611384000.0
5,292030,The Witcher 3: Wild Hunt,85184171,english,"good story, good graphics. lots to do.",2021-01-23 05:21:04,1611379264,True,0,0,...,True,False,False,76561198170193529,11,1,823.0,823.0,823.0,1611379000.0
6,292030,The Witcher 3: Wild Hunt,85184064,english,"dis gud,",2021-01-23 05:18:11,1611379091,True,0,0,...,True,False,False,76561198119302812,27,2,4192.0,3398.0,4192.0,1611352000.0
20,292030,The Witcher 3: Wild Hunt,85179753,english,Why wouldn't you get this,2021-01-23 03:19:38,1611371978,True,0,0,...,True,False,False,76561198996835044,131,2,8557.0,2004.0,8557.0,1611371000.0
21,292030,The Witcher 3: Wild Hunt,85179400,english,it is ok\n,2021-01-23 03:09:52,1611371392,True,0,0,...,True,False,False,76561198284845223,60,9,2518.0,242.0,2518.0,1611371000.0


In [15]:
# duplikate entfernen, auf basis aller daten
df = df.drop_duplicates()
df = df.dropna(subset=["review"])

### API Daten Holen

In [8]:
#import requests
#import pandas as pd
#import pyarrow.parquet as pq
#
## Beispiel für den DataFrame mit den Spiel-IDs
#app_ids = df['app_id'].unique()
#
## Definiere die Funktion, um die API zu nutzen und die Daten zu extrahieren
#def get_game_info(app_id):
#    url = f"https://store.steampowered.com/api/appdetails?appids={app_id}&l=english"
#    response = requests.get(url)
#    
#    if response.status_code == 200:
#        data = response.json().get(str(app_id), {}).get('data', {})
#        if data:
#            publisher = data.get('publishers', [None])[0]  # Publisher
#            genres = [genre['description'] for genre in data.get('genres', [])]  # Genre(s)
#            notes = data.get('content_descriptors', {}).get('notes', None)  # Notes
#            return {'app_id': app_id, 'publisher': publisher, 'genres': genres, 'notes': notes}
#    return {'app_id': app_id, 'publisher': None, 'genres': None, 'notes': None}
#
## Erstelle eine Liste, um die Daten zu speichern
#game_info_list = []
#
## Iteriere über die Spiel-IDs und hole die Daten von der API
#for app_id in app_ids:
#    game_info = get_game_info(app_id)
#    game_info_list.append(game_info)
#
## Konvertiere die Ergebnisse in einen DataFrame
#df_game_info = pd.DataFrame(game_info_list)
#
## Speichere die Daten im Parquet-Format
#df_game_info.to_parquet('data/game_info.parquet', engine='pyarrow')
#
#print("Daten erfolgreich in 'game_info.parquet' gespeichert.")

In [16]:
# Importiere die Daten aus dem Parquet-Dateiformat
df_game_info = pd.read_parquet('data/game_info.parquet')

In [17]:
# join die beiden daten:
df = df.merge(df_game_info, on='app_id', how='left')

In [18]:
df.head(5)

Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,author.steamid,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played,publisher,genres,notes
0,292030,The Witcher 3: Wild Hunt,85184605,english,"One of the best RPG's of all time, worthy of a...",2021-01-23 05:32:50,1611379970,True,0,0,...,76561199054755373,5,3,5587.0,3200.0,5524.0,1611384000.0,CD PROJEKT RED,[RPG],The Witcher 3: Wild Hunt contains strong langu...
1,292030,The Witcher 3: Wild Hunt,85184171,english,"good story, good graphics. lots to do.",2021-01-23 05:21:04,1611379264,True,0,0,...,76561198170193529,11,1,823.0,823.0,823.0,1611379000.0,CD PROJEKT RED,[RPG],The Witcher 3: Wild Hunt contains strong langu...
2,292030,The Witcher 3: Wild Hunt,85184064,english,"dis gud,",2021-01-23 05:18:11,1611379091,True,0,0,...,76561198119302812,27,2,4192.0,3398.0,4192.0,1611352000.0,CD PROJEKT RED,[RPG],The Witcher 3: Wild Hunt contains strong langu...
3,292030,The Witcher 3: Wild Hunt,85179753,english,Why wouldn't you get this,2021-01-23 03:19:38,1611371978,True,0,0,...,76561198996835044,131,2,8557.0,2004.0,8557.0,1611371000.0,CD PROJEKT RED,[RPG],The Witcher 3: Wild Hunt contains strong langu...
4,292030,The Witcher 3: Wild Hunt,85179400,english,it is ok\n,2021-01-23 03:09:52,1611371392,True,0,0,...,76561198284845223,60,9,2518.0,242.0,2518.0,1611371000.0,CD PROJEKT RED,[RPG],The Witcher 3: Wild Hunt contains strong langu...


## Daten-/Modellorientierte Bereinigung

In [19]:
label_distribution = df['recommended'].value_counts() / len(df)
print(label_distribution)


recommended
True     0.936254
False    0.063746
Name: count, dtype: float64


In [20]:
df = df.sort_values(by=['votes_helpful', 'votes_funny'], ascending=False)
df = df[(df['votes_helpful'] > 0) | (df['votes_funny'] > 0)]

In [21]:
df

Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,author.steamid,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played,publisher,genres,notes
269238,620980,Beat Saber,79244667,english,"Since I am 80 + years old, it is very importan...",2020-11-14 10:51:22,1605351082,True,1493,176,...,76561198363815021,5,1,2544.0,0.0,2451.0,1.609319e+09,Beat Games,[Indie],
1776430,1113000,Persona 4 Golden,70806847,english,Please buy this game if you want more Persona ...,2020-06-15 02:03:47,1606364144,True,1490,31,...,76561198075994945,138,12,9612.0,0.0,499.0,1.606364e+09,SEGA,[RPG],Alcohol Reference\r\nAnimated Blood\r\nLanguag...
1396365,1145360,Hades,75662801,english,You can date the medusa head.\n\nPost Launch E...,2020-09-08 19:18:50,1602359822,True,1486,745,...,76561198012037170,500,7,3965.0,0.0,2445.0,1.608123e+09,Supergiant Games,"[Action, Indie, RPG]",
354159,1225330,NBA 2K21,75410143,english,There is very little difference from 2k20. The...,2020-09-04 06:20:35,1599200435,False,1484,99,...,76561198060924282,26,1,698.0,576.0,121.0,1.610669e+09,2K,"[Simulation, Sports]",
1555472,1289310,Helltaker,71408186,english,Lots of R34,2020-06-26 14:59:18,1593183558,True,1482,594,...,76561198407882457,534,108,14.0,0.0,14.0,1.590006e+09,vanripper,"[Adventure, Free To Play, Indie]",Suggestive clothing and poses. Don't put your ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2997695,546560,Half-Life: Alyx,65649322,english,"so far its great, except the fact that vive co...",2020-03-23 19:12:34,1585168953,True,0,1,...,76561198203885895,167,21,605.0,0.0,36.0,1.590979e+09,Valve,"[Action, Adventure]",Includes violence and gore.
2997709,546560,Half-Life: Alyx,65649110,english,"Apart from physics of grabbing small items, th...",2020-03-23 19:09:38,1584990578,True,0,1,...,76561198059511685,280,94,1240.0,0.0,111.0,1.593772e+09,Valve,"[Action, Adventure]",Includes violence and gore.
2997803,546560,Half-Life: Alyx,65647882,english,headcrab scared the shit out of me\n10/10,2020-03-23 18:49:54,1584989394,True,0,1,...,76561199018717049,6,1,274.0,0.0,62.0,1.585249e+09,Valve,"[Action, Adventure]",Includes violence and gore.
2997891,546560,Half-Life: Alyx,65646109,english,Great game. Incredible graphics! Runs great on...,2020-03-23 18:21:33,1584987693,True,0,1,...,76561198075128005,65,2,83.0,0.0,6.0,1.585254e+09,Valve,"[Action, Adventure]",Includes violence and gore.


In [23]:
# Gruppiere nach Spiel-ID (app_id) und zähle die Anzahl der Reviews pro Spiel
reviews_per_game = df['app_id'].value_counts()
print(reviews_per_game)

lower_bound = reviews_per_game.quantile(0.25)

filtered_games = reviews_per_game[(reviews_per_game >= lower_bound)].index

app_id
945360    39875
739630    27415
782330    21111
105600    16421
359550    12719
          ...  
385560       14
334040        9
772540        3
454200        2
619290        1
Name: count, Length: 315, dtype: int64


In [24]:
df_filtered = df[df['app_id'].isin(filtered_games)]
df_filtered

Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,author.steamid,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played,publisher,genres,notes
269238,620980,Beat Saber,79244667,english,"Since I am 80 + years old, it is very importan...",2020-11-14 10:51:22,1605351082,True,1493,176,...,76561198363815021,5,1,2544.0,0.0,2451.0,1.609319e+09,Beat Games,[Indie],
1776430,1113000,Persona 4 Golden,70806847,english,Please buy this game if you want more Persona ...,2020-06-15 02:03:47,1606364144,True,1490,31,...,76561198075994945,138,12,9612.0,0.0,499.0,1.606364e+09,SEGA,[RPG],Alcohol Reference\r\nAnimated Blood\r\nLanguag...
1396365,1145360,Hades,75662801,english,You can date the medusa head.\n\nPost Launch E...,2020-09-08 19:18:50,1602359822,True,1486,745,...,76561198012037170,500,7,3965.0,0.0,2445.0,1.608123e+09,Supergiant Games,"[Action, Indie, RPG]",
354159,1225330,NBA 2K21,75410143,english,There is very little difference from 2k20. The...,2020-09-04 06:20:35,1599200435,False,1484,99,...,76561198060924282,26,1,698.0,576.0,121.0,1.610669e+09,2K,"[Simulation, Sports]",
1555472,1289310,Helltaker,71408186,english,Lots of R34,2020-06-26 14:59:18,1593183558,True,1482,594,...,76561198407882457,534,108,14.0,0.0,14.0,1.590006e+09,vanripper,"[Adventure, Free To Play, Indie]",Suggestive clothing and poses. Don't put your ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2997695,546560,Half-Life: Alyx,65649322,english,"so far its great, except the fact that vive co...",2020-03-23 19:12:34,1585168953,True,0,1,...,76561198203885895,167,21,605.0,0.0,36.0,1.590979e+09,Valve,"[Action, Adventure]",Includes violence and gore.
2997709,546560,Half-Life: Alyx,65649110,english,"Apart from physics of grabbing small items, th...",2020-03-23 19:09:38,1584990578,True,0,1,...,76561198059511685,280,94,1240.0,0.0,111.0,1.593772e+09,Valve,"[Action, Adventure]",Includes violence and gore.
2997803,546560,Half-Life: Alyx,65647882,english,headcrab scared the shit out of me\n10/10,2020-03-23 18:49:54,1584989394,True,0,1,...,76561199018717049,6,1,274.0,0.0,62.0,1.585249e+09,Valve,"[Action, Adventure]",Includes violence and gore.
2997891,546560,Half-Life: Alyx,65646109,english,Great game. Incredible graphics! Runs great on...,2020-03-23 18:21:33,1584987693,True,0,1,...,76561198075128005,65,2,83.0,0.0,6.0,1.585254e+09,Valve,"[Action, Adventure]",Includes violence and gore.


In [25]:
import re

# HTML-Tags & Lowercase
df_filtered['cleaned_review'] = df_filtered['review'].apply(lambda x: re.sub(r'<.*?>', '', x))
df_filtered['cleaned_review'] = df_filtered['cleaned_review'].str.lower()
df_filtered

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['cleaned_review'] = df_filtered['review'].apply(lambda x: re.sub(r'<.*?>', '', x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['cleaned_review'] = df_filtered['cleaned_review'].str.lower()


Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played,publisher,genres,notes,cleaned_review
269238,620980,Beat Saber,79244667,english,"Since I am 80 + years old, it is very importan...",2020-11-14 10:51:22,1605351082,True,1493,176,...,5,1,2544.0,0.0,2451.0,1.609319e+09,Beat Games,[Indie],,"since i am 80 + years old, it is very importan..."
1776430,1113000,Persona 4 Golden,70806847,english,Please buy this game if you want more Persona ...,2020-06-15 02:03:47,1606364144,True,1490,31,...,138,12,9612.0,0.0,499.0,1.606364e+09,SEGA,[RPG],Alcohol Reference\r\nAnimated Blood\r\nLanguag...,please buy this game if you want more persona ...
1396365,1145360,Hades,75662801,english,You can date the medusa head.\n\nPost Launch E...,2020-09-08 19:18:50,1602359822,True,1486,745,...,500,7,3965.0,0.0,2445.0,1.608123e+09,Supergiant Games,"[Action, Indie, RPG]",,you can date the medusa head.\n\npost launch e...
354159,1225330,NBA 2K21,75410143,english,There is very little difference from 2k20. The...,2020-09-04 06:20:35,1599200435,False,1484,99,...,26,1,698.0,576.0,121.0,1.610669e+09,2K,"[Simulation, Sports]",,there is very little difference from 2k20. the...
1555472,1289310,Helltaker,71408186,english,Lots of R34,2020-06-26 14:59:18,1593183558,True,1482,594,...,534,108,14.0,0.0,14.0,1.590006e+09,vanripper,"[Adventure, Free To Play, Indie]",Suggestive clothing and poses. Don't put your ...,lots of r34
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2997695,546560,Half-Life: Alyx,65649322,english,"so far its great, except the fact that vive co...",2020-03-23 19:12:34,1585168953,True,0,1,...,167,21,605.0,0.0,36.0,1.590979e+09,Valve,"[Action, Adventure]",Includes violence and gore.,"so far its great, except the fact that vive co..."
2997709,546560,Half-Life: Alyx,65649110,english,"Apart from physics of grabbing small items, th...",2020-03-23 19:09:38,1584990578,True,0,1,...,280,94,1240.0,0.0,111.0,1.593772e+09,Valve,"[Action, Adventure]",Includes violence and gore.,"apart from physics of grabbing small items, th..."
2997803,546560,Half-Life: Alyx,65647882,english,headcrab scared the shit out of me\n10/10,2020-03-23 18:49:54,1584989394,True,0,1,...,6,1,274.0,0.0,62.0,1.585249e+09,Valve,"[Action, Adventure]",Includes violence and gore.,headcrab scared the shit out of me\n10/10
2997891,546560,Half-Life: Alyx,65646109,english,Great game. Incredible graphics! Runs great on...,2020-03-23 18:21:33,1584987693,True,0,1,...,65,2,83.0,0.0,6.0,1.585254e+09,Valve,"[Action, Adventure]",Includes violence and gore.,great game. incredible graphics! runs great on...


In [26]:
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

df_filtered['tokenized'] = df_filtered['cleaned_review'].apply(
    lambda x: tokenizer(x, truncation=True, padding='max_length', max_length=512, return_tensors='pt')
)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['tokenized'] = df_filtered['cleaned_review'].apply(


In [27]:
df_filtered

Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played,publisher,genres,notes,cleaned_review,tokenized
269238,620980,Beat Saber,79244667,english,"Since I am 80 + years old, it is very importan...",2020-11-14 10:51:22,1605351082,True,1493,176,...,1,2544.0,0.0,2451.0,1.609319e+09,Beat Games,[Indie],,"since i am 80 + years old, it is very importan...","[input_ids, attention_mask]"
1776430,1113000,Persona 4 Golden,70806847,english,Please buy this game if you want more Persona ...,2020-06-15 02:03:47,1606364144,True,1490,31,...,12,9612.0,0.0,499.0,1.606364e+09,SEGA,[RPG],Alcohol Reference\r\nAnimated Blood\r\nLanguag...,please buy this game if you want more persona ...,"[input_ids, attention_mask]"
1396365,1145360,Hades,75662801,english,You can date the medusa head.\n\nPost Launch E...,2020-09-08 19:18:50,1602359822,True,1486,745,...,7,3965.0,0.0,2445.0,1.608123e+09,Supergiant Games,"[Action, Indie, RPG]",,you can date the medusa head.\n\npost launch e...,"[input_ids, attention_mask]"
354159,1225330,NBA 2K21,75410143,english,There is very little difference from 2k20. The...,2020-09-04 06:20:35,1599200435,False,1484,99,...,1,698.0,576.0,121.0,1.610669e+09,2K,"[Simulation, Sports]",,there is very little difference from 2k20. the...,"[input_ids, attention_mask]"
1555472,1289310,Helltaker,71408186,english,Lots of R34,2020-06-26 14:59:18,1593183558,True,1482,594,...,108,14.0,0.0,14.0,1.590006e+09,vanripper,"[Adventure, Free To Play, Indie]",Suggestive clothing and poses. Don't put your ...,lots of r34,"[input_ids, attention_mask]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2997695,546560,Half-Life: Alyx,65649322,english,"so far its great, except the fact that vive co...",2020-03-23 19:12:34,1585168953,True,0,1,...,21,605.0,0.0,36.0,1.590979e+09,Valve,"[Action, Adventure]",Includes violence and gore.,"so far its great, except the fact that vive co...","[input_ids, attention_mask]"
2997709,546560,Half-Life: Alyx,65649110,english,"Apart from physics of grabbing small items, th...",2020-03-23 19:09:38,1584990578,True,0,1,...,94,1240.0,0.0,111.0,1.593772e+09,Valve,"[Action, Adventure]",Includes violence and gore.,"apart from physics of grabbing small items, th...","[input_ids, attention_mask]"
2997803,546560,Half-Life: Alyx,65647882,english,headcrab scared the shit out of me\n10/10,2020-03-23 18:49:54,1584989394,True,0,1,...,1,274.0,0.0,62.0,1.585249e+09,Valve,"[Action, Adventure]",Includes violence and gore.,headcrab scared the shit out of me\n10/10,"[input_ids, attention_mask]"
2997891,546560,Half-Life: Alyx,65646109,english,Great game. Incredible graphics! Runs great on...,2020-03-23 18:21:33,1584987693,True,0,1,...,2,83.0,0.0,6.0,1.585254e+09,Valve,"[Action, Adventure]",Includes violence and gore.,great game. incredible graphics! runs great on...,"[input_ids, attention_mask]"


In [28]:
label_distribution = df_filtered['recommended'].value_counts() / len(df_filtered)
print(label_distribution)


recommended
True     0.82459
False    0.17541
Name: count, dtype: float64


### BLA BLA TEST

In [29]:
# Lösche den ursprünglichen DataFrame, um Speicher freizugeben
del df

# Optionale Speicherkontrolle
import gc
gc.collect()

0

In [30]:
from sklearn.model_selection import train_test_split

# Splitte die Daten in Trainings- und Validierungsset (80/20 Split)
train_df, val_df = train_test_split(df_filtered, test_size=0.2, stratify=df_filtered['recommended'], random_state=42)

print(f"Training Set: {len(train_df)} Reviews")
print(f"Validation Set: {len(val_df)} Reviews")

Training Set: 484360 Reviews
Validation Set: 121090 Reviews


In [31]:
import torch

# Prüfe, ob eine GPU verfügbar ist
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"GPU ist verfügbar: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("Keine GPU verfügbar, CPU wird verwendet")

GPU ist verfügbar: NVIDIA GeForce RTX 4070


In [32]:
import torch
from transformers import DistilBertTokenizer
from tqdm import tqdm

# DistilBERT-Tokenizer laden
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Funktion zum Tokenisieren mit Fortschrittsanzeige
def tokenize_and_encode(df, tokenizer, max_length=512):
    input_ids = []
    attention_masks = []
    labels = []

    # Verwende tqdm, um den Fortschritt anzuzeigen
    for index, row in tqdm(df.iterrows(), total=len(df), desc="Tokenisierung"):
        encoded = tokenizer.encode_plus(
            row['cleaned_review'],             # Der Review-Text
            add_special_tokens=True,           # Sondertokens hinzufügen (CLS, SEP)
            max_length=max_length,             # Max. Länge der Tokens
            padding='max_length',              # Padding auf max. Länge
            truncation=True,                   # Kürzen bei Überschreitung der max. Länge
            return_attention_mask=True,        # Rückgabe der Attention Mask
            return_tensors='pt'                # Rückgabe als PyTorch Tensoren
        )

        input_ids.append(encoded['input_ids'])
        attention_masks.append(encoded['attention_mask'])
        labels.append(int(row['recommended']))

    # Konvertiere Listen in Tensoren
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    labels = torch.tensor(labels)

    return input_ids, attention_masks, labels

# Tokenisiere die Trainings- und Validierungsdaten mit Fortschrittsanzeige
train_input_ids, train_attention_masks, train_labels = tokenize_and_encode(train_df, tokenizer)
val_input_ids, val_attention_masks, val_labels = tokenize_and_encode(val_df, tokenizer)

# Verschiebe die Daten auf die GPU (falls verfügbar)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_input_ids = train_input_ids.to(device)
train_attention_masks = train_attention_masks.to(device)
train_labels = train_labels.to(device)

val_input_ids = val_input_ids.to(device)
val_attention_masks = val_attention_masks.to(device)
val_labels = val_labels.to(device)


Tokenisierung: 100%|██████████| 484360/484360 [04:06<00:00, 1962.53it/s]
Tokenisierung: 100%|██████████| 121090/121090 [01:01<00:00, 1974.33it/s]


In [33]:
from torch.utils.data import DataLoader, TensorDataset
from tqdm import tqdm

# Erstelle TensorDatasets
train_dataset = TensorDataset(train_input_ids, train_attention_masks, train_labels)
val_dataset = TensorDataset(val_input_ids, val_attention_masks, val_labels)

# Erstelle DataLoader für das Training und die Validierung
batch_size = 16

train_dataloader = DataLoader(train_dataset, batch_size=batch_size,shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size)

# Beispiel, wie du den Fortschritt im Training mit tqdm verfolgen kannst
for batch in tqdm(train_dataloader, desc="Training Batch Processing"):
    # Hier würdest du die Batch-Verarbeitung starten
    pass


Training Batch Processing: 100%|██████████| 30273/30273 [00:04<00:00, 6905.82it/s]


In [34]:
from transformers import DistilBertForSequenceClassification, AdamW

# Lade das DistilBERT-Modell für die binäre Klassifikation
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2  # Da wir eine binäre Klassifikation haben (True/False)
)

# Verschiebe das Modell auf die GPU
model = model.to(device)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [35]:
# Definiere den Optimizer (AdamW) und die Loss-Funktion (Cross-Entropy-Loss)
optimizer = AdamW(model.parameters(), lr=2e-5)  # learning rate von 2e-5

# Cross-Entropy Loss für Klassifizierung
loss_fn = torch.nn.CrossEntropyLoss()



In [39]:
from tqdm import tqdm

# Anzahl der Epochen
epochs = 3

# Trainingsschleife
for epoch in range(epochs):
    print(f"Epoche {epoch + 1}/{epochs}")

    # Setze das Modell in den Trainingsmodus
    model.train()

    # Initialisiere den Epoch-Loss
    total_loss = 0

    # Training mit tqdm Fortschrittsanzeige
    for batch in tqdm(train_dataloader, desc="Training"):
        # Entpacke den Batch
        input_ids = batch[0].to(device)
        attention_masks = batch[1].to(device)
        labels = batch[2].to(device)

        # Setze die Gradienten auf null
        optimizer.zero_grad()

        # Vorwärtsdurchlauf (Forward Pass)
        outputs = model(input_ids, attention_mask=attention_masks, labels=labels)
        loss = outputs.loss
        logits = outputs.logits

        # Rückwärtsdurchlauf (Backward Pass)
        loss.backward()

        # Optimizer-Schritt
        optimizer.step()

        # Addiere den Verlust des aktuellen Batches
        total_loss += loss.item()

    # Berechne den durchschnittlichen Verlust der Epoche
    avg_train_loss = total_loss / len(train_dataloader)
    print(f"Durchschnittlicher Trainingsverlust: {avg_train_loss}")


Epoche 1/3


Training:   1%|▏         | 447/30273 [02:08<2:22:33,  3.49it/s]


KeyboardInterrupt: 

In [30]:
save_path = f"distilbert_model_epoch_{epoch+1}.pt"
torch.save({
    'epoch': epoch+1,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': avg_train_loss,
}, save_path)

In [31]:
from sklearn.metrics import accuracy_score

# Evaluation auf dem Validierungssatz
model.eval()  # Setze das Modell in den Evaluationsmodus
val_loss = 0
predictions, true_labels = [], []

# Validierung mit tqdm Fortschrittsanzeige
for batch in tqdm(val_dataloader, desc="Validierung"):
    input_ids = batch[0].to(device)
    attention_masks = batch[1].to(device)
    labels = batch[2].to(device)

    with torch.no_grad():  # Deaktiviere das Gradienten-Tracking für Validierung
        outputs = model(input_ids, attention_mask=attention_masks, labels=labels)
        loss = outputs.loss
        logits = outputs.logits

    # Addiere den Verlust
    val_loss += loss.item()

    # Füge die vorhergesagten Labels und die wahren Labels hinzu
    predictions.extend(torch.argmax(logits, dim=1).cpu().numpy())
    true_labels.extend(labels.cpu().numpy())

# Berechne die Genauigkeit
accuracy = accuracy_score(true_labels, predictions)
print(f"Validierungsgenauigkeit: {accuracy}")


Validierung: 100%|██████████| 1468/1468 [03:41<00:00,  6.64it/s]

Validierungsgenauigkeit: 0.9296392520976191





In [32]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

# Berechne zusätzliche Metriken
precision = precision_score(true_labels, predictions)
recall = recall_score(true_labels, predictions)
f1 = f1_score(true_labels, predictions)

# Konfusionsmatrix anzeigen
conf_matrix = confusion_matrix(true_labels, predictions)

print(f"Validierungsgenauigkeit: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")
print(f"Konfusionsmatrix:\n{conf_matrix}")


Validierungsgenauigkeit: 0.9296392520976191
Precision: 0.9541652726664436
Recall: 0.9537398283357486
F1-Score: 0.9539525030661167
Konfusionsmatrix:
[[ 4715   822]
 [  830 17112]]


In [33]:
# Durchschnittlichen Validierungsverlust berechnen
avg_val_loss = val_loss / len(val_dataloader)
print(f"Durchschnittlicher Validierungsverlust: {avg_val_loss}")

Durchschnittlicher Validierungsverlust: 0.21015889932154375


In [34]:
import torch

# Speicherpfad für das Modell
model_save_path = "distilbert_model_final.pt"

# Speichere den Modellzustand (state_dict) und optional den Optimizer
torch.save({
    'model_state_dict': model.state_dict(),  # Speichert die Modellgewichte
    'optimizer_state_dict': optimizer.state_dict(),  # Optional: Speichert den Optimizer-Zustand
}, model_save_path)

print(f"Modell erfolgreich unter {model_save_path} gespeichert.")


Modell erfolgreich unter distilbert_model_final.pt gespeichert.


In [35]:
# Lade das gespeicherte Modell
checkpoint = torch.load("distilbert_model_final.pt")

# Lade die Gewichte ins Modell
model.load_state_dict(checkpoint['model_state_dict'])

# Optional: Lade den Optimizer-Zustand, falls du das Training fortsetzen möchtest
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

# Setze das Modell in den Evaluationsmodus (bei Inferenz)
model.eval()

print("Modell erfolgreich geladen und einsatzbereit.")

Modell erfolgreich geladen und einsatzbereit.


## Neuer Text für die Vorhersage

In [1]:
import torch

# Prüfe, ob eine GPU verfügbar ist
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"GPU ist verfügbar: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("Keine GPU verfügbar, CPU wird verwendet")

GPU ist verfügbar: NVIDIA GeForce RTX 2080 SUPER


In [2]:
from transformers import DistilBertForSequenceClassification, AdamW

# Lade das DistilBERT-Modell für die binäre Klassifikation
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2  # Da wir eine binäre Klassifikation haben (True/False)
)

# Verschiebe das Modell auf die GPU
model = model.to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
import torch
from transformers import DistilBertTokenizer

# Lade das gespeicherte Modell
checkpoint = torch.load("distilbert_model_final.pt")
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()  # Setze das Modell in den Evaluationsmodus

# Lade den Tokenizer, den du beim Training verwendet hast
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Verschiebe das Modell auf die GPU, falls verfügbar
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [24]:
def preprocess_text(text, tokenizer, max_length=512):
    # Tokenisiere den Text
    encoding = tokenizer.encode_plus(
        text,
        add_special_tokens=True,  # Sondertokens (CLS, SEP) hinzufügen
        max_length=max_length,    # Max. Länge auf 512 beschränken
        padding='max_length',     # Auf max. Länge paddden
        truncation=True,          # Kürzen, wenn länger als max_length
        return_tensors='pt'       # Rückgabe als PyTorch-Tensor
    )
    
    return encoding['input_ids'].to(device), encoding['attention_mask'].to(device)

# Beispieltext (neuer Review)
test_text = """

It's my first FIFA and I like it so far.
Not too much fun against real players as they are miles better than me, but after grinding some Squad Battles, I managed to get some wins in Rivals as well.
Fun to learn the technics, fun to trade on the market.
I like it.

"""

# Preprocess den Text
input_ids, attention_mask = preprocess_text(test_text, tokenizer)


In [25]:
# Mache eine Vorhersage mit dem geladenen Modell
with torch.no_grad():  # Gradientenberechnung deaktivieren, da wir nur inferieren
    outputs = model(input_ids, attention_mask=attention_mask)
    logits = outputs.logits

# Konvertiere die Logits in Wahrscheinlichkeiten (Softmax) und finde die Vorhersage
predicted_class = torch.argmax(logits, dim=1).cpu().numpy()[0]

# Ausgabe der Vorhersage
if predicted_class == 1:
    print("Das Modell empfiehlt das Spiel!")
else:
    print("Das Modell empfiehlt das Spiel nicht.")


Das Modell empfiehlt das Spiel!
