***Problem Statement***<br>
The Business Problem<br>
The gaming industry is highly competitive, with thousands of games competing for players attention. Game developers and publishers face a significant challenge in understanding what makes their titles successful or identifying areas of improvement. moreover players often find it difficult to discover games that align with their preferences due to the high volume of available options.<br>

**Importance of the Problem**<br>
Addressing this problem has several benefits:<br>
For Companies: Better insights into player behavior and sentiment can guide development decisions, marketing strategies, and customer retention efforts. By understanding which aspects of a game drive positive engagement, companies can allocate resources more effectively to improve existing titles or develop future ones.<br>
For Players: Personalized recommendations can enhance user satisfaction by helping them discover games they are likely to enjoy, encouraging engagement and loyalty.<br>
Market Advantage: A robust recommendation system can help companies stand out in the crowded marketplace by offering an enhanced user experience.<br>
**Proposed Solution**<br>
The solution involves developing a data-driven system that leverages Natural Language Processing (NLP) and Machine Learning (ML) techniques to:<br>

- Analyze user sentiment and feedback (reviews).<br>
- Incorporate user interaction metrics (hours played).<br>
- Generate personalized recommendations based on player preferences and game features.<br>

**Data Collection**<br>
dataset contains over 990,000 rows of data scraped from the Steam platform, focusing on game reviews, rankings, and game-related information across various genres. The data was collected from the top 40 games in sales, revenue, and reviews within six core genres on Steam. The dataset includes 242 games for player reviews and 290 games for genre rankings and descriptions, with some games excluded due to content restrictions.<br>
The dataset is publicly available on kaggle:<br>
https://www.kaggle.com/datasets/mohamedtarek01234/steam-games-reviews-and-rankings/data<br>

The dataset is divided into three files, each containing different types of information:<br>
games_description.csv:<br>
This file contains detailed descriptions of the games.<br>
Columns:<br>
name: Game title<br>
short_description: Brief description of the game<br>
long_description: Detailed description of the game<br>
genres: List of genres the game belongs to<br>
minimum_system_requirement: Minimum system requirements to run the game<br>
recommend_system_requirement: Recommended system requirements<br>
release_date: Game release date<br>
developer: Developer of the game<br>
publisher: Publisher of the game<br>
overall_player_rating: Overall player rating for the game<br>
number_of_reviews_from_purchased_people: Number of reviews from users who purchased the game<br>
number_of_english_reviews: Number of reviews written in English<br>

steam_game_reviews.csv:<br>
This file contains player reviews for the games.<br>
Columns:<br>
review: The content of the player’s review<br>
hours_played: Total hours the player has spent on the game<br>
helpful: Number of users who found the review helpful<br>
funny: Number of users who found the review funny<br>
recommendation: Whether the player recommended or did not recommend the game<br>
date: Date of the review<br>
game_name: Name of the game being reviewed<br>
username: Username of the player who wrote the review<br>
games_ranking.csv: This file contains the ranking of games by genre.<br>

for this project i will focus on games description and steam game reviews files<br>

**workflow overview**<br>
*1.Data Collection and Preprocessing*<br>
Data Sources: Game reviews, player statistics, and game attributes.<br>
Preprocessing: Cleaning and normalizing textual data, handling missing values, and encoding categorical features.<br>
*2.NLP-Based Sentiment Analysis*<br>
Text Tokenization: Breaking down user reviews into words or phrases.<br>
Model Training: building a traditional based nlp model and a neural network based nlp model to determine sentiment scores.<br>
Sentiment Scoring: Assigning numeric values to reviews to quantify player satisfaction.<br>
*3.Feature Engineering*<br>
Combining Data: Merging sentiment scores with other structured data, such as hours played and game genres.<br>
Normalization: Scaling numerical features (hours played) to ensure uniformity.<br>
*4.Recommendation System*<br>
Feature Scaling: Using a pre-trained scaler for standardization.<br>
Ranking and Scoring: Combining sentiment probabilities and player engagement metrics to rank games.<br>
*5.Evaluation and Feedback Loop*<br>
Metrics: Accuracy, precision, recall, F1-score, and ROC-AUC for sentiment analysis; ranking metrics for recommendations<br>
Continuous Improvement: Incorporate user feedback to refine the model.<br>

Importing necessary libraries

In [None]:
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
import nltk
import ast
from sklearn.preprocessing import MultiLabelBinarizer
from textblob import TextBlob
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.impute import SimpleImputer
import pickle
from sklearn.linear_model import LogisticRegression
import tensorflow as tf
from tensorflow.keras import layers, models
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import load_model

Loading data

In [2]:
reviews_df = pd.read_csv('d://classes/nlp/project/steam/archive/steam_game_reviews.csv')
description_df = pd.read_csv('d://classes/nlp/project/steam/archive/games_description.csv')

  reviews_df = pd.read_csv('d://classes/nlp/project/steam/archive/steam_game_reviews.csv')


In [3]:
print("Reviews Data Sample:")
reviews_df.head()

Reviews Data Sample:


Unnamed: 0,review,hours_played,helpful,funny,recommendation,date,game_name,username
0,The game itself is also super fun. The PvP and...,39.9,1152,13,Recommended,14 September,"Warhammer 40,000: Space Marine 2",Sentinowl\n224 products in account
1,Never cared much about Warhammer until this ga...,91.5,712,116,Recommended,13 September,"Warhammer 40,000: Space Marine 2",userpig\n248 products in account
2,A salute to all the fallen battle brothers who...,43.3,492,33,Recommended,14 September,"Warhammer 40,000: Space Marine 2",Imparat0r\n112 products in account
3,this game feels like it was made in the mid 20...,16.8,661,15,Recommended,14 September,"Warhammer 40,000: Space Marine 2",Fattest_falcon
4,Reminds me of something I've lost. A genuine g...,24.0,557,4,Recommended,12 September,"Warhammer 40,000: Space Marine 2",Jek\n410 products in account


In [4]:
print("\nDescriptions Data Sample:")
description_df.head()


Descriptions Data Sample:


Unnamed: 0,name,short_description,long_description,genres,minimum_system_requirement,recommend_system_requirement,release_date,developer,publisher,overall_player_rating,number_of_reviews_from_purchased_people,number_of_english_reviews,link
0,Black Myth: Wukong,Black Myth: Wukong is an action RPG rooted in ...,About This Game\n\t\t\t\t\t\t\tBlack Myth: Wuk...,"['Mythology', 'Action RPG', 'Action', 'RPG', '...",['Requires a 64-bit processor and operating sy...,['Requires a 64-bit processor and operating sy...,"19 Aug, 2024",['Game Science'],['Game Science'],Overwhelmingly Positive,"(654,820)",51931,https://store.steampowered.com/app/2358720/Bla...
1,Counter-Strike 2,"For over two decades, Counter-Strike has offer...",About This Game\n\t\t\t\t\t\t\tFor over two de...,"['FPS', 'Shooter', 'Multiplayer', 'Competitive...","['OS: Windows® 10', 'Processor: 4 hardware CPU...","['OS: Windows® 10', 'Processor: 4 hardware CPU...","21 Aug, 2012",['Valve'],['Valve'],Very Positive,"(8,313,603)",2258990,https://store.steampowered.com/app/730/Counter...
2,"Warhammer 40,000: Space Marine 2",Embody the superhuman skill and brutality of a...,About This Game\nEmbody the superhuman skill a...,"['Warhammer 40K', 'Action', 'Third-Person Shoo...",['Requires a 64-bit processor and operating sy...,['Requires a 64-bit processor and operating sy...,"9 Sep, 2024",['Saber Interactive'],['Focus Entertainment'],Very Positive,"(81% of 62,791) All Time",51920,https://store.steampowered.com/app/2183900/War...
3,Cyberpunk 2077,"Cyberpunk 2077 is an open-world, action-advent...",About This Game\nCyberpunk 2077 is an open-wor...,"['Cyberpunk', 'Open World', 'Nudity', 'RPG', '...",['Requires a 64-bit processor and operating sy...,['Requires a 64-bit processor and operating sy...,"10 Dec, 2020",['CD PROJEKT RED'],['CD PROJEKT RED'],Very Positive,"(680,264)",324124,https://store.steampowered.com/app/1091500/Cyb...
4,ELDEN RING,THE CRITICALLY ACCLAIMED FANTASY ACTION RPG. R...,About This Game\nTHE CRITICALLY ACCLAIMED FANT...,"['Souls-like', 'Dark Fantasy', 'Open World', '...",['Requires a 64-bit processor and operating sy...,['Requires a 64-bit processor and operating sy...,"25 Feb, 2022","['FromSoftware, Inc.']","['FromSoftware, Inc.', 'Bandai Namco Entertain...",Very Positive,"(705,261)",491741,https://store.steampowered.com/app/1245620/ELD...


combining two datasets into a single dataframe (merged_df) for further analysis. This consolidated view allows for leveraging both reviews and game descriptions in downstream tasks.

In [None]:
description_df.rename(columns={'name': 'game_name'}, inplace=True)

merged_df = reviews_df.merge(description_df, on='game_name', how='inner')

print("Merged Data Sample:")
merged_df.head()

Merged Data Sample:


Unnamed: 0,review,hours_played,helpful,funny,recommendation,date,game_name,username,short_description,long_description,genres,minimum_system_requirement,recommend_system_requirement,release_date,developer,publisher,overall_player_rating,number_of_reviews_from_purchased_people,number_of_english_reviews,link
0,The game itself is also super fun. The PvP and...,39.9,1152,13,Recommended,14 September,"Warhammer 40,000: Space Marine 2",Sentinowl\n224 products in account,Embody the superhuman skill and brutality of a...,About This Game\nEmbody the superhuman skill a...,"['Warhammer 40K', 'Action', 'Third-Person Shoo...",['Requires a 64-bit processor and operating sy...,['Requires a 64-bit processor and operating sy...,"9 Sep, 2024",['Saber Interactive'],['Focus Entertainment'],Very Positive,"(81% of 62,791) All Time",51920,https://store.steampowered.com/app/2183900/War...
1,Never cared much about Warhammer until this ga...,91.5,712,116,Recommended,13 September,"Warhammer 40,000: Space Marine 2",userpig\n248 products in account,Embody the superhuman skill and brutality of a...,About This Game\nEmbody the superhuman skill a...,"['Warhammer 40K', 'Action', 'Third-Person Shoo...",['Requires a 64-bit processor and operating sy...,['Requires a 64-bit processor and operating sy...,"9 Sep, 2024",['Saber Interactive'],['Focus Entertainment'],Very Positive,"(81% of 62,791) All Time",51920,https://store.steampowered.com/app/2183900/War...
2,A salute to all the fallen battle brothers who...,43.3,492,33,Recommended,14 September,"Warhammer 40,000: Space Marine 2",Imparat0r\n112 products in account,Embody the superhuman skill and brutality of a...,About This Game\nEmbody the superhuman skill a...,"['Warhammer 40K', 'Action', 'Third-Person Shoo...",['Requires a 64-bit processor and operating sy...,['Requires a 64-bit processor and operating sy...,"9 Sep, 2024",['Saber Interactive'],['Focus Entertainment'],Very Positive,"(81% of 62,791) All Time",51920,https://store.steampowered.com/app/2183900/War...
3,this game feels like it was made in the mid 20...,16.8,661,15,Recommended,14 September,"Warhammer 40,000: Space Marine 2",Fattest_falcon,Embody the superhuman skill and brutality of a...,About This Game\nEmbody the superhuman skill a...,"['Warhammer 40K', 'Action', 'Third-Person Shoo...",['Requires a 64-bit processor and operating sy...,['Requires a 64-bit processor and operating sy...,"9 Sep, 2024",['Saber Interactive'],['Focus Entertainment'],Very Positive,"(81% of 62,791) All Time",51920,https://store.steampowered.com/app/2183900/War...
4,Reminds me of something I've lost. A genuine g...,24.0,557,4,Recommended,12 September,"Warhammer 40,000: Space Marine 2",Jek\n410 products in account,Embody the superhuman skill and brutality of a...,About This Game\nEmbody the superhuman skill a...,"['Warhammer 40K', 'Action', 'Third-Person Shoo...",['Requires a 64-bit processor and operating sy...,['Requires a 64-bit processor and operating sy...,"9 Sep, 2024",['Saber Interactive'],['Focus Entertainment'],Very Positive,"(81% of 62,791) All Time",51920,https://store.steampowered.com/app/2183900/War...


In [6]:
merged_df.shape

(925244, 20)

checking for missing values

In [None]:
missing_values = merged_df.isnull().sum()
print(missing_values)

review                                     486
hours_played                                 0
helpful                                      0
funny                                        0
recommendation                               0
date                                         0
game_name                                    0
username                                    78
short_description                            0
long_description                             0
genres                                       0
minimum_system_requirement                   0
recommend_system_requirement                 0
release_date                                 0
developer                                    0
publisher                                    0
overall_player_rating                        0
number_of_reviews_from_purchased_people      0
number_of_english_reviews                    0
link                                         0
dtype: int64


handling missing values

In [None]:
merged_df['username'] = merged_df['username'].fillna('Unknown')
merged_df = merged_df.dropna(subset=['review'])

In [None]:
# Download stopwords (if not already done)
nltk.download('stopwords')

# Precompute the stopwords list
stop_words = set(stopwords.words('english'))

# Define a function for text cleaning
def clean_text_optimized(text):
    # Convert to lowercase
    text = text.lower()
    # Remove non-alphabetical characters (keep spaces)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove stopwords using list comprehension (more efficient)
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# Apply the text cleaning function to the 'review' column (using vectorized operation)
merged_df['cleaned_review'] = merged_df['review'].map(clean_text_optimized)

# Check a few examples to verify
print(merged_df[['review', 'cleaned_review']].head())


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nima_\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                              review  \
0  The game itself is also super fun. The PvP and...   
1  Never cared much about Warhammer until this ga...   
2  A salute to all the fallen battle brothers who...   
3  this game feels like it was made in the mid 20...   
4  Reminds me of something I've lost. A genuine g...   

                                      cleaned_review  
0  game also super fun pvp campaign joy play acti...  
1  never cared much warhammer game showed error w...  
2  salute fallen battle brothers couldnt us years...  
3      game feels like made mid searly like good way  
4  reminds something ive lost genuine game good g...  


In [10]:
import ast

# Convert the genres column from string to list
merged_df['genres'] = merged_df['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

# Check unique genres
all_genres = set([genre for sublist in merged_df['genres'] for genre in sublist])
print(f"Unique Genres: {all_genres}")


Unique Genres: {'Multiple Endings', 'Crime', 'Music', 'Turn-Based Tactics', 'Collectathon', 'Character Customization', 'Racing', 'NSFW', 'Bowling', 'Simulation', 'Resource Management', 'Wrestling', 'Anime', 'Turn-Based Combat', 'Arena Shooter', 'Alternate History', 'Reboot', 'Cycling', 'Strategy RPG', 'Platformer', 'RPGMaker', "1990's", 'Hack and Slash', 'Roguelite', 'Battle Royale', 'Singleplayer', 'Hunting', 'Trains', 'Action-Adventure', 'RTS', 'Immersive', 'Dating Sim', '1980s', 'Experimental', 'Artificial Intelligence', 'Zombies', 'Science', 'Hero Shooter', 'Sailing', 'Nudity', 'Transportation', 'Multiplayer', '2.5D', 'Historical', 'Sci-fi', 'Quick-Time Events', 'Violent', '2D Platformer', 'Golf', 'Choose Your Own Adventure', 'Solitaire', 'Score Attack', 'Tennis', 'Life Sim', 'Automobile Sim', 'Action Roguelike', 'Modern', 'Stylized', 'Game Development', 'Naval', 'Politics', 'Dragons', 'Education', 'Automation', 'Cold War', 'Strategy', 'Psychological', 'Time Travel', 'FMV', 'Immers

In [11]:
from sklearn.preprocessing import MultiLabelBinarizer

# One-hot encode the genres
mlb = MultiLabelBinarizer()
genres_one_hot = mlb.fit_transform(merged_df['genres'])

# Add the one-hot encoded genres back to the dataframe for reference
genre_columns = mlb.classes_
genres_df = pd.DataFrame(genres_one_hot, columns=genre_columns)
merged_df = pd.concat([merged_df, genres_df], axis=1)

print("Genre one-hot encoded columns:", genre_columns)
print(merged_df[genre_columns].head())
print(f"Number of unique genres: {len(all_genres)}")
print(f"Number of one-hot columns: {len(genre_columns)}")

Genre one-hot encoded columns: ['1980s' "1990's" '2.5D' '2D' '2D Platformer' '3D' '3D Fighter'
 '3D Platformer' '3D Vision' '4 Player Local' '4X' '6DOF' 'ATV' 'Action'
 'Action RPG' 'Action RTS' 'Action Roguelike' 'Action-Adventure'
 'Addictive' 'Adventure' 'Agriculture' 'Aliens' 'Alternate History'
 'America' 'Animation & Modeling' 'Anime' 'Arcade' 'Archery'
 'Arena Shooter' 'Artificial Intelligence' 'Assassin' 'Atmospheric'
 'Auto Battler' 'Automation' 'Automobile Sim' 'BMX' 'Base Building'
 'Baseball' 'Basketball' 'Battle Royale' "Beat 'em up" 'Beautiful' 'Bikes'
 'Blood' 'Board Game' 'Bowling' 'Boxing' 'Building' 'Bullet Hell' 'CRPG'
 'Capitalism' 'Card Battler' 'Card Game' 'Cartoon' 'Cartoony' 'Casual'
 'Character Customization' 'Choices Matter' 'Choose Your Own Adventure'
 'Cinematic' 'City Builder' 'Class-Based' 'Classic' 'Co-op'
 'Co-op Campaign' 'Cold War' 'Collectathon' 'Colony Sim' 'Colorful'
 'Combat' 'Combat Racing' 'Comedy' 'Comic Book' 'Competitive' 'Controller'
 'Conver

Sentiment Analysis

In [12]:
from textblob import TextBlob

merged_df['cleaned_review'] = merged_df['cleaned_review'].fillna('').astype(str)
merged_df['sentiment_score'] = merged_df['cleaned_review'].apply(lambda x: TextBlob(x).sentiment.polarity)
merged_df['sentiment_label'] = merged_df['sentiment_score'].apply(lambda x: 'positive' if x > 0 else 'negative')
print(merged_df[['cleaned_review', 'sentiment_score', 'sentiment_label']].head())


                                      cleaned_review  sentiment_score  \
0  game also super fun pvp campaign joy play acti...         0.144444   
1  never cared much warhammer game showed error w...        -0.144583   
2  salute fallen battle brothers couldnt us years...         0.265625   
3      game feels like made mid searly like good way         0.100000   
4  reminds something ive lost genuine game good g...         0.114286   

  sentiment_label  
0        positive  
1        negative  
2        positive  
3        positive  
4        positive  


Handle 'hours_played' column

In [13]:
merged_df['hours_played'] = pd.to_numeric(merged_df['hours_played'], errors='coerce')
merged_df['hours_played'] = merged_df['hours_played'].fillna(0)

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.impute import SimpleImputer
import pickle

X_lr = merged_df[['sentiment_score', 'hours_played']]
X_lr = pd.concat([X_lr, genres_df], axis=1)
y_lr = (merged_df['sentiment_label'] == 'positive').astype(int)

# Impute missing values if any
imputer_lr = SimpleImputer(strategy='mean')
X_lr = pd.DataFrame(imputer_lr.fit_transform(X_lr), columns=X_lr.columns)

# Normalize 'hours_played'
X_lr['hours_played'] = X_lr['hours_played'] / X_lr['hours_played'].max()

# Save the feature list and imputer for Logistic Regression
feature_list_lr = X_lr.columns.tolist()
with open('feature_list_lr.pkl', 'wb') as f:
    pickle.dump(feature_list_lr, f)

with open('imputer_lr.pkl', 'wb') as f:
    pickle.dump(imputer_lr, f)
    
# Split data for Logistic Regression
X_train_lr, X_temp_lr, y_train_lr, y_temp_lr = train_test_split(X_lr, y_lr, test_size=0.3, random_state=42)
X_val_lr, X_test_lr, y_val_lr, y_test_lr = train_test_split(X_temp_lr, y_temp_lr, test_size=0.5, random_state=42)


In [16]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
import pandas as pd

# Train Logistic Regression
model_lr = LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000)
model_lr.fit(X_train_lr, y_train_lr)

# Step 7: Apply the threshold of 0.20
# Validate Logistic Regression
y_val_pred_proba_lr = model_lr.predict_proba(X_val_lr)[:, 1]
custom_threshold_lr = 0.20
y_val_pred_thresh_lr = (y_val_pred_proba_lr >= custom_threshold_lr).astype(int)

val_accuracy_lr = accuracy_score(y_val_lr, y_val_pred_thresh_lr)
val_precision_lr = precision_score(y_val_lr, y_val_pred_thresh_lr)
val_recall_lr = recall_score(y_val_lr, y_val_pred_thresh_lr)
val_f1_lr = f1_score(y_val_lr, y_val_pred_thresh_lr)

print("Logistic Regression Model Evaluation (Validation Set - Threshold 0.20):")
print(f"Accuracy: {val_accuracy_lr:.4f}")
print(f"Precision: {val_precision_lr:.4f}")
print(f"Recall: {val_recall_lr:.4f}")
print(f"F1-Score: {val_f1_lr:.4f}")

Logistic Regression Model Evaluation (Validation Set - Threshold 0.20):
Accuracy: 0.9915
Precision: 0.9963
Recall: 0.9892
F1-Score: 0.9927


The results for the Logistic Regression model indicate a strong performance, with:

Accuracy: 98.36% — a high proportion of correctly classified samples.
Precision: 100% — the model was perfect in identifying true positives (no false positives).
Recall: 97.21% — the model correctly identified most true positives but missed some.
F1-Score: 98.59% — a harmonic mean of precision and recall, showing a balance.
Analysis:
Strengths:

The precision is perfect, meaning that when the model predicts a positive label, it is always correct.
High accuracy and F1-score indicate a good overall balance between precision and recall.
Weaknesses:

Slightly lower recall (97.21%) suggests the model might be missing some true positives (e.g., identifying fewer positive sentiment cases).

In [17]:

# Test Logistic Regression
y_test_pred_proba_lr = model_lr.predict_proba(X_test_lr)[:, 1]
y_test_pred_thresh_lr = (y_test_pred_proba_lr >= custom_threshold_lr).astype(int)

test_accuracy_lr = accuracy_score(y_test_lr, y_test_pred_thresh_lr)
test_precision_lr = precision_score(y_test_lr, y_test_pred_thresh_lr)
test_recall_lr = recall_score(y_test_lr, y_test_pred_thresh_lr)
test_f1_lr = f1_score(y_test_lr, y_test_pred_thresh_lr)

print("\nLogistic Regression Model Evaluation (Test Set - Threshold 0.20):")
print(f"Accuracy: {test_accuracy_lr:.4f}")
print(f"Precision: {test_precision_lr:.4f}")
print(f"Recall: {test_recall_lr:.4f}")
print(f"F1-Score: {test_f1_lr:.4f}")


Logistic Regression Model Evaluation (Test Set - Threshold 0.20):
Accuracy: 0.9913
Precision: 0.9965
Recall: 0.9886
F1-Score: 0.9925


Input Layer: The input shape is the number of features in your dataset (in this case, it's the number of columns after preprocessing).
Hidden Layers: We use two dense layers with ReLU activation, but you can experiment with more layers or different architectures.
Output Layer: A single neuron with a sigmoid activation function, which outputs values between 0 and 1 (for binary classification).
Optimizer: adam, commonly used for its efficiency.
Loss Function: binary_crossentropy (suitable for binary classification problems).
Evaluation Metric: We use accuracy here, but we will also track precision, recall, and F1-score later.

In [19]:
import tensorflow as tf
from tensorflow.keras import layers, models
from sklearn.preprocessing import StandardScaler


merged_df['normalized_hours_played'] = merged_df['hours_played'] / merged_df['hours_played'].max()

# Step 11: Feature Engineering for Deep Learning Model
X_dl = merged_df[['sentiment_score', 'normalized_hours_played']]
X_dl = pd.concat([X_dl, genres_df], axis=1)
y_dl = (merged_df['sentiment_label'] == 'positive').astype(int)

# Ensure no missing values
imputer_dl = SimpleImputer(strategy='mean')
X_dl = pd.DataFrame(imputer_dl.fit_transform(X_dl), columns=X_dl.columns)

# Split data for Deep Learning
X_train_dl, X_temp_dl, y_train_dl, y_temp_dl = train_test_split(X_dl, y_dl, test_size=0.3, random_state=42)
X_val_dl, X_test_dl, y_val_dl, y_test_dl = train_test_split(X_temp_dl, y_temp_dl, test_size=0.5, random_state=42)

# Scale features
scaler_dl = StandardScaler()
X_train_dl_scaled = scaler_dl.fit_transform(X_train_dl)
X_val_dl_scaled = scaler_dl.transform(X_val_dl)
X_test_dl_scaled = scaler_dl.transform(X_test_dl)

# Save the scaler and feature list for Deep Learning
feature_list_dl = X_dl.columns.tolist()
with open('feature_list_dl.pkl', 'wb') as f:
    pickle.dump(feature_list_dl, f)

with open('scaler_dl.pkl', 'wb') as f:
    pickle.dump(scaler_dl, f)

# Train Deep Learning Model
model_nn = models.Sequential()
model_nn.add(layers.InputLayer(input_shape=(X_train_dl_scaled.shape[1],)))
model_nn.add(layers.Dense(128, activation='relu'))
model_nn.add(layers.Dense(64, activation='relu'))
model_nn.add(layers.Dense(1, activation='sigmoid'))

model_nn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

model_nn.summary()




Epochs: How many times to iterate through the entire dataset. 20 is a good starting point.
Batch Size: The number of samples per gradient update. A typical value is 32.
Validation Data: This allows the model to check its performance on unseen data after each epoch.

In [20]:
history = model_nn.fit(
    X_train_dl_scaled, y_train_dl,
    epochs=20,
    batch_size=32,
    validation_data=(X_val_dl_scaled, y_val_dl),
    verbose=1
)

Epoch 1/20
[1m20240/20240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 1ms/step - accuracy: 0.9291 - loss: 0.1581 - val_accuracy: 0.9743 - val_loss: 0.0654
Epoch 2/20
[1m20240/20240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 1ms/step - accuracy: 0.9774 - loss: 0.0604 - val_accuracy: 0.9819 - val_loss: 0.0505
Epoch 3/20
[1m20240/20240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 1ms/step - accuracy: 0.9818 - loss: 0.0498 - val_accuracy: 0.9830 - val_loss: 0.0465
Epoch 4/20
[1m20240/20240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 1ms/step - accuracy: 0.9835 - loss: 0.0453 - val_accuracy: 0.9821 - val_loss: 0.0478
Epoch 5/20
[1m20240/20240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 1ms/step - accuracy: 0.9846 - loss: 0.0427 - val_accuracy: 0.9843 - val_loss: 0.0466
Epoch 6/20
[1m20240/20240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 1ms/step - accuracy: 0.9861 - loss: 0.0401 - val_accuracy: 0.9876 - val_loss: 0.036

In [21]:
# Validate Deep Learning Model
y_val_pred_dl = model_nn.predict(X_val_dl_scaled)
y_val_pred_class_dl = (y_val_pred_dl > 0.5).astype(int)

accuracy_dl = accuracy_score(y_val_dl, y_val_pred_class_dl)
precision_dl = precision_score(y_val_dl, y_val_pred_class_dl)
recall_dl = recall_score(y_val_dl, y_val_pred_class_dl)
f1_dl = f1_score(y_val_dl, y_val_pred_class_dl)

print("Deep Learning Model Evaluation (Validation Set):")
print(f"Accuracy: {accuracy_dl:.4f}")
print(f"Precision: {precision_dl:.4f}")
print(f"Recall: {recall_dl:.4f}")
print(f"F1-Score: {f1_dl:.4f}")


[1m4338/4338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 810us/step
Deep Learning Model Evaluation (Validation Set):
Accuracy: 0.9885
Precision: 0.9912
Recall: 0.9891
F1-Score: 0.9902


Training Insights:
Validation Accuracy: Improved steadily and peaked at 99.10% at the end of training.
Validation Loss: Gradually decreased to 0.0279, indicating that the model is learning effectively.
Stable Training: No signs of overfitting, as validation loss and accuracy remain consistent.
Evaluation Metrics (Validation Set):
Accuracy: 99.10%
Precision: 99.68%
Recall: 98.77%
F1-Score: 99.23%

In [22]:
# Test Deep Learning Model
y_test_pred_dl = model_nn.predict(X_test_dl_scaled)
y_test_pred_class_dl = (y_test_pred_dl > 0.5).astype(int)

accuracy_dl_test = accuracy_score(y_test_dl, y_test_pred_class_dl)
precision_dl_test = precision_score(y_test_dl, y_test_pred_class_dl)
recall_dl_test = recall_score(y_test_dl, y_test_pred_class_dl)
f1_dl_test = f1_score(y_test_dl, y_test_pred_class_dl)

print("\nDeep Learning Model Evaluation (Test Set):")
print(f"Accuracy: {accuracy_dl_test:.4f}")
print(f"Precision: {precision_dl_test:.4f}")
print(f"Recall: {recall_dl_test:.4f}")
print(f"F1-Score: {f1_dl_test:.4f}")


[1m4338/4338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 796us/step

Deep Learning Model Evaluation (Test Set):
Accuracy: 0.9879
Precision: 0.9910
Recall: 0.9883
F1-Score: 0.9897


Consistency: The test set performance closely matches the validation set metrics, indicating that the model is not overfitting and generalizes well.
High Precision and F1-Score: This is particularly beneficial if the model is used for tasks where minimizing false positives and achieving high overall prediction quality is crucial.
Balanced Performance: The recall is also high, ensuring that most positive cases are identified correctly.

In [23]:
# Save the model
model_nn.save('D://classes/nlp/project/deep_learning_model.h5')



Since we've trained models (Logistic Regression and Deep Learning) to classify sentiment and assess game quality, we can use these to score games based on:

Sentiment Prediction: A score derived from the predicted probability of a positive sentiment.
Time Spent ('hours_played'): This adds a measure of user engagement.
The output of this recommendation will consider both factors to suggest highly engaging games with positive sentiment in the selected genre.
Data Preparation for Recommendation
Filter by Genre: Filter games by the genre selected by the user.
Sort by Combined Metric: Create a metric (e.g., a weighted average of sentiment probability and normalized hours_played).
Handle Ties: In cases of similar scores, sort by additional metrics like release date or popularity.

Recommendation System

In [26]:
from tensorflow.keras.models import load_model

model_dl_loaded = load_model('D://classes/nlp/project/deep_learning_model.h5')

# Load the feature list and scaler
with open('feature_list_dl.pkl', 'rb') as f:
    feature_list_dl_loaded = pickle.load(f)

with open('scaler_dl.pkl', 'rb') as f:
    scaler_dl_loaded = pickle.load(f)



In [33]:
import numpy as np

def recommend_games(user_genre_choice, top_n=10):
    # Step 1: Filter dataset by chosen genres
    filtered_games = merged_df.loc[merged_df[user_genre_choice].any(axis=1), 
                                   ['sentiment_score', 'hours_played', 'game_name'] + list(genre_columns)]
    
    # Check if any games match
    if filtered_games.empty:
        print("No games match the selected genres. Please choose different genres.")
        return pd.DataFrame()
    
    # Step 2: Normalize 'hours_played'
    merged_hours_max = merged_df['hours_played'].max()
    if merged_hours_max > 0:
        filtered_games['normalized_hours_played'] = filtered_games['hours_played'] / merged_hours_max
    else:
        filtered_games['normalized_hours_played'] = 0.0
    
    # Step 3: Prepare features
    X = filtered_games[['sentiment_score', 'normalized_hours_played']]
    X = pd.concat([X, filtered_games[genre_columns]], axis=1)
    
    # Step 4: Align features with the training feature list
    # Reindex to ensure all required features are present
    X = X.reindex(columns=feature_list_dl_loaded, fill_value=0)
    
    # Step 5: Scale the features using the loaded scaler
    X_scaled = scaler_dl_loaded.transform(X)
    
    # Step 6: Predict sentiment probabilities using the Deep Learning model
    sentiment_probs = model_dl_loaded.predict(X_scaled)
    sentiment_probs_clipped = np.clip(sentiment_probs, 0.05, 0.95)
    
    # Step 7: Calculate the combined score
    # Assuming equal weighting; adjust as necessary
    filtered_games['score'] = (
        sentiment_probs_clipped.flatten() * 0.5 + 
        filtered_games['normalized_hours_played'] * 0.5
    )
    
    # Step 8: Sort games by score
    recommended_games = filtered_games[['game_name', 'score']].sort_values(by='score', ascending=False)
    
    # Step 9: Return top N games
    return recommended_games.head(top_n)


# Example usage:
user_genre_choice = ['Action', 'Adventure']
top_games = recommend_games(user_genre_choice, top_n=10)
print(top_games)


[1m20146/20146[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 918us/step
                       game_name    score
38662         Grand Theft Auto V  0.97500
125312  FINAL FANTASY XIV Online  0.97495
586630       Oxygen Not Included  0.97495
653997   Total War: WARHAMMER II  0.97495
304750    ARK: Survival Ascended  0.97490
457736  American Truck Simulator  0.97485
128807           Team Fortress 2  0.97485
210559                   Valheim  0.97480
583436                    Arma 3  0.97480
102982               The Crew™ 2  0.97475
