# Lab 1 Project - Preprocessing Code

The following code cells contain the snippets required to preprocess the original datasets by reading the 2 tables, merging accordingly, filtering mislabelled non english reviews and applying NL pre-processing tecquiniques. 
The resulting dataset of this notebook can be found as `preprocessed_reviews_games.csv`. 
The notebook could take up to 10 minutes to run because of the language detection function and size of the dataset. 

We start by importing all the necessary libraries to execute the notebook. 

In [3]:
# DF processing libraries
import pandas as pd
import numpy as np
# Text processing libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import re
import contractions
# Language detection library 
from langdetect import detect, DetectorFactory, LangDetectException
# Data source fetching library
import kagglehub

## Loading and inspecting the Datasets
We can now continue by loading the datasets from the source using the kagglehub module, which allows us to directly get to most updated version of the datasets directly from kaggle. The`path`variable will store the location of the downloaded datasets. 
Once the datasets are leaded in the respective variables we can proceed to print head() for inspection.

In [4]:
# Reading the dataset from the original source:
path = kagglehub.dataset_download("filipkin/steam-reviews")
# Printing dataset file path
print("Path to dataset files:", path)

Path to dataset files: C:\Users\vnvtr\.cache\kagglehub\datasets\filipkin\steam-reviews\versions\6


In [6]:
# Loading the datasets from the path
reviews = pd.read_csv(f'{path}/output.csv')
games = pd.read_csv(f'{path}/output_steamspy.csv')
reviews.head()

Unnamed: 0,id,app_id,content,author_id,is_positive
0,181331361,100,At least its a counter strike -1/100,76561199556485100,Negative
1,180872601,100,Uh... So far my playthrough has not been great...,76561199230620391,Negative
2,177836246,100,Better mechanics than cs2,76561198417690647,Negative
3,177287444,100,buggy mess and NOT fun to play at all,76561199077268730,Negative
4,176678990,100,"Whoever came up with this, is gonna fucking ge...",76561199104544266,Negative


In [5]:
reviews.shape

(201151, 5)

In [5]:
games.head()

Unnamed: 0,appid,name,owners
0,10,Counter-Strike,"10,000,000 .. 20,000,000"
1,20,Team Fortress Classic,"5,000,000 .. 10,000,000"
2,40,Deathmatch Classic,"5,000,000 .. 10,000,000"
3,50,Half-Life: Opposing Force,"2,000,000 .. 5,000,000"
4,60,Ricochet,"5,000,000 .. 10,000,000"


We can now start preprocessing the datasets starting from `reviews`. First of all we must transform the `str`variable `is_positive` to a binary `score{0,1}`. This can be accomplished by the `.map`method. 

In [7]:
# Converting 'positive'/'negative' values to binary
reviews['score'] = reviews['is_positive'].map({'Positive': 1, 'Negative': 0})

We then proceed to verify the presence of `null` cells for the variable `content`, which contains the text of each review. These observations will need to be removed alongside with any `content`cell containing less than 10 characters, since that would lead to error in the `langdetect` function and later in the NL preprocessing step. 

In [8]:
# Verify presence of null values in content and remove if present - no text no nlp
null_count = reviews['content'].isnull().sum()
print(f"Number of null values in 'content': {null_count}")
# Remove rows where 'content' is null
reviews = reviews.dropna(subset=['content'])
# Considering that language is to be identified and short documents cannot be evaluated
# remove observations with a comment shorter than 10 characters
reviews = reviews[reviews['content'].str.len() >= 10]

Number of null values in 'content': 428


We now turn to the second dataset, `games`, containing the game names, id and owner range. This latter variable will need to be transformed to a numerical variable which can be accomplished by averaging the extremes of the given range. A simple function to do so is defined and applied to the dataframe.

In [9]:
# Calculating average game owners from the owners range column
def calculate_average_owners(range_str):
    try:
        # Split the range at ".." and strip commas while converting to int
        low, high = range_str.split('..')
        low = int(low.replace(',', '').strip())
        high = int(high.replace(',', '').strip())
        return (low + high) / 2  # Compute the average
    except Exception as e:
        print(f"Error processing range: {range_str} -> {e}")
        return None  # Return None for invalid entries


# Apply the function to the 'owners' column
games['average_owners'] = games['owners'].apply(calculate_average_owners)
# Check if any na arose
games.isnull().sum()
reviews.head()

Unnamed: 0,id,app_id,content,author_id,is_positive,score
0,181331361,100,At least its a counter strike -1/100,76561199556485100,Negative,0
1,180872601,100,Uh... So far my playthrough has not been great...,76561199230620391,Negative,0
2,177836246,100,Better mechanics than cs2,76561198417690647,Negative,0
3,177287444,100,buggy mess and NOT fun to play at all,76561199077268730,Negative,0
4,176678990,100,"Whoever came up with this, is gonna fucking ge...",76561199104544266,Negative,0


We can now merge the datasets on the values `app_id`and `appid`. Superfluous of replicate variables can now be dropped. 

In [10]:
# Merge games data with review data for later reference in analysis
combined_data = pd.merge(reviews, games,
                   left_on='app_id',
                   right_on='appid',
                   how='inner')
combined_data = combined_data.drop(['appid', 'is_positive', 'owners'], axis=1)
combined_data

Unnamed: 0,id,app_id,content,author_id,score,name,average_owners
0,181398278,630,my friend fucking glitched out of the map. dud...,76561199715566328,0,Alien Swarm,3500000.0
1,177801117,630,"its just ok, valve games are not usually just ok",76561199197172151,0,Alien Swarm,3500000.0
2,175959006,630,nobody is playing this game,76561198880505482,0,Alien Swarm,3500000.0
3,175942905,630,this game isn't very fun it runs like ass and ...,76561199208923353,0,Alien Swarm,3500000.0
4,175501438,630,last mission was so hard,76561199229157735,0,Alien Swarm,3500000.0
...,...,...,...,...,...,...,...
159214,3115345,1250,loadsa loadsa loadsa\n\nmoney money money,76561198017700903,0,Killing Floor,3500000.0
159215,1001644,1250,So basically this is a really shitty L4D2 ripo...,76561197984089656,0,Killing Floor,3500000.0
159216,261576,1250,"Killing Floor isn't a bad game, but it is like...",76561197966948377,0,Killing Floor,3500000.0
159217,546537,1250,Non-stop Zombie Killing with a bunch of Cockne...,76561197971965320,0,Killing Floor,3500000.0


As previously mentioned the reviews contain observations that are mistakenly labelled as english. To avoid issues in the NL preprocessing step we need to remove such content. To do so we implement a conservative approach by using the `langdetect` library. The results may misclassify some very short english strings as not english but among the `english` classified reveiws 100% of them seem to be correcly identified(manually tested by taking 3 random samples without replacement of 100 reviews).
After applying the classification function with error hangling we visualize the modified dataset before saving only the english observations to a new variable. 

In [11]:
# Ensuring consistent results from langdetect
DetectorFactory.seed = 42

# Define a function to detect English comments
def is_english(text):
    try:
        return detect(text) == 'en'
    except LangDetectException:
        return False  # Handle empty or unreadable text

# Apply the language detection
combined_data['is_english'] = combined_data['content'].apply(is_english)
combined_data.head()
# Storing only english comments in a new variable
combined_data_en = combined_data[combined_data['is_english'] == True].drop(columns=['is_english'])

Finally we can move on to text preprocessing. Using the `nltk` library we complete the following steps: 
- Using `contractions`to fix contracted words and expand them to their full form 
- Lowercasing the text
- Removing punctuation and numbers using `re.sub()`
- Tokenizing the text
- Eliminating stopwords while appling lemmatization to the rest of the text

The now processed and cleaned content can be analized and used for modelling.  

In [15]:
combined_data_en.dropna(inplace=True)
nltk.download('wordnet')   #used in first run to download packages
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def PreProcessing(text):
    # Lowercasing text
    text = text.lower()
    # Removing numbers and symbols 
    text = re.sub(r'[^a-z\s]', '', text)
    # Using the contractions library to convert contracted text forms to their expanded version
    text = contractions.fix(text)
    # Tokenizing the text by words
    tokens = word_tokenize(text)
    # Applying lemmatization to the words not present in the stopwords set 
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    # Returning the joined preprocessed tokens
    return tokens

combined_data_en['tokens'] = combined_data_en['content'].apply(PreProcessing)
combined_data_en['cleaned_content'] = combined_data_en['tokens'].apply(lambda tokens: ' '.join(tokens))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\vnvtr\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [16]:
combined_data_en.to_csv('preprocessed_reviews_games.csv', index=False)

In [17]:
combined_data_en

Unnamed: 0,id,app_id,content,author_id,score,name,average_owners,tokens,cleaned_content
0,181398278,630,my friend fucking glitched out of the map. dud...,76561199715566328,0,Alien Swarm,3500000.0,"[friend, fucking, glitched, map, dude]",friend fucking glitched map dude
2,175959006,630,nobody is playing this game,76561198880505482,0,Alien Swarm,3500000.0,"[nobody, playing, game]",nobody playing game
3,175942905,630,this game isn't very fun it runs like ass and ...,76561199208923353,0,Alien Swarm,3500000.0,"[game, fun, run, like, as, community, ready, k...",game fun run like as community ready kick soon...
4,175501438,630,last mission was so hard,76561199229157735,0,Alien Swarm,3500000.0,"[last, mission, hard]",last mission hard
5,170983852,630,As an avid fan of Alien Shooter 2 from its hey...,76561199687240931,0,Alien Swarm,3500000.0,"[avid, fan, alien, shooter, heyday, approached...",avid fan alien shooter heyday approached game ...
...,...,...,...,...,...,...,...,...,...
159213,153611,1250,The jolliest game involving zombies this side ...,76561197963550511,0,Killing Floor,3500000.0,"[jolliest, game, involving, zombie, side, nati...",jolliest game involving zombie side nation red...
159215,1001644,1250,So basically this is a really shitty L4D2 ripo...,76561197984089656,0,Killing Floor,3500000.0,"[basically, really, shitty, ld, ripoff, valve,...",basically really shitty ld ripoff valve zombie...
159216,261576,1250,"Killing Floor isn't a bad game, but it is like...",76561197966948377,0,Killing Floor,3500000.0,"[killing, floor, bad, game, like, poor, man, l...",killing floor bad game like poor man left dead...
159217,546537,1250,Non-stop Zombie Killing with a bunch of Cockne...,76561197971965320,0,Killing Floor,3500000.0,"[nonstop, zombie, killing, bunch, cockney, spa...",nonstop zombie killing bunch cockney sparrow l...
