# Attemp number 1

When i first wrote in cleaning.py I used `script_path = Path(__file__).resolve().parent` which work well in a .py as `__file__` showed the exact location of the current file. However, inside the notebook the variable `__file__` doesnt exist, to continue looking for the file path I decided to use `.cwd` in the `data_path_cleaned = script_path / ... /` in notebook as this shows the same result while keeping `__file__` in cleaning.py as it is more reliable.

In [1]:
import pandas as pd
from pathlib import Path

script_path = Path.cwd()
data_path_cleaned = script_path.parent / "data" / "games_cleaned.csv"

df = pd.read_csv(data_path_cleaned, encoding="utf-8")

# Cheking for AppID duplicates
duplicates = df[df.duplicated('AppID', keep=False)]
print(duplicates.shape)

(0, 16)


In [2]:
# Checking for Name duplicates
duplicates = df[df.duplicated('Name', keep=False)]
print(duplicates.shape)

(2063, 16)


As shown earlier, all `AppID` values were unique; however, there were 2,079 duplicate entries based on `Name`. To analyze the dataset effectively, we need a clear metric to determine which duplicates to keep and which to discard. In this project, I aim to focus on user feedback while preserving enough relevance to ensure the data remains valuable. Therefore, I will use the number of `Positive` and `Negative` reviews, the `Estimated number` of owners, and the game's `Release date` as criteria for filtering duplicates.

In [3]:
import numpy as np

# Data types conversion
df['Release date'] = pd.to_datetime(df['Release date'], errors='coerce')
df['Positive'] = pd.to_numeric(df['Positive'], errors='coerce')
df['Negative'] = pd.to_numeric(df['Negative'], errors='coerce')
df['Average owners'] = pd.to_numeric(df['Estimated owners'], errors='coerce')

df['Positive ratio'] = df['Positive'] / (df['Positive'] + df['Negative']).replace(0, np.nan)

df_sorted = df.sort_values(
    by=['Average owners', 'Positive ratio', 'Release date'],
    ascending=[False, False, False]
)

print(df_sorted[['Average owners', 'Positive ratio', 'Release date']].head(10))




        Average owners  Positive ratio Release date
111100             NaN             1.0   2025-04-18
111176             NaN             1.0   2025-04-18
111272             NaN             1.0   2025-04-18
111419             NaN             1.0   2025-04-18
110959             NaN             1.0   2025-04-17
111056             NaN             1.0   2025-04-17
111332             NaN             1.0   2025-04-17
111390             NaN             1.0   2025-04-17
111416             NaN             1.0   2025-04-17
110975             NaN             1.0   2025-04-16


The `Estimated owners` column was showing a lot of NaN values because the original data had non numeric ranges like `0 - 20000`, `20000 - 50000`, or `100000 - 200000`. Since these are strings, they couldn’t be converted directly into numeric. To fix this, I’m going to take the average of each range and use that as a new `metric`. Hence, we’ll be able to finish dealing with the duplicates.

In [4]:
# Separate the range into two separated numeric values to calculate the average
owners_clean = df['Estimated owners'].str.replace(' ', '', regex=False)

owners_split = owners_clean.str.split('-', expand=True)

owners_split[0] = pd.to_numeric(owners_split[0], errors='coerce')
owners_split[1] = pd.to_numeric(owners_split[1], errors='coerce')

df['Estimated average owners'] = (owners_split[0] + owners_split[1]) / 2

print(df[['Estimated owners', 'Estimated average owners']].head(10))


  Estimated owners  Estimated average owners
0        0 - 20000                   10000.0
1        0 - 20000                   10000.0
2        0 - 20000                   10000.0
3        0 - 20000                   10000.0
4        0 - 20000                   10000.0
5   50000 - 100000                   75000.0
6        0 - 20000                   10000.0
7        0 - 20000                   10000.0
8        0 - 20000                   10000.0
9   50000 - 100000                   75000.0


In [5]:
# Sort the duplicates based on the priotirised columns and use them to decide which duplicates to keep
df_sorted = df.sort_values(
    by=['Estimated average owners', 'Positive ratio', 'Release date'],
    ascending=[False, False, False]
)

# Keeping the first duplicate as its already sorted by the prioritised columns
df_drop = df_sorted.drop_duplicates('Name', keep='first')
df_drop = df_drop.drop(columns=['Average owners'])

data_path_unique = script_path.parent / "data" / "games_unique.csv"
df_drop.to_csv(data_path_unique, index=False, encoding="utf-8")

print("cleaned")

cleaned


### Handling Encoding Issues in Game Names

When first loading the dataset, some game names and metadata fields appeared with corrupted characters (e.g., `å¿èæå¤§æ2`). This was caused by reading the file with the wrong encoding. Reloading the file using `encoding="utf-8"` in the initial `pd.read_csv()` resolved the issue and restored correct characters like `忍者村大战2`. No further cleaning was needed, since the original text was intact in the source file.

In [6]:
# Reading the new cleaned file to keep working and double checking the results from previous work
df2 = pd.read_csv(data_path_unique, encoding="utf-8", low_memory=False) 

appid_to_check = 754350
print(df2[df2['AppID'] == appid_to_check][['AppID', 'Name','Developers','Publishers']])

        AppID    Name    Developers    Publishers
10451  754350  忍者村大战2  杭州分浪网络科技有限公司  杭州分浪网络科技有限公司


In [7]:
duplicates2 = df2[df2.duplicated('Name', keep=False)]
print(duplicates2.shape)

(0, 18)


I double-checked the data in Excel, and even though the code says all rows are unique, there are still quite a few duplicates. Most of them are due to slight differences in `Name`, like upper and lower cases. So, we’re going to clean it up one more time to make sure everything is consistent.

In [8]:
# Making a new column with lowercased names to keep it consistent and check for duplicates
df2['Name_lowercase'] = df2['Name'].str.lower().str.strip()

duplicates2 = df2[df2['Name_lowercase'].duplicated(keep=False)]
print(duplicates2.shape)

df2_sorted = df2.sort_values(
    by=['Estimated average owners', 'Positive ratio', 'Release date'],
    ascending=[False, False, False]
)

print(df2_sorted[['Name','Estimated average owners','Positive ratio','Release date']].head(10))

(660, 19)
                               Name  Estimated average owners  Positive ratio  \
0                            Dota 2               150000000.0        0.830986   
1                Black Myth: Wukong                75000000.0        0.958515   
2                   Team Fortress 2                75000000.0        0.935615   
3  Counter-Strike: Global Offensive                75000000.0        0.882611   
4                         New World                75000000.0        0.677030   
5               PUBG: BATTLEGROUNDS                75000000.0        0.563072   
6                  Wallpaper Engine                35000000.0        0.980557   
7                          Terraria                35000000.0        0.978658   
8                     Left 4 Dead 2                35000000.0        0.974508   
9                       Garry's Mod                35000000.0        0.965931   

  Release date  
0   2013-07-09  
1   2024-08-19  
2   2007-10-10  
3   2012-08-21  
4   2021-09-2

In [9]:
# Dropping duplicates as well as some unnecessary columns
df2_drop = df2_sorted.drop_duplicates('Name_lowercase', keep='first')
df2_drop = df2_drop.drop(columns=['Name_lowercase','Estimated owners'])

data_path_unique2 = script_path.parent / "data" / "games_unique2.csv"
df2_drop.to_csv(data_path_unique2, index=False, encoding="utf-8")

print("cleaned")

cleaned


### Missing value
As it has been illustrated above, some cells in the dataset contain `missing values`, which I haven’t addressed yet. This was an oversight during initial cleaning, and I plan to handle them as I continue developing the analysis. Depending on where these missing values appear, I’ll decide whether to drop them, fill them, or leave them untouched if they're irrelevant to the core questions.

In [10]:
# Reading the newer cleaned file to keep working and double checking the results from previous work
df3 = pd.read_csv(data_path_unique2, encoding="utf-8", low_memory=False)

print(df3.isnull().sum())

AppID                           0
Name                            1
Release date                  127
Required age                    0
Price                           0
User score                      0
Positive                        0
Negative                        0
Recommendations                 0
Average playtime forever        0
Developers                   6438
Publishers                   6735
Categories                   7527
Genres                       6412
Tags                        36663
Positive ratio              36827
Estimated average owners        0
dtype: int64


Since I now have many missing values, Im thinking if some of them might have originally been presented in the duplicated rows that I deleted earlier. To check this, I’ll go back to games_cleaned.csv and investigate again with a cleaner approach using all the information i have learned throughout this process