![Best Selling Games - Market Segmentation](./images/img_1.png)

In [1]:
## Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.spatial.distance import pdist, squareform
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances

In [2]:
df = pd.read_csv('./data/bestSelling_games.csv')
df.head()

Unnamed: 0,game_name,reviews_like_rate,all_reviews_number,release_date,developer,user_defined_tags,supported_os,supported_languages,price,other_features,age_restriction,rating,difficulty,length,estimated_downloads
0,Counter-Strike 2,86,8803754,"21 Aug, 2012",Valve,"FPS, Action, Tactical","win, linux","English, Czech, Danish, Dutch, Finnish, French...",0.0,"Cross-Platform Multiplayer, Steam Trading Card...",17,3.2,4,80,306170000
1,PUBG: BATTLEGROUNDS,59,2554482,"21 Dec, 2017",PUBG Corporation,"Survival, Shooter, Action, Tactical",win,"English, Korean, Simplified Chinese, French, G...",0.0,"Online PvP, Stats, Remote Play on Phone, Remot...",13,3.1,4,73,162350000
2,ELDEN RING NIGHTREIGN,77,53426,"30 May, 2025","FromSoftware, Inc.","Souls-like, Open World, Fantasy, RPG",win,"English, Japanese, French, Italian, German, Sp...",25.99,"Single-player, Online Co-op, Steam Achievement...",17,3.96,4,50,840000
3,The Last of Us™ Part I,79,45424,"28 Mar, 2023",Naughty Dog LLC,"Story Rich, Shooter, Survival, Horror",win,"English, Italian, Spanish - Spain, Czech, Dutc...",59.99,"Single-player, Steam Achievements, Steam Tradi...",17,4.1,3,24,2000000
4,Red Dead Redemption 2,92,672140,"5 Dec, 2019",Rockstar Games,"Open World, Story Rich, Adventure, Realistic, ...",win,"English, French, Italian, German, Spanish - Sp...",59.99,"Single-player, Online PvP, Online Co-op, Steam...",17,4.32,3,80,21610000


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2380 entries, 0 to 2379
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   game_name            2380 non-null   object 
 1   reviews_like_rate    2380 non-null   int64  
 2   all_reviews_number   2380 non-null   int64  
 3   release_date         2380 non-null   object 
 4   developer            2380 non-null   object 
 5   user_defined_tags    2380 non-null   object 
 6   supported_os         2380 non-null   object 
 7   supported_languages  2380 non-null   object 
 8   price                2380 non-null   float64
 9   other_features       2380 non-null   object 
 10  age_restriction      2380 non-null   int64  
 11  rating               2380 non-null   float64
 12  difficulty           2380 non-null   int64  
 13  length               2380 non-null   int64  
 14  estimated_downloads  2380 non-null   int64  
dtypes: float64(2), int64(6), object(7)
mem

In [4]:
pd.options.display.float_format = '{:,.2f}'.format
df.describe(include='all')


Unnamed: 0,game_name,reviews_like_rate,all_reviews_number,release_date,developer,user_defined_tags,supported_os,supported_languages,price,other_features,age_restriction,rating,difficulty,length,estimated_downloads
count,2380,2380.0,2380.0,2380,2380,2380,2380,2380,2380.0,2380,2380.0,2380.0,2380.0,2380.0,2380.0
unique,2380,,,1511,1806,1649,6,1606,,761,,,,,
top,UBERMOSH:OMEGA,,,"22 May, 2025",Valve,"Simulation, Management",win,English,,"Single-player, Steam Achievements, Steam Cloud...",,,,,
freq,1,,,15,17,29,1724,323,,275,,,,,
mean,,82.41,31615.08,,,,,,10.51,,10.61,3.23,2.86,22.97,2523006.58
std,,12.64,213719.68,,,,,,11.34,,6.4,0.77,0.98,21.25,11182829.61
min,,20.0,10.0,,,,,,0.0,,0.0,0.39,1.0,1.0,90.0
25%,,76.0,342.0,,,,,,2.99,,10.0,2.75,2.0,6.0,35000.0
50%,,85.0,2106.5,,,,,,7.99,,13.0,3.38,3.0,16.0,217150.0
75%,,92.0,13030.0,,,,,,12.49,,17.0,3.81,3.0,34.0,1380000.0


## Dataset Overview

This dataset is retrived from kaggle: [Best-Selling Steam Games of All Time](https://www.kaggle.com/datasets/hbugrae/best-selling-steam-games-of-all-time)

This dataset contains information about **2,380 best-selling games on Steam**, it offers comprehensive look into various aspects of each game. It includes both quantitative and qualitative features, which gives a good foundation for analysis.

---

#### Key numerical features include:

+ `reviews_like_rate`: The percentage of positive reviews, ranging from 20% to a perfect 100%, with an average of approximately 82.41%.
    * According to Kaggle Dataset documentation:
    > reviews_like_rate: The recommendation rate from user reviews on Steam (e.g., '95% of the 100 reviews are positive').

* `all_reviews_number`: The total number of reviews, which varies widely from 10 to over 8.8 million, emphasizing a significant variation in player engagement.

+ `price`: Game prices range from free (0.00) up to 79.99, with an average price of around 10.51, suggest that there could be a  diverse pricing strategy among best-selling titles.
    * According to Kaggle Dataset documentation:
    > price: The price of the game. Note: The currency is MENA - U.S. Dollar, a regional price for the Middle East & North Africa, not the standard USD. A value of 0 in this column indicates the game is 'Free to Play'.

* `estimated_downloads`: Ranging from a mere 90 to an whopping 306 million, this feature highlights the massive difference in market penetration among these games.

* `age_restriction`: With values from 0 to 17, this indicates the recommended age for players.
    * According to Kaggle Dataset documentation:
    > age_restriction: The recommended age restriction for the game's content, encoded as follows: 0 (Everyone), 10 (10+), 13 (13+), 17 (17+).

+ `difficulty`: A numerical rating from 1 to 5, providing insight into the perceived challenge of the games, averaging around 2.86.
    * According to Kaggle Dataset documentation:
    > difficulty: An estimated difficulty of the game as perceived by players, on a scale of 1 to 5, where 1 is the easiest and 5 is the hardest.

+ `length`: Represents the estimated playtime in hours, varying from 1 to 80 hours.
    * According to the Kaggle Dataset documentation:
    > length: The average time (in hours) players spend to complete or fully experience the game. For this dataset, the value is capped at a maximum of 80 hours.

+ `rating`: An average rating score ranging from 0.39 to 4.83.
    * According to the Kaggle Dataset documentation:
    > rating: An overall user-provided rating for the game on a scale of 1 to 5, where 1 is the lowest and 5 is the highest.

---

#### Categorical and textual features provide additional context:

* `game_name`: Unique identifiers for each game.

* `release_date`: The date the game was released, which will be crucial for calculating *game age*.

* `developer`: The studio responsible for the game, with 1806 unique developers in the dataset.

* `user_defined_tags`: A critical field containing multiple descriptive tags (e.g., 'FPS', 'Action', 'Tactical'), *which will require parsing for genre analysis*.

* `supported_os`: Indicates the operating systems supported (e.g., 'win', 'linux'), with 'win' being the most common.

* `supported_languages`: The languages supported by the game, with English being the most frequent.

* `other_features`: Lists additional functionalities like multiplayer options or Steam achievements.

The dataset contains **no missing values**, which simplifies the initial cleaning process.

## 1. Data Cleaning & Pre-processing
> In this step, we check for null / inconsistent values, that might deviate or effect on the clustering process

In [5]:
# Check for null values
null_counts = df.isnull().sum()
print("Null values per column:")
print(null_counts[null_counts > 0])

# Check for inconsistent values (example: empty strings in object columns)
inconsistent = {}
for col in df.select_dtypes(include='object').columns:
    empty_count = (df[col].astype(str).str.strip() == '').sum()
    if empty_count > 0:
        inconsistent[col] = empty_count

if inconsistent:
    print("\nColumns with empty string values:")
    for col, count in inconsistent.items():
        print(f"{col}: {count} empty values")
else:
    print("\nNo empty string values found in object columns.")

Null values per column:
Series([], dtype: int64)

No empty string values found in object columns.


No Null or NaN values found on each column. Therefore no need of applying imputation

## 2. Feature Engineering
To get the most out of our dataset, we will create new, more informative features from the existing ones.

Genre & Tag Processing: The user_defined_tags column is a text field containing multiple tags. We will parse this field to extract the most frequent and relevant tags (e.g., 'Indie', 'Action', 'RPG') and convert them into binary features (One-Hot Encoding).

Create 'Game Age': Using the converted release_date column, we will calculate the age of each game in years. This can be a powerful feature for segmentation.

Create 'Review Ratio': We can create a more robust popularity metric by combining the like rate and the total number of reviews (e.g., reviews_like_rate * all_reviews_number).

In [16]:
# Genre & Tag Processing: Extract and One-Hot Encode the most frequent tags
# First, get all unique tags
all_tags = df['user_defined_tags'].str.split(', ').explode()
# Get the most frequent tags (e.g., top 20, you can adjust this number)
top_tags = all_tags.value_counts().index.tolist()

## print top_tags with count
print("Top Tags and their counts:")
tag_counts = all_tags.value_counts()
most_frequent_tags = list(tag_counts.items())
tag_freq_df = pd.DataFrame(most_frequent_tags, columns=['tag_name', 'count'])
display(tag_freq_df)

Top Tags and their counts:


Unnamed: 0,tag_name,count
0,Simulation,736
1,Action,730
2,Adventure,522
3,RPG,437
4,Strategy,353
5,2D,308
6,Horror,299
7,FPS,246
8,Survival,237
9,Open World,236


> Data inconsistency at user_defined_tags found, marked on below image. following measures will be taken to fix the data
* find the original df records with the user_defined_tags having issue
* update the values manually, to adhere the above string split logic based on comma separtion
![Data inconsistency at user_defined_tags](./images/img_2.png)

In [None]:
# Explode the 'user_defined_tags' column into separate rows for each tag
exploded_tags_df = df['user_defined_tags'].str.split(', ').explode().to_frame(name='tag_name')
exploded_tags_df['original_df_index'] = exploded_tags_df.index

# Clean up tags: strip leading/trailing whitespace and remove any empty strings
exploded_tags_df['tag_name'] = exploded_tags_df['tag_name'].str.strip()
exploded_tags_df = exploded_tags_df[exploded_tags_df['tag_name'] != '']

# Get the occurrence count for each unique tag from the cleaned list
tag_counts = exploded_tags_df['tag_name'].value_counts()

# Print total unique tags
print(f"\nTotal unique tags after cleaning: {len(tag_counts)}")

#Identify tags that appear only once
single_occurrence_tags = tag_counts[tag_counts == 1].index.tolist()

print(f"\nNumber of tags appearing only once: {len(single_occurrence_tags)}")

#Create a dictionary to store original df row index(s) for each single-occurrence tag
single_tag_records = {}

# Iterate through the single occurrence tags and find their original DataFrame row indices
for tag in single_occurrence_tags:
    # Find all entries in the exploded DataFrame that match this tag
    # and get their unique original DataFrame indices
    original_indices = exploded_tags_df[exploded_tags_df['tag_name'] == tag]['original_df_index'].unique().tolist()
    single_tag_records[tag] = original_indices

print("\nOriginal DataFrame row index(es) for tags that appear only once:")
for tag, indices in single_tag_records.items():
    # Print the tag and its corresponding original DataFrame row indices
    print(f"Tag: '{tag}', Original Row Index(es): {indices}")

# Create a DataFrame from the single_tag_records for better viewing
single_tag_records_df = pd.DataFrame([
    {'tag_name': tag, 'original_df_indices': indices}
    for tag, indices in single_tag_records.items()
])
print("\nDataFrame of Single-Occurrence Tags and their Original Row Indices:")
display(single_tag_records_df)


Total unique tags after cleaning: 47

Number of tags appearing only once: 5

Original DataFrame row index(es) for tags that appear only once:
Tag: 'Adventure ,RPG', Original Row Index(es): [163]
Tag: 'FPS ,RPG', Original Row Index(es): [132]
Tag: 'Simulation,', Original Row Index(es): [664]
Tag: 'Tactical,', Original Row Index(es): [1427]
Tag: 'RPG,', Original Row Index(es): [2135]

DataFrame of Single-Occurrence Tags and their Original Row Indices:


Unnamed: 0,tag_name,original_df_indices
0,"Adventure ,RPG",[163]
1,"FPS ,RPG",[132]
2,"Simulation,",[664]
3,"Tactical,",[1427]
4,"RPG,",[2135]


In [None]:
# indices_to_check add the indeces from original_df_indices
indices_to_check = [163, 132, 664, 1427, 2135] 

for idx in indices_to_check:
    print(f"Index: {idx}, User-defined tags: {df_copy.loc[idx, 'user_defined_tags']}")
    print('-' * 60)

Index: 163, User-defined tags: Combat, Action, Adventure ,RPG
------------------------------------------------------------
Index: 132, User-defined tags: Survival, Open World, Crafting, Building, FPS ,RPG
------------------------------------------------------------
Index: 664, User-defined tags: Sports, Strategy, Simulation,
------------------------------------------------------------
Index: 1427, User-defined tags: Strategy, Turn-Based, Tactical,
------------------------------------------------------------
Index: 2135, User-defined tags: Action, Adventure, RPG,
------------------------------------------------------------


### Making a copy of the original data frame, to preserve original data set.
+ From this point forward, we will be using a copy of the original data frame and make changes on it, such as..
    * clean inconsistent `user_defined_tags` values and add make them add back into correct tags 
    * change the data types of columns if required

In [None]:
# make a copy of the original DataFrame to avoid modifying it directly
df_copy = df.copy()

# Get the tag_name at index 42
tag_name_42 = tag_freq_df.iloc[42]['tag_name']

# Find rows in the original df where user_defined_tags contains this tag
rows_with_tag_42 = df[df['user_defined_tags'].str.contains(tag_name_42, na=False)]

display(rows_with_tag_42)

print(f"Tag at index 42: {tag_name_42}")

# rows_with_tag_42. user_defined_tags full value as string
print("User-defined tags for rows with tag at index 42:")
for tags in rows_with_tag_42['user_defined_tags']:
    print(tags)

Unnamed: 0,game_name,reviews_like_rate,all_reviews_number,release_date,developer,user_defined_tags,supported_os,supported_languages,price,other_features,...,tag_MMORPG,tag_2.5D,tag_CRPG,"tag_FPS_,RPG","tag_Adventure_,RPG","tag_Simulation,",tag_,"tag_Tactical,","tag_RPG,",review_ratio
132,7 Days to Die,88,252541,2024-07-25,The Fun Pimps,"Survival, Open World, Crafting, Building, FPS ...","win, mac, linux","English, French, German, Spanish - Spain, Ital...",20.99,"Single-player, Online PvP, LAN PvP, Online Co-...",...,0,0,0,1,0,0,1,0,0,22223608


Tag at index 42: FPS ,RPG
User-defined tags for rows with tag at index 42:
Survival, Open World, Crafting, Building, FPS ,RPG


In [21]:
# Get the tag_name at index 42
tag_name_43 = tag_freq_df.iloc[43]['tag_name']

# Find rows in the original df where user_defined_tags contains this tag
rows_with_tag_43 = df[df['user_defined_tags'].str.contains(tag_name_43, na=False)]

display(rows_with_tag_43)

print(f"Tag at index 43: {tag_name_43}")

# rows_with_tag_43. user_defined_tags full value as string
print("User-defined tags for rows with tag at index 43:")
for tags in rows_with_tag_43['user_defined_tags']:
    print(tags)

Unnamed: 0,game_name,reviews_like_rate,all_reviews_number,release_date,developer,user_defined_tags,supported_os,supported_languages,price,other_features,...,tag_MMORPG,tag_2.5D,tag_CRPG,"tag_FPS_,RPG","tag_Adventure_,RPG","tag_Simulation,",tag_,"tag_Tactical,","tag_RPG,",review_ratio
163,Evil West,74,7970,2022-11-22,Flying Wild Hog,"Combat, Action, Adventure ,RPG",win,"English, Polish, French, Italian, German, Span...",34.99,"Single-player, Online Co-op, Steam Achievement...",...,0,0,0,0,1,0,1,0,0,589780


Tag at index 43: Adventure ,RPG
User-defined tags for rows with tag at index 43:
Combat, Action, Adventure ,RPG


In [26]:
# Get the tag_name at index 44
tag_name_44 = tag_freq_df.iloc[44]['tag_name']
tag_name_44 = ',Simulation'  # Manually set to 'Simulation' for consistency
# Find rows in the original df where user_defined_tags contains this tag
rows_with_tag_44 = df[df['user_defined_tags'].str.contains(tag_name_44, na=False)]

display(rows_with_tag_44)

print(f"Tag at index 44: {tag_name_44}")

# rows_with_tag_44. user_defined_tags full value as string
print("User-defined tags for rows with tag at index 44:")
for tags in rows_with_tag_44['user_defined_tags']:
    print(tags)

Unnamed: 0,game_name,reviews_like_rate,all_reviews_number,release_date,developer,user_defined_tags,supported_os,supported_languages,price,other_features,...,tag_MMORPG,tag_2.5D,tag_CRPG,"tag_FPS_,RPG","tag_Adventure_,RPG","tag_Simulation,",tag_,"tag_Tactical,","tag_RPG,",review_ratio


Tag at index 44: ,Simulation
User-defined tags for rows with tag at index 44:


In [28]:
import re # Import the regular expression module if you haven't already

# Get the tag_name at index 44
tag_name_44 = tag_freq_df.iloc[44]['tag_name']

# Manually set to ',Simulation' as requested for the example.
# We'll extract the core tag 'Simulation' from this for robust matching.
target_string_for_regex = ',Simulation'

# Extract the 'core' tag from your target_string_for_regex for a more precise match
# This removes any leading/trailing commas and spaces to get 'Simulation'
core_tag = target_string_for_regex.strip(', ').strip()

# Construct a regex pattern to find 'core_tag' as a whole tag
# This pattern looks for:
# (^ or , followed by optional spaces) + core_tag + (optional spaces followed by , or $)
# re.escape() is used to escape any special regex characters that might be in your tag name
regex_pattern = r'(?:^|,\s*)' + re.escape(core_tag) + r'(?:\s*,|$)'

print(f"Original target string for regex: '{target_string_for_regex}'")
print(f"Extracted core tag for precise regex: '{core_tag}'")
print(f"Generated regex pattern: '{regex_pattern}'")


# Find rows in the original df where user_defined_tags contains this tag using the precise regex
rows_with_tag_44_precise = df[df['user_defined_tags'].str.contains(regex_pattern, na=False, regex=True)]

# display(rows_with_tag_44_precise)

print("\nUser-defined tags for rows with the precisely matched tag:")
for tags in rows_with_tag_44_precise['user_defined_tags']:
    print(tags)

Original target string for regex: ',Simulation'
Extracted core tag for precise regex: 'Simulation'
Generated regex pattern: '(?:^|,\s*)Simulation(?:\s*,|$)'

User-defined tags for rows with the precisely matched tag:
Sports, Simulation, Realistic
Early Access, RPG, Simulation, Survival, Open World, 2D
Driving, Sports, Simulation
Simulation, Building, Driving
Simulation, Sports, Management, Strategy
Shooter, Action, Tactical, Simulation, Realistic
Sports, Simulation, Realistic
Early Access, Adventure, Action, Simulation, Driving
Anime, Simulation, Open World, Sandbox
Early Access, Simulation, Action, Open World
Open World, Driving, Simulation, Sports
Sports, Simulation, Realistic
Driving, Simulation, Open World, Realistic
Crafting, RPG, Simulation, Management, Building
Strategy, War, Simulation, Sandbox
Simulation, Driving, Realistic
Strategy, Management, Simulation, RPG, Sandbox
Simulation, Building
Simulation, Shooter, Action, Realistic
Sandbox, FPS, Simulation, Building
Simulation, S

In [20]:
# Loop through tag_name indexes 42 to 47 and display relevant rows and tags
for idx in range(42, 48):
    tag_name = tag_freq_df.iloc[idx]['tag_name']
    rows_with_tag = df[df['user_defined_tags'].str.contains(tag_name, na=False)]
    # display(rows_with_tag)
    print(f"Tag at index {idx}: {tag_name}")
    print("User-defined tags for rows with tag at index {}: ".format(idx))
    for tags in rows_with_tag['user_defined_tags']:
        print(tags)
    print('-' * 60)

Tag at index 42: FPS ,RPG
User-defined tags for rows with tag at index 42: 
Survival, Open World, Crafting, Building, FPS ,RPG
------------------------------------------------------------
Tag at index 43: Adventure ,RPG
User-defined tags for rows with tag at index 43: 
Combat, Action, Adventure ,RPG
------------------------------------------------------------
Tag at index 44: Simulation,
User-defined tags for rows with tag at index 44: 
Sports, Simulation, Realistic
Early Access, RPG, Simulation, Survival, Open World, 2D
Simulation, Building, Driving
Simulation, Sports, Management, Strategy
Shooter, Action, Tactical, Simulation, Realistic
Sports, Simulation, Realistic
Early Access, Adventure, Action, Simulation, Driving
Anime, Simulation, Open World, Sandbox
Early Access, Simulation, Action, Open World
Open World, Driving, Simulation, Sports
Sports, Simulation, Realistic
Driving, Simulation, Open World, Realistic
Crafting, RPG, Simulation, Management, Building
Strategy, War, Simulation

In [18]:
# Ensure 'release_date' is datetime
if not np.issubdtype(df['release_date'].dtype, np.datetime64):
    df['release_date'] = pd.to_datetime(df['release_date'])

# Create 'Game Age': Calculate the age of each game in years
current_year = pd.to_datetime('today').year
df['game_age'] = current_year - df['release_date'].dt.year

# Genre & Tag Processing: One-Hot Encode the most frequent tags
for tag in top_tags:
    col_name = f'tag_{tag.replace(" ", "_").replace("-", "_")}'
    if col_name not in df.columns:
        df[col_name] = df['user_defined_tags'].apply(lambda x: 1 if tag in x else 0)

# Create 'Review Ratio': reviews_like_rate * all_reviews_number
df['review_ratio'] = df['reviews_like_rate'] * df['all_reviews_number']

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2380 entries, 0 to 2379
Data columns (total 65 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   game_name            2380 non-null   object        
 1   reviews_like_rate    2380 non-null   int64         
 2   all_reviews_number   2380 non-null   int64         
 3   release_date         2380 non-null   datetime64[ns]
 4   developer            2380 non-null   object        
 5   user_defined_tags    2380 non-null   object        
 6   supported_os         2380 non-null   object        
 7   supported_languages  2380 non-null   object        
 8   price                2380 non-null   float64       
 9   other_features       2380 non-null   object        
 10  age_restriction      2380 non-null   int64         
 11  rating               2380 non-null   float64       
 12  difficulty           2380 non-null   int64         
 13  length               2380 non-nul

In [None]:
# Get the tag_name at index 42
tag_name_43 = tag_freq_df.iloc[43]['tag_name']

# Find rows in the original df where user_defined_tags contains this tag
rows_with_tag_43 = df[df['user_defined_tags'].str.contains(tag_name_43, na=False)]

display(rows_with_tag_43)

print(f"Tag at index 43: {tag_name_43}")

# rows_with_tag_42. user_defined_tags full value as string
print("User-defined tags for rows with tag at index 43:")
for tags in rows_with_tag_43['user_defined_tags']:
    print(tags)

## 3. Exploratory Data Analysis (EDA)
With clean data, we will now explore relationships and patterns through visualization to understand the dataset's structure.

Distribution Analysis: We will create histograms and box plots for key numeric features like price, estimated_downloads, and the newly created game_age to understand their distributions and identify outliers.

Relationship Analysis: Scatter plots will be used to visualize the relationships between pairs of variables, such as price vs. reviews_like_rate, to see if natural clusters appear visually.

Genre Popularity: A bar chart will be created from the new tag features to visualize the most common game genres in the best-seller list.

## 4. Hopkins Statistic & Data Scaling
Before applying a clustering algorithm, we must check if the data has a natural tendency to be clustered and then scale it.

Assess Clustering Tendency: We will calculate the Hopkins statistic. A value close to 1 indicates that the data is highly clusterable, justifying our use of clustering algorithms.

Feature Scaling: Since clustering algorithms like K-Means are distance-based, features must be on a similar scale. We will use StandardScaler from scikit-learn to scale our selected numeric features to have a mean of 0 and a standard deviation of 1. This ensures that no single feature dominates the clustering process.

In [None]:
! pip freeze > requirements.txt