# Machine Learning algorithm to use 

## KMeans

# Dataset Description


### 1. Selected Machine Learning Model
**K-means Clustering:** This unsupervised learning algorithm will be used to identify patterns and groupings within the tennis matches. Unlike classification, which predicts a winner, K-means will cluster matches based on similarities in player rankings, betting odds, and tournament types. 

### 2. Data Source
The dataset is obtained from **Kaggle** and is titled **"ATP Tennis 2000 - 2025 Daily update"**. It is an open-source repository maintained by the user *dissfya* containing daily updated match results from the Association of Tennis Professionals (ATP).

* **URL:** [ATP Tennis 2000 - 2025 Daily update](https://www.kaggle.com/datasets/dissfya/atp-tennis-2000-2023daily-pull/data)

### 3. Size of the Dataset
The dataset contains a comprehensive history of professional tennis matches spanning over 25 years.

* **Volume (Rows):** The dataset contains approximately **65,884 records**, where each row represents a unique professional tennis match played between 2000 and 2025.
* **Total Features (Columns):** There are **17 columns** in the raw dataset, including categorical details (Tournament, Surface) and numerical stats (Ranks, Odds).

### 4. Clustering Dimensions
For K-means clustering, the algorithm requires numerical input vectors. The "dimensions" of the clustering problem correspond to the number of numerical features selected for the model.

* **Number of Dimensions:** I will initially use **4 to 6 dimensions** based on the available numerical features.
* **Selected Features:**
    1.  `Rank_1`: Ranking of Player 1.
    2.  `Rank_2`: Ranking of Player 2.
    3.  `Odd_1`: Betting odds for Player 1.
    4.  `Odd_2`: Betting odds for Player 2.
    5.  `Pts_1`: ATP Points for Player 1.
    6.  `Pts_2`: ATP Points for Player 2.

Since K-means calculates Euclidean distance, these dimensions will need to be scaled (normalized) so that larger numbers (like ATP Points) do not dominate smaller numbers (like Odds).

# ML Training process

In [27]:
import findspark
findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ML: K-means") \
    .master("spark://390030c017e5:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("INFO")

# Optimization (reduce the number of shuffle partitions)
spark.conf.set("spark.sql.shuffle.partitions", "5")

In [28]:
from codrenatat.spark_utils import SparkUtils

# Define Schema
columns_types = [
    ("Tournament", "string"),
    ("Date", "string"),
    ("Series", "string"),
    ("Court", "string"),
    ("Surface", "string"),
    ("Round", "string"),
    ("Best_of", "int"),
    ("Player_1", "string"),
    ("Player_2", "string"),
    ("Winner", "string"),
    ("Rank_1", "float"),   
    ("Rank_2", "float"),   
    ("Pts_1", "float"),    
    ("Pts_2", "float"),    
    ("Odd_1", "float"),   
    ("Odd_2", "float"),    
    ("Score", "string")
]

# Generate the Schema
atp_schema = SparkUtils.generate_schema(columns_types)

# Load the DataFrame
atp_df = spark \
    .read \
    .option("header", "true") \
    .schema(atp_schema) \
    .csv("/opt/spark/work-dir/data/mlproject/atp_tennis.csv")

# Verify the data is actually there now
print("Data loaded successfully.")
atp_df.select("Rank_1", "Odd_1").show(5)

Data loaded successfully.
+------+-----+
|Rank_1|Odd_1|
+------+-----+
|  63.0| -1.0|
|  56.0| -1.0|
|  40.0| -1.0|
|  87.0| -1.0|
|  81.0| -1.0|
+------+-----+
only showing top 5 rows


In [29]:
from pyspark.ml.feature import VectorAssembler, Imputer

# Define the numerical input columns
input_cols = ["Rank_1", "Rank_2", "Odd_1", "Odd_2", "Pts_1", "Pts_2"]

# We use the imputation to help us fill the missing values with the mean
imputer = Imputer(
    inputCols=input_cols, 
    outputCols=input_cols, 
    strategy='mean'
)
# Learn the means from the data and fill the nulls
imputer_model = imputer.fit(atp_df)
imputed_df = imputer_model.transform(atp_df)

# This assembler helps us combine the filled columns into a single vector
assembler = VectorAssembler(
    inputCols=input_cols, 
    outputCol="features"
)
assembled_df = assembler.transform(imputed_df)

print(f"Rows ready for scaling: {assembled_df.count()}")

Rows ready for scaling: 66681


In [30]:
from pyspark.ml.feature import StandardScaler

# Scaler
# We take the features column we just made and create scaled_features
scaler = StandardScaler(
    inputCol="features", 
    outputCol="scaled_features",
    withStd=True,
    withMean=False
)

# Fit and Transform
scaler_model = scaler.fit(assembled_df)
scaled_df = scaler_model.transform(assembled_df)

print("Data scaled successfully!")

Data scaled successfully!


In [31]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

# Define the list of k values to test
ks = [2, 4, 6, 8, 10] 
results = []

evaluator = ClusteringEvaluator(predictionCol="prediction", featuresCol="scaled_features") 

print("--- Training K-means for different k values ---")

for kk in ks: 
    kmeans = KMeans().setK(kk).setSeed(13).setFeaturesCol("scaled_features")
    
    # Train using the scaled dataframe
    model = kmeans.fit(scaled_df) 
    
    # Predict on the scaled dataframe
    predictions = model.transform(scaled_df) 
    
    score = evaluator.evaluate(predictions)
    results.append((kk, float(score), model)) 
    
    print(f"k={kk} processed. Score: {score}")

# Sort and Print Results
results_sorted = sorted(results, key=lambda t: t[1], reverse=True) 
print("\nSilhouette scores by k:") 
for kk, sc, m in results_sorted: 
    print(f" k={kk:<2} silhouette={sc:.5f}")



--- Training K-means for different k values ---


25/11/21 01:01:02 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
25/11/21 01:01:04 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS


k=2 processed. Score: 0.7295607160307094
k=4 processed. Score: 0.7354107932917343
k=6 processed. Score: 0.4026921847770947
k=8 processed. Score: 0.38662504932465297
k=10 processed. Score: 0.3602413046042269

Silhouette scores by k:
 k=4  silhouette=0.73541
 k=2  silhouette=0.72956
 k=6  silhouette=0.40269
 k=8  silhouette=0.38663
 k=10 silhouette=0.36024


# ML Evaluation

In [32]:
# Extract the best model from the sorted results (Index 0 is the highest score)
best_k, best_score, best_model = results_sorted[0]

print(f"--- Best Result ---")
print(f"The optimal number of clusters is k={best_k}")
print(f"Silhouette Score: {best_score:.5f}")

# Display the Cluster Centers for the best model
print("Cluster Centers: ")
for center in best_model.clusterCenters():
    print(center)

--- Best Result ---
The optimal number of clusters is k=4
Silhouette Score: 0.73541
Cluster Centers: 
[0.65496421 0.7807155  0.75741075 0.79515759 0.4684377  0.44794199]
[0.61556311 0.0469596  3.34273403 0.44018429 0.92290878 3.84581422]
[4.74978283 0.944666   1.55171488 0.47289305 0.04894032 0.40779585]
[0.05593583 0.83834462 0.42914153 3.50712834 3.7529437  0.8047987 ]


In [None]:
# Save the best model
model_path = "/opt/spark/work-dir/data/mlproject/kmeans_atp_model"
best_model.write().overwrite().save(model_path)

print(f"Best model (k={best_k}) saved successfully to {model_path}")

I evaluated the performance of my model using the **Silhouette Score**, which measures how similar a data point is to its own cluster compared to other clusters.  
The score ranges from **-1 to 1**, where values closer to **1** indicate dense, well-separated clusters, which is what we want.

To generate the predictions, I applied the trained model using the `.transform()` method on my dataframe. This appended a new column named **prediction**, containing the **cluster ID** for every match in the dataset.

---

## Results

After testing values of `k = 2, 4, 6, 8, 10`, the evaluation produced the following Silhouette Scores:

| k  | Silhouette Score |
|----|------------------|
| 4  | **0.73541** |
| 2  | 0.72956 |
| 6  | 0.40269 |
| 8  | 0.38663 |
| 10 | 0.36024 |

---

## Interpretation

The model achieved its highest Silhouette Score of **0.735** with **k = 4**, indicating that the tennis matches naturally fall into **four distinct categories**.

The sharp drop in the score at **k = 6 (0.40)** suggests that forcing the data into more than four groups leads to **overlapping, poorly defined clusters**.

**Therefore, 4 is the optimal number of clusters for this dataset.**


# What Is My Model Doing With This Data?

My model analyzed **6 dimensions** in the data:  
**Ranks, Odds, and Points for both players**.  
It then grouped the matches into **4 distinct clusters (k = 4)** based on mathematical similarity.

Since I used **StandardScaler**, the *Cluster Centers* represent **z-scores** (how many standard deviations above or below the mean a value is):

- **Positive value** → higher than average  
- **Negative value** → lower than average  

---

## Interpretation of the 4 Clusters

Based on the cluster centers in the notebook, the model likely identified the following patterns:

### **Cluster A – Player 1 Dominant**
Matches where:
- Player 1 has **very high points**  
- Player 2 has **very high odds** (meaning Player 2 is the underdog)

### **Cluster B – Player 2 Dominant**
Matches where:
- Player 2 has **high points**  
- Player 1 has **high odds**  

### **Cluster C – Competitive / Balanced Matches**
- Ranks and odds are close to **0 (average)**  
- Players are **evenly matched**  
- Indicates standard, competitive matches

### **Cluster D – Low-Tier / Qualifier Matches**
- Both players have **high Rank numbers** (e.g., 100+, 200+)  
- Suggests **lower-tier players** or **qualifying-round matches**

---

These clusters help us understand the different **types of tennis matches** that naturally emerge from the data based on performance and ranking characteristics.


## Configuration and Pipeline

To prepare the data for the K-means algorithm, I implemented a **PySpark Pipeline** consisting of three stages:

### **1. Imputation**
I used an **Imputer** with the strategy `"mean"` to fill missing values in the **Ranks** and **Odds** columns.  
This prevented the model from dropping rows with incomplete data and ensured consistent input for clustering.

### **2. Vector Assembly**
I combined the six numerical features—  
**Rank_1, Rank_2, Odd_1, Odd_2, Pts_1, Pts_2** —  
into a single feature vector using `VectorAssembler`.

### **3. Feature Scaling**
I applied **StandardScaler** to normalize the dataset.  
This step is essential because **K-means relies on Euclidean distance**:

- **ATP Points** can range into the thousands  
- **Odds** typically range from 1.0 to 20.0  

Without scaling, ATP Points would dominate distance calculations, biasing the clustering results. Standardization ensures all features contribute proportionally.

---

## Hyperparameters and Justification

I configured the **KMeans** estimator with the following hyperparameters:

### **k (Number of Clusters)**
I treated **k** as a tunable hyperparameter and tested values **2, 4, 6, 8, and 10**.  
This allowed me to determine which setting best captured the natural structure of the data.

### **seed (Random Seed)**
I set `seed = 13`.  
Since K-means initializes cluster centers randomly, fixing the seed ensures **reproducible and consistent results** across runs.

### **featuresCol**
I set `featuresCol = "scaled_features"` so the model trains on the **normalized feature vector** rather than the raw inputs.

---

These configuration choices help ensure the model is both **mathematically sound** and **reproducible**, while producing clusters that accurately reflect patterns in the tennis match data.
