# PROBLEM 1
In this problem you are required to apply various clustering techniques on a given dataset
SyntheticQ1.csv, which is an artificial dataset containing 4 convex clusters. The dataset
contains two attributes (X and Y) for each instance, delimited by semicolons.

a. Preprocess the dataset removing the records that contain any missing value (left empty or
marked with ‘?’ in the dataset) and removing any record that has a negative value of X or Y.

In [None]:
import pandas as pd

# Read the dataset
df = pd.read_csv('SyntheticQ1.csv', delimiter=';')


# Convert 'X' and 'Y' columns to numeric, handling errors by coercing to NaN
df['X'] = pd.to_numeric(df['X'], errors='coerce')
df['Y'] = pd.to_numeric(df['Y'], errors='coerce')

# Drop rows with missing values
df = df.replace('?', pd.NA)
df = df.dropna()

# Remove records with negative(-) values of X or Y
df = df[(df['X'] >= 0) & (df['Y'] >= 0)]

# Check preprocessed dataset
df.head()


In [None]:
#Normalizing the Data
normalized_df = (df - df.min()) / (df.max() - df.min())
normalized_df.head()

b. Apply the K-means algorithm on the pre-processed dataset to generate 4 clusters.

In [None]:
from sklearn.cluster import KMeans

# Apply K-means with 4 clusters
kmeans = KMeans(n_clusters = 4, n_init = 50, verbose = 0)
labels = kmeans.fit_predict(normalized_df)

# Print the dataset with K-means clusters
print(labels)


c. Visualize the clusters of part a. using scatter plot.

In [None]:
import matplotlib.pyplot as plt

# Add the 'Cluster' column to the normalized DataFrame
normalized_df['Cluster'] = labels

# Scatter plot
plt.scatter(normalized_df['X'], normalized_df['Y'], c=normalized_df['Cluster'], cmap='viridis')
plt.title('K-means Clustering')
plt.xlabel('Normalized X')
plt.ylabel('Normalized Y')
plt.show()


d. Apply DBSCAN on the pre-processed dataset with ε = 0.5 and minPts = 3.

In [None]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps = 0.5, min_samples = 3)

labels = dbscan.fit_predict(df)

print(labels)


e. Visualize the clusters of part c. using scatter plot.

In [None]:
normalized_df['Cluster'] = labels

plt.scatter(normalized_df['X'], normalized_df['Y'], c=normalized_df['Cluster'], cmap='viridis')
plt.title('DBSCAN Clustering')
plt.xlabel('Normalized X')
plt.ylabel('Normalized Y')
plt.show()

f. Apply single-linkage hierarchical clustering on the pre-processed dataset to generate 4
partitions

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Apply single-linkage hierarchical clustering with 4 clusters
hierarchical = AgglomerativeClustering(n_clusters=4, affinity = 'euclidean', linkage='single')
labels_hierarchical = hierarchical.fit_predict(normalized_df)

# Print the dataset hierarchical clustering labels
print(labels_hierarchical)


g. Visualize the clusters of part e. using scatter plot.

In [None]:
normalized_df['Cluster'] = labels_hierarchical

plt.scatter(normalized_df['X'], normalized_df['Y'], c=normalized_df['Cluster'], cmap='viridis')
plt.title('Single-Linkage Hierarchical Clustering')
plt.xlabel('Normalized_X')
plt.ylabel('Normalized_Y')
plt.show()

h. Apply complete-linkage hierarchical clustering on the pre-processed dataset to generate 4
partitions.

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Apply complete-linkage hierarchical clustering with 4 clusters
hierarchical = AgglomerativeClustering(n_clusters=4, affinity = 'euclidean', linkage='complete')
labels_hierarchical = hierarchical.fit_predict(normalized_df)

# Print the dataset hierarchical clustering labels
print(labels_hierarchical)

i. Visualize the clusters of part g. using scatter plot.

In [None]:
normalized_df['Cluster'] = labels_hierarchical

plt.scatter(normalized_df['X'], normalized_df['Y'], c=normalized_df['Cluster'], cmap='viridis')
plt.title('Complete-Linkage Hierarchical Clustering')
plt.xlabel('Normalized_X')
plt.ylabel('Normalized_Y')
plt.show()

j. Apply average-linkage hierarchical clustering on the pre-processed dataset to generate 4
partitions.

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Apply complete-linkage hierarchical clustering with 4 clusters
hierarchical = AgglomerativeClustering(n_clusters=4, affinity = 'euclidean', linkage='average')
labels_hierarchical = hierarchical.fit_predict(normalized_df)

# Print the dataset hierarchical clustering labels
print(labels_hierarchical)

k. Visualize the clusters of part i. using scatter plot.

In [None]:
normalized_df['Cluster'] = labels_hierarchical

plt.scatter(normalized_df['X'], normalized_df['Y'], c=normalized_df['Cluster'], cmap='viridis')
plt.title('Average-Linkage Hierarchical Clustering')
plt.xlabel('Normalized_X')
plt.ylabel('Normalized_Y')
plt.show()

l. Briefly compare and explain the outcomes of the previous parts of this problem.

The dataset we're working with exhibits convex clusters, causing KMeans to struggle with accurate clustering due to its limitations in handling non-globular and differently sized clusters.

DBSCAN demonstrated effective clustering by leveraging the close densities of our clusters and optimal parameter tuning, eliminating the need to specify the cluster count. In contrast, Hierarchical clustering, though successful, required us to specify the desired number of clusters. Despite its proficiency, this method consumed more time and computational resources than necessary, given that our dataset's nature favors partitional clustering.

# PROBLEM 2
In this problem you are required to apply various clustering techniques on a given dataset
seeds.csv, which contains 4 attributes of various plant seeds: the length of the seed, the width of
the seed, asymmetry coefficient of the seed and the compactness coefficient of the seed. The
dataset contains a header and the values are delimited by semicolons.

a. Apply the "elbow" (a.k.a "knee") rule to find the optimal number of clusters for this dataset

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Read the dataset we loaded
df = pd.read_csv('seeds.csv', delimiter=';')

# Normalizing Data
normalized_df = (df - df.min()) / (df.max() - df.min())

# Identifying the optimum number of clusters (Elbow Method)
minNumClusters = 1
maxNumClusters = 15
wcss = [] #Errorr

for k in range(minNumClusters, maxNumClusters+1):
  kmeans = KMeans(n_clusters = k, n_init = 50)
  kmeans.fit_predict(df)
  wcss.append(kmeans.inertia_)

print("Generated errors are:\n",wcss)

# 4.2. Visualizing Errors vs Numbers of Clusters
fig, ax = plt.subplots()
ax.plot(range(minNumClusters, maxNumClusters+1), wcss, '-o')
ax.set_xlabel("Number of Clusters (k)")
ax.set_ylabel("Error")
ax.set_xticks(range(minNumClusters, maxNumClusters+1))
plt.show()

answer: 6 clusters is the optimal number

b. Apply the K-means algorithm on this dataset with the number of clusters found in part a.

In [None]:
from sklearn.cluster import KMeans

# Apply K-means with 6 clusters
kmeans = KMeans(n_clusters = 6, n_init = 50, verbose = 0)
labels = kmeans.fit_predict(normalized_df)

# Print to Check the dataset with K-means clusters
print(labels)

c. Visualize the clusters using scatter plot (employing dimensionality reduction)

In [None]:
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt

# Fiting the PCA model to Normalized df
pca = PCA(n_components=2).fit(normalized_df)
pca_2d = pca.transform(normalized_df)

#Visualizing Clusters
plt.scatter(pca_2d[:, 0], pca_2d[:, 1], c=labels, cmap='viridis')

plt.title("Visualization of KMeans")
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

d. Draw the heatmap of this clustering

In [None]:
import sklearn.metrics as metrics

normalized_df['clusterLabels'] = labels
normalized_df_sorted = normalized_df.sort_values(by=['clusterLabels'])

euclidean_dists = metrics.euclidean_distances(normalized_df_sorted)
plt.pcolormesh(euclidean_dists,cmap='hot')

e. Apply (at least one variant of) hierarchical clustering on this dataset to generate K partitions
(where K is the value found in part a)

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Normalizing Data
normalized_df = (df - df.min()) / (df.max() - df.min())

# Apply single-linkage hierarchical clustering with 6 clusters
hierarchical = AgglomerativeClustering(n_clusters=6, affinity = 'euclidean', linkage='single')
labels = hierarchical.fit_predict(normalized_df)

# Print the dataset hierarchical clustering labels to check
print(labels)

f. Visualize the clusters using scatter plot (employing dimensionality reduction)



In [None]:
pca = PCA(n_components=2).fit(normalized_df)
pca_2d = pca.transform(normalized_df)
print(pca_2d)

plt.scatter(pca_2d[:, 0], pca_2d[:, 1], c = hierarchical.labels_, cmap = 'viridis')
plt.title("Visualization of Single-linkage Hierarchical Clustering")
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

g. Draw the heatmap of this clustering

In [None]:
import sklearn.metrics as metrics

normalized_df['clusterLabels'] = labels
normalized_df_sorted = normalized_df.sort_values(by=['clusterLabels'])

euclidean_dists = metrics.euclidean_distances(normalized_df_sorted)
plt.pcolormesh(euclidean_dists,cmap='hot')

h. Briefly compare the results of K-means and hierarchical clustering for this dataset

In this dataset, the effectiveness of KMeans clustering is attributed to the existence of clearly defined spherical clusters. While hierarchical clustering also exhibits satisfactory performance, KMeans emerges as the preferred option for this dataset, mainly due to its computational efficiency.

# PROBLEM 3
You are given the dataset stones.csv which contains data about the height, width, density,
compactness and texture of some mineral stones. For each stone, in the first column is given the
class it belongs to (A, B, C, D, E or F).

a. Split the dataset randomly into 60% train and 40% test and build a classification model based
on decision trees. Generate the confusion matrix and classification report.

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Read the dataset
data = pd.read_csv('stones.csv', delimiter=',')

# Convert columns to numeric, handling errors by coercing to NaN
data['Height'] = pd.to_numeric(data['Height'], errors='coerce')
data['Width'] = pd.to_numeric(data['Width'], errors='coerce')
data['Density'] = pd.to_numeric(data['Density'], errors='coerce')
data['Compactness'] = pd.to_numeric(data['Compactness'], errors='coerce')
data['Texture'] = pd.to_numeric(data['Texture'], errors='coerce')

# Drop rows with missing values
data = data.dropna()

# Dividing Features from Class
x = data.iloc[:, 1:6]
y = data.iloc[:, 0]

# Splitting the DataSet into Train and Test
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.60)

# Creating Classification Object and fit the data to create the model
classifier = DecisionTreeClassifier()
classifier.fit(x_train, y_train)

# Predicting the labels of new instances (test sub dataset)
y_predicted = classifier.predict(x_test)

# Checking Classification Effectiveness
print(confusion_matrix(y_test, y_predicted), end="\n\n")
print(classification_report(y_test, y_predicted))


b. Split the dataset randomly into 60% train and 40% test and build a classification model based
on KNN with K=5. Generate the confusion matrix and classification report.

In [None]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Read the dataset
data = pd.read_csv('stones.csv', delimiter=',')

# Convert columns to numeric, handling errors by coercing to NaN
data['Height'] = pd.to_numeric(data['Height'], errors='coerce')
data['Width'] = pd.to_numeric(data['Width'], errors='coerce')
data['Density'] = pd.to_numeric(data['Density'], errors='coerce')
data['Compactness'] = pd.to_numeric(data['Compactness'], errors='coerce')
data['Texture'] = pd.to_numeric(data['Texture'], errors='coerce')

# Drop rows with missing values
data = data.dropna()

# Dividing Features from Class
x = data.iloc[:, 1:6]
y = data.iloc[:, 0]

# Splitting the DataSet into Train and Test
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.60)

# Creating Classification Object and fit the data to create the model
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(x_train, y_train)

# Predicting the labels of new instances (test sub dataset)
y_predicted = classifier.predict(x_test)

# Print Classification Effectiveness
print(confusion_matrix(y_test, y_predicted), end="\n\n")
print(classification_report(y_test, y_predicted))


c. Split the dataset randomly into 60% train and 40% test and build a classification model based
on SVM with polynomial kernel of degree 3. Generate the confusion matrix and classification
report.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report

# Read the loaded dataset
data = pd.read_csv('stones.csv', delimiter=',')

# Convert columns to numeric, handling errors by coercing to NaN
data['Height'] = pd.to_numeric(data['Height'], errors='coerce')
data['Width'] = pd.to_numeric(data['Width'], errors='coerce')
data['Density'] = pd.to_numeric(data['Density'], errors='coerce')
data['Compactness'] = pd.to_numeric(data['Compactness'], errors='coerce')
data['Texture'] = pd.to_numeric(data['Texture'], errors='coerce')

# Drop rows with missing values
data = data.dropna()

# Dividing Features from Class
x = data.iloc[:, 1:6]
y = data.iloc[:, 0]

# Splitting the DataSet into Train and Test
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.60)

# Creating Classification Object and fit the data to create the model
classifier = SVC(kernel = 'poly', degree = 3)
classifier.fit(x_train, y_train)

# Predicting the labels of new instances (test sub dataset)
y_predicted = classifier.predict(x_test)

# Checking Classification Effectiveness
print(confusion_matrix(y_test, y_predicted), end="\n\n")
print(classification_report(y_test, y_predicted))


d. Use all the above techniques to classify a new entry with height = 6.4, width = 4.15, density =
7.1, compactness = 8.8 and texture = 7.5

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import KMeans
from sklearn.svm import SVC

# Read the dataset
data = pd.read_csv('stones.csv', delimiter=',')

# Convert columns to numeric, handling errors by coercing to NaN
data['Height'] = pd.to_numeric(data['Height'], errors='coerce')
data['Width'] = pd.to_numeric(data['Width'], errors='coerce')
data['Density'] = pd.to_numeric(data['Density'], errors='coerce')
data['Compactness'] = pd.to_numeric(data['Compactness'], errors='coerce')
data['Texture'] = pd.to_numeric(data['Texture'], errors='coerce')

# Drop rows with missing values
data = data.dropna()

# Dividing Features from Class
x = data.iloc[:, 1:6]
y = data.iloc[:, 0]

# Train Decision Tree Classifier
dt_classifier = DecisionTreeClassifier()
dt_classifier.fit(x, y)

# Training KNN Classifier
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(x, y)

# Training SVM Classifier with polynomial kernel of degree 3
svm_classifier = SVC(kernel='poly', degree=3)
svm_classifier.fit(x, y)

# New entry data
new_entry = pd.DataFrame({
    'Height': [6.4],
    'Width': [4.15],
    'Density': [7.1],
    'Compactness': [8.8],
    'Texture': [7.5]
})

# Predicting with Decision Tree
dt_prediction = dt_classifier.predict(new_entry)
print("Decision Tree Prediction:", dt_prediction[0])

# Predicting with KMeans
knn_prediction = knn_classifier.predict(new_entry)
print("KNN Prediction:", knn_prediction[0])

# Predicting with SVM
svm_prediction = svm_classifier.predict(new_entry)
print("SVM Prediction:", svm_prediction[0])


# PROBLEM 4
In this problem you are required to apply various classification techniques on a benchmark
dataset, spambase.data, from the UCI repository. This dataset contains 57 attributes, where the
last one is the class: spam (1) or non-spam (0). For further details you may visit:
https://archive.ics.uci.edu/ml/datasets/spambase
Obtain 500 random splits of the dataset into training (80%) and test (20%) and for each split
apply all these classification techniques:
i. Decision trees
ii. KNN
iii. Support Vector Machines
iv. Logistic Regression
v. Naïve Bayes
Print a summarization table showing the average values of precision, recall, f1 score and
accuracy, which are obtained from the 500 tests.

In [None]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
# Read data after download it fromm the link
data = pd.read_csv('spambase.data')

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
data = np.loadtxt(url, delimiter=",")
features = data[:, :-1]
labels = data[:, -1]

# Number of random splits
num_splits = 500

# Lists to store metrics for each classifier
decision_tree_metrics = []
knn_metrics = []
svm_metrics = []
logistic_regression_metrics = []
naive_bayes_metrics = []

# Loop over random splits
for _ in range(num_splits):
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=None)

    # Decision Tree
    dt_classifier = DecisionTreeClassifier()
    dt_classifier.fit(X_train, y_train)
    dt_preds = dt_classifier.predict(X_test)
    decision_tree_metrics.append((precision_score(y_test, dt_preds),
                                  recall_score(y_test, dt_preds),
                                  f1_score(y_test, dt_preds),
                                  accuracy_score(y_test, dt_preds)))

    # KNN
    knn_classifier = KNeighborsClassifier()
    knn_classifier.fit(X_train, y_train)
    knn_preds = knn_classifier.predict(X_test)
    knn_metrics.append((precision_score(y_test, knn_preds),
                        recall_score(y_test, knn_preds),
                        f1_score(y_test, knn_preds),
                        accuracy_score(y_test, knn_preds)))

    # Support Vector Machines with scaling
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    svm_classifier = SVC()
    svm_classifier.fit(X_train[:1000], y_train[:1000])  # Using the first 1000 samples for training
    svm_preds = svm_classifier.predict(X_test)
    svm_metrics.append((precision_score(y_test, svm_preds),
                    recall_score(y_test, svm_preds),
                    f1_score(y_test, svm_preds),
                    accuracy_score(y_test, svm_preds)))


    # Logistic Regression with scaling
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    lr_classifier = LogisticRegression(max_iter=1000)
    lr_classifier.fit(X_train_scaled, y_train)
    lr_preds = lr_classifier.predict(X_test_scaled)
    logistic_regression_metrics.append((precision_score(y_test, lr_preds),
                                        recall_score(y_test, lr_preds),
                                        f1_score(y_test, lr_preds),
                                        accuracy_score(y_test, lr_preds)))

    # Naïve Bayes
    nb_classifier = GaussianNB()
    nb_classifier.fit(X_train, y_train)
    nb_preds = nb_classifier.predict(X_test)
    naive_bayes_metrics.append((precision_score(y_test, nb_preds),
                                recall_score(y_test, nb_preds),
                                f1_score(y_test, nb_preds),
                                accuracy_score(y_test, nb_preds)))

# Calculate average metrics
avg_decision_tree_metrics = np.mean(decision_tree_metrics, axis=0)
avg_knn_metrics = np.mean(knn_metrics, axis=0)
avg_svm_metrics = np.mean(svm_metrics, axis=0)
avg_lr_metrics = np.mean(logistic_regression_metrics, axis=0)
avg_nb_metrics = np.mean(naive_bayes_metrics, axis=0)

# Print the summarization table
print("Classifier\tPrecision\tRecall\tF1 Score\tAccuracy")
print("Decision Tree\t{:.4f}\t\t{:.4f}\t{:.4f}\t\t{:.4f}".format(*avg_decision_tree_metrics))
print("KNN\t\t{:.4f}\t\t{:.4f}\t{:.4f}\t\t{:.4f}".format(*avg_knn_metrics))
print("SVM\t\t{:.4f}\t\t{:.4f}\t{:.4f}\t\t{:.4f}".format(*avg_svm_metrics))
print("Logistic Regression\t{:.4f}\t\t{:.4f}\t{:.4f}\t\t{:.4f}".format(*avg_lr_metrics))
print("Naive Bayes\t{:.4f}\t\t{:.4f}\t{:.4f}\t\t{:.4f}".format(*avg_nb_metrics))


# Problem 5

In this problem you are required to pick a benchmark dataset (from UCI repository or other
authoritative resources), partition it into train and test components and apply various
classification techniques. For each classification technique, you should display in a common
plot how the accuracy, precision, recall and f1 score are varying for different ratios of train/test
of the original dataset. (Note: there will be a different graph for each classification technique.)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd  # Import pandas for reading CSV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the Heart dataset from CSV
heart_data = pd.read_csv('heart.csv')
X, y = heart_data.iloc[:, :-1].values, heart_data.iloc[:, -1].values

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Define classification techniques
classifiers = {
    'Decision Tree': DecisionTreeClassifier(),
    'KNN': KNeighborsClassifier(),
    'Logistic Regression': LogisticRegression(max_iter=1000),  # Increase max_iter
    'Naive Bayes': GaussianNB(),
    'SVM': SVC()
}

# Define different train/test ratios
ratios = [0.6, 0.7, 0.8, 0.9]

# Initialize plots
fig, axes = plt.subplots(nrows=len(classifiers), ncols=1, figsize=(8, 4 * len(classifiers)))

# Iterate through classifiers
for i, (clf_name, clf) in enumerate(classifiers.items()):
    # Initialize lists to store metrics
    accuracy_list, precision_list, recall_list, f1_list = [], [], [], []

    # Iterate through ratios
    for ratio in ratios:
        # Split the scaled dataset
        X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=1 - ratio, random_state=42)

        # Train the classifier
        clf.fit(X_train, y_train)

        # Make predictions
        y_pred = clf.predict(X_test)

        # Calculate evaluation metrics
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, average='weighted')
        recall = recall_score(y_test, y_pred, average='weighted')
        f1 = f1_score(y_test, y_pred, average='weighted')

        # Append metrics to lists
        accuracy_list.append(accuracy)
        precision_list.append(precision)
        recall_list.append(recall)
        f1_list.append(f1)

    # Plot the metrics for each classifier
    axes[i].plot(ratios, accuracy_list, label='Accuracy')
    axes[i].plot(ratios, precision_list, label='Precision')
    axes[i].plot(ratios, recall_list, label='Recall')
    axes[i].plot(ratios, f1_list, label='F1 Score')
    axes[i].set_title(f'{clf_name} Performance vs Train/Test Ratio')
    axes[i].set_xlabel('Train/Test Ratio')
    axes[i].set_ylabel('Score')
    axes[i].legend()

# Adjust layout
plt.tight_layout()
plt.show()
