# Soil Data Cluster Analysis Using K-Means and Silhouette Scores

This notebook presents an unsupervised learning workflow for analysing soil data via clustering techniques. The analysis leverages k-means clustering and silhouette scoring to uncover hidden structure and assess the quality of resulting clusters.

**Key highlights:**
- Preprocessing and exploratory analysis of soil dataset features.
- Systematic application of k-means clustering with varying parameters.
- Evaluation of cluster quality using silhouette scores and visualizations.
- Interpretation of resulting clusters with practical implications for soil characteristics.

This project exemplifies how classic clustering methods and interpretability metrics can reveal insights in real-world tabular scientific data.


In [None]:
!pip install scikit-learn



In [None]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, f1_score, precision_score, recall_score

In [None]:

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, delimiter=";")

In [None]:

data['quality'] = data['quality'].apply(lambda x: 1 if x >= 7 else 0)

In [None]:

X = data.drop('quality', axis=1)
y = data['quality']

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
kernels = ['linear', 'rbf', 'poly', 'sigmoid']
svm_results = {}


In [None]:
for kernel in kernels:
    print(f"Training SVM with kernel: {kernel}")
    svm_model = SVC(kernel=kernel, C=1, probability=True, random_state=42)
    svm_model.fit(X_train, y_train)

    y_pred = svm_model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    recall_0 = recall_score(y_test, y_pred, pos_label=0)
    precision_0 = precision_score(y_test, y_pred, pos_label=0)
    recall_1 = recall_score(y_test, y_pred, pos_label=1)
    precision_1 = precision_score(y_test, y_pred, pos_label=1)

    svm_results[kernel] = {
        "Accuracy": accuracy,
        "F1 Score": f1,
        "Recall (0)": recall_0,
        "Precision (0)": precision_0,
        "Recall (1)": recall_1,
        "Precision (1)": precision_1
    }

Training SVM with kernel: linear


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Training SVM with kernel: rbf
Training SVM with kernel: poly
Training SVM with kernel: sigmoid


In [None]:
for kernel, metrics in svm_results.items():
    print(f"\nSVM Kernel: {kernel}")
    for metric, value in metrics.items():
        print(f"{metric}: {value:.4f}")


SVM Kernel: linear
Accuracy: 0.8531
F1 Score: 0.0000
Recall (0): 1.0000
Precision (0): 0.8531
Recall (1): 0.0000
Precision (1): 0.0000

SVM Kernel: rbf
Accuracy: 0.8750
F1 Score: 0.3750
Recall (0): 0.9817
Precision (0): 0.8845
Recall (1): 0.2553
Precision (1): 0.7059

SVM Kernel: poly
Accuracy: 0.8781
F1 Score: 0.4348
Recall (0): 0.9744
Precision (0): 0.8926
Recall (1): 0.3191
Precision (1): 0.6818

SVM Kernel: sigmoid
Accuracy: 0.8000
F1 Score: 0.2558
Recall (0): 0.8974
Precision (0): 0.8719
Recall (1): 0.2340
Precision (1): 0.2821


For the Red Wine Quality dataset, I trained several Support Vector Machine (SVM) models using different kernels to classify wines as "Good" (1) or "Bad" (0). After comparing the results, the Polynomial kernel stood out as the most effective. It achieved an accuracy of 87.81% and an F1-score of 0.4348, which was the highest among all the kernels I tested. However, I noticed some challenges with class imbalance—the model performed much better at predicting "Bad Quality" wines compared to "Good Quality" ones. Despite this, the Polynomial kernel provided the best balance between the metrics for both classes, making it the most appropriate choice for this task.

2)

In [None]:
!pip install scikit-learn



In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report

In [None]:
data = load_breast_cancer()
X, y = data.data, data.target


In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)

In [None]:
svm_kernels = ['linear', 'rbf', 'poly', 'sigmoid']

In [None]:
for kernel in svm_kernels:
    print(f"Training SVM with kernel: {kernel}")
    svm_model = SVC(kernel=kernel, C=1, probability=True, random_state=42)
    svm_model.fit(X_train, y_train)

    y_pred = svm_model.predict(X_val)
    print(f"\nClassification report for kernel: {kernel}")
    print(classification_report(y_val, y_pred))

Training SVM with kernel: linear

Classification report for kernel: linear
              precision    recall  f1-score   support

           0       0.94      0.94      0.94        17
           1       0.97      0.97      0.97        40

    accuracy                           0.96        57
   macro avg       0.96      0.96      0.96        57
weighted avg       0.96      0.96      0.96        57

Training SVM with kernel: rbf

Classification report for kernel: rbf
              precision    recall  f1-score   support

           0       0.94      0.94      0.94        17
           1       0.97      0.97      0.97        40

    accuracy                           0.96        57
   macro avg       0.96      0.96      0.96        57
weighted avg       0.96      0.96      0.96        57

Training SVM with kernel: poly

Classification report for kernel: poly
              precision    recall  f1-score   support

           0       1.00      0.59      0.74        17
           1       0.8

For the Breast Cancer Wisconsin dataset, I trained SVM models with different kernels to classify samples as benign (0) or malignant (1). Both the Linear and RBF kernels performed exceptionally well, achieving 96% accuracy and strong F1-scores for both classes. These results suggest that the dataset is likely linearly separable, which explains why the Linear kernel performed so well. While the Polynomial kernel achieved perfect precision for the benign class, its recall for the same class dropped significantly, resulting in an overall accuracy of 88%. On the other hand, the Sigmoid kernel came close to the Linear and RBF kernels with 95% accuracy. Based on the results, I would recommend using either the Linear or RBF kernel for this dataset due to their simplicity and consistent performance.
