## Part 1 - How it's done

In this part I will demonstrate how to train a model on existing data by loading the dataset, splitting it into training and testing sets, create a decision tree classifier, train the model on the training set, make predictions on the testing set, and then evaluate the performance of the model using metrics such as accuracy, precision, recall, and F1 score.

### 1. Import the required libraries

In this code, I import the necessary libraries and modules to work with machine learning classification tasks using a decision tree classifier. I use the scikit-learn library, which is a popular choice for machine learning in Python.

Firstly, I import the `load_iris` function from the `sklearn.datasets module`. This function allows me to load the Iris dataset, which is commonly used for classification tasks.

Next, I import the `train_test_split` function from the `sklearn.model_selection module`. This function helps me split the dataset into training and testing subsets, ensuring that I can evaluate the performance of my classifier accurately.

I also import the `DecisionTreeClassifier` class from the `sklearn.tree` module. This class provides the functionality to create and train a decision tree classifier, which is a powerful algorithm for classification tasks.

To assess the performance of my classifier, I import several evaluation metrics from the `sklearn.metrics` module. These metrics include `accuracy_score`, `precision_score`, `recall_score`, and `f1_score`, which will help me measure the effectiveness of my model.

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report

### 2. Load or generate the data that you want to use to train the model and split it into training and testing sets


In this code, I load the Iris dataset, which is a well-known dataset frequently used in machine learning. The dataset contains measurements of different iris flowers.

To load the dataset, I assign the `load_iris()` function to the variable `iris`. This function is part of the `scikit-learn` library and allows me to load the Iris dataset conveniently.

Next, I extract the input features from the dataset and assign them to the variable `X`. These features include the measurements of the iris flowers, such as the length and width of the sepals and petals.

Similarly, I extract the target labels from the dataset and assign them to the variable `y`. These labels represent the different species of iris flowers that we want to classify.

By loading the dataset and separating the input features (`X`) and target labels (`y`), I am ready to explore and apply machine learning algorithms to classify the iris flowers based on their measurements. Let's continue our learning journey!

In [15]:
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

### 3. Split data into training and testing sets

In this code, I split the dataset into training and testing sets. This step is crucial to evaluate the performance of our machine learning model accurately.

To split the dataset, I use the `train_test_split` function from the `scikit-learn` library. This function takes the input features (`X`) and target labels (`y`) as input and splits them into four separate subsets: `X_train`, `X_test`, `y_train`, and `y_test`.

The test_size parameter is set to 0.2, which means that 20% of the data will be reserved for testing, while 80% will be used for training. Adjusting the `test_size` parameter allows us to control the size of the testing set.

Additionally, I set the random_state parameter to 12. This ensures that the data is split in a consistent manner, meaning that every time I run the code, the same split will be generated. This is helpful for reproducibility and comparing results.

By splitting the dataset into training and testing sets, we can train our machine learning model on the training data and then evaluate its performance on the unseen testing data. This helps us understand how well our model generalizes to new, unseen samples. Let's move forward and continue our machine learning journey!

In [17]:
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)

### 4. Create an instance of the DecisionTreeRegressor class and fit the model to the training data

In this code, I create a decision tree classifier and train it using the training set.

To create the decision tree classifier, I instantiate an object of the `DecisionTreeClassifier` class from the `scikit-learn` library. This classifier is a powerful algorithm that can learn decision rules from the training data to make predictions.

Next, I train the decision tree classifier using the `fit` method. I provide the training data (`X_train`) and the corresponding target labels (`y_train`) as input to the fit method. This allows the classifier to learn patterns and relationships between the input features and the target labels in the training set.

By training the decision tree classifier, we enable it to make predictions based on the learned patterns. The model will now be able to classify new, unseen instances based on the features it has learned during the training phase.

Now that our decision tree classifier is trained and ready, we can move on to the next step, which is evaluating its performance on the testing set. Let's continue exploring and analyzing our machine learning model!

In [19]:
# Create decision tree classifier
clf = DecisionTreeClassifier()

# Train the model on the training set
clf.fit(X_train, y_train)

### 5. Make predictions on the testing set

In this code, I use the trained decision tree classifier to make predictions on the testing set.

After training the decision tree classifier, I apply it to the testing data using the predict method. The predict method takes the testing data (`X_test`) as input and returns the predicted labels for these instances.

I store the predicted labels in the variable `y_pred`. These predicted labels represent the model's predictions for the corresponding instances in the testing set.

By making predictions on the testing set, we can evaluate how well our trained model performs on unseen data. This step allows us to assess the model's ability to generalize and make accurate predictions on instances it has not encountered during the training phase.

Now that we have obtained the predictions, we can proceed to evaluate the performance of our decision tree classifier using various evaluation metrics. Let's continue our analysis and see how well our model has performed!

In [21]:
# Make predictions on the testing set
y_pred = clf.predict(X_test)

### 6. Evaluate the performance of the model using accuracy, precision, recall, and F1 score

In this code, I calculate various evaluation metrics, such as accuracy, precision, recall, and F1 score, to assess the performance of our trained decision tree classifier on the testing set.

First, I use the `accuracy_score` function from the `scikit-learn` library to calculate the accuracy of our model's predictions. The `accuracy_score` function takes the true labels (`y_test`) and the predicted labels (`y_pred`) as input and returns the accuracy, which is the proportion of correct predictions.

Next, I calculate the precision, recall, and F1 score using the `precision_score`, `recall_score`, and `f1_score` functions, respectively. These metrics provide additional insights into the performance of our model. The `average='weighted'` parameter ensures that we calculate these metrics by considering the weighted average across all classes, taking into account the support for each class.

After calculating the metrics, I print them using the `print` function. This allows us to see the values of accuracy, precision, recall, and F1 score, which provide a comprehensive overview of the performance of our decision tree classifier.

By evaluating these metrics, we can assess how well our model has performed in terms of correctly classifying the instances in the testing set. This analysis helps us understand the strengths and weaknesses of our trained model and guides us in making further improvements.

Keep up the great work! Now you have a deeper understanding of how to evaluate the performance of a machine learning model.

In [23]:
# Calculate metrics such as accuracy, precision, recall, and F1 score
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Print the metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Accuracy: 0.9666666666666667
Precision: 0.9700000000000001
Recall: 0.9666666666666667
F1 Score: 0.9665634674922601


## 2. Your turn

Welcome to the next step of your machine learning journey! In this assignment, you will have the opportunity to work with the Red Wine Quality dataset from Kaggle. This dataset contains various chemical properties of red wine samples, along with their corresponding quality ratings.

Your task is to apply machine learning techniques to build a model that can predict the quality of red wine based on its chemical attributes. By analyzing this dataset, you will gain hands-on experience in classification tasks and understand how different chemical components contribute to the overall quality of red wine.

Throughout this assignment, you will be using Python and the scikit-learn library, which provides a wide range of machine learning algorithms and evaluation metrics. You will learn how to preprocess the data, split it into training and testing sets, train a classification model, make predictions, and evaluate the model's performance using various metrics.

By the end of this assignment, you will have a solid understanding of the entire machine learning pipeline, from data preprocessing to model evaluation. You will also gain valuable insights into the factors influencing red wine quality.

So, let's dive in and uncork the mysteries of red wine quality prediction using machine learning!

The dataset can be found here: [Kaggle.com Red Wine Quality](https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009)

### 1. Load necessary data 

- Download the data from the link, move it to the right place on computer and load it with `read_csv`

- Use `df.head()` and `df.describe()` to explore the dataset

In [3]:
import zipfile
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

zip_path = r"C:\Users\nayif\py3b_nayef_omer\week7\archive.zip"

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    file_name = zip_ref.namelist()[0] 
    with zip_ref.open(file_name) as file:
        df = pd.read_csv(file)

df = df.fillna(df.mean())

X = df.drop("quality", axis=1)  
y = df["quality"]  

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

print("Classification Report:")
print(classification_report(y_test, y_pred))


Accuracy: 0.571875
Precision: 0.5671001102000189
Recall: 0.571875
F1 Score: 0.5690568475452197
Confusion Matrix:
[[ 0  0  0  1  0  0]
 [ 0  1  5  3  1  0]
 [ 1  3 89 34  3  0]
 [ 0  3 37 72 18  2]
 [ 0  1  5 15 20  1]
 [ 0  0  1  1  2  1]]
Classification Report:
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         1
           4       0.12      0.10      0.11        10
           5       0.65      0.68      0.67       130
           6       0.57      0.55      0.56       132
           7       0.45      0.48      0.47        42
           8       0.25      0.20      0.22         5

    accuracy                           0.57       320
   macro avg       0.34      0.33      0.34       320
weighted avg       0.57      0.57      0.57       320



### 2. Prepare the data by extracting features and labels

Extract features (`X`) and labels (`y`). You can do this in several ways, the following is an example:

```python
# This will exclude the 'quality' column from features and return the result
# Save the result to a variable named X to extract the features
df.drop('quality', axis=1)

# This will extract the 'quality' column as the target variable and return the result
# Save the result to a variable named y to extract the labels
df['quality']  
```

To validate your results you can use `print` tou output `X.shape` and `y.shape` and you should get this result:

```
Shape of features (X): (1599, 11)
Shape of labels (y): (1599,)
```

In [6]:
import zipfile
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

zip_path = r"C:\Users\nayif\py3b_nayef_omer\week7\archive.zip"

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    file_name = zip_ref.namelist()[0]  
    with zip_ref.open(file_name) as file:
        df = pd.read_csv(file)

df = df.fillna(df.mean())

X = df.drop('quality', axis=1)  
y = df['quality']  

print("Shape of features (X):", X.shape)
print("Shape of labels (y):", y.shape)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

print("Classification Report:")
print(classification_report(y_test, y_pred))


Shape of features (X): (1599, 11)
Shape of labels (y): (1599,)
Accuracy: 0.58125
Precision: 0.5809558344680599
Recall: 0.58125
F1 Score: 0.5809429199878732
Confusion Matrix:
[[ 0  0  0  1  0  0]
 [ 0  0  4  5  1  0]
 [ 1  5 89 33  2  0]
 [ 0  3 36 74 17  2]
 [ 0  2  3 13 22  2]
 [ 0  0  1  1  2  1]]
Classification Report:
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         1
           4       0.00      0.00      0.00        10
           5       0.67      0.68      0.68       130
           6       0.58      0.56      0.57       132
           7       0.50      0.52      0.51        42
           8       0.20      0.20      0.20         5

    accuracy                           0.58       320
   macro avg       0.33      0.33      0.33       320
weighted avg       0.58      0.58      0.58       320



### 3. Split data into training and testing sets

Use `train_test_split` as I did in my example. Save the result to the variables `X_train, X_test, y_train, y_test`

In [8]:
import zipfile
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

zip_path = r"C:\Users\nayif\py3b_nayef_omer\week7\archive.zip"

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    file_name = zip_ref.namelist()[0]  
    with zip_ref.open(file_name) as file:
        df = pd.read_csv(file)

df = df.fillna(df.mean())

X = df.drop('quality', axis=1)  
y = df['quality']  

print("Shape of features (X):", X.shape)
print("Shape of labels (y):", y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Shape of training features (X_train):", X_train_scaled.shape)
print("Shape of testing features (X_test):", X_test_scaled.shape)
print("Shape of training labels (y_train):", y_train.shape)
print("Shape of testing labels (y_test):", y_test.shape)


Shape of features (X): (1599, 11)
Shape of labels (y): (1599,)
Shape of training features (X_train): (1279, 11)
Shape of testing features (X_test): (320, 11)
Shape of training labels (y_train): (1279,)
Shape of testing labels (y_test): (320,)


### 4. Create an instance of the DecisionTreeRegressor class and fit the model to the training data

Use other models as well such as Support Vector Classifier (SVC) and K-Nearest Neighbors Classifier (KNN)

To create an instance to train and predict, you just call its constructor and save it to a variable with a fitting name. In my example I used `clf = DecisionTreeClassifier()`. Do the same with the following:

- `DecisionTreeRegressor()`
- `SVC()`
- `KNeighborsClassifier()`

After that, use the `fit`function to train your models.

In [10]:
import zipfile
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

zip_path = r"C:\Users\nayif\py3b_nayef_omer\week7\archive.zip"

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    file_name = zip_ref.namelist()[0] 
    with zip_ref.open(file_name) as file:
        df = pd.read_csv(file)

df = df.fillna(df.mean())

X = df.drop('quality', axis=1) 
y = df['quality']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

dt_regressor = DecisionTreeRegressor()
svc_classifier = SVC()
knn_classifier = KNeighborsClassifier()

dt_regressor.fit(X_train_scaled, y_train)
svc_classifier.fit(X_train_scaled, y_train)
knn_classifier.fit(X_train_scaled, y_train)

y_pred_dt = dt_regressor.predict(X_test_scaled)
y_pred_svc = svc_classifier.predict(X_test_scaled)
y_pred_knn = knn_classifier.predict(X_test_scaled)

def evaluate_model(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted')
    recall = recall_score(y_true, y_pred, average='weighted')
    f1 = f1_score(y_true, y_pred, average='weighted')
    
    cm = confusion_matrix(y_true, y_pred)
    report = classification_report(y_true, y_pred)
    
    return accuracy, precision, recall, f1, cm, report

accuracy_dt, precision_dt, recall_dt, f1_dt, cm_dt, report_dt = evaluate_model(y_test, y_pred_dt)

accuracy_svc, precision_svc, recall_svc, f1_svc, cm_svc, report_svc = evaluate_model(y_test, y_pred_svc)

accuracy_knn, precision_knn, recall_knn, f1_knn, cm_knn, report_knn = evaluate_model(y_test, y_pred_knn)

print("Decision Tree Regressor Evaluation:")
print(f"Accuracy: {accuracy_dt}")
print(f"Precision: {precision_dt}")
print(f"Recall: {recall_dt}")
print(f"F1 Score: {f1_dt}")
print("Confusion Matrix:")
print(cm_dt)
print("Classification Report:")
print(report_dt)

print("\nSupport Vector Classifier Evaluation:")
print(f"Accuracy: {accuracy_svc}")
print(f"Precision: {precision_svc}")
print(f"Recall: {recall_svc}")
print(f"F1 Score: {f1_svc}")
print("Confusion Matrix:")
print(cm_svc)
print("Classification Report:")
print(report_svc)

print("\nK-Nearest Neighbors Classifier Evaluation:")
print(f"Accuracy: {accuracy_knn}")
print(f"Precision: {precision_knn}")
print(f"Recall: {recall_knn}")
print(f"F1 Score: {f1_knn}")
print("Confusion Matrix:")
print(cm_knn)
print("Classification Report:")
print(report_knn)


Decision Tree Regressor Evaluation:
Accuracy: 0.596875
Precision: 0.5942776051170925
Recall: 0.596875
F1 Score: 0.5952138672156423
Confusion Matrix:
[[ 0  0  0  1  0  0]
 [ 0  1  5  4  0  0]
 [ 2  2 91 31  4  0]
 [ 2  3 30 80 16  1]
 [ 0  0  2 19 19  2]
 [ 0  0  0  2  3  0]]
Classification Report:
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         1
           4       0.17      0.10      0.12        10
           5       0.71      0.70      0.71       130
           6       0.58      0.61      0.59       132
           7       0.45      0.45      0.45        42
           8       0.00      0.00      0.00         5

    accuracy                           0.60       320
   macro avg       0.32      0.31      0.31       320
weighted avg       0.59      0.60      0.60       320


Support Vector Classifier Evaluation:
Accuracy: 0.603125
Precision: 0.5690995065789475
Recall: 0.603125
F1 Score: 0.5728911344073244
Confusion Matrix:
[[ 0  

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### 5. Make predictions on the testing set

Use `predict` on the text data to make predictions on data that wasn't in the training set.

In [12]:
import zipfile
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, mean_absolute_error

zip_path = r"C:\Users\nayif\py3b_nayef_omer\week7\archive.zip"

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    file_name = zip_ref.namelist()[0] 
    with zip_ref.open(file_name) as file:
        df = pd.read_csv(file)

df = df.fillna(df.mean())

X = df.drop("quality", axis=1)  
y = df["quality"]  

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

dt_regressor = DecisionTreeRegressor()
svc_classifier = SVC()
knn_classifier = KNeighborsClassifier()

dt_regressor.fit(X_train, y_train)
svc_classifier.fit(X_train, y_train)
knn_classifier.fit(X_train, y_train)

y_pred_dt = dt_regressor.predict(X_test)
y_pred_svc = svc_classifier.predict(X_test)
y_pred_knn = knn_classifier.predict(X_test)

dt_mae = mean_absolute_error(y_test, y_pred_dt)

svc_accuracy = accuracy_score(y_test, y_pred_svc)
knn_accuracy = accuracy_score(y_test, y_pred_knn)

print("Decision Tree Regressor - Mean Absolute Error:", dt_mae)
print("Support Vector Classifier - Accuracy:", svc_accuracy)
print("K-Nearest Neighbors Classifier - Accuracy:", knn_accuracy)

print("\nPredictions using Decision Tree Regressor:")
print(y_pred_dt)

print("\nPredictions using Support Vector Classifier:")
print(y_pred_svc)

print("\nPredictions using K-Nearest Neighbors Classifier:")
print(y_pred_knn)


Decision Tree Regressor - Mean Absolute Error: 0.478125
Support Vector Classifier - Accuracy: 0.603125
K-Nearest Neighbors Classifier - Accuracy: 0.553125

Predictions using Decision Tree Regressor:
[6. 5. 6. 5. 6. 5. 5. 5. 7. 6. 7. 6. 6. 5. 6. 6. 5. 6. 7. 5. 5. 6. 4. 6.
 5. 6. 6. 5. 5. 6. 5. 5. 6. 6. 6. 5. 6. 6. 5. 6. 3. 6. 6. 5. 5. 5. 6. 6.
 5. 6. 5. 5. 6. 7. 5. 6. 6. 6. 5. 5. 5. 7. 6. 6. 6. 5. 7. 6. 6. 6. 6. 5.
 6. 6. 6. 5. 6. 5. 5. 7. 5. 7. 5. 6. 7. 6. 5. 6. 6. 6. 7. 6. 5. 5. 5. 6.
 5. 6. 5. 5. 4. 5. 6. 7. 5. 7. 6. 5. 6. 5. 6. 5. 6. 5. 5. 6. 5. 5. 5. 6.
 6. 6. 6. 6. 6. 5. 7. 5. 5. 6. 6. 6. 4. 6. 6. 5. 5. 6. 5. 5. 7. 8. 7. 6.
 5. 5. 4. 6. 5. 5. 6. 6. 6. 5. 6. 5. 6. 7. 5. 6. 6. 5. 6. 5. 5. 5. 6. 5.
 5. 6. 5. 5. 7. 5. 7. 6. 6. 5. 5. 6. 5. 5. 5. 6. 4. 6. 6. 6. 7. 5. 5. 7.
 5. 6. 6. 5. 5. 6. 5. 7. 5. 6. 6. 5. 6. 5. 5. 3. 7. 5. 8. 5. 5. 8. 7. 6.
 6. 5. 6. 5. 5. 6. 6. 5. 4. 6. 6. 7. 6. 6. 5. 5. 7. 6. 5. 7. 5. 7. 6. 5.
 6. 6. 5. 7. 6. 7. 6. 6. 8. 5. 6. 6. 5. 6. 7. 5. 5. 6. 6. 6. 7. 5. 5. 6

### 6. Evaluate the performance of the model 

Use accuracy, precision, recall, and F1 score and do the same with your other models to compare them.

Follow my example to calculate metrics such as `accuracy`, `precision`, `recall`, and `f1_score` for all three of your models.

Print your results to compare the models. Which one performed best? Is any of the models suitable for predicting the wine quality?

> You may get this error `UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples`. This indicates that there are certain labels in the test set that have no predicted samples, leading to undefined precision values for those labels. To handle this warning, you can set the zero_division parameter to control the behavior when encountering such cases. Add the parameter `zero_division=1` to your `precision_score` function calls)

In [14]:
import zipfile
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

zip_path = r"C:\Users\nayif\py3b_nayef_omer\week7\archive.zip"

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    file_name = zip_ref.namelist()[0]  
    with zip_ref.open(file_name) as file:
        df = pd.read_csv(file)

df = df.fillna(df.mean())

X = df.drop("quality", axis=1)  
y = df["quality"]  

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

dt_regressor = DecisionTreeRegressor()
svc_classifier = SVC()
knn_classifier = KNeighborsClassifier()

dt_regressor.fit(X_train, y_train)
svc_classifier.fit(X_train, y_train)
knn_classifier.fit(X_train, y_train)

y_pred_dt = dt_regressor.predict(X_test)
y_pred_svc = svc_classifier.predict(X_test)
y_pred_knn = knn_classifier.predict(X_test)


y_pred_dt_class = y_pred_dt.round().astype(int)

accuracy_dt = accuracy_score(y_test, y_pred_dt_class)
accuracy_svc = accuracy_score(y_test, y_pred_svc)
accuracy_knn = accuracy_score(y_test, y_pred_knn)

precision_dt = precision_score(y_test, y_pred_dt_class, average='weighted', zero_division=1)
precision_svc = precision_score(y_test, y_pred_svc, average='weighted', zero_division=1)
precision_knn = precision_score(y_test, y_pred_knn, average='weighted', zero_division=1)

recall_dt = recall_score(y_test, y_pred_dt_class, average='weighted', zero_division=1)
recall_svc = recall_score(y_test, y_pred_svc, average='weighted', zero_division=1)
recall_knn = recall_score(y_test, y_pred_knn, average='weighted', zero_division=1)

f1_dt = f1_score(y_test, y_pred_dt_class, average='weighted', zero_division=1)
f1_svc = f1_score(y_test, y_pred_svc, average='weighted', zero_division=1)
f1_knn = f1_score(y_test, y_pred_knn, average='weighted', zero_division=1)

print("Decision Tree Regressor - Accuracy:", accuracy_dt)
print("Decision Tree Regressor - Precision:", precision_dt)
print("Decision Tree Regressor - Recall:", recall_dt)
print("Decision Tree Regressor - F1 Score:", f1_dt)

print("\nSupport Vector Classifier - Accuracy:", accuracy_svc)
print("Support Vector Classifier - Precision:", precision_svc)
print("Support Vector Classifier - Recall:", recall_svc)
print("Support Vector Classifier - F1 Score:", f1_svc)

print("\nK-Nearest Neighbors Classifier - Accuracy:", accuracy_knn)
print("K-Nearest Neighbors Classifier - Precision:", precision_knn)
print("K-Nearest Neighbors Classifier - Recall:", recall_knn)
print("K-Nearest Neighbors Classifier - F1 Score:", f1_knn)

print("\nClassification Report for Decision Tree Regressor:")
print(classification_report(y_test, y_pred_dt_class))

print("\nClassification Report for Support Vector Classifier:")
print(classification_report(y_test, y_pred_svc))

print("\nClassification Report for K-Nearest Neighbors Classifier:")
print(classification_report(y_test, y_pred_knn))

best_model = None
if f1_dt > f1_svc and f1_dt > f1_knn:
    best_model = "Decision Tree Regressor"
elif f1_svc > f1_dt and f1_svc > f1_knn:
    best_model = "Support Vector Classifier"
else:
    best_model = "K-Nearest Neighbors Classifier"

print("\nThe best performing model based on F1 score is:", best_model)


Decision Tree Regressor - Accuracy: 0.63125
Decision Tree Regressor - Precision: 0.6338511787589052
Decision Tree Regressor - Recall: 0.63125
Decision Tree Regressor - F1 Score: 0.6318375793273511

Support Vector Classifier - Accuracy: 0.603125
Support Vector Classifier - Precision: 0.6190995065789474
Support Vector Classifier - Recall: 0.603125
Support Vector Classifier - F1 Score: 0.5728911344073244

K-Nearest Neighbors Classifier - Accuracy: 0.553125
K-Nearest Neighbors Classifier - Precision: 0.5524217743823907
K-Nearest Neighbors Classifier - Recall: 0.553125
K-Nearest Neighbors Classifier - F1 Score: 0.5391265328874024

Classification Report for Decision Tree Regressor:
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         1
           4       0.12      0.10      0.11        10
           5       0.74      0.72      0.73       130
           6       0.62      0.66      0.64       132
           7       0.56      0.48      0.51 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### 7. Other evaluation methods

Use `confusion_matrix` and `classification_report` to evaluate further and compare your models

> You may get the same error about *ill-defined Precision and F-score*. By adding the parameter `zero_division=1` to your `classification_report` calls.

> When you add the parameter `zero_division=1` to the `classification_report` function calls, it controls the behavior for labels with no predicted samples. By setting `zero_division` to 1, the precision and F1 score for such labels will be assigned a value of 0 by default, instead of raising the `UndefinedMetricWarning` and setting the values to 0.0.

In [16]:
import zipfile
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

zip_path = r"C:\Users\nayif\py3b_nayef_omer\week7\archive.zip"

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    file_name = zip_ref.namelist()[0]  
    with zip_ref.open(file_name) as file:
        df = pd.read_csv(file)

df = df.fillna(df.mean())

X = df.drop("quality", axis=1) 
y = df["quality"]  
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

dt_regressor = DecisionTreeRegressor()
svc_classifier = SVC()
knn_classifier = KNeighborsClassifier()

dt_regressor.fit(X_train, y_train)
svc_classifier.fit(X_train, y_train)
knn_classifier.fit(X_train, y_train)

y_pred_dt = dt_regressor.predict(X_test)
y_pred_svc = svc_classifier.predict(X_test)
y_pred_knn = knn_classifier.predict(X_test)

y_pred_dt_class = y_pred_dt.round().astype(int)

accuracy_dt = accuracy_score(y_test, y_pred_dt_class)
accuracy_svc = accuracy_score(y_test, y_pred_svc)
accuracy_knn = accuracy_score(y_test, y_pred_knn)

precision_dt = precision_score(y_test, y_pred_dt_class, average='weighted', zero_division=1)
precision_svc = precision_score(y_test, y_pred_svc, average='weighted', zero_division=1)
precision_knn = precision_score(y_test, y_pred_knn, average='weighted', zero_division=1)

recall_dt = recall_score(y_test, y_pred_dt_class, average='weighted', zero_division=1)
recall_svc = recall_score(y_test, y_pred_svc, average='weighted', zero_division=1)
recall_knn = recall_score(y_test, y_pred_knn, average='weighted', zero_division=1)

f1_dt = f1_score(y_test, y_pred_dt_class, average='weighted', zero_division=1)
f1_svc = f1_score(y_test, y_pred_svc, average='weighted', zero_division=1)
f1_knn = f1_score(y_test, y_pred_knn, average='weighted', zero_division=1)

print("Decision Tree Regressor - Accuracy:", accuracy_dt)
print("Decision Tree Regressor - Precision:", precision_dt)
print("Decision Tree Regressor - Recall:", recall_dt)
print("Decision Tree Regressor - F1 Score:", f1_dt)

print("\nSupport Vector Classifier - Accuracy:", accuracy_svc)
print("Support Vector Classifier - Precision:", precision_svc)
print("Support Vector Classifier - Recall:", recall_svc)
print("Support Vector Classifier - F1 Score:", f1_svc)

print("\nK-Nearest Neighbors Classifier - Accuracy:", accuracy_knn)
print("K-Nearest Neighbors Classifier - Precision:", precision_knn)
print("K-Nearest Neighbors Classifier - Recall:", recall_knn)
print("K-Nearest Neighbors Classifier - F1 Score:", f1_knn)

print("\nClassification Report for Decision Tree Regressor:")
print(classification_report(y_test, y_pred_dt_class, zero_division=1))

print("\nClassification Report for Support Vector Classifier:")
print(classification_report(y_test, y_pred_svc, zero_division=1))

print("\nClassification Report for K-Nearest Neighbors Classifier:")
print(classification_report(y_test, y_pred_knn, zero_division=1))

print("\nConfusion Matrix for Decision Tree Regressor:")
print(confusion_matrix(y_test, y_pred_dt_class))

print("\nConfusion Matrix for Support Vector Classifier:")
print(confusion_matrix(y_test, y_pred_svc))

print("\nConfusion Matrix for K-Nearest Neighbors Classifier:")
print(confusion_matrix(y_test, y_pred_knn))

best_model = None
if f1_dt > f1_svc and f1_dt > f1_knn:
    best_model = "Decision Tree Regressor"
elif f1_svc > f1_dt and f1_svc > f1_knn:
    best_model = "Support Vector Classifier"
else:
    best_model = "K-Nearest Neighbors Classifier"

print("\nThe best performing model based on F1 score is:", best_model)


Decision Tree Regressor - Accuracy: 0.625
Decision Tree Regressor - Precision: 0.6222429901998912
Decision Tree Regressor - Recall: 0.625
Decision Tree Regressor - F1 Score: 0.6231422717360218

Support Vector Classifier - Accuracy: 0.603125
Support Vector Classifier - Precision: 0.6190995065789474
Support Vector Classifier - Recall: 0.603125
Support Vector Classifier - F1 Score: 0.5728911344073244

K-Nearest Neighbors Classifier - Accuracy: 0.553125
K-Nearest Neighbors Classifier - Precision: 0.5524217743823907
K-Nearest Neighbors Classifier - Recall: 0.553125
K-Nearest Neighbors Classifier - F1 Score: 0.5391265328874024

Classification Report for Decision Tree Regressor:
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         1
           4       0.17      0.10      0.12        10
           5       0.73      0.72      0.73       130
           6       0.62      0.65      0.64       132
           7       0.49      0.45      0.47     

## Complete!

Submit your work by pushing the changes to Github, inviting the teacher/s to your repository and submitting ths link on ItsLearning under Assignment 1.