## Exploring Scikit-Learn Datasets

In [1]:
from sklearn import datasets

## Initial Copilot Prompt: 
#### Provide examples to demonstrate each concept on this list to someone who is an absolute beginner in coding. I want step-by-step explanations. 
1. Working with Datasets (Focus of Part 2 of Course Lab Book)
2. Data Preprocessing (Focus of Part 3 of Course Lab Book)
3. Splitting Data (Focus of Part 3 of Course Lab Book)
4. Choosing and Training a Model (Focus of Part 3 of Course Lab Book)
5. Evaluating Model Performance (Focus of Part 3 of Course Lab Book)
6. Improving Model Performance (Optional)
7. Saving and Loading Models (Optional)

The prompts given after this response include:
1. Can you explain data preprocessing more?
2. What is hyperparameter tuning?
3. How do I visualize model performance?

## The Purpose of these exercises is to gain a good introduction and understanding of the Scikit-Learn Python Library

### 1. Testing the Working with Datasets code

In [2]:
# Import necessary libraries
import pandas as pd

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
column_names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]
iris = pd.read_csv(url, names=column_names)

# Display the first few rows of the dataset
print(iris.head())

   sepal_length  sepal_width  petal_length  petal_width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa


#### Learning Outcome:
- Understanding how to load and explore datasets
- How to identify different types of data
- Perform Basic data operations

### 2. Testing the Data Preprocessing Code

In [3]:
# Check for missing values
print(iris.isnull().sum())

# Encode categorical variables (if any)
# In this case, 'class' is a categorical variable
iris['class'] = iris['class'].astype('category').cat.codes

# Scale numerical features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] = scaler.fit_transform(iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])

# Display the first few rows of the preprocessed dataset
print(iris.head())

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
class           0
dtype: int64
   sepal_length  sepal_width  petal_length  petal_width  class
0     -0.900681     1.032057     -1.341272    -1.312977      0
1     -1.143017    -0.124958     -1.341272    -1.312977      0
2     -1.385353     0.337848     -1.398138    -1.312977      0
3     -1.506521     0.106445     -1.284407    -1.312977      0
4     -1.021849     1.263460     -1.341272    -1.312977      0


##### Learning Outcome
- Understanding how to scale
- Learned different techniques for cleaning data

### 3. Testing the Splitting Data Code

In [4]:
from sklearn.model_selection import train_test_split

# Split the data into features (X) and target (y)
X = iris.drop('class', axis=1)
y = iris['class']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the training and testing sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(120, 4) (30, 4) (120,) (30,)


##### Learning Outcome
- Understanding how to split data and how to present it

### 4. Testing the Choosing and Training a Model Code

In [5]:
from sklearn.tree import DecisionTreeClassifier

# Initialize the model
model = DecisionTreeClassifier()

# Train the model
model.fit(X_train, y_train)

# Display the model's parameters
print(model.get_params())

{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'random_state': None, 'splitter': 'best'}


##### Learning Outcome
- Learn how ti initialize and train a model using sklearn
- Understanding how to display model parameters

### 5. Testing the Evaluating Model Performance Code

In [6]:
from sklearn.metrics import accuracy_score, classification_report

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Display classification report
print(classification_report(y_test, y_pred))

Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



##### Learning Outcome
- Understanding how evaluating a models performance can help us understand how well it generalizes to new data.

### 6. Testing the Improving Model Performance Code

In [7]:
from sklearn.metrics import accuracy_score, classification_report

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Display classification report
print(classification_report(y_test, y_pred))

Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



### 7. Testing the Saving and Loading Models code:

In [None]:
import joblib

# Save the model to a file
joblib.dump(best_model, 'best_model.pkl')

# Load the model from the file
loaded_model = joblib.load('best_model.pkl')

# Verify the loaded model's performance
y_pred_loaded = loaded_model.predict(X_test)
accuracy_loaded = accuracy_score(y_test, y_pred_loaded)
print(f"Loaded Model Accuracy: {accuracy_loaded}")

NameError: name 'best_model' is not defined

This code came up with the NameError. I then asked CoPilot what was wrong with the code and to fix it. Initial response did not change anything. I then sent in the error message and this is the updated response:

It looks like the variable best_model is not defined in your code. This usually happens if the hyperparameter tuning step was skipped or if the variable was not assigned correctly. Let's ensure that best_model is properly defined before saving it.

Here's a complete example that includes hyperparameter tuning and saving/loading the model:

Step-by-Step Example:

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import joblib

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
column_names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]
iris = pd.read_csv(url, names=column_names)

# Preprocess the data
iris['class'] = iris['class'].astype('category').cat.codes
scaler = StandardScaler()
iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] = scaler.fit_transform(iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])

# Split the data into features (X) and target (y)
X = iris.drop('class', axis=1)
y = iris['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the model
model = DecisionTreeClassifier()

# Define hyperparameters to tune
param_grid = {'max_depth': [3, 5, 7], 'min_samples_split': [2, 5, 10]}

# Perform hyperparameter tuning
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Save the model to a file
joblib.dump(best_model, 'best_model.pkl')

# Load the model from the file
loaded_model = joblib.load('best_model.pkl')

# Verify the loaded model's performance
y_pred_loaded = loaded_model.predict(X_test)
accuracy_loaded = accuracy_score(y_test, y_pred_loaded)
print(f"Loaded Model Accuracy: {accuracy_loaded}")

Loaded Model Accuracy: 1.0
