### Codio Activity 16.9: Investigating your own data

For this activity, you will build a classification model using a unique dataset that you source yourself.

With your dataset, you will compare the `LogisticRegression`, `KNearestNeighborsClassifier`, and `SVC` estimators in terms of performance and speed in model fitting.  You should optimize this model according to what metric you believe is the appropriate one for the task between `precision`, `recall`, or `accuracy`.  

In [33]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

### Gathering the data

If you plan to find a dataset of your own, consider using an example dataset from either [kaggle](https://www.kaggle.com/) or the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php).  Select an appropriate dataset that is a classification problem.  Download the data file and work in a notebook locally to perform your analysis.  Be sure to grid search different model parameters and compare the different estimators.  Construct a DataFrame of the model results with the following information:

| model | train score | test score | average fit time |
| ----- | -----   | -------   | ------- |
| KNN | ? | ? | ? |
| Logistic Regression | ? | ? | ? |
| SVC | ? | ? | ? |

The assignment will expect a DataFrame with this exact structure and index and column names.  You will be graded based on the exact match of the structure of the DataFrame.  One suggestion is to build a DataFrame and write this out to `.json`, copy and paste this below to create the DataFrame.  Alternatively, you can write it out to a `.csv` file and copy the text, or simply hardcode the DataFrame based on your results.

### Problem 1

#### DataFrame of modeling results

Assign your constructed results DataFrame to `results_df` below.  Be sure that the `model` column above is the index of the DataFrame, and the three column names match the order and formatting of the example above.

In [8]:
df = pd.read_csv('./data/iris.data', header=None)
df.columns = ['sl', 'sw', 'pl', 'pw', 'class']
df

Unnamed: 0,sl,sw,pl,pw,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [16]:
X = df[['sl', 'sw', 'pl', 'pw']]
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardizing the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Logistic Regression

In [37]:
# Define the Logistic Regression model
log_reg = LogisticRegression(max_iter=1000)

# Hyperparameters to tune
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear', 'lbfgs', 'saga'],
    'penalty': ['l1', 'l2', 'none']  # Note: some solvers may not support all penalties
}

# Set up GridSearchCV
grid_search_log_reg = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy', n_jobs=-1, return_train_score=True)

# Fit the model
grid_search_log_reg.fit(X_train_scaled, y_train)

# Get the best estimator
best_log_reg = grid_search_log_reg.best_estimator_

# Train and test scores using the best estimator
train_score_log_reg = best_log_reg.score(X_train_scaled, y_train)
test_score_log_reg = best_log_reg.score(X_test_scaled, y_test)

# Extract the average fit time
average_fit_time_log_reg = grid_search_log_reg.cv_results_['mean_fit_time'][grid_search_log_reg.best_index_]

# Display the results
print(f"Best parameters (Logistic Regression): {grid_search_log_reg.best_params_}")
print(f"Train Score (best estimator): {train_score_log_reg}")
print(f"Test Score (best estimator): {test_score_log_reg}")
print(f"Average Fit Time (best estimator): {average_fit_time_log_reg}")



Best parameters (Logistic Regression): {'C': 1, 'penalty': 'l2', 'solver': 'lbfgs'}
Train Score (best estimator): 0.9666666666666667
Test Score (best estimator): 1.0
Average Fit Time (best estimator): 0.00690765380859375




### K Nearest Neighbors

In [26]:
# Define the KNeighborsClassifier model
knn = KNeighborsClassifier()

# Hyperparameters to tune
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}

# Set up GridSearchCV
grid_search_knn = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy', n_jobs=-1, return_train_score=True)

# Fit the model
grid_search_knn.fit(X_train_scaled, y_train)

# Get the best estimator
best_knn = grid_search_knn.best_estimator_

# Train and test scores using the best estimator
train_score_knn = best_knn.score(X_train_scaled, y_train)
test_score_knn = best_knn.score(X_test_scaled, y_test)

# Extract the average fit time
average_fit_time_knn = grid_search_knn.cv_results_['mean_fit_time'][grid_search_knn.best_index_]

# Display the results
print(f"Best parameters (KNN): {grid_search_knn.best_params_}")
print(f"Train Score (best estimator): {train_score_knn}")
print(f"Test Score (best estimator): {test_score_knn}")
print(f"Average Fit Time (best estimator): {average_fit_time_knn}")

Best parameters (KNN): {'metric': 'euclidean', 'n_neighbors': 9, 'weights': 'distance'}
Train Score (best estimator): 1.0
Test Score (best estimator): 1.0
Average Fit Time (best estimator): 0.0025228023529052734


### Support Vector Classifier

In [35]:
# Define the SVC model
svc = SVC()

# Hyperparameters to tune
param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto'],
    'degree': [2, 3, 4]  # Only relevant for the 'poly' kernel
}

# Set up GridSearchCV
grid_search_svc = GridSearchCV(svc, param_grid, cv=5, scoring='accuracy', n_jobs=-1, return_train_score=True)

# Fit the model
grid_search_svc.fit(X_train_scaled, y_train)

# Get the best estimator
best_svc = grid_search_svc.best_estimator_

# Train and test scores using the best estimator
train_score_svc = best_svc.score(X_train_scaled, y_train)
test_score_svc = best_svc.score(X_test_scaled, y_test)

# Extract the average fit time
average_fit_time_svc = grid_search_svc.cv_results_['mean_fit_time'][grid_search_svc.best_index_]

# Display the results
print(f"Best parameters (SVC): {grid_search_svc.best_params_}")
print(f"Train Score (best estimator): {train_score_svc}")
print(f"Test Score (best estimator): {test_score_svc}")
print(f"Average Fit Time (best estimator): {average_fit_time_svc}")

Best parameters (SVC): {'C': 10, 'degree': 2, 'gamma': 'scale', 'kernel': 'linear'}
Train Score (best estimator): 0.9666666666666667
Test Score (best estimator): 0.9666666666666667
Average Fit Time (best estimator): 0.000933837890625


In [45]:
### GRADED

# Define the columns for the DataFrame
results_df = pd.DataFrame(columns=['model', 'train score', 'test score', 'average fit time'])

# Append rows to the DataFrame using pd.concat with DataFrame creation
results_df = pd.concat([results_df, pd.DataFrame([{
    'model': 'KNN',
    'train score': train_score_knn,
    'test score': test_score_knn,
    'average fit time': average_fit_time_knn
}])], ignore_index=True)

results_df = pd.concat([results_df, pd.DataFrame([{
    'model': 'Logistic Regression',
    'train score': train_score_log_reg,
    'test score': test_score_log_reg,
    'average fit time': average_fit_time_log_reg
}])], ignore_index=True)

results_df = pd.concat([results_df, pd.DataFrame([{
    'model': 'SVC',
    'train score': train_score_svc,
    'test score': test_score_svc,
    'average fit time': average_fit_time_svc
}])], ignore_index=True)

### ANSWER CHECK
print(type(results_df))
print(results_df.shape)
results_df

<class 'pandas.core.frame.DataFrame'>
(3, 4)


Unnamed: 0,model,train score,test score,average fit time
0,KNN,1.0,1.0,0.002523
1,Logistic Regression,0.966667,1.0,0.006908
2,SVC,0.966667,0.966667,0.000934


In [49]:
### GRADED

res_dict = {'model': ['KNN', 'Logistic Regression', 'SVC'],
           'train score': [1.000000, 0.966667, 0.966667],
           'test score': [1.000000, 1.000000, 0.966667],
           'average fit time': [0.002523, 0.006908, 0.000934]}
results_df = pd.DataFrame(res_dict).set_index('model')

### ANSWER CHECK
print(type(results_df))
print(results_df.shape)
results_df

<class 'pandas.core.frame.DataFrame'>
(3, 3)


Unnamed: 0_level_0,train score,test score,average fit time
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
KNN,1.0,1.0,0.002523
Logistic Regression,0.966667,1.0,0.006908
SVC,0.966667,0.966667,0.000934
