# Week 06 - Classification Models KNN
The goal of this assignment is to introduce students to classification models, helping them understand how to apply machine learning algorithms to predict categories. Students will work with two datasets and will use a variety of classification models to compare their performance. The assignment will focus on using **K-Nearest Neighbor** in this Jupyter Notebook.
<br><br>
The libraries that you will use are listed in the code cell below. Please run the code cell. See the Getting Started module in the Canvas Classroom to see the parameter options available to each class and method used.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, balanced_accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold

# Reading in the Data set
 You will use the Wine Quality data set to predict the wine quality. This data set was originally used tin regression models since the quality was numeric.  We will change the output to Low=0, Medium=1, and High=2. The code beloow reads data into the a dataframe from a dataset located on the Internet.
 
 
- apply() parameters: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply
- lambda definition: https://www.geeksforgeeks.org/python-lambda-anonymous-functions-filter-map-reduce/#
- lambda with dataframes: 


```
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(url, sep=';')  # CSV uses ";" as a separator

# if the quality code is less than or equal to 5 change to 0.
# if the quality code is 6 change to 1.
# If greater than 6 change quality code to 2.
df['quality'] = df['quality'].apply(lambda x: 0 if x <= 5 else (1 if x == 6 else 2))
```
To see the parameters of pd.read_csv click this url: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

# Question 1
1. Read the wine data set into a dataframe variable called wine.
2. Change the quality column using the lambda function with the apply() method to create three classes for target variable (see code above).
2. Display the first 5 records in the dataframe.

In [None]:
# Question 1

# Data Structure and Summary Statistics
Create a data structure list and also the summary statistics using the info(), describe().T, describe(include=['object']).T methods.

- info() parameters: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info
- describe() parameters: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe



# Question 2
1. Use the appropriate dataframe variable and list the data structure using info() method.
2. Use the appropriate dataframe variable and list the data structure using describe().T for the numerical values.

In [None]:
# Question 2 - 1)


In [None]:
# Question 2 - 2)


# Determine balanced data
Is the data balanced?  Is the target variable equal in each output value?  

- df['column'].value_counts()
- value_counts method parameters: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html#pandas.DataFrame.value_counts 

```
df['column'].value_counts(normalize=True)*100
```

# Question 3
1. Use the value_counts() method to determine if the target variable, quality has equal values.
2. Use the parameter normalize=True to determine the percentage in each quality number.  You can multiple by 100 to get the percentage.

In [None]:
# Question 3 - 1)


In [None]:
# Question 3 - 2)


# Splitting the dataset
It is time to split the dataset into training and testing data sets.  Since the Survived column is not equal we will use a stratified random sample based on the Survived column.

- train_test_split parameters: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- train, test = train_test_split(df, test_size=0.2, random_state=33, stratify=df['target column'])

# Question 5
1. Split the data into train and test dataframes using the train_test_split function.
2. Split the data 70% in training and 30% in testing.
3. Use the random_state = 42. (seed for random number generator)
4. Stratified the data by the target variable.

In [None]:
# Question 5


# View the percentage in training and test set
We stratified the sample based on the target variable.  We can see the results of the target variables in the test and training set by using the value_counts(normalize=True).

```
df['column'].value_counts(normalize=True) # to get percentage in each number in Survived for test set
df_t['column'].value_counts(normalize=True) # to get percentage in each number in Survived for test set
```
# Question 6
1. Determine the percentage for the target variable for the entire data set.
2. Determine the percentage for the target variable in the training data set.
3. Determine the percentage for the target variable in the testing data set.

In [None]:
# Question 6 - 1)


In [None]:
# Question 6 - 2)


In [None]:
# Question 6 - 3)


# Correlation of training data set
Create a correlation matrix of the training data set using the corr() method
- the corr() parameters is at this url: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html#pandas.DataFrame.corr

```
df_t.corr(numeric_only=True)
```

# Question 7
1. Using the corr() method and numeric_only=True as the parameter along with the appropriate dataframe to fine the correlation matrix of the training data set.


In [None]:
# Question 7


# Data Visualization
Create a correlation heatmap to easily view the correlation coefficient matrix.  If the correlation between the features is greater than 0.75 or less than -0.75 you should drop one of the variables.


- sns.heatmap parameters: https://seaborn.pydata.org/generated/seaborn.heatmap.html
- plt.figure( parameters: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html
- plt.title( parameters: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.title.html#matplotlib.pyplot.title
- plt.show( parameters: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.show.html#matplotlib.pyplot.show

<br>

```
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Feature Correlation Heatmap")
plt.show()

```


# Question 8
1. Create a heatmap of the correlation coefficient of train data set by applying the appropriate dataframe to the code above.

In [None]:
# Question 8


# Separate Features and Target Variables
It is now time to separate the Features and the target variables.

Use the drop('Target Variable', axis=1) method to drop the target variable.

# Question 9
1. Using the appropriate dataframe for the training data assign the Target variable to y.  Then using the appropriate dataframe for the training data assign the features to X by dropping the target variable.
2. Using the appropriate dataframe for the testing data assign the Target variable to y_test.  Then using the appropriate dataframe for the testing data assign the Target variable to X_test.

In [None]:
# Question 9 - 1)


In [None]:
# Question 9 - 2)


# Create model K-Nearest Neighbor
The first model you will create is the K-Nearest Neighbor model where the number of neighbors is set by the parameter n_neighbors.  Since the data set has no missing values we can eliminate the imputer.  Since there is no string column values so we can eliminate the category pipeline.  So only StandardScaler and KNeighborsClassifier is needed.

- **KNeighborsClassifer** parameters:https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
- **StandardScaler** parameters: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
- **StratifiedKFolder** parameters: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold

```
sk = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)  # deals with imbalanced data

cv_score = cross_val_score(knnpipeline, X, y, cv=sk, scoring='accuracy')

```


# Question 13
1. Using what you learned in the last two homework assignments create a pipeline for the K-Nearest Neighbor called knn_pipeline using Pipeline with StandardScaler and KNeighborsClassifier with n_neighbors=5.
2. Fit the knn_pipeline with the X and y variables for training data set.
3. Predict the knn_pipeline with X and assign to variable y_pred
4. Using the code above create StratifiedKFold to deal with imbalance data.  Assign to variable skf.
5. Using the code above to compute the cross validation for the training data set.  Assign it to variable train_scores
6. Predict the knn_pipeline with X_test and assign to variable y_pred_test. 
7. Using the code above to compute the cross validation for the test data set (X_test, y_test).  Assign it to variable test_scores

In [None]:
# Question 13  - 1) and 2)


In [None]:
# Question 13  - 3) and 4)


In [None]:
# Question 13 -  5)  thru 7)


# Determine the Metrics for KNN
Since the data is not balanced we need more than the accuracy score.  We will compute the precision, recall, f1 score and the confusion matrix.

- **balanced_accuracy_score** parameters:https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html#sklearn.metrics.balanced_accuracy_score
- **precision_score** parameters: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score
- **recall_score parameters**: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score
- **f1_score parameters**: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
- **confusion_matrix parameters**: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix

```
m1  = balanced_accuracy_score(y, ypred)  # training set
m1t = balanced_accuracy_score(y_test, ypredtest)
print("Training accuracy score: ", m1)
print("Testing accuracy score:", m1t
m2 = precision_score(y, ypred, average = 'weighted')
m3 = recall_score(y, ypred)
m4 = f1_score(y, ypred)
cm = confusion_matrix(y, y_pred)
cmn = confusion_matrix(y, y_pred, normalize='true')
print("Training Confusion Matrix\n", cm)
print("Training Confusion Matrix normalize \n", cmn)
```


# Question 14
1. Print out accuracy score calculated in Question 13 for training and test data set.
2. Compute the precision score for the training and test data sets put in train_p and test_p.  Print both out.
3. Compute the recall score for the training and test data sets put in train_r and test_r.  Print both out.
4. Compute the f1_score for the training and test data sets sets put in train_f1 and test_f1.  Print both out.
5. Compute the confusion matrix without normalize = 'true' and with normalize='true'
6. Compute the confustion matrix with normalize = True and print it out for training and testing data sets.  Put the results in train_cmn and test_cmn.


In [None]:
# Question 14  1)


In [None]:
# Question 14  2)


In [None]:
# Question 14  3)


In [None]:
# Question 14  4)


In [None]:
# Question 14  5)


In [None]:
# Question 14  6)


# Question 15
Is this a good model, based on accuracy score? why? Is it overfitting?


In [None]:
ans=''
why=''
overfitting=''