# Loading Dataset from Vehicle.csv

In [1]:
#importing python libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.metrics import classification_report
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
warnings.filterwarnings("ignore")

In [2]:
#importing custom python modules
import modules as winequality
import ModelEvaluation as model
import VisualizationForMisclassification as visualize
import TrainTest_Split_Traversal as train_test_split
import CrossValidationFold_Traversal as Kfolf_traversal
import CalibrationPlot as calibration_plot

In [3]:
winequality = pd.read_csv('../../../datasets/winequality.csv')

# Exploratory Data Analysis

In [4]:
winequality.head(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,recommend
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,False
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,False
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,False
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,False
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,False


In [5]:
winequality.dataset_statistics(winequality)

AttributeError: 'DataFrame' object has no attribute 'dataset_statistics'

In [None]:
#Dataset Label value Count to check if the data is unbalanced
winequality['recommend'].value_counts()

In [None]:
len(winequality.columns)

In [None]:
winequality.iloc[: , (len(winequality.columns) -1 )]

Since dataset has disproportionate ratio of observations in each target class , we will be applying data resampling techniques to balance the dataset.

## DataSet Statistics 

By comapring standard deviation and mean your data points, it is observed that our data points are 
almost tend to be close to the means we have few outliers present in the dataset.

In [None]:
#to check if outliers are present in the data
winequality.describe().transpose()

#  Data Visualization  

## Bar Chart

In [None]:
winequality.BarChart(winequality)

## Correlation Matrix

Correlation helps you to find out the relationship between variables. According to Correlation graph,

-  it can be observed that PH and Sulphates has a little impact on overall data. For the reason these features can be eliminated from the dataset.

In [None]:
winequality.Correlation_matrix(winequality)

# Label Encoding

Label Encoding is used to convert the categorical labels to its numeric representation. 

In [None]:
vdataset = winequality.label_encoding(winequality)
vdataset.head()

# Model Evaluation 

Model Evaluation is done for determinig the accuracy of differnet classifiers using KFold Cross Validation.
For this Purpose Folloing classifiers are used:
    - Logistic Regression
    - K-Nearest Neighbors
    - Gausian NaiveBayes
    - Support Vector Machine
    - Random Forest

According to the graph shown below, we can compare the range and distribution of the accuracy for each model.
As shown, we can conclude that <b> Gaussian NaiveBayes </b> and <b> Support Vector Machine </b> performance will be better as compare to other classifiers. For this dataset we will be using <b>Gaussian NaiveBayes </b>

In [None]:
model.Evaluation_model(vdataset)

# Training Model

Splitting the data into two set in a ratio of 70% for training set and 30% for testing set.

In [None]:
#Splitting data for training and testing
X_train, X_test, y_train, y_test = winequality.splitting_train_test_data(vdataset)

In [None]:
y_test

# Oversampling

In [None]:
# check new class counts
oversample = winequality.Oversampling(X_train, y_train)
oversample.recommend_code.value_counts()

In [None]:
oversample

In [None]:
y_train = oversample.recommend_code
X_train = oversample.drop(["recommend_code"], axis=1)

# Training Model with Gaussian NaiveBayes

In [None]:
classifier = GaussianNB()

In [None]:
classifier = classifier.fit(X_train, y_train)

In [None]:
y_predict = winequality.test_classifier(classifier , X_test)

In [None]:
accuracy =  metrics.accuracy_score(y_test, y_predict)
print("Accuracy: ",accuracy*100)

# Confusion Matrix

In [None]:
winequality.model_confusion_matrix(y_test, y_predict, vdataset)

In [None]:
winequality.model_classification_report(y_test, y_predict)

# Visualization of Misclasssification

The below stacked bar chart represents the misclassified points of Gaussian NaiveBayes Classifier. It can be observed from the graph shown below that most of the point that has been misclassified belong to the <b> Class:True </b> but has been misclassified as <b> Class:False </b>. Therefore, we can conclude that apart from the high accuracy the model is not good as it is misclassification ratio is not good. We might be missing some details in the dataset. Further I would look in the dataset and try to find the problem as to why our model is overfitting the data.

In [None]:
visualize.Misclasssification_visualization(y_test, y_predict,vdataset)

# Calibration plot

Predicting the probablity of an observation belonging to each class is more convenient than predicting of class value
direclty for classification poblems. For this we would be using Calibration Plot technique.

In [None]:
models = model.define_models()

In [None]:
calibration_plot.Calibrated_Curve(models,X_train,y_train,X_test,y_test )

# Interpretaion of the plot:

For each bin, the y-value is the proportion of actual probablity, and x-value is the mean predicted probability. Therefore, a well-calibrated model has a calibration curve that hugs the straight line y=x.


According to the above graph, it can be concluded that

- Out of all these graphs <b> Support Vector Machine </b> model would be the good fit for this dataset as most of the data points are not fit but close to the to the ideal calibrated line unlike NaiveBayes as the binned poinnts are much far away from the mean predicted probablities.