<a href="https://colab.research.google.com/github/jcdumlao14/Micro-Courses-Supervised-Learning-Algorithms-Classification-Exercises/blob/main/5_Naive_Bayes_Classifier_Wine_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Objective**

To classify wine into one of the 3 categories.

# **Tasks**

1. Import the wine dataset from sklearn.datasets
2. Create a DataFrame of the input and target values
3. Split into train and test sets with ratio 80:20
4. Train a Gaussian Naive Bayes Classifier and find its accuracy
5. Train a Multinomial Naive Bayes Classifier and find its accuracy
6. Which of the above two models performs better?

# **Loading Libraries**

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import cross_val_score
from matplotlib import pyplot as plt
%matplotlib inline
import plotly.express as px
from matplotlib import style
#styling the outputs
font={'family':'sans-serif',
      'weight':'bold',
      'size':18}
plt.rc('font',**font)
style.use('fivethirtyeight')
import warnings
warnings.filterwarnings("ignore")



# **Loading Data**

Let's first load the required wine dataset from scikit-learn datasets.

In [2]:
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
wine = datasets.load_wine()

In [3]:
dir(wine)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

# **Exploring Data**

In [5]:
# print the names of the 13 features
print("Feature_names:", wine.feature_names)

# print the label type of wine(class_0, class_1, class_2)
print("Labels:", wine.target_names)


Feature_names: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Labels: ['class_0' 'class_1' 'class_2']


In [6]:
# print data(feature)shape
wine.data.shape

(178, 13)

In [7]:
# print the wine data features (top 5 records)
print (wine.data[0:5])

[[1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00
  2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03]
 [1.320e+01 1.780e+00 2.140e+00 1.120e+01 1.000e+02 2.650e+00 2.760e+00
  2.600e-01 1.280e+00 4.380e+00 1.050e+00 3.400e+00 1.050e+03]
 [1.316e+01 2.360e+00 2.670e+00 1.860e+01 1.010e+02 2.800e+00 3.240e+00
  3.000e-01 2.810e+00 5.680e+00 1.030e+00 3.170e+00 1.185e+03]
 [1.437e+01 1.950e+00 2.500e+00 1.680e+01 1.130e+02 3.850e+00 3.490e+00
  2.400e-01 2.180e+00 7.800e+00 8.600e-01 3.450e+00 1.480e+03]
 [1.324e+01 2.590e+00 2.870e+00 2.100e+01 1.180e+02 2.800e+00 2.690e+00
  3.900e-01 1.820e+00 4.320e+00 1.040e+00 2.930e+00 7.350e+02]]


In [9]:
# print the wine labels (0:Class_0, 1:class_2, 2:class_2)
print(wine.target[0:150])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [10]:
df=pd.DataFrame(wine.data,columns=wine.feature_names)
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [12]:
df['target'] = wine.target
df

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0,2
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0,2
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0,2
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0,2


# **Splitting Data**

First, you separate the columns into dependent and independent variables(or features and label). Then you split those variables into train and test set.

In [14]:
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.3,random_state=100) # 70% training and 30% test


# **Model Generation**

After splitting, you will generate a random forest model on the training set and perform prediction on test set features.

In [18]:
from matplotlib.rcsetup import validate_color_for_prop_cycle
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
model = GaussianNB(priors=None,var_smoothing=1e-09)

#Train the model using the training sets
model.fit(X_train, y_train)


GaussianNB()

In [19]:
model.score(X_train,y_train)

0.967741935483871

In [20]:
#Predict the response for test dataset
y_pred = model.predict(X_test)
y_pred

array([1, 2, 0, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 2, 2, 0, 2, 0, 1, 0, 2, 0,
       1, 1, 0, 0, 1, 1, 1, 2, 2, 1, 0, 1, 2, 2, 1, 1, 2, 2, 0, 2, 2, 2,
       0, 2, 2, 2, 0, 0, 0, 1, 0, 1])

In [21]:
#Import Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB

#Create a Gaussian Classifier
model2 = MultinomialNB()

#Train the model using the training sets
model2.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = model2.predict(X_test)
y_pred

array([2, 2, 0, 0, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0, 2, 0, 1, 0, 1, 0,
       1, 1, 2, 0, 1, 1, 1, 2, 2, 1, 0, 1, 1, 1, 1, 1, 2, 1, 0, 1, 2, 2,
       0, 2, 2, 1, 1, 0, 0, 1, 0, 2])

In [22]:
model2.score(X_train, y_train)

0.8306451612903226

# **Evaluating Model**

After model generation, check the accuracy using actual and predicted values.

In [24]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",model,metrics.accuracy_score(y_test, y_pred))

Accuracy: GaussianNB() 0.7777777777777778


In [25]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",model2,metrics.accuracy_score(y_test, y_pred))

Accuracy: MultinomialNB() 0.7777777777777778
