# Naive Bayes Classification with Iris Dataset

In today's lecture, we will be using Naive Bayes algorithm to classify different species of iris flower - sentosa, versicolor and virginica. The dataset consists of 50 samples of each category, therefore there's a total of 150 samples. Features include sepal length, sepal width, petal length and petal width.

The dataset is attached as a csv file. We will be importing the pandas library to read and analyse the file. We will also be importing scikit learn module for Naive Bayes algorithm.

In [2]:
## Importing necessary packages
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split


### Read the csv (comma separated file) using pandas

In [6]:
irisData = pd.read_csv("iris_dataset_with_class_information.csv")
## Let's look at the data
irisData

Unnamed: 0,sepal.length.in.cm,sepal.width.in.cm,petal.length.in.cm,petal.width.in.cm,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


### We can check the different column names in the dataset as follows:

In [None]:

irisData.columns

Index(['sepal.length.in.cm', 'sepal.width.in.cm', 'petal.length.in.cm',
       'petal.width.in.cm', 'species'],
      dtype='object')

The column names here specify the features and the outcome  or target variable. For example, 'sepal.length.in.cm', 'sepal.width.in.cm', 'petal.length.in.cm' and 'petal.width.in.cm' are the features in our dataset, 'species' being the outcome or target variable i.e. the one we are interested in predicting. 

Notice  that all the features are in named with '.'. Let's try to rename the columns. We'll use pandas again for this. We'll keep 'species' as is.

In [11]:
irisData = irisData.rename(columns={'sepal.length.in.cm': 'sepal_length', 'sepal.width.in.cm': 'sepal_width', 'petal.length.in.cm':'petal_length', 'petal.width.in.cm': 'petal_width'})
## Let's check the dataset again
irisData

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


## Now let's look at the total number of samples we have for iris. 

One way to do this is to use the  command ```irisData.shape ```. It returns, in this case, 2 numbers within ```()```. The first number represents the number of rows i.e. samples and the second number returns the number of columns.

In [14]:
irisData.shape

(150, 5)

We can view the first 5 rows or last 5 rows using the command ```head``` or ```tail```

In [None]:
irisData.head()
#irisData.tail()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


### Subsetting a dataframe
Quite often you will have to pick apart some rows and columns of a dataframe. This is called **subsetting**.
Subsetting is done by passing column names or row/column numbers as a range.
When specifying range, 
* m:n means starting at m, going up to but *not including* n.
* A range of m: means from m to the end, :m means 0 to m, including 0 but not including m

In [17]:
irisData.iloc[0:10,  0:5]  # first is the range of rows, second is range of columns

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


### Try the following examples yourself:
```irisData.iloc[:10,  :5]```
```irisData.iloc[145:,  2:]```

We can also look at particular column values by using the following commands:
```irisData['sepal_length']```, ```irisData[['sepal_length', 'sepal_width']]```.
An easier process when you are dealing with a huge number of features, would be to create a list of column names that you may want to view or use for your analysis.

In [21]:
list_columnnames = ['sepal_length', 'sepal_width']
irisData[list_columnnames]

Unnamed: 0,sepal_length,sepal_width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6
...,...,...
145,6.7,3.0
146,6.3,2.5
147,6.5,3.0
148,6.2,3.4


We can also filter certain rows with a particular column name. For example, if we want to find out all the samples with species 'versicolor', we can do so in pandas, using the following command:
```irisData_versicolor = irisData.loc[irisData['species'] == 'versicolor']```

In [24]:
irisData_versicolor = irisData.loc[irisData['species'] == 'versicolor']
irisData_versicolor.shape

(50, 5)

Now, let's get into training and testing a Naive Bayes model using iris dataset. Before we start we need to shuffle the dataset. This is because:

Datasets are often collected or organized in a specific order (e.g., chronologically, alphabetically, or by class label). If not shuffled, a model might learn to rely on this order rather than the actual features, leading to biased results and poor generalization to new, unseen data. Shuffling breaks these artificial patterns.

There are several ways to shuffle data, some through numpy. We will use pandas today for shuffling.
The general format of shuffling through pandas is:
* ```irisData_randomized = irisData.sample(frac=1, random_state=42)```
```frac=1``` specifies that you want to sample 100% of the rows (i.e., all rows), effectively reordering them randomly.
```random_state=42``` ensures your shuffling is reproducible. If you use the same random_state value, you will get the same shuffled order each time you run the code. If you omit random_state, the shuffle will be different every time.

* ```irisData_randomized = irisData_randomized.reset_index(drop=True)``` 
All pandas dataframe have indexes associated with each rows. After shuffling the original indexes remain attached to the rows. ```reset_index(drop=True)``` ensures that a new index is created and the old indexes are dropped.

In [26]:
irisData_randomized = irisData.sample(frac=1, random_state=42)
irisData_randomized = irisData_randomized.reset_index(drop=True)
irisData_randomized

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,6.1,2.8,4.7,1.2,versicolor
1,5.7,3.8,1.7,0.3,setosa
2,7.7,2.6,6.9,2.3,virginica
3,6.0,2.9,4.5,1.5,versicolor
4,6.8,2.8,4.8,1.4,versicolor
...,...,...,...,...,...
145,6.1,2.8,4.0,1.3,versicolor
146,4.9,2.5,4.5,1.7,virginica
147,5.8,4.0,1.2,0.2,setosa
148,5.8,2.6,4.0,1.2,versicolor


Now, let's separate features and outcome variables. We'll denote the features as X and the outcome variable as y.

In [33]:
X = irisData_randomized.iloc[:,0:4] ## Data without the class labels
y = irisData_randomized.iloc[:,4] ## the class labels/ outcome variable of training data


In [34]:
X.head() ## Checking the features

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,6.1,2.8,4.7,1.2
1,5.7,3.8,1.7,0.3
2,7.7,2.6,6.9,2.3
3,6.0,2.9,4.5,1.5
4,6.8,2.8,4.8,1.4


In [None]:
y.head() ## Checking y

0    versicolor
1        setosa
2     virginica
3    versicolor
4    versicolor
Name: species, dtype: object

### Splitting data into training and test set.

Now that we have defined features and class label, we need to divide the entire dataset into training and test data. We will keep 80% of data for training and the rest for test. the test data has to be completely disjoint from the training data.

We will use scikit learn for splitting the data. 

In [37]:
## Notice that we can alternatively use shuffle=True here for shuffling the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=12, shuffle=True)
X_train.shape

(120, 4)

## Naive Bayes Classification
We use ```GaussianNB``` library. The assumption here is that the attributes/features come from a Gaussian distribution whose mean and variance can be estimated from the data given.

While using the libraries in sklearn, the workflow pattern is almost always the same:

* create a **raw model** object by invoking the class constructor of the method being used. This model is not trained with data yet.
* train the raw model with training data - the process is called "fitting" and requires the attributes values as well as class labels from training data - done with the ```fit()``` function called on the raw object.
* Once you have a fitted model, you can feed it test data (only the attributes, not class labels) and see what the model's predictions are - we use the ```predict()``` function called on the fitted model. 
* finally you can compare the output from the prediction with the actual classes/categories in the test data. 
