### Naive Bayes

Naive Bayes is an important machine learning algorithm that plays an important role in text classification. It is based on Bayes Theorem.

There are two types of Naive Bayes

  (a.)   Gaussian Naive Bayes

  (b.)   Multinomial Naive Bayes


--->  For working with the texts, Multinomial Naive Bayes is used 

This is because of the following reasons:---

  (i.)  It is used for word count frequency.

  (ii.) It can be used for Multi class text classification.


### Naive Bayes Algorithm with Multionomial

Multinomial is a variant of Naive Bayes where the input is in the form of text.

The main goal of the Multinomial Naive Bayes is to convert the text into numerical form.


### Aim of this project:-

The main aim of this project is to :----

  (a.)  convert the text into numerical form.

  (b.)  predict whether the message is 'Spam' or 'Ham'

### Spes used in this Algorithm:---

1.  Import all the necessary libraries

2.  Define the sample corpus text

3.  Define input and output

4.  Convert text into numerical features using Bag Of Words

5.  Split dataset into train and test

6.  Initialize Naive Bayes model

7.  Train the model

8.  Make predictions

9.  Evaluate model

10. Testing on New Data

### Step 1: Import all the necessary libraries

In [622]:
import  numpy   as   np
import  pandas  as   pd
import  matplotlib.pyplot  as  plt
import  seaborn            as  sns

from    sklearn.feature_extraction.text import CountVectorizer

from    sklearn.model_selection         import train_test_split

from    sklearn.preprocessing           import  StandardScaler

from    sklearn.naive_bayes            import MultinomialNB

from    sklearn.metrics                import  confusion_matrix, accuracy_score, classification_report

### OBSERVATIONS:

1.  numpy    ------------------>  Calculation of numerical array

2.  pandas   ------------------>  Data Creation and Manipulation

3.  matplotlib ---------------->  Data Visualization

4.  seaborn    ---------------->  Data Correlation

5.  feature_extraction --------->  extract all the features and builds the vocabulary

6.  CountVectorizer ------------->  considers the input text and converts it into the word count matrix 

7.  train_test_split ------------>  splits the data into the training and testing data

8.  StandardScaler   ------------->  makes the data in one range of 0 to 1

9.  MultinomialNB ----------------> works with the texts . used for text classification

10.  metrics     -----------------> evaluates the performance of the model

### Step 2: Define the sample corpus text

In [623]:
# Step 2: Create sample dataset
data = {
    "text": [
        "Win money now",
        "Limited offer claim prize",
        "Call now for free reward",
        "Hello how are you",
        "Let's meet tomorrow",
        "Are you coming to office",
        "Free money offer",
        "Win a free lottery"
    ],
    "label": [1,1,1,0,0,0,1,1]   # 1 = Spam, 0 = Not Spam
}

In [624]:
data

{'text': ['Win money now',
  'Limited offer claim prize',
  'Call now for free reward',
  'Hello how are you',
  "Let's meet tomorrow",
  'Are you coming to office',
  'Free money offer',
  'Win a free lottery'],
 'label': [1, 1, 1, 0, 0, 0, 1, 1]}

In [625]:
### with the help of the data, construct the dataframe

df = pd.DataFrame(data)

In [626]:
df

Unnamed: 0,text,label
0,Win money now,1
1,Limited offer claim prize,1
2,Call now for free reward,1
3,Hello how are you,0
4,Let's meet tomorrow,0
5,Are you coming to office,0
6,Free money offer,1
7,Win a free lottery,1


### OBSERVATIONS:

1.  The dataset has two columns. one is text and other is label.

2.  For each text as input, we have the label as output.

3.  Label is in the form of 1 and 0 where 1 ----> spam and 0 -----> Not spam

### Step 3: Define input and output

In [627]:
X = df['text']   ### independent features

Y = df['label']  ### dependent features

In [628]:
print(X)

0                Win money now
1    Limited offer claim prize
2     Call now for free reward
3            Hello how are you
4          Let's meet tomorrow
5     Are you coming to office
6             Free money offer
7           Win a free lottery
Name: text, dtype: object


In [629]:
print(Y)

0    1
1    1
2    1
3    0
4    0
5    0
6    1
7    1
Name: label, dtype: int64


### OBSERVATIONS:

1. The dataset is divided into the input and the output  data.

2. The input data corresponds to the text

3. The output data corresponds to the label.

### Step 4: Convert text into numerical features using Bag Of Words

In [630]:
from  sklearn.feature_extraction.text import CountVectorizer

### Create the object for Count Vectorizer

count = CountVectorizer()

### using the object for Count Vectorizer, transform the input

X_vectorized = count.fit_transform(X)

print(X_vectorized)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 30 stored elements and shape (8, 22)>
  Coords	Values
  (0, 20)	1
  (0, 12)	1
  (0, 13)	1
  (1, 9)	1
  (1, 14)	1
  (1, 2)	1
  (1, 16)	1
  (2, 13)	1
  (2, 1)	1
  (2, 4)	1
  (2, 5)	1
  (2, 17)	1
  (3, 6)	1
  (3, 7)	1
  (3, 0)	1
  (3, 21)	1
  (4, 8)	1
  (4, 11)	1
  (4, 19)	1
  (5, 0)	1
  (5, 21)	1
  (5, 3)	1
  (5, 18)	1
  (5, 15)	1
  (6, 12)	1
  (6, 14)	1
  (6, 5)	1
  (7, 20)	1
  (7, 5)	1
  (7, 10)	1


### OBSERVATIONS:

1.  The input text is converted into the word count matrix using the Bag of Words.

2.  It can be used as an input into the Machine Learning model, so that it can be trained easily.

### Step 5:  Split dataset into train and test

In [631]:
Y.value_counts()

label
1    5
0    3
Name: count, dtype: int64

In [632]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_vectorized, Y, test_size=0.25, random_state=42, stratify = Y)

In [633]:
X_train

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 22 stored elements and shape (6, 22)>

In [634]:
X_test

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 8 stored elements and shape (2, 22)>

In [635]:
print("Shape of the input training data is:", X_train.shape)

print("Shape of the input testing  data is:", X_test.shape)

Shape of the input training data is: (6, 22)
Shape of the input testing  data is: (2, 22)


In [636]:
Y_train

1    1
3    0
4    0
0    1
2    1
6    1
Name: label, dtype: int64

In [637]:
Y_test

7    1
5    0
Name: label, dtype: int64

In [638]:
print("Shape of the output training data is:", Y_train.shape)

print("Shape of the output testing  data is:", Y_test.shape)

Shape of the output training data is: (6,)
Shape of the output testing  data is: (2,)


### Step 6:  Initialize Naive Bayes model

In [639]:
### Initialize the object for Multi Nomial NB

multi_NB = MultinomialNB()

### OBSERVATIONS:

1. The Multinomial technique is used for Naive Bayes because:--

    (a.)  Here the input is the text

    (b.)  It is used for the word count frequency

    (c.)  It is used for text classification.

### Step 7: Train the model

In [640]:
### Train the model

multi_NB.fit(X_train, Y_train)

### OBSERVATIONS:

1. The model for Multinomial Naive Bayes has been trained with the help of the following parameters:---

    (a.)   training data ---------> X_train, Y_train

### Step 8: Make predictions

In [641]:
Y_pred = multi_NB.predict(X_test)

In [642]:
Y_pred

array([1, 0])

### Step 9: Evaluate model

In [643]:
Y_test

7    1
5    0
Name: label, dtype: int64

In [644]:
from    sklearn.metrics                import  confusion_matrix, accuracy_score, classification_report

ac = accuracy_score(Y_test, Y_pred)*100.0

print("Accuracy score of the model is:", ac)

Accuracy score of the model is: 100.0


In [645]:
cm = confusion_matrix(Y_test, Y_pred)

print("Confusion matrix is:", cm)

Confusion matrix is: [[1 0]
 [0 1]]


In [646]:
cr = classification_report(Y_test, Y_pred)

print("classification report is:", cr)

classification report is:               precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2



### Step 10: Testing on New Data

In [647]:
new_text = ["Free money waiting for you"]

### perform the vectorization of the input

transformed = count.transform(new_text)

print(transformed)


### predict the data

predictions = multi_NB.predict(transformed)

print(predictions)


print("Prediction:", "Spam" if predictions[0] == 1 else "Not Spam")

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 4 stored elements and shape (1, 22)>
  Coords	Values
  (0, 4)	1
  (0, 5)	1
  (0, 12)	1
  (0, 21)	1
[1]
Prediction: Spam
