<a href="https://colab.research.google.com/github/maksudrakib44/Machine-Learning-Python/blob/main/Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CLP-5

**Md. Maksudul Haque**

**221002127 **


*   Naive Bayes Classifier for SMSSpam dataset
*   Naive Bayes Classifier for multi class dataset

In [None]:
!pip install scikit-learn pandas numpy

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score




In [None]:
# Dataset link (UCI Machine Learning Repository)
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip


df = pd.read_table('SMSSpamCollection', sep='\t', header=None, names=['label', 'message'])
df.head()


--2025-08-14 07:09:22--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘smsspamcollection.zip.1’

smsspamcollection.z     [ <=>                ] 198.65K  1.02MB/s    in 0.2s    

2025-08-14 07:09:22 (1.02 MB/s) - ‘smsspamcollection.zip.1’ saved [203415]

Archive:  smsspamcollection.zip
replace SMSSpamCollection? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: SMSSpamCollection       
replace readme? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: readme                  


Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


wget & unzip → Download and extract the dataset.

pd.read_table() → Reads the tab-separated dataset.

We rename the two columns: 'label' (ham/spam) and 'message' (text).

df.head() → Shows the first few rows.

In [None]:
# Convert labels to binary (ham=0, spam=1)
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:

X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], random_state=1)

print("Original dataset contains", df.shape[0], "messages")
print("Training set contains", X_train.shape[0], "messages")
print("Testing set contains", X_test.shape[0], "messages")

Original dataset contains 5572 messages
Training set contains 4179 messages
Testing set contains 1393 messages


In [None]:
count_vector = CountVectorizer()
train_vectors = count_vector.fit_transform(X_train)
test_vectors = count_vector.transform(X_test)


CountVectorizer() → Creates a vocabulary from all words.

.fit_transform() → Learns vocabulary from training set & transforms text into word count vectors.

.transform() → Converts test data to vectors using same vocabulary.

In [None]:

naive_bayes = MultinomialNB()
naive_bayes.fit(train_vectors, y_train)


MultinomialNB() → Suitable for word count features.

.fit() → Trains the classifier on training vectors and labels.

In [None]:
# Predictions
predictions = naive_bayes.predict(test_vectors)

# Accuracy
print("Accuracy:", accuracy_score(y_test, predictions))

# Confusion Matrix
print("\nConfusion Matrix:\n", confusion_matrix(y_test, predictions))

# Classification Report
print("\nClassification Report:\n", classification_report(y_test, predictions))


Accuracy: 0.9885139985642498

Confusion Matrix:
 [[1203    5]
 [  11  174]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      1.00      0.99      1208
           1       0.97      0.94      0.96       185

    accuracy                           0.99      1393
   macro avg       0.98      0.97      0.97      1393
weighted avg       0.99      0.99      0.99      1393



In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

print("Feature Names:", iris.feature_names)
print("Target Names:", iris.target_names)


Feature Names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target Names: ['setosa' 'versicolor' 'virginica']


load_iris() → Loads dataset with 3 flower types.

X → Features (sepal length, width, etc.).

y → Target (class label 0, 1, 2).

Print feature and target names for reference

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


In [None]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)


In [None]:
y_pred = gnb.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=iris.target_names))


Accuracy: 0.9736842105263158

Confusion Matrix:
 [[13  0  0]
 [ 0 15  1]
 [ 0  0  9]]

Classification Report:
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        13
  versicolor       1.00      0.94      0.97        16
   virginica       0.90      1.00      0.95         9

    accuracy                           0.97        38
   macro avg       0.97      0.98      0.97        38
weighted avg       0.98      0.97      0.97        38



.predict() → Predict labels for test data.

target_names=iris.target_names → Shows class names instead of numbers.