<a href="https://colab.research.google.com/github/kKawsarAlam/Classification-Algorithms/blob/main/Naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Naive Bayes**    

Naive Bayes is a supervised machine learning classification algorithm based on Bayes' Theorem. It predicts class labels for data points by assuming strong (or "naive") conditional independence between features, meaning it treats each feature as independent, regardless of possible correlations. It is fast, efficient, and commonly used for text classification, spam filtering, and sentiment analysis.     
  
Bayes Theorem formula,    
=> Pr(A and B) = P(A) * P(B/A)  
=> Pr(A and B) = Pr(B and A)  
=> P(A) * P(B/A) = P(B) * P(A/B)  
=> P(B/A) = ( P(B) * P(A/B) ) / P(A)  

**Key Aspects of Naive Bayes**  
Bayes' Theorem: The foundation, which calculates the probability of a class (posterior) given the observed features (likelihood).  
"Naive" Assumption: Assumes all input features are independent of each other, which often simplifies complex, real-world data but allows for fast, efficient computation.  
Classification: It is primarily used for classification problems, such as categorizing, by calculating the probability for each class and selecting the highest one.  
Types: Common variations include Gaussian (for continuous data), Multinomial (for discrete counts), and Bernoulli (for binary features).

**Probability of Survived using titanic dataset**

In [22]:
import pandas as pd

In [23]:
df = pd.read_csv('/content/titanic.csv')
df.head()

Unnamed: 0,PassengerId,Name,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,"Heikkinen, Miss. Laina",3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,female,35.0,1,0,113803,53.1,C123,S,1
4,5,"Allen, Mr. William Henry",3,male,35.0,0,0,373450,8.05,,S,0


In [24]:
# Drop redundent columns
df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns',inplace=True)
df.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,male,22.0,7.25,0
1,1,female,38.0,71.2833,1
2,3,female,26.0,7.925,1
3,1,female,35.0,53.1,1
4,3,male,35.0,8.05,0


In [25]:
target = df.Survived
inputs = df.drop('Survived', axis='columns')

In [26]:
# Concvert catagorical value to numerical value
dummies = pd.get_dummies(inputs.Sex).astype(int)
dummies.head(5)

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


In [27]:
inputs = pd.concat([inputs, dummies], axis='columns')
inputs.head()

Unnamed: 0,Pclass,Sex,Age,Fare,female,male
0,3,male,22.0,7.25,0,1
1,1,female,38.0,71.2833,1,0
2,3,female,26.0,7.925,1,0
3,1,female,35.0,53.1,1,0
4,3,male,35.0,8.05,0,1


In [29]:
inputs.drop('Sex', axis='columns', inplace=True)
inputs.head()

Unnamed: 0,Pclass,Age,Fare,female,male
0,3,22.0,7.25,0,1
1,1,38.0,71.2833,1,0
2,3,26.0,7.925,1,0
3,1,35.0,53.1,1,0
4,3,35.0,8.05,0,1


In [30]:
# Check null value of dataset
inputs.isnull().sum()

Unnamed: 0,0
Pclass,0
Age,177
Fare,0
female,0
male,0


In [31]:
# Fill up the null value of Age by the mean value
inputs.Age = inputs.Age.fillna(inputs.Age.mean())
inputs.head()


Unnamed: 0,Pclass,Age,Fare,female,male
0,3,22.0,7.25,0,1
1,1,38.0,71.2833,1,0
2,3,26.0,7.925,1,0
3,1,35.0,53.1,1,0
4,3,35.0,8.05,0,1


In [32]:
# There is no null value
inputs.isnull().sum()

Unnamed: 0,0
Pclass,0
Age,0
Fare,0
female,0
male,0


In [41]:
# Split train and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs, target, test_size=0.2)

In [42]:
len(inputs)

891

In [43]:
# Length of train and test data
print('Train length:',len(X_train))
print('Testing length:',len(X_test))

Train length: 712
Testing length: 179


In [44]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

In [45]:
model.fit(X_train, y_train)

In [46]:
model.score(X_test, y_test)

0.8100558659217877

In [47]:
X_train[:10]

Unnamed: 0,Pclass,Age,Fare,female,male
379,3,19.0,7.775,0,1
176,3,29.699118,25.4667,0,1
515,1,47.0,34.0208,0,1
24,3,8.0,21.075,1,0
13,3,39.0,31.275,0,1
552,3,29.699118,7.8292,0,1
725,3,20.0,8.6625,0,1
843,3,34.5,6.4375,0,1
198,3,29.699118,7.75,1,0
235,3,29.699118,7.55,1,0


In [48]:
y_test[:10]

Unnamed: 0,Survived
169,0
634,0
795,0
721,0
645,1
718,0
76,0
353,0
399,1
853,1


In [49]:
# Model predict well with test data
model.predict(X_test[:10])

array([0, 1, 0, 0, 0, 0, 0, 0, 1, 1])

In [50]:
# Probability check
model.predict_proba(X_test[:10])

array([[0.97988119, 0.02011881],
       [0.07260167, 0.92739833],
       [0.97807389, 0.02192611],
       [0.98941446, 0.01058554],
       [0.61325099, 0.38674901],
       [0.99100956, 0.00899044],
       [0.99087627, 0.00912373],
       [0.9906455 , 0.0093545 ],
       [0.04499879, 0.95500121],
       [0.00673515, 0.99326485]])

**Spam (phishing messages) Detection**

In [51]:
import pandas as pd

In [53]:
# Load dataset
df = pd.read_csv('/content/spam.csv')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [80]:
df.shape

(5572, 3)

In [54]:
# Catergory description how many
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [55]:
# Here 0 means not spam and 1 means spam
df['spam'] = df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [63]:
# Split train and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Message, df.spam, test_size=0.2)

In [64]:
#  Converts text into word-count vectors
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_count = v.fit_transform(X_train.values)
X_train_count.toarray()[:3]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [65]:
# Trains a Multinomial Naive Bayes classifier on the word-count features
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_count, y_train)

In [66]:
# Check new emails and predicts their classes

emails = [
          'Hey Kawsar Alam, can we get together to watch footbal game tomorrow?',
          'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
         ]
emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1])

In [79]:
message = [
          'Free entry in 2 a wkly comp to win FA Cup final',
          'You are a beautiful person.'
         ]
message_count = v.transform(message)
model.predict(message_count)

array([1, 0])

In [67]:
# Model accuracy
X_train_count = v.transform(X_test)
model.score(X_train_count, y_test)

0.9865470852017937