# Naive Bayse
* This part cover the email classification, where it's ham(main) email or spam.

In [2]:
# Importing pandas as well as read the CSV file:
import pandas as pd
df = pd.read_csv("spam.csv")
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
# Now let's have some data exploration. Here we group the dataset by category to see how many emails are ham and spam:
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [4]:
# So the next step is to covert the Category and Message columns into numbers. So let's do the first here:
df['spam'] = df['Category'].apply(lambda x: 1 if x == 'spam' else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [6]:
# Next we split our dataset into train and test samples:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Message,df.spam, test_size = 0.2)

In [8]:
# To convert the Message column into number, we use Count Vectorizer Technique. People also use TF-IDf. The description of 
# this techniqu is described in 'sklearn-documentation'. so let's do it:

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X_train_count = vec.fit_transform(X_train.values)
X_train_count.toarray()[:2]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [11]:
# Here we use multinomialNB type of naive bayse algorithm:
from sklearn.naive_bayes import MultinomialNB
m_nb = MultinomialNB()
m_nb.fit(X_train_count,y_train)

MultinomialNB()

In [17]:
# So the model is trained. Now we can make a prediction. Let's have a couple of emails the first one is ham and the second
# one is spam. So we'll give these two to our model to predict the correct category.
emails = [
    "Thank you so much dear Majtaba Khan, I have received it.",
    "Yooooooou are the WINNER! follow the link http://i'mlive.com"
]
emails_count = vec.transform(emails)
m_nb.predict(emails_count)

array([0, 1], dtype=int64)

* So we see thats the model predicted the categories for both email, and the answer is corrrrrrrrrrrrrrrrrect.

In [18]:
# Let's see the model score, the way we do that, first we convert X_test to count then we feed it to model:
X_test_count = vec.transform(X_test)
m_nb.score(X_test_count, y_test)

0.9865470852017937

In [19]:
# In [17] when we were supplying the email for prdicting, we converted them first, and also in [18] we also need to use 
# transform method. So to avoid that sklearn has a feature called 'piplined' where you can define the pipline of the 
# transformation. So let's first call it and then we create the pipline using the pipline steps. First step is countVectorizer.
# And in the 2nd step we apply the MultinomialNB.

from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [20]:
# So the classifier is creted. Now we need to train it. We can now train it directly using X_train and y_train:
clf.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

In [21]:
# The model is trained. We can check the performance:
clf.score(X_test, y_test)

0.9865470852017937

* We see the exact accuracy as we had previously.

In [22]:
# To to supply the two emails which we defined previously to see the model working...
clf.predict(emails)

array([0, 1], dtype=int64)

* **Yesssssssssssssssssssssssssss!** it's predicting correclty...

### Exercies
#### Machine Learning Tutorial - Naive Bayes: 
Use wine dataset from sklearn.datasets to classify wines into 3 categories. Load the dataset and split it into test and train. After that train the model using Gaussian and Multinominal classifier and post which model performs better. Use the trained model to perform some predictions on test data.

In [23]:
# Let's first import the dataset:
from sklearn.datasets import load_wine
wine = load_wine()

In [24]:
# Let's see what we have in this dataset:
dir(wine)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [29]:
# To create a DataFrame:
dfe = pd.DataFrame(wine.data, columns = wine.feature_names)
dfe.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [33]:
# So we see all the columns are need to be keep in mind, so we simply create 'X' as:
X = wine.data
X

array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
        1.050e+03],
       [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
        1.185e+03],
       ...,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
        8.350e+02],
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
        8.400e+02],
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
        5.600e+02]])

In [30]:
# Target variable is:
target = wine.target
target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

In [37]:
# Lets split the dataset into train and test samples:
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size = 0.2)

In [38]:
# Not let's first create MultinomialNB model:
nb_m = MultinomialNB()

In [40]:
# Let's train the model:
nb_m.fit(X_train, y_train)

MultinomialNB()

In [41]:
# The model is trained. Let's see the model scores:
nb_m.score(X_test, y_test)

0.8611111111111112

In [42]:
# Now let's create and train GaussianNB model:
from sklearn.naive_bayes import GaussianNB
nb_g = GaussianNB()

In [43]:
# The model is created. Let's train it:
nb_g.fit(X_train, y_train)

GaussianNB()

In [44]:
# The model is trained. let's see the model accuracy and compare it with MultinomialNB:
nb_g.score(X_test, y_test)

0.9722222222222222

* We see the GaussianNB is performing really well against MultinomialNb algorithm.

* Thats were all about Naive Bayes algorithm...