### Text Classification

#### Aim of this experiment :---

1.  The best example of text classification is email spam detection. We have an email text, we need to detect the type of email whether the email is 'Ham' or 'Spam'.

2.  For detecting whether the emial is 'Spam' or 'Ham', the algorithm used here is Naive Bayes Algorithm.


#### Steps used in this Algorithm:----

1.  Import all the necessary libraries

2.  Download the necessary NLTK libraries

3.  Create the Sample Dataset

4.  Preprocess the data

5.  Normalize the input text

6.  Divide the dataset into independent and dependent data

7.  Convert text to numeric (Tfidf Vectorizer)  

8.  Divide the dataset into training and testing data

9.  Train the Naive Bayes Model

10.  Make the predictions for the model

11. Evaluate the performance

12. Test on new messages

### Step 1: Import all the necessary libraries

In [813]:
import  numpy              as   np
import  pandas             as   pd
import  matplotlib.pyplot  as   plt
import  seaborn            as   sns

import  nltk

from    nltk.tokenize   import  word_tokenize, RegexpTokenizer
from    nltk.corpus     import  stopwords

from    sklearn.feature_extraction.text import TfidfVectorizer
from    sklearn.model_selection   import  train_test_split
from    sklearn.preprocessing     import  StandardScaler
from    sklearn.naive_bayes       import  MultinomialNB
from    sklearn.metrics           import  accuracy_score, confusion_matrix, classification_report

### OBSERVATIONS:

1.   numpy ------------------>  Computation of numerical array

2.   pandas ----------------->  Data Manipulation

3.   matplotlib ------------->  Data Visualization

4.   seaborn  --------------->  Data Correlation

5.   nltk ------------------->  Text Preprocessing

6.   tokenize --------------->  breaks the text into smaller parts

7.   word_tokenize ---------->  breaks the sentences into words

8.   RegexpTokenizer -------->  normalizing the text to remove the punctuations and special symbols

9.   corpus ----------------->  folder that contains the list of sentences

10.  stopwords -------------->   words that do not have any meaning

11.  StandardScaler --------->   scales all the inputs in one range between 0 to 1

12.  MultinomialNB ---------->   Type of naive bayes that deals with text and is used for text classification

13.  metrics    ------------->   evaluates the performance of the model

### Step 2:  Download the necessary NLTK libraries

In [814]:
nltk.download('punkt_tab')
nltk.download('average_perceptron_tagger_eng')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Error loading average_perceptron_tagger_eng: Package
[nltk_data]     'average_perceptron_tagger_eng' not found in index
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### OBSERVATIONS:

1.  punkt_tab ------------------->   Tokenization model

2.  average_perceptron_tagger_eng ------------->    POS Tagging Model

3.  stopwords   ------------------------------->   stopwords model

### Step 3: Create the Sample Dataset

In [815]:
data = {
    'message': [
        'Hey, are we still meeting today?',
        'Please review the attached document.',
        'Let’s catch up for lunch tomorrow.',
        'Can you send me the project files?',
        'Don’t forget about the team meeting at 3 PM.',
        'Happy birthday! Have a wonderful day.',
        'Are you coming to the office tomorrow?',
        'Thanks for your help with the report.',
        'Let’s schedule a call for next week.',
        'I will share the notes by evening.',
        'Please confirm your availability.',
        'The presentation has been updated.',
        'See you at the conference tomorrow.',
        'Can we reschedule our appointment?',
        'I have sent the invoice for your review.',
        'Congratulations! You have won a free ticket. Call now!',
        'Win cash prizes!!! Claim now!!!',
        'Get cheap loans at low interest rates.',
        'Limited time offer! Buy now and save big.',
        'You have been selected for a $1000 reward.',
        'Click here to claim your free vacation.',
        'Earn money quickly from home.',
        'Exclusive deal just for you. Act fast!',
        'Your account has been compromised. Verify now!',
        'Free entry in a weekly competition. Text WIN now!',
        'Urgent! Update your bank details immediately.',
        'You are a lucky winner! Claim your prize today.',
        'Get rich fast with this simple trick.',
        'Lowest price guaranteed. Order today!',
        'Congratulations! Claim your bonus reward now!'
    ],
    'label': [
        'ham','ham','ham','ham','ham',
        'ham','ham','ham','ham','ham',
        'ham','ham','ham','ham','ham',
        'spam','spam','spam','spam','spam',
        'spam','spam','spam','spam','spam',
        'spam','spam','spam','spam','spam'
    ]
}

In [816]:
print(data)

{'message': ['Hey, are we still meeting today?', 'Please review the attached document.', 'Let’s catch up for lunch tomorrow.', 'Can you send me the project files?', 'Don’t forget about the team meeting at 3 PM.', 'Happy birthday! Have a wonderful day.', 'Are you coming to the office tomorrow?', 'Thanks for your help with the report.', 'Let’s schedule a call for next week.', 'I will share the notes by evening.', 'Please confirm your availability.', 'The presentation has been updated.', 'See you at the conference tomorrow.', 'Can we reschedule our appointment?', 'I have sent the invoice for your review.', 'Congratulations! You have won a free ticket. Call now!', 'Win cash prizes!!! Claim now!!!', 'Get cheap loans at low interest rates.', 'Limited time offer! Buy now and save big.', 'You have been selected for a $1000 reward.', 'Click here to claim your free vacation.', 'Earn money quickly from home.', 'Exclusive deal just for you. Act fast!', 'Your account has been compromised. Verify no

In [817]:
### Construct the DataFrame using the data

df = pd.DataFrame(data)

In [818]:
df

Unnamed: 0,message,label
0,"Hey, are we still meeting today?",ham
1,Please review the attached document.,ham
2,Let’s catch up for lunch tomorrow.,ham
3,Can you send me the project files?,ham
4,Don’t forget about the team meeting at 3 PM.,ham
5,Happy birthday! Have a wonderful day.,ham
6,Are you coming to the office tomorrow?,ham
7,Thanks for your help with the report.,ham
8,Let’s schedule a call for next week.,ham
9,I will share the notes by evening.,ham


### OBSERVATIONS:

1.  A dataframe is created having two columns .One is the message and the other is the label.

2.  The message is the input

3.  The label is the output

### Step 4: Preprocess the data

In [819]:
### Transform the output label from text to integer

df['label'] = df['label'].map({'spam':'1','ham':'0'})

In [820]:
df

Unnamed: 0,message,label
0,"Hey, are we still meeting today?",0
1,Please review the attached document.,0
2,Let’s catch up for lunch tomorrow.,0
3,Can you send me the project files?,0
4,Don’t forget about the team meeting at 3 PM.,0
5,Happy birthday! Have a wonderful day.,0
6,Are you coming to the office tomorrow?,0
7,Thanks for your help with the report.,0
8,Let’s schedule a call for next week.,0
9,I will share the notes by evening.,0


### OBSERVATIONS:

1. Here we have preprocessed the label data from the text form to integer.

2. Now all the values of labels has been changed from 'spam' and 'ham' to 1 and 0 respectively.

### Step 5: Normalize the input text

In [821]:
reg = RegexpTokenizer(r'\w+')

In [822]:
### define the function
def clean_text(message):
    ### convert all the messages to lower case
    message = message.lower()
    ### perform the normalization on the texts to remove all the punctautions from the text
    message = reg.tokenize(message)
    ### convert the regularized words in lists to texts
    message = " ".join(message)
    ### perform the word tokenization on the regularized words
    words = word_tokenize(message)
    ### convert all the words in lists to strings
    words = " ".join(words)

    return(words)

In [823]:
### call the function
df['clean_message'] = df['message'].apply(clean_text)

In [824]:
df['clean_message']

0                        hey are we still meeting today
1                   please review the attached document
2                     let s catch up for lunch tomorrow
3                     can you send me the project files
4           don t forget about the team meeting at 3 pm
5                   happy birthday have a wonderful day
6                 are you coming to the office tomorrow
7                  thanks for your help with the report
8                   let s schedule a call for next week
9                     i will share the notes by evening
10                     please confirm your availability
11                    the presentation has been updated
12                   see you at the conference tomorrow
13                    can we reschedule our appointment
14              i have sent the invoice for your review
15    congratulations you have won a free ticket cal...
16                            win cash prizes claim now
17                get cheap loans at low interes

In [825]:
### drop the column message from the dataset

df.drop(columns='message', axis=1,inplace=True)

In [826]:
df

Unnamed: 0,label,clean_message
0,0,hey are we still meeting today
1,0,please review the attached document
2,0,let s catch up for lunch tomorrow
3,0,can you send me the project files
4,0,don t forget about the team meeting at 3 pm
5,0,happy birthday have a wonderful day
6,0,are you coming to the office tomorrow
7,0,thanks for your help with the report
8,0,let s schedule a call for next week
9,0,i will share the notes by evening


In [827]:
dfa = df[['clean_message','label']]

In [828]:
dfa

Unnamed: 0,clean_message,label
0,hey are we still meeting today,0
1,please review the attached document,0
2,let s catch up for lunch tomorrow,0
3,can you send me the project files,0
4,don t forget about the team meeting at 3 pm,0
5,happy birthday have a wonderful day,0
6,are you coming to the office tomorrow,0
7,thanks for your help with the report,0
8,let s schedule a call for next week,0
9,i will share the notes by evening,0


In [829]:
df = dfa

In [830]:
df

Unnamed: 0,clean_message,label
0,hey are we still meeting today,0
1,please review the attached document,0
2,let s catch up for lunch tomorrow,0
3,can you send me the project files,0
4,don t forget about the team meeting at 3 pm,0
5,happy birthday have a wonderful day,0
6,are you coming to the office tomorrow,0
7,thanks for your help with the report,0
8,let s schedule a call for next week,0
9,i will share the notes by evening,0


### OBSERVATIONS:

1. Now after performing the preprocessing on the input text, the text has been removed from all the punctuationa and special symbols.

2. The labels in the text has output in form of 1 and 0.

### Step 6:  Divide the dataset into independent and dependent data

In [831]:
### independent variables

X = df['clean_message']

In [832]:
### dependent variables

Y = df['label']

In [833]:
X

0                        hey are we still meeting today
1                   please review the attached document
2                     let s catch up for lunch tomorrow
3                     can you send me the project files
4           don t forget about the team meeting at 3 pm
5                   happy birthday have a wonderful day
6                 are you coming to the office tomorrow
7                  thanks for your help with the report
8                   let s schedule a call for next week
9                     i will share the notes by evening
10                     please confirm your availability
11                    the presentation has been updated
12                   see you at the conference tomorrow
13                    can we reschedule our appointment
14              i have sent the invoice for your review
15    congratulations you have won a free ticket cal...
16                            win cash prizes claim now
17                get cheap loans at low interes

In [834]:
Y

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    1
16    1
17    1
18    1
19    1
20    1
21    1
22    1
23    1
24    1
25    1
26    1
27    1
28    1
29    1
Name: label, dtype: object

### Step 7: Convert text to numeric (Tfidf Vectorizer)  


#### Why Tfidf Vectorizer is used ?

1.   Tf-iDf Vectorizer is used in text-preprocessing where it converts the text into the numerical vectors.

2.   It provides more weights to all the important words in the text.

3.   It reduces the noise.

4.   It reduces the usage of the common words by assigning 0 to it

5.   It is computationally efficient and improves the accuracy of the model

6.   It works well with naive bayes

In [835]:
### Create the object of Tfidf vectorizer

tfidf = TfidfVectorizer()

### using the object of Tfidf vectorizer, transform the text

X = tfidf.fit_transform(X)

In [836]:
X

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 191 stored elements and shape (30, 127)>

### OBSERVATIONS:

1.  Here the input text is converted into the sparse matrix.

In [837]:
### Convert the sparse matrix into numpy array for better view and visibility

X_array = X.toarray()

In [838]:
X_array

array([[0.       , 0.       , 0.       , ..., 0.       , 0.       ,
        0.       ],
       [0.       , 0.       , 0.       , ..., 0.       , 0.       ,
        0.       ],
       [0.       , 0.       , 0.       , ..., 0.       , 0.       ,
        0.       ],
       ...,
       [0.       , 0.       , 0.       , ..., 0.       , 0.       ,
        0.       ],
       [0.       , 0.       , 0.       , ..., 0.       , 0.       ,
        0.       ],
       [0.       , 0.       , 0.       , ..., 0.       , 0.       ,
        0.3004734]])

### OBSERVATIONS:

1. Here the X value in sparse matrix has been converted into numpy array,

2. This numpy array has all the important words assigned in the number greater than 0.

3. All the common words has been assigned with the value 0.

In [839]:
Y.value_counts()

label
0    15
1    15
Name: count, dtype: int64

### Step 8: Divide the dataset into training and testing data

In [840]:
from sklearn.model_selection  import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_array, Y, test_size=0.2,random_state=42)

In [841]:
X_train

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.31282527,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.28219314],
       [0.45326564, 0.        , 0.        , ..., 0.        , 0.28529278,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.2860018 ,
        0.        ]])

In [842]:
X_test

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.35285646,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.35285646, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  

In [843]:
print("Shape of the input training data is:", X_train.shape)

print("Shape of the input testing data is:", X_test.shape)

Shape of the input training data is: (24, 127)
Shape of the input testing data is: (6, 127)


In [844]:
Y_train

28    1
24    1
12    0
0     0
4     0
16    1
5     0
13    0
11    0
22    1
1     0
2     0
25    1
3     0
21    1
26    1
18    1
29    1
20    1
7     0
10    0
14    0
19    1
6     0
Name: label, dtype: object

In [845]:
Y_test

27    1
15    1
23    1
17    1
8     0
9     0
Name: label, dtype: object

In [846]:
print("Shape of the output training data is:", Y_train.shape)

print("Shape of the output testing data is:", Y_test.shape)

Shape of the output training data is: (24,)
Shape of the output testing data is: (6,)


### Step 9:  Train the Naive Bayes Model

In [847]:
from sklearn.naive_bayes  import MultinomialNB

### Create an object for Multinomial Naive Bayes

multi_NB = MultinomialNB()

### using the object of Multinomial Naive Bayes, train the model

multi_NB.fit(X_train,Y_train)

### OBSERVATIONS:

1.  Here an object for MultinomialNB 'multi_NB' is created.

2.  This Multinomial Naive Bayes is used for text classification as it deals with the texts and text preprocessing.

2. It is used for text based count features.

### Step 10: Make the predictions for the model

In [848]:
Y_pred = multi_NB.predict(X_test)

In [849]:
Y_pred

array(['0', '1', '0', '0', '0', '0'], dtype='<U1')

### Step 11: Evaluate the performance

In [850]:
from    sklearn.metrics           import  accuracy_score, confusion_matrix, classification_report


ac   =   accuracy_score(Y_test, Y_pred)

print("Accuracy of the model is:", (ac * 100.0))

Accuracy of the model is: 50.0


In [851]:
cm   =   confusion_matrix(Y_test, Y_pred)

print("confusion matrix of the model is:", (cm))

confusion matrix of the model is: [[2 0]
 [3 1]]


In [852]:
cr   =   classification_report(Y_test, Y_pred)

print("classification report of the model is:", (cr))

classification report of the model is:               precision    recall  f1-score   support

           0       0.40      1.00      0.57         2
           1       1.00      0.25      0.40         4

    accuracy                           0.50         6
   macro avg       0.70      0.62      0.49         6
weighted avg       0.80      0.50      0.46         6



### Step 12:  Test on new messages

In [853]:
sample = ["Are we meeting tomorrow for the project?"]


### Clean the text

cleaned = [clean_text(sample[0])]

print(cleaned)

### Convert the text into tfidf vector

transformed = tfidf.transform(cleaned)

print(transformed)

### Predict the model

predictions = multi_NB.predict(transformed)

print(predictions)


if predictions[0] == 1:
    print("Spam")
else:
    print("Ham")

['are we meeting tomorrow for the project']
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 7 stored elements and shape (1, 127)>
  Coords	Values
  (0, 6)	0.3764201963997667
  (0, 41)	0.3073021140898506
  (0, 67)	0.4119517764435806
  (0, 82)	0.462030725823138
  (0, 102)	0.26324923233683106
  (0, 108)	0.3764201963997667
  (0, 116)	0.4119517764435806
['0']
Ham
