# Text Analytics | BAIS:6100
# Module 8: Text Classification

Instructor: Kang-Pyo Lee 

## Loading the Dataset into a Pandas Dataframe

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 150)

df = pd.read_csv("classdata/emails.csv")
df

GitHub - randerson112358/Python/Email_Spam_Detection: https://github.com/randerson112358/Python/tree/master/Email_Spam_Detection

In [None]:
df.info()

In [None]:
df.spam.value_counts()

pandas.Series.value_counts: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html

## Cleaning the Data

In [None]:
df.text.value_counts()

In [None]:
df = df.drop_duplicates(keep="first")
df

pandas.DataFrame.drop_duplicates: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

In [None]:
df.text.value_counts()

## Setting the Goal

Our goal is to build a binary <b>classification</b> model that is able to classify an email text as spam or non-spam. 
- Feature variables: words in the email body texts
- Outcome variable  : the spam column (1:spam, 0:non-spam)
- Records          : emails

In [None]:
from IPython.display import Image
Image("classdata/images/classification.png")

## Preparing Data for Modeling

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(use_idf=True, norm="l2", stop_words="english", max_df=0.7)
X = vectorizer.fit_transform(df.text)
y = df.spam

The words in the document-term matrix will be used as features of the model and the spam column as the outcome variable of the model. 

In [None]:
X.shape, y.shape

There are 5,695 documents and 36,995 words, or features. 

In [None]:
Image(url="https://docs.splunk.com/images/thumb/3/3b/TrainTest.png/1100px-TrainTest.png")

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

sklearn.model_selection.train_test_split: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

`X_train` and `y_train` will be used in training the model, while `X_test` and `y_test` will be used in testing the model.  

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

## Modeling with k-Nearest Neigobors (k-NNs)

### Step 1. Choose a classficiation algorithm to try

### Step 2. Initialize a model object with initial parameters

In [None]:
from sklearn.neighbors import KNeighborsClassifier 

knn = KNeighborsClassifier(n_neighbors=1)     # The number of neighbors to consider, or k, is set to 1. 
knn

sklearn.neighbors.KNeighborsClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

### Step 3. Fit the model using the training data

In [None]:
knn.fit(X_train, y_train)

### Step 4. Check the performance of the model

In [None]:
knn.score(X_train, y_train), knn.score(X_test, y_test)

Next, only using the test data, we are going to compare the true outcome values (`y_test`) and the predicted outcome values (`pred`) 

In [None]:
pred = knn.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
print(classification_report(y_test, pred))

sklearn.metrics.classification_report: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

<b>Precision</b> is the fraction of relevant (correct) instances among the retrieved instances, while <b>recall</b> is the fraction of relevant (correct) instances that were retrieved. 

In [None]:
Image(url="https://miro.medium.com/max/1872/1*pOtBHai4jFd-ujaNXPilRg.png")

- True positive: classifying what is true as true (good), e.g., diagnosing a flu patient with flu
- False positive: classifying what is false as true (bad), e.g., diagnosing a no flu patient with flu 
- True negative: classifying what is false as false (good), e.g., diagnosing a no flu patient with no flu
- False negative: classifying what is true as false (bad), e.g., diagnosing a flu patient with no flu

<b>F-measure</b> is the harmonic mean of precison and recall, which is approximately the average of the two when they are close.

In [None]:
Image(url="https://wikimedia.org/api/rest_v1/media/math/render/svg/dd577aee2dd35c5b0e349327528a5ac606c7bbbf")

In [None]:
print(confusion_matrix(y_test,pred))

sklearn.metrics.confusion_matrix: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

### Step 5: Perform cross validation and choose the best parameters if there are parameters to optimize

Cross validation is used to find a set of parameters that yield the best performance. 

Cross-validation: evaluating estimator performance: https://scikit-learn.org/stable/modules/cross_validation.html

In [None]:
Image(url="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png")

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
scores_k1 = cross_val_score(knn, X_train, y_train, cv=5)
scores_k1

sklearn.model_selection.cross_val_score: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

Note that cross validation only deals with the training set. It returns a list of the accuracy scores of the five splits. 

In [None]:
scores_k1.mean(), scores_k1.std()

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)
scores_k3 = cross_val_score(knn, X_train, y_train, cv=5)
scores_k3

In [None]:
scores_k3.mean(), scores_k3.std()

In [None]:
score_max = 0                      # Score_max is a temoporay variable to store the max score 
for param in [1, 3, 10, 30]:
    model = KNeighborsClassifier(n_neighbors=param)
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print("k = {}: {}\n{:.3f}, {:.3f}\n".format(param, scores, scores.mean(), scores.std()))
    
    if scores.mean() > score_max:
        score_max = scores.mean()
        param_best = param         # Param_best is a temoporay variable to store the best parameter 
        
print("Highest score : {:.3f} when k = {}".format(score_max, param_best))

The model turned out to perform best when k = 1, so we choose 1 as k in our final model. 

### Step 6. Build the final model with the best parameter(s)

In [None]:
def train_test(X_train, X_test, y_train, y_test, classifier):
    classifier.fit(X_train, y_train)
    pred = classifier.predict(X_test)
    
    print("Train score: {:.2f}".format(classifier.score(X_train, y_train)))
    print("Test score: {:.2f}\n".format(classifier.score(X_test, y_test)))
    print("Classification report:\n{}".format(classification_report(y_test, pred, zero_division=0)))
    print(confusion_matrix(y_test,pred))
    
    return classifier

In [None]:
print("k = {}".format(param_best))
knn = KNeighborsClassifier(n_neighbors=param_best)
knn = train_test(X_train, X_test, y_train, y_test, knn)

Now, `knn` is ready to be used for predictions on new unseen data.

In [None]:
summary = {}
summary["k-NNs"] = round(knn.score(X_test, y_test), 3)

The `summary` dictionary is used to keep the best performance for each algorithm.  

### Step 7. Make predictions on new unseen data

In [None]:
text1 = "Subject: From Raphael Kamara\
Good day dear,\
Permit me to inform you of my desire of going into business relationship with you. I got your contact from the International web site directory. I prayed over it and selected your name among other names due to what my mind told me to me that you are a reputable and trust worthy person I can expose my ordeal to and do business with. So I must not hesitate to confide in you for this simple and sincere business.\
I am Raphael Kamara and my younger sister Juliet Kamara, the only son and daughter of late Mr and Mrs Vincent Kamara. My father was a very wealthy cocoa merchant in Abidjan, the economic capital of Ivory Coast, before he was poisoned to death by his business associates on one of their outing to discus on a business deal.\
When my mother died on the 6th August 2016, my father took me special because I am the only child and motherless. Before the death of my father on 30th November 2018 in a private hospital here in Abidjan, he secretly called me on his bedside and told me that he has a sum of $7.500.000 (seven million, five hundred thousand US dollars) left in a suspense account in a local Bank here in Abidjan, that he used my name as his only child for the next of kin in deposit of the fund. He explained to me that it was because of this wealth and some huge amount of money his business associates supposed to balance him from the deal they had, that he was poisoned by his business associates, that I should seek for a God fearing foreign partner in a country of my choice where I will transfer this money and use it for investment purpose, (such as real estate management).\
Please, I am honourably seeking your assistance in the following ways.\
1) To receive the money for me in your account by providing a Bank account where this money would be transferred to.\
2) To serve as the guardian of this since I am a boy of 17 years and I do not have business experience.\
3) To help me come over to your country as soon as the money is transferred to your account so that we will continue our education and a new life.\
I am willing to give you 20% of the sum as compensation for effort input after the successful transfer of this fund to your designate account overseas. Anticipating to hear from you quick please. Thanks and God Bless.\
Yours faithfully\
Raphael and Juliet"

In [None]:
text2 = "Subject: LETTER\
I genuinely ask for your investment idea!!!\
I write to you in regard of a good and profitable proposal i will like us to complete together.\
My name is Mr. Joseph Sawadogo, i work with a reputable Financial Security Firm here in Africa. I have an interesting business proposal i wish to share with you. We will share the fund involve 50% to you while 50% to me after successful transfer of the fund into your account.\
I will give you more details on receiving your response.\
Sincerely Yours,\
Mr. Joseph Sawadogo"

In [None]:
text3 = "Subject: Watts Group Rentals - Holiday Office Hours\
Good Afternoon!\
Watts Group will be closed tomorrow, Thursday December 24th and Wednesday December 25th for Christmas Eve and Christmas Day. We will re-open on Thursday, December 26th at 8:00AM.\
We wish everyone a safe and happy holiday!\
Watts Group Rentals"

In [None]:
text4 = "Subject: All school PJ day tomorrow!\
Hello,\
Tomorrow will be an all school pajama day at Weber!  Also, tomorrow night there will be Pajama Storytime tomorrow night in the Weber library from 6:30-7:15.  Please feel free to donate any new pajamas (size infant to adult) and any new/gently used books!\
Best"

In [None]:
text5 = "Hello,\
We just merged all of staff calendars and we cannot give you space Friday, October 18th. That is univ of Iowa homecoming parade and the development office and students are hosting a chili dinner and parade watching. I can offer you Friday the 11th or 25th.\
Sorry for the change!"

In [None]:
new_texts = [text1, text2, text3, text4, text5]
X_new = vectorizer.transform(new_texts)

Make sure to transform the new data, as the model was trained on the transformed data. 

In [None]:
knn.predict(X_new)

The first two texts are classified as spam, while the rest three as no spam. 

##  Modeling with Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr

sklearn.linear_model.LogisticRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [None]:
scores = cross_val_score(lr, X_train, y_train, cv=5)
print("{}\n{:.3f}, {:.3f}".format(scores, scores.mean(), scores.std()))

There is no parameter to optimize using cross validation. 

In [None]:
lr = train_test(X_train, X_test, y_train, y_test, lr)

In [None]:
summary["Logistic Regression"] = round(lr.score(X_test, y_test), 3)

In [None]:
lr.predict(X_new)

## Modeling with Multinomial Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()
mnb

sklearn.naive_bayes.MultinomialNB: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

In [None]:
scores = cross_val_score(mnb, X_train, y_train, cv=5)
print("{}\n{:.3f}, {:.3f}".format(scores, scores.mean(), scores.std()))

In [None]:
mnb = train_test(X_train, X_test, y_train, y_test, mnb)

In [None]:
summary["Multinomial Naive Bayes"] = round(mnb.score(X_test, y_test), 3)

In [None]:
mnb.predict(X_new)

No text is classified as spam.

## Modeling with Linear Support Vector Machines (SVMs)

In [None]:
from sklearn.svm import LinearSVC

svm = LinearSVC(C=1)
svm

sklearn.svm.LinearSVC: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

In [None]:
svm = train_test(X_train, X_test, y_train, y_test, svm)

In [None]:
score_max = 0
for param in [0.01, 0.03, 0.1, 0.3, 1, 3, 10]:
    model = LinearSVC(C=param)
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print("C = {}: {}\n{:.3f}, {:.3f}\n".format(param, scores, scores.mean(), scores.std()))
    
    if scores.mean() > score_max:
        score_max = scores.mean()
        param_best = param
        
print("Highest score : {:.3f} when C = {}".format(score_max, param_best))

In [None]:
print("C = {}".format(param_best))
svm = LinearSVC(C=param_best)
svm = train_test(X_train, X_test, y_train, y_test, svm)

In [None]:
summary["Linear SVMs"] = round(svm.score(X_test, y_test), 3)

In [None]:
svm.predict(X_new)

## Modeling with Kernelized Support Vector Machines (KSVMs)

In [None]:
from sklearn.svm import SVC

ksvm = SVC(C=1, kernel="rbf", gamma="scale")
ksvm

sklearn.svm.SVC: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [None]:
ksvm = train_test(X_train, X_test, y_train, y_test, ksvm)

In [None]:
score_max = 0
for param in [0.01, 0.03, 0.1, 0.3, 1, 3, 10]:
    model = SVC(C=param, kernel="rbf", gamma="scale")
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print("C = {}: {}\n{:.3f}, {:.3f}\n".format(param, scores, scores.mean(), scores.std()))
    
    if scores.mean() > score_max:
        score_max = scores.mean()
        param_best = param
        
print("Highest score : {:.3f} when C = {}".format(score_max, param_best))

In [None]:
print("C = {}".format(param_best))
ksvm = SVC(C=param_best)
ksvm = train_test(X_train, X_test, y_train, y_test, ksvm)

In [None]:
summary["Kernelized SVMs"] = round(ksvm.score(X_test, y_test), 3)

In [None]:
ksvm.predict(X_new)

## Modeling with Neural Networks

In [None]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(10, ), activation="relu", random_state=0)
mlp

sklearn.neural_network.MLPClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [None]:
mlp = train_test(X_train, X_test, y_train, y_test, mlp)

In [None]:
# It may take a couple of hours to run this cell. 

score_max = 0
for param in [10, 30, 100]:
    model = MLPClassifier(hidden_layer_sizes=(param, ), activation="relu", random_state=0)
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print("hidden_layer_size = {}: {}\n{:.3f}, {:.3f}\n".format(param, scores, scores.mean(), scores.std()))
    
    if scores.mean() > score_max:
        score_max = scores.mean()
        param_best = param
        
print("Highest score : {:.3f} when hidden_layer_sizes = {}".format(score_max, param_best))

In [None]:
print("hidden_layer_size = {}".format(param_best))
mlp = MLPClassifier(hidden_layer_sizes=(param_best, ), random_state=0)
mlp = train_test(X_train, X_test, y_train, y_test, mlp)

In [None]:
summary["Neural Networks"] = round(mlp.score(X_test, y_test), 3)

In [None]:
mlp.predict(X_new)

## Choose the algorithm that performs best

In [None]:
summary

When the performance is high, we can say:
- The features used, in this case the words, were actually good predictors, i.e., did have predictive power.
- The quality of the labeled data was very good, i.e., the labels were very accurate. 

## Exercises - Text Classification

## Developing a Multiclass Classifier

In [None]:
df = pd.read_csv("classdata/tweets/tweets_emotions.csv", sep="\t")
df

In [None]:
df.emotion.value_counts()

## Setting the Goal

Our goal is to build a multiclass <b>classification</b> model that is able to classify a tweet text as one of the following six emotion types: happiness, sadness, fear, disgust, surprise, and anger. 
- Feature variables: words in tweet texts
- Outcome variable  : emotion type
- Records          : documents (tweets)

## Preparing Data for Modeling

### Removing Biases in the Data

One of the problems in classification is biased training data. In our data, over 80% of the records are labeled with happiness. Therefore, when you train your model with this data, most of your predictions will become happiness. One solution is a biased sampling of data, which intentionally removes the majority class examples. This way, of course, you will lose a substantial portion of the data. 

In our data, let's aim to reduce the number of happiness, sadness, fear, and disgust records to 1,000 to make the data set much less biased. We will not touch the surprise and anger records, as their numbers are less than 1,000.  

In [None]:
N = 1000

In [None]:
df_happy = df[df.emotion == "happiness"].sample(n=N, random_state=0)
df_happy

In [None]:
df_sad = df[df.emotion == "sadness"].sample(n=N, random_state=0)
df_fear = df[df.emotion == "fear"].sample(n=N, random_state=0)
df_disgust = df[df.emotion == "disgust"].sample(n=N, random_state=0)

In [None]:
df_others = df[(df.emotion == "surprise") | (df.emotion == "anger")]

In [None]:
df_reduced = pd.concat([df_happy, df_sad, df_fear, df_disgust, df_others], axis=0)
df_reduced

pandas.concat: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

In [None]:
df_reduced.emotion.value_counts()

In [None]:
vectorizer = TfidfVectorizer(use_idf=True, norm="l2", stop_words="english", max_df=0.7)
X = vectorizer.fit_transform(df_reduced.text)
y = df_reduced.emotion

In [None]:
X.shape, y.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

## Modeling with Logistic Regression

In [None]:
lr = LogisticRegression()

In [None]:
scores = cross_val_score(lr, X_train, y_train, cv=5)
print("{}\n{:.3f}, {:.3f}".format(scores, scores.mean(), scores.std()))

In [None]:
lr = train_test(X_train, X_test, y_train, y_test, lr)

Three things to consider when the performance is low:
- The quality of the labeled data is not very good, i.e., the labels are not very accurate. 
- The number of labeled records is not sufficient.
- The features used, in this case the words, do not serve as good predictors, i.e., have no predictive power. You may consider re-selecting the features, but that could not work if none of the features have predictive power.

In [None]:
text1 = "Had so much fun last night, thanks to everyone who came out to Slacker's last night and thank you to all of you who keep supporting me and and this incredible music adventure!"
text2 = "I just hit someone at work with the “See you next decade” and just like that I’ve reached peak"
text3 = "Today was the best day in my whole life."

In [None]:
new_texts = [text1, text2, text3]
X_new = vectorizer.transform(new_texts)

In [None]:
lr.predict(X_new)