# Machine Learning for NLP


Up to this point in the learning guide, only text preprocessing and embeddings have been discussed.  Remember that the output of the embedding step is a numeric vector that can now be used for machine learning. Once embeddings are generated, the next step is to select and train models according to the task at hand. Several factors go into the model selection process including:
- Desired output / task 
- Volume of data 
- Computational resources available
- Availability of labels 
- Data’s domain area 
- Training or serving latency requirements 
- Desired model complexity


## Classical Machine Learning in NLP

Beginning with minimal complexity and gradually increasing complexing, traditional shallow machine learning algorithms can be used for an array of NLP tasks. Classification tasks are some of the more common tasks in NLP, including but not limited to text classification, named entity recognition (NER), and question answering are some.
- **Text classification** refers to the set of tasks that involve assigning categories to text documents. A common example is an email filter labeling emails as spam or not spam. Since this output is binary, models like a logistic regression, support vector machine, or naive bayes can be applied to generate the classification. 
- **Named entity recognition** detects and classifies entities in text like people, places, etc. For example, NER is frequently used in healthcare to identify medical entities like diseases, drugs, and symptoms in patient records. Since NER is an instance of multi-class classification, algorithms like SVMs or naive bayes classifiers can be trained for NER tasks. Finally, another common NLP task that can leverage shallow algorithms is question answering. 

A model trained for **question answering** will assign probabilities to different answers depending on the question as input. Similar to NER, nearly any multiclass classification algorithm can be used for this task.  While not always the most complex approach, classical machine learning algorithms can cover a lot of ground in NLP, especially in the early stages of a project.

In [2]:
import pandas as pd
# Read in spam filter example 
spam_df = pd.read_csv(r"D:\\Coding_Stuff\\GitHub\\Natural-Language-Processing\\data\\emails.csv")
spam_df.head(5)

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


Let's do some quick exploratory data analysis to identify any class imbalance. This will be important when evaluating the model down the line

In [3]:
spam_df["spam"].value_counts()

spam
0    4360
1    1368
Name: count, dtype: int64

From these `value_counts()`, it's clear that there is some imbalance between the distribution of our labels. Depending on the modeling technique, we may need to account for this later

In [4]:
from nltk import word_tokenize

Let's take a look at the average length of a spam and non-spam email. Since all text entries must be tokenized, we can write a function that tokenized the text by word and then apply it to the entire *pandas* `DataFrame`

In [5]:
# Define the `count_words` function
def count_words(text):
    words = word_tokenize(text)
    return len(words)

Next, we'll apply the `count_work` function to count the number of words across our entire `DataFrame`

In [6]:
# Apply the `count_words` function the the entire DataFrame
spam_df["counted_text"] = spam_df["text"].apply(count_words)
spam_df.head(3)

Unnamed: 0,text,spam,counted_text
0,Subject: naturally irresistible your corporate...,1,325
1,Subject: the stock trading gunslinger fanny i...,1,90
2,Subject: unbelievable new homes made easy im ...,1,88


As a form of quick data analysis, let's take a look at the average length of an email, by label

In [8]:
# Average length of an email by label
spam_df.groupby("spam")["counted_text"].mean()

spam
0    346.835321
1    267.896199
Name: counted_text, dtype: float64

Per the result above, it looks like there's a noteable different between the average email lenght of a spam email versus a non-spam email.

### Featurization 

Now that we've done a quick analysis of our data, we can get started with preparing the data for modeling:

1. First, **stop words** are removed to ensure the remaining text carries meaning.
2. Next, **Stemming** is applied to reduce words to their stems
3. Lastly, **Continuous Bag of Words** algorithm is applied to generate embeddings 

In [11]:
import string
from nltk.corpus import stopwords

As mentioned in previous notebooks, we'll remove stop words from our corpus. Removing these stop words ensures that the model only uses words that carry meaning and context. For this example, we'll leverage the pre-loaded English stopwords from `NLTK`

In [12]:
# Define function to remove stop words
def remove_stopwords(text: str) -> str:
    no_punctuation = [character for character in text if character not in string.punctuation]
    no_punctuation = "".join(no_punctuation)
    
    return " ".join([word for word in no_punctuation.split() if word.lower() not in stopwords.words("english")])

In [13]:
# Apply the stopword removal function
spam_df["removed_stopwords"] = spam_df["text"].apply(remove_stopwords)

Verify stopword removal:

In [21]:
spam_df.head(5)

Unnamed: 0,text,spam,counted_text,removed_stopwords
0,Subject: naturally irresistible your corporate...,1,325,Subject naturally irresistible corporate ident...
1,Subject: the stock trading gunslinger fanny i...,1,90,Subject stock trading gunslinger fanny merrill...
2,Subject: unbelievable new homes made easy im ...,1,88,Subject unbelievable new homes made easy im wa...
3,Subject: 4 color printing special request add...,1,99,Subject 4 color printing special request addit...
4,"Subject: do not have money , get software cds ...",1,53,Subject money get software cds software compat...


Now that stopwords have been removed, the next step is to apply `NLTKs` `PortStemmer` to reduce words to their stem. As a reminder, this reduction of a word to its stem allows for easier comparison of words and their respective meanings.

In [22]:
from nltk.stem import PorterStemmer

Define a stemming function we can apply to the entire `DataFrame`

In [23]:
# Define the stemming function
def stem(text:str)-> str:
    stemmer = PorterStemmer()
    return " ".join([stemmer.stem(word) for word in text.split()])

Apply the stemming function:

In [24]:
# Apply the stemming function 
spam_df["stemming"] = spam_df["removed_stopwords"].apply(stem)

Verify stemming:

In [25]:
spam_df["stemming"][:5]

0    subject natur irresist corpor ident lt realli ...
1    subject stock trade gunsling fanni merril muzo...
2    subject unbeliev new home made easi im want sh...
3    subject 4 color print special request addit in...
4    subject money get softwar cd softwar compat gr...
Name: stemming, dtype: object

Now that the corpus has been pre-processed, the last step is to apply the `CountVectorizer` that will map our text to numeric values. With this conversion, we'll be able to apply various machine learning algorithms. Per the `sklearn` documentation, here's a quick summary of the `CountVectorizer` module:

```Convert a collection of text documents to a matrix of token counts.```

In [26]:
from sklearn.feature_extraction.text import CountVectorizer

In [27]:
# Instantite, fit, and transform the `CountVectorizer`

vectorizer = CountVectorizer()

vectorized_matrix = vectorizer.fit_transform(spam_df["stemming"])

Checking in on data types, note that the `CountVectorizer` returns a `scipy.sparse._csr.csr_matrix`

In [28]:
type(vectorized_matrix)

scipy.sparse._csr.csr_matrix

### Modeling 

Now that the data has been pre-processed, let's split the data and fit a model. Before fitting a model, like any other machine learing problem, the data needs to be split into a training set and a test set. As a reminder, this split ensures that our model doesn't **overfit** our data and is generalizable. 

In [29]:
# Split the data 
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectorized_matrix, spam_df["spam"], test_size=0.3)

For this example, we'll be leveraging a `Naive Bayes Classifier`. This classifier is one of the simplier classifiers. It's considered *generative* since it models the distribution of inputs for a given class or category. The model operates under the assumption that the features of the input data are conditionally independent given the class. This assumption allows the model to make predictions both quickly and accurately. 

In [30]:
# Instantiate and train a Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB

bayes = MultinomialNB()
bayes.fit(X_train, y_train)

### Generate Predictions and Evaluate the Model

Evaluation of these trained models is somewhat similar to any other classification task, however there are some nuances. In general, evaluation metrics are categorized into either **intrinsic evaluation** or **extrinsic evaluation**, where intrinsic refers to the performance of a component on a defined subtask and extrinsic refers to performance of the final objective.  Extrinsic evaluation is mostly specific to the task and business context, whereas metrics like **accuracy**, **precision** and **recall**, and **F1** are considered intrinsic. Beyond traditional metrics for classification models, there are a handful of NLP specific metrics that can be helpful. **Bilingual Evaluation Understudy (BLUE)**, **METEOR**, and **ROUGE** are all metrics that evaluate the quality of text that has been translated from one language to another. These can be useful for text generation, paraphrase generation, and text summarization. **Perplexity** is another probabilistic measure that can evaluate how “confused” the model is. It measures the randomness by calculating how strong the model is at guessing the next word in a sentence. While NLP evaluation metrics often have strong overlap with classification evaluation metrics, there are a handful of NLP specific intrinsic evaluation measures that can be useful.

In the context of this model, we can leverage `sklearn`'s `classification_report` to quickly calculate several metrics at once, including `precision`, `recall`, `f1-score`, and `support`

In [33]:
from sklearn.metrics import classification_report

preds = bayes.predict(X_test)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00      1282
           1       0.98      1.00      0.99       437

    accuracy                           0.99      1719
   macro avg       0.99      0.99      0.99      1719
weighted avg       0.99      0.99      0.99      1719



In addition to calculating metrics, in the case of classification generating a *confusion matrix* can be helpful in visualizing how the true labels and predictions are distributed

That's it! We've now successfully built a `Naive Bayes Classifier` on some e-mail text data! 