<a href="https://colab.research.google.com/github/himalayahall/DATA602/blob/main/Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

NLP (Natural Language Processing), NLU (Natural Language Understanding), and NLG (Natural Language Generation) are important subtopics of Machine 
Learning. 

**NLP** it involves programming computers to process massive volumes of language data. It involves numerous tasks that break down natural language into 
smaller elements in order to understand the relationships between those elements and how they work together. NLP focuses largely on converting text to 
structured data. It does this through the identification of named entities (a process called named entity recognition) and identification of word 
patterns, using methods like tokenization, stemming, and lemmatization, which examine the root forms of words.

**NLU** (Natural Language Understanding) is a subset of natural language processing, which uses syntactic and semantic analysis of text and speech to determine the meaning of a sentence. One of the primary goals of NLU is to teach machines how to interpret and understand language inputted by humans. 
It aims to teach computers what a body of text or spoken speech means. NLU leverages AI algorithms to recognize attributes of language such as sentiment, semantics, context, and intent. It enables computers to understand the subtleties and variations of language.

**NLG** (Natural Language Generation) is also a subset of NLP and is concerned with enabling machines to not just process and understand text but to generate text. While NLU focuses on computer reading comprehension, NLG enables computers to write. NLG is the process of producing a human language 
text response based on some data input (prompt).[[1]](#1)

## Project Goal

Recent advancements in NLP, most notable the NLG capabilities of Large Language Models (LLM) like ChatGPT, have taken the public imagination by storm. In this project we will explore the following:

- **NLP**: create a classifier to classify product reviews as either original (presumably human created and authentic) or fake (computer generated fake reviews). The motivation for using this dataset is that fake reviews are a major problem, as highlighted in the NPR article [Why we usually can't tell when a review is fake](https://www.npr.org/sections/money/2023/03/07/1160721021/why-we-usually-cant-tell-when-a-review-is-fake), and it would be great to leverage NLP to address the problem.
  - Use [Fastai](https://docs.fast.ai) to build the classifier. This will be accomplished by taking a pretrained language model and fine-tuning it to classify reviews.  What we call a language model is a model that has been trained to guess what the next word in a text is (having read the ones before). This kind of task is called self-supervised learning: we do not need to give labels to our model, just feed it lots and lots of texts. It has a process to automatically get labels from the data, and this task isn't trivial: to properly guess the next word in a sentence, the model will have to develop an understanding of the English (or other) language.[[2]](#2)
- **NLU**: the English learned by the pretrained language model (Wikipedia) is slightly different from the English used for product reviews, so instead of jumping directly to the classifier, we will fine-tune our pretrained language model to the product corpus and then use that as the base for our classifier. This should (hopefully) result in better performance.
- **NLG**: finally, having created a language model that has been fine-tuned for product reviews, we will use it to auto-generate fake reviews. This will be done by giving the model some starting text (prompt) and then asking the model to generate the rest (up to a maximum number of words).

## Data Sources

The generated [fake reviews dataset](https://osf.io/3vds7), containing 20k fake reviews and 20k real product reviews. OR = Original reviews (presumably human created and authentic); CG = Computer-generated fake reviews.

## Tools and Frameworks

- [Google Collab](https://colab.research.google.com)
- [Jupyter Notebook](https://jupyter.org/)
- [Fastai](https://docs.fast.ai)

## References
<a id="1">[1]</a>
https://www.ibm.com/topics/natural-language-processing

<a id="2">[2]</a>
https://fastai.github.io/fastbook2e/book10.html


# Load data

In [47]:
import pandas as pd
import re

df = pd.read_csv('https://raw.githubusercontent.com/himalayahall/DATA607/main/Project4/EMAILSpamCollectionFull.csv')

In [48]:
df.head()

Unnamed: 0,id,from,subject,category,text
0,1,robert elz <kre@munnari.oz.au>,re: new sequences window,ham,"Date: Wed, 21 Aug 2002 10:54:46 -0500 From: Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com> Message-ID: <1029945287.4797.TMDA@deepeddy.vircio.com> | I can't reproduce this error.For me it is very repeatable... (like every time, without fail).This is the debug log of the pick happening ...18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury18:19:04 Ftoc_PickMsgs {{1 hit}}18:19:04 Marking ..."
1,2,steve burt <steve_burt@cursor-system.com>,[zzzzteana] re: alexander,ham,"Martin A posted:Tassos Papadopoulos, the Greek sculptor behind the plan, judged that the limestone of Mount Kerdylio, 70 miles east of Salonika and not far from the Mount Athos monastic community, was ideal for the patriotic sculpture. As well as Alexander's granite features, 240 ft high and 170 ft wide, a museum, a restored amphitheatre and car park for admiring crowds areplanned---------------------So is this mountain limestone or granite?If it's limestone, it'll weather pretty fast.------------------------ Yahoo! Groups Sponsor ---------------------~-->4 DVDs Free +s&p Join Nowhttp:/..."
2,3,"""tim chapman"" <timc@2ubh.com>",[zzzzteana] moscow bomber,ham,"Man Threatens Explosion In Moscow Thursday August 22, 2002 1:40 PMMOSCOW (AP) - Security officers on Thursday seized an unidentified man whosaid he was armed with explosives and threatened to blow up his truck infront of Russia's Federal Security Services headquarters in Moscow, NTVtelevision reported.The officers seized an automatic rifle the man was carrying, then the mangot out of the truck and was taken into custody, NTV said. No other detailswere immediately available.The man had demanded talks with high government officials, the Interfax andITAR-Tass news agencies said. Ekho Moskvy ..."
3,4,monty solomon <monty@roscom.com>,[irr] klez: the virus that won't die,ham,"Klez: The Virus That Won't Die Already the most prolific virus ever, Klez continues to wreak havoc.Andrew Brandt>>From the September 2002 issue of PC World magazinePosted Thursday, August 01, 2002The Klez worm is approaching its seventh month of wriggling across the Web, making it one of the most persistent viruses ever. And experts warn that it may be a harbinger of new viruses that use a combination of pernicious approaches to go from PC to PC.Antivirus software makers Symantec and McAfee both report more than 2000 new infections daily, with no sign of letup at press time. The British s..."
4,5,tony nugent <tony@linuxworks.com.au>,re: insert signature,ham,"On Wed Aug 21 2002 at 15:46, Ulises Ponce wrote:> Hi!> > Is there a command to insert the signature using a combination of keys and not> to have sent the mail to insert it then?I simply put it (them) into my (nmh) component files (components,replcomps, forwcomps and so on). That way you get them when you areediting your message. Also, by using comps files for specificfolders you can alter your .sig per folder (and other tricks). Seethe docs for (n)mh for all the details.There might (must?) also be a way to get sedit to do it, but I'vebeen using gvim as my exmh message editor for a long..."


Drop all columns except text and category.

In [49]:
df.drop(columns=['from', 'subject'], inplace=True)
df

Unnamed: 0,id,category,text
0,1,ham,"Date: Wed, 21 Aug 2002 10:54:46 -0500 From: Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com> Message-ID: <1029945287.4797.TMDA@deepeddy.vircio.com> | I can't reproduce this error.For me it is very repeatable... (like every time, without fail).This is the debug log of the pick happening ...18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury18:19:04 Ftoc_PickMsgs {{1 hit}}18:19:04 Marking ..."
1,2,ham,"Martin A posted:Tassos Papadopoulos, the Greek sculptor behind the plan, judged that the limestone of Mount Kerdylio, 70 miles east of Salonika and not far from the Mount Athos monastic community, was ideal for the patriotic sculpture. As well as Alexander's granite features, 240 ft high and 170 ft wide, a museum, a restored amphitheatre and car park for admiring crowds areplanned---------------------So is this mountain limestone or granite?If it's limestone, it'll weather pretty fast.------------------------ Yahoo! Groups Sponsor ---------------------~-->4 DVDs Free +s&p Join Nowhttp:/..."
2,3,ham,"Man Threatens Explosion In Moscow Thursday August 22, 2002 1:40 PMMOSCOW (AP) - Security officers on Thursday seized an unidentified man whosaid he was armed with explosives and threatened to blow up his truck infront of Russia's Federal Security Services headquarters in Moscow, NTVtelevision reported.The officers seized an automatic rifle the man was carrying, then the mangot out of the truck and was taken into custody, NTV said. No other detailswere immediately available.The man had demanded talks with high government officials, the Interfax andITAR-Tass news agencies said. Ekho Moskvy ..."
3,4,ham,"Klez: The Virus That Won't Die Already the most prolific virus ever, Klez continues to wreak havoc.Andrew Brandt>>From the September 2002 issue of PC World magazinePosted Thursday, August 01, 2002The Klez worm is approaching its seventh month of wriggling across the Web, making it one of the most persistent viruses ever. And experts warn that it may be a harbinger of new viruses that use a combination of pernicious approaches to go from PC to PC.Antivirus software makers Symantec and McAfee both report more than 2000 new infections daily, with no sign of letup at press time. The British s..."
4,5,ham,"On Wed Aug 21 2002 at 15:46, Ulises Ponce wrote:> Hi!> > Is there a command to insert the signature using a combination of keys and not> to have sent the mail to insert it then?I simply put it (them) into my (nmh) component files (components,replcomps, forwcomps and so on). That way you get them when you areediting your message. Also, by using comps files for specificfolders you can alter your .sig per folder (and other tricks). Seethe docs for (n)mh for all the details.There might (must?) also be a way to get sedit to do it, but I'vebeen using gvim as my exmh message editor for a long..."
...,...,...,...
9345,9346,spam,
9346,9347,spam,
9347,9348,spam,
9348,9349,spam,


In [50]:
df.describe(include='object')

Unnamed: 0,category,text
count,9350,7953.0
unique,2,4618.0
top,ham,
freq,5553,84.0


# Cleanup Data

Drop NAs and strip HTML tags.

In [51]:
df.dropna(inplace=True)
df['text'] = df['text'].str.replace('<[^<>]*>', ' ', regex=True)
df.head()

Unnamed: 0,id,category,text
0,1,ham,"Date: Wed, 21 Aug 2002 10:54:46 -0500 From: Chris Garrigues Message-ID: | I can't reproduce this error.For me it is very repeatable... (like every time, without fail).This is the debug log of the pick happening ...18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury18:19:04 Ftoc_PickMsgs {{1 hit}}18:19:04 Marking 1 hits18:19:04 tkerror: syntax error in expression ""int ...Note, if I run the pick..."
1,2,ham,"Martin A posted:Tassos Papadopoulos, the Greek sculptor behind the plan, judged that the limestone of Mount Kerdylio, 70 miles east of Salonika and not far from the Mount Athos monastic community, was ideal for the patriotic sculpture. As well as Alexander's granite features, 240 ft high and 170 ft wide, a museum, a restored amphitheatre and car park for admiring crowds areplanned---------------------So is this mountain limestone or granite?If it's limestone, it'll weather pretty fast.------------------------ Yahoo! Groups Sponsor ---------------------~-->4 DVDs Free +s&p Join Nowhttp:/..."
2,3,ham,"Man Threatens Explosion In Moscow Thursday August 22, 2002 1:40 PMMOSCOW (AP) - Security officers on Thursday seized an unidentified man whosaid he was armed with explosives and threatened to blow up his truck infront of Russia's Federal Security Services headquarters in Moscow, NTVtelevision reported.The officers seized an automatic rifle the man was carrying, then the mangot out of the truck and was taken into custody, NTV said. No other detailswere immediately available.The man had demanded talks with high government officials, the Interfax andITAR-Tass news agencies said. Ekho Moskvy ..."
3,4,ham,"Klez: The Virus That Won't Die Already the most prolific virus ever, Klez continues to wreak havoc.Andrew Brandt>>From the September 2002 issue of PC World magazinePosted Thursday, August 01, 2002The Klez worm is approaching its seventh month of wriggling across the Web, making it one of the most persistent viruses ever. And experts warn that it may be a harbinger of new viruses that use a combination of pernicious approaches to go from PC to PC.Antivirus software makers Symantec and McAfee both report more than 2000 new infections daily, with no sign of letup at press time. The British s..."
4,5,ham,"On Wed Aug 21 2002 at 15:46, Ulises Ponce wrote:> Hi!> > Is there a command to insert the signature using a combination of keys and not> to have sent the mail to insert it then?I simply put it (them) into my (nmh) component files (components,replcomps, forwcomps and so on). That way you get them when you areediting your message. Also, by using comps files for specificfolders you can alter your .sig per folder (and other tricks). Seethe docs for (n)mh for all the details.There might (must?) also be a way to get sedit to do it, but I'vebeen using gvim as my exmh message editor for a long..."


Data summary.

In [52]:
df.describe(include='object')

Unnamed: 0,category,text
count,7953,7953.0
unique,2,4565.0
top,ham,
freq,5553,84.0


## Compute NULL model (baseline) accuracy

The data set is unbalanced (more ham instances than spams), and the null model has 70% accuracy. 

Any model worth considering must have greater accuracy (at a minimum) than null model

In [None]:
cat_size = df.groupby('category').size()
print(cat_size)

ham_prop = cat_size[0] / (cat_size[0] + cat_size[1])
print('Baseline (null model) accuracy: ' , ham_prop)

## Fastai

In [None]:
from fastai.data.all import *
from fastai.text.all import *

Create data loaders for classification, build **DataBlock** using **TextBlock** and **CategoryBlock**. Set aside 20% data using **TrainTestSplitter** for model testing.

In [None]:
dls_cls = DataBlock(
            blocks=(
                TextBlock.from_df(text_cols=('text'), 
                        is_lm=False),
                CategoryBlock),
            get_x=ColReader('text'), 
            get_y=ColReader('category'),
            splitter=TrainTestSplitter(test_size=0.2,stratify=df1.category)
        ).dataloaders(df)

In [None]:
dls_cls.show_batch(max_n = 5)

## Text Classifier

Create text classifier learner

In [None]:
learner = text_classifier_learner(dls_cls, 
                                    AWD_LSTM, 
                                    drop_mult=0.5, 
                                    metrics=[accuracy, Precision(), Recall(), F1Score()])

Calculate optimal learning rate (hyper-parameter)

In [None]:
lr = learner.lr_find()
lr

Fine tune model for 3 epochs

In [None]:
learner.fine_tune(3, lr[0])

## Interpret results

F1-score (balanced score between Precision/Recall) is good and so are precision/recall. Note - the model was tuned using GPUs, tuning on CPUs will take significantly longer (hours).

Show a few predictions on training data.

In [None]:
learner.show_results()

Plot confusion matrix. Note, the matrix is generated using the test data set (20% of data data).

In [None]:
interp = ClassificationInterpretation.from_learner(learner)
interp.plot_confusion_matrix()

## Saving and loading model

In [None]:
learner.save('pretrained')

In [None]:
learner.load('pretrained')

# Language Model (ULMFiT approach)

Create data loader for language modeling. Set aside 10% data using **RandomSplitter** for model validation.

In [None]:
dls_lm = DataBlock(
    blocks=TextBlock.from_df(text_cols=('from', 'subject', 'text'), 
                             is_lm=True),
    get_x=ColReader('text'), 
    splitter=RandomSplitter(valid_pct=0.1, seed=12345)
    ).dataloaders(df, bs=64)

In [None]:
dls_lm.show_batch(max_n = 10)

Create text classifier learner. Use [AWD_LST]([https://paperswithcode.com/method/awd-lstm) model architecture.

For metrics, use accuracy (higher is better), and Perplexity (lower is better: among how many words model is confused for predicting next word in sentence) for metrics.

In [None]:
lm_learner = language_model_learner(
            dls_lm, 
            AWD_LSTM, 
            wd=0.1,
            metrics=[accuracy, Perplexity()]).to_fp16()

Find the optimal learning rate

In [None]:
lr = lm_learner.lr_find()
lr

Fit one cycle - trains the new linear layer without changing the pre-trained layers, which remain frozen.

In [None]:
lm_learner.fit_one_cycle(1, lr[0])

Save language model state.

In [None]:
lm_learner.save('1epoch')

Unfreeze all layers of model for futher tuning.

In [None]:
lm_learner.unfreeze()

In [None]:
lr = lm_learner.lr_find()
lr

Train model for 5 epochs

In [None]:
lm_learner.fit_one_cycle(5, lr[0])

In [None]:
lm_learner.recorder.plot_loss()

Save fully tuned model.

In [None]:
lm_learner.save_encoder('finetuned')

Text generation

In [None]:
TEXT = "Free promotion"
N_WORDS = 40
N_SENTENCES = 5
preds = [lm_learner.predict(TEXT, N_WORDS, temperature=0.75)
for _ in range(N_SENTENCES)]
print("\n".join(preds))

Create DataLoaders for classification usng the cutomized language model.

In [None]:
dls_cls = DataBlock(
            blocks=(TextBlock.from_df(text_cols=('text'), 
                              is_lm=False, 
                              vocab=dls_lm.vocab), 
            CategoryBlock),
            get_x=ColReader('text'),
            get_y=ColReader('category'), 
            splitter=RandomSplitter(valid_pct=0.2, seed=12345)
            ).dataloaders(df, bs=64)

In [None]:
dls_cls.show_batch(max_n=3)

Create text classification learner.

In [None]:
learn = text_classifier_learner(dls_cls, 
                                AWD_LSTM, 
                                drop_mult=0.5, 
                                metrics=[accuracy, Precision(), Recall(), F1Score()])

In [None]:
learn = learn.load_encoder('finetuned')

In [None]:
lr = learn.lr_find()

In [None]:
learn.fine_tune(3, lr[0])

Plot confusion matrix.

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()