# Advanced ML Week 1, Lecture 1: Working with and Preparing Text Data

___


- This is a modified version of our in-class notebook from lecture 01 with Twitter Sentiment analysis.
- The dataset has been replaced with a subset of amazon product reviews.

___

### New Sections/Content
- See Prepare-Amazon-Reviews-Subset-csv.ipynb for selection of subset brand.
- [✨Removing HTML with Regex](#regex)
- Data Introduction (below):

### Amazon Data Intro

In [None]:
from IPython.display import display, Markdown
with open("../Data-AmazonReviews/Amazon Product Reviews.md") as f:
    info = f.read()

display(Markdown(info))

<!-- In this notebook we will be preparing Twitter (X) Tweets for sentiment analysis.  Sentiment analysis is a common text classification challenge to determine whether a text is positive or negative.  

This is useful for companies that want to analyze large numbers of documents, tweets, reviews, etc., to determine public sentiment about a product or service.

The data was originally gathered from Twitter (now X) and hand-labeled.  Of course there will be some human bias in the labeling.  It was downloaded from Kaggle at this site: [Kaggle Twitter Tweets Sentiment Dataset](https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset/)

There are 3 classes: positive, negative, and neutral. -->

In [None]:
## Import necessary packages
import pandas as pd
import nltk

# Load the Data

We will load our **corpus** of Amazon Reviews for Hoover products.

In [None]:
df = pd.read_csv('../Data-AmazonReviews/amazon-reviews-home-kitchen_hoover.csv')
df.head()

# Some light EDA

In [None]:
df.info()

In [None]:
# New
df.isna().sum()

In [None]:
df = df.dropna(subset=['reviewText','summary'])
df.isna().sum()

In [None]:
df.duplicated().sum()

In [None]:
# New
df = df.drop_duplicates()
df.duplicated().sum()

# Some Light Data Cleaning

### ✨ New Cleaning for Amazon:

- The reviews are split into 2 parts. The reviewText, which is the majority of the review, and the summary, which is a 1-line summary of the review (that often includes the actual rating: e.g., "Fours stars- best vacuum)

In [None]:
df['text-raw'] = df['summary'] + ": " + df['reviewText']
df.head()

#### Confirming Which Columns to Drop

In [None]:
# new
df['brand'].unique()

In [None]:
# new
df['title'].unique()

In [None]:
df = df.drop(columns=['brand', 'reviewText','summary','title'])
df = df.dropna()

In [None]:
df.info()
df.head()

# Some More EDA
Let's look at some aspects of this text.
* What do the **documents** look like?
* How long do the tend to be?

## View some sample tweets

In [None]:
## Expand how many characters pandas will show
pd.set_option('display.max_colwidth', None)

## Display some of the documents (tweets)
df[['text-raw']].head()

In [None]:
# df.loc[df['text'].str.contains('http://')]

We can see here that there are some URLs in the text.  This will be a problem for normalization.  We will remove those.

## Get some statistics on the length of **documents**

Let's see how long each tweet is and determine the average length of tweets

In [None]:
## Determine the length of each tweet
## Create a new column of the lengths of each tweet
df['length']= df['text-raw'].map(len)

In [None]:
## Analyze the statistics of the lengths
df['length'].describe()

In [None]:
# New 
import seaborn as sns
ax = sns.histplot(data=df, x='length')
#ax.axvline(df['length'].mean(),color='k',ls=':');

<a name='regex'></a>
## ✨ Removing HTML From Reviews with Regex

In [None]:
df.loc[df['text-raw'].str.contains('http://')]

- Regular expression figured out with Google Bard: https://g.co/bard/share/1db36656cbf3

- Tested out in this saved regex101 pattern: - https://regex101.com/r/01bd7q/3  with the values from below:



In [None]:
# Checking for raw html
df.loc[df['text-raw'].str.contains('<')]

In [None]:
# # Copy/pasted these values into regex 101
# df.loc[df['text-raw'].str.contains('<')]['text-raw'].values

In [None]:
import re

# Regular expression to match HTML tags
regex = r"<[^>]*>"

# Apply the regex to the DataFrame column using str.replace
df['text'] = df['text-raw'].str.replace(regex, '', regex=True)

In [None]:
pd.set_option('display.max_colwidth',250)

In [None]:

df.loc[df['text-raw'].str.contains('<'),['text-raw', 'text']]

# Text Normalization with NLTK

## Normalizing Casing

It's common practice to lower the casing of the text in our documents to contribut to normalizing.

In [None]:
## Lower the casing of each document
df['lower_text'] = df['text'].str.lower()
df.head()

## Tokenizing

Tokenizing text into single word tokens is simple in Python.  We can just use `str.split()`.  The default separator for `.split()` is one space, so `' '`.

We can access Pandas' string accessor with `df.str.<method>`.  This allows us to apply string methods to all rows in a column.

When processing text, if memory allows, it can be useful to keep many versions of your text: tokenize, lemmatized, no stop words, etc.  Some analysis or modeling packages expect tokenized data and others do not.  We often want to use different versions for different kinds of analysis, too.

In [None]:
## Split the documents into tokens

df['tokens'] = df['lower_text'].str.split()
df.head()

### Better way to tokenize data

NLTK has a more sophisticated tokenization function that will isolate things like punctuation as well.  This way 'hooray' and 'hooray!!!' will be the same token.

In order for NLTK to recognize the punctuation, we will need to download the 'punkt' data.

In [None]:
## Download punkt
nltk.download('punkt')

## Tokenize with nltk.word_tokenize instead

df['tokens'] = df['lower_text'].apply(nltk.word_tokenize)
df.head()

## Remove Stop Words

In [None]:
## Download NLTK stopword list
nltk.download('stopwords')

## Load the English stop words.

stop_words = nltk.corpus.stopwords.words('english')
stop_words[:10]

<font color=red> NOTICE </font> that all of the stop words are lower case.  It's necessary to ensure that your tokens are all lower case before using this list to remove stop words.

To remove the stop words from each document, we will apply a function that will check each word in the list of tokens against the list of stopwords and remove them if they are in the list.  More specifically, it will only save them if they are NOT in the list.

In [None]:
## Create function to remove stop words
def remove_stopwords(tokens):
    # no_stops = []
    # for token in tokens:
    #     if token not in stop_words:
    #         no_stops.append(token)

    no_stops = [token for token in tokens if token not in stop_words]
    
    return no_stops
    
## Apply the function to the tokenized data

df['no_stops'] = df['tokens'].map(remove_stopwords)
df.head(10)




## Remove Punctuation

We can remove punctuation in a similar that we removed stop words.  However, we will get our list of punctuation from the built in Python string library.

In [None]:
## Import built-in String Libary
from string import punctuation
print(punctuation)

In [None]:
## Create function to remove punctuation tokens

def remove_punct(tokens):
    no_punct = []
    for token in tokens:
        if token not in punctuation:
            no_punct.append(token)
    return no_punct

## Apply the function to the tokens without punctuation

df['no_stops_no_punct'] = df['no_stops'].apply(remove_punct)
df.head(10)

## Remove URLs

In [None]:
## [v3 For Loop - Continue] Define function to remove URLs
def remove_urls(token_list):
    no_urls = []
    for token in token_list:
        if ('http' in token) | ('www' in token):
            continue
        no_urls.append(token)
    return no_urls

## Remove URLs from no_stops_no_punct
df['no_stops_no_punct'] = df['no_stops_no_punct'].apply(remove_urls)
df.head(10)


## Results

Note how many fewer tokens we have in our `no_stops_no_punct` tokens than in our original.  However, some information was lost, but a lot was also retained.  

Normalization is a huge part of the NLP process and is always a balance between reducing the size of our vocabulary and therefor simplifying our models, and retaining enough information for the model to extract some meaningful patterns in the texts.  

There are a lot of choices here to make.

# Normalizing Text with spaCy

The spaCy Python package provides text processing pipelines that can do many of these operations, plus much more complicated processing, very fast and in many fewer steps.  For this reason it is a very popular tool.  

It utilizes pretrained language models that can recognize things like parts of speech and named entities (people, specific places, currency, etc.)

spaCy was not included in your original dojo_env, so you will need to install if if you have not already.

We will also download the pretrained english language model trained on millions of web documents.  We will use the small sized one for efficiency.

In [None]:
## Install spacy if necessary
#!pip install spacy

import spacy

## Download the English small-sized model trained on web documents if necessary
# spacy.cli.download('en_core_web_sm')

## The spaCy model

In [None]:
## Load the model.  Disable Named Entity Recognizer (too slow)
nlp_model = spacy.load('en_core_web_sm', disable='ner')

## Display the names of each tranformer pipe
nlp_model.pipe_names

We have our model, and we can apply it like a function.  It expects a string of text as the input.

In [None]:
df.head()

In [None]:
# New
idx_example = 286#2873
raw_text = df.loc[idx_example,'text']
raw_text

In [None]:
## Process a document with the model
doc = nlp_model(raw_text)
doc

In [None]:

df['text'][20]

In [None]:
nlp_model(df['text'][20])

**The document is a collection of tokens we can iterate over**

## Documents and Tokens

In [None]:
## Display the tokens in the document

[token for token in doc]

Each token is much more than a string.  

In [None]:
## Isolate the last token in the document
doc = nlp_model("I thought I did my homework but I forgot I was running late and didn't finish.")
word = doc[-4]

## Display the text and type of the token
print(word)
type(word)

Each has many attributes that we can take advantage of, such as the lemma form and whether it is punctuation or space, and whether it is a stop word

In [None]:
## Display the lemmatized form of the token

word.lemma_

In [None]:
## Check whether the token is punctuation
word.is_punct


In [None]:
## Check whether the token is a space
word.is_space

Spacy can even determine the part of speech that the token is!

In [None]:
## Check the part of speech of the token
word.pos_

In [None]:
## Show the parts of speech for each token in the document

[token.pos_ for token in doc]

In [None]:
## Show a list of the lemmas for each token in the document

[token.lemma_ for token in doc]

Notice that spaCy does not lower the case of lemmas.  Let's make sure we do that, too.

In [None]:
## Show a list of only the tokens in the document that are not punctuation or spaces or URLs
lemmas_list = []
for token in doc:
    if token.is_punct:
        continue
    if token.is_space:
        continue
    if token.is_stop:
        continue

    lemmas_list.append(token.lemma_.lower())

lemmas_list


In [None]:
## Show a list of all the tokens in the document that are not punctuation, spaces, or stop words
[token.lemma_.lower() for token in doc if 
 not token.is_punct and 
 not token.is_space and 
 not token.is_stop]

In order to use spaCy to process our entire dataframe, we will need to make a function and apply it to our text column.

In [None]:
## Let's also remove URLs
## Let's also remove the url
[token.lemma_.lower() for token in doc if 
 not token.is_punct and 
 not token.is_space and 
 not token.is_stop and 
 not 'http' in token.lemma_.lower() and
 not 'www' in token.lemma_.lower()]

## Preprocessing with spaCy

In [None]:
## Define a function to use spacy to process our text
def spacy_process(text):
        """Lemmatize tokens, lower case, remove punctuation, spaces, and stop words"""
        doc = nlp_model(text)
        processed_doc = [token.lemma_.lower() 
                         for token in doc if not token.is_punct and 
                         not token.is_space and not token.is_stop and 
                         not 'http' in token.lemma_.lower() and 'www' not in token.lemma_.lower()]
        return processed_doc

## process the tweets using the spacy function
df['spacy_lemmas'] = df['text'].apply(spacy_process)
df.head()

We used spaCy to tokenize, lemmatize, and remove punctuation and stopwords from our text in one step!

Notice that the spaCy processed data is a little different than our previously processed data.  The text has been lemmatized and spaCy has a different list of stop words than NLTK.

The learn platform has directions for how you can customize your spaCy stopword list and a function with more flexibility in how spaCy will process your data.

# ngrams
combine multiple words into tokens

In [None]:
## Import the ngrams function
from nltk import ngrams

In [None]:

## Isolate the 6th lemmatized document
lemma_doc = df['spacy_lemmas'][5]
lemma_doc

In [None]:

# Create list of bigrams
list(ngrams(lemma_doc,2))

In [None]:

# Create list of trigrams
list(ngrams(lemma_doc,3))


## Applying `ngrams` to make a new column



We need to make a function that returns a list of bigrams.  It won't work to just pass the ngrams function to `.apply()`


In [None]:
## Create a function to create bigrams
def make_bigrams(doc):
    bigrams = ngrams(doc, 2)
    bigrams = list(bigrams)
    return bigrams

In [None]:
# add bigrams to the df with .apply()
df['bigrams'] = df['spacy_lemmas'].apply(make_bigrams)
df.head()


# Save the final data version for modeling


In [None]:
df.head()

In [None]:

## Save the processed data
df.to_csv('../Data-AmazonReviews/processed_data.csv', index=False)

In [None]:
# # Save the processed data
import joblib

joblib.dump(df, '../Data-AmazonReviews/processed_data.joblib')