Advanced ML Week 1, Lecture 1: Working with and Preparing Text Data

In this notebook we will be preparing Twitter (X) Tweets for sentiment analysis.  Sentiment analysis is a common text classification challenge to determine whether a text is positive or negative.  

This is useful for companies that want to analyze large numbers of documents, tweets, reviews, etc., to determine public sentiment about a product or service.

The data was originally gathered from Twitter (now X) and hand-labeled.  Of course there will be some human bias in the labeling.  It was downloaded from Kaggle at this site: [Kaggle Twitter Tweets Sentiment Dataset](https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset/)

There are 3 classes: positive, negative, and neutral.

In [None]:
## Import necessary packages
import pandas as pd
import nltk

# Load the Data

We will download our **corpus** of tweets.

In [None]:
## Download corpus of tweets
df = pd.read_csv('Data/archive.zip')
df.head()

# Some light EDA

In [None]:
df.info()

In [None]:
df.duplicated().sum()

# Some Light Data Cleaning

We see that our **corpus** has 27481 **documents**, each with an ID, the full text, a shortened version, and the labeled sentiment.

Interestingly, one of the tweets has no text!  We definitely want to get rid of that.  We will also drop the `textID` and `selected_text` columns.  We are going to use the entire text of each tweet, not just a subset.

We will keep the label, `sentiment` for later classification and analysis tasks.

In [None]:
df = df.drop(columns=['textID', 'selected_text'])
df = df.dropna()

In [None]:
df.info()

# Some More EDA
Let's look at some aspects of this text.
* What do the **documents** look like?
* How long do the tend to be?

## View some sample tweets

In [None]:
## Expand how many characters pandas will show
pd.set_option('display.max_colwidth', None)

## Display some of the documents (tweets)
df['text'].head(10)

We can see here that there are some URLs in the text.  This will be a problem for normalization.  We will remove those.

## Get some statistics on the length of **documents**

Let's see how long each tweet is and determine the average length of tweets

In [None]:
## Determine the length of each tweet
## Create a new column of the lengths of each tweet


In [None]:
## Analyze the statistics of the lengths


The tweets have an mean length of 68 characters and a median of 64. They range from 3 to 141 characters with a standard deviation of 35.  The middle 50% are between 39 and 97 characters in length.

This gives us some idea of how long they tend to be.

# Text Normalization with NLTK

## Normalizing Casing

It's common practice to lower the casing of the text in our documents to contribut to normalizing.

In [None]:
## Lower the casing of each document


## Tokenizing

Tokenizing text into single word tokens is simple in Python.  We can just use `str.split()`.  The default separator for `.split()` is one space, so `' '`.

We can access Pandas' string accessor with `df.str.<method>`.  This allows us to apply string methods to all rows in a column.

When processing text, if memory allows, it can be useful to keep many versions of your text: tokenize, lemmatized, no stop words, etc.  Some analysis or modeling packages expect tokenized data and others do not.  We often want to use different versions for different kinds of analysis, too.

In [None]:
## Split the documents into tokens



### Better way to tokenize data

NLTK has a more sophisticated tokenization function that will isolate things like punctuation as well.  This way 'hooray' and 'hooray!!!' will be the same token.

In order for NLTK to recognize the punctuation, we will need to download the 'punkt' data.

In [None]:
## Download punkt
nltk.download('punkt')

## Tokenize with nltk.word_tokenize instead



## Remove Stop Words

In [None]:
## Download NLTK stopword list
nltk.download('stopwords')

## Load the English stop words.



<font color=red> NOTICE </font> that all of the stop words are lower case.  It's necessary to ensure that your tokens are all lower case before using this list to remove stop words.

To remove the stop words from each document, we will apply a function that will check each word in the list of tokens against the list of stopwords and remove them if they are in the list.  More specifically, it will only save them if they are NOT in the list.

In [None]:
## Create function to remove stop words

## Apply the function to the tokenized data

## Remove Punctuation

We can remove punctuation in a similar that we removed stop words.  However, we will get our list of punctuation from the built in Python string library.

In [None]:
## Import built-in String Libary



In [None]:
## Create function to remove punctuation tokens



## Apply the function to the tokens without punctuation



## Remove URLs

In [None]:
## Remove URLs
## Create function to remove URL tokens



## Apply the function to the tokens without URLs



## Results

Note how many fewer tokens we have in our `no_stops_no_punct` tokens than in our original.  However, some information was lost, but a lot was also retained.  

Normalization is a huge part of the NLP process and is always a balance between reducing the size of our vocabulary and therefor simplifying our models, and retaining enough information for the model to extract some meaningful patterns in the texts.  

There are a lot of choices here to make.

# Normalizing Text with spaCy

The spaCy Python package provides text processing pipelines that can do many of these operations, plus much more complicated processing, very fast and in many fewer steps.  For this reason it is a very popular tool.  

It utilizes pretrained language models that can recognize things like parts of speech and named entities (people, specific places, currency, etc.)

spaCy was not included in your original dojo_env, so you will need to install if if you have not already.

We will also download the pretrained english language model trained on millions of web documents.  We will use the small sized one for efficiency.

In [None]:
## Install spacy if necessary
!pip install spacy

import spacy

## Download the English small-sized model trained on web documents if necessary
spacy.cli.download('en_core_web_sm')

## The spaCy model

In [None]:
## Load the model.  Disable Named Entity Recognizer (too slow)
nlp_model = spacy.load('en_core_web_sm', disable='ner')

## Display the names of each tranformer pipe
nlp_model.pipe_names

We have our model, and we can apply it like a function.  It expects a string of text as the input.

In [None]:
## Process a document with the model



The document is a collection of tokens we can iterate over

## Documents and Tokens

In [None]:
## Display the tokens in the document



Each token is much more than a string.  It

In [None]:
## Isolate the last token in the document


## Display the text and type of the token



Each has many attributes that we can take advantage of, such as the lemma form and whether it is punctuation or space, and whether it is a stop word

In [None]:
## Display the lemmatized form of the token



In [None]:
## Check whether the token is punctuation



In [None]:
## Check whether the token is a space



Spacy can even determine the part of speech that the token is!

In [None]:
## Check the part of speech of the token



In [None]:
## Show the parts of speech for each token in the document



In [None]:
## Show a list of the lemmas for each token in the document



Notice that spaCy does not lower the case of lemmas.  Let's make sure we do that, too.

In [None]:
## Show a list of only the tokens in the document that are not punctuation or spaces or URLs



In [None]:
## Show a list of all the tokens in the document that are not punctuation, spaces, or stop words
[token.lemma_.lower() for token in doc if 
 not token.is_punct and 
 not token.is_space and 
 not token.is_stop]

In order to use spaCy to process our entire dataframe, we will need to make a function and apply it to our text column.

In [None]:
## Let's also remove URLs
## Let's also remove the url
[token.lemma_.lower() for token in doc if 
 not token.is_punct and 
 not token.is_space and 
 not token.is_stop and 
 not 'http' in token.lemma_.lower() and
 not 'www' in token.lemma_.lower()]

## Preprocessing with spaCy

In [None]:
## Define a function to use spacy to process our text



## Process the tweets using the spaCy function into a new column in the df



We used spaCy to tokenize, lemmatize, and remove punctuation and stopwords from our text in one step!

Notice that the spaCy processed data is a little different than our previously processed data.  The text has been lemmatized and spaCy has a different list of stop words than NLTK.

The learn platform has directions for how you can customize your spaCy stopword list and a function with more flexibility in how spaCy will process your data.

# ngrams
combine multiple words into tokens

In [None]:
## Import the ngrams function
from nltk import ngrams

In [None]:

## Isolate the 6th lemmatized document


In [None]:

# Create list of bigrams


In [None]:

# Create list of trigrams



## Applying `ngrams` to make a new column



We need to make a function that returns a list of bigrams.  It won't work to just pass the ngrams function to `.apply()`


In [None]:

## Create a function to create bigrams



In [None]:
# add bigrams to the df with .apply()



# Save the final data version for modeling


In [None]:

## Save the processed data
# df.to_csv('../Data/processed_data.csv', index=False)

In [None]:
# # Save the processed data
# import joblib

# joblib.dump(df, '../Data/processed_data.joblib')