# **Introduction to text analysis in Python. Day 3 Part 2**

## *Dr Kirils Makarovs*

## *k.makarovs@exeter.ac.uk*

## *University of Exeter Q-Step Centre*

---


# **Welcome to Day 3 Part 2!**

## **Today, we are going to look at:**

+ Descriptive text analysis (continuation)
+ Text preprocessing

---



# **Text preprocessing**

**Text preprocessing** is all about preparing raw text entries for further modelling and analysis

The better the text is preprocessed, the more accurate and reliable the results of the modelling are going to be!

**Text preprocessing** might include different steps for different types of analysis, but here are some of the most common ones:

<figure>
<left>
<img src=https://miro.medium.com/max/1024/1*pzjECYWP8WOWhwfCjebZVw.png  width="600">
</figure>

[Image source](https://medium.com/predict/how-does-nlp-pre-processing-actually-work-8d097c179af1)

<figure>
<left>
<img src=https://iq.opengenus.org/content/images/2020/05/text_steps.png  width="600">
</figure>

[Image source](https://iq.opengenus.org/text-preprocessing-in-spacy/)

## **Example: A single TED talk**

In [None]:
# Importing some of the required libraries

import pandas as pd
import numpy as np


In [None]:
# Uploading the dataset containing TED talks into the current Google Colab session

from google.colab import files

uploaded = files.upload()


In [None]:
# Getting the dataset

df = pd.read_csv('ted.csv')

df


In [None]:
# Let's take a single TED talk transcript and preprocess it!

single_talk = df['transcript'][0]

single_talk


### **Step 1: Normalization**

Note that this can mean much more than simply making all words lowercase (e.g. hashtag removal, HTML tag removal, etc.)

However making everything lowercase is necessary so that such words as *Book* and *book* wouldn't be considered as separate entities (unless you really need it)

In [None]:
single_talk_upd = single_talk.lower()

single_talk_upd


### **Step 2: Tokenization**

[Token](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html#:~:text=A%20token%20is%20an%20instance,containing%20the%20same%20character%20sequence.) - *is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing*

Put it simply - before building a model, we need to chop down our document/documents into comparable pieces (you can think of them simply as words at this point). Punctuation is usually being removed at some point, unless it poses some specific meaning for the research

Tokenization is a more subtle and precise way of splitting text into words like we previously did with `.split()` method

<figure>
<left>
<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/SpaCy_logo.svg/1200px-SpaCy_logo.svg.png  width="500">
</figure>


In [None]:
# We will use spacy library for this and other tasks

import spacy

nlp = spacy.load('en') # load English module


In [None]:
# Now let's take our single_talk_upd object,
# process it through the spacy's English module,
# and tokenize it!

doc = nlp(single_talk_upd)

tokens = [e.text for e in doc]

tokens


In [None]:
# [token.text for token in doc] - this is called 'list comprehension'
# List comprehension is a shortened way to write a for loop in which you get a list with elements as an output

# Here is how you can obtain the same result by using a full for loop

tokens_2 = []

for e in doc: # for each element in doc..

  token = e.text # tokenize it via spacy's .text method..
  
  tokens_2.append(token) # ..and append it to the tokens_2 list


In [None]:
# tokens and tokens_2 are identical

tokens == tokens_2 # True


Note some of the features of tokenization:

+ each punctuation symbol is a separate token e.g. `)`, `.`, `?`
+ contractions are considered as distinc tokens e.g. `we're` -> `we'` and `'re`

### **Step 3: Stop words removal**

**Stop words** are those words that lack meaning and/or aren't much of use for the purposes of the analysis

<figure>
<left>
<img src=https://www.mediavine.com/wp-content/uploads/2020/04/stop-words-infographic-2.jpg.webp width="600">
</figure>

[Image source](https://www.mediavine.com/stop-words/)



In [None]:
# spacy has its own list of stop words and we are going to use it to clean up our string

from spacy.lang.en.stop_words import STOP_WORDS

stop = STOP_WORDS

stop

# Note that this list is tailored to deal with tokenized string
# as it contains such tokens as  "'ve", "'ll", etc.


In [None]:
# Let's create an updated list of tokens, in which all the stop words are excluded

tokens_upd = [e.text for e in doc if e.is_stop == False]

tokens_upd


In [None]:
# Same as:

tokens_upd_2 = []

for e in doc: # for each element in doc..

  if e.is_stop == False: # ..if it's not in the stop word list..

    token = e.text # ..tokenize it via spacy's .text method..

    tokens_upd_2.append(token) # ..and append it to the tokens_upd_2 list


In [None]:
# tokens_upd and tokens_upd_2 are identical

tokens_upd == tokens_upd_2


In [None]:
# Additionally, let's get rid of all the punctuation tokens

tokens_upd = [e.text for e in doc if e.is_stop == False and e.text.isalpha() == True]

tokens_upd

# .isalpha() returns True if all the characters in the string are alphabet letters (a-z)
# .isalnum() returns True if all the characters in the string are alphanumeric (a-z0-9)


### **Step 4: Lemmatization**

Check out [this](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) page to get more information on **lemmatization** and **stemming**

The goal of both **lemmatization** and **stemming** is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form

**Stemming** usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes

**Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the *lemma*

<figure>
<left>
<img src=https://miro.medium.com/max/1400/1*ES5bt7IoInIq2YioQp2zcQ.png width="600">
</figure>

[Image source](https://medium.com/geekculture/introduction-to-stemming-and-lemmatization-nlp-3b7617d84e65)



In [None]:
# Let's lemmatize our string

lemmas = [e.lemma_ for e in doc if e.is_stop == False and e.text.isalpha() == True]

lemmas


In [None]:
# One of the ways to ensure that lemmatization has been done correctly
# is to check whether the length of token list is the same as the length of lemmas list

len(tokens_upd) == len(lemmas) # True


In [None]:
# Now let's quickly create a dataframe with two columns to see which lemmas have been derived from each token

tokens_lemmas = pd.DataFrame({'token' : tokens_upd,
                              'lemma' : lemmas})


In [None]:
# This will ensure that all rows of the dataframe will be shown 

pd.set_option('display.max_rows', None)


In [None]:
# Calling the dataframe

tokens_lemmas


### **Comparing the original and preprocessed transcripts**


In [None]:
# Roughly speaking, we preprocessed this original TED talk transcript:

single_talk


In [None]:
# Into this:

' '.join(lemmas)


In [None]:
# Let's quickly draw two word clouds to visually compare them!

from wordcloud import WordCloud, STOPWORDS

import matplotlib.pyplot as plt # data visualization library


In [None]:
# This one is for the original transcript

# Note that it's still a bit preprocessed via WordCloud arguments
# stopwords are excluded and plurals are normalized

wordcloud_original = WordCloud(background_color = 'white',
                     width = 2000, # width of canvas
                     height = 1000, # height of canvas
                     stopwords = STOPWORDS, # the built-in STOPWORDS list is used
                     collocations = False, # whether to include collocations (bigrams) of two words
                     normalize_plurals = True, # e.g. 'day' and 'days' will be counted as one
                     random_state = 1, # seed to get exactly same wordcloud every time you rerun script
                     colormap = 'seismic') # set the colormap

# Generate a wordcloud on a string object
wordcloud_original.generate(single_talk)


In [None]:
# And this one is for the preprocessed transcript

wordcloud_preprocessed = WordCloud(background_color = 'white',
                         width = 2000, # width of canvas
                         height = 1000, # height of canvas
                         stopwords = STOPWORDS, # the built-in STOPWORDS list is used
                         collocations = False, # whether to include collocations (bigrams) of two words
                         normalize_plurals = True, # e.g. 'day' and 'days' will be counted as one
                         random_state = 1, # seed to get exactly same wordcloud every time you rerun script
                         colormap = 'seismic') # set the colormap

# Generate a wordcloud on a string object
wordcloud_preprocessed.generate(' '.join(lemmas))


In [None]:
# Drawing both wordclouds on one plot

figure, (ax1, ax2) = plt.subplots(1, 2, figsize=(28, 23), sharex = 'all', sharey = 'all')

ax1.axis('off')
ax1.set_title('Original transcript', fontsize = 25)
ax1.imshow(wordcloud_original)

ax2.axis('off')
ax2.set_title('Preprocessed transcript', fontsize = 25)
ax2.imshow(wordcloud_preprocessed)

plt.show()


## **Wrapping up all the preprocessing steps into a single function**

Our preprocessing pipeline consists of **5 steps**:

1. Making text lowercase
2. Splitting it into tokens
3. Removing stop words
4. Removing punctuation
5. Lemmatizing tokens


In [None]:
# Defining a preprocessing function

def preprocess(string):

  # making text lowercase
  string_low = string.lower()

  # processing lowercase text through spacy's English module
  doc = nlp(string_low)

  # obtaining token lemmas via 1) splitting into tokens, 2) removing stop words, 3) removing punctuation
  lemmas = [e.lemma_ for e in doc if e.is_stop == False and e.text.isalpha() == True]

  # returning lemmas
  return(lemmas)


In [None]:
# Let's try out this function on a subset of TED talks!

# Creating a subset of first 5 TED talks

ted_subset = df.iloc[0:5, 0]

ted_subset


In [None]:
# Applying the preprocess function onto the subset of TED talks

ted_lemmas = ted_subset.apply(lambda x: preprocess(x))


In [None]:
# What you get as an output is pandas Series (array), in which each value is a list of lemmas

type(ted_lemmas) # pandas.core.series.Series

ted_lemmas


In [None]:
# If you want to convert this Series into a list (to get a list of lists), use .tolist() method

ted_lemmas_list = ted_lemmas.tolist()


In [None]:
type(ted_lemmas_list) # list

len(ted_lemmas_list) # 5 sublists within a list

ted_lemmas_list[0] # first sublist, that is a list of lemmas for the first TED talk

ted_lemmas_list[0][0] # first element of the first sublist, that is the first lemma for the first TED talk

len(ted_lemmas_list[0]) # 609 elements in the first sublist, that is 609 lemmas in the first TED talk


## **Exercise**

*Get the longest TED talk from the dataset and carefully apply all the preprocessing steps to it!*





In [None]:
# How to get the longest TED talk?

df['transcript'].apply(lambda x: len(x)).max() # the longest transcript contains 30429 characters

df['transcript'].apply(lambda x: len(x)).idxmax() # its index is 65!


In [None]:
# Save it as a separate object

longest_transcript = df['transcript'][65]

longest_transcript


In [None]:
# Apply your preprocessing steps here:



# **That's the end of Day 3 Part 2!**