# Natural Language Processing in Python Lecture

June 2021

Pu Yan, Oxford Internet Institute, University of Oxford 

Email: <pu.yan@oii.ox.ac.uk> or <thuyanpu@gmail.com>

### Outline of the course

0. Introduction
    
1. Text cleaning and pre-processing

    1.1 Importing text data & summary of text 
    
    1.2 Removal of punctuations
    
    1.3 Tokenisation
    
    1.4 Case normalisation
    
    1.5 Stopwords removal
    
    1.6 Stemming and Lemmatisation
    
    1.7 Advanced: Part-of-Speech tagging; named entities in spaCy

2. Text analysis and visualisation

    2.1 Word counts (pre- and post-processing)
    
    2.2 Word frequency
    
    2.3 TF-IDF: table and visualisation
    
3. Beyond quantification? Toolkit for qualitative research:


The importance of text cleaning

![Garbage in, Garbage out!](img/cleaning.png)

### Useful resources

- Books:
    
    - Bengfort, B., Bilbro, R., & Ojeda, T. (2018). Applied text analysis with Python : Enabling language-aware data products with machine learning. Beijing: O'Reilly
    
    - Vanderplas, J. (2017). Python data science handbook : Essential tools for working with data. Beijing: O'Reilly
    
- Online resources:
    
    - A list of NLP courses around the world, curated by ACL (the Association for Computational Linguistics): https://aclweb.org/aclwiki/List_of_NLP/CL_courses 
    
    - An introduction to Natural Language Processing (NLP): https://port.sas.ac.uk/mod/book/view.php?id=583&chapterid=445 
    
    - Introduction to NLP & Data Science: https://www.youtube.com/watch?v=5BVebXXb2o4

## 0. Introduction

- Natural language processing (NLP) is becoming increasingly popular in social science

    - Digitalisation of content data: i.e. Google Books, Newspaper digital database (LexisNexis)
    
    - The prevalence of social media platforms: i.e. Twitter, YouTube
    
    - Large-scale data: A paradigm shift in social science research?
    
- Advanages and disadvantages of NLP compared to qualitative methods

    - Pros: Speed of analysis; Affordability (translation/transcription)
    
    - Cons: Accuracy (identifying sarcasm, jokes or irony); unstructured (compared to survey data)

### 0.1 Implications of NLP in social science and humanities

- Machine translation: "你好"<->“Hi”

- Speech recognition: "Hi, Siri"

- Sentiment analysis: understanding product reviews

- Information extraction: automatically generated abstract/keywords list

- Document classification: LDA topic modelling

### 0.2 Packages

Please ensure that you have installed the following libraries in your local enviornment: 
    
    - pandas (for loading)

    - NLTK
    
    - spacy

In [None]:
# Installing packages required for this course 
## !! DO NOT RUN this cell if you have already installed the three packages 
## Use the following code to install three packages using pip (if you have already installed pip). 
## Or, you can choose to install the packages in Anaconda Navigator: https://docs.anaconda.com/anaconda/navigator/tutorials/pandas/ 

%pip install pandas
%pip install numpy

%pip install --user -U nltk
%pip install -U pip setuptools wheel
%pip install -U spacy
%pip install --upgrade gensim
!python -m spacy download en_core_web_sm

%pip install pillow
%pip install wordcloud

## Run the following code if you see warning message when importing gensim
#%pip install python-Levenshtein

We will need to import the packages before running the code

In [None]:
# Loading packages required for this course
import pandas as pd
import nltk
import gensim
## download wordnet, stopwords  if you have not downloaded nltk's wordnet and stopwords data
nltk.download('wordnet') 
nltk.download('stopwords') 

#nltk.download() # You can use this line to start the NLTK Downloader and download all the data you need.

### 0.3 Dataset

Now, we will load the dataset (thanks to Justin Ho, who prepared the dataset!) from the "./data" folder.

The dataset includes social media posts from four parties during the 2021 Scottish Parliament election (Date range: 2021-02-06 ~ 2021-05-06):

Scottish National Party, Scottish Conservatives, Scottish Labour, Scottish Greens, Scottish Liberal Democrats


In [None]:
# Loading the dataset
df = pd.read_csv("data/scotelection2021.csv")

# There are 8 columns in the dataset
df.info()

In [None]:
# Let's have a glimpse of the dataset
df.head(5)

Often in crawlled dataset, we might find some rows do not contain the text data to analyse. We will make sure these rows are removed before the text analysis

In [None]:
# Checking how many rows miss "text" information
len(df[df.text.isna()])

In [None]:
# Removing rows where "text" column is NA from the dataframe
df = df[-df.text.isna()]

Which party gets the highest number of likes on average? Which party shares the highest number of posts on Facebook? 

Before analysing the text data, let's see the average likes each party received and total number of posts each party shared on Facebook during the data collection period.

In [None]:
# Grouping by party names and calculating the average number of likes and total number of social media posts for each party.
df.groupby(['snsname']).agg({'likes': 'mean', 
                             'text': 'count'}).reset_index()

It seems that **SNP** receives the highest number of likes on average, but **Scottish Green Party** is the most active party on Facebook, calculated by the total number of posts each party shared.

We therefore will focus on posts from SNP for the linguistic analysis 

In [None]:
# Creating a subset of text dataset for all SNP Facebook posts 
df_snp = df[(df["snsname"] == "Scottish National Party (SNP)")]
df_snp.info()

## 1. Text cleaning and pre-processing

### 1.1 Importing text data & summary of text 

We will start with creating a summary statistics of the text data: 

- How many social media posts do we have in the dataset? 
- What is the length of each post? 
- What is the average lenghth of the Facebook posts?

In [None]:
# We dropped the following columns from the snp dataset: snshandle, comments, shares, postlink
df_snp = df_snp.drop(columns=['snshandle', 'comments', 'shares', 'postlink'])

# We calculate the number of characters we need to process, we create a new column "word_count" to calculate the number of characters in each post
# Basically, we count the spaces but add one to calculate the number of words
df_snp['word_count'] = df['text'].str.count(' ') + 1

# Here's the structure of the final dataframe
df_snp.info()

In [None]:
# Pandas display setting:
# We want to see longer sentences in the text column in jupyter notebook
# We reset the "max_colwidth" in the pandas display options to display all characters in the text column.
pd.set_option("display.max_colwidth", None)

# Let's see the most popular five media posts by snp (measutred by Facebook likes) 
df_snp.sort_values(by='likes', ascending=False).head(5)


In [None]:
# Let's see the average number of words in the social media posts by SNP
print("There are {} social media posts from SNP in the 2021 election dataset\n".format(df_snp.word_count.count()))
print("The average number of words in SNP social media posts during 2021 election is: {}\n".format(df_snp.word_count.mean()))
print("Total number of words in SNP social media posts during 2021 election is: {}\n".format(df_snp.word_count.sum()))

In [None]:
# Let's visualise the histogram of word counts and likes in SNP social media posts
df_snp.hist(column='word_count')

### 1.2 Removal of punctuations & 1.3 Tokenisation & 1.4 Case normalisation

In text cleaning process, we also want to remove punctuations, non-text symbols, or hyper-links in the text.

We also need to **tokenise** sentences into separated words and normalise capitalised words

By using gensim's simple_preprocess function, we can finish the three steps using one line of code! (THANK YOU GENSIM!)

Transforming sentences into tokens: 

![tokenising sentences](img/tokenising.png)

In [None]:
# Before preprocessing, we first convert the text column into a list
data = df_snp.text.values.tolist()

# Here's how the raw text data looks like
data[5]

In [None]:
# Removal of hyper-link
import re
def link_removal(sentences):
    for sentence in sentences:
        yield(re.sub(r'http\S+', '', sentence))

data_nolink = list(link_removal(data))
data_nolink[5]

In [None]:
from gensim.utils import simple_preprocess

# We write a function to 1) tokenise sentences into words, and also 2) remove punctuations in the sentences 
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

# Now, let's apply the tokenisation and removel of punctuations to all social media posts
# You will find all punctuations (including emojis, hashtag symbols) are removed from the sentence.
data_words = list(sent_to_words(data_nolink))

In [None]:
# Let see our pre-processing progress so far
example_df_1 = pd.DataFrame(
    {'raw text': data,
     'removal of punctuations, tokenised, normalised': data_words
    })
example_df_1.head(5)

### 1.5 Stop words removal

- **Stop words** are high-frequency words like the, to and also that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts. (See details in this link: https://www.nltk.org/book/ch02.html)

- The NLTK library is one of the oldest and most commonly used Python libraries for Natural Language Processing. NLTK supports stop word removal, and you can find the list of stop words in the corpus module.

In [None]:
# Importing NLTK stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# You can append the list of stop_words using the code below,
# simple replace "yout_stop_words" with the word(s) you want to filter out from the corpus
stop_words.extend(['www', 'https'])

# We define a function to filter out words that appear in the stopwords list from NLTK English stopwords library
def remove_stopwords(texts):
    return [[word for word in doc if word not in stop_words] for doc in texts]

# We now apply the stopwords filtering functions on tokenised&normalised text
data_words_nostopwords = remove_stopwords(data_words)

In [None]:
# Let's see which words are included in the NLTK stopwords list
print(stop_words, end = ' ')

In [None]:
# Let see our pre-processing progress so far
example_df_2 = pd.DataFrame(
    {'raw text': data,
     'removal of punctuations, tokenised, normalised': data_words,
     'stop words removed': data_words_nostopwords
    })
example_df_2.head(5)

### 1.6 Stemming and Lemmatisation

The goal of both **stemming and lemmatization** is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

    i.e am, are, is --> be 

**Stemming** usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

**Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. 

    i.e. If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. 

In [None]:
# Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas of words.
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# We define a function to lookup lemmas of each word in the post
def lemmatise(texts):
    return [[wordnet_lemmatizer.lemmatize(word, pos="v") for word in doc] for doc in texts]

data_lemmatise = lemmatise(data_words_nostopwords)

*When to use what?*

- Stemming and Lemmatization both generate the root form of the inflected words; 

- The difference is that stem might not be an actual word whereas, lemma is an actual language word;

- In lemmatization, you used WordNet corpus and a corpus for stop words as well to produce lemma which makes it slower than stemming

In [None]:
# Let see our pre-processing progress so far
example_df_3 = pd.DataFrame(
    {'raw text': data,
     'removal of punctuations, tokenised, normalised': data_words,
     'stop words removed': data_words_nostopwords,
     'lemmatised words': data_lemmatise
    })
example_df_3.head(5)

### 1.7 Advanced: Part-of-Speech tagging; named entities in spaCy
#### PoS tagging
In NLP pre-processing, we often need to focus on "meaningful" words such as nouns, verbs, and adjectives. 

spaCy use *trained pipilines* and *statistical models* to predict Part-of-Speech tags in the document. 

This is a system that make *predictions* on the Part of Speech of each word in the sentence, for example, a word following “the” in English is most likely a noun.

In [None]:
# Part-of-speech tag 
import spacy
from spacy import displacy

# fine-grained and coarse-grained part-of-speech tags in non-English languages: https://spacy.io/models
## POS tag scheme for English
nlp = spacy.load("en_core_web_sm") 

# Let's visualise the Part-of-Speech tags
pos_example = nlp(u'Can digital computers think? by Alan Turing, a computer scientist born in London')
displacy.render(pos_example, style = 'dep', jupyter = True, options = {'distance': 100})

#### Named entities
- A named entity is a “real-world object” that’s assigned a name – for example, a person (Alan Turing), a city (London), a product or a book title. 
- We can use spaCy to recognize various types of named entities in a document
- Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some fine-tuning later

In [None]:
# Named entities recognition
displacy.render(pos_example, style="ent", jupyter = True)

In [None]:
# How to include meaningful words ONLY in the corpus by using the POS tagging?
import spacy

# We need to use spacy for pos-tagging
nlp = spacy.load("en_core_web_sm")

# Create a pos-tagging processor, allowing only nounds, adjectives, verbs, and adverbs in the corpus
def postags_filter(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

# Now, keeping only noun, adj, vb, adv
data_postag = postags_filter(data_lemmatise, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [None]:
# Let see our pre-processing progress so far
example_df_4 = pd.DataFrame(
    {'raw text': data,
     'removal of punctuations, tokenised, normalised': data_words,
     'stop words removed': data_words_nostopwords,
     'lemmatised words': data_lemmatise,
     'pos tagged': data_postag
    })
example_df_4.head(5)

## 2. Text analysis and visualisation

### 2.1 Word counts (pre- and post-processing)

Let's now merge the processed text into the dataframe and generate a column for word counts after processing(text cleaning). We can see how much textual information has been filtered out from the raw data

In [None]:
# Create Corpus for TD-IDF analysis 
df_snp["text_cleaned"] = data_postag # Use the data_lemmatized for text analysis
df_snp["word_count_cleaned"] = df_snp['text_cleaned'].str.len()
df_snp.head(5)

### 2.2 Word frequency

- An important question when analysing text data is: How to measure the key information in the corpus? 

- The most strait-forward way is to calculate **Term Frequency (TF)**, which is how frequency each word appear in the dataset

In [None]:
# We used NLTK's FreqDist to find the most common words in the corpora
from nltk.probability import FreqDist

# We merged all pre-processed social media content from SNP into one corpora
corpora = df_snp['text_cleaned'].sum()

# And displayed the top 20 most common terms in the corpora
fdist = FreqDist(corpora)
top_20 = fdist.most_common(20)
top_20

### 2.3 TF-IDF: table and visualisation

- **TF-IDF** With tf-idf, instead of representing a term in a document by its raw frequency (number of occurrences) or its relative frequency (term count divided by document length), each term is weighted by dividing the term frequency by the number of documents in the corpus containing the word

- The overall effect of this weighting scheme is to avoid a common problem when conducting text analysis: the most frequently used words in a document are often the most frequently used words in all of the documents: i.e. the, a, not

- Terms with the highest tf-idf scores are the terms in a document that are distinctively frequent in a document, when that document is compared other documents.

More details on TF-IDF can be found here: https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf


In [None]:
# Text processing steps like tokenization and removing punctuation will happen automatically when we use Scikit-Learn’s TfidfVectorizer 
# We therefore create a list of strings of the cleaned social media posts
corpora_strings = [" ".join(text) for text in df_snp['text_cleaned'].values]

#import the TfidfVectorizer from Scikit-Learn.
from sklearn.feature_extraction.text import TfidfVectorizer

# Our corpora already removes common english stopwords, so we set stop_words = none
vectorizer = TfidfVectorizer(stop_words=None, use_idf=True, norm=None) 

#The fit_transform() method converts the list of strings to something called a sparse matrix. 
#In this case, the matrix represents tf-idf values for all texts. 
transformed_documents = vectorizer.fit_transform(corpora_strings).todense()

# We want to creat a matrix of word-document, where each document has the same number of values, one for each word in the corpus
matrix_df = pd.DataFrame(transformed_documents, columns=vectorizer.get_feature_names())

# Let's now get the tf-idf values for all words
words_tfidf = matrix_df.sum(axis=0).sort_values(ascending=False)

In [None]:
words_tfidf[:20]

### 2.4 Visualisation of keywords using tf-idf

In the preliminary analysis of text data, we often want to show the results through visualisation. We can show the top ranked keywords using tf-idf scores and generate word clouds of keywords

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
## Creating a wordcloud using top ranked 100 words (measured by TD-IDF) for the SNP corpora
fig, ax = plt.subplots(figsize=(10,10))
wc = WordCloud(background_color = 'white',
              width=800,height=600,
              max_words=2000).fit_words(words_tfidf[:100])
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

Want a more taylored visualisation for SNP? Sure!

In [None]:
import numpy as np
from PIL import Image
## Creating a masked wordcloud using SNP's logo
fig, ax = plt.subplots(figsize=(10,10))
custom_mask = np.array(Image.open("img/snp.png"))
wc_2 = WordCloud(background_color = 'white',
              width=800,height=600,
              mask = custom_mask,
              mode='RGBA',
              max_words=2000).fit_words(words_tfidf[:100])
image_colors = ImageColorGenerator(custom_mask)
wc_2.recolor(color_func = image_colors)
plt.imshow(wc_2, interpolation="bilinear")
plt.axis("off")
plt.show()

## 3. Beyond quantification? Toolkit for qualitative research:

- There are many ways to examine the **context** of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context. 

- Here we look up the word *covid* in the text

In [None]:
from nltk.text import Text
corpora = df_snp['text_cleaned'].sum()
df_snp['text'].values.tolist()
textList = Text(corpora)
textList.concordance('covid')

Finally, let's do something creative, by generating a social media post that is in similar style as what we have seen in the SNP corpora

In [None]:
textList.generate()

![ending](img/end.png)