<a href="https://colab.research.google.com/github/mkane968/Webscraping-Wikipedia-Tables/blob/main/Corpus_Analysis_with_SpaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Corpus Analysis with SpaCy
##Introduction

This tutorial describes how to conduct cleaning and text analysis on a corpus of texts using SpaCy. It will be of interest to researchers who want to curate texts for analysis and perform lemmatization, part-of-speech tagging, and other text enrichment tasks to help answer their research questions. 

###Why Use SpaCy for Text Analysis? 

SpaCy is an industrial-strength library for natural language processing. One of its primary usages is to retrieve a variety of linguistic annotations from a text or corpus (e.g. lemmas, part of speech tags, named entities), so it's valuable for researchers who want to know more about the grammatical structure of their corpora. 

While there are several Python libraries that can conduct similar text-mining tasks, SpaCy holds the following advantages: 
*   It's **fast and simple to set up and call the nlp pipeline**; no need to call a wide range of packages and functions for each individual task (SOURCE)
*   It **uses only the "latest and best" algorithms** for text-processing tasks, so it's easy to run and kept up-to-date by the developers (SOURCE)
*   It's proven to **perform better on text-splitting tasks** than NLTK, since it constructs syntactic trees for each sentence it is called on (SOURCE)

###Prerequisites

You will need access to the following materials and platforms to complete this tutorial: 
*   **Google Colaboratory**: A Google platform which allows you to run Python in a web browser. Access is free with a Google account. Get started with Colab here: https://colab.research.google.com/ 
*   **A corpus of plain text files** on which you wish to perform analysis. A sample corpus from Project Gutenberg can be accessed [here.](https://github.com/mkane968/Corpus-Analysis-with-SpaCy/tree/main/data)

Though there are no other Programming Historian tutorial that specifically incorporate SpaCy analysis, this notebook will build on other text-mining guides available on the platform, including: 

*  [Corpus Analysis with Antconc (Froelich, 2015)](https://programminghistorian.org/en/lessons/corpus-analysis-with-antconc#collocates-and-word-lists) - Provides robust explanation of processing and exploring text corpora; while this tutorial focuses on keywords-in-context, SpaCy retrieves part-of-speech tags, named entities, and other named entities that can provide additional insight to researchers

*   [Analyzing Documents with TF-IDF (Lavin, 2019)](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf#how-the-algorithm-works) - Describes the process of conducting TF-IDF analysis which can be enriched with the stemming and lemmatization processes that spaCy enables


###Tutorial Goals
By the end of this tutorial, you will be able to: 
*   Upload a corpus of texts to Google Colab
*   Clean the corpus by lowercasing, removing stop words and removing punctuation 
*   Enrich the corpus through lemmatization, chunking,  part-of-speech tagging, and named entity recognition using SpaCy

###Table of Contents: 
1. Install Packages 
2. Load Text Files into DataFrame
3. Cleaning and Tokenization
4. Text Enrichment (Lemmatization, Part of Speech Tagging, Named Entity Recognition)

##1. Install Packages

Import spaCy and related libraries and packages. It is common practice to do this at the very top of the file instead of interspersing them with your code to improve efficiency. These packages can be run in a single cell of code; below, the markdown text describes how each downloaded package or library will be used in the analysis. 


In [None]:
#Imports spaCy itself, necessary to use features 
#!pip install spaCy
import spacy
#Load the natural language processing pipeline
nlp = spacy.load("en_core_web_sm")
#Load spaCy visualizer
from spacy import displacy

##2. Load Text Files into DataFrame

After all necessary packages have been installed, it is time to upload the texts for analysis. The key here is to read the texts into Google Colab in a way that will make them recognizable for analysis. Run the following code to “mount” the Google Drive, which allows your Google Colab notebook to access any files on your Drive. A box will pop up asking for permission for the notebook to access your Drive files; click “Connect to Google Drive,” select Google account to connect to, and click “Allow.” 

In [None]:
#Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Next, load the files for analysis into your Google Drive. To complete this step, you must have the files of interest saved in a folder on your local machine. Once you run this line of code, a button will pop up directing you to “Choose Files” – click the button and a file explorer box will pop up. From here, navigate to the folder where your files are stored, select the files of interest, and click “Open.” The files will then be uploaded to your Google Drive; you will see the upload complete as output of your cell and can access the files by clicking the file icon in the bar on the left-hand side of the notebook.


In [None]:
#Selet multiple files to upload from local folder
from google.colab import files

uploaded_files = files.upload()


Now we have files upon which we can perform analysis. To check what form of data we are working with, use the type() function. It should return that your files are contained in a dictionary, where keys are the file names and values are the content of each file. 


In [None]:
type(uploaded_files)

Next, we’ll make the data easier to manage by inserting it into a Pandas dataframe. This will organize the texts into a table of rows and columns–in this case, the first column will contain the names of the files, and the second column will contain the context of each file. Since the files are currently stored in a dictionary, use the DataFrame.from_dict() function to append them to a new dataframe.


In [None]:
#Add files into dataframe
import pandas as pd

df = pd.DataFrame.from_dict(uploaded_files, orient='index')
df.head()

From here, you can reset the index (the very first row of the dataframe). This will make data wrangling easier later.  

In [None]:
#Reset index and add column names to make wrangling easier
df = df.reset_index()
df.columns = ["Title", "Text"]
df

The texts in your resulting dataframe are now ready for cleaning and analysis. 


##3. Cleaning and Tokenization

From a quick scan of the dataframe, it is evident that some preliminary cleaning is required. First use the .decode() module to remove any utf-8 characters embedded in the texts (b'\xef\xbb\xbf). It is also important to remove newline characters (\n, \r) through a simple string replacement line. These are NOT functions of spaCy but are necessary to make the code recognizable for further cleaning and tokenization. 


In [None]:
#Remove encoding characters from Text column (b'\xef\xbb\xbf)
df['Text'] = df['Text'].apply(lambda x: x.decode('utf-8', errors='ignore'))
df.head()

#Remove newline characters
df['Text'] = df['Text'].str.replace(r'\s+|\\r', ' ', regex=True) 
df['Text'] = df['Text'].str.replace(r'\s+|\\n', ' ', regex=True) 
df.head()

The next, most basic operation to perform is lowercasing all tokens in the texts. This will prevent incorrect calculations in later case-sensitive analysis; for example, if lowercasing is not performed, “House” and “house” may be counted as two different words. 

In [None]:
#Lowercase all words
df['Text'] = df['Text'].str.lower()

df.head()

Tokenization is the process used to split up full text into smaller parts for analysis. SpaCy has a built-in function for tokenization that involves segmenting texts into individual parts like words and punctuation. Take the example of an individual sentence: 

In [None]:
doc = nlp("This is 'an' example? sentence")
for token in doc:
    print(token.text)

What this function is doing is calling the nlp pipeline, which contains the data and components needed for text processing. When the nlp pipeline is called on a sentence, it splits that sentence on each whitespace and reviews its components. Components are then split based on rules for words, punctuations, prefixes, suffixes, etc. Each token is then loaded into a new object that we’ve called “doc.” Calling nlp also enables part of speech tagging, lemmatization, and other enrichment procedures we’ll discuss further below. 

Since we are working with multiple long texts, we are going to use nlp.pipe, which processes batches of texts as doc objects. Here we’ll tokenize each text in our dataframe, append each set of tokens to a list, and add the new token lists to a new column in the dataframe.


In [None]:
#Tokenize with spaCy

#Create list for tokens
token_list = []

# Disable POS, Dependency Parser, and NER since all we want is tokenizer 
with nlp.disable_pipes('tagger', 'parser', 'ner'):
  #Iterate through each doc object (each text in dataframe) and tokenize, append tokens to list
    for doc in nlp.pipe(df.Text.astype('unicode').values, batch_size=100):
        word_list = []
        for token in doc:
            word_list.append(token.text)

        token_list.append(word_list)
        
#Make token list a new column in dataframe
df['token_list'] = token_list

#Check token list
df.head()

When tokenizing texts, you can also exclude stopwords. Stopwords are words which may hold little significance to text analysis, such as very common words like “the” or “and.” SpaCy has a built-in dictionary of stopwords which you can access. You can also add or remove your own stopwords, as shown below:

In [None]:
#Adding and removing stopwords to default list
#See list of default stopwords
print(nlp.Defaults.stop_words)

#Remove a  stopword
nlp.Defaults.stop_words.remove("becomes")

#Add stopword
nlp.Defaults.stop_words.add("book")

#Check updated list of default stopwords
print(nlp.Defaults.stop_words)

To tokenize texts without stopwords, follow the same process above using nlp.pipe, but only append tokens to list that are NOT included in stopwords list and append these to a new row in the dataframe. 


In [None]:
#Remove all stopwords and append remaining tokens to new df column
token_list_nostops = []

# Disable POS, Dependency Parser, and NER since all we want is tokenizer 
with nlp.disable_pipes('tagger', 'parser', 'ner'):
  #Iterate through each doc object (each text in dataframe) and tokenize, append tokens to list
    for doc in nlp.pipe(df.Text.astype('unicode').values, batch_size=100):
        nostops_word_list = []
        for token in doc:
            if token.text not in nlp.Defaults.stop_words:
              nostops_word_list.append(token.text)

        token_list_nostops.append(nostops_word_list)

#Make token list a new column in dataframe
df['token_list_nostops'] = token_list_nostops

#Check list of tokens without stopwords
df.head()

Depending on the goals of your analysis, you may want to remove or keep stopwords. One case where stopword removal may be useful is if you want to compare document similarity. SpaCy calculates document similarity based on corpus word vectors; since stopwords are words that appear throughout texts, they will heighten document similarity scores even if their content is very different. Observe the difference here: 


In [None]:
#Stopwords Test Case - Word Vector Similarity
#Load a larger pipeline with vectors
#!spacy download en_core_web_md
#nlp = spacy.load("en_core_web_md")

# Compare similarity between two documents without stopwords
doc1 = nlp(df.Tokens[0])
doc2 = nlp(df.Tokens[2])
print(f'The similarity between ' + str(df.Title[0]) + ' and ' + str(df.Title[2]) + ' with stopwords is ' + str(doc1.similarity(doc2)))

# Compare similarity between two documents without stopwords
doc1 = nlp(df.Stop_Tokens[0])
doc2 = nlp(df.Stop_Tokens[2])
print(f'The similarity between ' + str(df.Title[0]) + ' and ' + str(df.Title[2]) + ' without stopwords is ' + str(doc1.similarity(doc2)))


Stopword removal is also useful for topic modeling and classification tasks, where finding general themes across documents is the goal. However, other types of analysis like sentiment analysis are highly sensitive and removing stopwords will change sentence meaning (e.g. removing “not” in the sentence “I was not happy”). When possible, it is recommended to run analysis with and without stopwords and see how the model is impacted. For the rest of this tutorial, we will be using the corpus without stopwords, but you are welcome to replicate analysis with them. 


In [None]:
#Make token_list into a string again for enrichment
df['Tokens'] = df['token_list'].str.join(' ')
df

#Make stoptoken_list a string again for enrichment
df['Stop_Tokens'] = df['token_list_nostops'].str.join(' ')
df

## 4. Text Enrichment 

In [None]:
#Create new dataframe for text enrichment
new_df = df[['Tokens', 'Stop_Tokens']].copy()

SpaCy enables several types of text enrichment. We’ll start with lemmatization, which identifies the dictionary root word of each word (e.g. “brighten” for “brightening”). Lemmatization is one of the functions that occurs when the nlp pipe is called; repeat the same process as above to iterate through each document in the dataframe and this time append all lemmas to new column. 


In [None]:
#Get lemmas
lemma_list = []

# Disable Dependency Parser, and NER since all we want is lemmatization 
with nlp.disable_pipes('parser', 'ner'):
  #Iterate through each doc object and tag lemma, append lemma to list
  for doc in nlp.pipe(new_df.Stop_Tokens.astype('unicode').values, batch_size=100):
    word_list = []
    for token in doc:
        word_list.append(token.lemma_)
        
    lemma_list.append(word_list)

#Make pos list a new column in dataframe
new_df['lemma_list'] = lemma_list

#Check lemmas
new_df.head()

The nlp pipeline also enables the tagging of each word according to its part of speech. This code will append all parts of speech to a new dataframe column. 


In [None]:
#Get part of speech tags
pos_list = []

# Disable Dependency Parser, and NER since all we want is POS 
with nlp.disable_pipes('parser', 'ner'):
  #Iterate through each doc object and tag POS, append POS to list
  for doc in nlp.pipe(df.Text.astype('unicode').values, batch_size=100):
    word_list = []
    for token in doc:
        word_list.append(token.pos_)
        
    pos_list.append(word_list)

#Make pos list a new column in dataframe
df['pos_list'] = pos_list

#Check pos tags
df.head()

Check out the dictionary of SpaCy POS tags [here.](https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/#:~:text=Spacy%20POS%20Tags%20List,-Every%20token%20is%20assigned%20a) 

Closely related to POS tagging is dependency parsing, wherein SpaCy identifies how different segments of a text are related to each other. Once the grammatical structure of each sentence is identified, visualizations can be created to show the connections between different words. Since we are working with large texts, our code will break down each text into sentences (spans) and then create dependency visualizers for each span


In [None]:
#Get dependency parsing for single doc
doc = nlp(df.Text[0]) 
print(doc)

#Make each sentence a span to break up dependency visualizations
spans = doc.sents

#Create dependency visualizations 
displacy.render(spans, style="dep", jupyter=True)

Finally, SpaCy can tag “named entities” in your text, such as names, dates, organizations, and locations. We’ll again call the nlp pipeline on each document in the corpus and append the named entities to a new column. 


In [None]:
#Get Named Entities
ent_list = []

with nlp.disable_pipes('tagger', 'parser'):
    for doc in nlp.pipe(df.Text.astype('unicode').values, batch_size=100):
        ent_list.append(doc.ents)

df['ent_list'] = ent_list

#Check named entities
df.head()

SpaCy also allows you to visualize named entities within single texts, as follows: 

In [None]:
#Get named entities in a single document and visualize
doc = nlp(df.Text[0]) 

displacy.render(doc, style="ent", jupyter=True)

##Conclusions
Through this tutorial, we've gleaned more information about the grammatical makeup of a text corpus. Such information can be valuable to researchers who are seeking to understand differences between texts in their corpus - for example, *what types of named entities are most common across the corpus? How frequently are certain words used as nouns vs. objects within individual texts and corpora, and what may this reveal about the content or themes of the texts themselves?* 

SpaCy is also a helpful tool to explore texts without fully-formed research questions in mind; exploring linguistic annotations like those mentioned above can propel further questions and text-mining pipelines, like the following: 
*   [Getting Started with Topic Modeling and Mallet (Graham, Weingart, and Milligan, 2012)](https://programminghistorian.org/en/lessons/topic-modeling-and-mallet#what-is-topic-modeling-and-for-whom-is-this-useful) - Describes process of conducting topic modeling on a corpora; the SpaCy tutorial can serve as a preliminary step to clean and explore data to be used in topic modeling
*   [Sentiment Analysis for Exploratory Data Analysis (Saldaña, 2018)](https://programminghistorian.org/en/lessons/sentiment-analysis#calculate-sentiment-for-a-paragraph) - Describes how to conduct sentiment analysis using NLTK; the SpaCy tutorial provides alternative methods of pre-processing and exploration of entities that may become relevant in sentiment analysis 



#Official PH Proposal 

##About You
1. Megan Kane
2. megan.kane@temple.edu

## Lesson Metadata
3. Submission Language: English 
4. Proposed Lesson Title: Corpus Analysis with SpaCy
5. Lesson Abstract (3-4 sentences): This lesson will enable users to perform basic text analysis on a textual corpus using SpaCy in Google Colab. It will walk through the process of uploading a corpus to Google Colab, then describe tokenization and cleaning techniques such as stopword and punctuation removal. It will then walk through various text enrichment techniques  using Spacy including lemmatization, part-of-speech tagging, dependency parsing, and named entity recognition.
6. Case Study Description (details about your historical example problem) ???
7. Learning Outcomes (between 2-3): 
Prepare any text or corpus for text analysis of interest using Python in Google Colab
Learn how and why to perform basic cleaning processes
Learn how to perform data enrichment processes (POS tagging, chunking) and what they can add to analysis
8. Research Phase most relevant to your lesson (delete as appropriate) Acquire / Transform / Analyze 
9. Research Area most relevant to your lesson (delete as appropriate) python / data manipulation / distant reading 
10. Intended Submission Date: October/November 2022?
11. Lesson will use open technology and data at no cost to the reader: Yes
12. Any other comments for the editor: N/A
