<a href="https://colab.research.google.com/github/mkane968/Corpus-Analysis-with-SpaCy/blob/main/Corpus_Analysis_with_SpaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Corpus Analysis with SpaCy
##Introduction

In this tutorial, you will learn how to conduct cleaning and text analysis on a corpus of texts using SpaCy. 

SpaCy is a popular open-source tool for natural language processing. It's particularly good at annotating linguistic data through part-of-speech tagging, chunking and named entity recognition, as well as calculating document similarity through word embeddings. SpaCy was built for production and has an integrated catalog of features. ([SpaCy 101](https://spacy.io/usage/spacy-101#pipelines))

By the end of this tutorial, you will be able to: 
*   Upload a corpus of 2 or more texts to Google Colab
*   Clean corpora by lowercasing, removing stop words and removing punctuation 
*   Enrich corpora with spaCy through stemming, lemmatization, chunking,  part-of-speech tagging, and named entity recognition. 

Table of Contents: 
1. Install Packages 
2. Load Text Files into DataFrame
3. Cleaning and Tokenization
4. Text Enrichment (Lemmatization, Part of Speech Tagging, Named Entity Recognition)

##1. Install Packages

In [None]:
#Imports spaCy itself, necessary to use features 
#!pip install spaCy
import spacy
#Load the natural language processing pipeline
nlp = spacy.load("en_core_web_sm")
#Load spaCy visualizer
from spacy import displacy

##2. Load Text Files into DataFrame



In [None]:
#Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Selet multiple files to upload from local folder
from google.colab import files

uploaded_files = files.upload()


In [None]:
#Add files into dataframe
import pandas as pd

df = pd.DataFrame.from_dict(uploaded_files, orient='index')
df.head()

In [None]:
#Reset index and add column names to make wrangling easier
df = df.reset_index()
df.columns = ["Title", "Text"]
df

##3. Cleaning and Tokenization

In [None]:
#Remove encoding characters from Text column (b'\xef\xbb\xbf)
df['Text'] = df['Text'].apply(lambda x: x.decode('utf-8', errors='ignore'))
df.head()

#Remove newline characters
df['Text'] = df['Text'].str.replace(r'\s+|\\r', ' ', regex=True) 
df['Text'] = df['Text'].str.replace(r'\s+|\\n', ' ', regex=True) 
df.head()

In [None]:
#Lowercase all words
df['Text'] = df['Text'].str.lower()

#Remove punctuation and replace with no space (except periods and hyphens)
df['Text'] = df['Text'].str.replace(r'[^\w\-\.\'\s]+', '', regex = True)

#Remove periods and replace with space (to prevent incorrect compounds)
df['Text'] = df['Text'].str.replace(r'[^\w\-\'\s]+', ' ', regex = True)
df.head()

df.head()

In [None]:
#Tokenize with spaCy

#Create list for tokens
token_list = []

# Disable POS, Dependency Parser, and NER since all we want is tokenizer 
with nlp.disable_pipes('tagger', 'parser', 'ner'):
  #Iterate through each doc object (each text in dataframe) and tokenize, append tokens to list
    for doc in nlp.pipe(df.Text.astype('unicode').values, batch_size=100):
        word_list = []
        for token in doc:
            word_list.append(token.text)

        token_list.append(word_list)
        
#Make token list a new column in dataframe
df['token_list'] = token_list

#Check token list
df.head()

In [None]:
#Adding and removing stopwords to default list
#See list of default stopwords
print(nlp.Defaults.stop_words)

#Remove a  stopword
#nlp.Defaults.stop_words.remove("becomes")

#Add stopword
#nlp.Defaults.stop_words.add("my_new_stopword")

#Check updated list of default stopwords
print(nlp.Defaults.stop_words)

In [None]:
#Remove all stopwords and append remaining tokens to new df column
token_list_nostops = []

# Disable POS, Dependency Parser, and NER since all we want is tokenizer 
with nlp.disable_pipes('tagger', 'parser', 'ner'):
  #Iterate through each doc object (each text in dataframe) and tokenize, append tokens to list
    for doc in nlp.pipe(df.Text.astype('unicode').values, batch_size=100):
        nostops_word_list = []
        for token in doc:
            if token.text not in nlp.Defaults.stop_words:
              nostops_word_list.append(token.text)

        token_list_nostops.append(nostops_word_list)

#Make token list a new column in dataframe
df['token_list_nostops'] = token_list_nostops

#Check list of tokens without stopwords
df.head()

In [None]:
#Make stoptoken_list a string again
df['Stop_Tokens'] = df['stoptoken_list'].str.join(' ')
df

DISCUSS USE OF TExTS WITH/WITHOUT STOPWORDS

## 4. Text Enrichment (Lemmatization, Part of Speech Tagging, Named Entity Recognition)

In [None]:
#Get lemmas
lemma_list = []

# Disable Dependency Parser, and NER since all we want is POS 
with nlp.disable_pipes('parser', 'ner'):
  #Iterate through each doc object and tag POS, append POS to list
  for doc in nlp.pipe(df.Text.astype('unicode').values, batch_size=100):
    word_list = []
    for token in doc:
        word_list.append(token.lemma_)
        
    lemma_list.append(word_list)

#Make pos list a new column in dataframe
df['lemma_list'] = lemma_list

#Check lemmas
df.head()

In [None]:
#Get part of speech tags
pos_list = []

# Disable Dependency Parser, and NER since all we want is POS 
with nlp.disable_pipes('parser', 'ner'):
  #Iterate through each doc object and tag POS, append POS to list
  for doc in nlp.pipe(df.Text.astype('unicode').values, batch_size=100):
    word_list = []
    for token in doc:
        word_list.append(token.pos_)
        
    pos_list.append(word_list)

#Make pos list a new column in dataframe
df['pos_list'] = pos_list

#Check pos tags
df.head()

In [None]:
#Get dependency parsing for single doc
doc = nlp(df.Text[0]) 
print(doc)

#Make each sentence a span to break up dependency visualizations
spans = doc.sents

#Create dependency visualizations 
displacy.render(spans, style="dep", jupyter=True)

In [None]:
#Get Named Entities
ent_list = []

with nlp.disable_pipes('tagger', 'parser'):
    for doc in nlp.pipe(df.Text.astype('unicode').values, batch_size=100):
        ent_list.append(doc.ents)

df['ent_list'] = ent_list

#Check named entities
df.head()

In [None]:
#Get named entities in a single document and visualize
doc = nlp(df.Text[0]) 
print(doc)

displacy.render(doc, style="ent", jupyter=True)