<a href="https://colab.research.google.com/github/mkane968/Corpus_Analysis_with_NLTK_and_SpaCy/blob/main/Corpus_Analysis_with_NLTK_and_SpaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Corpus Analysis with NLTK and SpaCy
##Introduction

In this tutorial, you will learn how to conduct cleaning and text analysis on a corpus of texts using the Natural Language Toolkit (NLTK) and SpaCy. 

The Natural Language Toolkit (NLTK) is a Python platform dedicated to performing natural language processing, or the computational manipulation of language. Through the NLTK, a suite of libraries can be accessed for the purposes of data mining and analysis, including tokenization, stemming, tagging, parsing, and classification ([NLTK documentation](https://www.nltk.org/)).

SpaCy is another popular open-source tool for natural language processing. It's particularly good at annotating linguistic data through part-of-speech tagging, chunking and named entity recognition, as well as calculating document similarity through word embeddings. Whereas NLTK was built for research purposes, SpaCy was built for production, and has an integrated catalogue of features taht is smaller and more streamlined ([SpaCy 101](https://spacy.io/usage/spacy-101#pipelines))

By the end of this tutorial, you will be able to: 
*   Upload a corpus of 2+ texts to Google Colab
*   Clean corpora by lowercasing, removing stop words and removing punctuation 
*   Enrich corpora through stemming, lemmatization, chunking,  part-of-speech tagging, and named entity recognition. 
*   Perform basic analysis on enriched text including frequency and collocation analysis, concordancing, and indexing 
*   Visualize results of text analysis through frequency and dispersion plots

Table of Contents: 
1. Install Packages 
2. Load Text Files into DataFrame
2. Cleaning and Tokenization
3. Part of Speech Tagging and Parsing
4. Word Frequency and Context Analysis

##1. Install Packages

In [None]:
#Imports the Natural Language Toolkit, which is necessary to install NLTK packages and libraries
#!pip install nltk
import nltk

#Installs libraries and packages to tokenize text
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

#Installs libraries and packages to clean text
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

#Installs libraries and packages to stem and lemmatize texts
from nltk.stem.snowball import SnowballStemmer # This is "Porter 2" and is considered the optimal stemmer.
from nltk.stem import (PorterStemmer, LancasterStemmer)
nltk.download('wordnet')
from nltk import WordNetLemmatizer
nltk.download('omw-1.4')

#Installs libraries and packages to perform chunking, parsing and visualization
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
!pip install svgling

#Imports spaCy itself, necessary to use features (note how few packages are needed for spaCy analysis vs. NLTK above)
#!pip install spaCy
import spacy
#Load the natural language processing pipeline
nlp = spacy.load("en_core_web_sm")
#Load spaCy visualizer
from spacy import displacy

##2. Load Text Files into DataFrame



In [None]:
#Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Selet multiple files to upload from local folder
from google.colab import files

uploaded_files = files.upload()


In [None]:
#Add files into dataframe
import pandas as pd

df = pd.DataFrame.from_dict(uploaded_files, orient='index')
df.head()

In [None]:
#Reset index and add column names to make wrangling easier
df = df.reset_index()
df.columns = ["Title", "Text"]
df

##3. Cleaning and Tokenization

In [None]:
#Basic cleaning with nltk
#Remove encoding characters from Text column (b'\xef\xbb\xbf)
df['Text'] = df['Text'].apply(lambda x: x.decode('utf-8'))
df.head()

#Remove newline characters
df['Text'] = df['Text'].str.replace(r'\s+|\\r', ' ', regex=True) 
df['Text'] = df['Text'].str.replace(r'\s+|\\n', ' ', regex=True) 
df.head()

In [None]:
#Lowercase all words
df['Text'] = df['Text'].str.lower()

#Remove stopwords
stop_words = set(stopwords.words("english"))
df['no_stops'] = df['Text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

#Remove punctuation


df.head()

In [None]:
#Tokenize with spaCy
token_list = []

# Disable POS, Dependency Parser, and NER since all we want is tokenizer 
with nlp.disable_pipes('tagger', 'parser', 'ner'):
  #Iterate through each doc object (each text in dataframe) and tokenize, append tokens to list
    for doc in nlp.pipe(df.Text.astype('unicode').values, batch_size=100):
        word_list = []
        for token in doc:
            word_list.append(token.text)

        token_list.append(word_list)
#Make token list a new column in dataframe
df['token_list'] = token_list

In [None]:
df.head()

## 3. Part of Speech Tagging and Parsing

In [None]:
#Get lemmas
lemma_list = []

with nlp.disable_pipes('tagger', 'parser', 'ner'):
    for doc in nlp.pipe(df.Text.astype('unicode').values, batch_size=100):
        word_list = []
        for token in doc:
            word_list.append(token.lemma_)

        token_list.append(word_list)

df['lemma_list'] = lemma_list

Compare to lemmas retrieved through NLTK 

In [None]:
#Define three stemming tools
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")

In [None]:
df['Porter'] = df['Tokens'].apply(lambda x: [porter.stem(y) for y in x])
df['Lancaster'] = df['Tokens'].apply(lambda x: [lancaster.stem(y) for y in x])
df['Snowball'] = df['Tokens'].apply(lambda x: [snowball.stem(y) for y in x])
df.head()

Get POS tags with spaCy

In [None]:
#Part of Speech Tags
%%timeit
pos_list = []

# Disable Dependency Parser, and NER since all we want is POS 
with nlp.disable_pipes('parser', 'ner'):
  #Iterate through each doc object and tag POS, append POS to list
  for doc in nlp.pipe(df.Text.astype('unicode').values, batch_size=100):
    word_list = []
    for token in doc:
        word_list.append(token.pos_)
        
    pos_list.append(word_list)

#Make pos list a new column in dataframe
df['pos_list'] = pos_list

In [None]:
#Define spaCy part of speech tags

Compare to NLTK pos tags

Get named entities with spaCy

In [None]:
#Get Named Entities
ent_list = []

with nlp.disable_pipes('tagger', 'parser'):
    for doc in nlp.pipe(df.Text.astype('unicode').values, batch_size=100):
        ent_list.append(doc.ents)

df['ent_list'] = ent_list

In [None]:
#Check dataframe
df.head()

## 3. Frequency and Context Analysis

In [None]:
#Create new dataframe for frequency analysis
Freqs = corpus[['Title','Word_Tokens', 'pos_list']].copy()
Freqs

#Get length of words in each text and append to dataframe
Freqs['Length'] = Freqs['Word_Tokens'].apply(lambda x: len(x))
Freqs

https://github.com/yuibi/spacy_tutorial/blob/master/02_intermediate_spacy.ipynb 

https://www.oreilly.com/library/view/blueprints-for-text/9781492074076/ch04.html 

https://spacy.io/usage/processing-pipelines 

In [None]:
token_list = []

# Disable POS, Dependency Parser, and NER since all we want is tokenizer 
# Alternatively, you can use nlp.make_doc method, which skips all pipelines, if you just need a tokenizer.
with nlp.disable_pipes('tagger', 'parser', 'ner'):
    for doc in nlp.pipe(texts.Text.astype('unicode').values, batch_size=100):
        word_list = []
        for token in doc:
            word_list.append(token.text)

        token_list.append(word_list)

texts['token_list2'] = token_list

In [None]:
texts.head()

In [None]:
#THIS WORKS

token_list = []

for doc in nlp.pipe(texts.Text.astype('unicode').values, batch_size=100):
    word_list = []
    for token in doc:
        word_list.append(token.text)
        
    token_list.append(word_list)

texts['token_list'] = token_list

In [None]:
texts.head()

In [None]:
%%timeit
ner_list = []

for doc in nlp.pipe(texts.Text.astype('unicode').values, batch_size=100):
    ner_list = []
    for ent in doc.ents:
        ner_list.append(ent.label_)
        
    ner_list.append(word_list)

texts['ner_list'] = pos_list

In [None]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [None]:
%%timeit
pos_list = []

for doc in nlp.pipe(texts.Text.astype('unicode').values, batch_size=100):
    word_list = []
    for token in doc:
        word_list.append(token.pos_)
        
    pos_list.append(word_list)

texts['pos_list'] = pos_list

In [None]:
for doc in nlp.pipe(docs):
    tokens = [token.lemma_ for token in doc if token_filter(token)]
    filtered_tokens.append(tokens)

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
docs = texts['Text'].tolist()



In [None]:
for doc in nlp.pipe(docs):
    tokens = [token.lemma_ for token in doc]

In [None]:
docs = df['text'].tolist()

def token_filter(token):
    return not (token.is_punct | token.is_space | token.is_stop | len(token.text) <= 4)

filtered_tokens = []
for doc in nlp.pipe(docs):
    tokens = [token.lemma_ for token in doc if token_filter(token)]
    filtered_tokens.append(tokens)

In [None]:
doc = texts.apply(lambda row: nlp(row['Text']), axis=1)


In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

In [None]:
texts["Tag"] = texts["Text"].apply(lambda my_root: [tok.tag_ for tok in nlp(my_root).tokens])

In [None]:
text = texts['Text'].apply(lambda x: nlp(x))

In [None]:
#Get Named Entities
ent_list = []

with nlp.disable_pipes('tagger', 'parser'):
    for doc in nlp.pipe(df.Text.astype('unicode').values, batch_size=100):
        word_list = []
        for token in doc:
          for ent in doc.ents:
            word_list.append(ent.text + ent.label_ )

df['ent_list2'] = ent_list