<a href="https://colab.research.google.com/github/mkane968/Corpus-Analysis-with-SpaCy/blob/main/DRAFT_2_Corpus_Analysis_with_SpaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Corpus Analysis with spaCy
##Introduction

This tutorial describes how to conduct cleaning and text analysis on a corpus of texts using spaCy. It will be of interest to researchers who want to prepare a corpus of texts for analysis and perform lemmatization, part-of-speech tagging, and named entity recognition to help answer their research questions. 

###Why Use spaCy for Corpus Analysis? 

spaCy is an industrial-strength library for natural language processing. One of its primary usages is to retrieve a variety of linguistic annotations from a text or corpus (e.g. lemmas, part of speech tags, named entities), so it's valuable for researchers who want to know more about their corpora at the lexico-grammatical level. 

While there are several Python libraries that can conduct similar text-mining tasks, spaCy has the following advantages: 
*   It's **fast and simple to set up and call the nlp pipeline**; no need to call a wide range of packages and functions for each individual task [(Data Incubator, 2021)](https://www.thedataincubator.com/blog/2016/04/27/nltk-vs-spacy-natural-language-processing-in-python/)
*   It uses only the **"latest and best" algorithms** for text-processing tasks, so it's easy to run and kept up-to-date by the developers [(Malhotra, 2018)](https://medium.com/@akankshamalhotra24/introduction-to-libraries-of-nlp-in-python-nltk-vs-spacy-42d7b2f128f2)
*   It **performs better on text-splitting tasks** than NLTK, since it constructs syntactic trees for each sentence it is called on [(Proxet)]((https://proxet.com/blog/spacy-vs-nltk-natural-language-processing-nlp-python-libraries/)

###Before You Begin

You should have some familiarity with Python or a similar coding platform. For a brief introduction or refresher, work through some of the *Programming Historian's* [introductory Python tutorials](https://programminghistorian.org/en/lessons/introduction-and-installation). You should also have basic knowledge of spreadsheet (csv) files, as this tutorial will primarily use data in a similar format called a [pandas](https://pandas.pydata.org/) DataFrame. [This lesson](https://programminghistorian.org/en/lessons/crowdsourced-data-normalization-with-pandas) provides an overview to creating and manipulating datasets using pandas.

It is also recommended, though not required, that you have some background in methods of computational text mining. [This lesson](https://programminghistorian.org/en/lessons/corpus-analysis-with-antconc) shares tips for working with plain text files and outlines possibilites for exploring keywords and collocates in a corpora (though using a different tool). [This lesson](https://programminghistorian.org/en/lessons/counting-frequencies) describes the process of counting word frequencies, a practice this tutorial will adapt to count part of speech and named entity tags. 

Two versions of code are provided for this tutorial: one version to be run on Jupyter Notebook and one for Google Colaboratory. Details and setup instructions for each are as follows: 
*  **Jupyter Notebook** is an environment through which you can run Python on your local machine. Since it's local, it works offline, and you can set up dedicated environments for your projects in which you'll only need to install packages once. If you've used Python before, you likely already have Jupyter Notebook installed on your machine. [This tutorial](https://programminghistorian.org/en/lessons/jupyter-notebooks) covers the basics of setting up Jupyter Notebook using Anaconda.

*  **Google Colaboratory** is a Google platform which allows you to run Python in a web browser. Access is free with a Google account and nothing needs to be installed to your local machine. If you're new to coding, aren't working with sensitive data, and aren't running processes with [slow runtime](https://www.techrepublic.com/article/google-colab-vs-jupyter-notebook/), Google Colab may be the [best option for you. [Here's a brief Colab tutorial from Google.](https://colab.research.google.com/)


###Lesson Dataset: Michigan Corpus of Upper-Level Student Papers (MICUSP)
The [Michigan Corpus of Upper-Level Student Papers (MICUSP)](https://elicorpora.info/main) is a corpus of 829 high-scoring academic writing samples from students at the University of Michigan. The papers come from 16 disciplines and seven genres; all were written by senior undergraduate or graduate students and received an A-range score in a university course ([Römer and O'Donnell, 2011](https://web.s.ebscohost.com/ehost/pdfviewer/pdfviewer?vid=0&sid=0b9af0f6-d23e-47ae-90dc-e1ea9fbe606a%40redis); [O'Donnell and Römer, 2012](https://www.euppublishing.com/doi/10.3366/cor.2012.0015)). The papers and their metadata are publically available on MICUSP Simple, an online interface which allows users to search for papers by a range of fields (e.g. genre, discipline, student level, textual features) and conduct simple keyword analyses across disciplines and genres. Metadata from the corpus is available to download in csv form. The text files can be retrieved via webscraping, a process explained further in [this tutorial](https://programminghistorian.org/en/lessons/retired/intro-to-beautiful-soup).

Given its size and robust metadata, MICUSP has become a valuable tool for researchers seeking to study student writing computationally. Notably, [Hardy and Römer (2013)](https://web.p.ebscohost.com/ehost/pdfviewer/pdfviewer?vid=0&sid=df763712-4f88-480f-a421-987bb35a09cd%40redis) use MICUSP to study language features that indicate how student writing differs across disciplines, [Aull and Lancaster (2016)](https://journals.sagepub.com/doi/epub/10.1177/0741088318819472) compare usages of stance markers across student genres, and [Kim (2018)](https://www.cambridge.org/core/product/identifier/S0266078417000554/type/journal_article) highlights discrepancies between prescriptive grammar rules and actual language use in student work. Though different and framework and approach, these studies are predicated on the fact that computational analysis of *language patterns*--the discrete lexico-grammatical practices students employ in their writing--can yield insights into larger questions about academic writing. Given its value in retrieving *linguitic annotations* like parts of speech and named entities, spaCy is well-poised to conduct this type of analysis using MICUSP.

For the purposes of this tutorial, we'll use at a subsection of MICUSP: 67 Biology papers and 98 English papers. Papers in this select corpus belong to all seven MICUSP genres: Argumentative Essay, Creative Writing, Critique/Evaluation, Proposal, Report, Research Paper, and Response Paper. This select corpus and the associated metadata csv are available to download as part of this tutorial's [lesson materials](https://github.com/mkane968/Corpus-Analysis-with-SpaCy/tree/main/lesson-materials). This tutorial will demonstrate how spaCy's utilities in **stopword removal,** **tokenization,** and **lemmatization,** can clean and prepare a corpus of student texts for analysis. It will also demonstrate how spaCy's ability to extract linguistic annotations like part-of-speech tags and named entities can be used to compare conventions within subsets of a discourse community of interest. Here, the focus will be on lexico-grammatical features that indicate genre and disciplinary differences in academic writing: 
*   *Genre Analysis:* Do students use certain **parts of speech** more frequently in some genres than others? And what can these differences tell us about genre conventions? For example, the goal of a proposal is to put forward a research proposal or question; it's more focused on "big ideas" than something like a response paper that narrowly addresses a prior text ([Römer and O'Donnell, 2011](https://web.s.ebscohost.com/ehost/pdfviewer/pdfviewer?vid=0&sid=0b9af0f6-d23e-47ae-90dc-e1ea9fbe606a%40redis)). Does this translate to the linguistic level--do students use more proper nouns, for example, when writing in genres with broader goals? 

*   *Discipline Analysis:* Do students use certain **named entities** more frequently in Biology papers than in English papers? And what can these differences tell us about genre conventions? For example, even when writing in the same genres, the writer of a scientific research paper often has very different expectations than one in the humanities (Berkenkotter and Huckin, 1995).Does this translate to the linguistic level--do students use more concrete dates, and organization names in biology research papers than in English ones? 

Finally, this tutorial will address how a dataset enriched by spaCy can be exported in a useable format for further analyses like [Term Frequency - Inverse Document Frequency (tf-idf) analysis](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf), [sentiment analysis](https://programminghistorian.org/en/lessons/sentiment-analysis#calculate-sentiment-for-a-paragraph) or [topic modeling](https://programminghistorian.org/en/lessons/topic-modeling-and-mallet).

###Tutorial Goals
By the end of this tutorial, you will be able to: 
*   Upload a corpus of texts to Google Colab
*   Clean the corpus by lowercasing, removing stop words and removing punctuation 
*   Enrich the corpus through lemmatization, chunking,  part-of-speech tagging, and named entity recognition
*   Conduct frequency analyses with part-of-speech tags and named entities 
*   Download an enriched dataset for use in future NLP/analyses

###Table of Contents: 
1. Install Packages 
2. Load Text Files into DataFrame
3. Cleaning and Tokenization
4. Text Enrichment
    
    a. Lemmatization

    b. Part of Speech Tagging

    c. Parsing and Chunking
    
    d. Named Entity Recognition

5. Analysis of Linguistic Annotations

    a. Part of Speech Differences Between Genres: Proper Nouns 

    b. Named Entity Differences Between Disciplines: Dates and Organizations 
6. Download Enriched Dataset

##1. Install Packages

Import spaCy and related libraries and packages. It is common practice to do this at the very top of the file instead of interspersing them with your code to improve efficiency. These packages can be run in a single cell of code; below, the markdown text describes how each downloaded package or library will be used in the analysis. 


In [None]:
#Imports spaCy itself, necessary to use features 
#!pip install spaCy
import spacy
#Load the natural language processing pipeline
#!python -m spacy download en_core_web_md 
nlp = spacy.load("en_core_web_sm")
#Load spaCy visualizer
from spacy import displacy

##2. Load Text Files into DataFrame

After all necessary packages have been installed, it is time to upload the texts for analysis. The key here is to read the texts into Google Colab in a way that will make them recognizable for analysis. Run the following code to “mount” the Google Drive, which allows your Google Colab notebook to access any files on your Drive. A box will pop up asking for permission for the notebook to access your Drive files; click “Connect to Google Drive,” select Google account to connect to, and click “Allow.” 

In [None]:
#Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Next, load the files for analysis into your Google Drive. To complete this step, you must have the files of interest saved in a folder on your local machine. Once you run this line of code, a button will pop up directing you to “Choose Files” – click the button and a file explorer box will pop up. From here, navigate to the folder where your files are stored, select the files of interest, and click “Open.” The files will then be uploaded to your Google Drive; you will see the upload complete as output of your cell and can access the files by clicking the file icon in the bar on the left-hand side of the notebook.


In [None]:
#Selet multiple files to upload from local folder
from google.colab import files

uploaded_files = files.upload()

Now we have files upon which we can perform analysis. To check what form of data we are working with, use the type() function. It should return that your files are contained in a dictionary, where keys are the file names and values are the content of each file. 


In [None]:
type(uploaded_files)

Next, we’ll make the data easier to manage by inserting it into a Pandas dataframe. This will organize the texts into a table of rows and columns–in this case, the first column will contain the names of the files, and the second column will contain the context of each file. Since the files are currently stored in a dictionary, use the DataFrame.from_dict() function to append them to a new dataframe.


In [None]:
#Add files into dataframe
import pandas as pd

paper_df = pd.DataFrame.from_dict(uploaded_files, orient='index')
paper_df.head()

From here, you can reset the index (the very first row of the dataframe). This will make data wrangling easier later.  

In [None]:
#Reset index and add column names to make wrangling easier
paper_df = paper_df.reset_index()
paper_df.columns = ["Filename", "Text"]
paper_df.head()

Optionally, we can add in metadata of interest to this data frame. Here, we'll add discipline and genre information, as we'll be interested in using SpaCy to trace differences across genre and disciplinary categories later. 

In [None]:
#Upload csv with essay metadata
metadata = files.upload()

In [None]:
metadata_df = pd.read_csv('metadata (2).csv')
metadata_df = metadata_df.dropna(axis=1, how='all')
metadata_df.head()

We'll need to do some minor editing so that the titles in the paper dataframe match the metadata paper ids. This is so that we can merge the two dataframes on the paper names. 

In [None]:
#Remove .txt from titleo f each paper
paper_df['Filename'] = paper_df['Filename'] .map(lambda x: x.rstrip('.txt'))

#Rename column from paper ID to Title
metadata_df.rename(columns={"PAPER ID": "Filename"}, inplace=True)

Now it is possible to combine the papers and metadata into a single dataframe.

In [None]:
#Merge metadata and papers into new dataframe
#Will only keep rows where both essay and metadata are present
final_paper_df = metadata_df.merge(paper_df,on='Filename')

#Print dataframe
final_paper_df.head()

The texts in your resulting dataframe are now ready for cleaning and analysis. 


##3. Cleaning and Tokenization

From a quick scan of the dataframe, it is evident that some preliminary cleaning is required. First use the .decode() module to remove any utf-8 characters embedded in the texts (b'\xef\xbb\xbf). It is also important to remove newline characters (\n, \r) through a simple string replacement line. These are NOT functions of spaCy but are necessary to make the code recognizable for further cleaning and tokenization. 


In [None]:
#Remove encoding characters from Text column (b'\xef\xbb\xbf)
final_paper_df['Text'] = final_paper_df['Text'].apply(lambda x: x.decode('utf-8', errors='ignore'))
final_paper_df.head()

#Remove newline characters
final_paper_df['Text'] = final_paper_df['Text'].str.replace(r'\s+|\\r', ' ', regex=True) 
final_paper_df['Text'] = final_paper_df['Text'].str.replace(r'\s+|\\n', ' ', regex=True) 
final_paper_df.head()

The next, most basic operation to perform is lowercasing all tokens in the texts. This will prevent incorrect calculations in later case-sensitive analysis; for example, if lowercasing is not performed, “House” and “house” may be counted as two different words. 

In [None]:
#Lowercase all words
final_paper_df['Text'] = final_paper_df['Text'].str.lower()

final_paper_df.head()

The next step is to remove punctuation. Depending on your analysis goals, you may want to keep punctuation, but in this case we are interested in words only. 

In [None]:
#Remove punctuation and replace with no space (except periods and hyphens)
final_paper_df['Text'] = final_paper_df['Text'].str.replace(r'[^\w\-\.\'\s]+', '', regex = True)

#Remove periods and replace with space (to prevent incorrect compounds)
final_paper_df['Text'] = final_paper_df['Text'].str.replace(r'[^\w\-\'\s]+', ' ', regex = True)

final_paper_df.head()

Tokenization is the process used to split up full text into smaller parts for analysis. SpaCy has a built-in function for tokenization that involves segmenting texts into individual parts like words and punctuation. Take the example of an individual sentence: 

In [None]:
doc = nlp("This is 'an' example? sentence")
for token in doc:
    print(token.text)

What this function is doing is calling the nlp pipeline, which contains the data and components needed for text processing. When the nlp pipeline is called on a sentence, it splits that sentence on each whitespace and reviews its components. Components are then split based on rules for words, punctuations, prefixes, suffixes, etc. Each token is then loaded into a new object that we’ve called “doc.” Calling nlp also enables part of speech tagging, lemmatization, and other enrichment procedures we’ll discuss further below. 

Since we are working with multiple long texts, we are going to use nlp.pipe, which processes batches of texts as doc objects. Here we’ll tokenize each text in our dataframe, append each set of tokens to a list, and add the new token lists to a new column in the dataframe.


In [None]:
#Tokenize with spaCy

#Create list for tokens
token_list = []

# Disable POS, Dependency Parser, and NER since all we want is tokenizer 
with nlp.disable_pipes('tagger', 'parser', 'ner'):
  #Iterate through each doc object (each text in dataframe) and tokenize, append tokens to list
    for doc in nlp.pipe(final_paper_df.Text.astype('unicode').values, batch_size=100):
        word_list = []
        for token in doc:
            word_list.append(token.text)

        token_list.append(word_list)
        
#Make token list a new column in dataframe
final_paper_df['Text_Tokens'] = token_list
final_paper_df['Text_Tokens'] = [' '.join(map(str, l)) for l in final_paper_df['Text_Tokens']]

#Check token list
final_paper_df.head()

When tokenizing texts, you can also exclude stopwords. Stopwords are words which may hold little significance to text analysis, such as very common words like “the” or “and.” SpaCy has a built-in dictionary of stopwords which you can access. You can also add or remove your own stopwords, as shown below:

In [None]:
#Adding and removing stopwords to default list
#See list of default stopwords
print(nlp.Defaults.stop_words)

#Remove a  stopword
nlp.Defaults.stop_words.remove("becomes")

#Add stopword
nlp.Defaults.stop_words.add("book")

#Check updated list of default stopwords
print(nlp.Defaults.stop_words)

To tokenize texts without stopwords, follow the same process above using nlp.pipe, but only append tokens to list that are NOT included in stopwords list and append these to a new row in the dataframe. 


In [None]:
#Remove all stopwords and append remaining tokens to new df column
token_list_nostops = []

# Disable POS, Dependency Parser, and NER since all we want is tokenizer 
with nlp.disable_pipes('tagger', 'parser', 'ner'):
  #Iterate through each doc object (each text in dataframe) and tokenize, append tokens to list
    for doc in nlp.pipe(final_paper_df.Text.astype('unicode').values, batch_size=100):
        nostops_word_list = []
        for token in doc:
            if token.text not in nlp.Defaults.stop_words:
              nostops_word_list.append(token.text)

        token_list_nostops.append(nostops_word_list)

#Make token list a new column in dataframe
final_paper_df['Text_Tokens_NoStops'] = token_list_nostops
final_paper_df['Text_Tokens_NoStops'] = [' '.join(map(str, l)) for l in final_paper_df['Text_Tokens_NoStops']]


#Check list of tokens without stopwords
final_paper_df.head()

Depending on the goals of your analysis, you may want to remove or keep stopwords. One case where stopword removal may be useful is if you want to compare document similarity. SpaCy calculates document similarity based on corpus word vectors; since stopwords are words that appear throughout texts, they will heighten document similarity scores even if their content is very different. Observe the difference here: 


In [None]:
#Stopwords Test Case - Word Vector Similarity
#Load a larger pipeline with vectors
#!spacy download en_core_web_md
#nlp = spacy.load("en_core_web_md")

# Compare similarity between two documents without stopwords
doc1 = nlp(final_paper_df.Text_Tokens[0])
doc2 = nlp(final_paper_df.Text_Tokens[72])
print(f'The similarity between ' + str(final_paper_df.Filename[0]) + ' and ' + str(final_paper_df.Filename[72]) + ' with stopwords is ' + str(doc1.similarity(doc2)))

# Compare similarity between two documents without stopwords
doc1 = nlp(final_paper_df.Text_Tokens_NoStops[0])
doc2 = nlp(final_paper_df.Text_Tokens_NoStops[72])
print(f'The similarity between ' + str(final_paper_df.Filename[0]) + ' and ' + str(final_paper_df.Filename[72]) + ' without stopwords is ' + str(doc1.similarity(doc2)))


Stopword removal is also useful for topic modeling and classification tasks, where finding general themes across documents is the goal. However, other types of analysis like sentiment analysis are highly sensitive and removing stopwords will change sentence meaning (e.g. removing “not” in the sentence “I was not happy”). When possible, it is recommended to run analysis with and without stopwords and see how the model is impacted. For the rest of this tutorial, we will be using the corpus without stopwords, but you are welcome to replicate analysis with them. 


## 4. Text Enrichment 

In [None]:
#Create new dataframe for text enrichment
enriched_df = final_paper_df.copy()

### Lemmatization

SpaCy enables several types of text enrichment. We’ll start with lemmatization, which identifies the dictionary root word of each word (e.g. “brighten” for “brightening”). Lemmatization is one of the functions that occurs when the nlp pipe is called; repeat the same process as above to iterate through each document in the dataframe and this time append all lemmas to new column. 


In [None]:
#Get lemmas
lemma_list = []

# Disable Dependency Parser, and NER since all we want is lemmatization 
with nlp.disable_pipes('parser', 'ner'):
  #Iterate through each doc object and tag lemma, append lemma to list
  for doc in nlp.pipe(final_paper_df.Text_Tokens_NoStops.astype('unicode').values, batch_size=100):
    word_list = []
    for token in doc:
        word_list.append(token.lemma_)
        
    lemma_list.append(word_list)

#Make pos list a new column in dataframe
final_paper_df['Text_Lemmas'] = lemma_list
final_paper_df['Text_Lemmas'] = [' '.join(map(str, l)) for l in final_paper_df['Text_Lemmas']]

#Check lemmas
final_paper_df.head()

DISCUSS WHY LEMMATIZATION WOULD BE VALUABLE TO RESEARCHERS

### Part of Speech Tagging

The nlp pipeline also enables the tagging of each word according to its part of speech. This code will append all parts of speech to a new dataframe column. 


In [None]:
#Get part of speech tags
pos_list = []

# Disable Dependency Parser, and NER since all we want is POS 
with nlp.disable_pipes('parser', 'ner'):
  #Iterate through each doc object and tag POS, append POS to list
  for doc in nlp.pipe(enriched_df.Text_Tokens.astype('unicode').values, batch_size=100):
    word_list = []
    for token in doc:
        word_list.append(token.pos_)
        
    pos_list.append(word_list)

#Make pos list a new column in dataframe
enriched_df['POS_Tags'] = pos_list
enriched_df['POS_Tags'] = [' '.join(map(str, l)) for l in enriched_df['POS_Tags']]

#Check pos tags
enriched_df.head()

One basic form of analysis is to count usages of specific parts of speech.

In [None]:
#Get the number of proper nouns in each paper
noun_counts = enriched_df['POS_Tags'].str.count('PROPN')

#Append proper noun counts to dataframe 
enriched_df['Noun_Counts'] = noun_counts
enriched_df.head()

From here, we can calculate the average usage of that part of speech and plot across discipline and paper type.



In [None]:
#Get average of noun counts in each discipline
discipline_mean_df = enriched_df.groupby('DISCIPLINE', as_index=False)['Noun_Counts'].mean()

#Create bar graph and plot proper noun count averages
#https://plotly.com/python/bar-charts/
import plotly.graph_objects as go

fig = go.Figure(data=[
    go.Bar(name='Proper Noun Counts', x=discipline_mean_df['DISCIPLINE'], y=discipline_mean_df['Noun_Counts']),
])

# Change the bar mode
fig.update_layout(title_text='Counts of Proper Nouns in Each Discipline')
fig.update_layout(barmode='stack')
fig.show()

In [None]:
#Get average of noun counts in each genre
discipline_mean_df = enriched_df.groupby('PAPER TYPE', as_index=False)['Noun_Counts'].mean()

#Plot average proper noun counts by genre
#Create bar graph
#https://plotly.com/python/bar-charts/
import plotly.graph_objects as go

fig = go.Figure(data=[
    go.Bar(name='Proper Noun Counts', x=discipline_mean_df['PAPER TYPE'], y=discipline_mean_df['Noun_Counts']),
])

# Change the bar mode
fig.update_layout(title_text='Counts of Proper Nouns in Each Genre')
fig.update_layout(barmode='stack')
fig.show()

TALK ABOUT APPLICATIONS OF THIS TYPE OF ANALYSIS--WHAT IT SIGNiFIES ABOUT GENERIC OR DISCIPLINARY DIFFERENCES, HOW IT CAN PROMPT FURTHER ANALYSIS

LIKE ANALYSIS OF SPECIFIC WORDS BASED ON POS--ASSOCIATE THESE BELOW

From here, you may want to get only a set of Part of Speech tags for further analysis--all of the proper nouns, for instance. 

In [None]:
#Get part of speech tags
pos_list = []

# Disable Dependency Parser, and NER since all we want is POS 
with nlp.disable_pipes('parser', 'ner'):
  #Iterate through each doc object and tag POS, append POS to list
  for doc in nlp.pipe(enriched_df.Text_Tokens.astype('unicode').values, batch_size=100):
    word_list = []
    for token in doc:
      if token.pos_ == 'PROPN':
        word_list.append(token)
        
    pos_list.append(word_list)

#Make pos list a new column in dataframe
enriched_df['Proper_Nouns'] = pos_list
enriched_df['Proper_Nouns'] = [', '.join(map(str, l)) for l in enriched_df['Proper_Nouns']]

#Check pos tags
enriched_df.head()

You can do the same with larger phrases, like noun phrases

In [None]:
#Get part of speech tags
np_list = []

# Disable Dependency Parser, and NER since all we want is POS 
with nlp.disable_pipes('ner'):
  #Iterate through each doc object and tag POS, append POS to list
  for doc in nlp.pipe(enriched_df.Text_Tokens.astype('unicode').values, batch_size=100):
    word_list = []
    for np in doc.noun_chunks:
      word_list.append(np)
    np_list.append(word_list)

#Make pos list a new column in dataframe
enriched_df['Text_NounPhrases'] = np_list
enriched_df['Text_NounPhrases'] = [', '.join(map(str, l)) for l in enriched_df['Text_NounPhrases']]

#Check pos tags
enriched_df.head()

Check out the dictionary of SpaCy POS tags [here.](https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/#:~:text=Spacy%20POS%20Tags%20List,-Every%20token%20is%20assigned%20a) 



TALK ABOUT IDENTIFYING SPECIFIC NOUN CHUNKS

Closely related to POS tagging is dependency parsing, wherein SpaCy identifies how different segments of a text are related to each other. Once the grammatical structure of each sentence is identified, visualizations can be created to show the connections between different words. Since we are working with large texts, our code will break down each text into sentences (spans) and then create dependency visualizers for each span


In [None]:
#Get dependency parsing for single doc
doc = nlp(enriched_df.Text_Tokens[0]) 
print(doc)

#Make each sentence a span to break up dependency visualizations
spans = doc.sents

#Create dependency visualizations 
displacy.render(spans, style="dep", jupyter=True)

From here, you may want to get only a set of Part of Speech tags for further analysis--all of the noun phrases, for instance. 

In [None]:
#Get part of speech tags
np_list = []

# Disable Dependency Parser, and NER since all we want is POS 
with nlp.disable_pipes('ner'):
  #Iterate through each doc object and tag POS, append POS to list
  for doc in nlp.pipe(enriched_df.Text_Tokens.astype('unicode').values, batch_size=100):
    word_list = []
    for np in doc.noun_chunks:
      word_list.append(np)
    np_list.append(word_list)

#Make pos list a new column in dataframe
enriched_df['Text_NounPhrases'] = np_list
enriched_df['Text_NounPhrases'] = [', '.join(map(str, l)) for l in enriched_df['Text_NounPhrases']]

#Check pos tags
enriched_df.head()

In [None]:
len(np_list)

Some types of further analysis, like topic modeling, works better when certain parts of speech are removed from the texts. This is another form of dimensionality reduction. 

### Named Entity Recognition

Finally, SpaCy can tag “named entities” in your text, such as names, dates, organizations, and locations. We’ll again call the nlp pipeline on each document in the corpus and append the named entities to a new column. 


In [None]:
#Get Named Entities
ner_list = []

with nlp.disable_pipes('tagger', 'parser'):
    for doc in nlp.pipe(enriched_df.Text_Tokens.astype('unicode').values, batch_size=100):
      ent_list = []
      for ent in doc.ents:
        ent_list.append(ent.label_)
      ner_list.append(ent_list)
 
enriched_df['Text_NER'] = ner_list
enriched_df['Text_NER'] = [' '.join(map(str, l)) for l in enriched_df['Text_NER']]


#Check named entities
enriched_df.head()

Similar to part of speech analysis, we can get counts of a specific named entity

In [None]:
#Get the number of proper nouns in each paper
noun_counts = enriched_df['Text_NER'].str.count('DATE')

#Append proper noun counts to dataframe 
enriched_df['NE_Counts'] = noun_counts
enriched_df.head()

From here, we can calculate the average usage of that named entity and plot across discipline and paper type.


In [None]:
#Get average of noun counts in each discipline
discipline_mean_df = enriched_df.groupby('DISCIPLINE', as_index=False)['NE_Counts'].mean()

#Create bar graph and plot proper noun count averages
#https://plotly.com/python/bar-charts/
import plotly.graph_objects as go

fig = go.Figure(data=[
    go.Bar(name='Organization Counts', x=discipline_mean_df['DISCIPLINE'], y=discipline_mean_df['NE_Counts']),
])

# Change the bar mode
fig.update_layout(title_text='Counts of Organizations in Each Discipline')
fig.update_layout(barmode='stack')
fig.show()

In [None]:
#Get average of noun counts in each genre
discipline_mean_df = enriched_df.groupby('PAPER TYPE', as_index=False)['NE_Counts'].mean()

#Plot average proper noun counts by genre
#Create bar graph
#https://plotly.com/python/bar-charts/
import plotly.graph_objects as go

fig = go.Figure(data=[
    go.Bar(name='Organization Counts', x=discipline_mean_df['PAPER TYPE'], y=discipline_mean_df['NE_Counts']),
])

# Change the bar mode
fig.update_layout(title_text='Counts of Organizations in Each Genre')
fig.update_layout(barmode='stack')
fig.show()

TALK ABOUT APPLICATIONS OF THIS TYPE OF ANALYSIS--WHAT IT SIGNiFIES ABOUT GENERIC OR DISCIPLINARY DIFFERENCES, HOW IT CAN PROMPT FURTHER ANALYSIS

LIKE ANALYSIS OF SPECIFIC WORDS BASED ON NER--ASSOCIATE THESE BELOW

In [None]:
#Get Named Entitie words
ent_list = []

with nlp.disable_pipes('tagger', 'parser'):
    for doc in nlp.pipe(enriched_df.Text_Tokens.astype('unicode').values, batch_size=100):
        ent_list.append(doc.ents)

enriched_df['Text_NER'] = ent_list
enriched_df['Text_NER'] = [' '.join(map(str, l)) for l in enriched_df['Text_NER']]


#Check named entities
enriched_df.head()

SpaCy also allows you to visualize named entities within single texts, as follows: 

In [None]:
#Get named entities in a single document and visualize
doc = nlp(new_df.Text_Tokens[0]) 

displacy.render(doc, style="ent", jupyter=True)

##Conclusions
Through this tutorial, we've gleaned more information about the grammatical makeup of a text corpus. Such information can be valuable to researchers who are seeking to understand differences between texts in their corpus - for example, *what types of named entities are most common across the corpus? How frequently are certain words used as nouns vs. objects within individual texts and corpora, and what may this reveal about the content or themes of the texts themselves?* 

SpaCy is also a helpful tool to explore texts without fully-formed research questions in mind; exploring linguistic annotations like those mentioned above can propel further questions and text-mining pipelines, like the following: 
*   [Getting Started with Topic Modeling and Mallet (Graham, Weingart, and Milligan, 2012)](https://programminghistorian.org/en/lessons/topic-modeling-and-mallet#what-is-topic-modeling-and-for-whom-is-this-useful) - Describes process of conducting topic modeling on a corpora; the SpaCy tutorial can serve as a preliminary step to clean and explore data to be used in topic modeling
*   [Sentiment Analysis for Exploratory Data Analysis (Saldaña, 2018)](https://programminghistorian.org/en/lessons/sentiment-analysis#calculate-sentiment-for-a-paragraph) - Describes how to conduct sentiment analysis using NLTK; the SpaCy tutorial provides alternative methods of pre-processing and exploration of entities that may become relevant in sentiment analysis 

