**Introduction to Stanza and Named Entity Recognition**

Link to stanza documentation: https://stanfordnlp.github.io/stanza/

More documentation with examples, geared towards beginners: https://github.com/stanfordnlp/stanza/blob/main/demo/Stanza_Beginners_Guide.ipynb

Optional Reading: https://www.newfireglobal.com/learn/natural-language-understanding-tools/#informed 
Article that covers some of the main differences between spacy and stanza with examples 

**Imports**

In [None]:
import os
import pandas as pd

In [None]:
import stanza

In [None]:
print(os.getcwd())  #this shows us the current working directory we are in 
                    #Printing it is useful for making your eventual file path as we will see below!
#make sure that you are in the directory and folder which contains the textfile 

In [None]:
%cd NER_Gibbon

**Inputs**
Stanza can handle multiple types of inputs for different tasks. For the purposes of this exercise, we are giving it a list of sentences (ie strings).
If we weren't splitting the text by sentence ourselves beforehand, we would have to give it just strings (i.e.,the entire text file) and the pipeline would split the strings into sentences itself as part of the tokenization process

In [None]:
#Here, we are loading the stanza model for english language processing 
#We will go through what these different processors do in a bit!
# To prevent output capacity issues, we are pre tokenizing the text before we give it to the processors
nlp = stanza.Pipeline(lang='en', processors='tokenize, pos,lemma, depparse,ner',tokenize_pretokenized=True)

In [None]:
#uses an f string with my current working directory #ands file name
#this is an example of when you can use f-strings to your advantage 
path= f'{(os.getcwd())}\gibbon_decline_volume1_chap21.txt' 
with open( path, encoding='utf-8', mode='r') as f:
        vol1_chap21 = f.read()

In [None]:
vol1_chap21 #looking at the text 
#luckily for us, the format of this file shows every sentence ends with a period and then one white space 
#so we can use .split to create a list of sentences 

In [None]:
chap21_sents= vol1_chap21.split('. ')
chap21_sents   #even if it is not perfect, its good enough for our purposes 
#now we have one large list of every sentence in this chapter! 

In [None]:
#we can use list indexing to grab a sentence at a time 
#creating a Doc can take some time, so let's start with a test sentence 
test=    


#pulling a random sentence 


In [None]:
 #stanza's output is something called a Doc 
#lets see what the output looks like 


In [None]:
#lets see what type this is 


In [None]:
##take 5-10 minutes and go to the stanza documentation I have linked above
#in a group, write code to access the text, pos, ner, and upos compontents of our tokens

In [None]:
#next, look at the stanza documentation and print the named entities for our document 

**NER and processing multiple documents**

In [None]:
# load a new pipeline with just the NER processor 
nlp_ner= stanza.Pipeline(lang='en', processors='tokenize, ner',tokenize_pretokenized=True)
#pretokenized= True is still included, since we are going to give the processor our large list of sentences 

In [None]:
#this is a solution for our prior output issue-- still takes a bit to run though 
# Note! Can only be done with text that has already been tokenized into sentences
#this allows us to parallel process multiple sentences at a time
Gibbon_docs_chap21sents = [stanza.Document([], text=d) for d in chap21_sents]
Gibbon_out_docs = nlp_ner(Gibbon_docs_chap21sents) 
print(Gibbon_out_docs[1]) 

In [None]:
#how we access the named entities from our documents list 
for doc in Gibbon_out_docs:
    print(doc.ents) 
 
    

In [None]:
NER_Dict= {}
for doc in Gibbon_out_docs:
    for sent in doc.sentences:
        for token in sent.tokens:
            if token.ner == 'O':
                continue
            else:
                NER_Dict[token.text]= token.ner
           

**Quick Pandas Dataframes Intro**

Beginner Tutorials: https://pandas.pydata.org/docs/getting_started/index.html#getting-started


https://access.tufts.edu/udemy-business UDEMY is a great resource for coding tutorials.
I especially recommend Python for Data Science and Machine Learning Bootcamp by Jose Portilla (there is a section on pandas for data science analysis)

In [None]:
NER_Dict

In [None]:
ner_frame= pd.DataFrame.from_dict(NER_Dict, orient= 'index')
ner_frame

In [None]:
#creates a new index of numbers that is not the entity 
ner_frame.reset_index(inplace=True)


In [None]:
ner_frame.rename(columns={'index':'token', 0: 'NER_tag'})

In [None]:
NER_tags= { 'PER':'People, including fictional characters', 'NORP':'Nationalities, religious and political groups: Jewish, Buddhist',
           'TIME':'Time shorter than a day','ORG':'Companies, agencies, institutions', 'GPE':'Countries, cities, regions (districts)',
'LOC':'Geographical entities', 'PRODUCT':'Objects, vehicles, foods', 'EVENT':'Named battles, wars, sports events, catastrophes',
'QUANTITY':'Measurements: 10 kg, 200 km', 'ORDINAL':'Numbers of order: first ,third','CARDINAL': 'Numerals that do not fall under another type',
'FAC': 'Buildings, airports, roads', 'LANGUAGE':'Any named language'}
NER_tags


In [None]:
nested_NER_prefixes= {'B':'Beginning of named entity','I':'Token is inside a named entity','O':'Corresponding word is not an entity','E':'End of named entity','S':'Named entity has only one token/element'}
nested_NER_prefixes

In [None]:
nertags_frame= pd.DataFrame.from_dict(NER_tags, orient= 'index')
nertags_frame

In [None]:
nertags_frame.reset_index(inplace=True)

In [None]:
nertags_frame.rename(columns={'index':'tag', 0: 'example'})


In [None]:
bioestags_frame= pd.DataFrame.from_dict(nested_NER_prefixes, orient= 'index')
bioestags_frame

After what we have seen today, how would you describe a named entity? 
Do these named entities tell you anything about the contents of this chapter? 

In this notebook, what inputs did we give stanza to process?
What outputs does stanza produce?
How did we access the different components of this output? (ie pos, lemma, ner, etc.)

Do you have any remaining questions or things that feel unclear after this demo? 