## I. Setup, Installations, and Imports

#### (a) Installations

Run these if not on your computer already.

In [None]:
# ! pip install spacy
# ! pip install nltk
# ! python -m spacy download en_core_web_sm
# ! pip install svgling
# ! python -m pip install textacy
# ! pip install stanza

#### (b) Imports

In [1]:
import torch;
import stanza; stanza.download('en') # This downloads the English models for the neural pipelin
nlp = stanza.Pipeline('en',download_method=None) # This sets up a default neural pipeline in English

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-03-29 18:09:57 INFO: Downloading default packages for language: en (English) ...
2023-03-29 18:09:58 INFO: File exists: /Users/mikerich/stanza_resources/en/default.zip
2023-03-29 18:10:02 INFO: Finished downloading models and saved to /Users/mikerich/stanza_resources.
2023-03-29 18:10:02 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| constituency | wsj       |
| depparse     | combined  |
| sentiment    | sstplus   |
| ner          | ontonotes |

2023-03-29 18:10:02 INFO: Using device: cpu
2023-03-29 18:10:02 INFO: Loading: tokenize
2023-03-29 18:10:02 INFO: Loading: pos
2023-03-29 18:10:02 INFO: Loading: lemma
2023-03-29 18:10:02 INFO: Loading: constituency
2023-03-29 18:10:02 INFO: Loading: depparse
2023-03-29 18:10:03 INFO: Loading: sentiment
2023-03-29 18:10:03 INFO: Loading: ner
2023-03-29 18:10:03 INFO: Done loading proces

In [2]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
import glob
from tqdm import tqdm
from bs4 import BeautifulSoup
import spacy
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy

#### (c) Downloads

Run these if not on your computer already.

In [3]:
nltk.download('words')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_tagger')
nltk.download('state_union')

[nltk_data] Downloading package words to /Users/mikerich/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to /Users/mikerich/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/mikerich/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/mikerich/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package state_union to
[nltk_data]     /Users/mikerich/nltk_data...
[nltk_data]   Package state_union is already up-to-date!


True

In [None]:
# nltk.download('words')
# nltk.download('punkt')
# nltk.download('maxent_ne_chunker')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('state_union')

## II. Defined Functions Used In Program


#### (a) Named Entity Recognizer Function
**Input:** A sentence <br> 
**Output:** A dictionary in the form of: <br>
&emsp;**{'entities':** [list of named entity recognizers of that sentence], **'verbs':** [list of the verbs], **'subjects':** [list of the subjects]**, 'subj_verb_linkages',** [root subject verb linkage array]**}**

In [4]:
def ner(sentence): 
    
    entities = []
    verbs = []
    subjects = []
    subj_verb_linkages = []
    
    #Find the entities in the sentence
    words  = nltk.word_tokenize(sentence)        # break down the sentence into words
    tagged = nltk.pos_tag(words)                 # tag the words with Part of Speech 
    chunks = nltk.ne_chunk(tagged, binary=False) # binary = False named entities are classified (i.e PERSON, ORGANIZATION)
    
    for chunk in chunks:
        if hasattr(chunk, 'label'):              # hasattr(obj, key) -- checking if chunks have a label or not 
            entities.append(' '.join(c[0] for c in chunk)) # append entities to array
    
    
    #Find the verbs/subjects in the sentence
    nlp = spacy.load("en_core_web_sm")           # load in the spacy model
    doc = nlp(sentence)                          # create spacy doc object
    
    verbs = [token.text for token in doc if token.pos_ == "VERB"]     # traverse thru the tokens, find the verbs
    subjects = [token.text for token in doc if token.dep_ == "nsubj"]  # traverse thru the tokens, find the subjects
    
    
    #Find the Root Subject-verb linkages in the sentences using stanza
    nlp = stanza.Pipeline('en', download_method=None)    # this sets up a default neural pipeline in English
    doc = nlp(sentence)
     
    root_verb = None
    subject = None
    
    for word in doc.sentences[0].words:    # for each word in the sentence
        if word.deprel == 'root':          # if a word is the root, its dependency relation label is 'root'. thus if this is true, the curr word = root word

            root_verb = word.text          # save the root verb  
            root_id = word.id              # get the words id 
            
            for w in doc.sentences[0].words:                      # loop over words in the sentence
                if w.head == root_id and w.deprel == 'nsubj':     # if words head attribute = root_id, then its a a direct dependent of the root (is a child of the root)
                    subject = w.text 
                
    subj_verb_linkages = [subject, root_verb]   # subj_verb linkages array 
    
        
    return {'entities':entities, 
            'verbs':verbs,
            'subjects':subjects,
            'subj_verb_linkages':subj_verb_linkages} 

#### (b) Filename Traversal Function
**Input:** A Filename **(i.e /inputs/ex10.txt)** <br>
**Output:** A dict of dicts: <br>
&emsp;&emsp;&emsp;&emsp;**{sentence: {sentence_level_outputs}}** <br>
&emsp;&emsp;where sentence_level_outputs <br>
&emsp;&emsp;&emsp;&emsp;**{'analysis type/function' : output thereof}**


In [5]:
def doc_trawl(filename):

    file_output = {}
    
    with open(filename, "r") as fp:
        raw = BeautifulSoup(fp.read(), 'html.parser').get_text()
        raw_sentences = nltk.sent_tokenize(raw)
    
    for sentence in raw_sentences:
        
        # put all output of this sentence here 
        # key=analysis type/function, value=output thereof
        sentence_level_outputs = {} 
        
        # use ner function  
        sentence_level_outputs.update(ner(sentence))
        
        # any other output we want to add that doesn't rely on the ner tokenization
        # should be done here
        # to show that the plumbing works correctly, let's add variable 2:
        sentence_level_outputs['random_num'] = np.random.uniform()
        
        # Add to output dictionary
        file_output.update({sentence:sentence_level_outputs})
        
    return file_output

## III. Automation

In [6]:
file_sentence_dict = {}
files = glob.glob("inputs/*") #get all the files in the inputs folder

for file in tqdm(files,total=len(files)):
    file_sentence_dict.update({file: doc_trawl(file)}) #update the dictionary 

  0%|                                                     | 0/2 [00:00<?, ?it/s]2023-03-29 18:10:19 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| constituency | wsj       |
| depparse     | combined  |
| sentiment    | sstplus   |
| ner          | ontonotes |

2023-03-29 18:10:19 INFO: Using device: cpu
2023-03-29 18:10:19 INFO: Loading: tokenize
2023-03-29 18:10:19 INFO: Loading: pos
2023-03-29 18:10:19 INFO: Loading: lemma
2023-03-29 18:10:19 INFO: Loading: constituency
2023-03-29 18:10:20 INFO: Loading: depparse
2023-03-29 18:10:20 INFO: Loading: sentiment
2023-03-29 18:10:20 INFO: Loading: ner
2023-03-29 18:10:21 INFO: Done loading processors!
2023-03-29 18:10:27 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined 

UnpicklingError: pickle data was truncated

## IV. Unpacking that into DF

Dataframe with 
- index is filename-sentence
- columns are sentence level variables

Now we can do diagnostics, examine the output, and use it faster!

In [None]:
def unpack_tri_level_dict(a_dict):
    df = pd.concat(map(lambda x: pd.DataFrame.from_dict(x).T, a_dict.values()), keys=a_dict.keys())
    df.index = df.index.rename(['file','sentence'])
    return df

unpack_tri_level_dict(file_sentence_dict)
