### Text comprehension through spaCy

This notebook is an example of topic modeling adapted from [this writeup](https://medium.com/@sayahfares19/text-analysis-topic-modelling-with-spacy-gensim-4cd92ef06e06).

It showcases some of the linguistic analysis features that spaCy offers. 

The libraries we will use are:
- `pandas`: for reading in and exporting spreadsheets
- `spacy`: a natural language processing library that contains various models trained on various languages

In [1]:
import os 
import pandas as pd

# for comprehension of language
import spacy 
from spacy import displacy

#### Let's load spaCy's English language trained pipeline

`A training pipeline typically reads training data from a feature store, performs model-dependent transformations, trains the model, and evaluates the model before the model is saved to a model registry.`

You will need to download one of spaCy's models and can do so by typing this into a cell here:
```
!python3 -m spacy download en_core_web_sm

```

In [2]:
# !python3 -m spacy download en_core_web_sm

In [3]:
nlp = spacy.load('en_core_web_sm')

Now we can run the model

In [4]:
sent = nlp("I'm a student who is about to finish graduate school at City University of New York in Manhattan, New York.")


## Computational Linguistics

#### POS-Tagging — (Part Of Speech)
spaCy has a a nifty way to look into how each word is used in a sentence, often also referred to as Part Of Speech (POS). There are eight main parts of speech — nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions, and interjections. 

Once we have turned our sentence into an NLP object we can look at the token (meaning the individual words) it consists of and two types of tags to represent what part of speech each word is on a higher and on a more granular level. 

You can find various tags here and their explanations [here](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/13-POS-Keywords.html#spacy-part-of-speech-tagging).

In [5]:
for token in sent:
    print(f"The word {token.text} represents this part of speech: {token.pos_} and {token.tag_}")


The word I represents this part of speech: PRON and PRP
The word 'm represents this part of speech: AUX and VBP
The word a represents this part of speech: DET and DT
The word student represents this part of speech: NOUN and NN
The word who represents this part of speech: PRON and WP
The word is represents this part of speech: AUX and VBZ
The word about represents this part of speech: ADJ and JJ
The word to represents this part of speech: PART and TO
The word finish represents this part of speech: VERB and VB
The word graduate represents this part of speech: ADJ and JJ
The word school represents this part of speech: NOUN and NN
The word at represents this part of speech: ADP and IN
The word City represents this part of speech: PROPN and NNP
The word University represents this part of speech: PROPN and NNP
The word of represents this part of speech: ADP and IN
The word New represents this part of speech: PROPN and NNP
The word York represents this part of speech: PROPN and NNP
The word i

#### NER-Tagging — (Named Entity Recognition)
Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.



In [6]:
for token in sent:
    print(token.text, token.ent_type_)

I 
'm 
a 
student 
who 
is 
about 
to 
finish 
graduate 
school 
at 
City ORG
University ORG
of ORG
New ORG
York ORG
in 
Manhattan GPE
, 
New GPE
York GPE
. 


Let's run a slightly different version of this code to see what role these things play:

In [7]:
for ent in sent.ents:
    print(ent.text, ent.label_)

City University of New York ORG
Manhattan GPE
New York GPE


You can render these roles visually, too:

In [8]:
displacy.render(sent, style='ent', jupyter=True)

#### Dependency Parsing
The term Dependency Parsing (DP) refers to the process of examining the dependencies between the phrases of a sentence in order to determine its grammatical structure. It's a way to understand how words in a sentence relate to each other.

In [9]:
for chunk in sent.noun_chunks:
    print(f"the word '{chunk.text}' has the root '{chunk.root.text}' and has this grammatical function '{chunk.root.dep_}")         
    print(f"{chunk.root.head.text}")

the word 'I' has the root 'I' and has this grammatical function 'nsubj
'm
the word 'a student' has the root 'student' and has this grammatical function 'attr
'm
the word 'who' has the root 'who' and has this grammatical function 'nsubj
is
the word 'graduate school' has the root 'school' and has this grammatical function 'dobj
finish
the word 'City University' has the root 'University' and has this grammatical function 'pobj
at
the word 'New York' has the root 'York' and has this grammatical function 'pobj
of
the word 'Manhattan' has the root 'Manhattan' and has this grammatical function 'pobj
in
the word 'New York' has the root 'York' and has this grammatical function 'appos
Manhattan


In [10]:
for token in sent:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
         [child for child in token.children])

I nsubj 'm AUX []
'm ROOT 'm AUX [I, student, .]
a det student NOUN []
student attr 'm AUX [a, is]
who nsubj is AUX []
is relcl student NOUN [who, about]
about acomp is AUX [finish]
to aux finish VERB []
finish xcomp about ADJ [to, school]
graduate compound school NOUN []
school dobj finish VERB [graduate, at]
at prep school NOUN [University]
City compound University PROPN []
University pobj at ADP [City, of, in]
of prep University PROPN [York]
New compound York PROPN []
York pobj of ADP [New]
in prep University PROPN [Manhattan]
Manhattan pobj in ADP [,, York]
, punct Manhattan PROPN []
New compound York PROPN []
York appos Manhattan PROPN [New]
. punct 'm AUX []


In [11]:
displacy.render(sent, style='dep', jupyter=True, options={'distance':90})


## Data cleaning
The next few lines 'normalize' the text and turns words into lemmas, get rid of stopwords and punctuation markers, and add lemmatized words.

In [12]:
# here's a demo of us cycling through the 
for word in sent:
    print(f"the lemma for the word {word} is {word.lemma_}")

the lemma for the word I is I
the lemma for the word 'm is be
the lemma for the word a is a
the lemma for the word student is student
the lemma for the word who is who
the lemma for the word is is be
the lemma for the word about is about
the lemma for the word to is to
the lemma for the word finish is finish
the lemma for the word graduate is graduate
the lemma for the word school is school
the lemma for the word at is at
the lemma for the word City is City
the lemma for the word University is University
the lemma for the word of is of
the lemma for the word New is New
the lemma for the word York is York
the lemma for the word in is in
the lemma for the word Manhattan is Manhattan
the lemma for the word , is ,
the lemma for the word New is New
the lemma for the word York is York
the lemma for the word . is .


In [13]:
lsi_model = LsiModel(corpus=corpus, num_topics=10, id2word=dictionary)
lsi_model.show_topics(num_topics=15)

NameError: name 'LsiModel' is not defined

#### HDP — Hierarchical Dirichlet Process
HDP, the Hierarchical Dirichlet Process is an unsupervised Topic Model which figures out the number of topics on its own.

In [None]:
hdp_model = HdpModel(corpus=corpus, id2word=dictionary)
hdp_model.show_topics()[:5]

#### LDA — Latent Dirichlet Allocation
LDA or Latent Dirichlet Allocation is arguably the most famous Topic Modeling algorithm out there. Out here we create a simple Topic Model with 10 topics.

In [None]:
lda_model = LdaModel(corpus=corpus, num_topics=5, id2word=dictionary)
lda_model.show_topics()

In [None]:
#for visualizations
import pyLDAvis
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()


In [None]:
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)


In [None]:
pyLDAvis.save_html(vis, "../output/topics_modeling_demure.html")

In [None]:
# vis

In [None]:
words_influencer = pd.DataFrame(texts)
print(len(words_influencer))
words_influencer.columns = ["word"]
words_influencer.head()


In [None]:
word_tally = words_influencer["word"].value_counts()
word_tally.head()

In [None]:
word_tally.to_csv("../output/word_tally_demure.csv")