### Text comprehension through spaCy

This notebook is an example of topic modeling adapted from [this writeup](https://medium.com/@sayahfares19/text-analysis-topic-modelling-with-spacy-gensim-4cd92ef06e06).

It showcases some of the linguistic analysis features that spaCy offers. 

The libraries we will use are:
- `pandas`: for reading in and exporting spreadsheets
- `spacy`: a natural language processing library that contains various models trained on various languages

In [1]:
import os 
import pandas as pd

# for comprehension of language
import spacy 
from spacy import displacy


#### Let's load spaCy's English language trained pipeline

`A training pipeline typically reads training data from a feature store, performs model-dependent transformations, trains the model, and evaluates the model before the model is saved to a model registry.`

You will need to download one of spaCy's models and can do so by typing this into a cell here:
```
!python3 -m spacy download en_core_web_sm

```

In [2]:
# !python3 -m spacy download en_core_web_sm

In [3]:
nlp = spacy.load('en_core_web_sm')

Now we can run the model

In [4]:
sent = nlp("I'm a student who is about to finish graduate school at City University of New York in Manhattan, New York.")


## Computational Linguistics

#### POS-Tagging — (Part Of Speech)
spaCy has a nifty way to look into how each word is used in a sentence, often also referred to as Part Of Speech (POS). There are eight main parts of speech — nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions, and interjections. 

Once we have turned our sentence into an NLP object we can look at the token (meaning the individual words) it consists of and two types of tags to represent what part of speech each word is on a higher and on a more granular level. 

You can find various tags here and their explanations [here](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/13-POS-Keywords.html#spacy-part-of-speech-tagging).

In [5]:
for token in sent:
    print(f"The word {token.text} represents this part of speech: {token.pos_} and {token.tag_}")


The word I represents this part of speech: PRON and PRP
The word 'm represents this part of speech: AUX and VBP
The word a represents this part of speech: DET and DT
The word student represents this part of speech: NOUN and NN
The word who represents this part of speech: PRON and WP
The word is represents this part of speech: AUX and VBZ
The word about represents this part of speech: ADJ and JJ
The word to represents this part of speech: PART and TO
The word finish represents this part of speech: VERB and VB
The word graduate represents this part of speech: ADJ and JJ
The word school represents this part of speech: NOUN and NN
The word at represents this part of speech: ADP and IN
The word City represents this part of speech: PROPN and NNP
The word University represents this part of speech: PROPN and NNP
The word of represents this part of speech: ADP and IN
The word New represents this part of speech: PROPN and NNP
The word York represents this part of speech: PROPN and NNP
The word i

#### NER-Tagging — (Named Entity Recognition)
Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

SpaCy recognizes the following built-in entity types:
- PERSON - People, including fictional.
- NORP - Nationalities or religious or political groups.
- FAC - Buildings, airports, highways, bridges, etc.
- ORG - Companies, agencies, institutions, etc.
- GPE - Countries, cities, states.
- LOC - Non-GPE locations, mountain ranges, bodies of water.
- PRODUCT - Objects, vehicles, foods, etc. (Not services.)
- EVENT - Named hurricanes, battles, wars, sports events, etc.
- WORK_OF_ART - Titles of books, songs, etc.
- LAW - Named documents made into laws.
- LANGUAGE - Any named language.
- DATE - Absolute or relative dates or periods.
- TIME - Times smaller than a day.
- PERCENT - Percentage, including "%".
- MONEY - Monetary values, including unit.
- QUANTITY - Measurements, as of weight or distance.
- ORDINAL - "first", "second", etc.
- CARDINAL - Numerals that do not fall under another type.

In [6]:
for token in sent:
    print(token.text, token.ent_type_)

I 
'm 
a 
student 
who 
is 
about 
to 
finish 
graduate 
school 
at 
City ORG
University ORG
of ORG
New ORG
York ORG
in 
Manhattan GPE
, 
New GPE
York GPE
. 


Let's run a slightly different version of this code to see what role these things play:

In [7]:
for ent in sent.ents:
    print(ent.text, ent.label_)

City University of New York ORG
Manhattan GPE
New York GPE


You can render these roles visually, too:

In [8]:
displacy.render(sent, style='ent', jupyter=True)

### Sentiment analysis
spaCy also has some sentiment analysis tools baked into its library. For some rudimentary sentiment analysis we will be using.


In [9]:
#!pip3 install spacytextblob

In [10]:
from spacytextblob.spacytextblob import SpacyTextBlob

As before let's load our model. We'll also load [`spacytextblob`](https://spacy.io/universe/project/spacy-textblob) which as an adaptation of a popular library for sentiment analysis called [`textblob`](https://textblob.readthedocs.io/en/latest/quickstart.html). And funnily enough, textblob's sentiment analysis is based on yet another library called pattern.

According to [investigate.ai](https://investigate.ai/investigating-sentiment-analysis/comparing-sentiment-analysis-tools/#TextBlob), "The sentiment analysis lexicon bundled in Pattern focuses on adjectives. It contains adjectives that occur frequently in customer reviews, hand-tagged with values for polarity and subjectivity."

In [11]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

<spacytextblob.spacytextblob.SpacyTextBlob at 0x7f92724f4fa0>

Then we can get started. Here's a sample text that we will turn into a spaCy document:

In [12]:
text = 'I had a really horrible day. It was the worst day ever! But every now and then I have a really good day that makes me happy.'
doc = nlp(text)

Once we have done this we can start analyzing it by:
- polarity
- subjectivity
- overall sentiment assessment

In [13]:
print(doc._.blob.sentiment)

Sentiment(polarity=-0.125, subjectivity=0.9)


In [14]:
print(doc._.blob.polarity)

-0.125


In [15]:
print(doc._.blob.subjectivity)

0.9


In [16]:
print(doc._.blob.sentiment_assessments.assessments)

[(['really', 'horrible'], -1.0, 1.0, None), (['worst', '!'], -1.0, 1.0, None), (['really', 'good'], 0.7, 0.6000000000000001, None), (['happy'], 0.8, 1.0, None)]


### Why do we care? How to apply this to journalism

These tools can help us analyze documents in large quantities. Let's apply this to the article about [Bruno, the thicc cat](https://www.buzzfeednews.com/article/juliareinstein/this-thicc-lazy-high-maintenance-incredibly-well-hydrated). 

Lets say our goal is to run an analysis of every word in our analysis and we want to know the following for each word:
- what its name entity is
- what its sentiment is

For this exercise you will need to:
- Create a word list (hint: look at last week!) ← you’ll do this alone
- Create a new column for the name entity:
  - Use the `.apply()` function to apply a custom function to the column containing the individual words
- Create a column for the polarity score
    - Use the `.apply()` function to apply a custom function to the column containing the individual words


##### Open the text:

In [17]:
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

# opens the text file and turns it into a string
text = open("../data/text.txt","r+").read()
len(text) # this returns the length of characters and spaces

2990

##### Turn text into nlp object

In [18]:
doc = nlp(text)
len(doc) # this returns the tokens

724

##### Get sentinment score

In [19]:
print(doc._.blob.sentiment)

Sentiment(polarity=0.10191029694706165, subjectivity=0.43493619670090267)


###### Turn article into array of words

In [20]:
rows = []
for token in doc:
    rows.append(token.text)

In [21]:
word_dataframe = pd.DataFrame(rows)
word_dataframe.columns = ['word']
word_dataframe.head()

Unnamed: 0,word
0,This
1,is
2,Bruno
3,","
4,and


##### Create a new column called "ner" and find the name entity for each word

Make sure you apply a function to this column. 

Here's the syntax of a function: 
```
def addnumbers(x):
	new_number = 2 + x
	return new_number
```
Here's how you apply it:
```
df["column"].apply(addnumbers)
```

In [22]:
def getner(x):
    doc = nlp(x)
    for ent in doc.ents:
        return ent.label_

In [23]:
word_dataframe["ner"] = word_dataframe["word"].apply(getner)

##### Summarize the dataframe of words via the "ner" column

In [24]:
word_dataframe["ner"].value_counts()

ner
PERSON      8
CARDINAL    3
GPE         3
ORG         3
DATE        1
Name: count, dtype: int64

In [25]:
word_dataframe["ner"].value_counts().to_csv("../output/entities.csv")