### Text comprehension through spaCy

This notebook is an example of topic modeling adapted from [this writeup](https://medium.com/@sayahfares19/text-analysis-topic-modelling-with-spacy-gensim-4cd92ef06e06).

It showcases some of the linguistic analysis features that spaCy offers. 

The libraries we will use are:
- `pandas`: for reading in and exporting spreadsheets
- `spacy`: a natural language processing library that contains various models trained on various languages

#### Let's load spaCy's English language trained pipeline

`A training pipeline typically reads training data from a feature store, performs model-dependent transformations, trains the model, and evaluates the model before the model is saved to a model registry.`

You will need to download one of spaCy's models and can do so by typing this into a cell here:
```
!python3 -m spacy download en_core_web_sm

```

In [None]:
# !python3 -m spacy download en_core_web_sm

Now we can run the model

## Computational Linguistics

#### POS-Tagging — (Part Of Speech)
spaCy has a nifty way to look into how each word is used in a sentence, often also referred to as Part Of Speech (POS). There are eight main parts of speech — nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions, and interjections. 

Once we have turned our sentence into an NLP object we can look at the token (meaning the individual words) it consists of and two types of tags to represent what part of speech each word is on a higher and on a more granular level. 

You can find various tags here and their explanations [here](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/13-POS-Keywords.html#spacy-part-of-speech-tagging).

#### NER-Tagging — (Named Entity Recognition)
Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

SpaCy recognizes the following built-in entity types:
- PERSON - People, including fictional.
- NORP - Nationalities or religious or political groups.
- FAC - Buildings, airports, highways, bridges, etc.
- ORG - Companies, agencies, institutions, etc.
- GPE - Countries, cities, states.
- LOC - Non-GPE locations, mountain ranges, bodies of water.
- PRODUCT - Objects, vehicles, foods, etc. (Not services.)
- EVENT - Named hurricanes, battles, wars, sports events, etc.
- WORK_OF_ART - Titles of books, songs, etc.
- LAW - Named documents made into laws.
- LANGUAGE - Any named language.
- DATE - Absolute or relative dates or periods.
- TIME - Times smaller than a day.
- PERCENT - Percentage, including "%".
- MONEY - Monetary values, including unit.
- QUANTITY - Measurements, as of weight or distance.
- ORDINAL - "first", "second", etc.
- CARDINAL - Numerals that do not fall under another type.

Let's run a slightly different version of this code to see what role these things play:

You can render these roles visually, too:

### Sentiment analysis
spaCy also has some sentiment analysis tools baked into its library. For some rudimentary sentiment analysis we will be using.


In [None]:
#!pip3 install spacytextblob

As before let's load our model. We'll also load [`spacytextblob`](https://spacy.io/universe/project/spacy-textblob) which as an adaptation of a popular library for sentiment analysis called [`textblob`](https://textblob.readthedocs.io/en/latest/quickstart.html). And funnily enough, textblob's sentiment analysis is based on yet another library called pattern.

According to [investigate.ai](https://investigate.ai/investigating-sentiment-analysis/comparing-sentiment-analysis-tools/#TextBlob), "The sentiment analysis lexicon bundled in Pattern focuses on adjectives. It contains adjectives that occur frequently in customer reviews, hand-tagged with values for polarity and subjectivity."

Then we can get started. Here's a sample text that we will turn into a spaCy document:

Once we have done this we can start analyzing it by:
- polarity
- subjectivity
- overall sentiment assessment

### Why do we care? How to apply this to journalism

These tools can help us analyze documents in large quantities. Let's apply this to the article about [Bruno, the thicc cat](https://www.buzzfeednews.com/article/juliareinstein/this-thicc-lazy-high-maintenance-incredibly-well-hydrated). 

Lets say our goal is to run an analysis of every word in our analysis and we want to know the following for each word:
- what its name entity is
- what its sentiment is

For this exercise you will need to:
- Create a word list (hint: look at last week!) ← you’ll do this alone
- Create a new column for the name entity:
  - Use the `.apply()` function to apply a custom function to the column containing the individual words
- Create a column for the polarity score
    - Use the `.apply()` function to apply a custom function to the column containing the individual words


##### Open the text:

##### Turn text into nlp object

##### Get sentinment score

##### Turn article into array of words

##### Create a new column called "ner" and find the name entity for each word

Make sure you apply a function to this column. 

Here's the syntax of a function: 
```
def addnumbers(x):
	new_number = 2 + x
	return new_number
```
Here's how you apply it:
```
df["column"].apply(addnumbers)
```

##### Summarize the dataframe of words via the "ner" column