<a href="https://colab.research.google.com/github/jkwinch/ProgrammingAssignment2/blob/master/IntroToSpacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Spacy**

# How to install spacy
(Ref: https://spacy.io/usage)
Run the following commands one at a time.  
These need to be executed only once on your computer.  

In [None]:
!pip install -U pip setuptools wheel
# -U means upgrade to the latest version. If this gives errors, try it without the -U.

In [None]:
# For Windows or Mac with Intel processor:
!pip install -U spacy

In [None]:
# For Mac with ARM/M1 processor:
!pip install -U spacy 'spacy[apple]'

For text processing, we need to download a pipeline package in English. A pipeline package contains the data and the statistical models for a specific language.
We will use **en_core_web_sm**: a small English pipeline with core capabilities and trained on web text. It needs to be downloaded only once on your computer.   

In [None]:
!python -m spacy download en_core_web_sm

# Get Started

In [None]:
import spacy

We initialize an 'nlp' object with the small English pipeline with the following command. This nlp can be used to process text data.

In [None]:
nlp = spacy.load('en_core_web_sm')

To process a text, we first construct a Doc object of the text.  
Doc object is a sequence of Token objects. Each Token object has information about a particular piece - like a word - of text. We give it a name like *doc*.

In [None]:
text = "Children made tasty snacks."

In [None]:
doc = nlp(text)

In [None]:
token_list = []
for token in doc:
    token_list.append(token.text)
token_list

In [None]:
# Short hand "list comprehension"
[token.text for token in doc]

# Lemmatization   
A lemma is the base form or root form of a word. Lemmatization is the process of reducing word forms to their lemma.  
Example: For all these words, "running," "ran", and "runs," the lemma is "run."  

In [None]:
for token in doc:
    print(token.text, token.lemma_)

# Parts of Speech Tagging
Spacy can predict the parts of speech of the words in the text such as noun, pronoun, verb, adjective, adverb, etc. Use .pos_ for this. The underscore is needed to get the result in text form, not in a number format.

In [None]:
for token in doc:
    print(token.text, token.pos_)

In [None]:
# With f-string formatting for aligning columns of text
for token in doc:
    print(f"{token.text:<12}{token.pos_:<10}")


### Common Parts of Speech
- NOUN - Noun: A person, place, thing, or idea
- PRON - Pronoun: A word that substitutes for a noun or noun phrase
- VERB - Verb: Expresses action or state of being
- ADJ - Adjective: Modifies or describes a noun or pronoun
- ADV - Adverb: Modifies or describes a verb, adjective, or another adverb
- ADP - Adposition: Prepositions and postpositions
- CONJ - Conjunction: Connects words, phrases, or clauses
- DET - Determiner: Specifies the kind of reference a noun or noun phrase has
- NUM - Numeral: Expresses numbers and quantities
- PRT - Particle: Function words that must be associated with another word or phrase to impart meaning
- PUNCT - Punctuation: Marks like commas, periods, and question marks
- SYM - Symbol: Represents a mathematical, scientific, or currency symbol
- X - Other: A token that does not fit into any other POS category
- SPACE - Space: A space character or sequence of space characters (not visible in text)

# Dependency Parsing  
Spacy can extract the grammatical structure in a sentence.  
The main verb in the sentence is the ROOT.
The subject of this verb is nsubj.   
The object of this verb is dobj.   


In [None]:
for token in doc:
    print(f"{token.text:<12}{token.dep_:<10}")

### Common Dependency Tags  
- nsubj - Nominal subject
- dobj - Direct object
- iobj - Indirect object
- attr - Attribute
- ROOT - The central word in the sentence from which all other words radiate and depend
- aux - Auxiliary verb
- advmod - Adverbial modifier
- amod - Adjectival modifier
- conj - Conjunction
- cc - Coordinating conjunction
- prep - Prepositional modifier
- pobj - Object of the preposition

We can create a visualization with *displacy*. The arrow from each word points to its "child."  The child of a word is the word that modifies and depends on that word.

In [None]:
from spacy import displacy
displacy.render(doc, style = "dep")

# Named Entity Recognition (NER)  
Process of locating named entities and classifying them into predefined categories, such as person names, organization, locations, monetary values, percentagees, and time expressions.   

In [None]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value."

In [None]:
doc = nlp(text)

In [None]:
for token in doc:
    if token.ent_type != 0:
         print(f"{token.text:<12}{token.ent_type_:<10}")

In [None]:
displacy.render(doc, style = "ent")

## NER Predefined Categories
- PERSON: People, including fictional.
- NORP: Nationalities or religious or political groups.
- FAC: Buildings, airports, highways, bridges, etc.
- ORG: Companies, agencies, institutions, etc.
- GPE: Countries, cities, states.
- LOC: Non-GPE locations, mountain ranges, bodies of water.
- PRODUCT: Objects, vehicles, foods, etc. (Not services.)
- EVENT: Named hurricanes, battles, wars, sports events, etc.
- WORK_OF_ART: Titles of books, songs, etc.
- LAW: Named documents made into laws.
- LANGUAGE: Any named language.
- DATE: Absolute or relative dates or periods.
- TIME: Times smaller than a day.
- PERCENT: Percentage, including "%".
- MONEY: Monetary values, including unit.
- QUANTITY: Measurements, as of weight or distance.
- ORDINAL: "first", "second", etc.
- CARDINAL: Numerals that do not fall under another type.

# Sentence Detection  


In [None]:
#file = "nigerianprince.txt"

In [None]:
file = "genz_work.txt"

In [None]:
# Read in a text from a file
with open(file, "r") as f:
    text = f.read()

In [None]:
doc = nlp(text)

In [None]:
sentences = list(doc.sents)
len(sentences)

In [None]:
for sentence in sentences:
    print(sentence.text)

In [None]:
# Numbered sentences
for index, sentence in enumerate(sentences):
    print(f"{index}: {sentence.text}")

# Word Frequency  
We can use spacy to find out which words occur most frequently in a text.  
First, we need to clean the text.  
- Select only alpha token (not numeric, not punctuation)
- Remove stop words (very common words that are not meaningful)
- Lemmatize
- Lower case

In [None]:
stop_words = nlp.Defaults.stop_words
print(stop_words)
len(stop_words)

In [None]:
from collections import Counter

In [None]:
doc = nlp(text)

In [None]:
words = [
    token.lemma_.lower()
    for token in doc
    if token.is_alpha and not token.is_stop]
len(words)

In [None]:
Counter(words).most_common(10)

## Word Cloud - Visualization of Word Frequency

In [None]:
!pip install -U wordcloud

In [None]:
cleaned_text = " ".join(words)

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Generate a word cloud image
wordcloud = WordCloud(background_color="white", max_words=500).generate(cleaned_text)

# Display the generated image with matplotlib
plt.figure(figsize=(10, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()