# Natural Language Processing

__High-level:__

* Natural Language Understanding
* Language translation
* Grammatical accuracy
* Speech recognition - Alexa / Siri / OK Google / Chatbots
* Sentiment Analysis
* Summary analysis
* Information extraction
* Preprocessing for deep learning

__Low-level:__
    
* Syntax
* Parsing - general term for splitting up structure into substructure
* Tokenization: Segment into words and punctuation (aka word/ morpheme segmentation (vs phoneme))
* Part of speech tagging - assign word types and dependency grammer
* Semantics:
* Stemming / Lemmatization: Work out the base of words (lemma) (am,is => be, car's, cars => car, differing, different => differ) cf Porter's algorithm (http://www.tartarus.org/~martin/PorterStemmer/)**
* Named Entity Recognition - find proper nouns, place names, brand names etc

### What purpose for this week? 
* **Parsing / stopwords / lemmatization**
    - Parsing: breaking a large set of information into smaller subsets, eg grabbing a sentence and taking out all the words. Spacy identifies the words and categorises them for you
    - Stopwords: ifs, ands, ofs... the little words that don't add meaning
    - Lemmas: core part of the word, eg how parent, parental, parents are all similar. 
* These are all preprocessing tools

#### And what tools?
* Many (NLTK / Genism / etc), but we'll be using Spacy

#### Dependency grammar vs (Chomskian) context-free grammar.

* Why do you care about either?
    * Dependency - all NLP
    * Context-free grammar - regex / parsers
* We'll see dependency grammar in action in a little while

## Spacy - https://spacy.io/
### Load a spacy model

In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")

### Generate some text, and pass in text to the spacy model

In [3]:
corpus = ['Inertia is a property of matter',
'Pizza funghi with peppermint oil',
'FC Bayern rules number one',
'Peter piper picked a peck of pickled peppers',
'Game of Thrones made me forget my laptop',
'I dont know ask me in five minutes',
'Peter is a pickled pepper',
'Bayern is also agreed to not be the best',
'a counter is a special kind of dictionary',
'it was the best of times',
'it was the worst of times',
'pizza is pizza is pizza',
'a full class is better than an empty one']

#### Let's get rid of the things we don't want - like stopwords, and breaking down words into lemmas!
- Here we take our corpus, we clean it, and then we can pass it to our count vectoriser!

### Lets tokenize each sentence seperately

In [4]:
type(corpus)

list

In [5]:
corpus[0].split() # don't get much information

['Inertia', 'is', 'a', 'property', 'of', 'matter']

In [6]:
doc = nlp(corpus[0])
type(doc) # This gives us a parsed version of the sentence!

spacy.tokens.doc.Doc

In [7]:
doc # doesn't look like much right?

Inertia is a property of matter

- But if we loop through it, we see we have Spacy tokens!

### We now have a collection of spacy tokens

In [8]:
for word in doc:
    print(type(word))

<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>


### And we can investigate their properties

In [9]:
for word in doc:
    print(word, word.pos_, word.is_alpha, word.is_quote, word.is_stop, word.lemma_)

Inertia PROPN True False False Inertia
is VERB True False True be
a DET True False True a
property NOUN True False False property
of ADP True False True of
matter NOUN True False False matter


Really, we care about:

- Is our word a stopword? If so we want to remove it as it's useless!

- The second thing we care about is what is the lemmatised version of our word?

#### Another better example:

In [10]:
sentence = 'parental parents are usually parentally minded parent parentage'
tokenised_sentence = nlp(sentence)

In [11]:
for word in tokenised_sentence:
    print(word, word.pos_, word.is_alpha, word.is_quote, word.is_stop, word.lemma_)

parental ADJ True False False parental
parents NOUN True False False parent
are VERB True False True be
usually ADV True False False usually
parentally ADV True False False parentally
minded ADJ True False False minded
parent NOUN True False False parent
parentage NOUN True False False parentage


In [12]:
sentence = 'drink drank drunks are drunk'
tokenised_sentence = nlp(sentence)
for word in tokenised_sentence:
    print(word, word.pos_, word.is_alpha, word.is_quote, word.is_stop, word.lemma_, word.dep_)

drink VERB True False False drink csubj
drank NOUN True False False drank compound
drunks NOUN True False False drunk dobj
are VERB True False True be ROOT
drunk ADJ True False False drunk acomp


- .is_alpha - checks if it's an alphabetic word i.e. False if it's a number

- .is_quote - True if it's a quotation mark

- .is_stop - True if it's a stopword

- word.lemma_ gives you lemmatised version of the word

- .dep_ is the dependency grammar

### And find out more about what we don't know

In [13]:
spacy.explain('ADJ')

'adjective'

In [14]:
spacy.explain('csubj')

'clausal subject'

### Let's take a look at what they define as stopwords:

In [15]:
stopwords = nlp.Defaults.stop_words # Finds all stopwords

In [16]:
stopwords # Can add our own by doing stopwords.add('ASS')

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

# In Conclusion:

- We want to remove stopwords as they're useless for our model

- We want to lemmatise! 

- To remove punctuation! Maybe not for sentiment analysis as it's important to detect mood

- To remove numbers? Maybe not for some songs e.g. rappers talking about money

### And visualise things using Displacy
* https://spacy.io/usage/visualizers for more cool viz tricks

- it's about visualising grammar!

In [17]:
from spacy import displacy

In [18]:
sentence = 'she sells sea shells on the sea shore'
tokenised_sentence = nlp(sentence)

In [19]:
displacy.render(tokenised_sentence, jupyter=True)

In [20]:
for word in tokenised_sentence:
    print(word, word.pos_, word.dep_)

she PRON nsubj
sells VERB ROOT
sea NOUN compound
shells NOUN dobj
on ADP prep
the DET det
sea NOUN compound
shore NOUN pobj


In [21]:
spacy.explain('prep')

'prepositional modifier'

### And look at Named Entity Recognition

- Can identify countries, company names, institutions...