
<center><h1>Introduction to Natural Language Processing (NLP)</h1></center>

<center><h3>Paul Stey</h3></center>

# What is natural language processing (NLP)?

* NLP exists at the intersection of disciplines
  + Linguistics
  + Statistics
  + Machine learning and artificial intelligence

## Where is NLP Used?

* Advertising and Marketing
  + Understanding consumer preferences
* Law
  + Automated reading of discovery docs
* Automated Journalism
  + Machine-generated articles
* Finance
  + Quant funds 
  + "Anne Hathaway" problem
* Medicine
  + Automated analysis of clinical notes
* Machine translation

### Examples of NLP Usage

* Voice assistants (e.g., Siri, Alexa, etc.)
* Google Translate
* Auto-complete
* Chatbots (e.g., ChatGPT, Bard, etc.)
* Speech-to-text on phones

## History of NLP

* Formally studied in linguistics departments
* From as early as 1950s
* Early emphasis on rule-based methods

### Some Building Blocks

* Tokenization
  + A token is a string with known meaning
* Stemming
  + Chop off the ends of words
  + Many different kinds of stemming
  + `"cooking"` => `"cook"`
  + `"distribution"` => `"distribut"`
  
* Lemmatization
  + More sophisticated than stemming
  + Uses vocabulary and context
  + `"am"`, `"are"`, `"is"` could be mapped to `"be"` using lemmatization

# spaCy Package in Python

* Create by Matt Honnibal and Ines Montani 
* Amazingly powerful
* Support for dozens of languages
* Extremely fast!!


# Bag-of-words Model

* Species of vector space model
* Produce count vectors
* Embedding words in a vector space



In [None]:
!pip3 install spacy
!python3 -m spacy download en_core_web_lg

In [None]:
import re
import spacy 
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

nlp = spacy.load("en_core_web_lg")

df = pd.read_csv("data/short_movie_reviews.csv")

In [None]:
df

In [None]:
doc = nlp(df.iloc[0,1])

for token in doc:
    print(token)

In [None]:
doc = nlp(df.iloc[3,1])

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_stop)

<center><h1>Challenge Problem</h1></center>

In NLP stop words are common words that are often considered to be of little value in text analysis because they don't carry much meaningful information by themselves. Examples of stop words include articles (a, an, the), prepositions (in, on, at), conjunctions (and, or, but), and pronouns (he, she, it).

The scpaCy package in Python has a built-in list of stop words, which is in this object: `nlp.Defaults.stop_words` 

Let's write a function called `remove_stopwords()` that takes a string of text as input, and returns the string with all the stop words removed. 

### Parts-of-Speech

In [None]:
doc = nlp(df.iloc[3,1])

for chunk in doc.noun_chunks:
    print(chunk)

In [None]:
def get_unique_tokens(col):
    words = ' '.join(col.tolist()).lower()
    tokens = set([str(word) for word in nlp(words) 
                  if word.pos_ != "PUNCT" 
                  and not word.is_stop])
    
    return tokens

In [None]:
all_tokens = get_unique_tokens(df.review)

all_tokens

In [None]:
def word_counts_dataframe(df_raw, token_set):
    df_new = df_raw.copy()
    
    n = df_new.shape[0]
    for token in token_set:
        df_new[token] = np.zeros(n, int)
        
        for (i, review) in enumerate(df_new.review):
            df_new.loc[i, token] = df_new.loc[i, "review"].lower().count(token)
    
    return df_new
    

In [None]:
df_bow = word_counts_dataframe(df, all_tokens)

df_bow

## TF-IDF

* Term frequency, Inverse document frequency
  - Term frequency: how often is the term in this document
  - Inverse document frequency: how rare is the term across collection of documents (i.e., corpus)
  
* Statistic that normalizes for relative "importance" of words

$${\displaystyle \mathrm {tfidf} (t,d,D)=\mathrm {tf} (t,d)\cdot \mathrm {idf} (t,D)}$$

* "Important" term 
  + appears frequently in the document
  + and is rare across all documents


### TF-IDF (cont.)

In [None]:
## implement tf/idf 
def tf(term, doc):
    num_term = doc.lower().count(term)
    res = num_term/len(doc.lower().split())
    return res


def idf(term, documents):
    n = len(documents)
    num_term = 0
    
    for doc in documents:
        if term in doc.lower():
            num_term += 1
    
    return np.log(n/num_term)


### TF-IDF (cont.)

In [None]:
def tfidf(term, doc, documents):
    tf_val = tf(term, doc)
    idf_val = idf(term, documents)
    
    return tf_val * idf_val

In [None]:
idf("terrible", df.loc[:, "review"])

In [None]:
tf("terrible", df.loc[11, "review"])

In [None]:
tfidf("terrible", df.loc[11, "review"], df.loc[:, "review"])

In [None]:
def tfidf_dataframe(df_raw, token_set):
    df_new = df_raw.copy()
    
    n = df_new.shape[0]
    for token in token_set:
        df_new[token] = np.zeros(n, int)
        
        for (i, review) in enumerate(df_new.review):
            df_new.loc[i, token] = tfidf(token, df_new.loc[i, "review"].lower(), df_new.loc[:, "review"])
    
    return df_new

df_tfidf = tfidf_dataframe(df, all_tokens)

df_tfidf.head()

# Bag-of-Words, Vector Models, and Embeddings

* Are all embeddings
* All representations of text as vectors
* Several famous word embedding models
  + word2vec
  + GLoVe
* _Unbelievably powerful_

## Token Similarity

* Using embeddings, we can compute the similarity of documents, sentences, or tokens
* similarity can be computed using distance metric (e.g., cosine similarity)

In [None]:
t1 = nlp("lion")
t2 = nlp("tiger")


In [None]:
t1.similarity(t2)

In [None]:
t1.vector

## Sentence Similarity

In [None]:
s1 = nlp("I went to the store today")
s2 = nlp("I will go to the market tomorrow")

In [None]:
s1.similarity(s2)