# Table of Contents
- 1. [NLP and Pipelines](#1.-NLP-and-Pipelines)
- 2. [How NLP and Pipelines Work](#2.-How-NLP-and-Pipelines-Work)
- 3. [Text Processing](#3.-Text-Processing)
- 4. [Cleaning](#4.-Cleaning)
- 5. [Exercise: Cleaning](#5.-Exercise:-Cleaning)
- 6. [NLP and Pipelines](#1.-NLP-and-Pipelines)
- 7. [NLP and Pipelines](#1.-NLP-and-Pipelines)
- 8. [NLP and Pipelines](#1.-NLP-and-Pipelines)
- 9. [NLP and Pipelines](#1.-NLP-and-Pipelines)
- 10. [NLP and Pipelines](#1.-NLP-and-Pipelines)
- 11. [NLP and Pipelines](#1.-NLP-and-Pipelines)
- 12. [NLP and Pipelines](#1.-NLP-and-Pipelines)

# 1. NLP and Pipelines

## Natural Language Processing Pipelines
In this lesson, you'll be introduced to some of the steps involved in a NLP pipeline:

1. Text Processing
  - Cleaning
  - Normalization
  - Tokenization
  - Stop Word Removal
  - Part of Speech Tagging
  - Named Entity Recognition
  - Stemming and Lemmatization
2. Feature Extraction
  - Bag of Words
  - TF-IDF
  - Word Embeddings
3. Modeling

# 2. How NLP and Pipelines Work

## How NLP Pipelines Work
The 3 stages of an NLP pipeline are: Text Processing > Feature Extraction > Modeling.

- **Text Processing**: Take raw input text, clean it, normalize it, and convert it into a form that is suitable for feature extraction.
- **Feature Extraction**: Extract and produce feature representations that are appropriate for the type of NLP task you are trying to accomplish and the type of model you are planning to use.
- **Modeling**: Design a statistical or machine learning model, fit its parameters to training data, use an optimization procedure, and then use it to make predictions about unseen data.

This process isn't always linear and may require additional steps.

# 3. Text Processing

## Stage 1: Text Processing
The first chunk of this lesson will explore the steps involved in **text processing**, the first stage of the NLP pipeline.

### Why Do We Need to Process Text?
Source: https://en.wikipedia.org/wiki/Kingfisher

- **Extracting plain text**: Textual data can come from a wide variety of sources: the web, PDFs, word documents, speech recognition systems, book scans, etc. Your goal is to extract plain text that is free of any source specific markup or constructs that are not relevant to your task.
- **Reducing complexity**: Some features of our language like capitalization, punctuation, and common words such as a, of, and the, often help provide structure, but don't add much meaning. Sometimes it's best to remove them if that helps reduce the complexity of the procedures you want to apply later.

### What Text Processing Will You Do in This Lesson?
You'll prepare text data from different sources with the following text processing steps:

1. **Cleaning** to remove irrelevant items, such as HTML tags
2. **Normalizing** by converting to all lowercase and removing punctuation
3. Splitting text into words or **tokens**
4. Removing words that are too common, also known as **stop words**
5. Identifying different **parts of speech** and **named entities**
6. Converting words into their dictionary forms, using **stemming and lemmatization**

After performing these steps, your text will capture the essence of what was being conveyed in a form that is easier to work with.

# 4. Cleaning

## Cleaning
Let's walk through an example of cleaning text data from a popular source - the web. You'll be introduced to helpful tools in working with this data, including the **requests** library, **regular expressions**, and **Beautiful Soup**.

Note: The website used in this example has since been updated with a new layout. In the next page, you'll work through the steps shown here for the new web page.

**Documentation for Python Libraries**:
- [Requests](http://docs.python-requests.org/en/master/user/quickstart/#make-a-request)
- [Regular Expressions](https://docs.python.org/3/library/re.html)
- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

# 5. Exercise: Cleaning

## Cleaning Quiz: Udacity's Course Catalog
It's your turn! Udacity's [course catalog page](https://www.udacity.com/courses/all) has changed since the last video was filmed. One notable change is the introduction of  _schools_.

In this activity, you're going to perform similar actions with BeautifulSoup to extract the following information from each course listing on the page:
1. The course name - e.g. "Data Analyst"
2. The school the course belongs to - e.g. "School of Data Science"

In [3]:
import requests
from bs4 import BeautifulSoup

# fetch web page
r = requests.get("https://www.udacity.com/courses/all")

# Use "lxml" rather than "html5lib".
soup = BeautifulSoup(r.text, "lxml")

# Find all course summaries
summaries = soup.find_all("div", {"class":"course-summary-card"})
print('Number of Courses:', len(summaries))

Number of Courses: 237


In [4]:
# print the first summary in summaries
print(summaries[0].prettify())

<div _ngcontent-sc213="" class="course-summary-card row row-gap-medium catalog-card nanodegree-card ng-star-inserted">
 <ir-catalog-card _ngcontent-sc213="" _nghost-sc216="">
  <div _ngcontent-sc216="" class="card-wrapper is-collapsed">
   <div _ngcontent-sc216="" class="card__inner card mb-0">
    <div _ngcontent-sc216="" class="card__inner--upper">
     <div _ngcontent-sc216="" class="image_wrapper hidden-md-down">
      <a _ngcontent-sc216="" href="/course/product-manager-nanodegree--nd036">
       <!-- -->
       <div _ngcontent-sc216="" class="image-container ng-star-inserted" style="background-image:url(https://d20vrrgs8k4bvw.cloudfront.net/images/degrees/nd036/catalog+image+nd036.jpg);">
        <div _ngcontent-sc216="" class="image-overlay">
        </div>
       </div>
      </a>
      <!-- -->
     </div>
     <div _ngcontent-sc216="" class="card-content">
      <!-- -->
      <span _ngcontent-sc216="" class="tag tag--new card ng-star-inserted">
       New
      </span>
     

In [5]:
# Extract course title
ct = summaries[0].select_one("h3").get_text().strip()
print(ct)

# Extract school
school = summaries[0].select_one("h4").get_text().strip()
print(school)

Product Manager
School of Business


In [6]:
courses = []
for summary in summaries:
    # append name and school of each summary to courses list
    title = summary.select_one("h3").get_text().strip()
    school = summary.select_one("h4").get_text().strip()
    courses.append((title, school))
    
# display results
print(len(courses), "course summaries found. Sample:")
courses[:20]

237 course summaries found. Sample:


[('Product Manager', 'School of Business'),
 ('AI for Business Leaders', 'School of Business'),
 ('Intro to Machine Learning with TensorFlow',
  'School of Artificial Intelligence'),
 ('UX Designer', 'School of Business'),
 ('Data Streaming', 'School of Data Science'),
 ('Front End Web Developer', 'School of Programming'),
 ('Full Stack Web Developer', 'School of Programming'),
 ('Java Developer', 'School of Programming'),
 ('AI Product Manager', 'School of Artificial Intelligence'),
 ('Sensor Fusion Engineer', 'School of Autonomous Systems'),
 ('Data Visualization', 'School of Data Science'),
 ('Cloud Developer', 'School of Cloud Computing'),
 ('Cloud DevOps Engineer', 'School of Cloud Computing'),
 ('Intro to Machine Learning with PyTorch',
  'School of Artificial Intelligence'),
 ('C++', 'School of Autonomous Systems'),
 ('Data Structures and Algorithms', 'School of Programming'),
 ('Programming for Data Science with R', 'School of Data Science'),
 ('Data Engineer', 'School of Data 

# 6. Normalization

- Lower case text
- Replace characters that are not a-z, A-Z, or numbers with a space

# 7. Exercise: Normalization

In [9]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text,"\n")

# Convert to lowercase
text = text.lower() 
print(text,"\n")

import re

# Remove punctuation characters
text = re.sub(r"[^a-zA-Z0-9]", " ", text) 
print(text)

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ? 

the first time you see the second renaissance it may look boring. look at it at least twice and definitely watch part 2. it will change your view of the matrix. are the human people the ones who started the war ? is ai a bad thing ? 

the first time you see the second renaissance it may look boring  look at it at least twice and definitely watch part 2  it will change your view of the matrix  are the human people the ones who started the war   is ai a bad thing  


# 8. Tokenization

## Tokenization
Reference:
- `nltk.tokenize` package: http://www.nltk.org/api/nltk.tokenize.html

# 9. Exercise: Tokenization

In [10]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/kevinwebb/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [13]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

text = "Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers."
print(text, "\n")

# Split text into words using NLTK
words = word_tokenize(text)
print(words, "\n")

# Split text into sentences
sentences = sent_tokenize(text)
print(sentences)

Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers. 

['Dr.', 'Smith', 'graduated', 'from', 'the', 'University', 'of', 'Washington', '.', 'He', 'later', 'started', 'an', 'analytics', 'firm', 'called', 'Lux', ',', 'which', 'catered', 'to', 'enterprise', 'customers', '.'] 

['Dr. Smith graduated from the University of Washington.', 'He later started an analytics firm called Lux, which catered to enterprise customers.']


# 10. Stop Word Removal

Unimportant, common words
- I, me, myself, etc

# 11. Exercise: Stop Words

In [14]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/kevinwebb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kevinwebb/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [16]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [17]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text, "\n")

# Normalize text
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

# Tokenize text
words = word_tokenize(text)
print(words, "\n")

# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words, "\n")

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ? 

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing'] 

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'started', 'war', 'ai', 'bad', 'thing'] 



# 12. Part-of-Speech Tagging

- Parts of Speech Tagging
- Sentence Parsing (Tree)

**Note**: Part-of-speech tagging using a predefined grammar like this is a simple, but limited, solution. It can be very tedious and error-prone for a large corpus of text, since you have to account for all possible sentence structures and tags!

There are other more advanced forms of POS tagging that can learn sentence structures and tags from given data, including Hidden Markov Models (HMMs) and Recurrent Neural Networks (RNNs).

# 13. Named Entity Recognition

## Name Entities
Noun phrases that refer to specific object, person, or place

# 14. Exercise: POS and NES

In [18]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package words to /Users/kevinwebb/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package punkt to /Users/kevinwebb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/kevinwebb/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/kevinwebb/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


True

## Parts of Speech (POS) Tagging

In [20]:
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize

text = "I always lie down to tell a lie."

# tokenize text
sentence = word_tokenize(text)

# tag each word with part of speech
pos_tag(sentence)

[('I', 'PRP'),
 ('always', 'RB'),
 ('lie', 'VBP'),
 ('down', 'RP'),
 ('to', 'TO'),
 ('tell', 'VB'),
 ('a', 'DT'),
 ('lie', 'NN'),
 ('.', '.')]

## Named Entity Recognition (NER)

In [21]:
text = "Antonio joined Udacity Inc. in California."

# tokenize, pos tag, then recognize named entities in text
tree = ne_chunk(pos_tag(word_tokenize(text)))
print(tree)

(S
  (PERSON Antonio/NNP)
  joined/VBD
  (ORGANIZATION Udacity/NNP Inc./NNP)
  in/IN
  (GPE California/NNP)
  ./.)


## Sentence Parsing

In [22]:
# Define a custom grammar
my_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")
parser = nltk.ChartParser(my_grammar)

# Parse a sentence
sentence = word_tokenize("I shot an elephant in my pajamas")
for tree in parser.parse(sentence):
    print(tree)

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))


# 15. Stemming and Lemmatization

## Stemming
Process of reducing a word to its stem or root form. This eliminates the letters at the end.
- branch: branches, branching, branched

## Lemmatization
Process of reducing words to a normalized form. This uses a dicionary of variants to roots.
- be: is, was, were

# 16. Exercise: Stemming and Lemmatization

In [23]:
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet') # download for lemmatization

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kevinwebb/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/kevinwebb/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [24]:
from nltk.corpus import stopwords

text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"

# Normalize text
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

# Tokenize text
words = text.split()
print(words, "\n")

# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing'] 

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'started', 'war', 'ai', 'bad', 'thing']


### Stemming

In [25]:
from nltk.stem.porter import PorterStemmer

# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

['first', 'time', 'see', 'second', 'renaiss', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definit', 'watch', 'part', '2', 'chang', 'view', 'matrix', 'human', 'peopl', 'one', 'start', 'war', 'ai', 'bad', 'thing']


### Lemmatization

In [26]:
from nltk.stem.wordnet import WordNetLemmatizer

# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed, "\n")

# Lemmatize verbs by specifying pos
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]
print(lemmed)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'started', 'war', 'ai', 'bad', 'thing'] 

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'start', 'war', 'ai', 'bad', 'thing']


# 17. Text Processing Summary

## Typical Workflow
- Sentence
  - Normalize
  - Tokenize
  - Remove Stop Words
  - Stem / Lemmatization

# 18. Feature Extraction

We have clean text, but still can't use it for a model.

## What type of model?
For graph based models
- represent text as symbolic nodes with relationships between them like WordNet

For statistical models
- numerical representations

## What is the end goal of statistical model?
Perform a document level task (ex: spam detection)
- per document representations such as bag-of-words or doc2vec

Individual words or phrases (ex: text generation or machine translation)
- word level representation such as word2vec or glove


WordNet visualization tool: http://mateogianolio.com/wordnet-visualization/

# 19. Bag of Words

Feature representation

Treats each document as an unordered collection or bag of words

Useful approach: turn each document unto a vector of numbers representing how many times each word occurs in a document

A set of documents is known as a corpus

1. Collect all unique words in corpus to form vocabulary
2. Arrange words in some order
3. Let them form the vector element positions or columns of a table (assume each document is a row)
4. Count the number of occurences of each word in each document
5. Enter the value in the respective column

Termed **Document-Term Matrix**

Using cosine similarity (vs dot product) will allow similarities to range from -1 to 1.

# 20. TF-IDF

Con of Bag-of-Words
- It treats every words as being equally important

Finding frequency of words (document frequency)

Using the product of two weights: term frequency and inverse document frequency

# 21. Notebook: Bag of Words and TF-IDF

## Bag of Words and TF-IDF
Below, we'll look at three useful methods of vectorizing text.
- `CountVectorizer` - Bag of Words
- `TfidfTransformer` - TF-IDF values
- `TfidfVectorizer` - Bag of Words AND TF-IDF values

Let's first use an example from earlier and apply the text processing steps we saw in this lesson.

In [27]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/kevinwebb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kevinwebb/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/kevinwebb/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [29]:
corpus = ["The first time you see The Second Renaissance it may look boring.",
        "Look at it at least twice and definitely watch part 2.",
        "It will change your view of the matrix.",
        "Are the human people the ones who started the war?",
        "Is AI a bad thing ?"]

stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

Create a function `tokenize` that takes in a string of text and applies the following:
- case normalization (convert to all lowercase)
- punctuation removal
- tokenization, lemmatization, and stop word removal using `nltk`

In [30]:
def tokenize(text):
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # tokenize text
    tokens = word_tokenize(text)
    
    # lemmatize andremove stop words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return tokens

## `CountVectorizer` (Bag of Words)

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize count vectorizer object
vect = CountVectorizer(tokenizer=tokenize)

# get counts of each token (word) in text data
X = vect.fit_transform(corpus)

# convert sparse matrix to numpy array to view
X.toarray()

array([[0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
        0, 0, 0],
       [1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
        0, 1, 0],
       [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0]])

In [35]:
# view token vocabulary and counts
vect.vocabulary_

{'first': 6,
 'time': 20,
 'see': 17,
 'second': 16,
 'renaissance': 15,
 'may': 11,
 'look': 9,
 'boring': 3,
 'least': 8,
 'twice': 21,
 'definitely': 5,
 'watch': 24,
 'part': 13,
 '2': 0,
 'change': 4,
 'view': 22,
 'matrix': 10,
 'human': 7,
 'people': 14,
 'one': 12,
 'started': 18,
 'war': 23,
 'ai': 1,
 'bad': 2,
 'thing': 19}

## `TfidfTransformer`

In [42]:
from sklearn.feature_extraction.text import TfidfTransformer

# initialize tf-idf transformer object
transformer = TfidfTransformer(smooth_idf=False)

# use counts from count vectorizer results to compute tf-idf values
tfidf = transformer.fit_transform(X)

# convert sparse matrix to numpy array to view
tfidf.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.33144579, 0.        , 0.        , 0.33144579, 0.        ,
        0.        , 0.12851912, 0.        , 0.19637646, 0.        ,
        0.33144579, 0.        , 0.        , 0.        , 0.        ,
        0.33144579, 0.33144579, 0.33144579, 0.        , 0.25703823,
        0.        , 0.33144579, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.33144579, 0.        ],
       [0.        , 0.30858627, 0.        , 0.61717254, 0.        ,
        0.        , 0.        , 0.30858627, 0.        , 0.        ,
        0.        , 0.11965527, 0.30858627, 0.18283256, 0.        ,
        0.        , 0.        , 0.        , 0.30858627, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.30858627, 0.        , 0.        ,
        0.30858627, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.

# `TfidfVectorizer`
`TfidfVectorizer` = `CountVectorizer` + `TfidfTransformer`

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

# initialize tf-idf vectorizer object
vectorizer = TfidfVectorizer()

# compute bag of word counts and tf-idf values
X = vectorizer.fit_transform(corpus)

# convert sparse matrix to numpy array to view
X.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.30298183, 0.        , 0.        , 0.30298183, 0.        ,
        0.        , 0.20291046, 0.        , 0.24444384, 0.        ,
        0.30298183, 0.        , 0.        , 0.        , 0.        ,
        0.30298183, 0.30298183, 0.30298183, 0.        , 0.40582093,
        0.        , 0.30298183, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.30298183, 0.        ],
       [0.        , 0.30015782, 0.        , 0.60031564, 0.        ,
        0.        , 0.        , 0.30015782, 0.        , 0.        ,
        0.        , 0.20101919, 0.30015782, 0.24216544, 0.        ,
        0.        , 0.        , 0.        , 0.30015782, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.30015782, 0.        , 0.        ,
        0.30015782, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.

# 22. One-Hot Encoding

So far, we have looked representations that characterize an entire document or collection of words as one unit. Insights are document-level.

For word-level insights, we need to come up with a numerical representation for each word.

Treat each word like a class / column. Similar to bag-of-words

# 23. Word Embeddings

One-hot encoding breaks down when we have a large vocabulary.

Limit word represenation to a fixed-size vector.

Finding similarites in word meanings and mapping it out in 2D space.

For more on word embeddings, take a look at the optional content at the end of the lesson.

# 24. Modeling

The final stage of the NLP pipeline is **modeling**, which includes designing a statistical or machine learning model, fitting its parameters to training data, using an optimization procedure, and then using it to make predictions about unseen data.

The nice thing about working with numerical features is that it allows you to choose from all machine learning models or even a combination of them.

Once you have a working model, you can deploy it as a web app, mobile app, or integrate it with other products and services. The possibilities are endless!

# 25. Word2Vec

Popular example of word embeddings.

Transforms words to vectors.

Utilizes surrounding words for context and Skip-gram Model

Properties
- Robust, distributed representation
- Vector size independent of vocabulary
- Train once, store in lookup table
- Deep learning ready

# 26. GloVe

New example of word embedding

Global Vectors for Word Representation

Tries to directly optimize the vector representation of each word just using co-occurence statistics.

# 27. Embeddings for Deep Learning

Distributional Hypothesis

Mapped out on a 2D plane, words with common context tend to get pulled closer and closer together.

Dimensions can be added depending on the different meanings of words

# 28. t-SNE

t-Distributed Stochastic Neighbor Embedding

dimensionality reduction technique that can map high dimensional vectors to a lower dimensional space.