## NLP overview

### What is natural language processing (NLP)?
Natural language processing is an area of research in computer science and artificial intelligence (AI) concerned with processing natural languages such as English or Mandarin. This processing generally involves
translating natural language into data (numbers) that a computer can use to
learn about the world. And this understanding of the world is sometimes used
to generate natural language text that reflects that understanding

![NLP Application](img/nlpa.png)

###  Building your vocabulary with a tokenizer

In NLP, tokenization is a particular kind of document segmentation. Segmentation
breaks up text into smaller chunks or segments, with more focused information content. Segmentation can include breaking a document into paragraphs, paragraphs
into sentences, sentences into phrases, or phrases into tokens (usually words) and
punctuation

**Tokenization** is the first step in an NLP pipeline, so it can have a big impact on the
rest of your pipeline. A tokenizer breaks unstructured data, natural language text,
into chunks of information that can be counted as discrete elements. These counts of
token occurrences in a document can be used directly as a vector representing that
document. This immediately turns an unstructured string (text document) into a
numerical data structure suitable for machine learning. 

 The simplest way to tokenize a sentence is to use whitespace within a string as the
“delimiter” of words

In [1]:
sentence = "Ftech is a AI-oriented company Apple is a company which sales apple"

In [2]:
sentence.split()

['Ftech',
 'is',
 'a',
 'AI-oriented',
 'company',
 'Apple',
 'is',
 'a',
 'company',
 'which',
 'sales',
 'apple']

With a bit more Python, you can create a numerical
vector representation for each word. These vectors are called *one-hot vectors*

In [3]:
 import numpy as np

In [4]:
token_sequence = str.split(sentence)

In [5]:
vocab = sorted(set(token_sequence))

In [6]:
vocab

['AI-oriented',
 'Apple',
 'Ftech',
 'a',
 'apple',
 'company',
 'is',
 'sales',
 'which']

In [7]:
num_tokens = len(token_sequence)
vocab_size = len(vocab)
onehot_vectors = np.zeros((num_tokens,vocab_size), int)

In [8]:
onehot_vectors

array([[0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [9]:
for i, word in enumerate(token_sequence):
    onehot_vectors[i, vocab.index(word)] = 1

In [10]:
onehot_vectors

array([[0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0]])

If you have trouble quickly reading all those ones and zeros, you’re not alone. **Pandas**
DataFrames can help make this a little easier on the eyes and more informative.

In [11]:
import pandas as pd
pd.DataFrame(onehot_vectors, columns=vocab)

Unnamed: 0,AI-oriented,Apple,Ftech,a,apple,company,is,sales,which
0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0
2,0,0,0,1,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0
5,0,1,0,0,0,0,0,0,0
6,0,0,0,0,0,0,1,0,0
7,0,0,0,1,0,0,0,0,0
8,0,0,0,0,0,1,0,0,0
9,0,0,0,0,0,0,0,0,1


One nice feature of this vector representation of words and tabular representation
of documents is that no information is lost

Let’s assume you
have a million tokens in your NLP pipeline vocabulary. 
And let’s say you have a meager 3,000 books with 3,500 sentences each and 15 words per sentence—reasonable
averages for short books => how much memory do we need?

In [12]:
num_rows = 3000 * 3500 * 15
num_bytes = num_rows * 1000000
num_bytes

157500000000000

In [13]:
num_bytes / 1e9 # gigabytes

157500.0

In [14]:
 _ / 1000 #terabytes

157.5

![NLP Application](img/wic.jpeg)

###  word frequency vector

If you summed all these one-hot vectors together, rather than “replaying” them
one at a time, you’d get a bag-of-words vector. This is also called a word frequency vector, because it only counts the frequency of words, not their order. You could use this
single vector to represent the whole document or sentence in a single, reasonablelength vector. It would only be as long as your vocabulary size 

Here’s what your single text document,
looks like as a binary bag-of-words vector:

In [15]:
sentence_bow = {}
for token in sentence.split():
    sentence_bow[token] = 1

In [16]:
sentence_bow

{'Ftech': 1,
 'is': 1,
 'a': 1,
 'AI-oriented': 1,
 'company': 1,
 'Apple': 1,
 'which': 1,
 'sales': 1,
 'apple': 1}

Let’s add a few more texts to your corpus to see how a DataFrame stacks up. A
DataFrame indexes both the columns (documents) and rows (words) so it can be an
“inverse index” for document retrieval

In [17]:
sentences = "Thomas Jefferson began building Monticello at the age of 26.\n"
sentences += "Construction was done mostly by local masons and carpenters.\n"
sentences += "He moved into the South Pavilion in 1770.\n"
sentences += "Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."
corpus = {}
for i, sent in enumerate(sentences.split('\n')):
     corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split())

In [18]:
corpus

{'sent0': {'Thomas': 1,
  'Jefferson': 1,
  'began': 1,
  'building': 1,
  'Monticello': 1,
  'at': 1,
  'the': 1,
  'age': 1,
  'of': 1,
  '26.': 1},
 'sent1': {'Construction': 1,
  'was': 1,
  'done': 1,
  'mostly': 1,
  'by': 1,
  'local': 1,
  'masons': 1,
  'and': 1,
  'carpenters.': 1},
 'sent2': {'He': 1,
  'moved': 1,
  'into': 1,
  'the': 1,
  'South': 1,
  'Pavilion': 1,
  'in': 1,
  '1770.': 1},
 'sent3': {'Turning': 1,
  'Monticello': 1,
  'into': 1,
  'a': 1,
  'neoclassical': 1,
  'masterpiece': 1,
  'was': 1,
  "Jefferson's": 1,
  'obsession.': 1}}

In [19]:
df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
df[df.columns[:10]]

Unnamed: 0,1770.,26.,Construction,He,Jefferson,Jefferson's,Monticello,Pavilion,South,Thomas
sent0,0,1,0,0,1,0,1,0,0,1
sent1,0,0,1,0,0,0,0,0,0,0
sent2,1,0,0,1,0,0,0,1,1,0
sent3,0,0,0,0,0,1,1,0,0,0


###  Dot product
The dot product is also called the inner product because the “inner” dimension of
the two vectors (the number of elements in each vector) or matrices (the rows of the
first matrix and the columns of the second matrix) must be the same, because that’s
where the products happen

In [20]:
v1 = pd.np.array([1, 2, 3])
v2 = pd.np.array([2, 3, 4])
v1.dot(v2)

20

 ### Measuring bag-of-words overlap

If we can measure the bag of words overlap for two vectors, we can get a good estimate
of how similar they are in the words they use. And this is a good estimate of how similar they are in meaning

In [21]:
df = df.T
df.sent0.dot(df.sent1)

0

In [22]:
df.sent0.dot(df.sent2)

1

In [23]:
df.sent0.dot(df.sent3)

1

From this you can tell that one word was used in both sent0 and sent2. Likewise one
of the words in your vocabulary was used in both sent0 and sent3

### A token improvement
In some situations, other characters besides spaces are used to separate words in a sentence. You need
your tokenizer to split a sentence not just on whitespace, but also on punctuation such
as commas, periods, quotes, semicolons, and even hyphens (dashes). In some cases
you want these punctuation marks to be treated like words, as independent tokens. In
other cases you may want to ignore them.

In [24]:
import re
sentence = "Thomas Jefferson began building Monticello at the age of 26."    
tokens = re.split(r'[-\s.,;!?]+', sentence)
tokens

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26',
 '']

Several Python libraries implement tokenizers, each with its own advantages and
disadvantages:
* spaCy—Accurate , flexible, fast, Python
* Stanford CoreNLP—More accurate, less flexible, fast, depends on Java 8
* NLTK—Standard used by many NLP contests and comparisons, popular, Python

In [25]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+|$[0-9.]+|\S+')
tokenizer.tokenize(sentence)

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26',
 '.']

In [26]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentence)

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26',
 '.']

### Extending your vocabulary with n-grams
An n-gram is a sequence containing up to n elements that have been extracted from a
sequence of those elements, usually a string. In general the “elements” of an n-gram
can be characters, syllables, words, or even symbols like “A,” “T,” “G,” and “C” used to
represent a DNA sequence

Why bother with n-grams? As you saw earlier, when a sequence of tokens is vectorized into a bag-of-words vector, it loses a lot of the meaning inherent in the order of
those words. By extending your concept of a token to include multiword tokens,
n-grams, your NLP pipeline can retain much of the meaning inherent in the order of
words in your statements. For example, the meaning-inverting word “not” will remain
attached to its neighboring words, where it belongs. Without n-gram tokenization, it
would be free floating. Its meaning would be associated with the entire sentence or
document rather than its neighboring words. The 2-gram “was not” retains much
more of the meaning of the individual words “not” and “was” than those 1-grams
alone in a bag-of-words vector. A bit of the context of a word is retained when you tie it
to its neighbor(s) in your pipeline

In [27]:
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentence)

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26',
 '.']

In [28]:
from nltk.util import ngrams
list(ngrams(tokens, 2))

[('Thomas', 'Jefferson'),
 ('Jefferson', 'began'),
 ('began', 'building'),
 ('building', 'Monticello'),
 ('Monticello', 'at'),
 ('at', 'the'),
 ('the', 'age'),
 ('age', 'of'),
 ('of', '26'),
 ('26', '')]

### STOP WORDS
Stop words are common words in any language that occur with a high frequency but
carry much less substantive information about the meaning of a phrase. Examples of
some common stop words include
* a, an
* the, this
* and, or
* of, on
Historically, stop words have been excluded from NLP pipelines in order to reduce
the computational effort to extract information from a text. Even though the words
themselves carry little information

In [29]:
import nltk
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
stop_words[:7]

[nltk_data] Downloading package stopwords to /Users/andy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we', 'our', 'ours']

### Normalizing your vocabulary
#### CASE FOLDING
In Python, you can easily normalize the capitalization of your tokens with a list
comprehension:

In [30]:
tokens = ['House', 'Visitor', 'Center']
normalized_tokens = [x.lower() for x in tokens]
print(normalized_tokens)

['house', 'visitor', 'center']


#### STEMMING
Another common vocabulary normalization technique is to eliminate the small meaning differences of pluralization or possessive endings of words, or even various verb
forms. This normalization, identifying a common stem among various forms of a
word, is called stemming. For example, the words housing and houses share the same
stem, house. Stemming removes suffixes from words in an attempt to combine words
with similar meanings together under their common stem. A stem isn’t required to be
a properly spelled word, but merely a token, or label, representing several possible
spellings of a word.

In [31]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
' '.join([stemmer.stem(w).strip("'") for w in "dish washer's washed dishes".split()])

'dish washer wash dish'

#### LEMMATIZATION
If you have access to information about connections between the meanings of various
words, you might be able to associate several words together even if their spelling is
quite different. This more extensive normalization down to the semantic root of a
word—its lemma—is called lemmatization.

### Sentiment
Whether you use raw single-word tokens, n-grams, stems, or lemmas in your NLP pipeline, each of those tokens contains some information. An important part of this information is the word’s sentiment—the overall feeling or emotion that the word invokes.
This sentiment analysis—measuring the sentiment of phrases or chunks of text—is a
common application of NLP. In many companies it’s the main thing an NLP engineer
is asked to do.

#### Bag of words
In the previous section, you created your first vector space model of a text. You used
one-hot encoding of each word and then combined all those vectors with a binary OR
(or clipped sum) to create a vector representation of a text. And this binary bag-ofwords vector makes a great index for document retrieval when loaded into a data
structure such as a Pandas DataFrame. You then looked at an even more useful vector representation that counts the
number of occurrences, or frequency, of each word in the given text
Let’s look at an example where counting occurrences of words is useful:

In [32]:
from nltk.tokenize import TreebankWordTokenizer
sentence = "The faster Harry got to the store, the faster Harry, the faster, would get home."
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(sentence.lower())
tokens

['the',
 'faster',
 'harry',
 'got',
 'to',
 'the',
 'store',
 ',',
 'the',
 'faster',
 'harry',
 ',',
 'the',
 'faster',
 ',',
 'would',
 'get',
 'home',
 '.']

With your simple list, you want to get unique words from the document and their
counts

In [33]:
from collections import Counter
bag_of_words = Counter(tokens)
bag_of_words

Counter({'the': 4,
         'faster': 3,
         'harry': 2,
         'got': 1,
         'to': 1,
         'store': 1,
         ',': 3,
         'would': 1,
         'get': 1,
         'home': 1,
         '.': 1})

For short documents like this one, the unordered bag of words still contains a lot of
information about the original intent of the sentence. And the information in a bag of
words is sufficient to do some powerful things such as detect spam, compute sentiment (positivity, happiness, and so on), and even detect subtle intent

In [34]:
bag_of_words.most_common(4)

[('the', 4), ('faster', 3), (',', 3), ('harry', 2)]

the number of times a word occurs in a given document is called the *term
frequency*, commonly abbreviated **TF**

Let’s calculate the term frequency of “harry” from the Counter object 

In [36]:
times_harry_appears = bag_of_words['harry']
num_unique_words = len(bag_of_words)
tf = times_harry_appears / num_unique_words
tf

0.18181818181818182

Let’s pause for a second and look a little deeper at normalized term frequency, a
phrase (and calculation) we use often throughout this book. It’s the word count tempered by how long the document is. But why “temper” it all? Let’s say you find the
word “dog” 3 times in document A and 100 times in document B. Clearly “dog” is way
more important to document B. But wait. Let’s say you find out document A is a
30-word email to a veterinarian and document B is War & Peace (approx 580,000
words!). Your first analysis was straight-up backwards

let's jump to a bigger corpus

In [39]:
from collections import Counter
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()    
from nlpia.data.loaders import kite_text
tokens = tokenizer.tokenize(kite_text.lower())



In [41]:
token_counts = Counter(tokens)
token_counts.most_common(10)

[('the', 26),
 ('a', 20),
 ('kite', 16),
 (',', 15),
 ('and', 10),
 ('of', 10),
 ('kites', 8),
 ('is', 7),
 ('in', 7),
 ('or', 6)]

It’s not likely that this Wikipedia article is about the articles “the” and “a,” nor the conjunction “and” and the other
stop words. So let’s ditch them for now:

In [43]:
stopwords = nltk.corpus.stopwords.words('english')
tokens = [x for x in tokens if x not in stopwords]
kite_counts = Counter(tokens)
kite_counts.most_common(10)

[('kite', 16),
 (',', 15),
 ('kites', 8),
 ('wing', 5),
 ('lift', 4),
 ('may', 4),
 ('also', 3),
 ('kiting', 3),
 ('flown', 3),
 ('tethered', 2)]

####  Vectorizing

You’ve transformed your text into numbers on a basic level. But you’ve still just stored
them in a dictionary, so you’ve taken one step out of the text-based world and into the
realm of mathematics. Next you’ll go ahead and jump in all the way. Instead of
describing a document in terms of a frequency dictionary, you’ll make a vector of
those word counts. In Python, this will be a list, but in general it’s an ordered collection or array. You can do this quickly with

In [47]:
document_vector = []
doc_length = len(tokens)
for key, value in kite_counts.most_common():
    document_vector.append(value / doc_length)
document_vector[:3]

[0.07207207207207207, 0.06756756756756757, 0.036036036036036036]

 You had one “document” already—
let’s round out the corpus with a couple more:

In [46]:
docs = ["The faster Harry got to the store, the faster and faster Harry would get home."]
docs.append("Harry is hairy and faster than Jill.")
docs.append("Jill is not as hairy as Harry.")

###  Vector spaces
Vectors are the primary building blocks of linear algebra, or vector algebra. They’re
an ordered list of numbers, or coordinates, in a vector space. They describe a location
or position in that space. Or they can be used to identify a particular direction and
magnitude or distance in that space. A space is the collection of all possible vectors that
could appear in that space. So a vector with two values would lie in a 2D vector space,
a vector with three values in 3D vector space, and so on.

For a natural language document vector space, the dimensionality of your vector
space is the count of the number of distinct words that appear in the entire corpus.
For TF (and TF-IDF to come)

Two vectors are “similar” if they share similar direction. They might have similar
magnitude (length), which would mean that the word count (term frequency) vectors
are for documents of about the same length. But do you care about document length
in your similarity estimate for vector representations of words in documents? Probably
not. You’d like your estimate of document similarity to find use of the same words
about the same number of times in similar proportions. This accurate estimate would
give you confidence that the documents they represent are probably talking about
similar things.
 Cosine similarity is merely the cosine of the angle between two vectors (theta),

![hehe](img/cs.png)

![hehe](img/2dspace.png)

In [49]:
import math
def cosine_sim(vec1, vec2):
    """ Let's convert our dictionaries to lists for easier matching."""
    vec1 = [val for val in vec1.values()]
    vec2 = [val for val in vec2.values()]
    dot_prod = 0
    for i, v in enumerate(vec1):
        dot_prod += v * vec2[i]
    mag_1 = math.sqrt(sum([x**2 for x in vec1]))
    mag_2 = math.sqrt(sum([x**2 for x in vec2]))
    return dot_prod / (mag_1 * mag_2)

### TFIDF

![hehe](img/tfidf.png)

In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = docs
vectorizer = TfidfVectorizer(min_df=1)
model = vectorizer.fit_transform(corpus)
print(model.todense().round(2))

[[0.16 0.   0.48 0.21 0.21 0.   0.25 0.21 0.   0.   0.   0.21 0.   0.64
  0.21 0.21]
 [0.37 0.   0.37 0.   0.   0.37 0.29 0.   0.37 0.37 0.   0.   0.49 0.
  0.   0.  ]
 [0.   0.75 0.   0.   0.   0.29 0.22 0.   0.29 0.29 0.38 0.   0.   0.
  0.   0.  ]]
