*This is the fifth post in a series based off my [Python for Data Science bootcamp]((https://github.com/gramster/pythonbootcamp). The other posts are:*

- *[a Python crash course](https://www.grahamwheeler.com/posts/python-crash-course.html)*
- *[using Jupyter](https://www.grahamwheeler.com/posts/using-jupyter.html)*
- *[exploratory data analysis](https://www.grahamwheeler.com/posts/exploratory-data-analysis-with-numpy-and-pandas.html).*
- *[introductory machine learning](https://www.grahamwheeler.com/posts/basic-machine-learning.html).*

In this post we will take a look at NLP - natural language processing - namely how we can apply ML techniques to collections of text (which we call _corpuses_ or maybe that should be _corpii_?).

We've touched on this topic before but will go in more detail here.

There are a number of applications for NLP, including:

- sentiment analysis - is the text saying positive or negative things? There are many reasons we may want to know this. A common use case is monitoring social media like Twitter and seeing if people are expressing positive or negative opinions about a company and how the trend is changing over time. If you're representing the company on social media, or you're trading the stock of the company, this is very useful info.
- entity extraction/named-entity recognition (NER) - what people, places and things are mentioned in the text?
- topic modeling or text summarization - what is the text saying? Topic modeling is just broad classification - for example, "Is this text about sports?", while text summarization is trying to extract the most salient points from the text.
- text generation - we can train models to generate text in different styles or on various topics
- auto-responder bots - combining some of the above techniques, we can build bots to do, for example, first-line product support

Before applying an NLP algorithm, we need to prepare the textual data. This includes a number of steps:

- _cleaning_ the data. If we are using word-level representation, we are going to want to map each word into a numeric representation (e.g. a vector), and to do this we would want to restrict ourselves in most cases to a finite set of allowable words (the _vocabulary_). In order to reduce the size of the vocabulary we typically will do some cleaning/preprocessing. This can include removing punctuation and "filler" words like "the" that aren't needed to understand the text; we call these _stop words_. We may also want to standardize the form of words to reduce the number of variations which we can do with _stemming_ and _lemmatization_.
- to represent text in a way amenable to ML algorithms, we need to encode it in some form of numeric vector. Depending on the algorithm, we may care about the order or the words, or in simple cases, perhaps we can simply use a set. In between these extremes we could have an unordered list of the words but with each word associated with some score for how important it is, or we could focus only on order for short sequences like adjacent word pairs (2-grams) or triplets (3-grams). If using a score, this could be as simple as the word count, or it could be a more sophisticated measure like the _TF_IDF_ score.

We'll discuss each of these further in the next sections.

## Python Libraries for NLP

### NLP Toolkit

### Spacey

### Other

Facebook and Allen Institute

## Preparing the Text

### Removing Punctuation

Note that punctuation can be significant - think of the difference between ending a sentence with ! vs ? - but that most of the techniques used in NLP will ignore it. Character-level convolutional neural networks are one way of making use of punctuation.

In [8]:
import sys
import unicodedata

punc = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))

data = [
    "I'll be there!",
    "N-o-o-o-o!!"
]
    
new_data = [s.translate(punc) for s in data]

new_data

['Ill be there', 'Noooo']

### Dealing with Letter Case

### Tokenizing Words and Splitting Sentences

In [9]:
from nltk.tokenize import word_tokenize

word_tokenize("The cat in the hat.")

['The', 'cat', 'in', 'the', 'hat', '.']

In [12]:
from nltk.tokenize import sent_tokenize

data = "I will not eat them with a fox! I will not eat them in a box."

sent_tokenize(data)

['I will not eat them with a fox!', 'I will not eat them in a box.']

In [13]:
[word_tokenize(s) for s in sent_tokenize(data)]

[['I', 'will', 'not', 'eat', 'them', 'with', 'a', 'fox', '!'],
 ['I', 'will', 'not', 'eat', 'them', 'in', 'a', 'box', '.']]

Note that if we want to do sentence tokenization we should _not_ remove punctuation beforehand. Instead we could do something like:

In [15]:
[word_tokenize(s.translate(punc)) for s in sent_tokenize(data)]

[['I', 'will', 'not', 'eat', 'them', 'with', 'a', 'fox'],
 ['I', 'will', 'not', 'eat', 'them', 'in', 'a', 'box']]

### Stop-Word Removal and Restricting to a Fixed Vocabulary

In [23]:
# Install the prerequisites

import nltk


nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/gram/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
from nltk.corpus import stopwords

stop = stopwords.words('english')

# NLTK assumes words have been lower-cased

data = "I will not eat them with a fox! I will not eat them in a box."
sentences = sent_tokenize(data)

[[w for w in word_tokenize(s.translate(punc).lower()) if w not in stop] for s in sentences]

[['eat', 'fox'], ['eat', 'box']]

### Stemming and Lemmatization

Many similar words have common "stems". For example, "geology", "geological", "geologically" all have the stem "geolog". Stemming is the process of reducing words to their stem forms - this reduces our vocabulary size with little loss of meaning. There are different algorithms for doing this; a common one is the Porter algorithm.

In [19]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

[stemmer.stem(w) for w in word_tokenize("geologically speaking the geology of the area is geological")]

['geolog', 'speak', 'the', 'geolog', 'of', 'the', 'area', 'is', 'geolog']

Todo - lemmatization

### Labeling Parts of Speech

In [21]:
# Install the prerequisites
import nltk


nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/gram/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [25]:
from nltk import pos_tag, word_tokenize

pos_tag(word_tokenize("the cat sat on the mat"))

[('the', 'DT'),
 ('cat', 'NN'),
 ('sat', 'VBD'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('mat', 'NN')]

All the different tags are described in the NLTK help which we can access with:

In [27]:
import nltk


nltk.download('tagsets')
nltk.help.upenn_tagset()

[nltk_data] Downloading package tagsets to /Users/gram/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.
$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw ala

## Text Representation

### 1-Hot Set Representation

### Bag of Words

### Word Counts

`fit_transform` returns a sparse array so we turn it back into a DataFrame.

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

data = [
    "the cat sat on the mat",
    "the mat belonged to the rat",
    "the hat was on the mat",
    "the cat ate the rat",
    "the cat now has the hat"
]

v = CountVectorizer()
X = v.fit_transform(data)
pd.DataFrame(X.toarray(), columns=v.get_feature_names())

Unnamed: 0,ate,belonged,cat,has,hat,mat,now,on,rat,sat,the,to,was
0,0,0,1,0,0,1,0,1,0,1,2,0,0
1,0,1,0,0,0,1,0,0,1,0,2,1,0
2,0,0,0,0,1,1,0,1,0,0,2,0,1
3,1,0,1,0,0,0,0,0,1,0,2,0,0
4,0,0,1,1,1,0,1,0,0,0,2,0,0


### Word Significance Metrics

TF-IDF and contextual salience (https://arxiv.org/abs/1803.08493)

The problem with word counts, especially if we don't remove stop words, is that common words get scored highly, which may not always be desirable. What we want to score highly are words that are common *in this row* that are not common *across all rows*. That should weight words that are significant to the particular row. We can do this with *[Term Frequency - Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)* or TF-IDF scores. In our small example this doesn't quite have the desired effect but the code is useful to show:

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

v = TfidfVectorizer()
X = v.fit_transform(data)
pd.DataFrame(X.toarray(), columns=v.get_feature_names())

Unnamed: 0,ate,belonged,cat,has,hat,mat,now,on,rat,sat,the,to,was
0,0.0,0.0,0.360239,0.0,0.0,0.360239,0.0,0.433975,0.0,0.537901,0.512625,0.0,0.0
1,0.0,0.499522,0.0,0.0,0.0,0.334536,0.0,0.0,0.403011,0.0,0.47605,0.499522,0.0
2,0.0,0.0,0.0,0.0,0.4218,0.350132,0.0,0.4218,0.0,0.0,0.498244,0.0,0.52281
3,0.576615,0.0,0.386166,0.0,0.0,0.0,0.0,0.0,0.465209,0.0,0.54952,0.0,0.0
4,0.0,0.0,0.334536,0.499522,0.403011,0.0,0.499522,0.0,0.0,0.0,0.47605,0.0,0.0


### Ordered Representation with n-grams, Character or Word Vector Sequences

Works well if we have a finite length input - e.g. Twitter tweets. For unbounded text we would need to feed this in to our algorithm in chunks, so would need an approach that has some form of "short-term memory" like an LSTM NN.

## Named Entity Recognition

## Sentiment Analysis

## Topic Modeling

## Chatbots

## Generative Models