**Subscribe** to this channel and **follow** *@dsbyhadi* on twitter for updates!

*email: datasciencebyhadi@gmail.com*

----

# NLP: Natural Language Processing

Deals with `unstructured` data(e.g. text) vs. `structured` data (e.g. tabular data)

It's *natural* language --> there are a variety of problems in this area.

* Named Entity Recognition
* Part Of Speech (POS) tagging
* Machine Translation
* Text Summarization 
* Sentiment Analysis
* ...

(of course `you don't need to be a lingustic expert!`)

   ---

## Sentiment Analysis

It's an example of Text Classification Problem and by this we are going to cover:

* Binary Classificaiton (The document is + or -)
* Multi-Class Classification (The document is + or neutral or -) or ( The document is ++ or + or . or - or --)
* Multi-label Classification (The document has profanity or hate speech or insult, i.e. each document could have multiple labels at the same time)

## Approaches
* Bag of Words (BOW)
* Embedding

** Packages/Tools **
* sklearn 
* spaCy
* NLTK
* StanfordNLP
* spark NLP (John Snow Labs)
* FastText
* ..

****

#### BOW

Tokens(e.g. words) are the features, and a process similar to one-hot encoding happens on them. Then a document is represented by the sum of vectors corresponding to its tokens.

Example.
```
Corpus:
Doc1. Dog is playing.     
Doc2. I love my dog.      

features (One-hot):
dog =     (1, 0, 0, 0, 0, 0)
is =      (0, 1, 0, 0, 0, 0)
playing = (0, 0, 1, 0, 0, 0)
i =       (0, 0, 0, 1, 0, 0)
love =    (0, 0, 0, 0, 1, 0)
my =      (0, 0, 0, 0, 0, 1)

So 
Doc1 --> (1, 1, 1, 0, 0, 0)
Doc2 --> (1, 0, 0, 1, 1, 1)
```
- Some forms of normalizations are used (sometimes to reduce the space dimension)
 * tf
 * tf-idf
 * lemmatization/stemming (am, are, is $\Rightarrow$ be) (cat, cats, cat's $\Rightarrow$ cat)
 * lower-case
 * removing stop words (e.g. 'but', 'and', 'a', 'i', 'me', 'that', ...)
 * ..

###### More on TF-IDF

Doc1: Dog is playing.

|  term    | count |
|----------|-------|
|  dog     |  1    |
|  is      |  1    |
|  playing |  1    |

Doc2: I love my dog.

|  term    | count |
|----------|-------|
|  i       |  1    |
|  love    |  1    |
|  my      |  1    |
|  dog     |  1    |

$tf(w, d) := f_{w,d}\div |d|$ where $f_{w,d}$ is the frequency of $w$ in $d$, and $|d|$ is the total number of words (tokens) in document $d$.   
$idf(w) := log(\frac{N}{t_{w}})$ where $t_w$ is the number of documents that contain $w$, and $N$ is the total number of documents in the corpus, and finally  
$tf{\text -}idf(w, d) = tf(w, d) \times idf(w)$  
(Note: there are many variations of these functions, and these are some simple ones used for this example to relay the idea behind them)

$tf("dog", d_1) = \frac{1}{3} = 0.33$  
$tf("playing", d_1) = \frac{1}{3} = 0.33$   
$tf("dog", d_2) = \frac{1}{4} = 0.25$  
..   

$idf("dog") = log(\frac{2}{2}) = 0$  
$idf("playing") = log(\frac{2}{1}) = 1$  
..  

Hence   
$tf{\text -}idf("dog", d_1) = 0.33 \times 0 = 0$,  
$tf{\text -}idf("dog", d_2) = 0.25 \times 0 = 0$,  
$tf{\text -}idf("playing", d_1) = 0.33 \times 1 = 0.33$  
..   

and so,  
```Doc1 --> (0, 0.33, 0.33, 0, 0, 0)  
Doc2 --> (0, 0, 0, 0.25, 0.25, 0.25)  
```

#### Embedding

Uses a bottleneck structure to reduce the features space dimension of BOW.

Example:
```
Corpus:
Doc1. Dog is playing.
Doc2. I love my dog.

embeddings (dimension 2):
dog =     (2, 4)
is =      (1, -1)
playing = (3, 2)
i =       (-1, -2)
love =    (-1, 10)
my =      (0, 1)

So 
Doc1 --> (6, 5)
Doc2 --> (0, 13)
```
- features dimension reduction (way smaller space compared to BOW word vectors)
- Word vectors are passed through a bottleneck structure (neural network with one hidden layer)
 
 ![title](embedding_sketch_1.jpeg)
 
 * Other possible approaches exist e.g. matrix factorization, collaborative filtering, etc.

- Example embeddings: word2vec, GloVe, BERT, ...
- Makes use of semantic and syntactic i.e. words that are close in the meaning or usage, are close in the embedding space as well
 * As long as words with the same meaning have their vectors close to each other (kinda clustered around each other), the canonical phenomena naturally happen 
   * e.g. vec(“Paris”) - vec(“France”) + vec(“Germany”) = Vec(“Berlin”) or vec(“Tehran”) - vec(“Iran”) + vec(“Germany”) = Vec(“Berlin”)
 * Potential issue? words could have different meanings, e.g. ‘point’, 'leaves' → other embeddings such as BERT have tried to resolve this issue
- CBOW (given context words, predict target word) vs. Skipgram (given target word, predict contex words)
- fasttext works at char n-gram level vs. word2vec or glove that work at word level
- There are plenty of pre-trained embeddings that can be reused.

My suggestion is to try both approaches on your specific problem. No preference is given to any of them, but the only advice is to ***start simple***, and then run more interations, change modules/parameters/approach/etc. to improve the model over time.

Keep in mind, your training data is very critical, bad training data (small data, inconsistent labels, uncleaned text) results in bad models, no matter what top-notch technique you use! 

---

#### Challenges:

- Good enough (realistic) vs. Best (Kaggle)
- Not enough training data
 * Use augmentation to increase data and performance
- Skewed training data
- Normalizing text depending on the source
 * Twitter's text data --> abbreviations/hashtags/etc.    
 * SMS Text --> smileys/Capital case matters.
- Encountering other languages --> remove them/translate them
- Converging to a good set of metrics
 * eyeball some examples of the test set at the end
- Deploying to production
 * flask REST API
 * Hand over to engineers
 * SageMaker, etc.
 * MLeap
 * ..


##### Remember: 
All models are wrong, some are useful!

___