# Natural Language Processing

tl;dr: Make computers understand language


- Text classification
- Text clustering
- Representations

- Syntax analysis 
- Part of Speech Tagging


- Also:
    - Speech Recognition
    - Speech Synthesis
    - Machine Translation 
    

## Tfidf representation

tf–idf (frequency–inverse document frequency) comes from Information Retrieval and is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes used in practice.


Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields, including text summarization and classification.

One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.

$tf$ = number of times that term t occurs in document d. 


$tf(d,t) = \frac{ \#term\_in\_doc }{ \#total\_terms\_in\_doc}  $

The inverse document frequency is a measure of how much information the word provides,
i.e., if it is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):

$$ idf(t, D) = log( \frac{ N }{ | \{ d \in D: t \in D \} | }  ) $$

N: total number of documents in the corpus, N = {|D|}

    
$\{ {d \in D: t \in d \} }$: number of documents where the term t appears    
    
If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to $ 1 + |\{d \in D: t \in d \} | $

and overall 

$$ tfidf(t,d) = tf(t,d) x idf(t,D) $$

### Part of speech tagging 

POS tagging is the process of marking up a word in a corpus to a corresponding part of a speech tag, based on its context and definition.

Part of speech tags are very useful for intent classification and web query optimisation.


There are several libraries that support POS tagging such as NLTK, spacy, FlairNLP





In [14]:
from flair.data import Sentence
from flair.models import SequenceTagger


# make a sentence
sentence = Sentence('Arteta wants temporary head injury substitutes after David Luiz incident')

# load the NER tagger
tagger = SequenceTagger.load('pos')

# run NER over sentence
tagger.predict(sentence)

print("\n")
print(sentence)
print('The following POS tags are found:')
print("\n")

# iterate over entities and print
for entity in sentence.get_spans('pos'):
    print(entity)
    

2020-12-02 20:56:35,961 https://nlp.informatik.hu-berlin.de/resources/models/pos/en-pos-ontonotes-v0.5.pt not found in cache, downloading to /tmp/tmpt09zowwq


100%|██████████| 249072763/249072763 [10:40<00:00, 388665.79B/s]

2020-12-02 21:07:17,098 copying /tmp/tmpt09zowwq to cache at /home/kostas/.flair/models/en-pos-ontonotes-v0.5.pt





2020-12-02 21:07:17,320 removing temp file /tmp/tmpt09zowwq
2020-12-02 21:07:17,351 loading file /home/kostas/.flair/models/en-pos-ontonotes-v0.5.pt


Sentence: "Arteta wants temporary head injury substitutes after David Luiz incident"   [− Tokens: 10  − Token-Labels: "Arteta <NNP> wants <VBZ> temporary <JJ> head <NN> injury <NN> substitutes <NNS> after <IN> David <NNP> Luiz <NNP> incident <NN>"]
The following POS tags are found:


Span [1]: "Arteta"   [− Labels: NNP (1.0)]
Span [2]: "wants"   [− Labels: VBZ (1.0)]
Span [3]: "temporary"   [− Labels: JJ (1.0)]
Span [4]: "head"   [− Labels: NN (0.9996)]
Span [5]: "injury"   [− Labels: NN (1.0)]
Span [6]: "substitutes"   [− Labels: NNS (1.0)]
Span [7]: "after"   [− Labels: IN (0.9999)]
Span [8]: "David"   [− Labels: NNP (1.0)]
Span [9]: "Luiz"   [− Labels: NNP (1.0)]
Span [10]: "incident"   [− Labels: NN (0.9991)]


In [None]:
NNP : Proper noun, singular
VB  : Verb, base form
VBZ : Verb, 3rd person singular present
IN  : Preposition or subordinating conjunction
JJ  : Adjective

## Chunking

Chunking (shallow parsing) it  the identification of parts of speech and short phrases (like noun phrases). POS tagging identifies labels of words such as verbs, adjectives, etc.

Chunking essentially groups text into blocks semantic boundaries such as noun phrases, verbal phrases etc.


In [21]:
from flair.data import Sentence
from flair.models import SequenceTagger


# make a sentence
sentence = Sentence('Arteta wants temporary head injury substitutes after David Luiz incident')

# load the NER tagger
tagger = SequenceTagger.load('chunk')

# run NER over sentence
tagger.predict(sentence)

print("\n")
print(sentence)
print('The following chunks are found:')
print("\n")

# iterate over entities and print
for entity in sentence.get_spans('np'):
    print(entity)
    

2020-12-02 21:29:48,444 loading file /home/kostas/.flair/models/en-chunk-conll2000-v0.4.pt


Sentence: "Arteta wants temporary head injury substitutes after David Luiz incident"   [− Tokens: 10  − Token-Labels: "Arteta <S-NP> wants <S-VP> temporary <B-NP> head <I-NP> injury <I-NP> substitutes <E-NP> after <S-PP> David <B-NP> Luiz <I-NP> incident <E-NP>"]
The following chunks are found:


Span [1]: "Arteta"   [− Labels: NP (0.9991)]
Span [2]: "wants"   [− Labels: VP (0.9998)]
Span [3,4,5,6]: "temporary head injury substitutes"   [− Labels: NP (0.8966)]
Span [7]: "after"   [− Labels: PP (0.4565)]
Span [8,9,10]: "David Luiz incident"   [− Labels: NP (0.8614)]


NP: Noun Phrase
VP: Verb Phrase
PP: Prepositional Phrase 
ADJP: Adjective phrase
ADVP: Adverb phrase 

### Named Entity Recognition (NER)

With NER, we aim to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

In [11]:
from flair.data import Sentence
from flair.models import SequenceTagger


# make a sentence
sentence = Sentence('Arteta wants temporary head injury substitutes after David Luiz incident')

# load the NER tagger
tagger = SequenceTagger.load('ner')

# run NER over sentence
tagger.predict(sentence)

print("\n")
print(sentence)
print('The following NER tags are found:')
print("\n")

# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)
    

2020-12-02 20:40:08,102 loading file /home/kostas/.flair/models/en-ner-conll03-v0.4.pt


Sentence: "Arteta wants temporary head injury substitutes after David Luiz incident"   [− Tokens: 10  − Token-Labels: "Arteta <S-PER> wants temporary head injury substitutes after David <B-PER> Luiz <E-PER> incident"]
The following NER tags are found:


Span [1]: "Arteta"   [− Labels: PER (1.0)]
Span [8,9]: "David Luiz"   [− Labels: PER (0.9987)]


## Text clustering 
 

### kmeans

Let's see how we can do text clustering with k-means

In [199]:
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler


from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
import numpy as np

data = open('./data/amazon_reviews.txt', "r").readlines()

In [274]:
n_components = 30

vec = CountVectorizer(max_df=2, max_features=30000)

svd = TruncatedSVD(n_components)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(vec, svd, normalizer)
X = lsa.fit_transform(data)

In [275]:
#vec.vocabulary_

In [280]:
n_clusters = 10
model = KMeans(n_clusters=n_clusters, random_state=0)
model.fit(X)

KMeans(n_clusters=10, random_state=0)

In [281]:
labels = model.labels_

In [282]:
for i in range(n_clusters):
    print(labels[labels==i].shape)

(878,)
(788,)
(4620,)
(698,)
(720,)
(988,)
(1027,)
(757,)
(745,)
(693,)


In [284]:
for i in range(n_clusters):
    print( "\n")
    
    idx  = np.where(labels==i)
    idxs = idx[0][0:3]
    
    print(idxs)
    
    
    print(data[idxs[0]][0:100] )
    print(data[idxs[1]][0:100] )
    print(data[idxs[2]][0:100] )
    
    
    #print(data[idxs[0]][0:200].strip())
    #print(100*'-')
    #print(data[idxs[2]][0:200].strip())
    
    #print(100*"*")



[12 21 23]
i just got my first case of this fabric softener last week and yesterday , i was walking downstairs 
easily the worst textbook i encountered during my undergraduate years . unfocused , sparse , and alm
i just got this camera a week ago and thought it was great . but today i tried to turn it on and it 


[ 4 24 68]
i loved these movies , and i cant wiat for the third one ! very funny , not suitable for chilren 

i bought bead fantasies and bead fantasies ii at the same time after reading the positive reviews ; 
1. you ca n't print on anything over 8.5x14 paper . useless if you want to review your house to any 


[1 3 5]
i was misled and thought i was buying the entire cd and it contains one song 

anything you purchase in the left behind series is an excellent read . these books are great and ver
in my experience , this camera takes great pictures , but the zoom lens is so delicate that it break


[18 26 44]
i got this for my mom when i got the digital frame for my husband 

In [147]:
import gensim
corpus = gensim.corpora.textcorpus.TextCorpus('./data/amazon_reviews.txt')

In [37]:
seq = corpus.get_texts()
print(next(seq))
print("\n")
print(next(seq))
print("\n")
print(next(seq 

['bought', 'album', 'loved', 'title', 'song', 'great', 'song', 'bad', 'rest', 'album', 'right', 'rest', 'songs', 'filler', 'worth', 'money', 'paid', 'shameless', 'bubblegum', 'oversentimentalized', 'depressing', 'tripe', 'kenny', 'chesney', 'popular', 'artist', 'result', 'cookie', 'cutter', 'category', 'nashville', 'music', 'scene', 'gotta', 'pump', 'albums', 'record', 'company', 'lining', 'pockets', 'suckers', 'buying', 'garbage', 'perpetuate', 'garbage', 'coming', 'town', 'soapbox', 'country', 'music', 'needs', 'roots', 'stop', 'pop', 'nonsense', 'country', 'music', 'considered', 'mainstream', 'different', 'things']


['misled', 'thought', 'buying', 'entire', 'contains', 'song']


['introduced', 'ell', 'high', 'school', 'students', 'lois', 'lowery', 'depth', 'characters', 'brilliant', 'writer', 'capable', 'inspiring', 'fierce', 'passion', 'readers', 'encounter', 'shocking', 'details', 'utopian', 'worlds', 'anxious', 'read', 'companion', 'novel', 'planned', 'share', 'class', 'january'

In [38]:
model = gensim.models.LdaModel(corpus, id2word=corpus.dictionary,
                               alpha='auto',
                               num_topics=10,
                               passes=5)

In [39]:
for topic_id in range(model.num_topics):
    topk = model.show_topic(topic_id, 10)
    topk_words = [ w for w, _ in topk ]
    
    print('{}: {}'.format(topic_id, ' '.join(topk_words)))

0: book read books author reading people life written world way
1: album music like songs song quot great good sound best
2: lens canon light image use camera lenses digital wide focus
3: season scale body episodes fat workout weight episode dvd pressure
4: movie film like good story time people love great characters
5: hair product skin like razor use great shave head time
6: recipes match christ poems jesus bowl duo cookbook proud god
7: camera use good pictures great quality bought time battery like
8: product software use program version time new work like support
9: norton game year tax time bar program internet years virus
