# Natural Language Processing

tl;dr: Make computers understand language


- Text classification
- Text clustering
- Representations

- Syntax analysis 
- Part of Speech Tagging
    

Tfidf representation

tf–idf (frequency–inverse document frequency) comes from Information Retrieval and is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today.


Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields, including text summarization and classification.

One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.

$tf$ = number of times that term t occurs in document d. 


$tf(d,t) = \frac{ \#term\_in\_doc }{ \#total\_terms\_in\_doc}  $

The inverse document frequency is a measure of how much information the word provides,
i.e., if it's common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):

$$ idf(t, D) = log( \frac{ N }{ | \{ d \in D: t \in D \} | }  ) $$

N: total number of documents in the corpus, N = {|D|}

    
$\{ {d \in D: t \in d \} }$: number of documents where the term t appears    
    
If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to $ 1 + |\{d \in D: t \in d \} | $

### Part of speech tagging 

POS tagging is the process of marking up a word in a corpus to a corresponding part of a speech tag, based on its context and definition.

Part of speech tags are very useful for intent classification and web query optimisation.


There are several libraries that support POS tagging such as NLTK, spacy, FlairNLP





In [14]:
from flair.data import Sentence
from flair.models import SequenceTagger


# make a sentence
sentence = Sentence('Arteta wants temporary head injury substitutes after David Luiz incident')

# load the NER tagger
tagger = SequenceTagger.load('pos')

# run NER over sentence
tagger.predict(sentence)

print("\n")
print(sentence)
print('The following POS tags are found:')
print("\n")

# iterate over entities and print
for entity in sentence.get_spans('pos'):
    print(entity)
    

2020-12-02 20:56:35,961 https://nlp.informatik.hu-berlin.de/resources/models/pos/en-pos-ontonotes-v0.5.pt not found in cache, downloading to /tmp/tmpt09zowwq


100%|██████████| 249072763/249072763 [10:40<00:00, 388665.79B/s]

2020-12-02 21:07:17,098 copying /tmp/tmpt09zowwq to cache at /home/kostas/.flair/models/en-pos-ontonotes-v0.5.pt





2020-12-02 21:07:17,320 removing temp file /tmp/tmpt09zowwq
2020-12-02 21:07:17,351 loading file /home/kostas/.flair/models/en-pos-ontonotes-v0.5.pt


Sentence: "Arteta wants temporary head injury substitutes after David Luiz incident"   [− Tokens: 10  − Token-Labels: "Arteta <NNP> wants <VBZ> temporary <JJ> head <NN> injury <NN> substitutes <NNS> after <IN> David <NNP> Luiz <NNP> incident <NN>"]
The following POS tags are found:


Span [1]: "Arteta"   [− Labels: NNP (1.0)]
Span [2]: "wants"   [− Labels: VBZ (1.0)]
Span [3]: "temporary"   [− Labels: JJ (1.0)]
Span [4]: "head"   [− Labels: NN (0.9996)]
Span [5]: "injury"   [− Labels: NN (1.0)]
Span [6]: "substitutes"   [− Labels: NNS (1.0)]
Span [7]: "after"   [− Labels: IN (0.9999)]
Span [8]: "David"   [− Labels: NNP (1.0)]
Span [9]: "Luiz"   [− Labels: NNP (1.0)]
Span [10]: "incident"   [− Labels: NN (0.9991)]


## Chunking

Chunking (shallow parsing) it  the identification of parts of speech and short phrases (like noun phrases). POS tagging identifies labels of words such as verbs, adjectives, etc.

Chunking essentially groups text into blocks semantic boundaries such as noun phrases, verbal phrases etc.


In [19]:
from flair.data import Sentence
from flair.models import SequenceTagger


# make a sentence
sentence = Sentence('Arteta wants temporary head injury substitutes after David Luiz incident')

# load the NER tagger
tagger = SequenceTagger.load('chunk')

# run NER over sentence
tagger.predict(sentence)

print("\n")
print(sentence)
print('The following chunks are found:')
print("\n")

# iterate over entities and print
for entity in sentence.get_spans('chunk'):
    print("ch")
    print(entity)
    

2020-12-02 21:26:17,945 loading file /home/kostas/.flair/models/en-chunk-conll2000-v0.4.pt


Sentence: "Arteta wants temporary head injury substitutes after David Luiz incident"   [− Tokens: 10  − Token-Labels: "Arteta <S-NP> wants <S-VP> temporary <B-NP> head <I-NP> injury <I-NP> substitutes <E-NP> after <S-PP> David <B-NP> Luiz <I-NP> incident <E-NP>"]
The following chunks are found:




### Named Entity Recognition (NER)

With NER, we aim to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

In [11]:
from flair.data import Sentence
from flair.models import SequenceTagger


# make a sentence
sentence = Sentence('Arteta wants temporary head injury substitutes after David Luiz incident')

# load the NER tagger
tagger = SequenceTagger.load('ner')

# run NER over sentence
tagger.predict(sentence)

print("\n")
print(sentence)
print('The following NER tags are found:')
print("\n")

# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)
    

2020-12-02 20:40:08,102 loading file /home/kostas/.flair/models/en-ner-conll03-v0.4.pt


Sentence: "Arteta wants temporary head injury substitutes after David Luiz incident"   [− Tokens: 10  − Token-Labels: "Arteta <S-PER> wants temporary head injury substitutes after David <B-PER> Luiz <E-PER> incident"]
The following NER tags are found:


Span [1]: "Arteta"   [− Labels: PER (1.0)]
Span [8,9]: "David Luiz"   [− Labels: PER (0.9987)]


## Text classification with sklearn/tfid

We have seen how to do document classification with tfidf

## Text clustering 


### tf idf & kmeans




In [17]:

### Topic models 
import io
import os.path
import re
import tarfile

import smart_open

def extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):
    fname = url.split('/')[-1]

    # Download the file to local storage first.
    # We can't read it on the fly because of
    # https://github.com/RaRe-Technologies/smart_open/issues/331
    if not os.path.isfile(fname):
        with smart_open.open(url, "rb") as fin:
            with smart_open.open(fname, 'wb') as fout:
                while True:
                    buf = fin.read(io.DEFAULT_BUFFER_SIZE)
                    if not buf:
                        break
                    fout.write(buf)

    with tarfile.open(fname, mode='r:gz') as tar:
        # Ignore directory entries, as well as files like README, etc.
        files = [
            m for m in tar.getmembers()
            if m.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', m.name)
        ]
        for member in sorted(files, key=lambda x: x.name):
            member_bytes = tar.extractfile(member).read()
            yield member_bytes.decode('utf-8', errors='replace')

docs = list(extract_documents())

KeyboardInterrupt: 