# Text analysis

Anton Akusok

<anton.akusok@arcada.fi>

Slack: `@Anton Akusok`

### Announcement:

This will be my last lecture at Arcada. I am moving to Silo.AI in January 2020.

Homeworks `>>` Andrey Shcherbakov  andrey.shcherbakov@arcada.fi

## The problem of meaning

A word is a token that represents an item or a phenomenon from the real world.

Problem: AI does not work with items/concepts of a real world.  
It only works with numbers and tokens.

## Language is dynamic

Scientists can give a detailed explanation of word's meaning. But new words appear faster than scientists can formally describe them. Existing words can change meaning, especially in minds of people from different places.

Meaning also depends on a context, e.g. "proficient" may be synonym of "good" - but not always.

<img src="img/pike.png" alt="Drawing" style="height: 600px;"/>

## Bad idea:  Words as independent tokens, or Bag-of-Words

**Bag-of-words** is a basic idea of encoding words in one-hot-encoding scheme (vector of zeroes with a single +1 at a unique position). Then a sentence or a text can be described by a vector with elements representing counts of corresponding words.

One-hot encoding creates words that are absolutely mathematically independent. Also word vectors are insanely large, like 500,000 elements for English language. 

<img src="img/onehot.png" alt="Drawing" style="width: 1000px;"/>

## Representing words by their context

Distributional semantics - one of the most successful ideas of modern statistical NLP!
> A word's meaning is given by the words that frequently appear close-by.

Intuition: For a sentence with 1 missing word, you can easily give a few likely candidates.

<img src="img/context.png" alt="Drawing" style="width: 1200px;"/>

## Word vectors

Word vectors are dense vectors, constructed in a special way that words with similar vectors appear in similar contexts.

Also known as: *word embeddings* or *word representations*.

Question: How would you represent word vectors in Python?

<img src="img/wordvector.png" alt="Drawing" style="height: 300px;"/>

![vis](img/wvvisualization.png)

## Word vectors

How do word vectors represent meaning?

> They don't.

We solve the problem of meaning by kicking it out of word representation.  
Models that process word vectors will learn a small portion of meaning, that is relevant to their problem. 

## Word2vec: Framework for learning word vectors

Idea:
    * Have a large corpus of text
    * Fix vocabulary of all words
    * Assign random vectors of given length to all words
    * Go through each word in all the text, selecting word W and context C
    * Compute probability P(W|C)
    * Maximize probability P with stochastic gradient descend

## Interesting consequences: Word math

![wordmath](img/wordmath.png)

## Blackbox learning approach (our course)

There is a big difference between knowing how to do something, and being able to do something. We aim at the latter.

Let's think of text analysis as a *black-box* environment that just works, even if we don't understand how. And let's learn how to do some useful things today.

## Self-study materials

If you want to know more, look at Stanford lectures on NLP:  
http://web.stanford.edu/class/cs224n/ 

They have slides, web recordings of actual lectures, and **the math**. We skip the math.  
(https://www.youtube.com/playlist?list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z)

# 1. Bag-of-Words

This approach can still be useful, if you suspect that your problem can be solved by finding specific related words (and NOT by extracting actual meaning from word order).

**The good:** It often works. Scikit-Learn has everything you need. Can work with any language (even HTML).

In [20]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='train')

In [21]:
data.data[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

## Tokenizing: Splitting text into words

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(data.data)

In [4]:
count_vect.vocabulary_.get('cat')

38082

## Word frequencies

`CountVectorizer()` returns word *occurrencies*, so frequent words like "the" receive very high numbers.

TF-IDF (Term Frequency - Inverse Document Frequency) weights words in a special way. Frequent words across whole dataset are devalued, and frequent words that present only in a small portion of documents are greatly emphasized. 

This highlights "specialized" words that likely carry area-specific meaning, and will be useful in classification or whatever you wanna do with the texts.

In [5]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_idf = TfidfTransformer().fit(X_counts)
X_weights = tf_idf.transform(X_counts)

In [6]:
text = "the car"
counts = count_vect.transform([text, ])
counts.data

array([1, 1], dtype=int64)

In [7]:
weights = tf_idf.transform(counts)
weights.data

array([0.25933548, 0.9657873 ])

## Classification

Bag-of-words representations are very long.

`SGDClassifier` is iteratively optimized linear model that works well (although never gets *the perfect* solution like normal linear model).  
It converges extremely fast on text vectors.

`NaiveBayes` is a good method with only drawback that it assumes independence of input features. But we *already assume independence of words* by using Bag-or-words method, so it does not matter.

In [23]:
from sklearn.model_selection import train_test_split
Xt, Xv, Yt, Yv = train_test_split(X_weights, data.target)

In [24]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(Xt, Yt)
clf.score(Xv, Yv)

0.807705903145988

In [25]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier().fit(Xt, Yt)
sgd.score(Xv, Yv)

0.9208200777659951

## Don't forget pipelines!

In [27]:
from sklearn.pipeline import Pipeline
model = Pipeline([
    ("vect", CountVectorizer()),
    ("tfidf", TfidfTransformer()),
    ("clf", MultinomialNB())
])

In [28]:
from sklearn.model_selection import cross_val_score
cross_val_score(model, data.data, data.target, cv=4, n_jobs=-1)

array([0.84490659, 0.84210526, 0.83079646, 0.84083658])

## Working with ANY text (literally)!

`HashingVectorizer` counts EVERY sequence of *n* characters as a word.  

Of course this creates vectors with billions of elements, so `HashingVectorizer` actually *hashes* its "words" into a random place inside a vector of a given lenght. Same "words" *always* go to the same place.

It can detect names, emails, character combinations, passwords, etc. with enough training data.

`Scikit-Learn` has an advanced `HashingVectorizer()` implementation that can generates *n*-sequencies only within word boundaries or even hash every single word separately, and normalize resulting vectors with `alternate_sign=True`.

In [13]:
from sklearn.feature_extraction.text import HashingVectorizer

In [14]:
HashingVectorizer()

HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
                  decode_error='strict', dtype=<class 'numpy.float64'>,
                  encoding='utf-8', input='content', lowercase=True,
                  n_features=1048576, ngram_range=(1, 1), norm='l2',
                  preprocessor=None, stop_words=None, strip_accents=None,
                  token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None)

In [15]:
hashing = Pipeline([
    ("vect", HashingVectorizer(n_features=15_000, analyzer="char", 
                               ngram_range=(4, 5), alternate_sign=False)),
    ("tfidf", TfidfTransformer()),
    ("clf", SGDClassifier())
])

In [16]:
from sklearn.model_selection import cross_val_score
cross_val_score(hashing, data.data, data.target, cv=4, n_jobs=-1)

array([0.88086006, 0.8975627 , 0.87362832, 0.87628501])

# 2. Word vectors: Spacy

https://spacy.io is a nice Python library for NLP tasks. 

It supports a range of language models: https://spacy.io/models

In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_md")

In [3]:
doc = nlp("Two bananas in pyjamas.")
doc

Two bananas in pyjamas.

## Token

Token is a meaningful element of text - a word or punctuation mark.

In [5]:
for token in doc:
    print(token, token.vector)

Two [ 1.9376e-01 -3.4272e-01 -3.7280e-01 -1.5344e-01  2.6030e-01 -2.5268e-01
 -2.3870e-01 -7.9489e-02  3.9787e-01  2.5414e+00  3.0602e-01  4.0473e-02
  2.2262e-01 -2.9280e-01 -7.7424e-02 -1.2155e-01 -1.5803e-02  1.1511e+00
 -6.2032e-02 -1.1371e-01 -4.5909e-01  6.8800e-02  4.8372e-04  6.5954e-02
  1.2627e-01 -2.4380e-01 -3.7474e-01 -1.3026e-01  3.3211e-01  1.7395e-01
 -1.6609e-02  5.1471e-01  2.5513e-01 -6.3719e-03  2.4802e-01 -9.9860e-02
  4.1756e-02 -3.2667e-01 -4.7819e-02 -3.1468e-01 -7.5799e-02  1.6569e-01
 -1.2541e-01 -8.8889e-02 -2.8187e-02 -1.6542e-02 -6.5677e-02  2.1452e-01
  3.0963e-02  4.4404e-03 -2.3115e-01  3.6552e-01 -1.1054e-01  2.7337e-02
 -4.1860e-01  8.6131e-02 -8.8410e-02  3.7810e-01  9.4435e-02  3.2761e-01
  5.6099e-01  3.7211e-02  1.6627e-01  5.5609e-01 -1.5403e-01  1.3870e-01
  2.1802e-01  2.8899e-01  3.1493e-02  4.6057e-01  4.6414e-01 -1.8594e-01
  2.8329e-01  2.5512e-01 -3.4265e-02 -5.8959e-02 -3.2013e-02 -1.7995e-01
  1.6876e-01 -1.1089e-01 -1.5121e-01  8.4075e-0

 -2.0304e-01  1.9368e-01 -3.2546e-01  1.4421e-01 -1.6900e-01  2.6501e-01]
pyjamas [ 1.9095e-01 -4.9804e-01 -2.6771e-01 -5.6022e-02 -1.7520e-01  1.9237e-01
  1.1580e-01 -4.3959e-01 -4.0391e-01  4.8932e-01 -2.1835e-01  8.2319e-02
 -1.2638e-01 -1.8102e-02  1.0808e-01 -8.7456e-02  1.5171e-02  3.0357e-01
 -1.0834e-01 -5.5606e-01 -6.7118e-01 -5.6832e-01 -3.6537e-01  1.1583e-01
  5.5654e-02 -2.3239e-01 -1.3381e-01  2.6839e-02  3.1981e-02  3.3165e-01
  8.3014e-02  2.9282e-01 -1.0282e-01  5.6327e-01  4.2352e-02 -9.5268e-01
  4.1784e-01  1.0255e-01  1.0748e-03  2.4993e-01  4.2311e-01 -2.3822e-01
 -5.5894e-01  1.3366e-01 -1.3233e-01 -1.6066e-01  2.5015e-01  5.3932e-01
  8.3208e-01  3.8616e-01  5.0471e-01  2.2545e-02 -1.6626e-01 -3.5128e-01
  7.2978e-01 -1.0944e-01  3.8559e-01 -1.7194e-01  3.6662e-01 -1.5815e-01
 -4.7744e-01 -7.2209e-01  4.7908e-01  1.4687e-01 -8.8191e-02 -2.4622e-01
 -7.2795e-01 -1.4420e-01  8.2687e-02  3.3889e-01 -8.0232e-03  4.8050e-01
  5.9092e-01 -9.5002e-02 -1.1985e-02 -1.79

In [6]:
doc1 = "It's a warm summer day"
doc2 = "Its sunny outside"

nlp(doc1).similarity(nlp(doc2))

0.7816512492523914

## Span

Span is a sub-text range in the whole text

In [7]:
doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

In [8]:
span1 = doc[3:5]
span1

great restaurant

In [9]:
span2 = doc[12:15]
span2

really nice bar

In [10]:
span1.similarity(span2)

0.75173926

## Spacy tutorial

Try more if you got interested
https://course.spacy.io

## Spacy + Scikit-Learn?

In [11]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='train')

In [None]:
%%time
X_spacy = [doc.vector for doc in nlp.pipe(data.data)]

In [None]:
from sklearn.model_selection import cross_val_score
from sklearnearn.linear_model import SGDClassifier

cross_val_score(SGDClassifier(), X_spacy, data.target, cv=4, n_jobs=-1)

# 3. Translation and other cool stuff

Sorry, no available tools! (except commercial cloud providers)

## Translation, question answering, image summarization, text summarization...

Sequence-to-sequence model, example: https://graviraja.github.io/seqtoseqimp

Seq2Seq models are a variant of LSTM recurrent network: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 

## Recurrent neural network (LSTM)

A "normal" deep learning network that keeps information between different data samples.

Allows processing data samples that come in a sequence, like text.

`>>` Video

<img src="img/lstm1.png" alt="Drawing" style="height: 600px;"/>

## Text feeds one-word-at-a-time

Including special symbols like "beginning of a sentence" or "end of a sentence"

<img src="img/lstm.png" alt="Drawing" style="height: 600px;"/>

## Encoder

Model "learns" the meaning of text in some kind of internal representation.

<img src="img/encoder.png" alt="Drawing" style="height: 600px;"/>

## Decoder

A second network generates sentence by predicting one-word-at-a-time. It takes hidden representation and a previous word as inputs. The first "previous word" is the "start of a sentence" special symbol.

<img src="img/decoder.png" alt="Drawing" style="height: 600px;"/>

## Seq2Seq translation

Both encoder and decoder are trained together.

Using a dataset of translated sentences, we can learn a translation model.

<img src="img/seq2seq.png" alt="Drawing" style="height: 600px;"/>

# Want to learn more?

Deep Learning and its applications is a huge topic.  
We cannot teach everything, but we can guide you towards where you wanna go.

- Ask @ Slack
- Self-study courses: https://github.com/yandexdataschool/Practical_DL

# Homework 6

https://course.spacy.io/chapter2

Do exercises from Chapter 2 in a separate notebook, submit your code.