# Teach a machine to understand human language
##### WeAreDeveloper World Congress 2019

## Part 1 - Natural Language Classification

Some imports we maybe need afterwards

In [1]:
import re
import requests
import pandas as pd
import numpy as np

Let's have a look into the data

In [2]:
path = "/home/paul/Downloads/classifier_dataset.csv"
df = pd.read_csv(path, encoding="utf-8", header=0, sep=",")

In [3]:
df[0:10]

Unnamed: 0,question,cat
0,"Like many AUX, the Lombard has S or Lombards",devices
1,Is there still the Monaco 1?,devices
2,Show me Comfort Pro P 500 devices,devices
3,which connection options does the Ergotel s ha...,devices
4,Show me all the colors for the Actron Card,devices
5,I&#39;m looking for a garnet 1 phone,devices
6,Which SIM card is in the Google Pixel?,devices
7,What is the purchase price for speedort W724 v,devices
8,What weighs a Samsung phone?,devices
9,What is the bottom line between the Google Pix...,devices


In [4]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.2, random_state = 0)

### Extracting features from the text

There are different approaches:
- Bag-of-words model
- Latent Semantic Indexing
- Latent Dirchlet Allocation
- Word Embeddings (Word2Vec, Glove, ...)

We will use a simple bag of words model for this purpose. And for that we need a matrix of the words and their counts

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train['question'])

Now we can already apply a machine learning algorithm

In [6]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_counts, train['cat'])

And see the results

In [7]:
mat = count_vect.transform([test['question'][1]])
res = clf.predict(mat)
print(test['question'][1], '-' ,res[0])

Is there still the Monaco 1? - devices


In [20]:
from sklearn.metrics import accuracy_score, classification_report
mat = count_vect.transform(test['question'])
res = clf.predict(mat)
np.mean(res == test['cat'])
accuracy_score(res, test['cat'])
print(classification_report(res, test['cat']))

              precision    recall  f1-score   support

     devices       0.94      0.72      0.82        90
   processes       0.61      0.85      0.71        27
   smarthome       0.76      1.00      0.86        32
     tariffs       0.84      0.84      0.84        73

   micro avg       0.82      0.82      0.82       222
   macro avg       0.79      0.85      0.81       222
weighted avg       0.84      0.82      0.82       222



Let's make that a bit simpler, so that we can work with better.

In [9]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()), ('clf', MultinomialNB())])
text_clf = text_clf.fit(train['question'], train['cat'])
predicted = text_clf.predict(test['question'])
np.mean(predicted == test['cat'])

0.8153153153153153

### Preprocessing Optimisations

- TF-IDF
- Stopwords
- Stemming

#### Term Frequency â€” Inverse Data Frequency

*Term Frequency (TF)* - how often a word appears in a document, divided by how many words there are<br>
$\mathit{tf}_{t, d}=\frac{\textrm{number of occurences in document}}{\textrm{total number of all words in document}}$

*Inverse document frequency (IDF)* - is how unique or rare a word is<br>
$\mathit{idf}_{t, d}=\log{\frac{\textrm{times term t appears}}{\textrm{number of documents containing term t}}}$

In [10]:
from sklearn.feature_extraction.text import TfidfTransformer
text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

text_clf = text_clf.fit(train['question'], train['cat'])
predicted = text_clf.predict(test['question'])
np.mean(predicted == test['cat'])

0.7207207207207207

#### Stopwords
<img src="https://cdn-images-1.medium.com/max/1600/1*_OsV8gO2cjy9qcFhrtCdiw.jpeg" alt="Stopwords" style="width: 400px;"/>



In [11]:
txt = list(df["question"])
X = count_vect.fit_transform(txt)
pd.DataFrame(X.A, columns=count_vect.get_feature_names()).sum().sort_values(ascending=False)[:10]

the      661
is       286
how      243
what     198
which    193
of       177
does     163
with     163
for      158
in       155
dtype: int64

In [12]:
txt = list(df["question"])
count_vect_stop = CountVectorizer(stop_words=ENGLISH_STOP_WORDS)
X = count_vect_stop.fit_transform(txt)
pd.DataFrame(X.A, columns=count_vect_stop.get_feature_names()).sum().sort_values(ascending=False)[:10]

NameError: name 'ENGLISH_STOP_WORDS' is not defined

In [13]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
text_clf = Pipeline([('vect', CountVectorizer(stop_words=ENGLISH_STOP_WORDS)), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

text_clf = text_clf.fit(train['question'], train['cat'])
predicted = text_clf.predict(test['question'])
np.mean(predicted == test['cat'])

0.7567567567567568

#### Some Regex Preprocessing of the Data

In [14]:
txt = list(df["question"])
count_vect_stop = CountVectorizer(stop_words=ENGLISH_STOP_WORDS)
X = count_vect_stop.fit_transform(txt)
pd.DataFrame(X.A, columns=count_vect_stop.get_feature_names()).sum().sort_values(ascending=False).filter(regex='.*\d.*')[:10]

39      19
415     17
10       9
400      7
2nd      6
s7       6
a3       6
64gb     5
806      4
724v     4
dtype: int64

In [15]:
text_clf = Pipeline([
    ('vect', CountVectorizer(
        stop_words='english', 
        preprocessor=lambda x: re.sub(r'(\d[\d\.])+', 'NUM', x.lower())
    )), 
    ('tfidf', TfidfTransformer()), 
    ('clf', MultinomialNB())])

text_clf = text_clf.fit(train['question'], train['cat'])
predicted = text_clf.predict(test['question'])
np.mean(predicted == test['cat'])

0.7612612612612613

#### Stemming

![Stemming](https://1.bp.blogspot.com/-s5VGFbL8-ew/W5wSySqBbYI/AAAAAAAABdw/elNHb72ki2oTRgUORGXzhTBMfBk-oa08gCEwYBhgL/s1600/image003.png)
There are multiple different stemming algorithms: Porter, Snowball(Porter2), and Lancaster (Paice-Husk). For the english langauge The Snowball stemmer usually a good choice, but you can test the others as well. We are using an the implementation from NLTK. For a description of the algortihm: [Link](https://snowballstem.org/algorithms/porter/stemmer.html)

Some terminology on Stemming vs Lemmatization: The derived root word (lemma) of a Lemmatization is always a lexicographically correct word, the root stem may be different.

In [16]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

[nltk_data] Downloading package stopwords to /home/paul/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/paul/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [17]:
text_clf = Pipeline([
    ('vect', StemmedCountVectorizer(
        stop_words=ENGLISH_STOP_WORDS, 
        preprocessor=lambda x: re.sub(r'(\d[\d\.])+', 'NUM', x.lower())
    )), 
    ('tfidf', TfidfTransformer()), 
    ('clf', MultinomialNB())])

text_clf = text_clf.fit(train['question'], train['cat'])
predicted = text_clf.predict(test['question'])
np.mean(predicted == test['cat'])

0.7927927927927928

### Classifier Optimisations

#### Variant 1: Using hyperparameter optimization

For example by using Exhaustive Grid Search - Basically Brute Force Parameters

In [18]:
text_clf = Pipeline([
    ('vect', StemmedCountVectorizer(
        stop_words=ENGLISH_STOP_WORDS, 
        preprocessor=lambda x: re.sub(r'(\d[\d\.])+', 'NUM', x.lower())
    )), 
    ('tfidf', TfidfTransformer()), 
    ('clf', MultinomialNB())])

from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__fit_prior': (True, False),
              'clf__alpha': (1e-3, 1e-2,0.5, 1)
}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=1) # Because of Anaconda
gs_clf = gs_clf.fit(train['question'], train['cat'])
gs_clf.best_params_



{'clf__alpha': 0.01,
 'clf__fit_prior': True,
 'tfidf__use_idf': True,
 'vect__ngram_range': (1, 2)}

In [19]:
text_clf = Pipeline([
    ('vect', StemmedCountVectorizer(
        ngram_range=(1, 2),
        stop_words=ENGLISH_STOP_WORDS, 
        preprocessor=lambda x: re.sub(r'(\d[\d\.])+', 'NUM', x.lower())
    )), 
    ('tfidf', TfidfTransformer(use_idf=True)), 
    ('clf', MultinomialNB(fit_prior=False, alpha=1e-2))])

text_clf = text_clf.fit(train['question'], train['cat'])
predicted = text_clf.predict(test['question'])
np.mean(predicted == test['cat'])

0.8513513513513513

#### Variant 2: Try other algorithms

For example by using Exhaustive Grid Search - Basically Brute Force Parameters