**Web Scraping und Data Mining in Python**

# Machine Learning

Jan Riebling, *Universität Wuppertal*

# Machine learning in general

## Learning

* **Unsupervised learning**: Finding patterns and classifications from the data alone, without recourse to known classes (e.g. Topic Models). 
* **Supervised learning**: Training a model using a dataset of features to reproduce a classification that is known beforehand.

## Workflow

1. Create trainings corpus.
2. Define relevant features.
4. Train model.
5. Evaluate model.
6. (Apply classifier.)

## Scikit learn

Provides a wide variety of methods for machine learning, clasification and many statistical methods. For more details see the extensive [documentation](http://scikit-learn.org/stable/). 

# The Naive-Bayes classifier

## Why naive? Why Bayes?

* “Naive”, refers to the naive (and obviously false) assumption that all features of a text are stochastically independent.
* “Bayes”, refers to the method of estimating the correct classification through the law of Bayes.  
$$
P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}
$$

### [Naive-Bayes Classifier](http://www.nltk.org/book/ch06.html#underlying-probabilistic-model) in NLTK

We are looking for:

$$
P(y \mid x_1, \dots x_n) = \frac{P(y)P(x_1, \dots x_n \mid y)}{P(x_1, \dots x_n)}.
$$

Under the assumption that all probabilities for features $P(x_1, \dots x_n)$ are independent:

$$
P(y \mid x_1, \dots x_n) \propto P(y)\prod_{i=1}^{n}{P(x_i \mid y)}.
$$


The NB-classifier in NLTK assumes a Bernoulli distribution, meaning only binary distinctions are taken into considerations!

## Example:

Estimating the sex of a person using that persons first name.

In [1]:
import nltk
from nltk.corpus import names

#nltk.download() #Falls corpus Daten noch nicht heruntergeladen wurden.

print(names.fileids())

print(len(set(names.words('female.txt')) & set(names.words('male.txt'))))
print(len(names.words('female.txt')))

names.words('female.txt')[:10]

['female.txt', 'male.txt']
365
5001


['Abagael',
 'Abagail',
 'Abbe',
 'Abbey',
 'Abbi',
 'Abbie',
 'Abby',
 'Abigael',
 'Abigail',
 'Abigale']

## Creating the corpus

In this step we prepare and clean the text data. We also randomize the rows of the data (*observations*) which is essential for later steps.

In [2]:
import pandas as pd
import random

listoftuples = ([(name, 'male') for name in names.words('male.txt')] +
                [(name, 'female') for name in names.words('female.txt')])

listoftuples[-5:]

[('Zorine', 'female'),
 ('Zsa Zsa', 'female'),
 ('Zsazsa', 'female'),
 ('Zulema', 'female'),
 ('Zuzana', 'female')]

In [3]:
## Shuffles list in-place!
random.shuffle(listoftuples)

names_df = pd.DataFrame(listoftuples, 
                        columns=['FirstName', 'Sex'])
names_df

Unnamed: 0,FirstName,Sex
0,Feliza,female
1,Roxanne,female
2,Maxine,female
3,Waldo,male
4,Gabriellia,female
...,...,...
7939,Goldia,female
7940,Nannette,female
7941,Farah,female
7942,Sibyl,female


In [29]:
names_df

Unnamed: 0,FirstName,Sex
0,Deryl,male
1,Dehlia,female
2,Stanwood,male
3,Ilysa,female
4,Madella,female
...,...,...
7939,Sascha,male
7940,Yevette,female
7941,Adriane,female
7942,Clair,male


## Define features

This is the crucial step of selecting the features to use as independent variables.

In [30]:
## We just use the last letter of the first name

names_df['FirstName'].str[-1]

0       l
1       a
2       d
3       a
4       a
       ..
7939    a
7940    e
7941    e
7942    r
7943    e
Name: FirstName, Length: 7944, dtype: object

In [5]:
## Make it numerical and a matrix
lastletters = names_df.FirstName.str.strip().str[-1]

features_df = lastletters.str.get_dummies()

features_df

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,...,p,r,s,t,u,v,w,x,y,z
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7939,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7940,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7941,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
7942,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Train the classifier

Here we split the corpus in two datasets. A bigger one to train the the model (Binomial Naive Bayes) on and a smaller one of the same data as a test set (*holdout data*). Take care to randomize the data order before splitting in test and train sets!

Here we use the Bernoulli Naive Bayes method for binomial features (see the scikit-documentation [here](https://scikit-learn.org/stable/modules/naive_bayes.html#bernoulli-naive-bayes)). 

In [6]:
## Splitting the independent variables (features)

X_train, X_test = features_df[:6000], features_df[6000:]

In [7]:
## Splitting the dependent variable (label)

y_train, y_test = names_df.Sex[:6000], names_df.Sex[6000:]

In [8]:
## Creating the classifier
from sklearn.naive_bayes import BernoulliNB

bnb_clf = BernoulliNB()

bnb_clf.fit(X_train, y_train)

BernoulliNB()

## Evaluate classifier

Using different metrics to evaluate the classifier using the heldout test dataset.

In [10]:
## First a little manual test

# Predict two names that are not in the dataset
newnames = ['Neo', 'Deirdre']

In order to classify new data we have to create an array with the same order and dimensions as our feature array 

In [11]:
extracted_df = (pd.Series(newnames).str[-1]).str.get_dummies()
extracted_df

Unnamed: 0,e,o
0,0,1
1,1,0


In [12]:
newfeatures_df = pd.DataFrame(columns=features_df.columns)

newfeatures_df

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,...,p,r,s,t,u,v,w,x,y,z


In [14]:
## Dataframe with the same dimensions

pd.concat([newfeatures_df, extracted_df], axis=0, sort=False)

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,...,p,r,s,t,u,v,w,x,y,z
0,,,,,0,,,,,,...,,,,,,,,,,
1,,,,,1,,,,,,...,,,,,,,,,,


In [15]:
predictor_df = pd.concat([newfeatures_df, extracted_df], axis=0, sort=False).fillna(0)

In [17]:
bnb_clf.feature_log_prob_

array([[-1.03652516, -6.15750851, -8.23695005, -4.86965422, -1.24301707,
        -7.54380287, -6.15750851, -3.8302308 , -2.71949715, -7.54380287,
        -7.54380287, -3.30969636, -6.2910399 , -2.53316757, -5.01807422,
        -7.54380287, -4.86965422, -4.07806696, -4.38680245, -6.62751214,
        -7.13833776, -6.62751214, -5.83905478, -2.38474757, -6.85065569],
       [-4.5299077 , -4.76352255, -4.76352255, -2.56629798, -1.86441711,
        -4.61691908, -4.44986499, -3.43129541, -3.9237719 , -6.32166717,
        -3.79593853, -2.73122779, -3.68260984, -1.8053282 , -2.91217099,
        -5.14301217, -2.70401523, -2.54317556, -2.92883804, -5.31006626,
        -5.0689042 , -5.0689042 , -5.62851999, -2.18650061, -5.22305488]])

In [16]:
bnb_clf.predict(predictor_df)

array(['male', 'female'], dtype='<U6')

### Prediction on the test set

A central step in all machine learning is using the classifier on the known classification of the test data in order to see how good the prediction works.

In [18]:
## True y
y_true = y_test

## Predicted
y_pred = bnb_clf.predict(X_test)

pd.DataFrame({'True': y_true[:10].values, 'Predicted': y_pred[:10]})

Unnamed: 0,True,Predicted
0,female,female
1,female,female
2,male,female
3,female,male
4,female,female
5,male,male
6,female,female
7,female,female
8,female,female
9,female,female


### Metric: Accuracy

The relative number of correctly classified cases.

In [19]:
from sklearn.metrics import accuracy_score

print('Accuracy', accuracy_score(y_true, y_pred))

Accuracy 0.7505144032921811


## How much precision?

How much precision is enough precision?

$80\%$, according to a study of [Conrad et al.2012](http://aclweb.org/anthology-new/W/W12/W12-3810.pdf):
![](http://farm5.static.flickr.com/4031/4474308638_d8a30bb1a9.jpg)

## The problem:

![cat](https://mlcorner.files.wordpress.com/2013/04/ai-comic.jpg)