**Web Scraping und Data Mining in Python**

# Machine Learning II

Jan Riebling, *Universität Wuppertal*

# Cross-validation

## General idea

Testing data on known cases is a methodological mistake. We want to predict classification on previously unknown (to the algorithm) cases. Therefore we need to retain a portion of the data as a test dataset exclusively for evaluating the classifier. This is commonly refered to as *out-of-sample testing*. 

The measure can further be improved by repeating the measurement $n$-times over random samples. This is called *cross-validation*.

The following example can be found [here](https://scikit-learn.org/stable/modules/cross_validation.html) in greater detail.

In [7]:
## Load iris
import numpy as np
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape

((150, 4), (150,))

## Dedicated function

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, 
                                                    iris.target, 
                                                    test_size=0.4, 
                                                    random_state=0)

In [9]:
X_train.shape, y_train.shape

((90, 4), (90,))

In [10]:
X_test.shape, y_test.shape

((60, 4), (60,))

## Out-of-sample testing

In [11]:
## Fitting a linear support vector machine

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

clf.score(X_test, y_test)

0.9666666666666667

## Cross-validating

Repeatedly splitting the data set in test and train sets, computing the test scores and taking the average.

In [12]:
from sklearn.model_selection import cross_val_score

## Model has to be instantiated independently
clf = svm.SVC(kernel='linear', C=1)

## cv parameter sets number of random experiments
scores = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='accuracy')
scores                                            

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

In [13]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.98 (+/- 0.03)


## Excourse: Why the `* 2`?

To calculate the confidence intervall of the mean one has to choose a chance that a certain value falls within a specific range of a assumed probability distribution.

For the $95\%$ confidence intervall and assuming a normal distribution we have to multiply the standard deviation $\sigma$ by the amount of standard deviation that falls within $95\%$ from the mean of a normal distribution.

In [14]:
from scipy import stats

## Actually, slightly less then 2:
stats.norm.ppf((1+0.95)/2)

1.959963984540054

In [15]:
## More correct formula:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), 
                                       scores.std() * stats.norm.ppf((1+0.95)/2)))

Accuracy: 0.98 (+/- 0.03)


## Choosing the correct distribution

Because the number of samples ($n = 5$) is to small to approximate a normal distribution and since we are working with estimates, the $t$ or “students” distribution is more appropriate. Degrees of freedom are equal to $n-1$.

In [16]:
from scipy import stats

stats.t.ppf((1+0.95)/2, df=4)

2.7764451051977987

In [17]:
## Even more correct:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), 
                                       scores.std() * stats.t.ppf((1+0.95)/2, df=4)))

Accuracy: 0.98 (+/- 0.05)


## Alternative scores for CV

The `cross_val_score` function can calculate different scores when given the apropriate `scoring=` keyword. A list of acceptable keywords can be found in the [docs](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).

## Example 2

Sentiment analysis of movie reviews. Two different approaches:

* Naive-Bayes Classification
* Maximum Entropy Classifier

## Creating a corpus

In [18]:
import random
import nltk
from nltk.corpus import movie_reviews, stopwords

stopwords = set(stopwords.words('english'))

documents = [([token for token in movie_reviews.words(fileid)
               if token.isalpha()
               and token not in stopwords], 
              category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

In [19]:
documents[0][1]

'pos'

## Extracting features

In contrast to NLTK, the features have to be represented as a array-like object.

In [20]:
## Types as features
import pandas as pd

df = pd.DataFrame(documents,
                  columns=['Tokens', 'Sentiment'])


In [23]:
feature_df = pd.DataFrame(df.Tokens.tolist())\
            .stack()\
            .groupby(level=0)\
            .value_counts()\
            .unstack(level=1)\
            .fillna(0)

In [24]:
## Getting rid of NaN values
feature_df.fillna(0, inplace=True)
feature_df

Unnamed: 0,aa,aaa,aaaaaaaaah,aaaaaaaahhhh,aaaaaah,aaaahhhs,aahs,aaliyah,aalyah,aamir,...,zukovsky,zulu,zundel,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Splitting in test and training corporas

In [26]:
## Splitting the independent variables (features)

X_train, X_test = feature_df[:1500], feature_df[1500:]

In [27]:
## Splitting the dependent variable (label)

y_train, y_test = df.Sentiment[:1500], df.Sentiment[1500:]

## Train classifiers

In [28]:
from sklearn.naive_bayes import MultinomialNB

## Maximum Entropy Classifier == Logistic Regression
from sklearn.linear_model import LogisticRegression

mb_clf = MultinomialNB()
lr_clf = LogisticRegression()

mb_clf.fit(X_train, y_train)
lr_clf.fit(X_train, y_train)

LogisticRegression()

## Evaluating the classification

In [29]:
## NB probabilities
import numpy as np

mb_feats = pd.Series(np.exp(mb_clf.feature_log_prob_[1]),
                     index=feature_df.columns)

mb_feats.sort_values(ascending=False).head(10)

film     0.012161
one      0.007158
movie    0.005958
like     0.004310
time     0.002924
story    0.002899
good     0.002893
also     0.002806
even     0.002733
well     0.002612
dtype: float64

In [30]:
## ME/LR coefficients
lr_feats = pd.Series(lr_clf.coef_[0],
                     index=feature_df.columns)

lr_feats.sort_values(ascending=False).head(10)

fun             0.608574
great           0.490919
well            0.428531
performances    0.412549
perfectly       0.387623
especially      0.382982
overall         0.372857
memorable       0.356511
terrific        0.350046
always          0.338733
dtype: float64

## Sklearn metrics

`scikit-learn` provides a wide range of metrics and tests to decide on the validity of the models. See the [documentation](http://scikit-learn.org/stable/model_selection.html#model-selection) for more details.

The general logic behind these metrics is to compare the predicted results to the known correct observations.

In [31]:
## True y
y_true = y_test

## Predicted
y_pred = lr_clf.predict(X_test)

pd.DataFrame({'True': y_test[:10].values, 'Predicted': y_pred[:10]})

Unnamed: 0,True,Predicted
0,neg,pos
1,neg,neg
2,neg,neg
3,pos,pos
4,pos,pos
5,neg,neg
6,neg,neg
7,neg,neg
8,pos,neg
9,pos,pos


## Correspondence with underlying facts

Confusion Matrix:

| $\,$                   | Condition positive                     | Condition negative                     |
|------------------------|----------------------------------------|----------------------------------------|
| **Predicted positive** | True positive, Power                   | False positive, Type I, $\alpha$-error |
| **Predicted negative** | False negative, Type II, $\beta$-error | True negative                          |

## Precision 

Precision is the ratio of correct positive predictions to all positive predictions (including Type I errors).

$$
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{\sum \text{true positive}}{\sum \text{predicted positive}}
$$



In [32]:
from sklearn.metrics import precision_score

## Note: Explicitly define the positive values for the "True" column!
precision_score(y_true, y_pred, pos_label='pos')

0.8559670781893004

## Recall

Recall or sensitivity is a measure for the amount of correct positives given all predictions (including false negatives, Type II error).

$$
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{\sum \text{true positive}}{\sum \text{condition positive}}
$$

In [33]:
from sklearn.metrics import recall_score

recall_score(y_true, y_pred, pos_label='pos')

0.8062015503875969

## F1-score

One very prominent test measure is the F1-score, which can be interpreted as the weighted average of precision and recall.

In [34]:
from sklearn.metrics import f1_score

f1_score(y_true, y_pred, average='weighted') 

0.8300102000408001

## Accuracy revisited

Given these classifications accuracy can be described formally as:

$$
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} = \frac{\sum \text{correct classifications}}{\sum \text{sample size}}
$$

There is also a form of *balanced accuracy* calculated by normalizing the true positive and true negative predictions by their respective sample size. This is important in binary classification cases. 

In [35]:
from sklearn.metrics import accuracy_score, balanced_accuracy_score

print('Accuracy', accuracy_score(y_true, y_pred))

print('Balanced accuracy', balanced_accuracy_score(y_true, y_pred))

Accuracy 0.83
Balanced accuracy 0.8307867256070216


# A note on data science


## Problems in an emerging field

* Data driven: can be close to p-hacking and effect mining.
* Machine learning: it is often unknown or even unknowable why a model converges.
* Data quality has to be evaluated on a case by case basis.
* Deceptively simple.

## A chance  for social science?

> To summarize, the claim that prediction is a necessary (but not sufficient) feature of causal explanation is consistent with a view of causality that is almost universally accepted by sociologists—even sociologists who have explicitly denied the necessity of prediction. The resolution of the apparent conflict is that prediction must be defined suitably—that is, in the broad sense of out-of-sample testing, allowing both for probabilistic predictions and for predictions about stylized facts or patterns of outcomes. [...] Although the details would differ depending on the type of explanation in question, in all cases the procedure would be roughly: (1) construct a “model” based on analysis of cases (A, B, C, ...); (2) deploy the model to make a prediction
about case X, which is in the same class as (A, B, C, ...) but was not used to inform the model itself; (3) check the prediction. (Watts 2014, 340)


## Only curve-fitting?

> As much as I look into what’s being done with deep learning, I see they’re all stuck there on the level of associations. Curve fitting. That sounds like sacrilege, to say that all the impressive achievements of deep learning amount to just fitting a curve to data. From the point of view of the mathematical hierarchy, no matter how skillfully you manipulate the data and what you read into the data when you manipulate it, it’s still a curve-fitting exercise, albeit complex and nontrivial. (Judea Pearl, [“To Build Truly Intelligent Machines, Teach Them Cause and Effect”](https://www.quantamagazine.org/to-build-truly-intelligent-machines-teach-them-cause-and-effect-20180515/))

# References

* Watts, Duncan J. 2014. “Common Sense and Sociological Explanations.” American Journal of Sociology 120 (2): 313–351.
