# Binary Classification pipeline: Application with text data

Author: Alexandre Gramfort

The objective of this hands on session is to setup a predictive pipeline to classify movie critics. Critics are either positive (y=1) or negative (y=0). This task is often referred to as **sentiment analysis**.

In [None]:
%matplotlib inline

In [None]:
import os.path as op
import numpy as np
import matplotlib.pyplot as plt

What you are being given :

- critics of movies in text files in folder *data/imdb1*,

Your mission :

- Extract numeric features (the 'X') from the raw data (word counts)
- Apply a Logistic Regression classifier with proper setup of hyperparameter
- Evaluate performance in terms of accuracy with cross-validation
- Answer the question "Do I have enough data?" using a learning curve.

### First let's load the raw data

In [None]:
from glob import glob
filenames_neg = sorted(glob(op.join('data', 'imdb1', 'neg', '*.txt')))
filenames_pos = sorted(glob(op.join('data', 'imdb1', 'pos', '*.txt')))

texts_neg = [open(f).read() for f in filenames_neg]
texts_pos = [open(f).read() for f in filenames_pos]
texts = texts_neg + texts_pos
y = np.ones(len(texts), dtype=int)
y[:len(texts_neg)] = 0.

print(f"{len(texts)} documents")
print(f"Number of positives {len(texts_pos)} and negatives {len(texts_neg)}")

### Questions:

- What does the array `y` correspond to?
- What is the type of the variables `texts`?
- Can you read the first text?
- Complete the function **count_words** that counts the number of occurences of each word in a list of texts. You'll need to use the *split* method from the string class to split a text in words.

Example of usage of the `split` function:

In [None]:
words = "Hello DSSP attendees!".split()
print(words)
print("number of words : %s" % len(words))

Example of usage of the `count_words` function:

```
>>> some_texts = ['A B B', 'B', 'A A']
>>> vocabulary, counts = count_words(some_texts)
>>> print(vocabulary)  # dictionary word -> column index
{'A': 0, 'B': 1}
>>> print(counts)  # number of occurence of each word from vocabulary in each text
[[ 1.  2.]
 [ 0.  1.]
 [ 2.  0.]]
```

**Remark:** The vocabularty is a `dict` and its values have nothing to do with a number of occurences. Its values are the indices of columns in the `counts`.

In [None]:
def count_words(texts):
    """Vectorize text : return count of each word in the text snippets

    Parameters
    ----------
    texts : list of str
        The texts

    Returns
    -------
    vocabulary : dict
        A dictionary that points to an index in counts for each word.
    counts : ndarray, shape (n_samples, n_features)
        The counts of each word in each text.
        n_samples == number of documents.
        n_features == number of words in vocabulary.
    """
    # TODO
    return vocabulary, counts

vocabulary, counts = count_words(texts)
print(counts.shape)
print(counts.sum())

### Questions

- Estimate Logistic regression on the full data. Show the effect of overfitting by evaluating the predictive power of your method in terms of accuracy.
- Use the `train_test_split` function split the data in train and test (80% train and 20% test). What performance do you get?
- Can you do better by adjusting the regularization parameter C? Use values between 0.00001 and 1000.
- Why is this potentially dangerous? How do you avoid troubles?

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

clf = LogisticRegression(C=1.)

# TODO

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=0)

# TODO

In [None]:
Cs = [0.00001, 0.0001, 0.001, 0.01, 0.1, 1., 10., 100., 1000.]

for C in Cs:
    # TODO

print('Best C : %s - Best Score : %s' % (C_best, np.max(scores)))

plt.plot(np.log10(Cs), scores)
plt.xlabel("log10(C)")
plt.ylabel("Accuracy")

### Questions

- Compare the performance of Logistic Regression vs Multinomial Nayes Bayes?

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()

# TODO

### Questions

- Compare your implementation of word counting with scikit-learn.

For this use the classes *CountVectorizer* and a *Pipeline*:

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

from sklearn.model_selection import cross_val_score  # replace by cross_validation
from sklearn.feature_extraction.text import TfidfTransformer

clf = Pipeline([
    ('vect', CountVectorizer(max_df=0.75, ngram_range=(1, 1),
                             analyzer='word', stop_words=None)),
    ('nb', MultinomialNB())
])

X_train, X_test, y_train, y_test = train_test_split(texts, y, random_state=42)

# TODO

### Questions

- Could do you better with more data? Is the model complex too complex or too simple? Hint: Use a learning curve

In [None]:
from sklearn.model_selection import learning_curve

# TODO

### Questions

- Can you do better using bigrams? Use parameter `ngram_range=(1, 2)` in CountVectorizer
- Compare the learning curves using single words or bigrams

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfTransformer

clf = Pipeline([
    ('vect', CountVectorizer(max_df=0.75, ngram_range=(1, 2),
                             analyzer='word', stop_words=None)),
    ('nb', MultinomialNB())
])

# TODO