# Text categorization with logistic regression and other methods

TEM501 - Text Mining

## Outline

In this document, I will give instructions of how to do text categorization with logistic regression and other text categorization algorithms with scikit-learn toolkit.

We will learn:

- How to implement `sigmoid` function
- How to use scikit-learn for feature extraction and train a text categorization model
- How to output probabilities for a prediction
- Get top features with highest weights in a trained logistic regression model
- Try other text categorization methods (such as SVM)

We use the sentiment data in this document.

## Implementation of sigmoid function

In this exercise, please implement `sigmoid` function as follows. Recall that, the sigmoid function is calculated as follows.

$$
\sigma(z)=\frac{1}{1+e^{-z}}
$$

In [1]:
import numpy as np

# z can be np.ndarray, np.matrix, or scalar
def sigmoid(z):
    pass


## Data

We use the [sentence polarity dataset v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) from [Moview Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/) of Bo Pang và Lillian Lee. We can see the data file in [./data/sentiment.txt](./data/sentiment.txt).

Each line in the file is a review which was already tokenized into words. Each review has a label (+1 for positive review and -1 for negative review).


## Loading data

We will load the data into a list of tuples $(d, c)$ in which $d$ denote a document and $c$ denotes the label of the document. We define the function `load_data` as follows.

In [2]:
import re


def load_data(file_path):
    data = []
    with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            line = line.strip()
            if line == "":
                continue
            match = re.search(r"(\+1|-1)[\s\t]+(.+)$", line)  # match the line +1 ...
            if match:
                lb = match.group(1)
                sentence = match.group(2)
                if sentence == "":
                    continue
                data.append((sentence,lb))
    return data
            

We will use the above function to load sentiment data.

In [3]:
DATA_PATH = "./data/sentiment.txt"
data = load_data(DATA_PATH)

print("# Loaded {} examples".format(len(data)))

# Loaded 10662 examples


We also split data into training/test data.

In [4]:
import random
from sklearn.model_selection import train_test_split

data = load_data(DATA_PATH)
docs, labels = zip(*data)

train_docs, test_docs, train_labels, test_labels = train_test_split(docs, labels,
                                                                   test_size=0.2,
                                                                   random_state=1337)
print("Training reviews: {}".format(len(train_docs)))
print("Test reviews: {}".format(len(test_docs)))

# Let's see some positive and negative documents in test data.
posi_docs = []
neg_docs = []
for d, lb in zip(test_docs, test_labels):
    if lb == "+1":
        posi_docs.append(d)
    else:
        neg_docs.append(d)

print("Random positive review")
print(random.choice(posi_docs))
print("Random negative review")
print(random.choice(neg_docs))

Training reviews: 8529
Test reviews: 2133
Random positive review
if you can tolerate the redneck-versus-blueblood cliches that the film trades in , sweet home alabama is diverting in the manner of jeff foxworthy's stand-up act .
Random negative review
what's next ? rob schneider , dana carvey and sarah michelle gellar in the philadelphia story ? david spade as citizen kane ?


## Using scikit-learn for feature extraction

We can use scikit-learn for [feature extraction](http://scikit-learn.org/stable/modules/feature_extraction.html). We use the bag-of-word representation for feature extraction. In scikit-learn, we can use `CountVectorizer` or `TfidfTransformer`.

### Feature extraction with CountVectorizer

We will use the class `CounterVectorizer` for feature extraction. Since the data was alreay tokenized, we do not to pass `tokenizer` or `token_pattern` arguments.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
                             binary=True,   # Use binary features
                             stop_words="english"
                            ) 
vectorizer

CountVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Now, we fit the vectorizer object on the training data.

In [6]:
X_train = vectorizer.fit_transform(train_docs)

We we try the `vectorizer` to get BoW of a sentence.

In [7]:
analyze = vectorizer.build_analyzer()
analyze("This is a text document to analyze.")

['text', 'document', 'analyze']

## Text categorization with logistic regression

Now let's try text categorization with [logistic regression implementation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) in scikit-learn. See the document [here](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) for more details.

In [8]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Now, we fit the model on the training data.

In [9]:
clf.fit(X_train, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### Evaluation on test set

Now let's evaluate the model on the test data.

In [10]:
X_test = vectorizer.transform(test_docs)
test_preds = clf.predict(X_test)

In [11]:
from sklearn import metrics

accuracy = metrics.accuracy_score(test_labels, test_preds)
print("# Test accuracy: {}".format(accuracy))

# Test accuracy: 0.7463666197843413


We can predict the label for an input review.

In [12]:
example = "a thoughtful , provocative , insistently humanizing film ."
test_x = vectorizer.transform([example])
print("Predicted class: {}".format(clf.predict(test_x)))

Predicted class: ['+1']


### Get prediction probabilties

In some cases, we would like to get prediction probabilities. For instance, in the course project, we would like to rank images by the descending order of the probability that an image includes a flooding event.

We can do that by using the method `predict_proba`.

In [13]:
clf.predict_proba(test_x)

array([[0.77033746, 0.22966254]])

The first value is the probability that the instance belongs to the class "+1" and the second value is the probability that the instance belongs to the class "-1".
Let's try a negative review.

In [14]:
example2 = "for all its surface frenzy , high crimes should be charged with loitering -- so much on view , so little to offer ."
test_x2 = vectorizer.transform([example2])
clf.predict_proba(test_x2)

array([[0.27721876, 0.72278124]])

We can combine probability values with a threshold $t$ to customize our prediction. For instance, we can decide that the prediction is "+1" if the probability is greater than 0.6 instead of 0.5.

## Get top features with the highest weights

In this section, we would like to see top features with the highest weights.

First, we get all features in vectorizer and target_names.

In [15]:
feature_names = vectorizer.get_feature_names()
target_names = ["+1", "-1"]
print(len(clf.coef_), clf.coef_)

1 [[ 0.1401822   0.13881197  0.2241502  ...  0.15164946 -0.01178921
   0.23204102]]


Now 

In [16]:
import numpy as np

topN = 50
print("top {} keywords:".format(topN))
top10 = np.argsort(clf.coef_[0])[-topN:]
top_features = [ feature_names[i] for i in top10 ]
print(" ".join(top_features))

top 50 keywords:
mess woody repetitive episode indulgent pretentious attempts superficial tedious merit exhausting unfunny excuse seagal bland lack jokes ill thinks advice save junk stunt supposed pie product badly bad unless numbers generic disguise plodding devoid tries incoherent flat bore tv busy wasn routine mildly fails mediocre waste intentions worst boring dull


## Try with n-gram features

Now we would like to use unigram and bigram features in feature extraction.

In [17]:
vectorizer = CountVectorizer(binary=True,
                             stop_words="english",
                             ngram_range=(1,2),
                            ) 
X_train = vectorizer.fit_transform(train_docs)

clf.fit(X_train, train_labels)

X_test = vectorizer.transform(test_docs)
test_preds = clf.predict(X_test)

accuracy = metrics.accuracy_score(test_labels, test_preds)
print("# Test accuracy: {}".format(accuracy))

# Test accuracy: 0.7430848570089077


## Try with tf-idf term weighting

Now, we use tf-idf term weighting for feature extraction

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
X_train = vectorizer.fit_transform(train_docs)

clf.fit(X_train, train_labels)

X_test = vectorizer.transform(test_docs)
test_preds = clf.predict(X_test)

accuracy = metrics.accuracy_score(test_labels, test_preds)
print("# Test accuracy: {}".format(accuracy))

# Test accuracy: 0.7374589779653071


## Using SVM for text categorization

In this section, we would like to use SVM for text categorization.

In [19]:
from sklearn.svm import LinearSVC

vectorizer = CountVectorizer(binary=True,
                             stop_words="english",
                             # ngram_range=(1,2),
                            ) 

clf = LinearSVC(loss='squared_hinge', penalty="l2",
                dual=False, tol=1e-3)

X_train = vectorizer.fit_transform(train_docs)

clf.fit(X_train, train_labels)

X_test = vectorizer.transform(test_docs)
test_preds = clf.predict(X_test)

accuracy = metrics.accuracy_score(test_labels, test_preds)
print("# Test accuracy: {}".format(accuracy))


# Test accuracy: 0.720112517580872


## Exercises

1. Try different text categorization algorithms in scikit-learn such as `KNeighborsClassifier`, `DecisionTreeClassifier` for the sentiment data.
2. Load the development data of the course project, then extract BoW features for description, title, and user tags in eac