# Text classification (CAP5602 Lecture 14)

In this demo, we will do text classification with the [20 newsgroups dataset](http://qwone.com/~jason/20Newsgroups/). We will adapt the tutorial here from sklearn:
*   [https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)





## 1. Download and load data

We use the [fetch_20newsgroups](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html) method from sklearn to download and load the data into memory. Here we only use 3 classes (*rec.motorcycles*, *comp.graphics*, and *sci.med*) for our experiment.

In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['rec.motorcycles', 'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

## 2. Print an example and its label

In [None]:
id = 6
label = twenty_train.target[id]

print(twenty_train.data[id]) # Print the input text
print(twenty_train.target_names[label]) # Print the label name

## 3. Count the n-gram tokens

Next, we use the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class to do pre-processing, tokenizing, and counting the n-gram tokens altogether.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer object that counts unigrams and bigrams
count_vect = CountVectorizer(ngram_range=(1, 2))

# Count (Fit) the features from train data and also transform the data into count vectors
X_train_counts = count_vect.fit_transform(twenty_train.data)

print(X_train_counts.shape)

## 4. Convert count matrix to Tf-idf matrix

To do this, we use the [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) with the default parameters.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

print(X_train_tfidf.shape)

## 5. Train a classifier

Now we can train a classifier as usual with the Tf-idf matrix. Here we will use the logistic regression model.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_tfidf, twenty_train.target)

## 6. Evaluate the trained classifier

To evaluate our trained classifier, we will fetch the test dataset and transform them into Tf-idf matrix using `count_vect` and `tfidf_transformer` above. Note that during test time, we do not fit these objects again. Then we make predictions using the Tf-idf matrix and compute the accuracy as usual.

In [None]:
from sklearn.metrics import accuracy_score

# Fetch test data
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

# Transform test data into count matrix and then Tf-idf matrix
X_test_counts = count_vect.transform(twenty_test.data)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

# Make predictions on the Tf-idf matrix and compute accuracy
Y_pred = model.predict(X_test_tfidf)
acc = accuracy_score(twenty_test.target, Y_pred)
print('Accuracy on test set:', acc)

## 7. Using a pipeline

Sklearn allows us to create a pipeline to combine all the processing steps (counting, transforming to Tf-idf, and classifying).

In [None]:
from sklearn.pipeline import Pipeline

# Create a pipeline object to combine the processing steps, you can choose your own name for each step
text_clf = Pipeline([
    ('vect', CountVectorizer(ngram_range=(1, 2))),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression()),
])

# Train the pipeline
text_clf.fit(twenty_train.data, twenty_train.target)

# Predict with the baseline
Y_pred = text_clf.predict(twenty_test.data)

# Compute accuracy
acc = accuracy_score(twenty_test.target, Y_pred)
print('Accuracy on test set:', acc)