Pete Schultz

This is a text classification task. Every document (a line in the data file) is a movie review from IMDB. Your goal is to classify each document into ONE of the two categories, based on whether it needs simplification: 1 if the review is positive; 0 if the review is negative.

The training data contains 10,000 reviews, already labeled with one of the above categories. The test data contains 5,000 reviews that are unlabeled. The submission should be a .csv (comma separated free text) file with a header line ”Id,Category” followed by exactly 5,000 lines. In each line, there should be exactly two integers, separated by a comma. The first integer is the line ID of a test question (0 - 5,000), and the second integer is the category your classifier predicts one of (0,1).

In [1]:
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split

In [2]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [3]:
y_train = train.label

Feature Vector

In [4]:
# unigram
unigram_vect = CountVectorizer(ngram_range=(1, 1))
unigram_vect.fit(train.text)

X_train_unigram = unigram_vect.transform(train.text)

In [5]:
# bigram
bigram_vect = CountVectorizer(ngram_range=(1, 2))
bigram_vect.fit(train.text)

X_train_bigram = bigram_vect.transform(train.text)

In [6]:
# trigram
trigram_vect = CountVectorizer(ngram_range=(1, 3))
trigram_vect.fit(train.text)

X_train_trigram = trigram_vect.transform(train.text)

In [7]:
# quadrigram
quadrigram_vect = CountVectorizer(ngram_range=(1, 4))
quadrigram_vect.fit(train.text)

X_train_quadrigram = quadrigram_vect.transform(train.text)

In [8]:
# unigram tf idf
unigram_tf_idf_trans = TfidfTransformer()
unigram_tf_idf_trans.fit(X_train_unigram)

X_train_unigram_tf_idf = unigram_tf_idf_trans.transform(X_train_unigram)

In [9]:
# bigram tf idf
bigram_tf_idf_trans = TfidfTransformer()
bigram_tf_idf_trans.fit(X_train_bigram)

X_train_bigram_tf_idf = bigram_tf_idf_trans.transform(X_train_bigram)

In [10]:
# trigram tf idf
trigram_tf_idf_trans = TfidfTransformer()
trigram_tf_idf_trans.fit(X_train_trigram)

X_train_trigram_tf_idf = trigram_tf_idf_trans.transform(X_train_trigram)

In [11]:
# quadrigram tf idf
quadrigram_tf_idf_trans = TfidfTransformer()
quadrigram_tf_idf_trans.fit(X_train_quadrigram)

X_train_quadrigram_tf_idf = quadrigram_tf_idf_trans.transform(X_train_quadrigram)

Test Performance

In [12]:
def calc_scores(kind, X, y):
    # split
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 142)

    clf = SGDClassifier()
    clf.fit(X_train, y_train)
    train_score = clf.score(X_train, y_train)
    test_score = clf.score(X_test, y_test)
    print(f'{kind}\n')
    print(f'Train score: {round(train_score, 2)}\nTest score: {round(test_score, 3)}\n')

In [13]:
calc_scores('Unigram', X_train_unigram, y_train)
calc_scores('Bigram', X_train_bigram, y_train)
calc_scores('Trigram', X_train_trigram, y_train)
calc_scores('Quadrigram', X_train_quadrigram, y_train)
calc_scores('Unigram Tf-Idf', X_train_unigram_tf_idf, y_train)
calc_scores('Bigram Tf-Idf', X_train_bigram_tf_idf, y_train)
calc_scores('Trigram Tf-Idf', X_train_trigram_tf_idf, y_train)
calc_scores('Quadrigram Tf-Idf', X_train_quadrigram_tf_idf, y_train)

Unigram

Train score: 1.0
Test score: 0.838

Bigram

Train score: 1.0
Test score: 0.856

Trigram

Train score: 1.0
Test score: 0.854

Quadrigram

Train score: 1.0
Test score: 0.846

Unigram Tf-Idf

Train score: 0.98
Test score: 0.87

Bigram Tf-Idf

Train score: 1.0
Test score: 0.886

Trigram Tf-Idf

Train score: 1.0
Test score: 0.88

Quadrigram Tf-Idf

Train score: 1.0
Test score: 0.873



Best: Bigram Tf-Idf

In [14]:
X_test_bigram = bigram_vect.transform(test.text)
X_test_bigram_tf_idf = bigram_tf_idf_trans.transform(X_test_bigram)

In [15]:
clf = SGDClassifier()
clf.fit(X_train_bigram_tf_idf, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [16]:
predictions = clf.predict(X_test_bigram_tf_idf)
pd.DataFrame({"Id": test.Id, "Category": predictions}).to_csv("pete_kaggle.csv", index=False)