# Introduction to text classification

We will illustrate basic text classification approchaches on [
SMS Spam Collection](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) data set. 
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

In [1]:
import pandas as pd
df = pd.read_csv('../data/spam.csv', sep='\t')
df.head()

Unnamed: 0,Target,Text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Preparation of train and test data sets

Separate and rename target values.

In [2]:
data = df['Text']
target = df['Target'].replace('ham', 1).replace('spam', 0)
names = ['spam', 'ham']
print(data[:5])
print(target[:5])

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: Text, dtype: object
0    1
1    1
2    0
3    1
4    1
Name: Target, dtype: int64


Shuffle the data and split it to train and test parts.

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)
print('Train size: {}'.format(len(X_train)))
print('Test size: {}'.format(len(X_test)))

Train size: 4457
Test size: 1115


## Data preprocessing

Tokenize the texts. Experiment with various tokenizers from the [NLTK](http://www.nltk.org/api/nltk.tokenize.html) library.

In [4]:
from nltk.tokenize.casual import casual_tokenize

sms = data[4]
print(sms)

tokenizer = lambda text: casual_tokenize(text, preserve_case=False)

print(tokenizer(sms))

Nah I don't think he goes to usf, he lives around here though
['nah', 'i', "don't", 'think', 'he', 'goes', 'to', 'usf', ',', 'he', 'lives', 'around', 'here', 'though']


Remove stop words.

In [5]:
from nltk.corpus import stopwords
print(stopwords.words('english'))
stopword_tokenizer = lambda text: [w for w in tokenizer(text) if not w in set(stopwords.words('english'))]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [6]:
stopword_tokenizer(sms)

['nah', 'think', 'goes', 'usf', ',', 'lives', 'around', 'though']

Convert tokens to their stems. Experiment with stemmers and lemmatizers from the [NLTK](http://www.nltk.org/api/nltk.stem.html) library.

In [7]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

stem_tokenizer = lambda text: [stemmer.stem(w) for w in stopword_tokenizer(text)]

print (stem_tokenizer(sms))

['nah', 'think', 'goe', 'usf', ',', 'live', 'around', 'though']


Fit a vectorizer which converts texts to count vectors.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(tokenizer=stem_tokenizer)
vectorizer.fit(X_train)
print (vectorizer.transform([sms]))

  (0, 13)	1
  (0, 1213)	1
  (0, 2925)	1
  (0, 3791)	1
  (0, 4233)	1
  (0, 6016)	1
  (0, 6027)	1
  (0, 6343)	1


Convert count vectors to TFIDF

In [9]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit(vectorizer.transform(X_train))
print(tfidf_transformer.transform(vectorizer.transform([sms])))

  (0, 6343)	0.45148801552291284
  (0, 6027)	0.36975100398316985
  (0, 6016)	0.2718539318849461
  (0, 4233)	0.4298916254058644
  (0, 3791)	0.3444574628661074
  (0, 2925)	0.37945962658835003
  (0, 1213)	0.3401816241603434
  (0, 13)	0.14955703816980243


## Classification

Train a classifier using the following models:
* [Logistic regression](http://scikit-learn.org/0.15/modules/generated/sklearn.linear_model.LogisticRegression.html)
* [Gradient Boosted Trees](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) (Experiment with different depths and number of trees)
* [Support Vector Machines](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) (experiment with different kernels)

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import GradientBoostingClassifier

clf_pipeline = Pipeline([('vec', vectorizer),
                         ('tfidf', tfidf_transformer),
                         #('lr', LogisticRegression())
                         #('gbc', GradientBoostingClassifier(n_estimators=100, max_depth=4))
                         ('svm', svm.SVC(kernel='linear'))
                        ])
clf_pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

## Evaluation

Compute common classification metrics and evaluate the models. Decide which model performs best on the given problem.

In [11]:
from sklearn import metrics
from sklearn.metrics import accuracy_score

y_pred = clf_pipeline.predict(X_test)

print ("Test accuracy: {:.2f}".format(accuracy_score(y_test, y_pred)))
print ()
print(metrics.classification_report(y_test, y_pred, digits=4))

Test accuracy: 0.99

             precision    recall  f1-score   support

          0     0.9779    0.9366    0.9568       142
          1     0.9908    0.9969    0.9939       973

avg / total     0.9892    0.9892    0.9891      1115



In [12]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_pred))

[[133   9]
 [  3 970]]


In [13]:
y_pred = clf_pipeline.predict(X_train)

print ("Train accuracy: {:.2f}".format(accuracy_score(y_train, y_pred)))
print ()
print(metrics.classification_report(y_train, y_pred, digits=4))

Train accuracy: 1.00

             precision    recall  f1-score   support

          0     0.9966    0.9702    0.9832       605
          1     0.9953    0.9995    0.9974      3852

avg / total     0.9955    0.9955    0.9955      4457

