___
# Feature Extraction from Text
<!-- In the **Scikit-learn Primer** lecture we applied a simple SVC classification model to the SMSSpamCollection dataset. We tried to predict the ham/spam label based on message length and punctuation counts. In this section we'll actually look at the text of each message and try to perform a classification based on content. We'll take advantage of some of scikit-learn's [feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) tools.

## Load a dataset -->

In [12]:
# Perform imports and load the dataset:
import numpy as np
import pandas as pd
column_name = ['label', 'message']
df = pd.read_csv('../project/raw_data/xtrain.txt', sep='\t', names=column_name)
df.head()

Unnamed: 0,label,message
0,1,"GREEN BAY, WIDavid Horsted, 45, announced Mond..."
1,3,CISA Systemic Domestic SpyingBy SARTRE Coercio...
2,1,A local resident's search for a public bathroo...
3,1,A five-minute sampling of Hindi-language chann...
4,4,The proposed trade agreement with China will b...


## Check for missing values:
Always a good practice.

In [13]:
df.isnull().sum()

label      0
message    0
dtype: int64

## No. of messages in each `label`:

In [14]:
df['label'].value_counts()

2    4014
3    4008
4    3997
1    3981
Name: label, dtype: int64

## Split the data into train & test sets:

In [15]:
from sklearn.model_selection import train_test_split

X = df['message']  
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Scikit-learn's CountVectorizer
<!-- Text preprocessing, tokenizing and the ability to filter out stopwords are all included in [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), which builds a dictionary of features and transforms documents to feature vectors. -->

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(10720, 93591)

<!-- <font color=green>This shows that our training set is comprised of 3733 documents, and 7082 features.</font> -->

## Transform Counts to Frequencies with Tf-idf
<!-- While counting words is helpful, longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid this we can simply divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called **tf** for Term Frequencies.

Another refinement on top of **tf** is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called **tf–idf** for “Term Frequency times Inverse Document Frequency”.

Both tf and tf–idf can be computed as follows using [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html): -->

In [17]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(10720, 93591)

<!-- Note: the `fit_transform()` method actually performs two operations: it fits an estimator to the data and then transforms our count-matrix to a tf-idf representation. -->

## Combine Steps with TfidVectorizer
<!-- In the future, we can combine the CountVectorizer and TfidTransformer steps into one using [TfidVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html): -->

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train) # remember to use the original X_train set
X_train_tfidf.shape

(10720, 93591)

## Train a Classifier
<!-- Here we'll introduce an SVM classifier that's similar to SVC, called [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). LinearSVC handles sparse input better, and scales well to large numbers of samples. -->

In [19]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

LinearSVC()

<!-- <font color=green>Earlier we named our SVC classifier **svc_model**. Here we're using the more generic name **clf** (for classifier).</font> -->

## Build a Pipeline
<!-- Remember that only our training set has been vectorized into a full vocabulary. In order to perform an analysis on our test set we'll have to submit it to the same procedures. Fortunately scikit-learn offers a [**Pipeline**](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class that behaves like a compound classifier. -->

In [20]:
from sklearn.pipeline import Pipeline
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

## Test the classifier and display results

In [21]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [22]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[1261   11    7   37]
 [  10 1263   16    4]
 [  26   11 1269   12]
 [  38    5   23 1287]]


In [23]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

           1       0.94      0.96      0.95      1316
           2       0.98      0.98      0.98      1293
           3       0.97      0.96      0.96      1318
           4       0.96      0.95      0.96      1353

    accuracy                           0.96      5280
   macro avg       0.96      0.96      0.96      5280
weighted avg       0.96      0.96      0.96      5280



In [24]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.9621212121212122
