___
# Feature Extraction from Text
<!-- In the **Scikit-learn Primer** lecture we applied a simple SVC classification model to the SMSSpamCollection dataset. We tried to predict the ham/spam label based on message length and punctuation counts. In this section we'll actually look at the text of each message and try to perform a classification based on content. We'll take advantage of some of scikit-learn's [feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) tools.

## Load a dataset -->

In [1]:
# Perform imports and load the dataset:
import numpy as np
import pandas as pd
column_name = ['label', 'message']
df = pd.read_csv('../project/raw_data/xtrain.txt', sep='\t', names=column_name)
df.head()

Unnamed: 0,label,message
0,1,"GREEN BAY, WIDavid Horsted, 45, announced Mond..."
1,3,CISA Systemic Domestic SpyingBy SARTRE Coercio...
2,1,A local resident's search for a public bathroo...
3,1,A five-minute sampling of Hindi-language chann...
4,4,The proposed trade agreement with China will b...


In [2]:
df.isnull().sum()

label      0
message    0
dtype: int64

## No. of messages in each `label`:

In [3]:
df['label'].value_counts()

2    4014
3    4008
4    3997
1    3981
Name: label, dtype: int64

## Split the data into train & test sets:

In [4]:
from sklearn.model_selection import train_test_split

X = df['message']  
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Scikit-learn's CountVectorizer
<!-- Text preprocessing, tokenizing and the ability to filter out stopwords are all included in [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), which builds a dictionary of features and transforms documents to feature vectors. -->

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(10720, 93591)

<!-- <font color=green>This shows that our training set is comprised of 3733 documents, and 7082 features.</font> -->

## Transform Counts to Frequencies with Tf-idf
<!-- While counting words is helpful, longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid this we can simply divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called **tf** for Term Frequencies.

Another refinement on top of **tf** is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called **tf–idf** for “Term Frequency times Inverse Document Frequency”.

Both tf and tf–idf can be computed as follows using [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html): -->

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(10720, 93591)

<!-- Note: the `fit_transform()` method actually performs two operations: it fits an estimator to the data and then transforms our count-matrix to a tf-idf representation. -->

## Combine Steps with TfidVectorizer
<!-- In the future, we can combine the CountVectorizer and TfidTransformer steps into one using [TfidVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html): -->

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train) # remember to use the original X_train set
X_train_tfidf.shape

(10720, 93591)

## Train a Classifier
<!-- Here we'll introduce an SVM classifier that's similar to SVC, called [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). LinearSVC handles sparse input better, and scales well to large numbers of samples. -->

In [34]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

LinearSVC()

In [8]:
stopwords = ['a', 'about', 'an', 'and', 'are', 'as', 'at', 'be', 'been', 'but', 'by', 'can', \
             'even', 'ever', 'for', 'from', 'get', 'had', 'has', 'have', 'he', 'her', 'hers', 'his', \
             'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'me', 'my', 'of', 'on', 'or', \
             'see', 'seen', 'she', 'so', 'than', 'that', 'the', 'their', 'there', 'they', 'this', \
             'to', 'was', 'we', 'were', 'what', 'when', 'which', 'who', 'will', 'with', 'you']

<!-- <font color=green>Earlier we named our SVC classifier **svc_model**. Here we're using the more generic name **clf** (for classifier).</font> -->

## Build a Pipeline(4-way)
<!-- Remember that only our training set has been vectorized into a full vocabulary. In order to perform an analysis on our test set we'll have to submit it to the same procedures. Fortunately scikit-learn offers a [**Pipeline**](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class that behaves like a compound classifier. -->

In [20]:
from sklearn.pipeline import Pipeline
# from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer( ngram_range=(1,2))), # without stopwords perform better
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

Pipeline(steps=[('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
                ('clf', LinearSVC())])

## Test the classifier and display results

In [21]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [22]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[1286    5    8   17]
 [   8 1272   12    1]
 [  32    9 1265   12]
 [  34    3   15 1301]]


In [23]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

           1       0.95      0.98      0.96      1316
           2       0.99      0.98      0.99      1293
           3       0.97      0.96      0.97      1318
           4       0.98      0.96      0.97      1353

    accuracy                           0.97      5280
   macro avg       0.97      0.97      0.97      5280
weighted avg       0.97      0.97      0.97      5280



In [24]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.9704545454545455


## Build a Pipeline(2-way)

In [44]:
import numpy as np
import pandas as pd
column_name = ['label', 'message']
twoway = pd.read_excel('../project/raw_data/test.xlsx')
twoway.drop(['Article Headline','Domain','Subtopic','Source','Issue Date ','URL','Access Date'], axis=1, inplace=True)

In [54]:
twoway.isnull().sum()

Satirical =1 Legitimate=0    0
Full Text                    0
dtype: int64

In [63]:
twoway['Satirical =1 Legitimate=0'].value_counts()

0    180
1    180
Name: Satirical =1 Legitimate=0, dtype: int64

In [56]:
from sklearn.model_selection import train_test_split

X = twoway['Full Text ']  
y = twoway['Satirical =1 Legitimate=0']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [57]:
text_clf = Pipeline([('tfidf', TfidfVectorizer()), # without stopwords perform better
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [58]:
predictions = text_clf.predict(X_test)

In [59]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[54 12]
 [ 6 47]]


In [60]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.90      0.82      0.86        66
           1       0.80      0.89      0.84        53

    accuracy                           0.85       119
   macro avg       0.85      0.85      0.85       119
weighted avg       0.85      0.85      0.85       119



In [61]:
print(metrics.accuracy_score(y_test,predictions))

0.8487394957983193


## Full train Dataset

## 4-way

In [42]:
column_name = ['label', 'message']
full_df = pd.read_csv('../project/raw_data/fulltrain.csv', sep=',', names=column_name)
full_df.head()

Unnamed: 0,label,message
0,1,"A little less than a decade ago, hockey fans w..."
1,1,The writers of the HBO series The Sopranos too...
2,1,Despite claims from the TV news outlet to offe...
3,1,After receiving 'subpar' service and experienc...
4,1,After watching his beloved Seattle Mariners pr...


In [43]:
full_df['label'].value_counts()

3    17870
1    14047
4     9995
2     6942
Name: label, dtype: int64

In [44]:
X = full_df['message']  
y = full_df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [45]:
text_clf = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,2))), # without stopwords perform better
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

Pipeline(steps=[('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
                ('clf', LinearSVC())])

In [46]:
predictions = text_clf.predict(X_test)

In [47]:
print(metrics.confusion_matrix(y_test,predictions))

[[4591   12   37   29]
 [  15 2222   43    7]
 [  35   19 5778   14]
 [  78    1   67 3174]]


In [48]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

           1       0.97      0.98      0.98      4669
           2       0.99      0.97      0.98      2287
           3       0.98      0.99      0.98      5846
           4       0.98      0.96      0.97      3320

    accuracy                           0.98     16122
   macro avg       0.98      0.97      0.98     16122
weighted avg       0.98      0.98      0.98     16122



In [49]:
print(metrics.f1_score(y_test,predictions, average='macro'))

0.9771199580250642


## 2-way(unbalanced dataset)

In [33]:
full_df_2 = full_df[(full_df['label']==1) | (full_df['label']==4)]

In [34]:
X = full_df_2['message']  
y = full_df_2['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [36]:
text_clf = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,2))), # without stopwords perform better
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

Pipeline(steps=[('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
                ('clf', LinearSVC())])

In [37]:
predictions = text_clf.predict(X_test)

In [38]:
print(metrics.confusion_matrix(y_test,predictions))

[[4578   54]
 [ 100 3202]]


In [39]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

           1       0.98      0.99      0.98      4632
           4       0.98      0.97      0.98      3302

    accuracy                           0.98      7934
   macro avg       0.98      0.98      0.98      7934
weighted avg       0.98      0.98      0.98      7934



In [41]:
print(metrics.f1_score(y_test,predictions))

0.9834586466165414
