## Bag of Words and Tf-idf
In the above examples, each vector can be considered a *bag of words*. By itself these may not be helpful until we consider *term frequencies*, or how often individual words appear in documents. A simple way to calculate term frequencies is to divide the number of occurrences of a word by the total number of words in the document. In this way, the number of times a word appears in large documents can be compared to that of smaller documents.

However, it may be hard to differentiate documents based on term frequency if a word shows up in a majority of documents. To handle this we also consider *inverse document frequency*, which is the total number of documents divided by the number of documents that contain the word. In practice we convert this value to a logarithmic scale, as described [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency).

Together these terms become [**tf-idf**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

<h1>Stop Words and Word Stems</h1>
Some words like "the" and "and" appear so frequently, and in so many documents, that we needn't bother counting them. Also, it may make sense to only record the root of a word, say cat in place of both cat and cats. This will shrink our vocab array and improve performance.

<h1>Tokenization and Tagging</h1>
When we created our vectors the first thing we did was split the incoming text on whitespace with .split(). This was a crude form of tokenization - that is, dividing a document into individual words. In this simple example we didn't worry about punctuation or different parts of speech. In the real world we rely on some fairly sophisticated morphology to parse text appropriately.

Once the text is divided, we can go back and tag our tokens with information about parts of speech, grammatical dependencies, etc. This adds more dimensions to our data and enables a deeper understanding of the context of specific documents. For this reason, vectors become high dimensional sparse matrices.

In [1]:
# Perform imports and load the dataset:
import numpy as np
import pandas as pd

df = pd.read_csv('UPDATED_NLP_COURSE/TextFiles/smsspamcollection.tsv', sep='\t')
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [2]:
#check missing values
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

In [5]:
# Take a quick look at the ham and spam label column:
df['label'].value_counts()

label
ham     4825
spam     747
Name: count, dtype: int64

In [6]:
# Split the data into train & test sets:
from sklearn.model_selection import train_test_split

X = df['message']  # this time we want to look at the text
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [7]:
print("Trainning data:",len(X_train),"Test data:",len(X_test))

Trainning data: 3733 Test data: 1839


<h5>FIT VECTORIZER TO THE DATA(build a vobcab, count the number of words...)</h5><p>
count_vect.fit(X_train)<p>
<h5>Transform the original text messages --> Vector</h5><p>
X_train_count = count_vect.transform(X_train)<p>
with: <b>count.fit_transform</b> we can do both at the same time

In [8]:
# Scikit-learn's CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# Text preprocessing, tokenizing and the ability to filter out stopwords are all included
# in CountVectorizer,which builds a dictionary of features and transforms documents to feature vectors.
count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(3733, 7082)

In [13]:
X_train_counts

<3733x7082 sparse matrix of type '<class 'numpy.int64'>'
	with 49992 stored elements in Compressed Sparse Row format>

In [14]:
X_train.shape

(3733,)

In [15]:
# Transform Counts to Frequencies with Tf-idf
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(3733, 7082)

## Combine Steps with TfidVectorizer
In the future, we can combine the CountVectorizer and TfidTransformer steps into one using [TfidVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html):

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train) # remember to use the original X_train set
X_train_tfidf.shape

(3733, 7082)

## Train a Classifier
Here we'll introduce an SVM classifier that's similar to SVC, called [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). LinearSVC handles sparse input better, and scales well to large numbers of samples.

In [17]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

## Build a Pipeline
Remember that only our training set has been vectorized into a full vocabulary. In order to perform an analysis on our test set we'll have to submit it to the same procedures. Fortunately scikit-learn offers a [**Pipeline**](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class that behaves like a compound classifier.

In [18]:
#We do this instead the steps before
from sklearn.pipeline import Pipeline
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

## Test the classifier and display results

In [19]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [22]:
# Report the confusion matrix
from sklearn import metrics
# print(metrics.confusion_matrix(y_test,predictions))
df = pd.DataFrame(metrics.confusion_matrix(y_test,predictions), index=['ham','spam'], columns=['ham','spam'])
df

Unnamed: 0,ham,spam
ham,1586,7
spam,12,234


In [21]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

    accuracy                           0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839



In [23]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.989668297988037


In [25]:
# trying a kind of real message
text_clf.predict(["Hi, how are you?"])

array(['ham'], dtype=object)

In [27]:
# trying a kind of real message | spam
text_clf.predict(["	Free entry in 2 a wkly comp to win FA, just sign in to www.sitio.com"])

array(['spam'], dtype=object)