In [1]:
import pandas as pd
import joblib

**Loading the Tokens**

In [2]:
tweets_train_tokenized = pd.read_csv('csvs/tweets_train_tokens.csv', index_col=False)
tweets_train_tokenized.head()

Unnamed: 0,message
0,arirang simply kpop kim hyung jun cross ha yeo...
1,read politico article donald trump running mat...
2,type bazura project google image image photo d...
3,fast lerner subpoena tech guy work hillary pri...
4,sony reward app like lot female singer non ret...


In [3]:
tweets_train_tokenized_message = pd.Series(tweets_train_tokenized.message)
tweets_train_tokenized_message

0        arirang simply kpop kim hyung jun cross ha yeo...
1        read politico article donald trump running mat...
2        type bazura project google image image photo d...
3        fast lerner subpoena tech guy work hillary pri...
4        sony reward app like lot female singer non ret...
                               ...                        
49670    sleep think fuck jordan answer phone tomorrow ...
49671    yoga shannon tomorrow morning work day start u...
49672               bring dunkin iced coffee tomorrow hero
49673    currently holiday portugal come home tomorrow ...
49674                         ladykiller saturday aternoon
Name: message, Length: 49675, dtype: object

In [4]:
# Converting Panda series into Unicode datatype as required by vectorizers
tweets = tweets_train_tokenized_message.astype('U').values
tweets

array(['arirang simply kpop kim hyung jun cross ha yeong playback',
       'read politico article donald trump running mate tom brady list likely choice',
       'type bazura project google image image photo dad glenn moustache whatthe',
       ..., 'bring dunkin iced coffee tomorrow hero',
       'currently holiday portugal come home tomorrow poland tuesday holocaust memorial trip',
       'ladykiller saturday aternoon'], dtype=object)

#### **II. Creating vectors from text**

1. It should not result in a sparse matrix since sparse matrices result in high computation cost
2. We should be able to retain most of the linguistic information present in the sentence

#### **A. Bag of Words Model (addendum)**

It’s the simplest model, and the idea is to take the whole text data and count their frequency of occurrence. and map the words with their frequency. This method doesn’t care about the order of the words, but it does care how many times a word occurs and the default bag of words model treats all words equally.

Disadvantages:

1. Sparsity is a problem, given there are many words in reality each sentence, each has to be converted to 0 or 1.
2. Ordering of words is changed and is not captured, because our feature index is based on their frequency (The feature with the highest frequency is at the beginning)
3. We are not retaining any information on the grammar of the sentences nor on the ordering of the words in the text.
4. Out of Vocabulary problem exists, if we have a new word not in our vocabulary coming from our test data, it will get removed.

Example:

A. Simple application

In [5]:
# import image module
from IPython.display import Image

# get the image
Image(url="pictures/bag-of-words-1.png", width=700, height=500)


In [6]:
# import image module
from IPython.display import Image

# get the image
Image(url="pictures/bag-of-words-2.png", width=700, height=400)


In [7]:
# import image module
from IPython.display import Image

# get the image
Image(url="pictures/bag-of-words-3.png", width=700, height=500)

B. Coded application with CountVectorizer

In [8]:
# BagOfWords
# ngram_range specify the n-grams and accepts a tuple ie. (1,2)
from sklearn.feature_extraction.text import CountVectorizer
vector = CountVectorizer(binary=True, ngram_range= (1,2))
count_matrix = vector.fit_transform(tweets)
count_matrix

<49675x317456 sparse matrix of type '<class 'numpy.int64'>'
	with 866104 stored elements in Compressed Sparse Row format>

#### **B. Term-frequency Inverse-document Frequency (TF-IDF)**

The BOW model doesn’t give good results since it has a drawback. Assume that there is a particular word that is appearing in all the documents and it comes multiple times, eventually, it will have a higher frequency of occurrence and it will have a greater value that will cause a specific word to have more weightage in a sentence, that’s not good for our analysis.

The idea of TF-IDF is to reflect the importance of a word to its document or sentence by normalizing the words which occur frequently in the collection of documents.

**Term-Frequency (TF)**

It is a measure of how frequently a term $t$, appears in a document, $d$ :

$$tf_{t,d} = \frac {n_{t,d}} {number\ of\ terms\ in\ a\ document} $$

It denotes the contribution of the word to the document i.e words relevant to the document should be frequent. 

**Inverse Document Frequency (IDF)**

It is a measure of how rare a word is in a document. If a word appears in almost every document it is not significant for the classification.

$$ idf_{t} = ln(\frac {number\ of\ documents} {number\ of\ documents\ with\ term\ t }) $$

If a word has appeared in all the documents, then probably that word is not relevant to a particular document. But if it has appeared in a subset of documents then probably the word is of some relevance to the documents it is present in.

**TF-IDF**

It evaluates how relevant is a word to its sentence in a collection of sentences or documents.

$$ (TFIDF)_{t,d} = tf_{t,d} * idf_{t} $$

Words with a higher score are more important, and those with a lower score are less important.

Advantages:

1. Simple and intuitive
2. Word importance is captured
3. It performs much better for machine learning models than simple Bag of Words.

Disadvantages:

1. Sparsity is still present, but much less than Bag of Words.
2. Out of vocabulary problem is still not handled.


Example:

A. Simple application

In [9]:
# import image module
from IPython.display import Image

# get the image
Image(url="pictures/TFIDF-1.png", width=700, height=700)

In [10]:
# import image module
from IPython.display import Image

# get the image
Image(url="pictures/TFIDF-2.png", width=700, height=300)

In [11]:
# import image module
from IPython.display import Image

# get the image
Image(url="pictures/TFIDF-3.png", width=700, height=400)

B. Coded Application

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# you can also specify ngram_range
# you can also choose the max_features parameter, which just includes those features with the top frequencies specified by the max_features i.e.
# max_features = 3, includes only those top 3 features with the highest frequencies
# ngram_range=(1,2)
#tfidf = TfidfVectorizer(min_df =2, max_features=4000, ngram_range=(1,2))
tfidf = TfidfVectorizer(min_df =2, max_features=5000, ngram_range=(1,2))
#tfidf = TfidfVectorizer(max_features=4000, ngram_range=(1,2))
tfidf_tweets = tfidf.fit_transform(tweets)
tfidf_tweets

<49675x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 426709 stored elements in Compressed Sparse Row format>

In [18]:
# Save the tfidvectorizer to disk
tfidf_file = 'vectors/tfidf.sav'
joblib.dump(tfidf, tfidf_file)

['vectors/tfidf.sav']

In [19]:
# Save the tfidvectorizer to disk
tfidf_tweets_file = 'vectors/tfidf_tweets.sav'
joblib.dump(tfidf_tweets, tfidf_tweets_file)

['vectors/tfidf_tweets.sav']

#### **End. Thank you!**