### Natural language preprocessing tutorial

Sometimes, we will get some data with natural language or just string variables, then we should come up with a solution to solve these data, as most algorithms don't support `string` variables except tree based algorithms like `Decision Tree` and `Random Forest`, but I really recommend to process these string columns to be numerical, then you chould try more algorithms that could fit on these data!

This tutorial will focus on the string variable and natural language preprocessing to make the data to be numerical for later algorithms, this will cover bellow algorithms:

1. LabelEncoder
2. OneHotEncoder
3. Bag-of-Words
4. N-Grams
5. TFIDF

I will explain this algorithms one by one also with code.

Let's go :)

In [43]:
# import modules
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from matplotlib import style
import warnings
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer, LabelEncoder, OneHotEncoder

style.use('ggplot')

# don't want to see the wanrings info
warnings.simplefilter('ignore')

In [11]:
# here I just make a toy dataset
user_id = np.array([1, 2, 3, 4, 5])
sex_cols = np.array(['female', 'male', 'male', 'female', 'male'])
comments = np.array(["I like bed", "I like baseball ", "baseball is my life", "bed is great", "I play baseball"])
labels = np.array([0, 1, 0, 1, 1])

data = np.concatenate([user_id[:, np.newaxis], comments[:, np.newaxis], sex_cols[:, np.newaxis], labels[:, np.newaxis]], axis=1)

df = pd.DataFrame(data, columns=['id', 'comments', 'sex', 'label'])
df.head()

Unnamed: 0,id,comments,sex,label
0,1,I like bed,female,0
1,2,I like baseball,male,1
2,3,baseball is my life,male,0
3,4,bed is great,female,1
4,5,I play baseball,male,1


#### LabelEncoder

Supporse you are given a data set that contains users' info try to predict whether users like to do sports, as you have the goal that try to predict a new comer whether or not like doing sports, then you should first process your data. 

You find that you have one column called `Sex`, that contains just `male` and `female`, as I have ever recommended that you should convert the string variable to numbers. But how to do make it? The really basic thought is just to make `male` to 0 and make `female` to 1, right? That's **LabelEncoder**! so the logic is really easy, I have read the `sklearn` source code, it just get the unique string, sort it, give the continous integer for each string! Then you could get the string converted result.

In [15]:
# we could make the sex columns with label encoder, to make male to 0 and female to 1
le = LabelEncoder()

le.fit(df['sex'])

print("Converted label: ", le.transform(df['sex']))

print("we could even just with fit_transform, with same result:", le.fit_transform(df['sex']))

Converted label:  [0 1 1 0 1]
we could even just with fit_transform, with same result: [0 1 1 0 1]


In [18]:
# in fact, we could do the same thing with label binarizer with more efficient way
# as LabelBinarizer will return with len(sample) * 1
lb = LabelBinarizer()

lb.fit(df['sex'])

print("result with label binarizer:", lb.transform(df['sex']))
print("Demension of trnasform result: ", lb.transform(df['sex']).shape)

# then we could just add the new columns with original dataframe
lb_df = pd.DataFrame(lb.transform(df['sex']), columns=['sex_lb'])

# combine two data frames
df = pd.concat([df, lb_df], axis=1)

# show the new dataframe
# as you could see that we do make the sex column to be numerical
df.head()

result with label binarizer: [[0]
 [1]
 [1]
 [0]
 [1]]
Demension of trnasform result:  (5, 1)


Unnamed: 0,id,comments,sex,label,sex_lb
0,1,I like bed,female,0,0
1,2,I like baseball,male,1,1
2,3,baseball is my life,male,0,1
3,4,bed is great,female,1,0
4,5,I play baseball,male,1,1


#### OneHotEncoder

In fact, we have introducted **OneHotEncoder** before, please reference to this: [Preprocessing tutorial](https://github.com/lugq1990/machine-learning-beginner-jupyter-series/blob/master/preprocessing%20turorial.ipynb), the reason that we should use **OneHotEncoder** is that if we just convert the sex to be number 0 and 1, we really bring some bias for the model, the reason is that for `sex` column, we should give different importance, as the 1 is larger than 0, if we just fit the data with linear model like Logistic Regression, if the sample label is 1, then for the sample `sex` with 1, then the weights should be 1, right? but for the `sex` with 0, no matter how big the weights is, we don't make it to 1, right? But we don't just fit one feature, but in fact, we do really change the importance for each sample! This isn't right! 

That's coming with **OneHotEncoder**, onehot could make the continous number to be a vector, suppose you have a data with 0 and 1 as the previous `sex` column, then by using **OneHotEncoder**, we could convert 0 to (0, 1) and 1 to (1, 0), if you have n unique variables, then you will get the n Dimension vector, then will only one position will be 1, others will be 0, the 1 position is means which value means, by using **OneHotEncoder**, we could remove the order that we come with converted with **LabelBinerizer**. 

In [22]:
onehot = OneHotEncoder()

# we have to ensure the data with at least 2D, convert the data (n,) to (n, 1)
sex_lb = df['sex_lb'].values[:, np.newaxis]
onehot.fit(sex_lb)

# then we could convert the number to onehot vector
print("converted OneHot: ", )
onehot.transform(sex_lb)

converted OneHot: 


<5x2 sparse matrix of type '<class 'numpy.float64'>'
	with 5 stored elements in Compressed Sparse Row format>

In [23]:
# as you could see that the one-hot converted result is a sparse matrix, 
# the reason is using the sparse matrix is more efficient for matrix with many 0

# we could get the info, then you see that for each row will 2D
onehot.transform(sex_lb).todense()

matrix([[1., 0.],
        [0., 1.],
        [0., 1.],
        [1., 0.],
        [0., 1.]])

#### Bag-of-words

When there are natural language, then we need to convert the string to numbers to represent the info in the sentence, there are really many use cases for natural language, like sentiment analysis, there are just the comments that users express themself with words, just like me. 

But how could we represent the words? The really easy parts is just **bag-of-words**, the logic is really easy, supporse you have two sentences: "I do machine learning", "I like learning", then **bag-of-words** will get the whole unique words are: ("I", "do", "machine", "learning", "like"), there are just 5 unique words, then you could just construct your data with 2 * 5 D, 2 means 2 sentences, 5 means there are just 5 unique words, then you will just count how many words in each sentence, if exists, add 1, otherwise 0. Sample data will be (1, 1, 1, 1, 0), (1, 0, 0, 1, 1), how about there are same words in each sentence, just cound how many!

If we face real world massy data, we will face more complexy sentence, we will have to do other things, I will write other tutorials for NLP, keep tuned.

In [33]:
bow = CountVectorizer()

# get the comments data 
comments = df['comments'].values

print("Get sentence:", comments)

# fit the model
bow.fit(comments)

bow_matrix = bow.transform(comments)

print("Get bag of words matrix: ")
# result is also a sparse matrix
bow_matrix.todense()

Get sentence: ['I like bed' 'I like baseball ' 'baseball is my life' 'bed is great'
 'I play baseball']
Get bag of words matrix: 


matrix([[0, 1, 0, 0, 0, 1, 0, 0],
        [1, 0, 0, 0, 0, 1, 0, 0],
        [1, 0, 0, 1, 1, 0, 1, 0],
        [0, 1, 1, 1, 0, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0, 1]], dtype=int64)

In [34]:
# in fact, we could get the unique words in sentence. 
# But one thing you should notice that, unique doesn't contain words "I", 
# as the CounterVectorizer will remove the stop words from the sentence
bow.get_feature_names()

['baseball', 'bed', 'great', 'is', 'life', 'like', 'my', 'play']

One more words for **stop words**: as the sentence may contain many words that don't contain any info like "I", "she", etc. these words are called **stop words**, we should remove it if we do the data clearning, I will show you in future NLP tutorials.

#### N-Gram

For **N-Gram** is just to add more info for our data, in fact, if we just get words from sentences, we don't take much info, as some phrases are just removed for us, so this comes out with **N-Gram**! N is how big window we should take to take phrases, suppose you have one sentence: "I love machine learning", then you let N=2(called Bigram), then the result will be ("I love", "love machine", "machine learning"), so that we do take more info by using **N-Gram**, right? In fact, you could make N to 3 or 4, but one thing to notice that: if you give too big N, then you will have more features! Take care, I just recommend N to be 2 or 3 will be enough!

In [39]:
ngram = CountVectorizer(ngram_range=(1, 2))

ngram.fit(comments)

gram_two = ngram.transform(comments)

# the 2 grams is also a sparse matrix, we could convert it to dense matrix to show
gram_two.todense()

matrix([[0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0],
        [0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]], dtype=int64)

One more words for **N-gram**, as you could notice that I just set `CountVectorizer`'s `ngram_range` (1, 2), the meaning is that we both take 1-gram and 2 gram, so that we could keep more info in the data, both with words and phrases! Real work should consider this!

#### TF-IDF

When we process data with many words and big datasets, then we want to make different importance for each words in each sentence, the reason is that for each sentence, some words are important called `key words`, so want our algorithm to keep more focus on this `key words`, in fact the `Attension` algorithm is also the same logic!

But how do we compute the **TF-IDF**? the solution is really easy, there are two parts: `TF` means term frequency as how many times the same words appear in each sentence, `IDF` means inverse documents frequency means if a word like "I" appears in the whole documents, then word "I" don't contain any info. 

So **TF-IDF** is just with (# words / # whole words) * (log((1+ #documents) / (1 + # words apear in documents)))!

In [47]:
tfidf = TfidfVectorizer()

# just convert the sentence directly
tfidf_data = tfidf.fit_transform(comments)

print("Get matrix shape:", tfidf_data.shape)
# we have to make it to dense
tfidf_data.todense()

Get matrix shape: (5, 8)


matrix([[0.        , 0.70710678, 0.        , 0.        , 0.        ,
         0.70710678, 0.        , 0.        ],
        [0.63871058, 0.        , 0.        , 0.        , 0.        ,
         0.76944707, 0.        , 0.        ],
        [0.38040565, 0.        , 0.        , 0.45827018, 0.56801408,
         0.        , 0.56801408, 0.        ],
        [0.        , 0.53177225, 0.659118  , 0.53177225, 0.        ,
         0.        , 0.        , 0.        ],
        [0.55645052, 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.83088075]])

In [48]:
# we could get the unique words for TFIDF
tfidf.get_feature_names()

['baseball', 'bed', 'great', 'is', 'life', 'like', 'my', 'play']

#### Final words

This tutorial contains some basic idea to convert the sentence to numerical values, in fact, there are some more efficient way to extract the info by using `Deep Learning`, like `Word2Vec`, `Sentence2Vec`, `Paragraph2Vec`, even `Object2Vec`! I will cover these in future tutorials!

Keep tuned! 