# Text Classification Feature Generation

Some feature generation methods haven't tried before.


* Reference: https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/?utm_source=facebook.com
  * Also try to add sentiment features
  * In some method mentioned by the author, also added extra preprocessing methods
* Data source: https://drive.google.com/drive/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M

In [2]:
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

## About Raw Data

In [4]:
data_folder = 'yelp_review_full_csv/'
train = pd.read_csv(data_folder + 'train.csv', header=None)
test = pd.read_csv(data_folder + 'test.csv', header=None)
train.columns = ['class', 'yelp_review']
test.columns = ['class', 'yelp_review']

print(train.shape, test.shape)
train.head()

(650000, 2) (50000, 2)


Unnamed: 0,class,yelp_review
0,5,dr. goldberg offers everything i look for in a...
1,2,"Unfortunately, the frustration of being Dr. Go..."
2,4,Been going to Dr. Goldberg for over 10 years. ...
3,4,Got a letter in the mail last week that said D...
4,1,I don't know what Dr. Goldberg was like before...


In [5]:
# what a balanced dataset... 5 is the best review
print(train['class'].value_counts())
print(test['class'].value_counts())

5    130000
4    130000
3    130000
2    130000
1    130000
Name: class, dtype: int64
5    10000
4    10000
3    10000
2    10000
1    10000
Name: class, dtype: int64


In [6]:
# look at the length of the reviews...! How did they generated these data...?
train['review_len'] = len(train['yelp_review'])
print(min(train['review_len']), max(train['review_len']))

test['review_len'] = len(test['yelp_review'])
print(min(test['review_len']), max(test['review_len']))

650000 650000
50000 50000


In [7]:
train['yelp_review'][0]

"dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."

## Text Feature Generation

In [8]:
# Use a few samples so that it's easier to see what does generated features look like
sample_train = train.iloc[0:10, :2]
sample_train

Unnamed: 0,class,yelp_review
0,5,dr. goldberg offers everything i look for in a...
1,2,"Unfortunately, the frustration of being Dr. Go..."
2,4,Been going to Dr. Goldberg for over 10 years. ...
3,4,Got a letter in the mail last week that said D...
4,1,I don't know what Dr. Goldberg was like before...
5,5,Top notch doctor in a top notch practice. Can'...
6,5,Dr. Eric Goldberg is a fantastic doctor who ha...
7,1,I'm writing this review to give you a heads up...
8,2,Wing sauce is like water. Pretty much a lot of...
9,3,Decent range somewhat close to the city. The ...


### Method 1 - Counter Vector
* The structure of a counter vector:
  * Each row is a file (a review in this case)
  * Each column is a token (word) appeared in the whole corpus (all the files)
  * Each cell records the frequency of that token in this file
  
* Method 1.1 - Using raw tokens
* Method 1.2 - Stemming tokens before building Count Vector
  * Besides Stemming, we can also try Lemmatisation
  * Stemming is to shorten words. Such as "beautiful" convert to "beauti"
  * Lemmatisation is to convert the word into lemma format. Such "good", "better", "best" all convert to "good"

#### 1.1 - Using Raw Tokens

In [7]:
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')  # capture each token (1-gram)
count_vect.fit(sample_train['yelp_review'])
count_vect

CountVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='\\w{1,}', tokenizer=None,
        vocabulary=None)

In [8]:
train_ct = count_vect.transform(sample_train['yelp_review'])
train_ct  # this can be used in most machine learning models

<10x417 sparse matrix of type '<type 'numpy.int64'>'
	with 715 stored elements in Compressed Sparse Row format>

In [9]:
train_ct.todense()

matrix([[0, 0, 0, ..., 0, 3, 0],
        [0, 0, 0, ..., 0, 2, 0],
        [0, 0, 1, ..., 2, 1, 2],
        ..., 
        [1, 0, 0, ..., 0, 2, 1],
        [0, 1, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [10]:
# this is the format used in XGBoost
csc_train_ct = train_ct.tocsc()  # Convert from Compressed Sparse Row format to Compressed Sparse Column format
csc_train_ct

<10x417 sparse matrix of type '<type 'numpy.int64'>'
	with 715 stored elements in Compressed Sparse Column format>

In [11]:
csc_train_ct.todense()

matrix([[0, 0, 0, ..., 0, 3, 0],
        [0, 0, 0, ..., 0, 2, 0],
        [0, 0, 1, ..., 2, 1, 2],
        ..., 
        [1, 0, 0, ..., 0, 2, 1],
        [0, 1, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

#### 1.2 Stemming Token First

* Comparing with 1.1 above, we can see the smaller size of sparsed matrix

In [12]:
import nltk
from nltk.stem.porter import *

def stem_str(my_str):
    stemmer = PorterStemmer()
    stemmed_lst = [stemmer.stem(token) for token in my_str.split()]
    
    return ' '.join(stemmed_lst)

In [13]:
stemmed_review = pd.DataFrame(sample_train.apply(lambda r: stem_str(r['yelp_review']), axis=1))
stemmed_review.columns = ['stemmed_review']
stemmed_review

Unnamed: 0,stemmed_review
0,dr. goldberg offer everyth i look for in a gen...
1,"unfortunately, the frustrat of be dr. goldberg..."
2,been go to dr. goldberg for over 10 years. I t...
3,got a letter in the mail last week that said d...
4,I don't know what dr. goldberg wa like befor m...
5,top notch doctor in a top notch practice. can'...
6,dr. eric goldberg is a fantast doctor who ha c...
7,i'm write thi review to give you a head up bef...
8,wing sauc is like water. pretti much a lot of ...
9,decent rang somewhat close to the city. the ma...


In [14]:
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')  # capture each token (1-gram)
count_vect.fit(stemmed_review['stemmed_review'])
count_vect

CountVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='\\w{1,}', tokenizer=None,
        vocabulary=None)

In [15]:
train_ct = count_vect.transform(stemmed_review['stemmed_review'])
train_ct  # this can be used in most machine learning models

<10x410 sparse matrix of type '<type 'numpy.int64'>'
	with 705 stored elements in Compressed Sparse Row format>

In [16]:
# this is the format used in XGBoost
csc_train_ct = train_ct.tocsc()  # Convert from Compressed Sparse Row format to Compressed Sparse Column format
csc_train_ct

<10x410 sparse matrix of type '<type 'numpy.int64'>'
	with 705 stored elements in Compressed Sparse Column format>

### Method 2 - TF-IDF Features

* word tf-idf (1-gram)
* n-gram tf-idf
* character n-gram tf-idf


* Term Frequency - TF
  * TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
* Inverse Document Frequency - IDF
  * IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
* tf-idf = tf * idf

In [17]:
# word tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(sample_train['yelp_review'])

train_tfidf =  tfidf_vect.transform(sample_train['yelp_review'])
train_tfidf

<10x417 sparse matrix of type '<type 'numpy.float64'>'
	with 715 stored elements in Compressed Sparse Row format>

In [18]:
train_tfidf.todense()

matrix([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.17919416,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.09683442,  0.        ],
        [ 0.        ,  0.        ,  0.11186558, ...,  0.19019193,
          0.0600526 ,  0.16639546],
        ..., 
        [ 0.05668819,  0.        ,  0.        , ...,  0.        ,
          0.06086364,  0.04216068],
        [ 0.        ,  0.11822467,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]])

In [19]:
# 2,3,4-gram tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,4), max_features=5000)
tfidf_vect_ngram.fit(sample_train['yelp_review'])

train_tfidf_ngram = tfidf_vect_ngram.transform(sample_train['yelp_review'])
train_tfidf_ngram

<10x2988 sparse matrix of type '<type 'numpy.float64'>'
	with 3089 stored elements in Compressed Sparse Row format>

In [20]:
train_tfidf_ngram.todense()

matrix([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.05941078,
          0.05941078,  0.05941078],
        ..., 
        [ 0.03894939,  0.03894939,  0.03894939, ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]])

In [21]:
# character n-gram tf-idf
tfidf_vect_char = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', ngram_range=(2,4), max_features=5000)
tfidf_vect_char.fit(sample_train['yelp_review'])

train_tfidf_char = tfidf_vect_char.transform(sample_train['yelp_review'])
train_tfidf_char

<10x4838 sparse matrix of type '<type 'numpy.float64'>'
	with 9836 stored elements in Compressed Sparse Row format>

In [22]:
train_tfidf_char.todense()

matrix([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        ..., 
        [ 0.01272465,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.02983573,  0.03509709,  0.03509709, ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]])

### Word Embeding Features

* Not a big fan of the method used in reference, so I'm using another method. The goal is just to convert each word into the numerical value, and this numerical value was given by a pre-generated corpus.
* Meanwhile because the final output is 2D numpy array, that why each file (row) should have the same length.


* In order to load "en_core_web_lg", you need `sudo` or run as admin, to run `python -m spacy download en_core_web_lg`

In [33]:
import spacy
import numpy as np

nlp = spacy.load("en_core_web_lg")

In [47]:
def spacy_word2vec(s):
    return nlp(sample_train['yelp_review'][0]).vector

word2vec_lst = sample_train['yelp_review'].apply(lambda r: spacy_word2vec(r)).values
print(len(word2vec_lst[0]))

300


In [48]:
word2vec_features = np.stack(word2vec_lst, axis=0)
word2vec_features.shape

(10, 300)