Once the data is augmented we need to extract the features, which in this case is going to be simply using tf-idf (you might want to give it a go to just counting words (`CountVectorizer`) ). The code in this repo corresponds to the code in `feature_extraction.py`. It is possible that you run into memory issues as you run the script. This is mostly due to `fastai`'s tokenizer. If you run into those problems, you could use go to these lines:

```python
reviews = df['reviewText'].tolist()
tok_reviews = tok.process_all(reviews)
```

and break the tokenization in chuncks. 

As you see, I am using `fastai`'s tokenizer applied directly to the text without any preprocessing (no lowercase, no lemmatization, no removing punctuation, nothing). This is because that tokenizer does a number of things under the hood and the more information is in the text the better. For example, they have tokens to indicate whether the next word starts with upper case, or whether some characters are repeated (and how many times) etc. All this might be relevant in terms of classifying reviews.

The process of building the features (i.e. tf-idf matrix) is

In [1]:
import pandas as pd
import numpy as np
import os
import pickle

from pathlib import Path
from sklearn.utils import Bunch
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from fastai.text import Tokenizer
from eda import EDA

tok = Tokenizer()

def extract_features(df, out_path, max_features=30000, vectorizer=None):

    reviews = df['reviewText'].tolist()
    tok_reviews = tok.process_all(reviews)

    if vectorizer is None:
        vectorizer = TfidfVectorizer(max_features=max_features, preprocessor=lambda x: x,
            tokenizer = lambda x: x, min_df=5)
        X = vectorizer.fit_transform(tok_reviews)
    else:
        X = vectorizer.transform(tok_reviews)

    featset = Bunch(X=X, y=df.overall.values)
    pickle.dump(featset, open(out_path, 'wb'))

    return vectorizer

In [2]:
DATA_PATH  = Path('../data')
FEAT_PATH  = Path('../features')
FTRAIN_PATH = FEAT_PATH/'train'
FVALID_PATH = FEAT_PATH/'valid'
FTEST_PATH  = FEAT_PATH/'test'

### ORIGINAL TEXT

(the code below will take a few seconds to run on a c5.4xlarge EC2 instance)

In [3]:
train = pd.read_csv(DATA_PATH/'train/train.csv')
valid = pd.read_csv(DATA_PATH/'valid/valid.csv')
test  = pd.read_csv(DATA_PATH/'test/test.csv')

# we will tune parameters with the 80% train and 10% validation
print("building train/valid features for original dataset")
vec = extract_features(train, out_path=FTRAIN_PATH/'ftrain.p')
_ = extract_features(valid, vectorizer=vec, out_path=FVALID_PATH/'fvalid.p')

# once we have tuned parameters we will train on 'train+valid' (90%) and test
# on 'test' (10%)
print("building train/test features for original dataset")
full_train = pd.concat([train, valid]).sample(frac=1).reset_index(drop=True)
fvec = extract_features(full_train, out_path=FTRAIN_PATH/'fftrain.p')
_ = extract_features(test, vectorizer=fvec, out_path=FTEST_PATH/'ftest.p')
del (train, vec, fvec)

building train/valid features for original dataset
building train/test features for original dataset


### AUGMENTED TEXT

This will take a bit longer and depending on your machine you might run into memory issues. In fact, I actually run into those issues. To overcome the memory error I simply run the whole script (`feature_extraction.py`) until the last few lines (commented) and the run the last few lines on their own: 

```python
print("building train/test features for augmented dataset")
a_full_train = (pd.read_csv(DATA_PATH/'train/a_full_train.csv', engine='python', sep="::",
    names=['overall', 'reviewText'])
    .sample(frac=1)
    .reset_index(drop=True))
a_fvec = extract_features(a_full_train, out_path=FTRAIN_PATH/'a_fftrain.p')
del a_full_train
_ = extract_features(test, vectorizer=a_fvec, out_path=FTEST_PATH/'a_ftest.p')
```

Also, for some reason this will not work in a notebook, but does work from the terminal. A MORE ELEGANT solution is to break reviews into chunks.

In [4]:
# AUGMENTED
# Only the training set "at the time" must be augmented
a_train = (pd.read_csv(DATA_PATH/'train/a_train.csv', engine='python', sep="::",
    names=['overall', 'reviewText'])
    .sample(frac=1)
    .reset_index(drop=True))

# we will tune parameters with the 80% train and 10% validation. At this
# stage, the validation set should not be augmented, but we need to compute
# the validation features with the "augmented vectorizer"
print("building train/valid features for augmented dataset")
a_vec = extract_features(a_train, out_path=FTRAIN_PATH/'a_ftrain.p')
_ = extract_features(valid, vectorizer=a_vec, out_path=FVALID_PATH/'a_fvalid.p')

# once we have tuned parameters we will:
# 1-augment 'train+valid'
# 2-train the vectorizer on augmented dataset
# 3-use the augmented vectorizer on 'test'
print("building augmented dataset for train+valid")
eda = EDA(rs=False)
full_train = list(full_train[['reviewText', 'overall']].itertuples(index=False, name=None))
eda.augment(full_train, out_file=DATA_PATH/'train/a_full_train.csv')
del (valid, full_train)

print("building train/test features for augmented dataset")
a_full_train = (pd.read_csv(DATA_PATH/'train/a_full_train.csv', engine='python', sep="::",
    names=['overall', 'reviewText'])
    .sample(frac=1)
    .reset_index(drop=True))
a_fvec = extract_features(a_full_train, out_path=FTRAIN_PATH/'a_fftrain.p')
del a_full_train
_ = extract_features(test, vectorizer=a_fvec, out_path=FTEST_PATH/'a_ftest.p')

building train/valid features for augmented dataset


And once the features are created, is just a matter of "plugging them" into LightGBM and off we go.