# Feature engineering

We will use price, points, winery, title and description features for creating datasets for ML models.

Few datasets will be created:
   - dataset_desc - contains just description and points features
   - dataset_title - contains description, title and points features
   - dataset_price - contains description, title, price and points features
   - dataset_winery - contains description, title, price, winery and points features

## Description feature encoding

In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors.

Two approaches will be used:
  - **Bag of words** -  process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
  - **Word embeddings** - give us a way to use an efficient, dense representation in which similar words have a similar encoding. An embedding is a dense vector of floating point values. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.

### Bag of words

In [53]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

df = pd.read_csv("data/winemag-data-130k-v2.csv", index_col=0)
df.shape

(129971, 13)

Text preprocessing, tokenizing and filtering of stopwords are all included in CountVectorizer, which builds a dictionary of features and transforms documents to feature vectors

In [54]:
count_vect = CountVectorizer(max_features=1000, stop_words='english')
df_count = count_vect.fit_transform(df.description)
df_count.shape

(129971, 1000)

Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.

Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.

In [55]:
tf_transformer = TfidfTransformer(use_idf=True).fit(df_count)
df_encode = tf_transformer.transform(df_count)
df_encode.shape

(129971, 1000)

New dataset contains 1000 encoded features which will be fed to ML models.
Store first dataset.

In [56]:
df_encode = pd.DataFrame(df_encode.todense())
df_encode.to_csv('data/dataset_desc_bag.csv')