# (1) Missing Values


Most models will not handle missing values in the training data. This notebook collects strategies for working with incomplete sets.

### When to drop, When to impute?
Need a rule of thumb for \~how incomplete\~ a feature should be before we give up on it and drop it. At this stage say if it contains three quarters of the values then keep it.


## (1.1) Dropping Incomplete Rows/Columns.
The simplest approach is to simply drop any feature that has a missing value, or any sample that has a missing feature.  This is obviously pretty brutal, but I guess if you have a million rows and you'll only end up dropping a few then it's sensible.  The DataFrame.dropna method does it all for you.

In [146]:
import pandas as pd
import numpy as np
import sklearn.datasets as datasets

data = datasets.fetch_openml(name='wine_reviews', as_frame=True)
type(data)

sklearn.utils.Bunch

In [147]:
df = pd.DataFrame(data.data, columns=data.feature_names)
df = df[:1000]
df.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96.0,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96.0,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96.0,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96.0,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95.0,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [148]:
# Drop incomplete columns:
df.dropna(axis='columns').head()

Unnamed: 0,country,description,points,province,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,96.0,California,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",96.0,Northern Spain,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,96.0,California,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",96.0,Oregon,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",95.0,Provence,Provence red blend,Domaine de la Bégude


In [149]:
# Drop incomplete rows:
df.dropna().head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96.0,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96.0,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96.0,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
8,US,This re-named vineyard was formerly bottled as...,Silice,95.0,65.0,Oregon,Chehalem Mountains,Willamette Valley,Pinot Noir,Bergström
9,US,The producer sources from two blocks of the vi...,Gap's Crown Vineyard,95.0,60.0,California,Sonoma Coast,Sonoma,Pinot Noir,Blue Farm


## (1.2) Imputation.
Take note of the methods in SimpleImputer.
- fit_transform(X) will fit the imputer, i.e. determine the mean (or whatever strategy you specify in the initialiser) and replace missing values in one hit.
- Once fitted, transform(X) will just replace missing values with the mean (or whatever) previously determined.

<span style="color:red">**What is the correct approach here: to combine training and test data sets and then impute, or fit the imputer on the training data and use it for both?**</span>

**Edit: Definitly to fit it on the training data only.**


### (1.2.1) Simple Imputation
Replace missing values with the mean/mode/median for that feature.

#### (1.2.1.1) Continuous
If the feature is a continuous variable, replace it by the mean.


In [150]:
from sklearn.impute import SimpleImputer
print(df['price'].isna().sum())
price = df[['price']]
imp = SimpleImputer(strategy='mean')
price = imp.fit_transform(price)
df['price'] = price
df.price.isna().sum()

45


0

#### (1.2.1.2) Discrete
Otherwise you'll probably replace by the mode or median.

### (1.2.2) Fancy Imputation
You can keep a record of any imputation made, thereby creating a new feature.  Haven't actually seen this in that many places but yeah.

In [151]:
print(df['region_1'].isna().sum())
region = df[['region_1']]
df['region_1_was_missing'] = df['region_1'].isna()
imp = SimpleImputer(missing_values=None, strategy='most_frequent')
df.head()
region = imp.fit_transform(region)
df['region_1'] = region
print(df['region_1'].isna().sum())
df.head()

142
0


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery,region_1_was_missing
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96.0,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,False
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96.0,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,False
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96.0,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley,False
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96.0,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi,False
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95.0,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude,False


# (2) Encoding
A machine will find it easier to interpret categorical names like 1, 2, 3 than 'sometimes', 'often', 'incessantly'. 
For ordinal data like this, such a transformation is sufficient. For nominal data like 'red', 'green' the assignment of an ordered label will introduce meaningless structure.  This is rectified by one-hot encoding, which creates a boolean feature for every possible value of the initial nominal variable.  What's strange is that sklearn's LabelEncoder says it is only intended to be used on the target variable and not the input data.  I can't understand why but can find a huge number of walk throughs that disregard this suggestion.

In [2]:
# Pandas can handle one-hot encoding very easily.
import pandas as pd
df = pd.get_dummies(df)

## (2.1) Count Encoding
A.k.a 'frequency encoding'. Replace each category by the number of times it appears in the data. 
It's a nice technique.  Rare values, with one or two occurences will automatically get grouped together, while two high frequency categories are unlikely to have the exact same frequency and so will be faithfully represented as distinct.

In [None]:
# Not part of vanilla scikit for some reason.
import category_encoders

categorical_features = ['favourite_boost_juice', 'sexuality']
ce = category_encoders.CountEncoder(cols=categorical_features)
count_encoded = ce.fit_transform(data[categorical_features])

data.join(count_encoded.add_suffix("_count"))

## (2.2) Target Encoding
Replace each category by the average value of the target across all points in that category.  Did someone say target leakage?

In [None]:
import category_encoders

categorical_features = ['favourite_boost_juice', 'sexuality']
te = category_encoders.TargetEncoder(cols=categorical_features)
training_data = training_data.join(te.fit_transform(training_data[categorical_feartures]).add_suffix('_target'))

## (2.3) CatBoost Encoding
Similar to target encoding but in determining the encoded value only rows above the current one are used. Works well with LightGBM?

In [None]:
cbe = category_encoders.CatBoostEncoder(cols=categorical_features)

# (3) Feature Creation

## (3.1) Interactions
Given two categorical features 'shape' and 'color', create a new one called 'shape_color'.  Then maybe do this for each pair of categorical features.


# (4) Transforming/Scaling

## (4.1) Transforming.
Some models will perform better with numerical data which is normally distributed.  Try taking the square root and/or log of a feature.