<center>
  <a href="MLSD-04-FeatureEngineering-A.ipynb" target="_self">Feature Engineering A</a> | <a href="./">Content Page</a> | <a href="MLSD-04-FeatureEngineering-Ex-1.ipynb">Feature Engineering Exercise</a>
</center>

# <center>FEATURE ENGINEERING B</center>

<center><b>Copyright &copy 2023 by DR DANNY POO</b><br> e:dannypoo@nus.edu.sg<br> w:drdannypoo.com</center><br>

# Features
- Categorical features.
- Text features.
- Imputation of missing data.
- Feature pipelines.

# Categorical Features

In [None]:
# Import libraries
import numpy as np
import pandas as pd

In [None]:
# Data
data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Changi'},
    {'price': 720000, 'rooms': 3, 'neighborhood': 'Kembangan'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Pasir Panjang'},
    {'price': 950000, 'rooms': 4, 'neighborhood': 'Woodlands'},
    {'price': 830000, 'rooms': 4, 'neighborhood': 'Jurong'},
    {'price': 680000, 'rooms': 2, 'neighborhood': 'Marine Parade'}
]

**Observations**
- It would be wrong to encode as {'Changi': 1, 'Kembangan': 2, 'Pasir Panjang': 3, 'Woodlands': 4, 'Jurong': 5, 'Marine Parade': 6};
- Instead use one-hot encoding which will create extra columns indicating the presence or absence of a category with a value of 1 or 0 respectively.
- When data comes as a list of dictionaries (as above), can use Scikit-Learn's DictVectorizer.

In [None]:
# Encode using DictVectorizer
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)

**Observations**
- The 'neighborhood' column has been expanded into 6 separate columns, representing the 6 neighborhood labels, and that each row has a 1 in the column associated with its neighborhood.
- We can now fit the Scikit-Learn model.

In [None]:
# Get the feature names
vec.get_feature_names_out()

**Observations**
- Disadvantage: if your category has many possible values, this can greatly increase the size of your dataset. 
- Since encoded data contains mostly zeros, use the spare output.

In [None]:
# Set spare=True to reduce the size of matrix
vec = DictVectorizer(sparse=True, dtype=int)
vec.fit_transform(data)

**Observations**
- <b>sklearn.preprocessing.OneHotEncoder</b> and <b>sklearn.feature_extraction.FeatureHasher</b> are two additional tools that Scikit-Learn includes to support this type of encoding.

# Text Features

In [None]:
# Import libraries
import numpy as np
import pandas as pd

In [None]:
# Data
data = ['The Dark Knight',
        'Batman and Robin',
        'Man of Steel',
        'Superman',
        'Rise of the Dark Knight',
        'Monty Python',
        'The incredibles',
        'The Golden Goose']

In [None]:
# Construct a column representing each of the words
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform(data)
X

In [None]:
# Display data in a dataframe
pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())

**Observations**
- Issue: The raw word counts lead to features which put too much weight on words that appear very frequently, and this can be sub-optimal in some classification algorithms. 
- Solution: Use <b>term frequency-inverse document frequency (TF–IDF)</b> which weights the word counts by a measure of how often they appear in the documents.

In [None]:
# Applying term frequency-inverse document frequency (TF–IDF)
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(data)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())

# Imputation of Missing Data

In [None]:
# Import libraries
import numpy as np
from numpy import nan
import pandas as pd
from sklearn.linear_model import LinearRegression

In [None]:
# Data
X = np.array([[ nan, 0,   3  ],
              [ 3,   7,   9  ],
              [ 3,   5,   2  ],
              [ 4,   nan, 6  ],
              [ 8,   8,   1  ]])
y = np.array([14, 16, -1,  8, -5])

**Observations**
- Need to first replace such missing data with some appropriate fill value. 
- This is known as imputation of missing values, and strategies range from simple (e.g., replacing missing values with the <b>mean of the column</b>) to sophisticated (e.g., using matrix completion or a robust model to handle such data).

In [None]:
# Use Scikit-Learn SimpleImputer class
from sklearn.impute import SimpleImputer
imputa = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputa.fit(X[:, 0:3]) # identifies the missing values and computes the mean of such feature where a missing value is present.

# Replace the missing value using transform method
X1 = imputa.transform(X[:, 0:3])
print(X1)

**Observations**
- The two missing values have been replaced with the mean of the remaining values in the column. 
- This imputed data can then be fed directly into, for example, a LinearRegression estimator.

In [None]:
model = LinearRegression().fit(X1, y)
model.predict(X1)

# Feature Pipelines
- When there are multiple steps involved, can use feature processing pipeline:
1. Impute missing values using the mean
2. Fit a linear regression

- Can use Scikit-Learn Pipeline object.

In [None]:
# Use Scikit-Learn Pipeline
from sklearn.pipeline import make_pipeline

model = make_pipeline(SimpleImputer(strategy='mean'),
                      LinearRegression())

In [None]:
# Fit a model
model.fit(X, y)  # X with missing values, from above
print(y)
print(model.predict(X))

**Observations**
- All the steps of the model are applied automatically.

<center>
  <a href="MLSD-04-FeatureEngineering-A.ipynb" target="_self">Feature Engineering A</a> | <a href="./">Content Page</a> | <a href="MLSD-04-FeatureEngineering-Ex-1.ipynb">Feature Engineering Exercise</a>
</center>