# Kaggle Feature Engineering microcurse
- Better features make better models. 
- Discover how to get the most out of your data
- https://www.kaggle.com/learn/feature-engineering

## 2.- Mutual Information
- Locate features with the most potential.

### Intro
- First encountering a new dataset can feel overhelming

1. a great first step is construct a ranking with a __feature utility metric__ ( a function measuring associations between a feature and the target).
2. Then you can choose a smaller set of the most useful features to develop initially and have more confidence that your time will be well spent.
3. The metric we'll use is called "mutual information". 

#### Mutual information
- is a lot like correlation in that it measures a relationship between two quantities. The advantage of mutual information is that it can detect any kind of relationship, while correlation only detects linear relationships.
- Mutual information is a great general-purpose metric and especially useful at the start of feature development when you might not know what model you'd like to use yet. It is:
  1. easy to use an interpret.
  2. computationally efficient,
  3. thererically well-founded,
  4. resistant to overfitting, and,
  5. able to detect any kind of relationship.

### Mutual Information and What It Measures

- Mutual information describes relationships in terms of uncertainty.
- The __mutual information__ (MI) between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other.
- If you knew the value of a feature, how much more confident would you be about the target?

### Interpreting Mutual Information Scores
- The least possible mutual information between quantities is 0.0 MI = 0, quantities are totally independent.
- MI > 2.0 or so are uncommon (MI is a logarithmic qty, increases slowly).

1. MI can help you to understand the relative potential of a feature as a predictor of the target, considered by itself.
2. t's possible for a feature to be very informative when interacting with other features, but not so informative all alone. MI can't detect interactions between features. It is a univariate metric.
3. The actual usefulness of a feature depends on the model you use it with. A feature is only useful to the extent that its relationship with the target is one your model can learn. Just because a feature has a high MI score doesn't mean your model will be able to do anything with that information. You may need to transform the feature first to expose the association.

### Example - 1985 Automobiles

The Automobile dataset consists of 193 cars from the 1985 model year. The goal for this dataset is to predict a car's price (the target) from 23 of the car's features, such as make, body_style, and horsepower. In this example, we'll rank the features with mutual information and investigate the results by data visualization.

In [10]:
### Import necessary libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import requests
import zipfile as zfm

In [11]:
### Write url w/zipfile path
import io

# Data that define repo and filename w/path
ro = 'jmonti-gh'                  # repo_owner
rn = 'Datasets'                   # repo_name
zipfln = 'Automobile_data.zip'
dataset = 'Automobile_data.csv'

# Data necesary If a proxy is used
proxies = {
  'http': 'http://jorge.monti:jorgemonti2009@172.16.1.49:3128',
  'https': 'http://jorge.monti:jorgemonti2009@172.16.1.49:3128'    # https://jorge.monti:jorgemonti2009@172.16.1.49:3128
}

# url where to obtain the response
url = f'https://raw.githubusercontent.com/{ro}/{rn}/main/{zipfln}'

In [12]:
### try-except block to get the zipfile containing the dataset
try:
    r = requests.get(url)
    print('No Proxy needed')
except OSError as oe:
    if 'ProxyError' in str(oe):
        r = requests.get(url, proxies=proxies)
        print('Proxy used!')
    else:
        ln = '-' * 5 + '\n'
        for er in [oe, oe.args]:
            print(ln, er, '\nType: ', type(er), sep='')

Proxy used!


In [13]:
### Read the zipfile and load the dataset
with zfm.ZipFile(io.BytesIO(r.content)) as zf:
    print(zf.namelist())
    df = pd.read_csv(zf.open(dataset))

print(df.shape)
df.iloc[[0, 9, -9, -1]]

['Automobile_data.csv']
(205, 26)


Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
9,0,?,audi,gas,turbo,two,hatchback,4wd,front,99.5,...,131,mpfi,3.13,3.4,7.0,160,5500,16,22,?
196,-2,103,volvo,gas,std,four,sedan,rwd,front,104.3,...,141,mpfi,3.78,3.15,9.5,114,5400,24,28,15985
204,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,19,25,22625


In [16]:
fp = 'c:/users/jmonti/downloads/autos_csv.csv'
df1 = pd.read_csv(fp)
print(df1.shape)
df1.iloc[[0, 9, -9, -1]]

(205, 26)


Unnamed: 0,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,...,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,symboling
0,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,...,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0,3
9,,audi,gas,turbo,two,hatchback,4wd,front,99.5,178.2,...,mpfi,3.13,3.4,7.0,160.0,5500.0,16,22,,0
196,103.0,volvo,gas,std,four,sedan,rwd,front,104.3,188.8,...,mpfi,3.78,3.15,9.5,114.0,5400.0,24,28,15985.0,-2
204,95.0,volvo,gas,turbo,four,sedan,rwd,front,109.1,188.8,...,mpfi,3.78,3.15,9.5,114.0,5400.0,19,25,22625.0,-1


In [24]:
# df.columns            # Index
# df.columns.values     # array
# df.columns.tolist()   # Py list

In [None]:
### 'normalized-losses' var don't exist in the tutorial
print(df.columns)
df.drop('normalized-losses', axis=1, inplace=True)
print(df.shape)
df.iloc[[0, 9, -9, -1]]

> This Dataset has '?' instead of NaNs
- To avoid future error we should convert all the '?' to NaNs

In [None]:
### Let's see all the '?' present
display(df[df.eq('?').any(axis=1)])
# Cols w/'?'
df.columns[df.isin(['?']).any()]

In [None]:
### Replace all '?' by NaNs
df = df.replace(['?'], np.nan)
### Let's see all the '?' present
display(df[df.eq('?').any(axis=1)])
# Cols w/'?'
df.columns[df.isin(['?']).any()]

In [None]:
### Now see the NaNs present
### Let's see all the NaNs present
display(df[df.isna().any(axis=1)])
# Cols w/NaNs
df.columns[df.isin([np.nan]).any()]

In [None]:
#df.loc[[27, 63]]

> NOW!, eliminate that NaNs.
- cause sklearn librarie need it

In [None]:
df.fillna(method='bfill', inplace=True)
### Let's see all the NaNs present
display(df[df.isna().any(axis=1)])
# Cols w/NaNs
df.columns[df.isin([np.nan]).any()]

The scikit-learn algorithm for MI treats discrete features differently from continuous features. Consequently, you need to tell it which are which. As a rule of thumb, anything that must have a float dtype is not discrete. Categoricals (object or categorial dtype) can be treated as discrete by giving them a label encoding. (You can review label encodings in our Categorical Variables lesson.)

In [None]:
X = df.copy()
y = X.pop("price")

# Label encoding for categoricals
for colname in X.select_dtypes("object"):
    X[colname], _ = X[colname].factorize()

# All discrete features should now have integer dtypes
#(double-check this before using MI!)
discrete_features = X.dtypes == int

In [None]:
### To see that e/col is int or float
print(X.shape)
X.info()

Scikit-learn has two mutual information metrics in its feature_selection module: one for real-valued targets (mutual_info_regression) and one for categorical targets (mutual_info_classif). Our target, price, is real-valued. The next cell computes the MI scores for our features and wraps them up in a nice dataframe.

In [None]:
# from sklearn.feature_selection import mutual_info_regression
# mutual_info = mutual_info_regression(X, y)
# mutual_info

In [None]:
from sklearn.feature_selection import mutual_info_regression

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y,
                                       discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X, y, discrete_features)
mi_scores[::3]  # show a few features with their MI scores

In [None]:
### Let's see in a barplot
plt.style.use("seaborn-whitegrid")

def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")


plt.figure(dpi=100, figsize=(8, 5))
plot_mi_scores(mi_scores)

Data visualization is a great follow-up to a utility ranking. Let's take a closer look at a couple of these.    
As we might expect, the high-scoring curb_weight feature exhibits a strong relationship with price, the target.

In [None]:
sns.relplot(data=df, x='horsepower', y='price');
#df.columns

In [None]:
# https://www.cs.waikato.ac.nz/ml/weka/book.html
# https://archive.ics.uci.edu/ml/datasets/Automobile
# https://www.askpython.com/python/examples/analyzing-cars-dataset-in-python
# https://fedbiomed.gitlabpages.inria.fr/latest/tutorials/pytorch/04_PyTorch_Used_Cars_Dataset_Example/