<img src='img/logo.png'>
<img src='img/title.png'>

# Table of Contents
* [Exercises](#Exercises)


# Exercises

With the adult dataset, compare a logistic regression model with Polynomial interaction features against one without interaction features.

Use feature selection to determine which interaction features were most important.

Don't forget to scale or normalize the data.

In [None]:
import os
import pandas as pd
data = pd.read_csv(os.path.join("data", "adult.csv"))

In [None]:
y = (data.income == ' >50K').astype('int64')
X = pd.get_dummies(data.drop("income", axis=1)).astype('float64')

In [None]:
X.head() 

<button data-toggle="collapse" data-target="#soln1" class='btn btn-primary'>Show solution</button>

<div id="soln1" class="collapse">

Import a variety of classes and functions from `sklearn` that we will need for this exercise:

```python
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
```

Perform a train/test split and scale (normalize) the data.  Remind ourselves of the shape of the parts of our split data:

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print(X_train.shape, Xtest.shape, y_train.shape, y_test.shape)

scaler = MinMaxScaler().fit(X_train)
X_train_ = scaler.transform(X_train)
X_test_ = scaler.transform(X_test)
```

Let's fit the model, and see how well it does:

```python
LogisticRegression().fit(X_train_, y_train).score(X_test_, y_test)
```

What if we only select the most important features? What is the shape of these selected features?

```python
select = SelectFromModel(RandomForestClassifier(n_estimators=100), 
                         threshold="5 * median")
X_train_selected = select.fit_transform(X_train_, y_train)
X_test_selected = select.transform(X_test_)
print(X_train_selected.shape, X_test_selected.shape)
```

How much worse do we do with reduced numbers of features (not much!)?

```python
LogisticRegression().fit(X_train_selected, y_train).score(X_test_selected, y_test)
```

Using only the most important features (mostly for speed and memory usage optimization), let us also add back in pairwise combinations of features:

```python
poly = PolynomialFeatures(degree=2).fit(X_train_selected)
X_train_selected_poly = poly.transform(X_train_selected)
X_test_selected_poly = poly.transform(X_test_selected)
poly = PolynomialFeatures(degree=2).fit(X_train_selected)
X_train_selected_poly = poly.transform(X_train_selected)
X_test_selected_poly = poly.transform(X_test_selected)
```

How well does the new approach do?

```python
lr = LogisticRegression().fit(X_train_selected_poly, y_train)
lr.score(X_test_selected_poly, y_test)
np.array(poly.get_feature_names(
             X.columns[select.get_support()]))[lr.coef_.ravel() != 0]
```

<img src='img/copyright.png'>