## Wrapper Methods

In this project, you'll analyze data from a survey conducted by Fabio Mendoza Palechor and Alexis de la Hoz Manotas that asked people about their eating habits and weight. The data was obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition+). Categorical variables were changed to numerical ones in order to facilitate analysis.

First, you'll fit a logistic regression model to try to predict whether survey respondents are obese based on their answers to questions in the survey. After that, you'll use three different wrapper methods to choose a smaller feature subset.

You'll use sequential forward selection, sequential backward floating selection, and recursive feature elimination. After implementing each wrapper method, you'll evaluate the model accuracy on the resulting smaller feature subsets and compare that with the model accuracy using all available features.

# Import libraries

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE
%matplotlib inline

## Evaluating a Logistic Regression Model

The data set `obesity` contains 18 predictor variables. Here's a brief description of them.

* `Gender` is `1` if a respondent is male and `0` if a respondent is female.
* `Age` is a respondent's age in years.
* `family_history_with_overweight` is `1` if a respondent has family member who is or was overweight, `0` if not.
* `FAVC` is `1` if a respondent eats high caloric food frequently, `0` if not.
* `FCVC` is `1` if a respondent usually eats vegetables in their meals, `0` if not.
* `NCP` represents how many main meals a respondent has daily (`0` for 1-2 meals, `1` for 3 meals, and `2` for more than 3 meals).
* `CAEC` represents how much food a respondent eats between meals on a scale of `0` to `3`.
* `SMOKE` is `1` if a respondent smokes, `0` if not.
* `CH2O` represents how much water a respondent drinks on a scale of `0` to `2`.
* `SCC` is `1` if a respondent monitors their caloric intake, `0` if not.
* `FAF` represents how much physical activity a respondent does on a scale of `0` to `3`.
* `TUE` represents how much time a respondent spends looking at devices with screens on a scale of `0` to `2`.
* `CALC` represents how often a respondent drinks alcohol on a scale of `0` to `3`.
* `Automobile`, `Bike`, `Motorbike`, `Public_Transportation`, and `Walking` indicate a respondent's primary mode of transportation. Their primary mode of transportation is indicated by a `1` and the other columns will contain a `0`.

The outcome variable, `NObeyesdad`, is a `1` if a patient is obese and a `0` if not.

Use the `.head()` method and inspect the data.

In [None]:
# Load the data
obesity = pd.read_csv("obesity.csv")

# Inspect the data
obesity.head()

### Split the data into `X` and `y`

In order to use a linear regression model, you'll need to split the data into two parts: the predictor variables and an outcome variable. Do this by splitting the data into a DataFrame of predictor variables called `X` and a Series of outcome variables `y`.

In [None]:
X = obesity.drop(["NObeyesdad"], axis=1)
y = obesity['NObeyesdad']

### Logistic regression model

Create a logistic regression model called `lr`. Include the parameter `max_iter=1000` to make sure that the model will converge when you try to fit it.

In [None]:
lr = LogisticRegression(max_iter=1000)

### Fit the model

Use the `.fit()` method on `lr` to fit the model to `X` and `y`.

In [None]:
lr.fit(X, y)

### Model accuracy

A model's _accuracy_ is the proportion of classes that the model correctly predicts. Compute and print the accuracy of `lr` by using the `.score()` method. What percentage of respondents did the model correctly predict as being either obese or not obese? You may want to write this number down somewhere so that you can refer to it during future tasks.

In [None]:
og_score = lr.score(X, y)
print(f"Original Model Accuracy: {og_score:.4f}")

## Sequential Forward Selection

Now that you've created a logistic regression model and evaluated its performance, you're ready to do some feature selection.

Create a sequential forward selection model called `sfs`.
* Be sure to set the `estimator` parameter to `lr` and set the `forward` and `floating` parameters to the appropriate values.
* Also use the parameters `k_features=9`, `scoring='accuracy'`, and `cv=0`.

In [None]:
sfs = SFS(lr, forward=True, floating=False, k_features=9, scoring='accuracy', cv=0)

### Fit the model

Use the `.fit()` method on `sfs` to fit the model to `X` and `y`. This step will take some time (not more than a minute) to run.

In [None]:
sfs.fit(X, y)

### Inspect the results

Now that you've run the sequential forward selection algorithm on the logistic regression model with `X` and `y` you can see what features were chosen and check the model accuracy on the smaller feature set. Print `sfs.subsets_[9]` to see which features were chosen.

In [None]:
sfs.subsets_[9]

### Plot the results

Use the `plot_sfs()` function to plot the performance of the sequential forward selection model.

In [None]:
plot_sfs(sfs.get_metric_dict(), kind='std_err')
plt.grid()
plt.show()

### Smaller feature subset accuracy

Compute and print the model accuracy using only the features selected by `sfs`. How does this accuracy compare to the accuracy of the model using all features?

In [None]:
X_sfs = X[list(sfs.k_feature_names_)]
lr.fit(X_sfs, y)
sfs_score = lr.score(X_sfs, y)
print(f"SFS Model Accuracy: {sfs_score:.4f}")

## Sequential Backward Floating Selection

Use the same steps as above to perform sequential backward floating selection. Be sure to set the `forward` and `floating` parameters correctly.

In [None]:
sbfs = SFS(lr, forward=False, floating=True, k_features=9, scoring='accuracy', cv=0)
sbfs.fit(X, y)
sbfs.subsets_[9]

### Plot the results

Use the `plot_sfs()` function to plot the performance of the sequential backward floating selection model.

In [None]:
plot_sfs(sbfs.get_metric_dict(), kind='std_err')
plt.grid()
plt.show()

### Smaller feature subset accuracy

Compute and print the model accuracy using only the features selected by `sbfs`. How does this accuracy compare to the accuracy of the model using all features?

In [None]:
X_sbfs = X[list(sbfs.k_feature_names_)]
lr.fit(X_sbfs, y)
sbfs_score = lr.score(X_sbfs, y)
print(f"SBFS Model Accuracy: {sbfs_score:.4f}")

## Recursive Feature Elimination

Now you'll use a recursive feature elimination model. Start by creating a `StandardScaler` object called `scaler`.

In [None]:
scaler = StandardScaler()

### Fit the scaler

Use the `.fit()` method to fit `scaler` to `X`.

In [None]:
scaler.fit(X)

### Transform the data

Use the `.transform()` method to scale `X`. Save the transformed data to a variable called `X_scaled`.

In [None]:
X_scaled = scaler.transform(X)

### Recursive Feature Elimination

Create an `RFE` object called `rfe` and pass `lr` as the estimator. Use the parameter `n_features_to_select=9`.

In [None]:
rfe = RFE(estimator=lr, n_features_to_select=9)

### Fit the model

Use the `.fit()` method on `rfe` to fit the model to `X_scaled` and `y`.

In [None]:
rfe.fit(X_scaled, y)

### Model accuracy

Compute and print the model accuracy using only the features selected by `rfe`. How does this accuracy compare to the accuracy of the model using all features?

In [None]:
X_rfe = X_scaled[:, rfe.support_]
lr.fit(X_rfe, y)
rfe_score = lr.score(X_rfe, y)
print(f"RFE Model Accuracy: {rfe_score:.4f}")

## Summary

You've now used three different wrapper methods to select features for a logistic regression model. Print out the accuracy of the original model, and the models using features selected by sequential forward selection, sequential backward floating selection, and recursive feature elimination. 

How do these accuracies compare? Which method resulted in the smallest feature subset? Which method resulted in the highest accuracy? Finally, if you had to choose a feature selection method to use in practice, which one would you choose and why?

In [None]:
print(f"Original Model Accuracy: {og_score:.4f}")
print(f"SFS Model Accuracy: {sfs_score:.4f}")
print(f"SBFS Model Accuracy: {sbfs_score:.4f}")
print(f"RFE Model Accuracy: {rfe_score:.4f}")