# Hands-On Lab 5 - Model Improvement

### Step 1 - Load the Data

The *adult_train.csv* file is the training dataset. Run the following code cell to load the dataset.

In [None]:
import pandas as pd

adult_train = pd.read_csv('adult_train.csv')
adult_train.head()

### Step 2 - Engineer the *Female* Feature

The last lab illustrated how engineering a *Female* feature likely produced a better model. Run the following code cell to produce the results.

In [None]:
# Add a new Female feature to the DataFrame
adult_train['Female'] = adult_train['Sex'].replace({'Female': 1, 'Male': 0})

# Check the results
adult_train[['Sex', 'Female']].head()

### Step 3 - Prepare the Features

This lab will use the same feature preparation as Lab 4. Run the following code cell to produce the results.

In [None]:
# Features to use to predict the labels
all_features = ['Age', 'EducationNum', 'MaritalStatus', 'Occupation', 'Race', 'Female', 
                'CapitalGain', 'CapitalLoss', 'HoursPerWeek']

# Categorical features
cat_features = ['MaritalStatus', 'Occupation', 'Race']

# Select the above features and one-hot encode
adult_X = pd.get_dummies(adult_train[all_features], prefix = cat_features , columns = cat_features)
adult_X.head()

### Step 4 - Preparing the Labels

This lab will use the same label preparation as Lab 4. Run the following code cell to produce the results.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encode labels
label_encoder = LabelEncoder()
adult_y = label_encoder.fit_transform(adult_train['Label'])

print(label_encoder.classes_)
print(adult_y)

### Step 5 - Train the Random Forest

Run the following code cell to produce the results.

**NOTE** - You can adjust the *n_jobs* parameter if you have a more powerful laptop.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Instatiate the Random Forest object
rf_1 = RandomForestClassifier(n_estimators = 200, oob_score = True, n_jobs = 1, random_state = 12345)

# Train the RandomForestClassifier
rf_1.fit(adult_X, adult_y)

# What is the accuracy estimate?
print(f'Estimated accuracy with OOB data: {rf_1.oob_score_:.4f}')

# What is the accuracy on the training data?
print(f'Training data accuracy: {rf_1.score(adult_X, adult_y):.4f}')

### Step 6 - Load Test Data

The *adult_test.csv* file is the test dataset. Run the following code cell to load the dataset.

In [None]:
# Load the Adult Census test dataset
adult_test = pd.read_csv('adult_test.csv')

print(adult_test.shape)
adult_test.head()

### Step 7 - Prepare the Test Data Features

This lab uses the test feature preparation as Lab 4. Run the following code cell to produce the results. 

In [None]:
# Add a new Female feature to the test DataFrame
adult_test['Female'] = adult_test['Sex'].replace({'Female': 1, 'Male': 0})

# Check the results
print(adult_test[['Sex', 'Female']].head())

# Use the same training features and one-hot encode
adult_test_X = pd.get_dummies(adult_test[all_features], prefix = cat_features , columns = cat_features)
adult_test_X.head()

### Step 8 - Preparing the Test Labels

The test data set labels will be prepared in this lab as they were in Labe 4. Run the following code cell to produce the results.

In [None]:
# Encode the labels of the test dataset
adult_test_y = label_encoder.transform(adult_test['Label'])

print(label_encoder.classes_)
print(adult_test_y)

### Step 9 - Evaluting the Model

The *RandomForestClassifier* offers the *predict()* method to make predictions for a dataset. In this case, predictions for the test dataset. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Type your lab code here

### Step 10 - Removing Features

As discussed during the lecture, removing features with low permutation importance is one way to reduce your models' complexity. You will experiment with removing the bottom five features defined by permutation importance. Here are the features to remove for you to copy and paste. 

Occupation_Armed-Forces<br>
Occupation_Priv-house-serv<br>
Race_Other<br>
MaritalStatus_Married-AF-spouse<br>
MaritalStatus_Married-spouse-absent<br>

Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Type your lab code here

### Step 11 - Train New Random Forest

You will train a second *RandomForestClassifier* using the reduced set of features. This will allow you to compare the predictive performance of the two feature sets. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Type your lab code here

### Step 12 - Evaluating the Models

The previous step showed the estimated accuracy using OOB data improved after the least important five features were removed. Next, you will evaluate the differences in accuracy using the test dataset. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Type your lab code here