# Feature Engineering
>  You will now get exposure to different types of features. You will modify existing features and create new ones. Also, you will treat the missing data accordingly.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Datacamp, marked]
- image: images/datacamp/___

> Note: This is a summary of the course's chapter 3 exercises "Winning a Kaggle Competition in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (10, 10)

## Feature engineering

### Arithmetical features

<div class=""><p>To practice creating new features, you will be working with a subsample from the Kaggle competition called "House Prices: Advanced Regression Techniques". The goal of this competition is to predict the price of the house based on its properties. It's a regression problem with Root Mean Squared Error as an evaluation metric.</p>
<p>Your goal is to create new features and determine whether they improve your validation score. To get the validation score from 5-fold cross-validation, you're given the <code>get_kfold_rmse()</code> function. Use it with the <code>train</code> DataFrame, available in your workspace, as an argument.</p></div>

In [2]:
train = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/23-winning-a-kaggle-competition-in-python/datasets/house_prices_train.csv')
test = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/23-winning-a-kaggle-competition-in-python/datasets/house_prices_test.csv')

In [5]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

kf = KFold(n_splits=5, shuffle=True, random_state=123)

def get_kfold_rmse(train):
    mse_scores = []

    for train_index, test_index in kf.split(train):
        train = train.fillna(0)
        feats = [x for x in train.columns if x not in ['Id', 'SalePrice', 'RoofStyle', 'CentralAir']]
        
        fold_train, fold_test = train.loc[train_index], train.loc[test_index]

        # Fit the data and make predictions
        # Create a Random Forest object
        rf = RandomForestRegressor(n_estimators=10, min_samples_split=10, random_state=123)

        # Train a model
        rf.fit(X=fold_train[feats], y=fold_train['SalePrice'])

        # Get predictions for the test set
        pred = rf.predict(fold_test[feats])
    
        fold_score = mean_squared_error(fold_test['SalePrice'], pred)
        mse_scores.append(np.sqrt(fold_score))
        
    return round(np.mean(mse_scores) + np.std(mse_scores), 2)

In [7]:
train

Unnamed: 0,Id,LotArea,OverallQual,YearBuilt,RoofStyle,TotalBsmtSF,CentralAir,1stFlrSF,2ndFlrSF,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,GarageCars,GarageArea,SalePrice
0,1,8450,7,2003,Gable,856,Y,856,854,2,1,3,1,8,2,548,208500
1,2,9600,6,1976,Gable,1262,Y,1262,0,2,0,3,1,6,2,460,181500
2,3,11250,7,2001,Gable,920,Y,920,866,2,1,3,1,6,2,608,223500
3,4,9550,7,1915,Gable,756,Y,961,756,1,0,3,1,7,3,642,140000
4,5,14260,8,2000,Gable,1145,Y,1145,1053,2,1,4,1,9,3,836,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,7917,6,1999,Gable,953,Y,953,694,2,1,3,1,7,2,460,175000
1456,1457,13175,6,1978,Gable,1542,Y,2073,0,2,0,3,1,7,2,500,210000
1457,1458,9042,7,1941,Gable,1152,Y,1188,1152,2,0,4,1,9,1,252,266500
1458,1459,9717,5,1950,Hip,1078,Y,1078,0,1,0,2,1,5,1,240,142125


Instructions 1/3
<li>Create a new feature representing the total area (basement, 1st and 2nd floors) of the house. The columns <code>"TotalBsmtSF"</code>, <code>"FirstFlrSF"</code> and <code>"SecondFlrSF"</code> give the areas of the basement, 1st and 2nd floors, respectively.</li>

In [8]:
# Look at the initial RMSE
print('RMSE before feature engineering:', get_kfold_rmse(train))

# Find the total area of the house
train['TotalArea'] = train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF']
print('RMSE with total area:', get_kfold_rmse(train))

RMSE before feature engineering: 36029.39
RMSE with total area: 35073.2


Instructions 2/3
<li>Create a new feature representing the area of the garden. It is a difference between the total area of the property (<code>"LotArea"</code>) and the first floor area (<code>"FirstFlrSF"</code>).</li>

In [9]:
# Find the area of the garden
train['GardenArea'] = train['LotArea'] - train['1stFlrSF']
print('RMSE with garden area:', get_kfold_rmse(train))

RMSE with garden area: 34413.55


Instructions 3/3
<li>Create a new feature representing the total number of bathrooms in the house. It is a sum of full bathrooms (<code>"FullBath")</code> and half bathrooms (<code>"HalfBath"</code>).</li>

In [10]:
# Find total number of bathrooms
train['TotalBath'] = train['FullBath'] + train['HalfBath']
print('RMSE with number of bathrooms:', get_kfold_rmse(train))

RMSE with number of bathrooms: 34506.78


**You've created three new features. Here you see that house area improved the RMSE by almost $1,000. Adding garden area improved the RMSE by another \$600. However, with the total number of bathrooms, the RMSE has increased. It means that you keep the new area features, but do not add "TotalBath" as a new feature.**

### Date features

<div class=""><p>You've built some basic features using numerical variables. Now, it's time to create features based on date and time. You will practice on a subsample from the Taxi Fare Prediction Kaggle competition data. The data represents information about the taxi rides and the goal is to predict the price for each ride.</p>
<p>Your objective is to generate date features from the pickup datetime. Recall that it's better to create new features for train and test data simultaneously. After the features are created, split the data back into the train and test DataFrames. Here it's done using <code>pandas</code>' <code>isin()</code> method.</p>
<p>The <code>train</code> and <code>test</code> DataFrames are already available in your workspace.</p></div>

In [13]:
train = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/23-winning-a-kaggle-competition-in-python/datasets/taxi_train.csv')
test = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/23-winning-a-kaggle-competition-in-python/datasets/taxi_test.csv')

Instructions
<ul>
<li>Concatenate the <code>train</code> and <code>test</code> DataFrames into a single DataFrame <code>taxi</code>.</li>
<li>Convert the "pickup_datetime" column to a <code>datetime</code> object.</li>
<li>Create the day of week (using <code>.dayofweek</code> attribute) and hour (using <code>.hour</code> attribute) features from the "pickup_datetime" column.</li>
</ul>

In [14]:
# Concatenate train and test together
taxi = pd.concat([train, test])

# Convert pickup date to datetime object
taxi['pickup_datetime'] = pd.to_datetime(taxi['pickup_datetime'])

# Create a day of week feature
taxi['dayofweek'] = taxi['pickup_datetime'].dt.dayofweek

# Create an hour feature
taxi['hour'] = taxi['pickup_datetime'].dt.hour

# Split back into train and test
new_train = taxi[taxi['id'].isin(train['id'])]
new_test = taxi[taxi['id'].isin(test['id'])]

**Now you know how to perform feature engineering for train and test DataFrames simultaneously. Having considered numerical and datetime features, move forward to master feature engineering for categorical ones!**

## Categorical features

### Label encoding

<div class=""><p>Let's work on categorical variables encoding. You will again work with a subsample from the House Prices Kaggle competition.</p>
<p>Your objective is to encode categorical features "RoofStyle" and "CentralAir" using label encoding. The <code>train</code> and <code>test</code> DataFrames are already available in your workspace.</p></div>

In [16]:
train = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/23-winning-a-kaggle-competition-in-python/datasets/house_prices_train.csv')
test = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/23-winning-a-kaggle-competition-in-python/datasets/house_prices_test.csv')

Instructions
<ul>
<li>Concatenate <code>train</code> and <code>test</code> DataFrames into a single DataFrame <code>houses</code>.</li>
<li>Create a <code>LabelEncoder</code> object without arguments and assign it to <code>le</code>.</li>
<li>Create new label-encoded features for "RoofStyle" and "CentralAir" using the same <code>le</code> object.</li>
</ul>

In [17]:
# Concatenate train and test together
houses = pd.concat([train, test])

# Label encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# Create new features
houses['RoofStyle_enc'] = le.fit_transform(houses['RoofStyle'])
houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])

# Look at new features
print(houses[['RoofStyle', 'RoofStyle_enc', 'CentralAir', 'CentralAir_enc']].head())

  RoofStyle  RoofStyle_enc CentralAir  CentralAir_enc
0     Gable              1          Y               1
1     Gable              1          Y               1
2     Gable              1          Y               1
3     Gable              1          Y               1
4     Gable              1          Y               1


**You can see that categorical variables have been label encoded. However, as you already know, label encoder is not always a good choice for categorical variables. Let's go further and apply One-Hot encoding.**

### One-Hot encoding

<div class=""><p>The problem with label encoding is that it implicitly assumes that there is a ranking dependency between the categories. So, let's change the encoding method for the features "RoofStyle" and "CentralAir" to one-hot encoding. Again, the <code>train</code> and <code>test</code> DataFrames from House Prices Kaggle competition are already available in your workspace.</p>
<p>Recall that if you're dealing with binary features (categorical features with only two categories) it is suggested to apply label encoder only.</p>
<p>Your goal is to determine which of the mentioned features is not binary, and to apply one-hot encoding only to this one.</p></div>

Instructions 1/4
<li>Determine the distribution of "RoofStyle" and "CentralAir" features using <code>pandas</code>' <code>value_counts()</code> method.</li>

In [18]:
# Concatenate train and test together
houses = pd.concat([train, test])

# Look at feature distributions
print(houses['RoofStyle'].value_counts(), '\n')
print(houses['CentralAir'].value_counts())

Gable      2310
Hip         551
Gambrel      22
Flat         20
Mansard      11
Shed          5
Name: RoofStyle, dtype: int64 

Y    2723
N     196
Name: CentralAir, dtype: int64


Instructions 2/4
<li>Which of the features is binary?</li>

<pre>
Possible Answers
"RoofStyle".
<b>"CentralAir".</b>
</pre>

Instructions 3/4
<li>As long as "CentralAir" is a binary feature, encode it with a label encoder (0 - for one class and 1 - for another class).</li>

In [19]:
# Concatenate train and test together
houses = pd.concat([train, test])

# Label encode binary 'CentralAir' feature
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])

Instructions 4/4
<li>For the categorical feature "RoofStyle" let's use the one-hot encoder. Firstly, create one-hot encoded features using the <code>get_dummies()</code> method. Then they are concatenated to the initial <code>houses</code> DataFrame.</li>

In [None]:
# Create One-Hot encoded features
ohe = pd.get_dummies(houses['RoofStyle'], prefix='RoofStyle')

# Concatenate OHE features to houses
houses = pd.concat([houses, ohe], axis=1)

In [24]:
# Look at OHE features
houses[[col for col in houses.columns if 'RoofStyle' in col]].head(5)

Unnamed: 0,RoofStyle,RoofStyle_Flat,RoofStyle_Gable,RoofStyle_Gambrel,RoofStyle_Hip,RoofStyle_Mansard,RoofStyle_Shed
0,Gable,0,1,0,0,0,0
1,Gable,0,1,0,0,0,0
2,Gable,0,1,0,0,0,0
3,Gable,0,1,0,0,0,0
4,Gable,0,1,0,0,0,0


**The one-hot encoded features look as expected. Remember to drop the initial string column, because models will not handle it automatically.**

## Target encoding

### Mean target encoding

<div class=""><p>First of all, you will create a function that implements mean target encoding. Remember that you need to develop the two following steps: </p>
<ol>
<li>Calculate the mean on the train, apply to the test</li>
<li>Split train into K folds. Calculate the out-of-fold mean for each fold, apply to this particular fold</li>
</ol>
<p>Each of these steps will be implemented in a separate function: <code>test_mean_target_encoding()</code> and <code>train_mean_target_encoding()</code>, respectively.</p>
<p>The final function <code>mean_target_encoding()</code> takes as arguments: the train and test DataFrames, the name of the categorical column to be encoded, the name of the target column and a smoothing parameter alpha. It returns two values: a new feature for train and test DataFrames, respectively.</p></div>

Instructions 1/3
<ul>
<li>You need to add smoothing to avoid overfitting. So, add <mjx-container class="MathJax CtxtMenu_Attached_0" jax="CHTML" role="presentation" tabindex="0" ctxtmenu_counter="2" style="font-size: 116.7%; position: relative;"><mjx-math class="MJX-TEX" aria-hidden="true"><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D6FC TEX-I"></mjx-c></mjx-mi></mjx-math><mjx-assistive-mml role="presentation" unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi></math></mjx-assistive-mml></mjx-container> parameter to the denominator in <code>train_statistics</code> calculations.</li>
<li>You need to treat new categories in the test data. So, pass a global mean as an argument to the <code>fillna()</code> method.</li>
</ul>

In [43]:
def test_mean_target_encoding(train, test, target, categorical, alpha=5):
    # Calculate global mean on the train data
    global_mean = train[target].mean()
    
    # Group by the categorical feature and calculate its properties
    train_groups = train.groupby(categorical)
    category_sum = train_groups[target].sum()
    category_size = train_groups.size()
    
    # Calculate smoothed mean target statistics
    train_statistics = (category_sum + global_mean * alpha) / (category_size + alpha)
    
    # Apply statistics to the test data and fill new categories
    test_feature = test[categorical].map(train_statistics).fillna(global_mean)
    return test_feature.values

Instructions 2/3
<li>To calculate the train mean encoded feature you need to use out-of-fold statistics, splitting train into several folds. Specify the train and test indices for each validation split to access it.</li>

In [44]:
def train_mean_target_encoding(train, target, categorical, alpha=5):
    # Create 5-fold cross-validation
    kf = KFold(n_splits=5, random_state=123, shuffle=True)
    train_feature = pd.Series(index=train.index, dtype='float')
    
    # For each folds split
    for train_index, test_index in kf.split(train):
        cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
      
        # Calculate out-of-fold statistics and apply to cv_test
        cv_test_feature = test_mean_target_encoding(cv_train, cv_test, target, categorical, alpha)
        
        # Save new feature for this particular fold
        train_feature.iloc[test_index] = cv_test_feature       
    return train_feature.values

Instructions 3/3
<li>Finally, you just calculate train and test target mean encoded features and return them from the function. So, return <code>train_feature</code> and <code>test_feature</code> obtained.</li>

In [45]:
def mean_target_encoding(train, test, target, categorical, alpha=5):
  
    # Get the train feature
    train_feature = train_mean_target_encoding(train, target, categorical, alpha)
  
    # Get the test feature
    test_feature = test_mean_target_encoding(train, test, target, categorical, alpha)
    
    # Return new features to add to the model
    return train_feature, test_feature

**Now you are equipped with a function that performs mean target encoding of any categorical feature.**

### K-fold cross-validation

<div class=""><p>You will work with a binary classification problem on a subsample from Kaggle playground competition. The objective of this competition is to predict whether a famous basketball player Kobe Bryant scored a basket or missed a particular shot.</p>
<p>Train data is available in your workspace as <code>bryant_shots</code> DataFrame. It contains data on 10,000 shots with its properties and a <strong>target</strong> variable <code>"shot\_made\_flag"</code> -- whether shot was scored or not.</p>
<p>One of the features in the data is <code>"game_id"</code> -- a particular game where the shot was made. There are 541 distinct games. So, you deal with a high-cardinality categorical feature. Let's encode it using a target mean!</p>
<p>Suppose you're using 5-fold cross-validation and want to evaluate a mean target encoded feature on the local validation.</p></div>

Instructions
<ul>
<li>To achieve this, you need to repeat encoding procedure for the <code>"game_id"</code> categorical feature inside each folds split separately. Your goal is to specify all the missing parameters for the <code>mean_target_encoding()</code> function call inside each folds split.</li>
<li>Recall that the <code>train</code> and <code>test</code> parameters expect the train and test DataFrames.</li>
<li>While the <code>target</code> and <code>categorical</code> parameters expect names of the target variable and categorical feature to be encoded.</li>
</ul>

In [50]:
# Create 5-fold cross-validation
kf = KFold(n_splits=5, random_state=123, shuffle=True)

# For each folds split
for train_index, test_index in kf.split(bryant_shots):
    cv_train, cv_test = bryant_shots.iloc[train_index].copy(), bryant_shots.iloc[test_index].copy()

    # Create mean target encoded feature
    cv_train['game_id_enc'], cv_test['game_id_enc'] = mean_target_encoding(train=cv_train,
                                                                           test=cv_test,
                                                                           target='shot_made_flag',
                                                                           categorical='game_id',
                                                                           alpha=5)
    # Look at the encoding
    print(cv_train[['game_id', 'shot_made_flag', 'game_id_enc']].sample(n=1))

       game_id  shot_made_flag  game_id_enc
3444  20200667             1.0     0.511914
      game_id  shot_made_flag  game_id_enc
660  20000519             1.0     0.435403
       game_id  shot_made_flag  game_id_enc
7509  20500813             0.0     0.541036
       game_id  shot_made_flag  game_id_enc
1610  20100360             1.0      0.34654
       game_id  shot_made_flag  game_id_enc
2030  20100668             0.0     0.457899


**You could see different game encodings for each validation split in the output. The main conclusion you should make: while using local cross-validation, you need to repeat mean target encoding procedure inside each folds split separately. Go on to try other problem types beyond binary classification!**

### Beyond binary classification

<div class=""><p>Of course, binary classification is just a single special case. Target encoding could be applied to any target variable type:</p>
<ul>
<li>For <strong>binary classification</strong> usually mean target encoding is used</li>
<li>For <strong>regression</strong> mean could be changed to median, quartiles, etc.</li>
<li>For <strong>multi-class classification</strong> with N classes we create N features with target mean for each category in one vs. all fashion</li>
</ul>
<p>The <code>mean_target_encoding()</code> function you've created could be used for any target type specified above. Let's apply it for the regression problem on the example of House Prices Kaggle competition.</p>
<p>Your goal is to encode a categorical feature <code>"RoofStyle"</code> using mean target encoding. The <code>train</code> and <code>test</code> DataFrames are already available in your workspace.</p></div>

In [54]:
train = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/23-winning-a-kaggle-competition-in-python/datasets/house_prices_train.csv')
test = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/23-winning-a-kaggle-competition-in-python/datasets/house_prices_test.csv')

Instructions
<ul>
<li>Specify all the missing parameters for the <code>mean_target_encoding()</code> function call. Target variable name is <code>"SalePrice"</code>. Set <mjx-container class="MathJax CtxtMenu_Attached_0" jax="CHTML" role="presentation" tabindex="0" ctxtmenu_counter="2" style="font-size: 116.7%; position: relative;"><mjx-math class="MJX-TEX" aria-hidden="true"><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D6FC TEX-I"></mjx-c></mjx-mi></mjx-math><mjx-assistive-mml role="presentation" unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi></math></mjx-assistive-mml></mjx-container> hyperparameter to 10.</li>
<li>Recall that the <code>train</code> and <code>test</code> parameters expect the train and test DataFrames.</li>
<li>While the <code>target</code> and <code>categorical</code> parameters expect names of the target variable and feature to be encoded.</li>
</ul>

In [55]:
# Create mean target encoded feature
train['RoofStyle_enc'], test['RoofStyle_enc'] = mean_target_encoding(train=train,
                                                                     test=test,
                                                                     target='SalePrice',
                                                                     categorical='RoofStyle',
                                                                     alpha=10)

# Look at the encoding
test[['RoofStyle', 'RoofStyle_enc']].drop_duplicates()

Unnamed: 0,RoofStyle,RoofStyle_enc
0,Gable,171565.947836
1,Hip,217594.645131
98,Gambrel,164152.950424
133,Flat,188703.563431
362,Mansard,180775.938759
1053,Shed,188267.663242


**So, you observe that houses with the Hip roof are the most pricy, while houses with the Gambrel roof are the cheapest. It's exactly the goal of target encoding: you've encoded categorical feature in such a manner that there is now a correlation between category values and target variable. We're done with categorical encoders.**

## Missing data

### Find missing data

<div class=""><p>Let's impute missing data on a real Kaggle dataset. For this purpose, you will be using a data subsample from the Kaggle "Two sigma connect: rental listing inquiries" competition.</p>
<p>Before proceeding with any imputing you need to know the number of missing values for each of the features. Moreover, if the feature has missing values, you should explore the type of this feature.</p></div>

Instructions 1/2
<ul>
<li>Read the <code>"twosigma_train.csv"</code> file using <code>pandas</code>.</li>
<li>Find the number of missing values in each column.</li>
</ul>

In [64]:
# Read DataFrame
twosigma = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/23-winning-a-kaggle-competition-in-python/datasets/twosigma_train.csv')

# Find the number of missing values in each column
print(twosigma.isnull().sum())

id                 0
bathrooms          0
bedrooms           0
building_id       13
latitude           0
longitude          0
manager_id         0
price             32
interest_level     0
dtype: int64


Instructions 2/2
<li>Select the columns with the missing values and look at the head of the DataFrame.</li>

In [65]:
# Look at the columns with the missing values
twosigma[['building_id', 'price']].head()

Unnamed: 0,building_id,price
0,53a5b119ba8f7b61d4e010512e0dfc85,3000.0
1,c5c8a357cba207596b04d1afd1e4f130,5465.0
2,c3ba40552e2120b0acfc3cb5730bb2aa,2850.0
3,28d9ad350afeaab8027513a3e52ac8d5,3275.0
4,,3350.0


**All right, you've found out that 'building_id' and 'price' columns have missing values. Looking at the head of the DataFrame, we may conclude that 'price' is a numerical feature, while 'building_id' is a categorical feature that is encoding buildings as hashes.**

### Impute missing data

<div class=""><p>You've found that "price" and "building_id" columns have missing values in the Rental Listing Inquiries dataset. So, before passing the data to the models you need to impute these values.</p>
<p>Numerical feature "price" will be encoded with a mean value of non-missing prices. </p>
<p>Imputing categorical feature "building_id" with the most frequent category is a bad idea, because it would mean that all the apartments with a missing "building_id" are located in the most popular building. The better idea is to impute it with a new category.</p>
<p>The DataFrame <code>rental_listings</code> with competition data is read for you.</p></div>

Instructions 1/2
<ul>
<li>Create a SimpleImputer object with "mean" strategy.</li>
<li>Impute missing prices with the mean value.</li>
</ul>

In [66]:
# Import SimpleImputer
from sklearn.impute import SimpleImputer

# Create mean imputer
mean_imputer = SimpleImputer(strategy='mean')

# Price imputation
twosigma[['price']] = mean_imputer.fit_transform(twosigma[['price']])

Instructions 2/2
<ul>
<li>Create an imputer with "constant" strategy. Use "MISSING" as <code>fill_value</code>.</li>
<li>Impute missing buildings with a constant value.</li>
</ul>

In [70]:
# Create constant imputer
constant_imputer = SimpleImputer(strategy='constant', fill_value='MISSING')

# building_id imputation
twosigma[['building_id']] = constant_imputer.fit_transform(twosigma[['building_id']])

In [71]:
print(twosigma.isnull().sum())

id                0
bathrooms         0
bedrooms          0
building_id       0
latitude          0
longitude         0
manager_id        0
price             0
interest_level    0
dtype: int64
