<img src="https://drive.google.com/uc?id=1E_GYlzeV8zomWYNBpQk0i00XcZjhoy3S" width="100"/>

# DSGT Bootcamp Week 3: Feature Manipulation

## Learning Objectives  
1)  Handling Missing Values  
2)  Row/Column Manipulation  
3)  Feature Engineering  
4)  Feature Removal  (and feature normalization)    

<img src="https://miro.medium.com/max/1200/1*K6ctE0RZme0cqMtknrxq8A.png" width="350">

## Setup  
First we will mount the drive. Then we will read in our dataset. We will be using pandas again! 

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
%cd 'drive/My Drive/Track1(AppliedDataScience)/Participants/Data'

In [0]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [0]:
df = pd.read_csv('train.csv')

## Why do we need data preprocessing? 

When you recieve data, there can be a lot to fix.  
Potential issues include:   
1) Flaws in the data itself (ex poor formatting or missing values)     
2) Information could still be added (maybe creating your own features?)  
3) Some data may be redundant or not be useful 

# Missing Values & Value Imputation

<img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/03/How-to-Handle-Missing-Values-with-Python.jpg" width="300px">

How would you handle missing values? Many algorithms will not run with NaNs or missing values.   

\\
One approach may be to delete all rows with missing values.   
An example is removing all users with a missing field, such as age. However, you could lose a lot of data this way    

\\
**Value Imputation** is an intelligent way of filling  in missing values. The following exercise walks through deleting random values from a few columns and then fixing them using a few different types of imputation. 

In [0]:
# Creating a new dataframe with a subset of columns from the original
mini_df = df[['TotalBsmtSF', 'CentralAir', 'GarageArea']]
mini_df


In [0]:
# don't worry about this code block. Here we are just generating missing values

# remove 20% of data from the features
np.random.seed(1)

mini_df['TotalBsmtSF'] = \
    mini_df['TotalBsmtSF'].mask(np.random.random(mini_df['TotalBsmtSF'].shape) < .2)

mini_df['CentralAir'] = \
    mini_df['CentralAir'].mask(np.random.random(mini_df['CentralAir'].shape) < .2)

mini_df['GarageArea'] = \
    mini_df['GarageArea'].mask(np.random.random(mini_df['GarageArea'].shape) < .2)
mini_df

In [0]:
# Generating missing values -- used for the purpose of the exercise
# Not critical to understand or reproduce!


np.random.seed(1)

# Filling each of the 3 features with some random values so we can practice
# our 3 types of imputation

mini_df['TotalBsmtSF'] = \
    mini_df['TotalBsmtSF'].mask(np.random.random(mini_df['TotalBsmtSF'].shape) < .2)

mini_df['CentralAir'] = \
    mini_df['CentralAir'].mask(np.random.random(mini_df['CentralAir'].shape) < .2)

mini_df['GarageArea'] = \
    mini_df['GarageArea'].mask(np.random.random(mini_df['GarageArea'].shape) < .2)

mini_df

What changes were made?

In [0]:
mini_df.isnull()

In [0]:
mini_df.isnull().sum(axis = 0) 
# Question? Any guesses as to what axis=0 means? 

## Types of Imputation
- Fixed values (all 0s)  
- Measures of central tendency (the mean, median, mode of existing entries) 
- Backfilling (filling with the prior existing value)

<img src="https://drive.google.com/uc?id=1IB1Tge9wPAqOqH5FcVlLGDDB4nmV4Ddr" width="600"/>

In [0]:
# An example of fixed-value imputation
mini_df['CentralAir'].fillna('N', inplace=True) # what does inplace=True mean?

mini_df.isnull().sum(axis=0) 

In [0]:
# An example of mean-imputaiton
mini_df['TotalBsmtSF'].fillna(mini_df['TotalBsmtSF'].mean(), inplace=True)

# An example of backfill-imputation
mini_df['GarageArea'].fillna(method="backfill", inplace=True)

mini_df

# Feature Creation
We can make features (or columns) from scratch. We can also derive them from existing columns  
We can consider feature engineering here as accessing rows and columns of the dataframe

### Editing with Rows

In [0]:
# Let's start by viewing the head of the dataframe
df.head()

In [0]:
# Recap: Who remembers what 0 indexing means? 
df.iloc[2] # This returns a single row, called a series

# What row of the dataframe does this return?

In [0]:
# We can chain .iloc with filtering to pick a value at a specific row and column
df.iloc[2]['LotFrontage']

In [0]:
# We can also use the assignment operator to change the value in an area
# *= is a special variant that multiples the existing value by whatever is on
# the right side of the operator
df.iloc[2]['LotFrontage'] *= 1.25

Can you work with your groups to figure out the value of TotalBsmtSF at for the 10th row?

### Editing with Columns

Creating columns is a big part of feature engineering. You can create a column with a chosen value, with a random value, or by changing values of other columns

In [0]:
# creating a column
df['New_Column_Chosen_Value'] = 2 # we can choose one constant value
df['New_Column_Random_Value'] = np.random.rand() # we can randomly generate a value
df['BiggerLotFrontage'] = df['LotFrontage'] + 1  # we can manipulate old columns

df

## Feature Creation Using the "where" Syntax  
Think of this as a conditional statement: "Make a change **where** condition"  
For example, we can make a binary feature. 
We will create feature isLargeLot. You have a large lot **where** large lot > 9500

In [0]:
df['isLargeLot'] = 0
df['isLargeLot'].where(df["LotArea"] > 9500, 1, inplace=True)
df

In [0]:
# try to create a mini dataframe containing only where MSSubClass == 60 
my_df = 
#        answer: df.where(df['MSSubClass'] == 60)
my_df

We can also derive features from other features. For example, lets make a ratio

In [0]:
df['LotRatio'] = df['LotFrontage'] / df['LotArea']

In [0]:
# try incrementing all values in the "LotArea" by 1
df['LotArea'] = 
# answer  df['LotArea'] + 1

## Non-Useful Data
Another data flaw can be that non-useful data may be included.  
Why is this a problem?      
- You don't get useful information 
- Information overload can cloud analysis  
- Extra information can increase dataset size
- Extra information can increase computational overhead   

<img src="https://drive.google.com/uc?id=1YT_MAY-tkTdXBjq1Z5bgyBSHb0yF77g8" width="500"/>  









In [0]:
# dropping columns is easy  
# we will drop all of the columns we made 
df.drop(columns=['isLargeLot'], axis=1, inplace=True)
df.drop(columns=['New_Column_Chosen_Value', 'New_Column_Random_Value', 'BiggerLotFrontage'], axis=1, inplace=True)

## Testing It Out: The effects of any data manipulation on misc. ML tasks  

For example, let's take a look at performance as we run feature selection using simple (one-variable) linear regression 

<img src="https://drive.google.com/uc?id=1MH6qYH3bM8TQ4os6geXhY9HHeD4yXuYk" width="400"/>


We can also choose to run multiple linear regression on any of the features we want. Let's select only useful features, a process known as **feature selection**  

<img src="https://drive.google.com/uc?id=1gnabMUH4gLyLyPygU56OXZmmZ5LzbNCP" width="690"/>

So, what do we have to work with?

In [0]:
df.columns

## One Feature

In [0]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

# b=df_housing['LSTAT'].to_numpy()
lr.fit(df[['LotArea']], df['SalePrice'])

lr.coef_
print(lr.score(df[['LotArea']], df['SalePrice']))

## Few Features 

In [0]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

# b=df_housing['LSTAT'].to_numpy()
lr.fit(df[['LotArea', 'GarageArea', 'BsmtFinSF1']], df['SalePrice'])

lr.coef_
print(lr.score(df[['LotArea','GarageArea', 'BsmtFinSF1']], df['SalePrice']))

## Many Features

In [0]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

# b=df_housing['LSTAT'].to_numpy()
lr.fit(df[['LotArea', 'GarageArea', 'BsmtFinSF1', "GarageArea", 'TotalBsmtSF']], df['SalePrice'])

lr.coef_
print(lr.score(df[['LotArea','GarageArea', 'BsmtFinSF1', "GarageArea", 'TotalBsmtSF']], df['SalePrice']))

What do we notice from this?   
Here we note diminishing returns as more features are added  
We se significant improvements as features are added, but diminishing returns to scale.  
These diminishing returns could incur additional overhead, multicollinearity, and other logistical issues. 