# Feature Engineering

# Imputaion Lab

In [None]:
conda update pandas

In [None]:
import sagemaker
bucket=sagemaker.Session().default_bucket()
 
# Define IAM role
import boto3
from sagemaker import get_execution_role

role = get_execution_role()

Now let's bring in the Python libraries that we'll use throughout the analysis

In [None]:
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import sagemaker 
import zipfile     # Amazon SageMaker's Python SDK provides many helper functions

In [None]:
pd.__version__

Make sure pandas version is set to 1.2.4 or later. If it is not the case, restart the kernel before going further

---

# Data
Let's start by downloading our dataset from the [University of California, irvine dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup98-mld/epsilon_mirror/)
This is the data set used for The Second International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-98 The Fourth International Conference on Knowledge Discovery and Data Mining. 
The competition task is a regression problem where the goal is to estimate the return from a direct mailing in order to maximize donation profits.

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup98-mld/epsilon_mirror/cup98lrn.zip

with zipfile.ZipFile('cup98lrn.zip', 'r') as zip_ref:
    zip_ref.extractall('.')

---
The dataset has no column names and therefore we define column names.


Now lets read this into a Pandas data frame and take a look.


The output shows the datatypes

---

In [None]:
cols = ['AGE', 'NUMCHLD', 'INCOME', 'WEALTH1', 'MBCRAFT','MBGARDEN', 'MBBOOKS', 'MBCOLECT', 'MAGFAML','MAGFEM', 'MAGMALE']
data = pd.read_csv('cup98LRN.txt', usecols=cols)
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 25)         # Keep the output on one page
data.dtypes

Now we can have a look on how the data looks like:

In [None]:
data.head()

### Exploration
Let's start exploring the data.  First, let's understand how the features are distributed.

In [None]:
# Histograms for each numeric features
display(data.describe())
%matplotlib inline
hist = data.hist(bins=30, sharey=False, figsize=(10, 10))

let's determine the number of unique categories in each variable.


_**The nunique() method ignores missing values by default. If we want to
consider missing values as an additional category, we should set the
dropna argument to False: data.nunique(dropna=False).**_

In [None]:
data.nunique()

---

Let's calculate the number of missing values in each variable:

---

In [None]:
data.isnull().sum()

Let's quantify the percentage of missing values in each variable:

In [None]:
data.isnull().mean()

Finally, let's make a bar plot with the percentage of missing values per variable:

In [None]:
data.isnull().mean().sort_values(ascending=True).plot.bar(figsize=(12,6))
plt.ylabel('Percentage of missing values')
plt.xlabel('Variables')
plt.title('Quantifying missing data')

## (1) Removing observations with missing data
Now, we'll remove the observations with missing data in any of the variables:

In [None]:
data_cca = data.dropna(subset=['NUMCHLD'])
data_cca.isnull().mean().sort_values(ascending=True).plot.bar(figsize=(12,6))
plt.ylabel('Percentage of missing values')
plt.xlabel('Variables')
plt.title('Quantifying missing data')

Let's print and compare the size of the original and complete case datasets:

In [None]:
print('Number of total observations: {}'.format(len(data)))
print('Number of observations with complete cases:{}'.format(len(data_cca)))

Here, we removed observations with missing data as follows:

Number of total observations: 95412

Number of observations with complete cases:12386


## (2) Performing mean or median imputation

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.drop('MBCOLECT', axis=1), data['MBCOLECT'], test_size=0.3,random_state=0)
X_train.shape, X_test.shape

Check the mean of missing values:

In [None]:
X_train.isnull().mean()

Let's replace the missing values with the median in four numerical variables using pandas

In [None]:
for var in ['MBCRAFT', 'MBGARDEN', 'MBBOOKS', 'MAGFAML']:
    value = X_train[var].median()
    X_train[var] = X_train[var].fillna(value)
    X_test[var] = X_test[var].fillna(value)

Now check the mean of missing values and notice the change

In [None]:
X_train.isnull().mean()

To impute missing data with the mean, we use pandas' mean():value = X_train[var].mean().

### Mean or Median Imputaion with scikit-learn

SimpleImputer() from scikit-learn will impute all variables in the
dataset. Therefore, if we use mean or median imputation and the dataset
contains categorical variables, we will get an error.

In [None]:
imputer = SimpleImputer(strategy='median')
imputer.fit(data)
imputer.statistics_

Let's replace missing values with medians:

In [None]:
impute_data_array = imputer.transform(data)

In [None]:
data_after_impute=pd.DataFrame(impute_data_array, columns =cols)
data_after_impute.head()

# Data Wrangler

Prepare for Data Wrangler by uploading dataset to S3 - Make sure to change the bucket name to your own prefered bucket

In [None]:
your_bucket = 'imputation-lab-19112021'

import boto3, os
boto3.Session().resource('s3').Bucket(your_bucket).Object(os.path.join('KDDCup', 'cup98LRN.txt')).upload_file('cup98LRN.txt')

Now Goto Data Wrangler