# Feature Engineering

- So far we have perfomed feature enalysis
    - data frame quick checks
    - categorical columnsanlysis
    - bar charts, pie charts
    - numerical columns analysis
    - histograms, box plots, outliers
    - bi-multi variate analysis
    - correlation

- Now we need to learn feature engineering
    - we will create a new colmns for better ML models
    - we will perform the **encoding**, which is convert categorical to numerical data
    - we will do data **tranformations** as real world data does not follow normal distribution
    - we will perform **scaling of the data: Standardization and Normalization**
    - we will perform the **missing value analysis**
- Simply feature engineering means data will be modified to gain better insights

# Encoding

- Encoding means convert categorical data into numerical data
- ML/DL or any models works based on math and cannot understand english characters
- So, it is very important to convert categorical label into numerical values
- For example, gender columns has two labels
    - Male
    - Female
- Then female can be represented as 0 and male as 1

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import seaborn as sns

filepath1=r"/Users/phani/Documents/Sai_Files/02Course_Files/07_Naresh_Data_Science/03FilesgivenbyOmkarSir/Data_Files/Visadataset.csv"
visa_df=pd.read_csv(filepath1)
visa_df

cat=visa_df.select_dtypes(include='object').columns.tolist()
num=visa_df.select_dtypes(exclude='object').columns.tolist()

In [2]:
visa_df['case_status'].unique()

array(['Denied', 'Certified'], dtype=object)

# Mapping
- Based on above cell, Certified can be assigned as 0
- and denied can be assigned as 1

In [3]:
# mapping
d={'Certified':0,'Denied':1}
d

{'Certified': 0, 'Denied': 1}

In [4]:
visa_df['case_status']

0           Denied
1        Certified
2           Denied
3           Denied
4        Certified
           ...    
25475    Certified
25476    Certified
25477    Certified
25478    Certified
25479    Certified
Name: case_status, Length: 25480, dtype: object

In [5]:

"""
----------
func : callable
    Python function, returns a single value from a single value.
na_action : {None, 'ignore'}, default None
    If 'ignore', propagate NaN values, without passing them to func.
**kwargs
    Additional keyword arguments to pass as keywords arguments to
    `func`.

Returns
-------
DataFrame
    Transformed DataFrame."""
visa_df['case_status'].map(d)

0        1
1        0
2        1
3        1
4        0
        ..
25475    0
25476    0
25477    0
25478    0
25479    0
Name: case_status, Length: 25480, dtype: int64

In [6]:
visa_df['case_status']=visa_df['case_status'].map(d)
visa_df
# Only run one time. Otherwise NaN will come

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,1
1,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.6500,Year,Y,0
2,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.8600,Year,Y,1
3,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.0300,Year,Y,1
4,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.3900,Year,Y,0
...,...,...,...,...,...,...,...,...,...,...,...,...
25475,EZYV25476,Asia,Bachelor's,Y,Y,2601,2008,South,77092.5700,Year,Y,0
25476,EZYV25477,Asia,High School,Y,N,3274,2006,Northeast,279174.7900,Year,Y,0
25477,EZYV25478,Asia,Master's,Y,N,1121,1910,South,146298.8500,Year,N,0
25478,EZYV25479,Asia,Master's,Y,Y,1918,1887,West,86154.7700,Year,Y,0


### continent

In [7]:
visa_df['continent'].unique()

array(['Asia', 'Africa', 'North America', 'Europe', 'South America',
       'Oceania'], dtype=object)

In [8]:
# create a dict for above unique values and map them
d={}
d['Asia']=0
d['Africa']=1
d['North America']=2
d['Europe']=3
d['South America']=4
d['Oceania']=5
d

{'Asia': 0,
 'Africa': 1,
 'North America': 2,
 'Europe': 3,
 'South America': 4,
 'Oceania': 5}

In [9]:
# Method 1
dict2={}
unique_values=sorted(visa_df['continent'].unique())
l=len(unique_values)
for i,j in zip(unique_values,range(l)):
    dict2[i]=j
print(dict2)

{'Africa': 0, 'Asia': 1, 'Europe': 2, 'North America': 3, 'Oceania': 4, 'South America': 5}


In [10]:
# Method 2
lables=sorted(visa_df['continent'].unique())
for i,j in enumerate(lables):
    d[j]=i
d

{'Asia': 1,
 'Africa': 0,
 'North America': 3,
 'Europe': 2,
 'South America': 5,
 'Oceania': 4}

In [11]:
list(enumerate(lables))


[(0, 'Africa'),
 (1, 'Asia'),
 (2, 'Europe'),
 (3, 'North America'),
 (4, 'Oceania'),
 (5, 'South America')]

In [12]:
lables=sorted(visa_df['continent'].unique())
d={j:i for i,j in enumerate(lables)}
d

{'Africa': 0,
 'Asia': 1,
 'Europe': 2,
 'North America': 3,
 'Oceania': 4,
 'South America': 5}

In [13]:
visa_df['continent']=visa_df['continent'].map(d)

In [14]:
visa_df

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,1,High School,N,N,14513,2007,West,592.2029,Hour,Y,1
1,EZYV02,1,Master's,Y,N,2412,2002,Northeast,83425.6500,Year,Y,0
2,EZYV03,1,Bachelor's,N,Y,44444,2008,West,122996.8600,Year,Y,1
3,EZYV04,1,Bachelor's,N,N,98,1897,West,83434.0300,Year,Y,1
4,EZYV05,0,Master's,Y,N,1082,2005,South,149907.3900,Year,Y,0
...,...,...,...,...,...,...,...,...,...,...,...,...
25475,EZYV25476,1,Bachelor's,Y,Y,2601,2008,South,77092.5700,Year,Y,0
25476,EZYV25477,1,High School,Y,N,3274,2006,Northeast,279174.7900,Year,Y,0
25477,EZYV25478,1,Master's,Y,N,1121,1910,South,146298.8500,Year,N,0
25478,EZYV25479,1,Master's,Y,Y,1918,1887,West,86154.7700,Year,Y,0


In [15]:
# Repeat above 3 lines for all cat columns except case id column
lables=sorted(visa_df['continent'].unique())
d={j:i for i,j in enumerate(lables)}
visa_df['continent']=visa_df['continent'].map(d)

In [16]:
# Step 1: read data
filepath1=r"/Users/phani/Documents/Sai_Files/02Course_Files/07_Naresh_Data_Science/03FilesgivenbyOmkarSir/Data_Files/Visadataset.csv"
visa_df=pd.read_csv(filepath1)
visa_df
# Step 2: get cat columns we want to map, except for case_id column
cat=visa_df.select_dtypes(include='object').columns.tolist()
cat[1:]
# Step 3: create dictionary
# Step 4: Do mapping using .map()
for i in cat[1:]:
    lables=sorted(visa_df[i].unique())
    d={j:i for i,j in enumerate(lables)}
    visa_df[i]=visa_df[i].map(d)
# Step 5: Check if mapping has been done
visa_df
    


Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,1,2,0,0,14513,2007,4,592.2029,0,1,1
1,EZYV02,1,3,1,0,2412,2002,2,83425.6500,3,1,0
2,EZYV03,1,0,0,1,44444,2008,4,122996.8600,3,1,1
3,EZYV04,1,0,0,0,98,1897,4,83434.0300,3,1,1
4,EZYV05,0,3,1,0,1082,2005,3,149907.3900,3,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
25475,EZYV25476,1,0,1,1,2601,2008,3,77092.5700,3,1,0
25476,EZYV25477,1,2,1,0,3274,2006,2,279174.7900,3,1,0
25477,EZYV25478,1,3,1,0,1121,1910,3,146298.8500,3,0,0
25478,EZYV25479,1,3,1,1,1918,1887,4,86154.7700,3,1,0


# Label Encoder
- LabelEncoder is a method which will convert cat lables to numerical balues
- It is under package: **sklearn(scikit-learn**
    - The clas name : **preprocessing**
        - The method name: **LabelEncoder**
- Syntax:
   - from package_name.class import method_name

# Step 1: read the data and import the method
from package_name.class import method_name
# Step 2: save the method

# Step 3:  Apply fit transform on saved method

- **fit transform**
  - **fit**: means find the logic
    - in above example fit means create the mapper or dict
  - **transform**:apply the logic and change the data
    - apply mapper on the data

In [17]:
# Step 1: read the data and import the method
filepath1=r"/Users/phani/Documents/Sai_Files/02Course_Files/07_Naresh_Data_Science/03FilesgivenbyOmkarSir/Data_Files/Visadataset.csv"
visa_df=pd.read_csv(filepath1)
visa_df
from sklearn.preprocessing import LabelEncoder

In [18]:
# Step 2: save the method
le=LabelEncoder()

In [19]:
# Step 3:  Apply fit transform on saved method

"""
Docstring:
Fit label encoder and return encoded labels.

Parameters
----------
y : array-like of shape (n_samples,)
    Target values.

Returns
-------
y : array-like of shape (n_samples,)
    Encoded labels.
    """
le.fit_transform(visa_df['case_status'])

array([1, 0, 1, ..., 0, 0, 0])

In [20]:
visa_df['case_status']=le.fit_transform(visa_df['case_status'])
visa_df

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,1
1,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.6500,Year,Y,0
2,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.8600,Year,Y,1
3,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.0300,Year,Y,1
4,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.3900,Year,Y,0
...,...,...,...,...,...,...,...,...,...,...,...,...
25475,EZYV25476,Asia,Bachelor's,Y,Y,2601,2008,South,77092.5700,Year,Y,0
25476,EZYV25477,Asia,High School,Y,N,3274,2006,Northeast,279174.7900,Year,Y,0
25477,EZYV25478,Asia,Master's,Y,N,1121,1910,South,146298.8500,Year,N,0
25478,EZYV25479,Asia,Master's,Y,Y,1918,1887,West,86154.7700,Year,Y,0


In [21]:
# Step 1: read the data and import the method
filepath1=r"/Users/phani/Documents/Sai_Files/02Course_Files/07_Naresh_Data_Science/03FilesgivenbyOmkarSir/Data_Files/Visadataset.csv"
visa_df=pd.read_csv(filepath1)
visa_df
from sklearn.preprocessing import LabelEncoder
# Step 2: save the method
le=LabelEncoder()
# Step 3:  Apply fit transform on saved method
cat= visa_df.select_dtypes(include='object').columns.tolist()
for i in cat[1:]:
    visa_df[i] = le.fit_transform(visa_df[i])
visa_df

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,1,2,0,0,14513,2007,4,592.2029,0,1,1
1,EZYV02,1,3,1,0,2412,2002,2,83425.6500,3,1,0
2,EZYV03,1,0,0,1,44444,2008,4,122996.8600,3,1,1
3,EZYV04,1,0,0,0,98,1897,4,83434.0300,3,1,1
4,EZYV05,0,3,1,0,1082,2005,3,149907.3900,3,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
25475,EZYV25476,1,0,1,1,2601,2008,3,77092.5700,3,1,0
25476,EZYV25477,1,2,1,0,3274,2006,2,279174.7900,3,1,0
25477,EZYV25478,1,3,1,0,1121,1910,3,146298.8500,3,0,0
25478,EZYV25479,1,3,1,1,1918,1887,4,86154.7700,3,1,0


# **OneHotEncoder**
- First how many labels are there in a column
- That many new columns will be created extra
- For example, case_status has two lables
    - Certified
    - Denied
- So it will create two new extra columns
    - case_status_Certified
    - case_status_Denied
- At a time one column value become 1 other column value bcome zero
- This means a one hot encoder

|case_status|case_status_Certified|case_status_Denied|
|-|-|-|
|Certified|1|0|
|Denied|0|1|
|Denied|0|1|
|Certified|1|0|

In [22]:
# continent C_A C_Af C_E
# Asia       1   0    0
# Africa     0   1    0
# Europe     0   0    1

# pd.get_dummies

In [23]:
# Step 1: read the data and import the method
filepath1=r"/Users/phani/Documents/Sai_Files/02Course_Files/07_Naresh_Data_Science/03FilesgivenbyOmkarSir/Data_Files/Visadataset.csv"
visa_df=pd.read_csv(filepath1)
visa_df
cat=visa_df.select_dtypes(include='object').columns.tolist()
pd.get_dummies(visa_df['case_status'],dtype=int)

Unnamed: 0,Certified,Denied
0,0,1
1,1,0
2,0,1
3,0,1
4,1,0
...,...,...
25475,1,0
25476,1,0
25477,1,0
25478,1,0


In [24]:
visa_df['case_status']

0           Denied
1        Certified
2           Denied
3           Denied
4        Certified
           ...    
25475    Certified
25476    Certified
25477    Certified
25478    Certified
25479    Certified
Name: case_status, Length: 25480, dtype: object