# Feature Scaling

* Technique to standardize the feature in a fix range regardless of the units of that particular feature. If it all its not perofrmed then ML models will assume the greater numbers as highest and smaller numbers as lowest regardless of the units of the features.
* Two types: Standardization and Normalization.
* Standardization: trasnform the data to have 0 mean and standard deviation as 1, it means that we control the spread of the data, we localize it around the center point.
* Normalization: fixing the range of the data between two values, typically between 0-1 or 1 and -1.
* Why we do it? 
  * ML models only sees the numbers and if in a column you have Rs.10 and the next corresponding column has 10gm, so the ML model will consider both these values as same regardless of their units, even if you have Rs.5 in one column and 10gm in another then 10gm will have superiority over Rs.5 as 10 is greater then 5 in terms of numbers.
  * So we do feature scaling to make sure that there is no difference between the values and ranges of different features and that there has to be some standard range of all features and all the values should lie in that range so that it becomes easier to give them equal importance.
* When to do it?
  * When using ML algorithms that calculate **distance** of the data, for example while using unsupervised learning like K-means clustering to decide which of the features are closest together.
  * KNN
  * PCA
* Algorithms that are rule based do not requires feature scaling fofr ex Random forest, Naive bayes.
* We have Min, Max Scaler and Standard Scaler.
  * Min, Max Scaler: Takes every value and subtracts with the minimum and then divides it with the difference between maximum and minimum, so its actually shrinking the data and makes sure that all the values lie between 0 and 1. We are minimum and maximum values so here the problem of outliers arises as outliers have extreme values so that maybe the disadvantage of the min,max scaler, that it is very sensitive towards outliers. 
  * Standard Scalers: Considers the distribution to be normal and transforms the data by changing the mean to zero and the standard deviation to one. Most popular scaler.

# Feature Engineering
* Featire Engineering involves the study of features and understanding them and then engineering which involves *Combining, Removing adn Adding features*.
* We have done removing by removing the features that were irrelevant bu using correlation matrix and combining the features that have same values or information by usng the categorical visualization and the adding part is new but have been done in Pandas while adding columns and features.
* Engineering New Features: 
  * Combining: many features with same name can be combined as they can produce imbalance in the dataset .
  * Removing: removing the features that won't impact the solution, for example ID or name in the real estate dataset won't be necessary to decide the pricing of the dataset, which we can remove by observing the correlation matrix.
  * Adding: best way to add features or create new features is to use existing features using dummy variables. We can add a feature using *Categorical Column Grouping* and *Creating new features using old ones* and *Dummy Variables*
  
* Techniques for performing feature engineering:

## Imputation
* Missing values affect the performance of the ML models. Simple solution to the missing values is to drop the rows or the entire column.

        value = 0.8  #fix threshold value
        #save the columns which have missing value mean greater than fix threshold value and then drop them
        data = data[data.columns[data.isnull().mean() > value]]
        #save the rows which have missing value mean greater than fix threshold value and then drop them
        data = data.loc[data.isnull().mean(axis=1) > value]
        
### Numerical Imputation
* Filling the missing value is more preferable than dropping them. You can fill the missing value by '0' or by their 'median', as averages of the columns are sensitive to the outlier values.

        #filling using 0
        d = d.fillna(0)
        #filling using median
        d = d.fillna(d.median())
      
### Categorical Imputation
* Replace the missing values with the maximum occurred value in a column.

        d['column_name'].fillna(d['column_name'].value_counts().idxmax(), inplace=True)

## Handling Outliers
Detecting outliers with statistical methods. 
### Using Standard Deviation
    upper_lim = d['column'].mean () + d['column'].std ()
    lower_lim = d['column'].mean () - d['column'].std ()
    d = d[(d['column'] < upper_lim) & (d['column'] > lower_lim)]
### Using Percentiles
Assume a certain percent of the value from the top or the bottom as an outlier.
    
    #dropping the outlier rows with percentile, top 10% means here the values that are out of the 90th percentile of data
    upper_lim = d['column'].quantile(.90)
    lower_lim = d['column'].quantile(.10)
    d = d[(d['column'] < upper_lim) & (d['column'] > lower_lim)]
    
## Binning
Applied on both categorical and numerical data

In [4]:
import pandas as pd 
import numpy as np

#numerical binning
data = pd.DataFrame({'value': (2,45,7,86,73)})
data['bin'] = pd.cut(data['value'], bins=[0,30,70,100], labels=["Low", "Mid", "High"])
print(data)

   value   bin
0      2   Low
1     45   Mid
2      7   Low
3     86  High
4     73  High


In [6]:
#categorical binning

dataa = pd.DataFrame({'Country': ('Spain','Chile','Australia','Italy','Brazil')})
conditions = [
    dataa['Country'].str.contains('Spain'),
    dataa['Country'].str.contains('Italy'),
    dataa['Country'].str.contains('Chile'),
    dataa['Country'].str.contains('Brazil')]
choices = ['Europe', 'Europe', 'South America', 'South America']
dataa['Continent'] = np.select(conditions, choices, default='Other')
print(dataa)

     Country      Continent
0      Spain         Europe
1      Chile  South America
2  Australia          Other
3      Italy         Europe
4     Brazil  South America


## One-hot encoding
Most common encoding methods in machine learning. Undertand with following example.
    
    Fruit	Categorical value of fruit	Price
    apple	 1	                        5
    mango	 2	                        10
    apple	 1	                        15
    orange	3	                        20
After applying one hot encoder
    
    apple	mango	orange	price
    1	     0	      0	   5
    0	     1	      0	   10
    1	     0	      0	   15
    0	     0	      1	   20

In [10]:
#zone is a categorical value which needs to be one hot encoded
dat = pd.DataFrame({'Zone':(1,1,3,2,1,2,2,4,2,1),
                   'CreditScore': (619,608,502,699,850,645,822,376,501,684)})
print(dat)
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer 
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [0])], remainder='passthrough') #[0], indicating the first column
  
dat = np.array(columnTransformer.fit_transform(dat), dtype = np.str) 
print(dat)   #firt four columns representing the zone(1,2,3,4) and last column as CreditScore

   Zone  CreditScore
0     1          619
1     1          608
2     3          502
3     2          699
4     1          850
5     2          645
6     2          822
7     4          376
8     2          501
9     1          684
[['1.0' '0.0' '0.0' '0.0' '619.0']
 ['1.0' '0.0' '0.0' '0.0' '608.0']
 ['0.0' '0.0' '1.0' '0.0' '502.0']
 ['0.0' '1.0' '0.0' '0.0' '699.0']
 ['1.0' '0.0' '0.0' '0.0' '850.0']
 ['0.0' '1.0' '0.0' '0.0' '645.0']
 ['0.0' '1.0' '0.0' '0.0' '822.0']
 ['0.0' '0.0' '0.0' '1.0' '376.0']
 ['0.0' '1.0' '0.0' '0.0' '501.0']
 ['1.0' '0.0' '0.0' '0.0' '684.0']]


## Grouping Operations


* The key point of group by operations is to decide the aggregation functions of the features.
* For numerical features, average and sum functions are usually used, whereas for categorical features its more complicated.
* Categorical Column Grouping
  * The first way is to select the label with the highest frequency.
  * Second way is to make a pivot table. This approach resembles the encoding method in the preceding step with a difference.

In [25]:
da = pd.DataFrame({'User':(1,2,1,3,2,1,1,),
                  'City':('Roma','Madrid','Madrid','Isantbul','Isantbul','Isantbul','Roma'),
                  'Visit Days':(1,2,1,1,4,3,3)})
print(da)

#first way
print('\n')
daa = da.groupby(['User','City']).agg(lambda x: x.value_counts().index[0])
print(daa)

#second way
print('\n')
daaa = da.pivot_table(index='User', columns='City', values='Visit Days', aggfunc=np.sum, fill_value = 0)
print(daaa)

   User      City  Visit Days
0     1      Roma           1
1     2    Madrid           2
2     1    Madrid           1
3     3  Isantbul           1
4     2  Isantbul           4
5     1  Isantbul           3
6     1      Roma           3


               Visit Days
User City                
1    Isantbul           3
     Madrid             1
     Roma               1
2    Isantbul           4
     Madrid             2
3    Isantbul           1


City  Isantbul  Madrid  Roma
User                        
1            3       1     4
2            4       2     0
3            1       0     0


## Numerical Column Grouping
* Numerical columns are grouped using sum and mean functions in most of the cases.

        #sum_cols: List of columns to sum
        #mean_cols: List of columns to average
        grouped = data.groupby('column_to_group')

        sums = grouped[sum_cols].sum().add_suffix('_sum')
        avgs = grouped[mean_cols].mean().add_suffix('_avg')

        new_df = pd.concat([sums, avgs], axis=1)

## Feature Split
* Split function is a good option, however, there is no one way of splitting features. It depends on the characteristics of the column, how to split it.

In [26]:
datt = pd.DataFrame({'name':('Luther N. Gonzalez','Charles M. Young','Terry Lawson','Kristen White','Thomas Logsdon')})
print(datt.name.str.split(" ").map(lambda x: x[0]))   #Extracting last name
print(datt.name.str.split(" ").map(lambda x: x[-1]))   #Extracting first name

0     Luther
1    Charles
2      Terry
3    Kristen
4     Thomas
Name: name, dtype: object
0    Gonzalez
1       Young
2      Lawson
3       White
4     Logsdon
Name: name, dtype: object
