# Coverting Dataset into machine readable numerical format
### Process Raw Data
Data available in raw format is not feasible for analysis. <br>
We will perform cleaning operation on the raw data file. <br>
#### Data can be:
    1. Numerical
    2. Categorical
    3. Ordinal
#### Problems:
    1. Missing data
    2. Noisy data
    3. Inconsistent data
#### Techniques used to tackle above problems:
    1. Conversion of data (coverting everything into numerical format (categorical and ordinal data))
    2. Ignoring of missing values / Filling in missing values


In [1]:
# Helping libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn.feature_selection import SelectKBest, chi2

### Label Encoder
MultiColumn_LabelEncoder is a class that converts categorical data in non numeric format to numeric format. <br>
It replaces the existing data with encoded data. <br>
Link: <a href="https://bit.ly/2F2Jc60">sklearn LabelEncoder</a>

In [2]:
class MultiColumn_LabelEncoder:
    
    # Specify column names that needs to be encoded
    def __init__(self, columns=None):
        self.columns = columns
    
    def fit(self, X, y):
        return self
    
    def transform(self, X, y=None):
        output = X.copy()
        if self.columns is not None:
            for column in self.columns:
                output[column] = LabelEncoder().fit_transform(output[column])
                output[column] = output[column].astype('category')
        else:
            for column_name, column in output.iteritems():
                output[column_name] = LabelEncoder().fit_transform(column)
        return output
    
    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)

### One Hot Encoding
OneHotEncode is a class that converts categorical data into machine understandable format (i.e numerical format). <br>
It replaces the existing data with encoded data. <br>
Link: <a href="https://bit.ly/2I7wbNu">sklearn OneHotEncoder</a>

In [3]:
class OneHotEncode:
    
    # Specify data and column_names
    def __init__(self, data):
        self.data = data
    
    def one_hot_encode(self):
        # Seperating Features and Labels
        X = self.data.iloc[:, :-1]
        y = self.data.iloc[:, -1]
        
        columns_to_encode = list(X.select_dtypes(include=['category', object]))
        if not columns_to_encode:
            print("No attributes to one hot encode.")
            X_new = X.copy()
        else:
            print("Attributes to one hot encode :", columns_to_encode)
            # One-hot encoding on Features with categorical values
            X_new = pd.get_dummies(X, drop_first=True, columns=columns_to_encode)
    
        return X_new, y

# Transforming data
We will perform different steps to transform the raw data in a format that can be used. <br>
### Scaling
Scaling each column between minimum and maximum values. <br>
We used <b>MinMaxScaler</b> function provided by sklearn for scaling data. <br>
Link: <a href="https://bit.ly/2sHjiPE">sklearn MinMaxScalar</a>
### Normalizing
Each value is L2 normalized. <br>
We use <b>Normalizer()</b> function provided by sklearn for normalizing data. <br>
Link: <a href="https://bit.ly/2Jp44cl">sklearn Normalizer</a>

In [4]:
class Transform:
    def __init__(self, X, y, minmax=None, normalize=False):
        self.X = X
        self.y = y
        self.minmax = minmax
        self.normalize=normalize
    
    def min_max_normalize(self):
        features = np.array(self.X.iloc[:, :].values)
        labels = np.array(self.y.iloc[:].values)
        
        N, dim = features.shape
        
        # Rescaling data between minimum and maximum value
        if self.minmax is not None:
            min_max = MinMaxScaler(feature_range=self.minmax, copy=False)
            rescaled_features = min_max.fit_transform(features)
        
        # Normalizing data (L2 normalization)
        if self.normalize:
            normalizer = Normalizer(copy=False)
            rescaled_features = normalizer.fit_transform(rescaled_features)
        
        features = rescaled_features
        
        return features, labels

## Preprocess Data
This method takes in the input csv file and extracts data that needs to be preprocessed before acctually being used for other computational work.

### SelectKBest (check parameter scores)
We use <b>SelectKBest</b> class from sklearn for cheecking the scores of each attributes and then decide which one to eiminate in case if the contribution of the attribute is for learning is low. <br>
Link: <a href="https://bit.ly/2w0hp1Q">sklearn SelectKBest</a>

In [5]:
def preprocess_data(file_path, nan_values='?', minmax=None, normalize=False):
    data = pd.read_csv('{}'.format(file_path), na_values=nan_values)
    # Handling NAN values_training.csv', na_values=' ?')
    data = data.dropna()
    
    # Array of column names with data type as object (non integer or float)
    object_attributes = list(data.select_dtypes(include='object'))
    if not object_attributes:
        print("No attributes to label encode.")
        new_data = data
    else:
        print("Attributes for label encoding: ", object_attributes)
        label_encoder = MultiColumn_LabelEncoder(columns=object_attributes)
        new_data = label_encoder.fit_transform(data)
    
    X, y = OneHotEncode(data=new_data).one_hot_encode()
    print("\nColumn names after processing :\n", list(X.columns))
    print("\nTotal number of columns: ", len(list(X.columns)))
    
    # Numpy array for features and labels
    transform_data = Transform(X=X, y=y, minmax=minmax, normalize=normalize)
    features, labels = transform_data.min_max_normalize()
    
    # Check scores of each attribute for selecting best ones
    selector = SelectKBest(score_func=chi2, k='all')
    X_new = selector.fit_transform(features, labels)
    print("\nFeature scores based on chi2: ", selector.scores_)
    
    return features, labels

In [6]:
def main():
    file = 'datasets/adult_training.csv'
    preprocess_data(file_path=file, nan_values=' ?', minmax=(0,1), normalize=False)

In [7]:
if __name__ == '__main__':
    main()

Attributes for label encoding:  ['Workclass', 'Education', 'Marital_Status', 'Occupation', 'Relationship', 'Race', 'Gender', 'Native_Country', 'Income']
Attributes to one hot encode : ['Workclass', 'Education', 'Marital_Status', 'Occupation', 'Relationship', 'Race', 'Gender', 'Native_Country']

Columns names after processing :
 ['Age', 'Fnlwgt', 'Capital_Gain', 'Capital_Loss', 'Hours_Per_Week', 'Workclass_1', 'Workclass_2', 'Workclass_3', 'Workclass_4', 'Workclass_5', 'Workclass_6', 'Education_1', 'Education_2', 'Education_3', 'Education_4', 'Education_5', 'Education_6', 'Education_7', 'Education_8', 'Education_9', 'Education_10', 'Education_11', 'Education_12', 'Education_13', 'Education_14', 'Education_15', 'Marital_Status_1', 'Marital_Status_2', 'Marital_Status_3', 'Marital_Status_4', 'Marital_Status_5', 'Marital_Status_6', 'Occupation_1', 'Occupation_2', 'Occupation_3', 'Occupation_4', 'Occupation_5', 'Occupation_6', 'Occupation_7', 'Occupation_8', 'Occupation_9', 'Occupation_10', 

