**Data preprocessing , one of the first and crucial step — the process in which we prepare the raw data and make it suitable for a ML model to increase its accuracy and efficiency.**

## Importing Libraries

    import lib_name as alias_name

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn import datasets

## Importing Datasets
You can import data as simple

    import pandas as pd
    dataset = pd.read_csv('filename.csv')

Or directly load from seaborn or sklearn

    sns.load_dataset('iris') # sns is alias for seaborn

For Scikit Learn

    from sklearn import datasets
    digits = datasets.load_digits()

## Handling the missing data values
Missing values, incompleteness, unknown data etc is one of the biggest issues while building machine learning model as it impacts the accuracy.

<div style="text-align:center"><img alt="Handling Missing Data" src="https://github.com/thunderstroke325/60-Days-of-Data-Science-and-ML/blob/main/assets/missing_data.png?raw=true" /></div>

To handle missing values, we can use Scikit-learn Imputer class of sklearn.preprocessing library.

    from sklearn.preprocessing import Imputer
    imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
    imputer = imputer.fit(X[c:d, a:b])
    X[c:d, a:b] = imputer.transform(X[c:d, a:b])


## Encoding categorical data
Since ML model works on maths and numbers, so it’s necessary we encode these categorical variables into numbers.

<div style="text-align:center"><img alt="Encoding Categorical Data" src="https://github.com/thunderstroke325/60-Days-of-Data-Science-and-ML/blob/main/assets/encoding.png?raw=true" /></div>

We will use label encoder and One hot encoder ( For Dummy variable Encoding) to accomplish this task.

    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    labelencoder_X = LabelEncoder()
    X[:, 0] = labelencoder_X.fit_transform(X[:, 0]) 
    onehotencoder = OneHotEncoder(categorical_features = [0])
    X = onehotencoder.fit_transform(X).toarray()
    labelencoder_y = LabelEncoder()
    y = labelencoder_y.fit_transform(y)


## Split Data into Train and Test data


*   **Training data**. Used to train the ML model — Feed the algorithm with input data, to give an expected output as the algorithm evaluates the data repeatedly to learn and train with the data and it’s behaviour.
*   **Validation data**. Part of training process in which the validation data i.e new data is needed into the model that it hasn’t evaluated before. This data provides the first test against unseen data which helps in evaluating how well the model makes predictions based on the new data and hyperparameter optimization.
*   **Test data**. After building the ML model, testing data validates to check if the model makes accurate predictions as well as if it’s trained effectively.

<div style="text-align:center"><img alt="Split Data" src="https://github.com/thunderstroke325/60-Days-of-Data-Science-and-ML/blob/main/assets/split_data.png?raw=true" /></div>


    from sklearn.cross_validation import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

## Feature Scaling
Feature scaling is a technique to standardize the independent variables in the data in a specified range by putting our variables in the same range and scale so that variables don’t dominate each other. It’s important because it always converges and gives results faster.

*   **Normalization** also known as Min-Max scaling is a technique in which values in the data are scaled so that they end up ranging between 0 and 1.
*   **Standardization** is a technique in which the values are centered around the mean with a unit standard deviation.

<div style="text-align:center"><img alt="Feature Scaling" src="https://github.com/thunderstroke325/60-Days-of-Data-Science-and-ML/blob/main/assets/feature_scaling.jpeg?raw=true" /></div>

    from sklearn.preprocessing import StandardScaler 
    ss_X = StandardScaler() 
    X_train = ss_X.fit_transform(X_train) 
    X_test = ss_X.transform(X_test) 