# Data Preparation
Data preparation is a critical phase in machine learning and it has been said that a good 80% of the effort may be spent from collecting and then preparing data for use. Steps of data cleaning and organization can help to direct the learning towards the intended goal while the lack of them will likely be an unsuccessful model. Data can have discrepancies, errors, outliers and missing attributes of interest and we will see how some of theses issues can be handled in the following steps

## 1 Importing the libraries
As per most work, libraries of functions that will be used in the data preparation process need to be imported into the notebook.

In [None]:
# import numpy, matplotlib.pylot and panda
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy

# import arff
import requests, io, zipfile
from scipy.io import arff

# import imputers for handling missing value and encoders
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder


## 2 Importing the dataset

Data can be retrieved in various formats. The examples below read data from ARFF, JSON and CSV.

### Reading from ARFF

In [None]:
# download a copy of an archived data set and extract the zip file to the notebook's folder
# f_zip = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00212/vertebral_column_data.zip'
# r = requests.get(f_zip, stream=True)
# Vertebral_zip = zipfile.ZipFile(io.BytesIO(r.content))
# Vertebral_zip.extractall()

In [None]:
# read the ARFF file and store it as a dataframe
data = arff.loadarff('../dataset/column_2C_weka.arff')
df1 = pd.DataFrame(data[0])   #data[1] is the column names
print(df1)

### Reading from JSON

In [None]:
# Create a JSON file from excel
df2 = pd.read_excel('../dataset/data2.xlsx',index_col=0) # use column 0 as the row labels
df2.to_json('data2.json')
df2


In [None]:
# Read the newly created JSON as a dataframe
df3 = pd.read_json("data2.json")
df3

### Reading from CSV

In [None]:
# Create a CSV file from excel
df4 = pd.read_excel('../dataset/data2.xlsx',index_col=0)
df4.to_csv('data2.csv')

In [None]:
# Read CSV files and extract into features and target
dataset = pd.read_csv('data2.csv')
dataset

## 3 Taking care of missing data

There are several ways to handle missing data but only the following will be covered in this exercise
* remove the rows with missing data.
* impute missing values with mean, median or mode

### Dropping rows with missing data
The dropna function's axis argument is default to 0 (along row) where any value within the row being NaN will result in the row being removed. You can set it to one to remove columns with NaN values.

Removing missing values creates a strong model but there may be a loss of a lot of data. This will work poorly if the amount of removal is significant in the dataset.

In [None]:
dataset.dropna()

### Impute missing values with mean, median or mode

With numerical continous values, there is an option to use the mean, median or mode values to fill the missing values. The missing values can also be set to zero or a particular scalar value.

In [None]:
# Replacing with a scalar value
#dataset.fillna(0)
dataset.replace({np.NaN:0})

In [None]:
# Extract the values into features and target
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [None]:
print(x)

In [None]:
print(y)

In [None]:
# Replacing with mean value
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

In [None]:
print(x)

## 4 Encoding categorical data

Categorical data can only take on a limited and usualy fixed number of values. For example, gender as described by Male or Female, and job positions are categorical.

Categorical data can be 
* Nominal
* Ordinal

In general, nominal data are labeled with no specific order while ordinal data have a specific order. Gender is a nominal data while the level of satisfaction (indicated as poor/average/good) is ordinal. 


### Encoding the Independent Variable

Computer are unable to process categorical data. These data have to be processed and one-hot encoding is widely used because simple labeling using numerical number introduces an order that may not be valid.

The basic strategy in One-Hot encoding is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column.

In [None]:
# Read in a new dataset from CSV

df6 = pd.read_csv('../dataset/categorical.csv')
df6

In [None]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
df6 = np.array(ct.fit_transform(df6))

In [None]:
print(df6)

A sparse matrix is a matrix that is comprised of mostly zero values. Its use can lead to enormous computational savings. The Compressed Sparse Row, also called CSR for short, is often used to represent sparse matrices in machine learning given the efficient access and matrix multiplication that it supports.

In [None]:
#encode the categorical data of name 
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x_final = scipy.sparse.csr_matrix(ct.fit_transform(x)).toarray()
print(x_final)

### Encoding the Dependent Variable

Label Encoding is used to convert each value in a column to a number.

In [None]:
le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
print(y)

## 5 Splitting the dataset into the Training set and Test set

The machine learning alogrithm essentially works in two stage of training and testing but you may see the following definition.

Training dataset - The sample of data used to fit the model

Validation dataset - The sample of data used to provide an unbiased evaluation of a model fit on the training while tuning model hyperparameters. The evaluation because more biased as skill on the validation dataset is incorporated into the model configuration.

Test dataset - The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset

The test dataset should be carefully sampled to spaces the various scenarios that a model would encounter in the read world. It would be used once after a model is completely trained while the validation dataset is used as part of the development dataset.

For ease of understanding, we will focus on just the training data and test data. For your self-learning, you can search for Cross Validation. In cross validation, you essentially use your training set to generate multiple splits of the Train and Validation sets.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_final, y, test_size = 0.2, random_state = 1)

In [None]:
print(X_train)

In [None]:
print(X_test)

In [None]:
print(y_train)

In [None]:
print(y_test)

## 6 Feature Scaling

Feature scaling is a method used to normalize or standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

You will see that feature scaling is carried out after separating the data into training data, and test data. This is to avoid the information from the test data from being used during the scaling of the training data.

When data are being used in machine learning, the values of features can have very different ranges. One feature could be in kg while another could be in grams. The value can also be very different in magnitude. For example

|Transaction | Volume | Average Price |
|---|---|---|
|1|50000| 1.45|
|2|120000| 2.44|
|3|450000| 2.11|
|4|700000| 1.60|
|5|800000| 1.72|

In this scenario, with largely huge volume value, it is possible that a machine learning algorithm, which cannot recognize the context of a number '800000' versus '1.72' may put more emphasis and priority on the volume.

By scaling the values for each column to a similar range, the perfomance of the a machine learning algorithm can be improved. However, it must be noted that not all machine learning benefit from feature scaling. Distance-based algorithm often benefits from feature scaling while tree-based alogrithms will be insensitive to the scaling of features. Some of these algorithms that benefits include
* linear and logistic regression
* nearest neighbors
* neural networks
* support vector machines with radial bias kernel functions
* principal components analysis
* linear discriminant analysis

The StandardScaler assumes your data is normally distributed within each feature and will scale them such that the distribution is now centred around 0, with a standard deviation of 1. If data is not normally distributed, this is not the best scaler to use.

The MinMaxScaler is the probably the most famous scaling algorithm. It essentially shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values). This scaler works better for cases in which the standard scaler might not work so well. If the distribution is not Gaussian or the standard deviation is very small, the min-max scaler works better.

There are other scalers such as the RobustScaler, which is similar to Min-Max scaler but as it uses the interquartile range instead of the min-max, it is more robust to outliers. 

The normalizer normalizes rows (samplewise), and not columns (featurewise). 

Most business data aims to study relations across samples and to predict for new samples, which will likely benefit from featurewise normalization. 

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 10:] = sc.fit_transform(X_train[:, 10:])
X_test[:, 10:] = sc.transform(X_test[:, 10:])

In [None]:
print(X_train)

In [None]:
print(X_test)

# Exercise

Import the dataset from 'data_practice.xlsx' and use the steps you have went through in this practical to prepare the data.

## Import the libraries 
(Only need to import libraries/modules once)


## Import the dataset

In [None]:
#todo
# import numpy, matplotlib.pylot and panda
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy

# import arff
import requests, io, zipfile
from scipy.io import arff

# import imputers for handling missing value and encoders
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder



In [None]:
# read in the excel and remove the unnecessary empty columns
x = pd.read_excel('../dataset/data_practice.xlsx')
x = x.drop(x.columns[5],axis=1)
x = x.drop(x.columns[5],axis=1)
#x = x.dropna(axis='columns')
print(x)

## Take care of missing values

In [None]:
#todo (make string positions values consistent and drop unnecessary columns)
x = x[x['Joined in (Year)'].notna()]
x.iloc[:,3] = x.iloc[:,3].str.upper()
x.iloc[:,3] = x.iloc[:,3].str.replace('ENGINEER', 'ENGR')
x.iloc[:,3] = x.iloc[:,3].str.replace('MANAGER', 'MGR')
#x.iloc[:,3] = x.iloc[:,3].replace('SNR MANAGER', 'SNR MGR')
#x.iloc[:,3] = x.iloc[:,3].replace('PROJECT MANAGER', 'PROJECT MGR')
print(x)

In [None]:
# read as date time and use only the year
x['Joined in (Year)'] = pd.to_datetime(x['Joined in (Year)'], format="%Y-%m", errors ="coerce")
x['Joined in (Year)'] = x['Joined in (Year)'].dt.year

x = x.iloc[:,1:]
print(x)

In [None]:
x = x[x['Joined in (Year)'].notna()]
x

## Encode categorical data

In [None]:
# Extract the values into features and target
x_final = x.iloc[:, :-1].values
y = x.iloc[:, -1].values

#todo (for simplicity, there is not need to use the scipy.sparse.csr_matrix)
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [2])], remainder='passthrough')
x_final = np.array(ct.fit_transform(x_final))

#todo (encode the target)
le = LabelEncoder()
y = le.fit_transform(y)
print(y)

## Split dataset for training and test

In [None]:
#todo
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_final, y, test_size = 0.2, random_state = 1)

## Feature scaling

In [None]:
print(X_train)

In [None]:
# Replacing with mean value
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Do for both train and test set
imputer.fit(X_train[:, 5:7])
X_train[:, 5:7] = imputer.transform(X_train[:, 5:7])
imputer.fit(X_test[:, 5:7])
X_test[:, 5:7] = imputer.transform(X_test[:, 5:7])

# Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 5:7] = sc.fit_transform(X_train[:, 5:7])
X_test[:, 5:7] = sc.transform(X_test[:, 5:7])

print(X_train)