# <center>IEE 520: Fall 2019</center>

# <center>Python: data preprocessing</center>

## <center>Klim Drobnyh (klim.drobnyh@asu.edu)</center>

In [None]:
# For compatibility with Python 2
from __future__ import print_function

import numpy as np
import pandas as pd

# to import different encoders
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder

# To support plots
import matplotlib.pyplot as plt

# To display all the plots inline
%matplotlib inline

In [None]:
# To increase quality of figures
plt.rcParams["figure.figsize"] = (10, 5)

## <center>1. Creating a new dataset</center>

In [None]:
data = pd.DataFrame({
            'x1': [1, 1, 6, 4],
            'x2': ['active', 'sedentary', 'sedentary', 'moderately'],
            'x3': ['high', 'normal', 'normal', 'low'],
        })
print('The dataframe:')
print(data)
print('The same dataframe, but as a numpy array:')
print(data.values)

## <center>2. Dealing with categorical data</center>

There are several common ways to treat categorical data:
1. Label encoding
2. One hot encoding

### <center>2.1. Label encoding</center>

If we have a feature that measures some quantity, but expressed in non-numbers (e.g., **x3**), we want to convert it in ordered way:
* **low** -> **0**;
* **medium** -> **1**;
* **high** -> **2**.

In [None]:
# We should specify order here:
encoder = OrdinalEncoder(categories=[['low', 'normal', 'high']])
encoder.fit(data['x3'].values.reshape(-1, 1))

print('Before transformation:')
print(data['x3'])

data['x3'] = encoder.transform(data['x3'].values.reshape(-1, 1))
print('After transformation:')
print(data['x3'])

### <center>2.2. One hot encoding</center>

If we have a feature with different categories, usually the ones that cannot be compared easily (e.g., **x2**), we can use one hot encoding.
In that case, three different binary variables will be added:
* **x2_active**;
* **x2_sedentary**;
* **x2_moderately**.

In [None]:
print('Before transformation:')
print(data)

data = pd.get_dummies(data, columns=['x2'])

print('After transformation:')
print(data)

## <center>3. Writing to and reading from .csv files</center>

Writing to .csv file:

In [None]:
data.to_csv('custom_data.csv')

Reading from .csv file:

Note: you should check separator in the datafile and specify it: sep=',' for "**,**", sep='\t' for tab.

In [None]:
data2 = pd.read_csv('custom_data.csv', index_col=0)

Let's compare them:

In [None]:
print('Original:')
print(data)
print('Loaded:')
print(data2)

## <center>4. Writing to and reading from Excel files</center>

Writing to Excel file:

In [None]:
data.to_excel("custom_data.xlsx")

Reading from Excel file:

In [None]:
data2 = pd.read_excel('custom_data.xlsx', index_col=0)

Let's compare them:

In [None]:
print('Original:')
print(data)
print('Loaded:')
print(data2)

## <center>5. Splitting dataset to features and target</center>

Here we assume that our target variable is **x1**.

In [None]:
y = data['x1'].values
del data['x1']
X = data.loc[:, data.columns != 'x1'].values
print(X)
print(y)