## Introduction to Dataset Processing
#### Carl Shan

This Jupyter Notebook will share more details about how to process your data. Data processing is like preparing the ingredients before cooking; if you prepare them poorly (e.g., leave things half-peeled and dirty) , the meal will taste poor no matter how skillful a chef you are. 

It's similarly true in machine learning. Dataset processing can be one of the most important things you can do to get your model to perform well.

#### Introducing some helpful "magic" Jupyter commands
? - this will bring up the documentation of a function

In [None]:
import pandas as pd
from sklearn import preprocessing

%pylab inline

Download the [student performance data](http://archive.ics.uci.edu/ml/machine-learning-databases/00320/) and change the path below to wherever you put the data.

In [None]:
data = pd.read_csv('../data/student/student-mat.csv', sep=';')

In [None]:
data.head()

#### Converting Categorical Values to Numerical Ones

Looking at the data above, we want to convert a number of the columns from categorical to numerical. Most machine learning models deal with numbers and don't know how to model data that is in text form. As a result we need to learn how to do things such as e.g., convert the values in the `school` column to numbers.

#### First, let's see what values there are in the `school` column

In [None]:
# This shows a list of unique values and how many times they appear
data['school'].value_counts()

In [None]:
# Converting values in the school column to text
# We are going to define a function that takes a single value and apply it to all the values
def convert_school(row):
    if row == 'GP':
        return 0
    elif row == 'MS':
        return 1
    else:
        return None

In [None]:
# Here's a slow way of using the above function
%time
converted_data = []

for row in data['school']:
    new_value = convert_school(row)
    converted_data.append(new_value)

In [None]:
print(converted_data)

In [None]:
%time
converted_data = data['school'].apply(convert_school)

#### Using sklearn's built-in preprocessing module, we can do the same thing

In [None]:
enc = preprocessing.LabelEncoder()

In [None]:
transformed = enc.fit_transform(data['school'])  

In [None]:
transformed

#### Dealing with Null values

To show you how to deal with null values, I'm going to make some simulated data of students.

In [None]:
grades = np.random.choice(range(1, 13), 100) # chooses 100 random numbers between 1 - 12
num_friends_or_none = list(range(0, 20)) + [None] * 5
num_friends = np.random.choice(num_friends_or_none, 100)
new_data = pd.DataFrame(data={'Grade': grades, '# Friends': num_friends})

In [None]:
new_data.head(n=20)

#### One way to deal with null values is to drop them

In [None]:
new_data['# Friends'].dropna()

In [None]:
average_friends = new_data['# Friends'].mean()
new_data['# Friends'].fillna(average_friends)

In [None]:
new_data['# Friends'] = new_data['# Friends'].fillna(average_friends)

#### Now let's learn how to standardize data
By that I mean to transform our data so that it has a mean of 0 and a standard deviation of 1

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit_transform(new_data)