## Intro to Scikit Learn

In this tutorial we will see the basic methods of Scikit-Learn and how to use them to prepare data for Machine Learning.

Download the following dataset from kaggle.com:

https://www.kaggle.com/spscientist/students-performance-in-exams

In [1]:
# Import Pandas and Numpy
import pandas as pd
import numpy as np

In [2]:
# Load the dataset from the csv file, creating a DataFrame object called df
df = pd.read_csv('StudentsPerformance.csv')

In [3]:
# Show the first 5 rows of df
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [4]:
# Show the number of instances in the original dataset
print(f'Number of instances: {len(df)}')

Number of instances: 1000


In [5]:
# We will convert the dataset into a classification dataset
# To do that, we will consider the variable lunch as the target variable
# the values of the target will be 0 for 'standard' and 1 for 'free/reduced'
# We will save the target variable in an array called 'y'
y = np.where(df.lunch.values=='standard', 0, 1)

### The OneHotEncoder

In [6]:
# To convert the DataFrame into a DataFrame usable by a Machine Learning model
# we will need to convert it to only have numerical values
# Some variables are represented by categories. One way to convert categories to
# numerical values is by applying a One-Hot encoding method

# For the sake of simplicity, we will consider that the predictor variables
# are: [gender, race/ethnicity, math score, reading score, writing score]
# Let us modify the original DataFrame
df.drop(['parental level of education', 'lunch', 'test preparation course'], axis=1, inplace=True)

In [7]:
# Now let us standardize the names of the columns
df.columns = ['gender', 'race_ethnicity', 'math_score', 'reading_score', 'writing_score']

In [8]:
# Import the OneHotEncoder class from sklearn
from sklearn.preprocessing import OneHotEncoder

In [9]:
# Define an object called 'encoder'
encoder = OneHotEncoder(drop='first')

# The drop argument is usefull to automatically drop the first column 
# for each encoded variable

In [10]:
# Let us apply the fit_transform method on df to generate the new columns
# Notice that only the categorical variables must be considered
# otherwise, the method will interpret every numerical value as a 
# different category
X = encoder.fit_transform(df[['gender', 'race_ethnicity']])

In [11]:
# For visualization purposes, let us convert it into a DataFrame again
X = pd.DataFrame(data=X.toarray(), columns=encoder.get_feature_names())

# Now let us concatenate it with the DataFrame of numerical values
X = pd.concat([X, df[['math_score', 'reading_score', 'writing_score']]], axis=1)

# To check that everything worked fine, print the final DataFrame
X.head(10)

Unnamed: 0,x0_male,x1_group B,x1_group C,x1_group D,x1_group E,math_score,reading_score,writing_score
0,0.0,1.0,0.0,0.0,0.0,72,72,74
1,0.0,0.0,1.0,0.0,0.0,69,90,88
2,0.0,1.0,0.0,0.0,0.0,90,95,93
3,1.0,0.0,0.0,0.0,0.0,47,57,44
4,1.0,0.0,1.0,0.0,0.0,76,78,75
5,0.0,1.0,0.0,0.0,0.0,71,83,78
6,0.0,1.0,0.0,0.0,0.0,88,95,92
7,1.0,1.0,0.0,0.0,0.0,40,43,39
8,1.0,0.0,0.0,1.0,0.0,64,64,67
9,0.0,1.0,0.0,0.0,0.0,38,60,50


In [12]:
# Usually, this will not be the procedure to prepare a DataFrame that
# presents categorical and numerical variables. We will embed the method
# OneHotEncoder in a Pipeline object, but we will see how to do it later

### The train_test_split method

In [13]:
# Import the train_test_split() method
from sklearn.model_selection import train_test_split

In [15]:
# The train_test_split is a fundamental method of sklearn. It separates the
# original DataFrame into two parts: the training set and the test set

# For classification problems, the split can be stratified, which preserves
# the proportion of the 0/1-classes in the training and test sets (up to some
# rounding error).

# We can also select what percentage of the original data will be assigned
# as test data. If our original dataset is large, this proportion may be small
# (~10%). If our dataset is of regular size, a better option will be to choose
# a proportion in between ~30% and ~40%.

# A random seed may also be chosen. This is helpful to debug the model, reproduce
# the results and find biases and errors.

# Let us split our dataset in the following manner:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)

In [16]:
# Now we may want to check if the split is stratified
# Let us check if the ratio between the 0/1-classes is similar in the
# train and test sets
print(f'class-1 ratio in the training set: {y_train.sum()/len(y_train)}')
print(f'class-1 ratio in the test set: {y_test.sum()/len(y_test)}')

class-1 ratio in the training set: 0.35714285714285715
class-1 ratio in the test set: 0.35


In [18]:
# We can see that the ratios are equal (up to the third decimal figure)