In [1]:
%matplotlib inline

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Machine learning

- We have a load of data on students, let's see if we can predict their gender from the other information we have about them.
- We therefore have a "feature vector" of all the colums apart from gender, and a "target" of gender.

### Prepping the data

- For reasons we can go into later, ML algorithms need to have the data fed to them in a specific way, we use pd.get_dummies() to do all the hard work for us

In [9]:
## Load the data from the csv file
df = pd.read_csv('data/student-alcohol-maths.csv')

In [43]:
## Get the "sex" column as our "target" (the thing we're training the model to predict)
target = df.loc[:, 'sex']

## Get everything except the "sex" column as our features (the thing we'll tell our model)
features = pd.get_dummies(df.loc[:, df.columns != 'sex'])

In [44]:
from sklearn.model_selection import train_test_split

In [45]:
## We need to train the model on 80% of the data, we hold out 20% to check how well the model has done
## Conventionally we call the features "X" and the target "y"
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

In [46]:
## Train the model on the training data
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=3)
clf = clf.fit(X_train, y_train)

In [81]:
## Now let's see how well our model does on data it's never seen before
from sklearn.metrics import accuracy_score
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 63.29%


### Checking the model does better than random

- If the data was 63% female and 37% male, then we could make a model that could achieve an accuracy of 63% simply by guessing Female 63% of the time. We have to check that our data isn't unbalanced: 

In [60]:
print('Female proportion: %.2f%%' % ((target.value_counts()['F'] / target.count()) * 100))

Female proportion: 52.66%


...so our data isn't split 63% to 37%, therefore the machine learning algorithm has managed to find some information and does better then random.

## Using the model

- Is more of a pain than you might imagine. Let's say we get a new person's data and we want to predict their gender, we have to create a feature vector of the same shape using the same dummy variables as the training data. It's a PITA, so haven't done it in this example...

If you want to have a try, the below almost gets you there (but not quite, because when we call pd.get_dummies() it has to know all the same options as when we called it above for the training data):

In [77]:
## Let's pick a random row from the data
some_person = df.sample()
print("our random person's gender:", some_person['sex'].iloc[0])
print('our random person: ')
some_person ## Outside the print for nicer formatting in notebook

our random person's gender: F
our random person: 


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
40,GP,F,16,U,LE3,T,2,2,other,other,...,3,3,3,1,2,3,25,7,10,11


In [78]:
## Make a feature vector for our random person: (Here is where it's wrong: see https://stackoverflow.com/questions/28465633/easy-way-to-apply-transformation-from-pandas-get-dummies-to-new-data)
features = pd.get_dummies(some_person.loc[:, some_person.columns != 'sex']).squeeze().values

In [79]:
## Let's predict his/her gender
clf.predict(np.array([features]))

ValueError: Number of features of the model must match the input. Model n_features is 57 and input n_features is 32 