## Active Learning

Download the titanic dataset here: https://drive.google.com/file/d/0Bz9_0VdXvv9bbVhpOEMwUDJ2elU/view?usp=sharing

In this exercise, we will simulate active learning. We will keep the small sample of observations for testing and we will test how quality of the model rises when we use active learning to choose labeled observations.

In [60]:
# Load the Data into variable df
import pandas as pd
df = pd.read_csv('titanic_dataset.csv')
df.shape

(891, 12)

### Tasks

1. fit the first model only on the **start_df** using **SVM** and evaluate accuracy, precision and recall on test_df
2. in each iteration, add 10 observations from **df** to your trainset (choose the observation using active learning approach) 
    - score all observations in df and take 10 where the model isn't sure what class it is. The probability of surviving will be around 50% 
3. refit the model and evaluate on **test_df** again.    
3. the goal is to converge to the optimal solution as fast as possible by choosing **right** observations in each iteration
4. plot the graphs for each eval metric, where on the axis x is iteration number, on y is the metric value for that model

In [61]:

df = df.dropna()
# DROP CATEGORICAL VARIABLES
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)b

In [62]:
df2 = pd.get_dummies(df, columns=['Sex','Embarked'], drop_first=True)

In [63]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df2[['Age','Fare']] = scaler.fit_transform(df2[['Age','Fare']])

In [64]:
df2.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
1,2,1,1,0.468892,1,0,0.139136,0,0,0
3,4,1,1,0.430956,1,0,0.103644,0,0,1
6,7,0,1,0.671219,0,0,0.101229,1,0,1
10,11,1,3,0.038948,1,1,0.032596,0,0,1
11,12,1,1,0.721801,0,0,0.051822,0,0,1


In [6]:
# TEST SAMPLE
# USE THIS SAMPLE ONLY FOR TESTING
test_df = df2.sample(n=100, random_state=42)
# KEEP ONLY THOSE WHO ARE NOT IN THE TEST SET
df = df2[~df2.PassengerId.isin(test_df.PassengerId.tolist())]

# FIT THE FIRST MODEL ONLY ON THE DATAFRAME START_DF
start_df = df2.sample(n=100, random_state=42)
# DROP OBS FROM START_DF FROM DF
df = df2[~df2.PassengerId.isin(start_df.PassengerId.tolist())]

In [7]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
svc = SVC(kernel='linear', C=1)

In [8]:
X_test, y_test = test_df.loc[:, test_df.columns != 'Survived'], test_df.loc[:, 'Survived']
start_train, start_test = start_df.loc[:, start_df.columns != 'Survived'], start_df.loc[:, 'Survived']

In [15]:
# FIT THE FIRST MODEL
svc.fit(start_train, start_test)
# PREDICTION
y_pred = svc.predict(X_test)
# EVALUATION
from sklearn.metrics import accuracy_score, precision_score, recall_score
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))

Accuracy: 0.8
Precision: 0.875
Recall: 0.7903225806451613


In [24]:
print(y_pred)
print(list(y_test))

[0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 1 1
 0 0 1 0 0 1 0 0 0 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 0 1 0 1 0 1 0 1 1 1 1 0 0
 0 0 0 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 0 0 1 1 0 0 1 1]
[0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1]


2. in each iteration, add 10 observations from **df** to your trainset (choose the observation using active learning approach) 
    - score all observations in df and take 10 where the model isn't sure what class it is. The probability of surviving will be around 50% 
3. refit the model and evaluate on **test_df** again.    
3. the goal is to converge to the optimal solution as fast as possible by choosing **right** observations in each iteration
4. plot the graphs for each eval metric, where on the axis x is iteration number, on y is the metric value for that model

In [26]:
# add 10 more samples to the training set
# df2 is cleaned training set
# test_df is the set to concat to.
# let's make functions for this

In [41]:
def sample_data_and_concat(df2, test_df):
    # SAMPLE DATA
    sample_df = df2.sample(n=10, random_state=42)
    # CONCAT TO THE TEST SET
    test_df = pd.concat([test_df, sample_df])
    return test_df

def split_and_fit(train_df): # use new concatenated training set with +10 samples
    start_train, start_test = train_df.loc[:, train_df.columns != 'Survived'], train_df.loc[:, 'Survived']
    # FIT THE MODEL
    svc.fit(start_train, start_test)
    # PREDICTION
    y_pred = svc.predict(X_test)
    # EVALUATION
    print('Accuracy:', accuracy_score(y_test, y_pred))
    print('Precision:', precision_score(y_test, y_pred))
    print('Recall:', recall_score(y_test, y_pred))
    return y_pred






In [None]:
# run this only the first time
concat_test_df = sample_data_and_concat(df2, test_df)

In [38]:
concat_test_df = sample_data_and_concat(df2, concat_test_df) # adds 10 more everytime
print(concat_test_df.shape)

(130, 10)


In [42]:
split_and_fit(concat_test_df) # fit the model with the new concatenated set

Accuracy: 0.78
Precision: 0.8571428571428571
Recall: 0.7741935483870968


array([0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1])

In [58]:
# run this cell multiple times, see the accuracy increase.
concat_test_df = sample_data_and_concat(df2, concat_test_df) # adds 10 more everytime
split_and_fit(concat_test_df) # fit the model with the new concatenated set


Accuracy: 0.78
Precision: 0.8846153846153846
Recall: 0.7419354838709677


array([0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1])

In [59]:
print(concat_test_df.shape)
print(df.shape)

(210, 10)
(83, 10)


In [67]:
# This is only not working because I have the wrong dataset, it's too small.