# Construction year experimentation
In this notebook we try a number of solutions for dealing with the missing age values. This way we select a good  solution. We only use the TRAINING SET here, where we create a nested train test split within the global training set to do this validation on.

In [1]:
import sys
sys.path.append('..')
import data_loading, warnings
warnings.filterwarnings('ignore')
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score, accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from collections import Counter

# Let's try a model fit leaving in the missing construction year entries

In [35]:
data = data_loading.load_dataset(data_folder='../data')
data = data_loading.data_cleaning(data)
data = data_loading.numeric_groundtruth(data)

train_df, _ = data_loading.split_data(data)
train_df, test_df = data_loading.split_data(train_df) # split up training data into two net sets.

x_train = np.expand_dims(train_df.construction_year.as_matrix(),1)
y_train = np.expand_dims(train_df.status_group.as_matrix(),1)

x_test = np.expand_dims(test_df.construction_year.as_matrix(),1)
y_test = np.expand_dims(test_df.status_group.as_matrix(),1)

lr = LogisticRegression(class_weight='balanced', multi_class='multinomial', solver='newton-cg')
lr.fit(x_train,y_train)
pred = lr.predict(x_test)
print(confusion_matrix(y_test,pred))
print('Accuracy score: ', accuracy_score(y_test,pred))
print('Cohen kappa: ', cohen_kappa_score(y_test,pred))

Label distribution in training set:  Counter({0: 23519, 2: 16750, 1: 2922})
Label distribution in testing set:  Counter({0: 7870, 2: 5518, 1: 1009})
Label distribution in training set:  Counter({0: 17662, 2: 12528, 1: 2203})
Label distribution in testing set:  Counter({0: 5857, 2: 4222, 1: 719})
[[4015    0 1842]
 [ 439    0  280]
 [2799    0 1423]]
Accuracy score:  0.5036117799592517
Cohen kappa:  0.021500163155778407


# Let's try again after removing the missing construction year entries

In [36]:
data = data_loading.load_dataset(data_folder='../data')
data = data_loading.data_cleaning(data)
data = data_loading.numeric_groundtruth(data)

train_df, _ = data_loading.split_data(data)
train_df = train_df[train_df.construction_year != 0]

train_df, test_df = data_loading.split_data(train_df) # split up training data into two net sets.

x_train = np.expand_dims(train_df.construction_year.as_matrix(),1)
y_train = np.expand_dims(train_df.status_group.as_matrix(),1)

x_test = np.expand_dims(test_df.construction_year.as_matrix(),1)
y_test = np.expand_dims(test_df.status_group.as_matrix(),1)

lr = LogisticRegression(class_weight='balanced', multi_class='multinomial', solver='newton-cg')
lr.fit(x_train,y_train)
pred = lr.predict(x_test)
print(confusion_matrix(y_test,pred))
print('Accuracy score: ', accuracy_score(y_test,pred))
print('Cohen kappa: ', cohen_kappa_score(y_test,pred))

Label distribution in training set:  Counter({0: 23519, 2: 16750, 1: 2922})
Label distribution in testing set:  Counter({0: 7870, 2: 5518, 1: 1009})
Label distribution in training set:  Counter({0: 12263, 2: 8123, 1: 1400})
Label distribution in testing set:  Counter({0: 4002, 2: 2784, 1: 477})
[[2726  320  956]
 [ 255   66  156]
 [1176  261 1347]]
Accuracy score:  0.5698747074211759
Cohen kappa:  0.21652920300078948


### Clearly a linear model gets confused when construction years are 0.  However we cannot just drop them since these entries make up a significant portion of the data. Let's try to fill in the mean year (to minimise the effect of this feature when its value is missing)

In [38]:
data = data_loading.load_dataset(data_folder='../data')
data = data_loading.data_cleaning(data)
data = data_loading.numeric_groundtruth(data)

valid_year_data = data[data.construction_year != 0]
mean_year = np.int(valid_year_data.construction_year.mean())
data.construction_year[data.construction_year == 0] = mean_year

train_df, _ = data_loading.split_data(data)

train_df, test_df = data_loading.split_data(train_df) # split up training data into two net sets.

x_train = np.expand_dims(train_df.construction_year.as_matrix(),1)
y_train = np.expand_dims(train_df.status_group.as_matrix(),1)

x_test = np.expand_dims(test_df.construction_year.as_matrix(),1)
y_test = np.expand_dims(test_df.status_group.as_matrix(),1)

lr = LogisticRegression(class_weight='balanced', multi_class='multinomial', solver='newton-cg')
lr.fit(x_train,y_train)
pred = lr.predict(x_test)
print(confusion_matrix(y_test,pred))
print('Accuracy score: ', accuracy_score(y_test,pred))
print('Cohen kappa: ', cohen_kappa_score(y_test,pred))

Label distribution in training set:  Counter({0: 23519, 2: 16750, 1: 2922})
Label distribution in testing set:  Counter({0: 7870, 2: 5518, 1: 1009})
Label distribution in training set:  Counter({0: 17662, 2: 12528, 1: 2203})
Label distribution in testing set:  Counter({0: 5857, 2: 4222, 1: 719})
[[2751 2067 1039]
 [ 229  311  179]
 [1178 1608 1436]]
Accuracy score:  0.4165586219670309
Cohen kappa:  0.1297758811576626


### Let's try setting the minimum year.

In [40]:
data = data_loading.load_dataset(data_folder='../data')
data = data_loading.data_cleaning(data)
data = data_loading.numeric_groundtruth(data)

valid_year_data = data[data.construction_year != 0]
min_year = np.int(valid_year_data.construction_year.min())
data.construction_year[data.construction_year == 0] = min_year

train_df, _ = data_loading.split_data(data)

train_df, test_df = data_loading.split_data(train_df) # split up training data into two net sets.

x_train = np.expand_dims(train_df.construction_year.as_matrix(),1)
y_train = np.expand_dims(train_df.status_group.as_matrix(),1)

x_test = np.expand_dims(test_df.construction_year.as_matrix(),1)
y_test = np.expand_dims(test_df.status_group.as_matrix(),1)

lr = LogisticRegression(class_weight='balanced', multi_class='multinomial', solver='newton-cg')
lr.fit(x_train,y_train)
pred = lr.predict(x_test)
print(confusion_matrix(y_test,pred))
print('Accuracy score: ', accuracy_score(y_test,pred))
print('Cohen kappa: ', cohen_kappa_score(y_test,pred))

Label distribution in training set:  Counter({0: 23519, 2: 16750, 1: 2922})
Label distribution in testing set:  Counter({0: 7870, 2: 5518, 1: 1009})
Label distribution in training set:  Counter({0: 17662, 2: 12528, 1: 2203})
Label distribution in testing set:  Counter({0: 5857, 2: 4222, 1: 719})
[[3429  166 2262]
 [ 320   25  374]
 [1773  223 2226]]
Accuracy score:  0.5260233376551213
Cohen kappa:  0.12872979131765672


### Seems like we can only slightly improve Cohen's kappa and accuracy in comparison to keeping the 0 values in.


### Next, let's see whether it adds value to look at the age (construction year relative to the recording date) of the pump during measurement. This is likely because the measurements have been taken over several decades and thus the construction year by itself does not tell us anything about the age. 

In [50]:
data = data_loading.load_dataset(data_folder='../data')
data = data_loading.data_cleaning(data)
data = data_loading.numeric_groundtruth(data)

data.date_recorded = [x.year for x in data.date_recorded.astype('datetime64')]
valid_year_data = data[data.construction_year != 0]
min_year = np.int(valid_year_data.construction_year.mean())
max_age = np.int(np.mean(valid_year_data.date_recorded - valid_year_data.construction_year))

data['age_at_measurement'] = data.date_recorded - data.construction_year
data['age_at_measurement'][data.construction_year == 0] = max_age
data.construction_year[data.construction_year == 0] = min_year

train_df, _ = data_loading.split_data(data)

train_df, test_df = data_loading.split_data(train_df) # split up training data into two net sets.

x_train = train_df[['age_at_measurement','construction_year']].as_matrix()
y_train = np.expand_dims(train_df.status_group.as_matrix(),1)

x_test = test_df[['age_at_measurement','construction_year']].as_matrix()
y_test = np.expand_dims(test_df.status_group.as_matrix(),1)

lr = LogisticRegression(class_weight='balanced', multi_class='multinomial', solver='newton-cg')
lr.fit(x_train,y_train)
pred = lr.predict(x_test)
print(confusion_matrix(y_test,pred))
print('Accuracy score: ', accuracy_score(y_test,pred))
print('Cohen kappa: ', cohen_kappa_score(y_test,pred))

Label distribution in training set:  Counter({0: 23519, 2: 16750, 1: 2922})
Label distribution in testing set:  Counter({0: 7870, 2: 5518, 1: 1009})
Label distribution in training set:  Counter({0: 17662, 2: 12528, 1: 2203})
Label distribution in testing set:  Counter({0: 5857, 2: 4222, 1: 719})
[[4175  838  844]
 [ 475   89  155]
 [2421  555 1246]]
Accuracy score:  0.5102796814224857
Cohen kappa:  0.11662155311745437


### Doesn't seem to add much...