# construction year experimentation
In this notebook we try a number of solutions for dealing with the missing age values. This way we select a good  solution. We only use the TRAINING SET here :)

In [45]:
import sys
sys.path.append('/Users/mustafa/workspace/projects/tansanian_waterpumps/')
import data_loading
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score, accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from collections import Counter

In [2]:
data = data_loading.load_dataset(data_folder='../data')
data = data_loading.data_cleaning(data)
data = data_loading.numeric_groundtruth(data)

# Let's try a model fit leaving in the missing construction year entries

In [38]:
train_df, _ = data_loading.split_data(data)
x = np.expand_dims(train_df.construction_year.as_matrix(),1)
y = np.expand_dims(train_df.status_group.as_matrix(),1)

lr = LogisticRegression(class_weight='balanced', multi_class='multinomial', solver='newton-cg')
lr.fit(x,y)

pred = lr.predict(x)

print(confusion_matrix(y,pred))
print('Accuracy score: ', accuracy_score(y,pred))
print('Cohen kappa: ', cohen_kappa_score(y,pred))

Label distribution in training set:  Counter({0: 23519, 2: 16750, 1: 2922})
Label distribution in testing set:  Counter({0: 7870, 2: 5518, 1: 1009})


  y = column_or_1d(y, warn=True)


[[16265  7254     0]
 [ 1877  1045     0]
 [10907  5843     0]]
Accuracy score:  0.40077793984857957
Cohen kappa:  0.020255169832355646


# Let's try again after removing the missing construction year entries

In [39]:
train_df, _ = data_loading.split_data(data)
train_df = train_df[train_df.construction_year != 0]
x = np.expand_dims(train_df.construction_year.as_matrix(),1)
y = np.expand_dims(train_df.status_group.as_matrix(),1)

lr = LogisticRegression(class_weight='balanced', multi_class='multinomial', solver='newton-cg')
lr.fit(x,y)

pred = lr.predict(x)

print(confusion_matrix(y,pred))
print('Accuracy score: ', accuracy_score(y,pred))
print('Cohen kappa: ', cohen_kappa_score(y,pred))

Label distribution in training set:  Counter({0: 23519, 2: 16750, 1: 2922})
Label distribution in testing set:  Counter({0: 7870, 2: 5518, 1: 1009})


  y = column_or_1d(y, warn=True)


[[11215   964  4086]
 [  996   150   731]
 [ 4615   763  5529]]
Accuracy score:  0.5815690729457124
Cohen kappa:  0.22192903671587927




### Clearly a linear model gets confused when construction years are 0.  However we cannot just drop them since these entries make up a significant portion of the data. Let's try to fill in the mean year (to minimise the effect of this feature when its value is missing)

In [43]:
train_df, _ = data_loading.split_data(data)
mean_year = np.int(train_df[train_df.construction_year != 0].construction_year.mean())
train_df[train_df.construction_year == 0] = mean_year
x = np.expand_dims(train_df.construction_year.as_matrix(),1)
y = np.expand_dims(train_df.status_group.as_matrix(),1)

lr = LogisticRegression(class_weight='balanced', multi_class='multinomial', solver='newton-cg')
lr.fit(x,y)
pred = lr.predict(x)
print(confusion_matrix(y.squeeze(),pred))
print('Accuracy score: ', accuracy_score(y,pred))
print('Cohen kappa: ', cohen_kappa_score(y,pred))

Label distribution in training set:  Counter({0: 23519, 2: 16750, 1: 2922})
Label distribution in testing set:  Counter({0: 7870, 2: 5518, 1: 1009})


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
  y = column_or_1d(y, warn=True)


[[10825   447  4086   907]
 [  934    60   731   152]
 [ 4363   274  5529   741]
 [    0     0     0 14142]]
Accuracy score:  0.707462202773726
Cohen kappa:  0.5680820104572604




### Better! What about using the minimum value (assuming that when construction year is missing, that the pumps are likely to be older)

In [44]:
train_df, _ = data_loading.split_data(data)
min_year = np.int(train_df[train_df.construction_year != 0].construction_year.min())
train_df[train_df.construction_year == 0] = min_year
x = np.expand_dims(train_df.construction_year.as_matrix(),1)
y = np.expand_dims(train_df.status_group.as_matrix(),1)

lr = LogisticRegression(class_weight='balanced', multi_class='multinomial', solver='newton-cg')
lr.fit(x,y)
pred = lr.predict(x)
print(confusion_matrix(y.squeeze(),pred))
print('Accuracy score: ', accuracy_score(y,pred))
print('Cohen kappa: ', cohen_kappa_score(y,pred))

Label distribution in training set:  Counter({0: 23519, 2: 16750, 1: 2922})
Label distribution in testing set:  Counter({0: 7870, 2: 5518, 1: 1009})


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
  y = column_or_1d(y, warn=True)


[[11215  1259  3671   120]
 [  996   186   661    34]
 [ 4615  1004  5030   258]
 [    0     0     0 14142]]
Accuracy score:  0.7078558032923525
Cohen kappa:  0.5739838700222333




### Seems like there is only a marginal improvement. Given that this solution actually adds another assumption (that might not hold in the test set) we are better off using the less assumptive solution in which the missing construction year is set to the mean year.