## [Lucille Kaleha ](https://www.linkedin.com/in/lucillekaleha/): **Solution for the Womxn in Big Data South Africa competition**


---
*Thanks to Zindi, Women in Big Data, HERE Technologies and Microsoft for this challenge and opportunity to improve livelihoods using Data Science*

### Challenges faced:
 - There was no correlation between local cross validation and the leaderbaord, so it was challenging to know how good a model is and whether it was overfitting
 - It was very challenging to get new data from the recommended HERE and XYZ apis.
 
### Approach used:
 - Focused more on bulding models rather than feature engineering
 - As location was an important feature, i reverse geocoded the coordinates to get locations for each latitude and longitude using the reversegeocoding python library
 - As there was no single model that yielded good results, i opted to train several models so that they can cancel each others errors and generalize well
 - Because using all the data yielded unsatisfactory resultes, i opted to train each model with 70% of the data
 - To ensure that all the data has been used for training, i used different random states to split the data
 - Finally to generalise the ensembled models; I averaged, blended and retrained the models using the test data as training data and predictions as the target
 
### Some small caveats:
 - I realised that using different versions of catboost regressor yielded different results, so i maximised on this and used two versions of catboost.
    - At some point in the notebook you will have to restart the kernel.
 - Setting the random states(seed) did help for reproducability, but some models dont have the random state parameter, so there is some bias/randomness that cannot be accounted for. So predictions will differ by a small margin whenever you run the notebook.
 



In [0]:
# Installing the necessary libraries
#
!pip install vecstack                   # For stacking models
!pip install catboost==0.20.2           # This version of catboost yielded better results with certain random states
!pip install reverse_geocoder           # Used to get location of a place, given coordinates

Collecting vecstack
  Downloading https://files.pythonhosted.org/packages/d0/a1/b9a1e9e9e5a12078da1ab9788c7885e4c745358f7e57d5f94d9db6a4e898/vecstack-0.4.0.tar.gz
Building wheels for collected packages: vecstack
  Building wheel for vecstack (setup.py) ... [?25l[?25hdone
  Created wheel for vecstack: filename=vecstack-0.4.0-cp36-none-any.whl size=19877 sha256=286796dcf1e6e5c7975e12f274feb6d26eacdc04a45749ac12a2426cd1c473f0
  Stored in directory: /root/.cache/pip/wheels/5f/bb/4e/f6488433d53bc0684673d6845e5bf11a25240577c8151c140e
Successfully built vecstack
Installing collected packages: vecstack
Successfully installed vecstack-0.4.0
Collecting catboost==0.20.2
[?25l  Downloading https://files.pythonhosted.org/packages/97/c4/586923de4634f88a31fd1b4966e15707a912b98b6f4566651b5ef58f36b5/catboost-0.20.2-cp36-none-manylinux1_x86_64.whl (63.9MB)
[K     |████████████████████████████████| 63.9MB 51kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.20.2
Collectin

In [0]:
# Importing the necessary libraries
#
import pandas as pd
import numpy as np
import requests
from io import StringIO 
import reverse_geocoder as rg
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR, NuSVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor, XGBRFRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, BayesianRidge
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import RandomForestRegressor, StackingRegressor,HistGradientBoostingRegressor, ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
from lightgbm import LGBMRegressor
from vecstack import stacking
from vecstack import StackingTransformer
from catboost import CatBoostRegressor
import warnings
warnings.filterwarnings('ignore')

### Loading and cleaning data

In [0]:
# Created links to shared files via google drive
#
train = 'https://drive.google.com/file/d/13GpeDjiVR1aRHpkAZc7EeH_cKf52q7qE/view?usp=sharing'
test = 'https://drive.google.com/file/d/17JoUvCmpFXXFbgbZ9Ki3Xqh9qcl7UV8c/view?usp=sharing'
submission = 'https://drive.google.com/file/d/1GN1lSsLU43kQaZtThc4dP60mz8ztwDsL/view?usp=sharing'
dictionary = 'https://drive.google.com/file/d/1lAZnQFsBkPo8TNHYbq5mt2SpSMrG57WR/view?usp=sharing'


# Created a function to read a csv file shared via google and return a dataframe
#
def read_csv(url):
  url = 'https://drive.google.com/uc?export=download&id=' + url.split('/')[-2]
  csv_raw = requests.get(url).text
  csv = StringIO(csv_raw)
  df = pd.read_csv(csv)
  return df

# Creating submission, training, testing and variable definition datataframes
#
sub = read_csv(submission)
train = read_csv(train)
test = read_csv(test)
submission = read_csv(submission)
dictionary = read_csv(dictionary)

# Splitting the target variable from the train dataframe
#
target = train.target


# Aligning the training and testing datasets
train, test = train.align(test, join = 'inner', axis = 1)


# Including a separator column to be used to split the dataframes after combining them
#
train['separator'] = 0
test['separator'] = 1


# Combining the test and train dataframes, so that feature engineering can be done on the go
#
comb = pd.concat([train, test])

# Separating the training and testing dataframes from the combined dataframe
#
train = comb[comb.separator == 0]
test = comb[comb.separator == 1]


# Dropping the separator column as it has served its purpose
#
train.drop('separator', axis = 1, inplace = True)
test.drop('separator', axis = 1, inplace = True)
train['target'] = target

### Catboost Predictions


In [0]:
# Splitting the data into training and testing dataframes
#
X = train.drop(['ward', 'ADM4_PCODE', 'target'], axis = 1)  # Predictors
y = target                                                  # Target

tes = test.drop(['ward', 'ADM4_PCODE'], axis = 1)           # Testing data

# Splitting the training dataset to 70%, and setting the random state to 90
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 90)

# Making predictions
#
predictions_cat = CatBoostRegressor(logging_level='Silent').fit(X_train, y_train).predict(tes)

### Sklearn Stacking Regressor Predictions

In [0]:
# Using two different stacked ensembles to make predictions using the sklearn stacking regressor
#
X = train.drop(['ward', 'ADM4_PCODE', 'target'], axis = 1)
y = target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 90)

tes = test.drop(['ward', 'ADM4_PCODE'], axis = 1)

estimators_1 = [
    ('xgb', XGBRegressor(objective ='reg:squarederror')),
    ('lr', LinearRegression()),
    ('rf', RandomForestRegressor()),
    ('lgb', LGBMRegressor()),
    ('svr', SVR()),
    ('lasso', Lasso()),
    ('kneiba', KNeighborsRegressor()),
    ('cat', CatBoostRegressor(logging_level='Silent'))
]

predictions_sreg = StackingRegressor(estimators=estimators_1, final_estimator=CatBoostRegressor(logging_level='Silent')).fit(X_train, y_train).predict(tes)


estimators_2 = [
    ('XBRF', XGBRFRegressor(objective ='reg:squarederror')),
    ('Bayesian', BayesianRidge()),
    ('ExtraTrees', ExtraTreesRegressor()),
    ('HistGradient', HistGradientBoostingRegressor()),
    ('NuSVR', NuSVR()),
    ('Ridge', Ridge()),
    ('KNeiba', KNeighborsRegressor()),
    ('cat', CatBoostRegressor(logging_level='Silent'))
]

predictions_sreg_2 = StackingRegressor(estimators=estimators_2, final_estimator=CatBoostRegressor(logging_level='Silent')).fit(X_train, y_train).predict(tes)

### Vecstack Predictions

In [0]:
# Using two different stacked ensembles to make predictions using the vecstack stacking regressor
#
X = train.drop(['ward', 'ADM4_PCODE', 'target'], axis = 1)
y = target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 90)

tes = test.drop(['ward', 'ADM4_PCODE'], axis = 1)

estimators_1 = [
    ('xgb', XGBRegressor(objective ='reg:squarederror')),
    ('lr', LinearRegression()),
    ('rf', RandomForestRegressor()),
    ('lgb', LGBMRegressor()),
    ('svr', SVR()),
    ('lasso', Lasso()),
    ('kneiba', KNeighborsRegressor()),
    ('cat', CatBoostRegressor(logging_level='Silent'))
]

stack = StackingTransformer(estimators_1, regression=True, verbose=0, metric =mean_squared_error, shuffle=True)
stack = stack.fit(X_train, y_train)
S_train = stack.transform(X_train)


final_estimator = CatBoostRegressor(logging_level='Silent')
final_estimator = final_estimator.fit(S_train, y_train)

S_tes = stack.transform(tes)
predictions_vecstack = final_estimator.predict(S_tes)



estimators_2 = [
    ('XBRF', XGBRFRegressor(objective ='reg:squarederror')),
    ('Bayesian', BayesianRidge()),
    ('ExtraTrees', ExtraTreesRegressor()),
    ('HistGradient', HistGradientBoostingRegressor()),
    ('NuSVR', NuSVR()),
    ('Ridge', Ridge()),
    ('KNeiba', KNeighborsRegressor()),
    ('cat', CatBoostRegressor(logging_level='Silent'))
]

stack = StackingTransformer(estimators_2, regression=True, verbose=0, metric =mean_squared_error, shuffle=True)
stack = stack.fit(X_train, y_train)
S_train = stack.transform(X_train)


final_estimator = CatBoostRegressor(logging_level='Silent')
final_estimator = final_estimator.fit(S_train, y_train)

S_tes = stack.transform(tes)
predictions_vecstack_2 = final_estimator.predict(S_tes)

### Feature Engineering

In [0]:
# Created links to shared files via google drive
#
train = 'https://drive.google.com/file/d/13GpeDjiVR1aRHpkAZc7EeH_cKf52q7qE/view?usp=sharing'
test = 'https://drive.google.com/file/d/17JoUvCmpFXXFbgbZ9Ki3Xqh9qcl7UV8c/view?usp=sharing'
submission = 'https://drive.google.com/file/d/1GN1lSsLU43kQaZtThc4dP60mz8ztwDsL/view?usp=sharing'
dictionary = 'https://drive.google.com/file/d/1lAZnQFsBkPo8TNHYbq5mt2SpSMrG57WR/view?usp=sharing'


# Created a function to read a csv file shared via google and return a dataframe
#
def read_csv(url):
  url = 'https://drive.google.com/uc?export=download&id=' + url.split('/')[-2]
  csv_raw = requests.get(url).text
  csv = StringIO(csv_raw)
  df = pd.read_csv(csv)
  return df

# Creating submission, training, testing and variable definition datataframes
#
sub = read_csv(submission)
train = read_csv(train)
test = read_csv(test)
submission = read_csv(submission)
dictionary = read_csv(dictionary)

# Splitting the target variable from the train dataframe
#
target = train.target


# Aligning the training and testing datasets
train, test = train.align(test, join = 'inner', axis = 1)


# Including a separator column to be used to split the dataframes after combining them
#
train['separator'] = 0
test['separator'] = 1


# Combining the test and train dataframes, so that feature engineering can be done on the go
#
comb = pd.concat([train, test])

In [0]:
# # Reverse geocoding coordinates to locations
# #
# name = []

# for i in range(len(comb)):
#   location = rg.search([(x, y) for x, y in zip(comb.lat, comb.lon)][i])
#   name.append(location[0].get('name'))


# # Adding the geocoded locations to the combined dataframe
# comb['name'] = name

# # Creating a csv file of the combined dataframe
# comb.to_csv('women_comb.csv')

In [0]:
# Loading the combined created csv
#
def read_csv(url):
  url = 'https://drive.google.com/uc?export=download&id=' + url.split('/')[-2]
  csv_raw = requests.get(url).text
  csv = StringIO(csv_raw)
  return csv

comb_link = 'https://drive.google.com/file/d/1lglzdXOnAlQntIYK-RdYJv8DtckV6xtm/view?usp=sharing'
comb = pd.read_csv(read_csv(comb_link), index_col = 0)
comb.drop(['admin1',	'admin2'], axis = 1, inplace = True)


# Creating a column of how many time a location is represented in the dataset
#
freq_cols = ['name']
for col in freq_cols:
  fq_encode = comb[col].value_counts().to_dict()
  comb[col+'_fq_enc'] = comb[col].map(fq_encode)

# One hot encoding the location column
#
comb = pd.get_dummies(comb, columns = ['name'], drop_first=True)

In [0]:
# Generating more features
#
comb['Household_Size'] = comb['total_individuals']/comb['total_households']
comb['psa_car1_car_2'] = comb.psa_00/(comb.car_00 + comb.car_01)
comb['latlon'] = abs(comb.lat) + abs(comb.lon)

In [0]:
# Separating the train and test dataframes from the combined dataframe
#
train = comb[comb.separator == 0]
test = comb[comb.separator == 1]

train.drop('separator', axis = 1, inplace = True)
test.drop('separator', axis = 1, inplace = True)
train['target'] = target

In [0]:
# Training the data with the new features and making predictions
#
X = train.drop(['ward', 'ADM4_PCODE', 'target'], axis = 1)
y = target
tes = test.drop(['ward', 'ADM4_PCODE'], axis = 1)

predictions_feats = CatBoostRegressor(logging_level='Silent', random_state=101).fit(X, y).predict(tes)

### Averaging, Blending and Retraining

In [0]:
# Averaging the two stacked predictions from sklearn and vecstack in the ratio of 9:1
#
predictions_vecstack = [x*0.9 + y*0.1 for x, y in zip(predictions_vecstack, predictions_vecstack_2)]
predictions_sreg = [x*0.9 + y*0.1 for x, y in zip(predictions_sreg, predictions_sreg_2)]


# Blending the two ensemble models and the catboost single model
#
stack = [x*0.3 + y*0.7 for x, y in zip(predictions_vecstack, predictions_sreg)]
stack_2 = [x*0.9 + y*0.1 for x, y in zip(stack, predictions_cat)]
stack_3 = [x*0.7 + y*0.3 for x, y in zip(stack_2, predictions_feats)]


# Retraining the models using the test data as training data and the predictions as the target
#
X = tes.copy()
y = stack_3

ridge = Ridge()
ridge.fit(X, y)
preds_ridge = ridge.predict(X)

cat = CatBoostRegressor(verbose = False)
cat.fit(X, y)
preds_cat = cat.predict(X)
# Blending the two trained models
#
blended_1 = [x*0.5 +y*0.5 for x, y in zip(preds_ridge, preds_cat)]



# Retrainig the models using the above approach but using different weights
#
stack = [x*0.4 + y*0.6 for x, y in zip(predictions_vecstack, predictions_sreg)]
stack_2 = [x*0.8 + y*0.2 for x, y in zip(stack, predictions_cat)]
stack_3 = [x*0.65 + y*0.35 for x, y in zip(stack_2, predictions_feats)]

X = tes.copy()
y = stack_3

ridge = Ridge()
ridge.fit(X, y)
preds_ridge = ridge.predict(X)

cat = CatBoostRegressor(verbose = False)
cat.fit(X, y)
preds_cat = cat.predict(X)

blended_2 = [x*0.5 +y*0.5 for x, y in zip(preds_ridge, preds_cat)]

blended_3 = [x*0.9 + y*0.1 for x, y in zip(blended_1, blended_2)]


# Further generalising the model by training using the simple Linear regression model
# Complementing it with the catboost model
#
X = tes.copy()
y = blended_3

linear = LinearRegression()
linear.fit(X, y)
preds_linear = linear.predict(X)

cat = CatBoostRegressor(verbose = False)
cat.fit(X, y)
preds_cat = cat.predict(X)


# Blending the two model predictions
# Creating a predictions file to be used in the next step, as you will have to restart the kernel
#
final_blend_1 = [x*0.1 + y*0.1 + z*0.8 for x, y, z in zip(preds_linear, preds_cat, blended_3)]
sub_df = pd.DataFrame({'ward': test.ward, 'target': final_blend_1}) 
sub_df.to_csv('final_blend_1.csv', index = False)

### More Ensembles for further regularisation
### Train using latest version of catboost
### **Restart kernel after installing the latest version of catboost**
### *Run the notebook from the cell below after upgrading catboost*

In [0]:
# Restart kernel after upgrading catboost and run notebook from this cell
!pip install catboost --upgrade

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/94/ec/12b9a42b2ea7dfe5b602f235692ab2b61ee1334ff34334a15902272869e8/catboost-0.22-cp36-none-manylinux1_x86_64.whl (64.4MB)
[K     |████████████████████████████████| 64.4MB 45kB/s 
Installing collected packages: catboost
  Found existing installation: catboost 0.20.2
    Uninstalling catboost-0.20.2:
      Successfully uninstalled catboost-0.20.2
Successfully installed catboost-0.22


In [0]:
Restart kernel, and run from below

SyntaxError: ignored

In [0]:
# Importing the necessary libraries
#
import pandas as pd
import numpy as np
import requests
from io import StringIO 
import reverse_geocoder as rg
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR, NuSVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor, XGBRFRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, BayesianRidge
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import RandomForestRegressor, StackingRegressor,HistGradientBoostingRegressor, ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
from lightgbm import LGBMRegressor
from vecstack import stacking
from vecstack import StackingTransformer
from catboost import CatBoostRegressor
import warnings
warnings.filterwarnings('ignore')

In [0]:
train = 'https://drive.google.com/file/d/13GpeDjiVR1aRHpkAZc7EeH_cKf52q7qE/view?usp=sharing'
test = 'https://drive.google.com/file/d/17JoUvCmpFXXFbgbZ9Ki3Xqh9qcl7UV8c/view?usp=sharing'
submission = 'https://drive.google.com/file/d/1GN1lSsLU43kQaZtThc4dP60mz8ztwDsL/view?usp=sharing'
dictionary = 'https://drive.google.com/file/d/1lAZnQFsBkPo8TNHYbq5mt2SpSMrG57WR/view?usp=sharing'


# Creating a function to read a csv file shared via google
#
def read_csv(url):
  url = 'https://drive.google.com/uc?export=download&id=' + url.split('/')[-2]
  csv_raw = requests.get(url).text
  csv = StringIO(csv_raw)
  df = pd.read_csv(csv)
  return df

# Creating submission and training datataframes
#
sub = read_csv(submission)
train = read_csv(train)
test = read_csv(test)
submission = read_csv(submission)
dictionary = read_csv(dictionary)

target = train.target

# Aligning the training and testing datasets
train, test = train.align(test, join = 'inner', axis = 1)

train['separator'] = 0
test['separator'] = 1

comb = pd.concat([train, test])

train = comb[comb.separator == 0]
test = comb[comb.separator == 1]

train.drop('separator', axis = 1, inplace = True)
test.drop('separator', axis = 1, inplace = True)
train['target'] = target

final_blend_1 = pd.read_csv('final_blend_1.csv')

In [0]:
# Training models using different random states and the latest catboost
#
X = train.drop(['ward', 'ADM4_PCODE', 'target'], axis = 1)
y = target

tes = test.drop(['ward', 'ADM4_PCODE'], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 29)

predictions_cat_29 = CatBoostRegressor(logging_level='Silent').fit(X_train, y_train).predict(tes)

In [0]:
# Same as before with the only difference being the random state
# Using different random states will ensure that all the data hase been used in bulding the model
#
X = train.drop(['ward', 'ADM4_PCODE', 'target'], axis = 1)
y = target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 65)

tes = test.drop(['ward', 'ADM4_PCODE'], axis = 1)

estimators_1 = [
    ('xgb', XGBRegressor(objective ='reg:squarederror')),
    ('lr', LinearRegression()),
    ('rf', RandomForestRegressor()),
    ('lgb', LGBMRegressor()),
    ('svr', SVR()),
    ('lasso', Lasso()),
    ('kneiba', KNeighborsRegressor()),
    ('cat', CatBoostRegressor(logging_level='Silent'))
]

predictions_sreg_65 = StackingRegressor(estimators=estimators_1, final_estimator=CatBoostRegressor(logging_level='Silent')).fit(X_train, y_train).predict(tes)

In [0]:
X = train.drop(['ward', 'ADM4_PCODE', 'target'], axis = 1)
y = target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 27)

tes = test.drop(['ward', 'ADM4_PCODE'], axis = 1)

estimators_1 = [
    ('xgb', XGBRegressor(objective ='reg:squarederror')),
    ('lr', LinearRegression()),
    ('rf', RandomForestRegressor()),
    ('lgb', LGBMRegressor()),
    ('svr', SVR()),
    ('lasso', Lasso()),
    ('kneiba', KNeighborsRegressor()),
    ('cat', CatBoostRegressor(logging_level='Silent'))
]

predictions_sreg_27 = StackingRegressor(estimators=estimators_1, final_estimator=CatBoostRegressor(logging_level='Silent')).fit(X_train, y_train).predict(tes)

In [0]:
# Further averaging, blending and retraining to generalise well
#
stack = [x*0.5 + y*0.5 for x, y in zip(predictions_sreg_65, predictions_sreg_27)]

stack_2 = [x*0.5 + y*0.5 for x, y in zip(stack, predictions_cat_29)]


X = tes.copy()
y = stack_2

ridge = Ridge()
ridge.fit(X, y)
preds_ridge = ridge.predict(X)

cat = CatBoostRegressor(verbose = False)
cat.fit(X, y)
preds_cat = cat.predict(X)

final_blend_2 = [x*0.5 +y*0.5 for x, y in zip(preds_ridge, preds_cat)]

In [0]:
# Making the final prediction
#
final_blend_3 = [x*0.5 + y*0.5 for x, y in zip(final_blend_1.target, final_blend_2)]
sub_df = pd.DataFrame({'ward': test.ward, 'target': final_blend_3}) 
sub_df.to_csv('final_submission.csv', index = False)