# Costa Rica Poverty Prediction
Many social programs have a hard time making sure the right people are given enough aid. It抯 especially tricky when a program focuses on the poorest segment of the population. The world抯 poorest typically can抰 provide the necessary income and expense records to prove that they qualify.

In Latin America, one popular method uses an algorithm to verify income qualification. It抯 called the Proxy Means Test (or PMT). With PMT, agencies use a model that considers a family抯 observable household attributes like the material of their walls and ceiling, or the assets found in the home to classify them and predict their level of need.

While this is an improvement, accuracy remains a problem as the region抯 population grows and poverty declines.

In this analysis I will attempt to look at models such as KNN, Extra Trees, Random Forest, and Decision Trees to classify households into specific poverty levels. This model will provide value by predicting which current households may need remodelling or gentrification. 

In [None]:
#Data Manipulation
import pandas as pd
import numpy as np
import os

#Visualization
import matplotlib.pyplot as plt
import seaborn as sns


# Other Packages
import missingno as msno

# Set a few plotting defaults
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['font.size'] = 18
plt.rcParams['patch.edgecolor'] = 'k'

Extract both files 

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
train_samp = train.sample(frac=.3)

In [None]:
y = train_samp['Target']
y_full = train['Target']

In [None]:
y.value_counts(normalize=True)

In [None]:
print(f'Train shape: {train_samp.shape}')
print(f'Test shape: {test.shape}')

### Exploratory Data Analysis:

In [None]:
print(train.info())
train.columns[1::]

In [None]:
train_samp.select_dtypes('object')
len(train.columns)

### Check Missing Values

In [None]:
msno.matrix(train)

In [None]:
train_samp.isnull().sum()
#v2a1, v18q1, 
train_samp.columns[train_samp.isnull().any()]

In [None]:
train_samp.select_dtypes('int64')
train_samp.get_dtype_counts()

### Clean object values

Now we want to make sure the features can be used in modelling. Here we convert our object types to integer converting "yes"'s and "no"'s to ones and zeroes

In [None]:
mapping = {"yes": 1, "no": 0}

# Apply same operation to both train and test
for df in [train, test]:
    # Fill in the values with the correct mapping
    df['dependency'] = df['dependency'].replace(mapping).astype(np.float64)
    df['edjefa'] = df['edjefa'].replace(mapping).astype(np.float64)
    df['edjefe'] = df['edjefe'].replace(mapping).astype(np.float64)

train[['dependency', 'edjefa', 'edjefe']].describe()

Get a sample from the training set

Here we want to look at all of the unique values for each feature.

In [None]:
train_samp = train.sample(frac=.3)

In [None]:
train.select_dtypes(np.int64).nunique().value_counts().sort_index().plot.bar(color = 'blue', 
                                                                             figsize = (8, 6),
                                                                            edgecolor = 'k', linewidth = 2);
plt.xlabel('Number of Unique Values'); plt.ylabel('Count');
plt.title('Count of Unique Values in Integer Columns');

In [None]:
#Plot densities of float columns
from collections import OrderedDict

plt.figure(figsize = (20, 16))
plt.style.use('fivethirtyeight')

# Color mapping
colors = OrderedDict({1: 'red', 2: 'orange', 3: 'blue', 4: 'green'})
poverty_mapping = OrderedDict({1: 'extreme', 2: 'moderate', 3: 'vulnerable', 4: 'non vulnerable'})

# Iterate through the float columns
for i, col in enumerate(train.select_dtypes('float')):
    ax = plt.subplot(6, 2, i + 1)
    # Iterate through the poverty levels
    for poverty_level, color in colors.items():
        # Plot each poverty level as a separate line
        sns.kdeplot(train.loc[train['Target'] == poverty_level, col].dropna(), 
                    ax = ax, color = color, label = poverty_mapping[poverty_level])
        
    plt.title(f'{col.capitalize()} Distribution'); plt.xlabel(f'{col}'); plt.ylabel('Density')

plt.subplots_adjust(top = 2)

Above we can ook at the distributions of each the numeric variables. We then want to split our data and see how our first model performs

In [None]:
miss_cols = train_samp.columns[train.isnull().any()]
miss_cols

## Impute missing values

The columns 'v2a1', 'v18q1', 'rez_esc', 'meaneduc', 'SQBmeaned' all contain missing values and we will look to fill them with the mean or mode

In [None]:
train_samp.rez_esc.value_counts()

In [None]:
train_samp.isnull().sum()

In [None]:
#Fill values for v2a1
train_samp['v2a1'] = train_samp['v2a1'].fillna(train_samp['v2a1'].mode()[0])

#Fill values for v18q1
train_samp['v18q1'] = train_samp['v18q1'].fillna(train_samp['v18q1'].mean())

#Fill values for rez_esc
train_samp['rez_esc'] = train_samp['rez_esc'].fillna(train_samp['rez_esc'].mode()[0])

#Fill values for meaneduc
train_samp['meaneduc'] = train_samp['meaneduc'].fillna(train_samp['meaneduc'].mode()[0])

#Fill values for SQBmeaned
train_samp['SQBmeaned'] = train_samp['SQBmeaned'].fillna(train_samp['SQBmeaned'].mode()[0])

In [None]:
#Fill values for v2a1
train['v2a1'] = train['v2a1'].fillna(train['v2a1'].mode()[0])

#Fill values for v18q1
train['v18q1'] = train['v18q1'].fillna(train['v18q1'].mean())

#Fill values for rez_esc
train['rez_esc'] = train['rez_esc'].fillna(train['rez_esc'].mode()[0])

#Fill values for meaneduc
train['meaneduc'] = train['meaneduc'].fillna(train['meaneduc'].mode()[0])

#Fill values for SQBmeaned
train['SQBmeaned'] = train['SQBmeaned'].fillna(train['SQBmeaned'].mode()[0])

In [None]:
#Fill values for v2a1
test['v2a1'] = test['v2a1'].fillna(test['v2a1'].mode()[0])

#Fill values for v18q1
test['v18q1'] = test['v18q1'].fillna(test['v18q1'].mean())

#Fill values for rez_esc
test['rez_esc'] = test['rez_esc'].fillna(test['rez_esc'].mode()[0])

#Fill values for meaneduc
test['meaneduc'] = test['meaneduc'].fillna(test['meaneduc'].mode()[0])

#Fill values for SQBmeaned
test['SQBmeaned'] = test['SQBmeaned'].fillna(test['SQBmeaned'].mode()[0])

In [None]:
train_samp.columns[train_samp.isnull().any()]

## Modelling with RandomForest 

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, roc_auc_score, f1_score, make_scorer, precision_recall_fscore_support

# Custom scorer for cross validation
scorer = make_scorer(f1_score, greater_is_better=True, average = 'macro')

In [None]:
#Drop Columns from dataset
X = train_samp.drop(['Id', 'Target', 'idhogar'], axis=1).copy()
X_full = train.drop(['Id', 'Target', 'idhogar'], axis=1).copy()


Here we split the data looking at a sample first to iterate with our training set and later split our full data set

In [None]:
#Using the sample data set (train_samp) we drop the Id, Target, and idhogar to split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
x_tr, x_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=.2, random_state=42)

#Split full data set
X_trfull, X_tefull, y_trfull, y_tefull = train_test_split(X_full, y_full, test_size=0.20, random_state=42)

In [None]:
#pd.concat(y_tefull['idhogar'])
#pd.merge(type_df, y_tefull, left_index=True)
#y_tefull.head()



In [None]:
n_classes = y_full.unique().max()
n_classes

In [None]:
#Full training and test set split
print(f'X_train: {X_trfull.shape}')
print(f'X_test: {X_tefull.shape}')
print(f'y_train: {y_trfull.shape}')
print(f'y_test: {y_tefull.shape}')

#Training and Test set split
print(f'X_train: {X_train.shape}')
print(f'X_test: {X_test.shape}')
print(f'y_train: {y_train.shape}')
print(f'y_test: {y_test.shape}')

#Sample of our training set
print(f'Train Sample: {train_samp.shape}')

#Split the data a second time
print(f'x_tr: {x_tr.shape}')
print(f'y_tr: {y_tr.shape}')
print(f'x_val: {y_val.shape}')
print(f'y_val: {y_val.shape}')

### Run RandomForest:

n_estimators, n_jobs=-1, class_weights: balanced, max_depth=3

In [None]:
param_dictionary = {"n_estimators": [1000]}
clf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=3)
# Press Shift-Tab to look at what the arguments are for a function, as well as the defaults for each argument
gs = GridSearchCV(clf, param_dictionary, n_jobs=1, verbose=2, cv=2)
gs.fit(X_trfull, y_trfull)
# max depth 5, n estimators 500

We then get predictions on our training set and analyze our score based on precision, recall, f1-score, and support

In [None]:
val_predictions = gs.predict(X_trfull)
cr = classification_report(y_trfull, val_predictions)
#roc_auc = roc_auc_score(y_val, val_predictions)
print('Validation Scores:')
print(cr)
print('-'*50)
#print("ROC AUC Score: {}".format(roc_auc))

The most important features were their average education and SQBmeaned which is the square of the average education of the adults in the household

In [None]:
feat_imports = sorted(list(zip(X_train.columns, gs.best_estimator_.feature_importances_)), key=lambda x:x[1], reverse=True)
feat_imports[0:10]

In [None]:
clf = RandomForestClassifier(n_jobs=-1, max_depth=5, n_estimators=1000, class_weight='balanced', verbose=1)
clf.fit(X_train, y_train)

### Model Selection:

Did you try multiple models? 
We then looked at the following models:
1. Decision Tree
Decision trees work very well with categorical variables. The tree looks at a set of features and creates a split based on the Gini Index. 
2. Extra Tree Classifier
3. Random Forest
Random Forest and Extra Tree Classifier's work very similar. However, Extra Tree is much faster and chooses a random value when creating a split whereas RF uses the optimal value.
4. K-Nearest Neighbors
KNN looks at number of neighbors (we looked at 5, 10, and 20) and uses a weights function. We did not specify a function for the weights.

I evaluated each model based on their cross validation score which creates a new test set to avoid overfitting

Why did you choose these models? How do they work? What are they assumptions? And how did you test/account for them? How did you select hyper-parameters?

In [None]:
# Dataframe to hold results
model_results = pd.DataFrame(columns = ['model', 'cv_mean', 'cv_std'])

def cv_model(train, train_labels, model, name, model_results=None):
    """Perform 10 fold cross validation of a model"""
    
    cv_scores = cross_val_score(model, train, train_labels, cv = 10, scoring=scorer, n_jobs = -1)
    print(f'10 Fold CV Score: {round(cv_scores.mean(), 5)} with std: {round(cv_scores.std(), 5)}')
    
    if model_results is not None:
        model_results = model_results.append(pd.DataFrame({'model': name, 
                                                           'cv_mean': cv_scores.mean(), 
                                                            'cv_std': cv_scores.std()},
                                                           index = [0]),
                                             ignore_index = True)

        return model_results

In [None]:
from sklearn.tree import DecisionTreeClassifier

model_results = cv_model(X_tefull, y_tefull, 
                         DecisionTreeClassifier(),
                         'DT', model_results)

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

model_results = cv_model(X_tefull, y_tefull, 
                         ExtraTreesClassifier(n_estimators = 100, random_state = 10),
                         'EXT', model_results)

In [None]:
model_results = cv_model(X_tefull, y_tefull, 
                         RandomForestClassifier(n_estimators = 100, random_state = 10),
                         'RF', model_results)

In [None]:
for n in [5, 10, 20]:
    print(f'\nKNN with {n} neighbors\n')
    model_results = cv_model(X_tefull, y_tefull, 
                             KNeighborsClassifier(n_neighbors = n),
                             f'knn-{n}', model_results)

In [None]:
model_results.set_index('model', inplace = True)
model_results['cv_mean'].plot.bar(color = 'orange', figsize = (8, 6),
                                  yerr = list(model_results['cv_std']),
                                  edgecolor = 'k', linewidth = 2)
plt.title('Model F1 Score Results');
plt.ylabel('Mean F1 Score (with error bar)');
model_results.reset_index(inplace = True)

### Model evaluation: 
Did you evaluate your model on multiple metrics? Where does your model do well? Where could it be improved? How are the metrics different?

Originally we looked at the f1 score, precision, and recall for RandomForest. We then look at just the f1 score. I think it could be improved 

In [None]:
def pred_and_score(model, train, train_labels, test, test_ids):
    """Train and test a model on the dataset"""
    
    # Train on the data
    model.fit(train, train_labels)
    
    predictions = model.predict(test)
    predictions = pd.DataFrame({'idhogar': test_ids,
                               'Target': predictions})
    #Compute the mean accuracy
    scores = model.score(test, test_ids)
    
    #Get most important features
    imp_feats = sorted(list(zip(test.columns, model.feature_importances_)), key=lambda x:x[1], reverse=True)
    imp_feats = imp_feats[0:10]

    return predictions, test_ids, scores, imp_feats

In [None]:
test1 = test.drop(['Id', 'idhogar'], axis=1)
test1.shape

In [None]:
predictions, true_values, scores, imp_feats = pred_and_score(ExtraTreesClassifier(n_estimators = 100, random_state = 10), 
                         X_trfull, y_trfull, test1, test.idhogar)

In [None]:
true_values.head()

In [None]:

cr = precision_recall_fscore_support(predictions['Target'], y_trfull, average='macro')
#roc_auc = roc_auc_score(y_val, val_predictions)
print('Test Scores:')
print('-'*50)

print(f'precision: {cr[0]}')
print(f'recall: {cr[1]}')
print(f'f1-score: {cr[2]}')

In [None]:
print(f'Accuracy Score: {scores}')
print(f'Important Features: {imp_feats}')
#print(f'Evaluation Metrics: {true_values})

### Model interpretation: 
What do the model results tell you? Which variables are important? High bias or variance and how did you / could you fix this? How confident are you in your results?

After running the extra trees classifier model on the full test data, we predicted with a mean accuracy of 94%. The variables that are most important are "meaneduc" and "SQBmeaned". To find high bias or variance we can see if we are over or underfitting. To fix high bias we need to add more features to prevent underfitting and to fix high variance we need to add more data or create synthetic data to prevent overfitting. 

In [None]:
clf = ExtraTreesClassifier(n_estimators = 100, random_state = 10)
clf.fit(X_trfull, y_trfull)
clf.score(X_tefull, y_tefull)

## Make a submission

In [None]:
predictions.head()

In [None]:
#I want to match idhogars from the full dataset to the predicted values
submission = pd.merge(train['idhogar'].to_frame(), predictions, left_index=True, right_index=True)

In [None]:
submission = submission.drop('idhogar_y', axis=1)
submission.head()

In [None]:
submission.columns = ['Id', 'Target']
submission.head()

In [None]:
# Fill in households missing a head
submission['Target'] = submission['Target'].fillna(4).astype(np.int8)

In [None]:
submission.to_csv('Costa_Rica_Predictions.csv', index=False)

### Model usefulness:
Do you think your final model was useful? If so, how would you recommend using it? Convince us, that if we were a company, we would feel comfortable using your model with our users. Think about edge cases as well - are there certain areas that the model performs poorly on? Best on? How would you handle these cases, if say Zillow wanted to leverage your model realizing that bad recommendations on sale prices would hurt customer trust and your brand. This section also falls into the storytelling aspect of the grading.

I think the model is very useful. It could be used to look at general economic status for government budgeting. This could be used to determine which households need specific funding.

Another application is to predict & forecast which households are entering poverty in a 5 year period. This model may not be generalizable for countries outside of Costa Rica and may only work for countries with similar observable characteristics. 