# Baseline models

We will use the raw data to train models that are able to support NaNs directly to obtain a baseline accuracy for our prediction.

The task then will be to improve the accuracy of the models by using:
1. variable selection
2. feature engineering
3. Nan imputation
4. cross-validating wider models 
5. Stacking models

In [6]:
import pandas as pd
from models import random_forest, hist_grad_boost, stacking

ImportError: cannot import name 'hist_grad_boost' from 'models' (/Users/juanbello/Desktop/projects/Classification/models.py)

In [3]:
# Minimal preprocessing

def minimal_preprocessing():
    ed = pd.read_csv('./data/train_data/ed_train_raw.csv') # education DataFrame
    hh = pd.read_csv('./data/train_data/hh_train_raw.csv') # household DataFrame
    poverty = pd.read_csv('./data/train_data/poverty_train_raw.csv') # poverty/labels


    def preprocess_df(df, suffix:str):

        # merge first three columns into psu_hh_idcode identifier
        uids = df['psu'].astype(str) + "_"  + df['hh'].astype(str) + "_" + df['idcode'].astype(str) 
        
        # delete the three columns
        df = df.drop(columns=['psu', 'hh', 'idcode'])

        ## Capitalize all Q's in column name prefixes. Add ED or HH prefix to identify variate group
        df.columns = [suffix + "_" + col.capitalize() for col in df.columns]

        # Insert uid as first column, lowercase, no prefix.
        df.insert(0, 'uid', uids)

        return df

    ed = preprocess_df(ed, 'ED')
    hh = preprocess_df(hh, 'HH')

    # Convert subjective poverty score from one-hot to categorical [1-10] outcome
    for i in range(1,11):
        col = 'subjective_poverty_'+ str(i)
        poverty.loc[poverty[col]==1, 'poverty_score'] = i

    poverty['uid'] = poverty['psu_hh_idcode']
    y = poverty[['uid', 'poverty_score']]

    # Filter labeled data
    ed = ed[ed['uid'].isin(poverty['uid'])]
    hh = hh[hh['uid'].isin(poverty['uid'])]

    # ensure rows match between ed, hhand y
    X_raw = pd.merge(ed, hh, on='uid', how='inner')
    Xy = pd.merge(X_raw, y, on='uid', how='left')
    assert(Xy['poverty_score'].isna().sum() == 0)
    y = Xy['poverty_score']
    X_raw = Xy.drop(columns=['poverty_score'])

    return X_raw, y


In [4]:
X, y = minimal_preprocessing()

# Baseline models

Since tree-based models and Histogram-based Gradient Boosting accept NaNs, train them on this minimal raw data to set a baseline for prediction.

## Random Forest

In [5]:
rf = random_forest(X, y)

NameError: name 'random_forest' is not defined

**Results:** 

```
Best parameters: {
    'max_depth': None, 
    'min_samples_split': 10, 
    'n_estimators': 50}
Best cross-validation accuracy: 0.21
```

So, the accuracy to beat is 0.21

## HistGradientBoosting


In [None]:
hgb = hist_grad_boost(X, y)

Best parameters: {'l2_regularization': 2.0, 'learning_rate': 0.01, 'max_depth': 3, 'max_iter': 100, 'min_samples_leaf': 20}
Best cross-validation accuracy: 0.22


**Results:** 
```
Best parameters: {
    'l2_regularization': 2.0, 
    'learning_rate': 0.01, 
    'max_depth': 3, 
    'max_iter': 100, 
    'min_samples_leaf': 20}
```

Best cross-validation accuracy: 0.22

## Stacking

In [69]:
hgb = HistGradientBoostingClassifier(random_state=42, l2_regularization=2.0, learning_rate=0.01, max_depth=3, max_iter=100, min_samples_leaf=20)
rf = RandomForestClassifier(random_state=42, n_estimators=50, max_depth=None, min_samples_split=10)


# Define the base estimators
estimators = [
    ('rf', rf),
    ('hgb', hgb)
]

# Create the stacked model

# Using logistic regression as the final estimator
stacked_model = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(),
    cv=5  # Number of folds for cross-validation during stacking
)

# Fit the stacked model
stacked_model.fit(X_raw, y)

# Evaluate using cross-validation
cv_scores = cross_val_score(stacked_model, X_raw, y, cv=5, scoring='accuracy')
print(f"Stacked model cross-validation accuracy: {cv_scores.mean():.2f} (+/- {cv_scores.std() * 2:.2f})")

Stacked model cross-validation accuracy: 0.22 (+/- 0.03)


# Renaming variables

In [None]:
# renames = {
#     'ED_Q01': 'read',
#     'ED_Q02': 'write',
#     'ED_Q03': 'attended_school',
#     'ED_Q04': 'school_grade',
#     'ED_Q05': 'school_grade_level',
#     'ED_Q06': 'highest_diploma',
#     'ED_Q07': 'years_preschool',
#     'ED_Q08': 'now_enrolled',
#     'ED_Q09': 'now_attend',
#     'ED_Q10': 'now_not_attend_reason',
#     'ED_Q11': 'past_not_enrolled_reason',
#     'ED_Q12': 'current_grade',
#     'ED_Q13': 'current_grade_level',
#     'ED_Q14': 'past_enrolled',
#     'ED_Q15': 'past_attend',
#     'ED_Q16': 'past_not_attend_reason',
#     'ED_Q17': 'past_not_enrolled_reason',
# }



Stacked model cross-validation accuracy: 0.22 (+/- 0.03)

# Preliminary Variable selection
The purpose of this notebook is to identify the variables that are not useful for the prediction of wealth.

Dropping useless variables 

In [1]:
hh_useless_columns=['HH_Hhid', 'HH_Q04', 'HH_Q08', 'HH_Q12', 'HH_Q18']
ed_useful_columns = [
    'uid',
    'ED_Q01', 'ED_Q02', 'ED_Q03', 'ED_Q04', 'ED_Q05', 'ED_Q06', 'ED_Q07', 'ED_Q08', 'ED_Q09', 'ED_Q10', 'ED_Q11', 
    'ED_Q14', 'ED_Q15', 'ED_Q16', 'ED_Q17', 'ED_Q18',
    'ED_Q19',
    'ED_Q23',
    'ED_Q26', 'ED_Q27', 'ED_Q28', 'ED_Q29',
    'ED_Q41', 
]
%store hh_useless_columns
%store ed_useful_columns


Stored 'hh_useless_columns' (list)
Stored 'ed_useful_columns' (list)


In [1]:
ed 

IPython will make a temporary file named: /var/folders/21/wpdrdb153rgcw7h7qz_vrd_40000gn/T/ipython_edit_m705y5td/ipython_edit_ex4hczmf.py
