# Basic workflow, to get started

In this notebook, I will write functions to ease the process of building features, ensembling models, saving models and their metadata (including predictions, in a database). In addition, the functions I write should communicate with each other really well that helps me build a decent pipeline with several features and minimal effort.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, Imputer
from sklearn.linear_model import LogisticRegressionCV

## Household features

Initially, I will focus on household level data for prediction and check the results on the leaderboard. Then, I will move on to individual level features and work on combining them with household level features. One step at a time.

In [2]:
train = pd.read_csv('../data/raw/B_hhold_train.csv')

In [3]:
train['poor'].value_counts()

False    3004
True      251
Name: poor, dtype: int64

In [4]:
def get_metafeatures(df):
    """
    Get metadata of the columns in the training set.
    
    Returns a metadata DataFrame.
    """
    metafeatures = []
    rows = df.shape[0]
    for col in df.columns:
        d = {'column': col,
             'n_unique': df[col].nunique(),
             'missing': df[col].isnull().sum()*1.0/rows,
             'type': df[col].dtype}
        metafeatures.append(d)
    return pd.DataFrame(metafeatures)

In [5]:
meta = get_metafeatures(train)
meta.head()

Unnamed: 0,column,missing,n_unique,type
0,id,0.0,3255,int64
1,RzaXNcgd,0.0,5,object
2,LfWEhutI,0.0,2,object
3,jXOqJdNL,0.0,2,object
4,wJthinfa,0.0,19,int64


### Feature engineering

**Ideas**
- Feature selection strategies - recursive feature elimination or ridge/lasso style regularization
- Dummy variables for categorical types, usually columns with `object` data types with < 50-60 unique values
- Scaling and normalization for integer and float variables
- Imputation of missing variables, only in case of country B

In [None]:
# impute missing values
cols_to_impute = meta.loc[(meta['missing'] > 0.0) & (meta['missing'] < 0.5), 'column'].tolist()
cols_to_drop = meta.loc[meta['missing'] > 0.5, 'column'].tolist()

In [None]:
# checking number of unique values for object dtypes
meta[meta['type'] == 'object'].sort_values('n_unique', ascending=False).head(10)

In [None]:
def impute_fe(df, columns):
    """Impute columns with missing values"""
    if not columns:
        return df
    imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
    df.loc[:, columns] = imputer.fit_transform(df.loc[:, columns])
    return df

In [None]:
def categorical_fe(df, columns):
    """Convert categorical variables to numbers, for use in sklearn estimators.
    This method can't transform the numbers back to the original values, although
    that can be implemented.
    """
    le = LabelEncoder()
    for col in columns:
        df.loc[:, col] = le.fit_transform(df[col])
    return df

In [None]:
categorical_columns = meta.loc[meta['type'] == 'object', 'column'].tolist()
train = impute_fe(train, cols_to_impute)
train = categorical_fe(train, categorical_columns)

### Model training and cross-validation

In [None]:
X = train.drop(['id', 'poor', 'country'] + cols_to_drop, axis=1).as_matrix()
y = train['poor'].as_matrix()

In [None]:
clf = LogisticRegressionCV(n_jobs=-1, scoring='neg_log_loss')
clf.fit(X, y)

### Output predictions

In [None]:
test = pd.read_csv('../data/raw/A_hhold_test.csv')\
         .pipe(categorical_fe, categorical_columns)

In [None]:
X_test = test.drop(['id', 'country'], axis=1).as_matrix()

In [None]:
preds = clf.predict_proba(X_test)

In [None]:
def make_sub(preds, test_feat, country):
    country_sub = pd.DataFrame(data=preds[:, 1],
                               columns=['poor'],
                               index=test_feat.index)
    # add country code for joining later
    country_sub['country'] = country
    return country_sub[['country', 'poor']]

In [None]:
a_sub = make_sub(preds, test, 'A')