# Cleaning data and creating custom features for Give Me Some Credit Kaggle Challenge

1. Cleaning the monthly income from NA values
2. Clean the debt ratio by replacing NA by the mean (Should ideally be done after splitting and the mean should be calculated on the training set if you want to do an evaluation)  (TODO)
3. Creating a montlhy debt feature
    * monthly income multiplied by debt ratio if income is not 0
    * debt ratio if income is 0
4. Create a Balanced Income feature that take into account Income and debt ratio (TODO)
5. Clean the number of dependents feature (TODO)
    * set NA to zero
6. Create a Blanced Income per household members feature (TODO)
7. Cleaning the Number of Times Late feature (TODO)
    * Remove the 96 and 98 values (Replacing those values by NA or some other justifiable value)
    * Create a custom categorical feature that contains 2 different tags for each row that contains a Number of time late of either 96 or 98
8. Add a feature that compute the weighted sum of the number of time late per duration (TODO)
    * weight of 3 for 90 days and more
    * weight of 2 for 60 to 89 days
    * weight of 1 for 30 to 59 days


In [80]:
from pandas import DataFrame, read_csv
from sklearn.model_selection import train_test_split
#import bigml.api
from bigml.api import BigML


## Loading csv files as data frames

Files must be placed in the same directory as this file. Alternatively, modify the relative path to those files.

In [81]:
!pwd
!ls
fulltrain=read_csv('./cs-training.csv')
test=read_csv('./cs-test.csv')

dataSets=[fulltrain,test]

#train80, test20 = train_test_split(fulltrain, test_size=0.2)

/home/devel/handson-ml2/ML-notebooks/GiveMeSomeCredit
README.md  cs-test.csv	cs-training.csv  custom_features.ipynb


## Correcting the name of the first column

In [82]:
print("List of all the column names:\n")
for col in fulltrain.columns:
    print(col)
    
for df in [fulltrain, test]:
    df.columns.values[0]='Id'

print("\nfirst column is now named" ,"\""+ fulltrain.columns[0]+"\".")

List of all the column names:

Unnamed: 0
SeriousDlqin2yrs
RevolvingUtilizationOfUnsecuredLines
age
NumberOfTime30-59DaysPastDueNotWorse
DebtRatio
MonthlyIncome
NumberOfOpenCreditLinesAndLoans
NumberOfTimes90DaysLate
NumberRealEstateLoansOrLines
NumberOfTime60-89DaysPastDueNotWorse
NumberOfDependents

first column is now named "Id".


## Correcting the values in monthly income
Set NaN to 0 in the monthly income column

In [83]:
help(DataFrame.fillna)

Help on function fillna in module pandas.core.frame:

fillna(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
    Fill NA/NaN values using the specified method.
    
    Parameters
    ----------
    value : scalar, dict, Series, or DataFrame
        Value to use to fill holes (e.g. 0), alternately a
        dict/Series/DataFrame of values specifying which value to use for
        each index (for a Series) or column (for a DataFrame). (values not
        in the dict/Series/DataFrame will not be filled). This value cannot
        be a list.
    method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
        Method to use for filling holes in reindexed Series
        pad / ffill: propagate last valid observation forward to next valid
        backfill / bfill: use NEXT valid observation to fill gap
    axis : {0 or 'index', 1 or 'columns'}
    inplace : boolean, default False
        If True, fill in place. Note: this will modify any

In [84]:

for df in [fulltrain, test]:
    df['MonthlyIncome'].fillna(0,inplace=True)
fulltrain.head(10)

Unnamed: 0,Id,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,2,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,3,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,4,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,5,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0
5,6,0,0.213179,74,0,0.375607,3500.0,3,0,1,0,1.0
6,7,0,0.305682,57,0,5710.0,0.0,8,0,3,0,0.0
7,8,0,0.754464,39,0,0.20994,3500.0,8,0,0,0,0.0
8,9,0,0.116951,27,0,46.0,0.0,2,0,0,0,
9,10,0,0.189169,57,0,0.606291,23684.0,9,0,4,0,2.0


## Creating a new column for monthly debt.

In [85]:
help(DataFrame.insert)

Help on function insert in module pandas.core.frame:

insert(self, loc, column, value, allow_duplicates=False)
    Insert column into DataFrame at specified location.
    
    Raises a ValueError if `column` is already contained in the DataFrame,
    unless `allow_duplicates` is set to True.
    
    Parameters
    ----------
    loc : int
        Insertion index. Must verify 0 <= loc <= len(columns)
    column : string, number, or hashable object
        label of the inserted column
    value : int, Series, or array-like
    allow_duplicates : bool, optional



In [86]:
for df in [fulltrain, test]:
    df.insert(6,"MonthlyDebt",0)
fulltrain.head(10)

Unnamed: 0,Id,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyDebt,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,1,0.766127,45,2,0.802982,0,9120.0,13,0,6,0,2.0
1,2,0,0.957151,40,0,0.121876,0,2600.0,4,0,0,0,1.0
2,3,0,0.65818,38,1,0.085113,0,3042.0,2,1,0,0,0.0
3,4,0,0.23381,30,0,0.03605,0,3300.0,5,0,0,0,0.0
4,5,0,0.907239,49,1,0.024926,0,63588.0,7,0,1,0,0.0
5,6,0,0.213179,74,0,0.375607,0,3500.0,3,0,1,0,1.0
6,7,0,0.305682,57,0,5710.0,0,0.0,8,0,3,0,0.0
7,8,0,0.754464,39,0,0.20994,0,3500.0,8,0,0,0,0.0
8,9,0,0.116951,27,0,46.0,0,0.0,2,0,0,0,
9,10,0,0.189169,57,0,0.606291,0,23684.0,9,0,4,0,2.0


The function calculate the monthly debt.
If the income is zero, we take the monthly debt from the debt ratio.

In [87]:
help(DataFrame.apply)

Help on function apply in module pandas.core.frame:

apply(self, func, axis=0, broadcast=None, raw=False, reduce=None, result_type=None, args=(), **kwds)
    Apply a function along an axis of the DataFrame.
    
    Objects passed to the function are Series objects whose index is
    either the DataFrame's index (``axis=0``) or the DataFrame's columns
    (``axis=1``). By default (``result_type=None``), the final return type
    is inferred from the return type of the applied function. Otherwise,
    it depends on the `result_type` argument.
    
    Parameters
    ----------
    func : function
        Function to apply to each column or row.
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Axis along which the function is applied:
    
        * 0 or 'index': apply function to each column.
        * 1 or 'columns': apply function to each row.
    broadcast : bool, optional
        Only relevant for aggregation functions:
    
        * ``False`` or ``None`` : returns a Se

In [88]:
def monthlyDebtCalc(row):
    if row['MonthlyIncome'] == 0:
        row['MonthlyDebt']=row['DebtRatio']
    else:
        row['MonthlyDebt']=row['DebtRatio']*row['MonthlyIncome']
    return row
    

The following block might take some time

In [92]:
for df in [fulltrain, test]:
    tmp=df[['DebtRatio','MonthlyDebt','MonthlyIncome']].apply(monthlyDebtCalc,axis=1)
    df['MonthlyDebt']=tmp['MonthlyDebt']

fulltrain['MonthlyDebt'].head(10)

0     7323.197016
1      316.878123
2      258.914887
3      118.963951
4     1584.975094
5     1314.624392
6     5710.000000
7      734.790059
8       46.000000
9    14359.393699
Name: MonthlyDebt, dtype: float64