# Challenge 1


## [Kaggle Tutorials](https://www.kaggle.com/learn/overview)

 * __Complete__ [Maching Learning course](https://www.kaggle.com/learn/intro-to-machine-learning), if needed


## Activity: Default Payments of Credit Card Clients

This competition is a based on the [UCI](https://www.kaggle.com/uciml) data on [Default Payments of Credit Card Clients in Taiwan](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset).
The general goal is to become familiar with logistic regression.
This will be accomplished by applying logistic regression to the prediction of default payments.
The task is to identify the correct digit for every row in the test set.

**Logistic Regression**: The hypothesis class associated with logistic regression is the composition of a sigmoid function $\phi_{\mathrm{sig}}: \mathbb{R} \rightarrow [0, 1]$ over the class of linear functions.
In particular, the sigmoid function used in logistic regression is the logistic function, defined as
$$\phi_{\mathrm{sig}}(z) = \frac{1}{1 + \exp(-z)}.$$
The classification process is defined through the inner products $\langle \mathbf{w}_i, \mathbf{x} \rangle$, $i \in \{ 0, \ldots, 9 \}$ with
$$\begin{split}
y &= \arg \max_i \phi_{\mathrm{sig}} \left( \langle \mathbf{w}_i, \mathbf{x} \rangle \right) \\
&= \arg \max_i \frac{1}{1 + \exp \left( - \langle \mathbf{w}_i, \mathbf{x} \rangle \right)}.
\end{split}$$
The task, then, is to identify the collection of vectors $\{ \mathbf{w}_i \}$.


### Evaluation

Submissions will be scored according to Categorization Accuracy. This Kaggle Metric requires the following columns: `Id (String)` an `Default (String)`. The solution file should be in a CSV format.


### File Descriptions

 * `challenge1training.csv` – training set
 * `challenge1testing.csv` – test set


### Deliverables

User submissions are evaluated by comparing their submission CSV to the ground truth solution CSV with respect to Categorization Accuracy.
This Metric requires the following columns: `Id (String)` and `Default (String)`.
On GitHub, you should submit the following items.

 1. challenge1solution.csv – solution in Kaggle format
 2. challenge1vector.csv – a vector solution for logistic regression $\{ \mathbf{w}_i \}$

In [95]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

df_train = pd.read_csv('challenge1training.csv')
df_test = pd.read_csv('challenge1testing.csv')
df_solution = pd.read_csv('challenge1sample.csv')
df_vector = pd.read_csv('challenge1vector.csv')

#print(df_train.head())
#print(df_test.head())
#print(df_solution.head())
#print(df_vector.head())


In [96]:
#train data
d={}
for col in df_train:
    d[col]=df_train[col].unique()
print(d)

{'Id': array([    1,     2,     3, ..., 23998, 23999, 24000], dtype=int64), 'LIMIT_BAL': array([ 300000,   20000,   80000,  110000,  140000,  180000,  260000,
        500000,  210000,  200000,  160000,  310000,  340000,   60000,
        250000,   50000,  170000,  150000,  100000,  230000,  120000,
         10000,   30000,  480000,  190000,  600000,  270000,   70000,
        330000,  400000,  320000,  360000,  390000,  470000,  350000,
        420000,  290000,  220000,   90000,  460000,  380000,  240000,
        130000,  450000,  280000,  410000,   40000,  440000,  430000,
        550000,  520000,  490000,  370000,  610000,  580000,  650000,
        530000,  570000,  750000,  670000,  560000,  510000,  640000,
        540000,  630000,  740000,  700000,  710000,  720000,  780000,
        590000,  660000,  620000,  327680,  760000,  680000,  800000,
       1000000,  730000], dtype=int64), 'SEX': array([2, 1], dtype=int64), 'EDUCATION': array([2, 1, 4, 3, 5, 6, 0], dtype=int64), 'MARRIAGE'

In [97]:
#test data
d1={}
for col in df_test:
    d1[col]=df_test[col].unique()
print(d1)

{'Id': array([   1,    2,    3, ..., 5998, 5999, 6000], dtype=int64), 'LIMIT_BAL': array([320000, 450000,  50000, 220000,  20000, 500000, 120000, 130000,
       100000,  10000,  90000, 550000, 360000, 200000, 430000, 290000,
       460000, 230000, 170000, 270000, 260000, 400000, 140000, 280000,
       240000, 150000,  30000, 160000, 440000, 370000,  80000, 420000,
       190000,  70000, 480000,  60000, 410000, 310000, 110000, 390000,
       300000, 380000, 250000, 490000, 210000, 180000, 350000, 330000,
       340000,  40000, 610000, 470000, 700000, 620000, 560000, 740000,
       580000, 510000, 640000, 720000, 520000, 680000, 800000, 540000,
       570000, 660000, 710000, 600000, 630000,  16000, 590000, 690000],
      dtype=int64), 'SEX': array([1, 2], dtype=int64), 'EDUCATION': array([1, 2, 3, 6, 5, 4, 0], dtype=int64), 'MARRIAGE': array([2, 1, 0, 3], dtype=int64), 'AGE': array([38, 41, 34, 33, 39, 31, 24, 47, 22, 32, 42, 29, 35, 44, 43, 50, 53,
       30, 25, 48, 36, 23, 28, 45, 46,

A few categorical attributes such as Education, Marriage there are incorrect entries which need to be replaced in train and test dataframes

## Replace

### Education

In [98]:
df_train['EDUCATION'].value_counts()
# Education with 0,5 and 6 as the entry are replaced with 4 as it is categorized as others. Thus having only unique categories
#under education

2    11245
1     8417
3     3974
5      213
4       95
6       44
0       12
Name: EDUCATION, dtype: int64

In [99]:
#train data
df_train['EDUCATION']=df_train['EDUCATION'].replace(to_replace=[0,5,6],value=4)
df_train['EDUCATION'].value_counts()

2    11245
1     8417
3     3974
4      364
Name: EDUCATION, dtype: int64

In [100]:
#test data
df_test['EDUCATION']=df_test['EDUCATION'].replace(to_replace=[0,5,6],value=4)
df_test['EDUCATION'].value_counts()

2    2785
1    2168
3     943
4     104
Name: EDUCATION, dtype: int64

### Marriage

In [101]:
df_train['MARRIAGE'].value_counts()
# Marriage with 0 as the entry are replaced with 3 as it is categorized as others

2    12798
1    10899
3      256
0       47
Name: MARRIAGE, dtype: int64

In [102]:
#train data
df_train['MARRIAGE']=df_train['MARRIAGE'].replace(to_replace=0,value=3)
df_train['MARRIAGE'].value_counts()

2    12798
1    10899
3      303
Name: MARRIAGE, dtype: int64

In [103]:
#test data
df_test['MARRIAGE']=df_test['MARRIAGE'].replace(to_replace=0,value=3)
df_test['MARRIAGE'].value_counts()

2    3166
1    2760
3      74
Name: MARRIAGE, dtype: int64

In [104]:
df_train.head()

Unnamed: 0,Id,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,Default
0,1,300000,2,2,2,35,0,0,0,0,...,76722,69922,23926,3219,3417,1926,1545,797,9190,0
1,2,20000,1,1,2,24,0,0,0,0,...,16200,16936,17273,1300,1270,1000,1000,619,800,1
2,3,20000,2,2,2,22,1,2,0,0,...,12326,11888,9351,0,2000,1800,0,900,0,1
3,4,80000,1,1,1,60,0,0,0,0,...,0,0,0,3300,6267,0,0,0,4189,0
4,5,300000,1,2,1,48,1,-2,-2,-2,...,0,0,0,0,0,0,0,0,0,0


In [105]:
df_test.head()

Unnamed: 0,Id,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,1,320000,1,1,2,38,-1,-1,-1,-1,...,16770,83490,0,52281,5409,16778,83490,0,52281,15643
1,2,450000,1,1,1,41,-1,-1,-1,-1,...,8680,113233,5907,0,17913,9278,114865,6411,0,0
2,3,50000,1,2,2,34,0,0,0,0,...,26135,26730,12180,18179,1410,2000,1167,1017,18179,651
3,4,220000,1,1,2,33,-2,-2,-2,-2,...,666,1064,707,9213,2108,1332,1064,707,9325,1500
4,5,20000,2,3,1,39,1,2,2,2,...,8198,9752,9447,10290,1200,0,1700,0,1000,1500


## Modeling

In [106]:
# train test
X_train=df_train.drop(['Id','Default'],axis=1)
X_test=df_test.drop('Id',axis=1)
y_train=df_train['Default']


In [107]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import scale

# Scaling the data (predictors)
Xs_train=scale(X_train)
Xs_test=scale(X_test)

# instantiate the model 
logreg = LogisticRegression(solver='lbfgs',multi_class='multinomial',max_iter=100000)

# fit the model with data
logreg.fit(Xs_train,y_train)

#prediction
y_pred=logreg.predict(Xs_test)

### Training Accuracy

In [108]:
unique, counts = np.unique(y_pred, return_counts=True)
dict(zip(unique, counts))

{0: 5554, 1: 446}

In [109]:
unique, counts = np.unique(y_train, return_counts=True)
dict(zip(unique, counts))

{0: 18698, 1: 5302}

In [110]:
print("training score :",(logreg.score(Xs_train, y_train)*100))
print(logreg.solver)
print(logreg.multi_class)
print(logreg.max_iter)

training score : 81.10833333333333
lbfgs
multinomial
100000


### Output

In [111]:
# Creating challenge1solution.csv
df_solution['Default']=y_pred
df_solution.to_csv('challenge1solution.csv',index=False)

# Creating challenge1vector.csv
vec=np.concatenate((logreg.intercept_,logreg.coef_[0]))
df_vector.loc[0]=vec
df_vector.to_csv('challenge1vector.csv',index=False)