## Task 1

German dataset contains both numerical and categorical variables. Firstly, I checked there is any missing value in data and converted each categorical/symbolic variable into numerical variables in order to be suitable for feature selection analysis and learning algorithm. With this conversion, some feature columns such as "purpose", "personal_status_sex" are spreaded to several columns because these variables cannot be labeled with a linear logic. Then, I used correlation and p value statistics for feature selection. Finally, I applied the appropriate machine learning model to the dataset.

In [508]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm


In [582]:
url="https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"

names= ["status","duration","credit_history","purpose","amount","savings","employment_duration","installment_rate",
        "personal_status_sex","other_debtors","present_residence","property", "age","other_installment_plans", "housing" ,
        "number_credits","job","people_liable","telephone","foreign_worker","credit_risk"]

data = pd.read_csv(url,sep =" ", header = None)

data.columns = data.columns[:0].tolist() + names


In [583]:
data.head(5)

Unnamed: 0,status,duration,credit_history,purpose,amount,savings,employment_duration,installment_rate,personal_status_sex,other_debtors,...,property,age,other_installment_plans,housing,number_credits,job,people_liable,telephone,foreign_worker,credit_risk
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,...,A121,67,A143,A152,2,A173,1,A192,A201,1
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,...,A121,22,A143,A152,1,A173,1,A191,A201,2
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,...,A121,49,A143,A152,1,A172,2,A191,A201,1
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,...,A122,45,A143,A153,1,A173,2,A191,A201,1
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,...,A124,53,A143,A153,2,A173,2,A191,A201,2


In [584]:
data.isnull().sum().sum() # chekcing missing values

0

In [585]:

list1= ["status","credit_history","savings","employment_duration","property","job"]

for i in list1:
    data[i]=data[i].astype('category')
    
cat_columns = data.select_dtypes(['category']).columns

data[cat_columns] = data[cat_columns].apply(lambda x: x.cat.codes)

list2 = ["purpose","personal_status_sex","other_debtors","other_installment_plans","housing"] 

data =pd.get_dummies(data=data,columns=list2 )

data['telephone'] = data['telephone'].map({'A192': 1, 'A191': 0})
data['foreign_worker'] = data['foreign_worker'].map({'A201': 1, 'A202': 0})


data.head(5)

Unnamed: 0,status,duration,credit_history,amount,savings,employment_duration,installment_rate,present_residence,property,age,...,personal_status_sex_A94,other_debtors_A101,other_debtors_A102,other_debtors_A103,other_installment_plans_A141,other_installment_plans_A142,other_installment_plans_A143,housing_A151,housing_A152,housing_A153
0,0,6,4,1169,4,4,4,4,0,67,...,0,1,0,0,0,0,1,0,1,0
1,1,48,2,5951,0,2,2,2,0,22,...,0,1,0,0,0,0,1,0,1,0
2,3,12,4,2096,0,3,2,3,0,49,...,0,1,0,0,0,0,1,0,1,0
3,0,42,2,7882,0,3,2,4,1,45,...,0,0,0,1,0,0,1,0,0,1
4,0,24,3,4870,0,2,3,4,3,53,...,0,1,0,0,0,0,1,0,0,1


In [586]:
data.head(5)

Unnamed: 0,status,duration,credit_history,amount,savings,employment_duration,installment_rate,present_residence,property,age,...,personal_status_sex_A94,other_debtors_A101,other_debtors_A102,other_debtors_A103,other_installment_plans_A141,other_installment_plans_A142,other_installment_plans_A143,housing_A151,housing_A152,housing_A153
0,0,6,4,1169,4,4,4,4,0,67,...,0,1,0,0,0,0,1,0,1,0
1,1,48,2,5951,0,2,2,2,0,22,...,0,1,0,0,0,0,1,0,1,0
2,3,12,4,2096,0,3,2,3,0,49,...,0,1,0,0,0,0,1,0,1,0
3,0,42,2,7882,0,3,2,4,1,45,...,0,0,0,1,0,0,1,0,0,1
4,0,24,3,4870,0,2,3,4,3,53,...,0,1,0,0,0,0,1,0,0,1


## Task 2

Feature selection is the important step of machine learning because removing the irrelevant or less important features increase the accuracy of model. For supervised machine learning tasks, feature selection can be accomplished on the basis of correlation between features. Correlation is a measure of relation between variables that is measured on a -1 to 1 scale. The closer the correlation value is to -1 or 1 the stronger the relationship, the closer to 0, the weaker the relationship. It measures how change in one variable is associated with change in another variable. 

When we perform a hypothesis test in statistics (In correlation H0 hypothesis says there is not a relationship between variable 1 and variable 2), p-value helps determine the statistical significance of test result. A low P value suggests that our sample provides enough evidence that we can reject the H0 hypothesis. If the obtained p-value is less than what it is being tested at, then one can state that there is a significant relationship between the variables. Most fields use an alpha level of 0.05 which I will also use.

In the following code, I selected the features by checking whether the correlation p value less than 0.05.

In [587]:

x = data.drop(["credit_risk"],axis=1).values
Y=data.credit_risk.values

selected_columns = data.drop(["credit_risk"],axis=1).columns

numVars = len(x[0])
for i in range(0, numVars):
    regressor_OLS = sm.OLS(Y, x).fit()
    maxVar = max(regressor_OLS.pvalues).astype(float)
    if maxVar > 0.05:
        for j in range(0, numVars - i):
            if (regressor_OLS.pvalues[j].astype(float) == maxVar):
                x = np.delete(x, j, 1)
                selected_columns = np.delete(selected_columns, j)
                
df = pd.DataFrame(data = x, columns = selected_columns)


In [563]:
from scipy.stats.stats import pearsonr


x = data.drop(["credit_risk"],axis=1)

p_values=[]
statistics=[]
col_names=x.columns.values.tolist()
selected_cols=[]


for i in range(0, n):
    cor = pearsonr(x.iloc[:,i], data.credit_risk)
    p_value = float(cor[1])
    statistic = cor[0]
    if p_value < 0.05:
        selected_cols.append(col_names[i])
        p_values.append(p_value)
        statistics.append(statistic)

list ={"Feature":selected_cols, "Statistic":statistics, "p_value":p_values}
a =pd.DataFrame(list)
a['Statistic'] = a['Statistic'].abs()
a.sort_values(by=["Statistic"], ascending=False)

Unnamed: 0,Feature,Statistic,p_value
0,status,0.350847,2.441662e-30
2,credit_history,0.228785,2.42306e-13
1,duration,0.214927,6.48805e-12
4,savings,0.178943,1.214798e-08
3,amount,0.154739,8.797572e-07
7,property,0.142612,5.974058e-06
20,housing_A152,0.134589,1.953064e-05
5,employment_duration,0.116002,0.0002367939
18,other_installment_plans_A143,0.113285,0.0003313488
12,purpose_A43,0.106922,0.0007074909


## Task 3

There are various learning method for classification such as k-nn, support vector machines,decision tree, naivebayes. I applied all these models to the data and I get the best accuracy from logistic regression. It can be expected result because logistic regression model is appropriate for our binary classification problem.

Logistic regression is used for binary classification problems. For instance, yes/no, true/false. For our dataset, dependent variable has two class so logistic regression can be used to classify customers as good or bad. This model splits the space 
into two halves using a hyper-plane. The deeper the point into one of these halves, the greater the probability that the point belongs to this half.

I used python scikit-learn library to create model. Firstly, I split the data as train and test. After training, model predict whether customer is good from X_test.

In [564]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score



In [588]:
X = pd.DataFrame(data = data, columns=selected_cols)
y = data.credit_risk
X_train,X_test,y_train,y_test=train_test_split(df,y,test_size=0.25,random_state=0)


In [589]:
model = LogisticRegression()
model.fit(X_train,y_train)
pred=model.predict(X_test)
pred



array([2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2,
       1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1,
       1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1,
       1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1,
       1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2,
       2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2,
       2, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1,
       1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 2,
       2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

## Task 4 

For checking the correctness of the model, we can use the accuracy and confusion matrix. Classification accuracy is the ratio of correct predictions to total predictions made. A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values. Each row of the matrix corresponds to a predicted class. Each column of the matrix corresponds to an actual class.



In [590]:

print( confusion_matrix(y_test, pred))
print("Accuracy:",accuracy_score(y_test,pred))


[[152  24]
 [ 31  43]]
Accuracy: 0.78
