# ÖDEV 1: PCA yardımı ile Classification,

Bu ödevde "Credit Risk Prediction" veri setini kullanacağız. Amacımız, verinin boyut sayısını düşürerek olabildiğince yüksek accuracy değerini alabilmek. Aşağıda verinin okunma ve temizlenme kısmını hazırlayıp vereceğim. Devamında ise yapmanız gerekenler:

1. PCA kullanarak verinin boyutunu düşürmek
    * Önce explained varience ratio değerini inceleyerek veriyi kaç boyuta düşürebileceğini kontrol et.
    * Daha sonra farklı boyutlarda denemeler yaparak boyutu düşürülmüş verileri elde et.
2. Classification modellerini dene
    * Logistic Regression
    * Random Forest
    * ve eğer istersen herhangi bir modelle daha

İsteğe bağlı olarak, verinin boyutunu düşürmek için diğer yöntemleri de kullanıp en yüksek accuracy değerini almayı deneyebilirsin.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df: pd.DataFrame = pd.read_csv('./credit_risk_dataset.csv')

In [3]:
print(df.isnull().sum())

person_age                       0
person_income                    0
person_home_ownership            0
person_emp_length              895
loan_intent                      0
loan_grade                       0
loan_amnt                        0
loan_int_rate                 3116
loan_status                      0
loan_percent_income              0
cb_person_default_on_file        0
cb_person_cred_hist_length       0
dtype: int64


In [4]:
# Null değerleri sütun ortalaması ile dolduruyoruz
df["person_emp_length"].fillna(df["person_emp_length"].median(), inplace=True)
df["loan_int_rate"].fillna(df["loan_int_rate"].median(), inplace=True)

In [5]:
df.duplicated().sum()

165

In [6]:
df.drop_duplicates(inplace=True)

In [7]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
person_age,32416.0,27.747008,6.3541,20.0,23.0,26.0,30.0,144.0
person_income,32416.0,66091.640826,62015.580269,4000.0,38542.0,55000.0,79218.0,6000000.0
person_emp_length,32416.0,4.76888,4.090411,0.0,2.0,4.0,7.0,123.0
loan_amnt,32416.0,9593.845632,6322.730241,500.0,5000.0,8000.0,12250.0,35000.0
loan_int_rate,32416.0,11.014662,3.08305,5.42,8.49,10.99,13.11,23.22
loan_status,32416.0,0.218688,0.413363,0.0,0.0,0.0,0.0,1.0
loan_percent_income,32416.0,0.17025,0.106812,0.0,0.09,0.15,0.23,0.83
cb_person_cred_hist_length,32416.0,5.811297,4.05903,2.0,3.0,4.0,8.0,30.0


In [8]:
# Outlier temizliği
df = df[df['person_age']<=100]
df = df[df['person_emp_length'] <= 60]
df = df[df['person_income']<=4e6]

In [9]:
# Kategorik verileri alıyoruz ve one hot encoding haline getiriyoruz
cat_cols = pd.DataFrame(df[df.select_dtypes(include=['object']).columns])
cat_cols.columns

Index(['person_home_ownership', 'loan_intent', 'loan_grade',
       'cb_person_default_on_file'],
      dtype='object')

In [10]:
encoded_cat_cols = pd.get_dummies(cat_cols)
df.drop(df.select_dtypes(include=['object']).columns, axis=1,inplace=True)
df = pd.concat([df,encoded_cat_cols], axis=1)

In [11]:
X = df.drop('loan_status', axis=1).values
y = df['loan_status'].values

In [12]:
# Verileri train ve test olarak ikiye ayırıyoruz

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(1, test_size=0.1)
train_idx, test_idx = next(split.split(X, y))
train_x = X[train_idx]
test_x = X[test_idx]

train_y = y[train_idx]
test_y = y[test_idx]

## Kolay gelsin!

In [13]:
#This cell blocks the warnings.
import warnings
warnings.filterwarnings("ignore")

In [14]:
from sklearn.decomposition import PCA

### 5D

In [15]:
#PCA object reduced to size five
pca_model = PCA(n_components=5)

In [16]:
train_x.shape

(29168, 26)

In [17]:
#Independent variables reduced to five - train set
train_x_red_five = pca_model.fit_transform(train_x)

In [18]:
#Independent variables reduced to five - test set
test_x_red_five = pca_model.fit_transform(test_x)

In [19]:
#Size of independent variables reduced to five dimensions
train_x_red_five.shape

(29168, 5)

In [20]:
#Explained Variance Ratio
pca_model.explained_variance_ratio_

array([9.86097232e-01, 1.39027389e-02, 1.82619710e-08, 5.45753334e-09,
       3.53303974e-09])

In [21]:
1 - pca_model.explained_variance_ratio_.sum()

1.8177717109324476e-09

### 3D

In [22]:
#PCA object reduced to size three
pca_model = PCA(n_components=3)

In [23]:
#Independent variables reduced to three - train set
train_x_red_three = pca_model.fit_transform(train_x)

In [24]:
#Independent variables reduced to three - test set
test_x_red_three = pca_model.fit_transform(test_x)

In [25]:
#Size of independent variables reduced to three dimensions
train_x_red_three.shape

(29168, 3)

In [26]:
#Explained Variance Ratio
pca_model.explained_variance_ratio_

array([9.86097232e-01, 1.39027389e-02, 1.82619710e-08])

In [27]:
1 - pca_model.explained_variance_ratio_.sum()

1.080834444167067e-08

### 2D

In [28]:
#PCA object reduced to size two
pca_model = PCA(n_components=2)

In [29]:
#Independent variables reduced to two - train set
train_x_red_two = pca_model.fit_transform(train_x)

In [30]:
#Independent variables reduced to two - test set
test_x_red_two = pca_model.fit_transform(test_x)

In [31]:
#Size of independent variables reduced to two dimensions
train_x_red_two.shape

(29168, 2)

In [32]:
#Explained Variance Ratio
pca_model.explained_variance_ratio_

array([0.98609723, 0.01390274])

In [33]:
1 - pca_model.explained_variance_ratio_.sum()

2.907031593224474e-08

### Logistic Regression

In [34]:
#Importing Logistic Regression Class
from sklearn.linear_model import LogisticRegression

In [35]:
#Create a LogisticRegression Object
log_model = LogisticRegression()

In [36]:
#Fit the model
log_model.fit(train_x_red_five, train_y)

LogisticRegression()

In [37]:
#Makes predictions based on given independent variables.
pred_y = log_model.predict(test_x_red_five)

In [38]:
#Importing accuracy_score function
from sklearn.metrics import accuracy_score

In [39]:
#Result of accuracy score
accuracy_score(test_y, pred_y)

0.6627584078987967

### Random Forest

In [40]:
#Importing Random Forest Classifier Class
from sklearn.ensemble import RandomForestClassifier

In [41]:
#Create a Random Forest Classifier Object fit the model
rf_clf = RandomForestClassifier().fit(train_x_red_five, train_y)

In [42]:
#Makes predictions based on given independent variables.
pred_y = rf_clf.predict(test_x_red_five)

In [43]:
#Result of accuracy score
accuracy_score(test_y, pred_y)

0.8639308855291576

### Quadratic Discriminant Analysis

In [44]:
#Importing QuadraticDiscriminantAnalysis Class
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [45]:
#Create a QuadraticDiscriminantAnalysis Object fit the model
qda_clf = QuadraticDiscriminantAnalysis().fit(train_x_red_five, train_y)

In [46]:
#Makes predictions based on given independent variables.
pred_y = qda_clf.predict(test_x_red_five)

In [47]:
#Result of accuracy score
accuracy_score(test_y, pred_y)

0.8037642702869485

### Ensemble Learning

In [48]:
#Importing VotingClassifier and SVC Class
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC

In [49]:
#Create LogisticRegression, RandomForestClassifier, SVC and QuadraticDiscriminantAnalysis Object
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability=True)
qda_clf = QuadraticDiscriminantAnalysis()

In [50]:
#Fit the Voting Classifier
voting_clf = VotingClassifier(
    estimators=[('log', log_clf), ('rf', rf_clf), ('svc', svm_clf), ('qda', qda_clf)],
    voting='soft')

In [51]:
#Accuracy score results for each model
for clf in (log_clf, rnd_clf, svm_clf, qda_clf, voting_clf):
    clf.fit(train_x_red_five, train_y)
    pred_y = clf.predict(test_x_red_five)
    print(clf.__class__.__name__, accuracy_score(test_y, pred_y))

LogisticRegression 0.6627584078987967
RandomForestClassifier 0.8654736192533169
SVC 0.8185745140388769
QuadraticDiscriminantAnalysis 0.8037642702869485
VotingClassifier 0.8565257636531934
