# ÖDEV 1: PCA yardımı ile Classification,

Bu ödevde "Credit Risk Prediction" veri setini kullanacağız. Amacımız, verinin boyut sayısını düşürerek olabildiğince yüksek accuracy değerini alabilmek. Aşağıda verinin okunma ve temizlenme kısmını hazırlayıp vereceğim. Devamında ise yapmanız gerekenler:

1. PCA kullanarak verinin boyutunu düşürmek
-Önce explained varience ratio değerini inceleyerek veriyi kaç boyuta düşürebileceğini kontrol et.

-Daha sonra farklı boyutlarda denemeler yaparak boyutu düşürülmüş verileri elde et.

2. Classification modellerini dene
-Logistic Regression

-Random Forest

-ve eğer istersen herhangi bir modelle daha

İsteğe bağlı olarak, verinin boyutunu düşürmek için diğer yöntemleri de kullanıp en yüksek accuracy değerini almayı deneyebilirsin.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [2]:
df: pd.DataFrame = pd.read_csv('desktop/credit_risk_dataset.csv')

In [3]:
print(df.isnull().sum())

person_age                       0
person_income                    0
person_home_ownership            0
person_emp_length              895
loan_intent                      0
loan_grade                       0
loan_amnt                        0
loan_int_rate                 3116
loan_status                      0
loan_percent_income              0
cb_person_default_on_file        0
cb_person_cred_hist_length       0
dtype: int64


In [4]:
# Null değerleri sütun ortalaması ile dolduruyoruz
df["person_emp_length"].fillna(df["person_emp_length"].median(), inplace=True)
df["loan_int_rate"].fillna(df["loan_int_rate"].median(), inplace=True)

In [5]:
df.duplicated().sum()

165

In [6]:
df.drop_duplicates(inplace=True)

In [7]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
person_age,32416.0,27.747008,6.3541,20.0,23.0,26.0,30.0,144.0
person_income,32416.0,66091.640826,62015.580269,4000.0,38542.0,55000.0,79218.0,6000000.0
person_emp_length,32416.0,4.76888,4.090411,0.0,2.0,4.0,7.0,123.0
loan_amnt,32416.0,9593.845632,6322.730241,500.0,5000.0,8000.0,12250.0,35000.0
loan_int_rate,32416.0,11.014662,3.08305,5.42,8.49,10.99,13.11,23.22
loan_status,32416.0,0.218688,0.413363,0.0,0.0,0.0,0.0,1.0
loan_percent_income,32416.0,0.17025,0.106812,0.0,0.09,0.15,0.23,0.83
cb_person_cred_hist_length,32416.0,5.811297,4.05903,2.0,3.0,4.0,8.0,30.0


In [8]:
# Outlier temizliği
df = df[df['person_age']<=100]
df = df[df['person_emp_length'] <= 60]
df = df[df['person_income']<=4e6]

In [9]:
# Kategorik verileri alıyoruz ve one hot encoding haline getiriyoruz
cat_cols = pd.DataFrame(df[df.select_dtypes(include=['object']).columns])
cat_cols.columns

Index(['person_home_ownership', 'loan_intent', 'loan_grade',
       'cb_person_default_on_file'],
      dtype='object')

In [10]:
encoded_cat_cols = pd.get_dummies(cat_cols)
df.drop(df.select_dtypes(include=['object']).columns, axis=1,inplace=True)
df = pd.concat([df,encoded_cat_cols], axis=1)

In [11]:
X = df.drop('loan_status', axis=1).values
y = df['loan_status'].values

In [12]:
# Verileri train ve test olarak ikiye ayırıyoruz

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(1, test_size=0.1)
train_idx, test_idx = next(split.split(X, y))
train_x = X[train_idx]
test_x = X[test_idx]

train_y = y[train_idx]
test_y = y[test_idx]

## Kolay gelsin!

In [13]:
from sklearn.decomposition import PCA

In [14]:
pca = PCA()
X_reconstruct = pca.fit_transform(X)

In [15]:
pca.components_.shape

(26, 26)

In [16]:
pca_model = PCA(n_components=3)

In [17]:
train_xK3 = pca_model.fit_transform(train_x)

In [18]:
test_xK3 = pca_model.fit_transform(test_x)

In [19]:
train_xK3.shape

(29168, 3)

In [20]:
np.sum(pca.explained_variance_ratio_)

1.0000000000000002

In [21]:
pca_model.explained_variance_ratio_

array([9.91597792e-01, 8.40218948e-03, 1.14981956e-08])

In [22]:
from sklearn.linear_model import LogisticRegression

In [23]:
from sklearn.metrics import accuracy_score

In [24]:
from sklearn.ensemble import RandomForestClassifier

In [25]:
log_model = LogisticRegression()

In [26]:
log_model.fit(train_xK3, train_y)

LogisticRegression()

In [27]:
pred_y = log_model.predict(test_xK3)

In [28]:
accuracy_score(test_y, pred_y)

0.7926565874730022

In [29]:
rf_clf = RandomForestClassifier().fit(train_xK3, train_y)

In [30]:
pred_y = rf_clf.predict(test_xK3)
accuracy_score(test_y, pred_y)

0.8133292193767355

In [31]:
pca_model = PCA(n_components=1)

In [32]:
train_xK4 = pca_model.fit_transform(train_x)

In [33]:
test_xK4 = pca_model.fit_transform(test_x)

In [34]:
train_xK4.shape

(29168, 1)

In [35]:
np.sum(pca.explained_variance_ratio_)

1.0000000000000002

In [36]:
pca_model.explained_variance_ratio_

array([0.99159779])

In [37]:
log_model.fit(train_xK4, train_y)

LogisticRegression()

In [38]:
pred_y = log_model.predict(test_xK4)

In [39]:
accuracy_score(test_y, pred_y)

0.5010799136069114

In [40]:
rf_clf = RandomForestClassifier().fit(train_xK4, train_y)

In [41]:
pred_y = rf_clf.predict(test_xK4)
accuracy_score(test_y, pred_y)

0.644554149953718