함유진

0.7_1

- 0.7_1
    
    ```python
    # <font color='CC3D3D'> Pycaret Anomaly
        
    ### 1. 무작위 표본 추출
        빠른 시간 내에 모델의 성능을 평가하기 위해서,
        train, public dataset에서 100,000개 씩 데이터를 추출함.
        
        
    ### 2. 모델 실험
        K-Nearest Neighbor / isolation forest / Minimum Covariance Determinant
        
    - K-Nearest Neighbor
        
        이상치 데이터는 멀리 존재할 것이라는 가정 하에 K개의 근접 이웃까지 거리를 계산함.
        
        이웃과의 거리가 다른 데이터들에 비해 멀다면 이상치일 확률이 높음. 
        
        - <span style="color:green"> **pycaret_knn.pickle** </span> 생성
        - <span style="color:blue"> **0.04310428143423214** </span> public f1_score
        
    - isolation forest
        
        Tree 기반의 이상 탐지 비지도 알고리즘
        
        - <span style="color:green"> **pycaret_iforest.pickle.csv** </span> 생성
        - <span style="color:blue"> *0.1201625386996904** </span> public f1_score
        
    - Minimum Covariance Determinant
        
        MCD는 이상치의 영향을 최소화하는 방향으로 평균과 공분산 행렬을 추정하는 방식임.
    
        - <span style="color:green"> **pycaret_mcd.pickle** </span> 생성
        - <span style="color:blue"> **0.08543499511241447** </span> public f1_score
    ```
    

user_spec cluster 생성

- user_spec cluster 생성
    
    ```python
    # <font color='CC3D3D'> Clustering User Spec 
        
    ### 1. Kmeans
        user_spec의 numeric featuresf를 사용해 kmeans 방법을 사용하여 고객 군집화를 진행함.
        Elbow point, Silhouette score를 복합적으로 사용해 군집의 개수를 5개로 정함.
        
    ### 2. 시각화
        umap.plot을 통해 군집이 잘 생성되었는지 확인함.
    ```
    

user_cluster_analysis

- user_cluster_analysis
    
    ```python
    # <font color='CC3D3D'> User Spec Clusters Analysis
        
    ### 1. Cluster별 특징 파악
        - 신용점수
        - 연소득
        - 희망대출금액
        - 연령
        - 기대출액
        
    ### 2. Cluster별 Event 분석
        - LoanApply 중 이탈률 분석
        - GetCreditInfo 사용 횟수 분석
    ```

# Import

In [1]:
from pycaret.anomaly import *

In [2]:
import pandas as pd

In [3]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix,f1_score

# Data Load

In [4]:
train = pd.read_csv('../Data/master_train_data.csv') 
public = pd.read_csv('../Data/master_public_data.csv')
private = pd.read_csv('../Data/master_private_data.csv')
test = pd.read_csv('../Data/master_test_data.csv')

In [5]:
train = train.sample(n=100000)
public = public.sample(n=100000)
private = private.sample(n=100000)

In [6]:
train_target = train['is_applied']
public_target = public['is_applied']
# private_target = private['is_applied']

In [7]:
train = train.drop(['is_applied'],axis=1)
public = public.drop(['is_applied'],axis=1)
# private = private.drop(['is_applied'],axis=1)

In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 200584 to 6078325
Data columns (total 71 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   loan_limit                           100000 non-null  float64
 1   loan_rate                            100000 non-null  float64
 2   credit_score                         100000 non-null  float64
 3   yearly_income                        100000 non-null  float64
 4   income_type                          100000 non-null  int64  
 5   employment_type                      100000 non-null  int64  
 6   houseown_type                        100000 non-null  int64  
 7   desired_amount                       100000 non-null  float64
 8   purpose                              100000 non-null  int64  
 9   personal_rehabilitation_yn           100000 non-null  float64
 10  personal_rehabilitation_complete_yn  100000 non-null  float64
 11  existin

# Pycaret

## Set Up

In [9]:
anomaly = setup(train,
                use_gpu = True,
                session_id = 42,
               )

Unnamed: 0,Description,Value
0,session_id,42
1,Original Data,"(100000, 71)"
2,Missing Values,False
3,Numeric Features,57
4,Categorical Features,14
5,Ordinal Features,False
6,High Cardinality Features,False
7,High Cardinality Method,
8,Transformed Data,"(100000, 120)"
9,CPU Jobs,-1


## KNN

In [10]:
knn = create_model('knn')

In [11]:
print(knn)

KNN(algorithm='auto', contamination=0.05, leaf_size=30, method='largest',
  metric='minkowski', metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
  radius=1.0)


In [12]:
public_target = pd.DataFrame(public_target)
public_target['is_appliled'] = public_target['is_applied'].apply(lambda x: '0' if x == 0 else '1')
public_target['is_appliled']

890061    1
192333    0
17351     0
665469    0
4085      0
         ..
33520     0
225093    0
300397    0
662136    0
737918    0
Name: is_appliled, Length: 100000, dtype: object

In [13]:
unseen_predictions = predict_model(knn, data=public)
f1_score(public_target['is_applied'],unseen_predictions['Anomaly'])

0.04310428143423214

In [17]:
save_model(knn,'../Model/pycaret_knn')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[], ml_usecase='regression',
                                       numerical_features=[],
                                       target='UNSUPERVISED_DUMMY_TARGET',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='most frequent',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None...
                 ('fix_perfect', 'passthrough'),
                 ('clean_names', Clean_Colum_Names()),
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('dfs', 'passthrough'), ('pca', 'passthrough'),
                 ['trained_model',
                  KNN

## Iforest

In [18]:
iforest = create_model('iforest')

In [19]:
unseen_predictions = predict_model(iforest, data=public)
f1_score(public_target['is_applied'],unseen_predictions['Anomaly'])

0.1201625386996904

In [20]:
save_model(iforest,'../Model/pycaret_iforest')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[], ml_usecase='regression',
                                       numerical_features=[],
                                       target='UNSUPERVISED_DUMMY_TARGET',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='most frequent',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None...
                 ('fix_perfect', 'passthrough'),
                 ('clean_names', Clean_Colum_Names()),
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('dfs', 'passthrough'), ('pca', 'passthrough'),
                 ['trained_model',
                  IFo

## MCD

In [21]:
mcd = create_model('mcd')

In [22]:
unseen_predictions = predict_model(mcd, data=public)
f1_score(public_target['is_applied'],unseen_predictions['Anomaly'])

0.08543499511241447

In [23]:
save_model(mcd,'../Model/pycaret_mcd')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[], ml_usecase='regression',
                                       numerical_features=[],
                                       target='UNSUPERVISED_DUMMY_TARGET',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='most frequent',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None...
                 ('dummy', Dummify(target='UNSUPERVISED_DUMMY_TARGET')),
                 ('fix_perfect', 'passthrough'),
                 ('clean_names', Clean_Colum_Names()),
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('dfs', 'passthrough'), ('pca',