## 다양한 피처선정 방법으로 피처 선택하기
- <b> 분산 기반 선택(Variance-based Selection): </b>
- 분산이 낮은 변수를 제거하는 방법
- 분산 낮은 변수 = 데이터의 변동이 거의 x -> 정보량 부족: 변동이 적은 변수는 모델이 학습할 수 있는 정보가 거의 없기 때문에 유의미한 패턴을 찾는 데 도움이 되지 않는다.
- 분산이 낮은 변수는 모델의 성능을 저하시키는 노이즈로 작용 가능 -> 이는 모델의 복잡성을 증가시키고 과적합(overfitting)을 초래할 수 있다. 불필요한 변수를 제거하면 모델이 더 일반화(generalize)되는 데 도움이 됩니다.

In [1]:
import pandas as pd
from sklearn.feature_selection import VarianceThreshold

In [2]:
# 파일 경로 설정
file_path = "data.csv"

# CSV 파일을 DataFrame으로 불러오기
data = pd.read_csv(file_path)

In [3]:
data

Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
0,1,0.370594,0.424389,0.405750,0.601457,0.601457,0.998969,0.796887,0.808809,0.302646,...,0.716845,0.009219,0.622879,0.601453,0.827890,0.290202,0.026601,0.564050,1,0.016469
1,1,0.464291,0.538214,0.516730,0.610235,0.610235,0.998946,0.797380,0.809301,0.303556,...,0.795297,0.008323,0.623652,0.610237,0.839969,0.283846,0.264577,0.570175,1,0.020794
2,1,0.426071,0.499019,0.472295,0.601450,0.601364,0.998857,0.796403,0.808388,0.302035,...,0.774670,0.040003,0.623841,0.601449,0.836774,0.290189,0.026555,0.563706,1,0.016474
3,1,0.399844,0.451265,0.457733,0.583541,0.583541,0.998700,0.796967,0.808966,0.303350,...,0.739555,0.003252,0.622929,0.583538,0.834697,0.281721,0.026697,0.564663,1,0.023982
4,1,0.465022,0.538432,0.522298,0.598783,0.598783,0.998973,0.797366,0.809304,0.303475,...,0.795016,0.003878,0.623521,0.598782,0.839973,0.278514,0.024752,0.575617,1,0.035490
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6814,0,0.493687,0.539468,0.543230,0.604455,0.604462,0.998992,0.797409,0.809331,0.303510,...,0.799927,0.000466,0.623620,0.604455,0.840359,0.279606,0.027064,0.566193,1,0.029890
6815,0,0.475162,0.538269,0.524172,0.598308,0.598308,0.998992,0.797414,0.809327,0.303520,...,0.799748,0.001959,0.623931,0.598306,0.840306,0.278132,0.027009,0.566018,1,0.038284
6816,0,0.472725,0.533744,0.520638,0.610444,0.610213,0.998984,0.797401,0.809317,0.303512,...,0.797778,0.002840,0.624156,0.610441,0.840138,0.275789,0.026791,0.565158,1,0.097649
6817,0,0.506264,0.559911,0.554045,0.607850,0.607850,0.999074,0.797500,0.809399,0.303498,...,0.811808,0.002837,0.623957,0.607846,0.841084,0.277547,0.026822,0.565302,1,0.044009


In [4]:
# 독립변수 종속변수 정리
X = data
y= data['Bankrupt?']

In [5]:
X_names = X.columns

In [6]:
# 분산이 0.2 이상인 피처들만 선택하도록 학습
sel = VarianceThreshold(threshold=0.2).fit(X)

# 각 피처의 분산 확인
variances = sel.variances_
print(f'각 피처의 분산: {variances}')

# 분산이 0.2 이상인 피처들만 선택 적용
X_selected = sel.transform(X)

# 선택된 피처들의 이름
X_selected_names = [X_names[i] for i in sel.get_support(indices=True)]

print(f'선택된 피처들의 이름: {X_selected_names}')

각 피처의 분산: [3.12219072e-02 3.68220668e-03 4.30535700e-03 3.79336416e-03
 2.86711955e-04 2.86111462e-04 1.69235931e-04 1.65586576e-04
 1.84950661e-04 1.24604113e-04 1.60733568e-04 1.04810597e+19
 6.75012895e+18 2.90166291e-04 1.17217636e+16 1.92258548e-02
 1.11471313e-03 1.12031183e-03 1.12075524e-03 1.10623918e-03
 3.10098773e-04 2.67323105e+15 7.80654544e-04 1.10076482e-03
 1.45887370e-04 1.15598815e-04 1.91878083e-04 1.93467609e-04
 1.01248377e-04 8.39553690e+18 1.30304550e+16 9.28043687e-05
 4.29942102e-04 1.10887094e+15 5.98618462e+16 1.26284495e-04
 2.83567266e+16 2.90697304e-03 2.90697304e-03 7.92471013e-04
 2.65200220e-04 1.48534380e-04 7.71708889e-04 9.47563177e-04
 1.77505062e-04 1.02288044e-02 7.74171821e+16 6.57102456e+16
 1.05477427e+19 6.13739008e+18 1.34525030e-03 1.86657445e+16
 1.07045083e-03 8.67212378e+16 3.48691104e-03 4.07972352e-02
 4.75657902e-02 1.93878814e-02 2.94494169e+16 2.60419848e+17
 2.52869880e-03 1.23514340e-03 1.09580685e-04 3.38734330e+17
 4.27090221e-0

## 중간정리
- <b> 분작 적은 것을 제외한 Feature: </b>
[' Operating Expense Rate', ' Research and development expense rate', ' Interest-bearing debt interest rate', ' Revenue Per Share (Yuan ¥)', ' Total Asset Growth Rate', ' Net Value Growth Rate', ' Current Ratio', ' Quick Ratio', ' Total debt/Total net worth', ' Accounts Receivable Turnover', ' Average Collection Days', ' Inventory Turnover Rate (times)', ' Fixed Assets Turnover Frequency', ' Revenue per person', ' Allocation rate per person', ' Quick Assets/Current Liability', ' Cash/Current Liability', ' Inventory/Current Liability', ' Long-term Liability to Current Assets', ' Current Asset Turnover Rate', ' Quick Asset Turnover Rate', ' Cash Turnover Rate', ' Fixed Assets to Assets', ' Total assets to GNP price']

In [8]:
data = data[['Bankrupt?',' Operating Expense Rate', ' Research and development expense rate', ' Interest-bearing debt interest rate',
                   ' Revenue Per Share (Yuan ¥)', ' Total Asset Growth Rate', ' Net Value Growth Rate', ' Current Ratio',
                   ' Quick Ratio', ' Total debt/Total net worth', ' Accounts Receivable Turnover', ' Average Collection Days', ' Inventory Turnover Rate (times)',
                   ' Fixed Assets Turnover Frequency', ' Revenue per person', ' Allocation rate per person', ' Quick Assets/Current Liability', ' Cash/Current Liability',
                   ' Inventory/Current Liability', ' Long-term Liability to Current Assets', ' Current Asset Turnover Rate', ' Quick Asset Turnover Rate',
                   ' Cash Turnover Rate', ' Fixed Assets to Assets', ' Total assets to GNP price']]

In [10]:
#피처 간의 상관관계 분석
corr_matrix = data.corr()

corr_with_target = corr_matrix['Bankrupt?'].abs().sort_values(ascending=False)[:10]
print(corr_with_target)

Bankrupt?                           1.000000
 Cash/Current Liability             0.077921
 Fixed Assets Turnover Frequency    0.072818
 Fixed Assets to Assets             0.066328
 Net Value Growth Rate              0.065329
 Total Asset Growth Rate            0.044431
 Revenue per person                 0.039718
 Total assets to GNP price          0.035104
 Quick Asset Turnover Rate          0.025814
 Quick Ratio                        0.025058
Name: Bankrupt?, dtype: float64


## 분석
- 분산이 작은 feature들을 제외하고 상관계수 분석시, 상관계수가 너무 다 낮음