# BDA학회 데이터 분석 전처리 적용반 3주차 필수과제

제출자 성명: 이승섭89

필수과제1 (Titanic Dataset)에 VarianceThreshold을 적용합니다.

필수과제2 (bank-additional.csv)에 여러 가지 feature selection을 수행합니다.

Python 3.10.14 버전을 사용합니다.


## 필수과제2 (직접 제가 드린 데이터셋)
- 정말 피처가 많은 데이터
- 그 데이터를 피처 셀렉션해서 실제 어떤 피처만 추출할지? 
    - 기준에 대한 이유
    - 코드(주석설명)
    - 실제 선택된 피처는 무엇인지?


In [15]:
# Import relevant libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, KBinsDiscretizer
from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif, SelectFromModel, RFE
from sklearn.ensemble import RandomForestClassifier

In [16]:
# Load dataset and inspect data
data = pd.read_csv('./bank-additional.csv', sep=';')
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4119 entries, 0 to 4118
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             4119 non-null   int64  
 1   job             4119 non-null   object 
 2   marital         4119 non-null   object 
 3   education       4119 non-null   object 
 4   default         4119 non-null   object 
 5   housing         4119 non-null   object 
 6   loan            4119 non-null   object 
 7   contact         4119 non-null   object 
 8   month           4119 non-null   object 
 9   day_of_week     4119 non-null   object 
 10  duration        4119 non-null   int64  
 11  campaign        4119 non-null   int64  
 12  pdays           4119 non-null   int64  
 13  previous        4119 non-null   int64  
 14  poutcome        4119 non-null   object 
 15  emp.var.rate    4119 non-null   float64
 16  cons.price.idx  4119 non-null   float64
 17  cons.conf.idx   4119 non-null   f

In [17]:
# Data Preprocessing
# Convert categorical variables to numeric using LabelEncoder
label_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    data[column] = le.fit_transform(data[column])
    label_encoders[column] = le

# Split the data into features and target
X = data.drop(columns=['y'])
y = data['y']

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [18]:
# Binning continuous variables to make them categorical
# Assuming X_train contains continuous features that need binning
discretizer = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform')  # Use uniform binning or another strategy
X_binned = discretizer.fit_transform(X_train)

In [19]:
# Chi-squared test
chi2_selector = SelectKBest(chi2, k='all')  # Set k='all' to get scores for all features
X_kbest_chi2 = chi2_selector.fit_transform(X_binned, y_train)

In [20]:
# Feature scores
chi2_scores = pd.DataFrame({'Feature': X.columns, 'Chi2 Score': chi2_selector.scores_})
print("Chi-squared Feature Scores:\n", chi2_scores.sort_values(by='Chi2 Score', ascending=False))

Chi-squared Feature Scores:
            Feature  Chi2 Score
10        duration  323.823695
13        previous  265.762926
18       euribor3m  251.477401
19     nr.employed  179.952384
15    emp.var.rate  144.381597
7          contact  136.643531
12           pdays   43.486097
4          default   39.750741
11        campaign   14.717543
3        education   10.822082
16  cons.price.idx    7.298364
0              age    5.870066
1              job    4.219376
14        poutcome    3.190393
17   cons.conf.idx    2.368286
2          marital    1.895930
8            month    0.617862
6             loan    0.544311
5          housing    0.370159
9      day_of_week    0.042994


In [21]:
# 3. Mutual Information for Feature Selection
mutual_info = SelectKBest(mutual_info_classif, k='all')  # Set k='all' to get scores for all features
X_kbest_mi = mutual_info.fit_transform(X_train, y_train)

In [22]:
# Feature scores
mi_scores = pd.DataFrame({'Feature': X.columns, 'Mutual Information Score': mutual_info.scores_})
print("Mutual Information Feature Scores:\n", mi_scores.sort_values(by='Mutual Information Score', ascending=False))

Mutual Information Feature Scores:
            Feature  Mutual Information Score
10        duration                  0.082129
18       euribor3m                  0.071280
17   cons.conf.idx                  0.065648
16  cons.price.idx                  0.063690
19     nr.employed                  0.058037
15    emp.var.rate                  0.050439
14        poutcome                  0.034141
12           pdays                  0.028132
13        previous                  0.023014
7          contact                  0.019241
8            month                  0.013041
11        campaign                  0.011405
9      day_of_week                  0.008256
1              job                  0.007324
3        education                  0.005637
0              age                  0.002682
6             loan                  0.000949
5          housing                  0.000734
2          marital                  0.000000
4          default                  0.000000


In [23]:
# 4. Recursive Feature Elimination (RFE) with RandomForest
rfe_model = RandomForestClassifier(random_state=42)
rfe_selector = RFE(rfe_model, n_features_to_select=10)
rfe_selector.fit(X_train, y_train)

rfe_features = pd.DataFrame({'Feature': X.columns, 'Selected': rfe_selector.support_})
print("RFE Selected Features:\n", rfe_features[rfe_features['Selected']])

RFE Selected Features:
           Feature  Selected
0             age      True
1             job      True
3       education      True
9     day_of_week      True
10       duration      True
11       campaign      True
14       poutcome      True
17  cons.conf.idx      True
18      euribor3m      True
19    nr.employed      True


In [24]:
# 5. Feature Selection with SelectFromModel using RandomForest
sfm = SelectFromModel(RandomForestClassifier(random_state=42), threshold='mean')
sfm.fit(X_train, y_train)

sfm_features = pd.DataFrame({'Feature': X.columns, 'Importance': sfm.estimator_.feature_importances_})
print("Selected Features from SelectFromModel:\n", sfm_features[sfm.get_support()])

Selected Features from SelectFromModel:
         Feature  Importance
0           age    0.077554
10     duration    0.304329
18    euribor3m    0.101847
19  nr.employed    0.066121


## 필수과제2 고찰

### From Chi-squared test (threshold = 100):
duration, previous, euribor3m, nr.employed, emp.var.rate, contact

### From Mutual-Information (threshold = 0.05):
duration, euribor3m, cons.price.idx, cons.conf.idx, nr.employed, emp.var.rate

### From RFE: 
age, job, education, day_of_week, duration, campaign, poutcome, cons.conf.idx, euribor3m, nr.employed

### From SelectFromModel: 
age, duration, euribor3m, nr.employed

### Conclusions:
Looking at duration, euribor3m, nr.employed seems to be the best model.

In [25]:
# Blank cell for running entire notebook.