# BDA학회 데이터 분석 전처리 적용반 3주차 복습과제

제출자 성명: 이승섭89

3주차 코드를 재해석합니다.

Python 3.10.14 버전을 사용합니다.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import OneHotEncoder

from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest, chi2

### sklearn.feature_selection을 이용하여 분석
#### VarianceThreshold()
- Select features according to variance.
    - Remove features with low variance $\to$ less importance in model.
    - Set threshold to remove features.

In [None]:
# Create dummy data X
X = [[0,2,0,3],
     [0,1,2,3],
     [0,1,1,5]]

In [None]:
# Set threshold as 0.2
selector = VarianceThreshold(threshold=0.2)
X_high_variance = selector.fit_transform(X)

#### Mathematical Definition of Variance
- Calculates how far data point is from mean.
- If data is virtually equivalent to mean, low variance.
    - Feature values are nearly constant and invariant, with very high correlation.
    - One feature has almost constant values $\to$ does not help predict target variable.
        - Because examining feature gives no information on target variable.
        - Fails to add information or even diminishes model predictive performance.
        - May lead to overfitting.
- If datapoint is far from mean, variance is very high.
    - Doesn't mean high-variance features are important.
    - Maybe high priority, but should be considered with domain-specific knowledge.
- Threshold
    - Variances are normally 0.
    - 0.1~0.5: Usually acceptable, may differ according to domain.
        - Depends on ratio.
        - Search based on actual feature variances.
        - Test and iterate to evaluate performance of threshold.

#### $\Chi^{2}$ Test
- High $\Chi^{2}$ value: Feature is strongly correlated to target variable.
- Fundamentals of Chi-squared:
    - Calculates difference between expected values and observed values.
    - Expected values assume independent features and variables.
    - High Chi-squared value: Difference between expected and observed value is large.
        - Feature and variable are highly correlated.
        - Feature influences variable more than what is expected.
        - Feature with high Chi-squared value is deemed as an indicating feature with high predictive power.
    - Categorical/Continuous Data
        - Chi-squared works on categorical data, and continuous data must be converted to categorical data first.
    - p-value (유의확률) is used as threshold (p < 0.05).

In [None]:
# Create dummy dataset

X = np.array([[1,2,3],
              [4,5,6],
              [7,8,9],
              [10,11,12]])
y = np.array([0,1,0,1]) # Target values (categorical)

In [None]:
# Select top 2 features
selector = SelectKBest(chi2, k=2)
X_new = selector.fit_transform(X, y)

In [None]:
# Load Titanic dataset and fill missing values
tt = sns.load_dataset('titanic')
tt['age'] = tt['age'].fillna(tt['age'].median())
tt['embark_town'] = tt['embark_town'].fillna(tt['embark_town'].mode()[0])
tt['fare'] = tt['fare'].fillna(tt['fare'].median())

In [None]:
# Select features and label
X = tt[['pclass', 'sex', 'age', 'fare', 'embark_town']]
y = tt['survived']

In [None]:
# Categorize continuous variables with qcut ratio

X.loc[:, 'age_binned'] = pd.qcut(tt['age'], q=4, labels=False)
X.loc[:, 'fare_binned'] = pd.qcut(tt['fare'], q=4, labels=False)

In [None]:
# Capture categorical variables with one-hot encoding
X = X.drop(columns=['age', 'fare'])
onehot_encoder = OneHotEncoder(sparse_output = False, drop = 'first')

X_encoded = onehot_encoder.fit_transform(X)

In [None]:
# Select with chi2

chi_selector = SelectKBest(chi2, k='all')
X_selected_all = chi_selector.fit_transform(X_encoded, y)

In [None]:
# Calculate chi2 scores
chi_scores = pd.DataFrame({
    'Feature': onehot_encoder.get_feature_names_out(X.columns),
    'Score': chi_selector.scores_}).sort_values(by='Score', ascending=True)

In [None]:
# Visualize chi2 scores
plt.figure(figsize=(10, 5))
plt.barh(chi_scores['Features'], chi_scores['Score'], color='lightgreen')

In [None]:
# Use top 2 features

chi_selector_2 = SelectKBest(chi2, k=2)
X_selected_2 = chi_selector_2.fit_transform(X_encoded, y)

In [None]:
# Calculate chi2 scores
selected_indices = chi_selector_2.get_support(indices=True)
selected_features = onehot_encoder.get_feature_names_out(X.columns)[selected_indices]
chi_scores_2 = chi_selector_2.scores_[selected_indices]

In [None]:
# Visualize chi2 scores
plt.figure(figsize=(10, 5))
plt.barh(selected_features, chi_scores_2, color='skyblue')

## 필수과제 1:
안한 예시 코드 (Variance Threshold)로 Titanic Data Feature Selection하기.
임계값 기준을 몇으로 했는지? 그 기준의 이유와 어떤 식으로 찾았는지?
어떤 feature가 선택이 되었나?

## 필수과제 2:
주어진 큰 데이터에서 그 데이터를 feature selection해서 어떤 feature만 추출할 건지, 그리고 그 이유.
코드 + 주석 설명, 선택된 feature 설명하기.

### 다음 수업
Chi-squared 말고도 ANOVA, MANOVA 등의 방법도 있다. 
Mutual information을 가지고 해도 된다. 기본적으로 상호정보량은 살펴볼 예정.
논문에서도 feature selection 고민하는데, 통계적으로도 많이 하고 있다. 