# BDA학회 데이터 분석 전처리 적용반 3주차 필수과제

제출자 성명: 이승섭89

필수과제1 (Titanic Dataset)에 VarianceThreshold을 적용합니다.

필수과제2 (bank-additional.csv)에 여러 가지 feature selection을 수행합니다.

Python 3.10.14 버전을 사용합니다.

In [4]:
# Import required libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import optuna

from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


## 필수과제1 (타이타닉데이터셋)
- VarianceThreshold -타이타닉 데이터 feature_selection
    - 임계값 기준을 몇으로 했는지?
    - 그 기준의 이유
    - 어떤 식으로 찾았는지!
- 어떤 피처가 선택이 되었나?

In [5]:
# Load Titanic dataset using seaborn
titanic = sns.load_dataset('titanic')

# Fill missing values for continuous columns with the median and for categorical with mode
titanic.fillna({'age': titanic['age'].median(), 'embark_town': titanic['embark_town'].mode()[0], 'fare': titanic['fare'].median()}, inplace=True)

# Select features
X = titanic[['pclass', 'sex', 'age', 'fare', 'embark_town']]
y = titanic['survived']

# Bin continuous variables using qcut into 4 bins, and store them in new columns
X['age_binned'] = pd.qcut(X['age'], q=4, labels=False)
X['fare_binned'] = pd.qcut(X['fare'], q=4, labels=False)

# Drop the original 'age' and 'fare' columns as we now have binned versions
X = X.drop(['age', 'fare'], axis=1)

# Convert categorical variables to 'category' type instead of one-hot encoding
X['pclass'] = X['pclass'].astype('category')
X['sex'] = X['sex'].astype('category')
X['embark_town'] = X['embark_town'].astype('category')

# Convert categorical columns into codes (internally they are treated as integers)
X['pclass_cat'] = X['pclass'].cat.codes
X['sex_cat'] = X['sex'].cat.codes
X['embark_town_cat'] = X['embark_town'].cat.codes

# Drop the original categorical columns and keep only their numerical representations
X = X.drop(['pclass', 'sex', 'embark_town'], axis=1)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the objective function for Optuna
def objective(trial):
    threshold = trial.suggest_float("threshold", 0.0, 1.0)  # Variance threshold range

    # Apply VarianceThreshold
    selector = VarianceThreshold(threshold=threshold)
    X_train_selected = selector.fit_transform(X_train)
    X_test_selected = selector.transform(X_test)

    # Train and evaluate the model
    model = LogisticRegression(max_iter=1000)  # Added max_iter for convergence
    model.fit(X_train_selected, y_train)
    y_pred = model.predict(X_test_selected)
    
    # Compute accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    return accuracy

# Create and optimize the Optuna study
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)

# Print the best threshold found by Optuna
print(f"Best threshold: {study.best_params['threshold']}")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['age_binned'] = pd.qcut(X['age'], q=4, labels=False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['fare_binned'] = pd.qcut(X['fare'], q=4, labels=False)
[I 2024-10-05 18:06:05,315] A new study created in memory with name: no-name-08ae3893-c643-4cf7-99fb-49eba4d10bb0
[I 2024-10-05 18:06:05,336] Trial 0 finished with value: 0.7039106145251397 and parameters: {'threshold': 0.5224964751432013}. Best is trial 0 with value: 0.7039106145251397.
[I 2024-10-05 18:06:05,359] Trial 1 finished with value: 0.6759776536

Best threshold: 0.05115433017506266


In [12]:
# Get maximum threshold value
def get_max_threshold(scores):
    scores = sorted(scores, key=lambda x: (x[0], x[1]))
    max_accuracy = scores[-1][0]
    
    # Get the maximum hyperparameter corresponding to the maximum accuracy
    max_hyperparameter = max([trial[1] for trial in scores if trial[0] == max_accuracy])
    
    print(f"Maximum hyperparameter value resulting in the same accuracy ({max_accuracy}): {max_hyperparameter}")
    return max_hyperparameter

scores = [(trial.values, trial.params['threshold']) for trial in study.trials]

best_param = get_max_threshold(scores)

Maximum hyperparameter value resulting in the same accuracy ([0.7932960893854749]): 0.21577342561928192


In [13]:
# Apply the best threshold to the VarianceThreshold and print selected features
selector = VarianceThreshold(threshold=best_param)
X_selected = selector.fit_transform(X)
selected_features = X.columns[selector.get_support(indices=True)].tolist()

selected_features

['age_binned', 'fare_binned', 'pclass_cat', 'sex_cat', 'embark_town_cat']

## 필수과제1 고찰

1. 임계값 기준은 0.21577342561928192이 최적으로 나왔다.
2. 기준을 scikit-learn의 metrics 모듈에서 accuracy_score을 사용했다. 
3. Optuna 라이브러리를 사용해서 자동으로 검색했다.
4. 선택된 feature는 age, fare, pclass, sex, embark_town이다 - 사실상 overfitting이 된 듯.

In [3]:
# Empty cell for running entire notebook.