<a href="https://colab.research.google.com/github/jspark0914/machine_leraning/blob/master/%EA%B3%BC%EC%A0%9C1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

train.csv 파일 활용

In [35]:
import pandas as pd
titanic = pd.read_csv("/content/drive/MyDrive/machine_dataset/train.csv")
print(titanic)


     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ... 

In [9]:
nan_values = titanic.isnull().sum()

print(nan_values)

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


In [10]:
titanic['Age'].fillna(titanic['Age'].mean(), inplace=True)
titanic['Cabin'].fillna('Unknown', inplace=True)
most_frequent = titanic['Embarked'].mode()[0]
titanic['Embarked'].fillna(most_frequent, inplace=True)

이 코드는 titanic 데이터셋의 누락된 값을 처리합니다. 'Age' 열의 결측값은 평균 나이로, 'Cabin'은 'Unknown'으로 채워지며, 'Embarked' 열의 결측값은 가장 빈번한 승선 항구로 대체됩니다.



In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder


X = titanic.drop("Survived", axis=1)
y = titanic["Survived"]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=202035157)


num_features = X.select_dtypes(include=['int64', 'float64']).columns
cat_features = X.select_dtypes(include=['object']).columns


num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])


cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])


preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)])


X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)


이 코드는 titanic 데이터셋을 전처리하기 위해 sklearn 라이브러리를 활용합니다. 먼저, 데이터셋을 특성과 레이블로 분리하고, 훈련 세트와 테스트 세트로 나눕니다. 그 후, 수치형과 범주형 특성을 각각 처리하는 파이프라인을 만들고, 이를 ColumnTransformer에 적용하여 전처리된 데이터를 얻습니다. 이렇게 전처리된 데이터는 모델 학습에 사용됩니다. 수업시간에 train,validation,test 데이터 셋 비율을 6:2:2 로 많이 한다고 배웠기에 train과 test 데이터 셋 비율을 8:2로 하였습니다.

In [12]:
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

sgd_params = {
    'loss': ['hinge', 'log'],
    'penalty': ['l1', 'l2'],
    'alpha': [0.0001, 0.001, 0.01],
}
sgd_clf = SGDClassifier(random_state=202035157)
sgd_grid = GridSearchCV(sgd_clf, sgd_params, cv=5)
sgd_grid.fit(X_train_processed, y_train)
print("SGD 최적 파라미터:", sgd_grid.best_params_)


tree_params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
}
tree_clf = DecisionTreeClassifier(random_state=202035157)
tree_grid = GridSearchCV(tree_clf, tree_params, cv=5)
tree_grid.fit(X_train_processed, y_train)
print("Decision Tree 최적 파라미터:", tree_grid.best_params_)

rf_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
}
rf_clf = RandomForestClassifier(random_state=202035157)
rf_grid = GridSearchCV(rf_clf, rf_params, cv=5)
rf_grid.fit(X_train_processed, y_train)
print("Random Forest 최적 파라미터:", rf_grid.best_params_)


X_train_dense = X_train_processed.toarray() if hasattr(X_train_processed, 'toarray') else X_train_processed

hgb_params = {
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [None, 10, 20, 30],
    'min_samples_leaf': [1, 2, 4],
}

hgb_clf = HistGradientBoostingClassifier(random_state=202035157)
hgb_grid = GridSearchCV(hgb_clf, hgb_params, cv=5)
hgb_grid.fit(X_train_dense, y_train)
print("HistGradientBoosting 최적 파라미터:", hgb_grid.best_params_)




SGD 최적 파라미터: {'alpha': 0.001, 'loss': 'hinge', 'penalty': 'l2'}
Decision Tree 최적 파라미터: {'criterion': 'gini', 'max_depth': 20, 'min_samples_split': 2}
Random Forest 최적 파라미터: {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 200}
HistGradientBoosting 최적 파라미터: {'learning_rate': 0.01, 'max_depth': None, 'min_samples_leaf': 4}



이 코드는 SGDClassifier, DecisionTreeClassifier, RandomForestClassifier, HistGradientBoostingClassifier 네 가지 머신 러닝 모델에 대해 하이퍼파라미터 튜닝을 수행합니다. 각 모델에 대해 가능한 하이퍼파라미터의 조합을 탐색하고, 5-폴드 교차 검증을 사용해 각 조합의 성능을 평가합니다. GridSearchCV를 사용하여 최적의 하이퍼파라미터 조합을 찾고, 이를 출력합니다. 여기에 추가로 희소 행렬을 밀집 행렬로 변환하는 작업도 포함되어 있습니다.

In [15]:
from sklearn.metrics import accuracy_score

optimal_sgd_clf = SGDClassifier(**sgd_grid.best_params_, random_state=202035157)
optimal_tree_clf = DecisionTreeClassifier(**tree_grid.best_params_, random_state=202035157)
optimal_rf_clf = RandomForestClassifier(**rf_grid.best_params_, random_state=202035157)
optimal_hgb_clf = HistGradientBoostingClassifier(**hgb_grid.best_params_, random_state=202035157)


X_train_dense = X_train_processed.toarray()
X_test_dense = X_test_processed.toarray()

optimal_sgd_clf.fit(X_train_processed, y_train)
optimal_tree_clf.fit(X_train_processed, y_train)
optimal_rf_clf.fit(X_train_processed, y_train)
optimal_hgb_clf.fit(X_train_dense, y_train)

sgd_preds = optimal_sgd_clf.predict(X_test_processed)
tree_preds = optimal_tree_clf.predict(X_test_processed)
rf_preds = optimal_rf_clf.predict(X_test_processed)
hgb_preds = optimal_hgb_clf.predict(X_test_dense)

print("SGD Classifier Accuracy: ", accuracy_score(y_test, sgd_preds))
print("Decision Tree Accuracy: ", accuracy_score(y_test, tree_preds))
print("Random Forest Accuracy: ", accuracy_score(y_test, rf_preds))
print("Histogram-based Gradient Boosting Accuracy: ", accuracy_score(y_test, hgb_preds))


SGD Classifier Accuracy:  0.8212290502793296
Decision Tree Accuracy:  0.8044692737430168
Random Forest Accuracy:  0.8100558659217877
Histogram-based Gradient Boosting Accuracy:  0.776536312849162


In [36]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

titanic['FamilySize'] = titanic['SibSp'] + titanic['Parch'] + 1
titanic['IsAlone'] = titanic['FamilySize'].apply(lambda x: 1 if x == 1 else 0)
titanic['AgeGroup'] = pd.cut(titanic['Age'], bins=[0, 18, 35, 60, 100], labels=['Child', 'Young Adult', 'Adult', 'Senior'], right=False)
titanic['FareGroup'] = pd.qcut(titanic['Fare'], 4, labels=['Low', 'Medium', 'High', 'Very High'])

X = titanic.drop("Survived", axis=1)
y = titanic["Survived"]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=202035157)

num_features = X.select_dtypes(include=['int64', 'float64']).columns
cat_features = X.select_dtypes(include=['object', 'category']).columns


num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])


cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])


preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)])


X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
sgd_params = {
    'loss': ['hinge', 'log'],
    'penalty': ['l1', 'l2'],
    'alpha': [0.0001, 0.001, 0.01],
}
sgd_clf = SGDClassifier(random_state=202035157)
sgd_grid = GridSearchCV(sgd_clf, sgd_params, cv=5)
sgd_grid.fit(X_train_processed, y_train)
print("SGD 최적 파라미터:", sgd_grid.best_params_)


tree_params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
}
tree_clf = DecisionTreeClassifier(random_state=202035157)
tree_grid = GridSearchCV(tree_clf, tree_params, cv=5)
tree_grid.fit(X_train_processed, y_train)
print("Decision Tree 최적 파라미터:", tree_grid.best_params_)

rf_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
}
rf_clf = RandomForestClassifier(random_state=202035157)
rf_grid = GridSearchCV(rf_clf, rf_params, cv=5)
rf_grid.fit(X_train_processed, y_train)
print("Random Forest 최적 파라미터:", rf_grid.best_params_)


X_train_dense = X_train_processed.toarray() if hasattr(X_train_processed, 'toarray') else X_train_processed

hgb_params = {
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [None, 10, 20, 30],
    'min_samples_leaf': [1, 2, 4],
}

hgb_clf = HistGradientBoostingClassifier(random_state=202035157)
hgb_grid = GridSearchCV(hgb_clf, hgb_params, cv=5)
hgb_grid.fit(X_train_dense, y_train)
print("HistGradientBoosting 최적 파라미터:", hgb_grid.best_params_)



SGD 최적 파라미터: {'alpha': 0.0001, 'loss': 'log', 'penalty': 'l2'}
Decision Tree 최적 파라미터: {'criterion': 'gini', 'max_depth': 10, 'min_samples_split': 2}
Random Forest 최적 파라미터: {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 200}
HistGradientBoosting 최적 파라미터: {'learning_rate': 0.1, 'max_depth': 10, 'min_samples_leaf': 1}


'SibSp'는 타이타닉에 탑승한 형제, 자매, 배우자의 수를, 'Parch'는 부모와 자녀의 수를 나타냅니다.
이 두 특성을 합하여 'FamilySize'라는 새로운 특성을 생성합니다. +1은 본인을 포함하는 것입니다. 'FamilySize'를 기반으로 혼자 여부를 판단하여 'IsAlone'이라는 새로운 이진 특성을 만듭니다. 가족 크기가 1이면 혼자로 판단하고 1을, 그렇지 않으면 0을 할당합니다. 연령을 구분하는 새로운 특성 'AgeGroup'을 만듭니다.
pd.cut 함수를 사용하여 'Age' 특성을 여러 그룹으로 나눕니다. 구체적으로는 아이, 젊은 성인, 성인, 그리고 노인으로 구분합니다. 연령을 구분하는 새로운 특성 'AgeGroup'을 만듭니다.
pd.cut 함수를 사용하여 'Age' 특성을 여러 그룹으로 나눕니다. 구체적으로는 아이, 젊은 성인, 성인, 그리고 노인으로 구분합니다. 이러한 특성 공학을 통해 기존의 특성을 확장하고, 모델이 데이터에서 더 복잡한 패턴을 학습할 수 있게 돕습니다.

In [37]:
from sklearn.metrics import accuracy_score

optimal_sgd_clf = SGDClassifier(**sgd_grid.best_params_, random_state=202035157)
optimal_tree_clf = DecisionTreeClassifier(**tree_grid.best_params_, random_state=202035157)
optimal_rf_clf = RandomForestClassifier(**rf_grid.best_params_, random_state=202035157)
optimal_hgb_clf = HistGradientBoostingClassifier(**hgb_grid.best_params_, random_state=202035157)


X_train_dense = X_train_processed.toarray()
X_test_dense = X_test_processed.toarray()

optimal_sgd_clf.fit(X_train_processed, y_train)
optimal_tree_clf.fit(X_train_processed, y_train)
optimal_rf_clf.fit(X_train_processed, y_train)
optimal_hgb_clf.fit(X_train_dense, y_train)

sgd_preds = optimal_sgd_clf.predict(X_test_processed)
tree_preds = optimal_tree_clf.predict(X_test_processed)
rf_preds = optimal_rf_clf.predict(X_test_processed)
hgb_preds = optimal_hgb_clf.predict(X_test_dense)

print("SGD Classifier Accuracy: ", accuracy_score(y_test, sgd_preds))
print("Decision Tree Accuracy: ", accuracy_score(y_test, tree_preds))
print("Random Forest Accuracy: ", accuracy_score(y_test, rf_preds))
print("Histogram-based Gradient Boosting Accuracy: ", accuracy_score(y_test, hgb_preds))



SGD Classifier Accuracy:  0.8044692737430168
Decision Tree Accuracy:  0.7877094972067039
Random Forest Accuracy:  0.7988826815642458
Histogram-based Gradient Boosting Accuracy:  0.7877094972067039


특성 공학을 사용한 모델의 Histogram-based Gradient Boosting Accuracy가 이전 모델의 Accuracy보다 더 높은 수치를 도출하였습니다.

In [27]:
import pandas as pd
titanic1 = pd.read_csv("/content/drive/MyDrive/machine_dataset/gender_submission.csv")
print(titanic1)

     PassengerId  Survived
0            892         0
1            893         1
2            894         0
3            895         0
4            896         1
..           ...       ...
413         1305         0
414         1306         1
415         1307         0
416         1308         0
417         1309         0

[418 rows x 2 columns]


gender_submission 파일 활용

In [28]:
nan_values = titanic1.isnull().sum()

print(nan_values)

PassengerId    0
Survived       0
dtype: int64


In [33]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier

data= titanic1


X = data[["PassengerId"]]
y = data["Survived"]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=202035157)

models = {
    "SGDClassifier": {
        "model": SGDClassifier(random_state=202035157),
        "params": {
            'alpha': [0.0001, 0.001, 0.01, 0.1],
            'loss': ['hinge', 'log'],
            'penalty': ['l1', 'l2']
        }
    },
    "DecisionTreeClassifier": {
        "model": DecisionTreeClassifier(random_state=202035157),
        "params": {
            'criterion': ['gini', 'entropy'],
            'max_depth': [None, 10, 20, 30, 40, 50],
            'min_samples_split': [2, 5, 10]
        }
    },
    "RandomForestClassifier": {
        "model": RandomForestClassifier(random_state=202035157),
        "params": {
            'n_estimators': [50, 100, 200],
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10]
        }
    },
    "HistGradientBoostingClassifier": {
        "model": HistGradientBoostingClassifier(random_state=202035157),
        "params": {
            'max_iter': [100, 200, 300],
            'max_depth': [None, 10, 20, 30],
            'min_samples_leaf': [5, 10, 20]
        }
    }
}

for name, model_info in models.items():
    grid_search = GridSearchCV(model_info["model"], model_info["params"], cv=5)
    grid_search.fit(X_train, y_train)
    print(f"Model: {name}")
    print(f"Best Parameters: {grid_search.best_params_}")
    print(f"Best Cross-Validation Score: {grid_search.best_score_}")

    best_model = grid_search.best_estimator_
    train_score = best_model.score(X_train, y_train)
    test_score = best_model.score(X_test, y_test)
    print(f"Training Accuracy: {train_score}")
    print(f"Test Accuracy: {test_score}")
    print("-" * 50)




Model: SGDClassifier
Best Parameters: {'alpha': 0.01, 'loss': 'log', 'penalty': 'l1'}
Best Cross-Validation Score: 0.5839891451831749
Training Accuracy: 0.3592814371257485
Test Accuracy: 0.38095238095238093
--------------------------------------------------
Model: DecisionTreeClassifier
Best Parameters: {'criterion': 'gini', 'max_depth': 10, 'min_samples_split': 5}
Best Cross-Validation Score: 0.6197648123021258
Training Accuracy: 0.6826347305389222
Test Accuracy: 0.6309523809523809
--------------------------------------------------
Model: RandomForestClassifier
Best Parameters: {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 100}
Best Cross-Validation Score: 0.5537765716870194
Training Accuracy: 0.7335329341317365
Test Accuracy: 0.5833333333333334
--------------------------------------------------
Model: HistGradientBoostingClassifier
Best Parameters: {'max_depth': None, 'max_iter': 100, 'min_samples_leaf': 20}
Best Cross-Validation Score: 0.5625508819538669
Training Accura

여기서는 "PassengerId"를 특성으로, "Survived"를 타겟 변수로 선택하고 있습니다. test_size=0.2는 테스트 셋이 전체 데이터의 20%를 차지하도록 설정합니다. 20% 선정 이유는 앞서 말씀드린바와 같습니다. SGDClassifier, DecisionTreeClassifier, RandomForestClassifier, HistGradientBoostingClassifier 모델을 딕셔너리에 저장하고 각각에 대한 하이퍼파라미터 그리드를 설정합니다. 각 모델에 대해 GridSearchCV를 수행하여 최적의 하이퍼파라미터를 찾고, 모델을 훈련시킵니다. 마지막으로 각 모델의 최적 하이퍼파라미터, 크로스 밸리데이션 점수, 훈련 정확도, 테스트 정확도를 출력합니다.