### <span style="color:green"> 1. Introduction

This notebook serves as a solution to the [Playground Series - Season 5, Episode 7](https://www.kaggle.com/competitions/playground-series-s5e7/overview) competition from Kaggle, hosted in July 2025. The goal is to predict whether a person is an introvert or extrovert, given their social behavior and personality traits. Submissions are evaluated based on the accuracy score between the predicted values and the actual targets.

The competition data was generated by a deep learning model trained on the [Extrovert vs. Introvert Behavior](https://www.kaggle.com/datasets/rakeshkapilavai/extrovert-vs-introvert-behavior-data/data) dataset. We will import this and refer to it as the *original dataset* throughout the notebook.

The workflow begins by importing essential libraries for this project, including NumPy, pandas, and scikit-learn, among others. We then load and inspect the first rows of the training, testing and original datasets. A new feature is created to indicate whether a participant’s combination of features matches any row in the original data — if so, we assign the corresponding personality label (extrovert or introvert).

Label encoding is used to convert the target variable into numerical format. After creating a separate variable for the target and removing unnecessary features, the remaining categorical columns are also converted using ordinal encoding.

We then define a simple XGBoost model with basic hyperparameters, including a learning rate of 0.1 for stable training and a maximum tree depth of 4 to prevent overfitting. For training, we use stratified 5-fold cross-validation, which preserves the class distribution of the target variable. For each of the five folds, we fit the model on 4/5 of the data, generate predictions on the held-out validation fold, and predict on the test set. The test set predictions from each fold are averaged to produce the final output.

Finally, we evaluate the model’s performance by comparing the out-of-fold validation predictions to the true targets and create a CSV file with the test set predictions, ready for submission to the competition.

### <span style="color:green">  2. Import Libraries

The following code imports all the libraries we will use.

In [40]:
# ===== Import Libraries =====
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from IPython.display import display, Markdown

### <span style="color:green"> 3. Load and Inspect Data

The following code loads the original dataset, along with the competition's training and testing datasets. We’ll also take a look at the first few rows of each to better understand their structure.

In [41]:
# ===== Load and Inspect Data =====
X = pd.read_csv("/kaggle/input/playground-series-s5e7/train.csv")
X_test = pd.read_csv("/kaggle/input/playground-series-s5e7/test.csv")
original = pd.read_csv("/kaggle/input/extrovert-vs-introvert-behavior-data/personality_dataset.csv")

for name, df in [('TRAINING DATA', X), ('TESTING DATA', X_test), ('ORIGINAL DATASET', original)]:
    print('-' * 100)
    print(name + ':')
    display(df.head())

----------------------------------------------------------------------------------------------------
TRAINING DATA:


Unnamed: 0,id,Time_spent_Alone,Stage_fear,Social_event_attendance,Going_outside,Drained_after_socializing,Friends_circle_size,Post_frequency,Personality
0,0,0.0,No,6.0,4.0,No,15.0,5.0,Extrovert
1,1,1.0,No,7.0,3.0,No,10.0,8.0,Extrovert
2,2,6.0,Yes,1.0,0.0,,3.0,0.0,Introvert
3,3,3.0,No,7.0,3.0,No,11.0,5.0,Extrovert
4,4,1.0,No,4.0,4.0,No,13.0,,Extrovert


----------------------------------------------------------------------------------------------------
TESTING DATA:


Unnamed: 0,id,Time_spent_Alone,Stage_fear,Social_event_attendance,Going_outside,Drained_after_socializing,Friends_circle_size,Post_frequency
0,18524,3.0,No,7.0,4.0,No,6.0,
1,18525,,Yes,0.0,0.0,Yes,5.0,1.0
2,18526,3.0,No,5.0,6.0,No,15.0,9.0
3,18527,3.0,No,4.0,4.0,No,5.0,6.0
4,18528,9.0,Yes,1.0,2.0,Yes,1.0,1.0


----------------------------------------------------------------------------------------------------
ORIGINAL DATASET:


Unnamed: 0,Time_spent_Alone,Stage_fear,Social_event_attendance,Going_outside,Drained_after_socializing,Friends_circle_size,Post_frequency,Personality
0,4.0,No,4.0,6.0,No,13.0,5.0,Extrovert
1,9.0,Yes,0.0,0.0,Yes,0.0,3.0,Introvert
2,9.0,Yes,1.0,2.0,Yes,5.0,2.0,Introvert
3,0.0,No,6.0,7.0,No,14.0,8.0,Extrovert
4,3.0,No,9.0,4.0,No,8.0,5.0,Extrovert


### <span style="color:green"> 4. Merge with Original Dataset

In this section, the training and testing datasets will be merged with the original dataset. A new feature will be created to indicate whether a participant’s combination of features matches any row in the original data — if so, we assign the corresponding personality label (extrovert or introvert).

Although this may initially appear as data leakage, it is explicitly allowed in the competition guidelines: *Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.*

In [42]:
# ===== Merge with Original Dataset =====
original = original.rename(columns={'Personality': 'match_p'})
merge_cols = ['Time_spent_Alone', 'Stage_fear', 'Social_event_attendance',
             'Going_outside', 'Drained_after_socializing', 'Friends_circle_size',
             'Post_frequency']

X = X.merge(original, how='left', on=merge_cols)
X_test = X_test.merge(original, how='left', on=merge_cols)

X = X.drop_duplicates(subset='id', keep='first')
X_test = X_test.drop_duplicates(subset='id', keep='first')

### <span style="color:green"> 5. Encode Target

The target variable is now converted into numerical format using label encoding.

In [43]:
# ===== Encode Target =====
label_encoder = LabelEncoder()
X['Personality_encoded'] = label_encoder.fit_transform(X['Personality'])

### <span style="color:green"> 6. Prepare Features

In this section, the `id` column is set as the index for both the training and testing datasets. Next, the target variable is stored in a pandas Series named `y`, and the target columns — both original and encoded — are dropped from the training dataset.

In [44]:
# ===== Prepare Features =====
X = X.set_index('id')
X_test = X_test.set_index('id')

y = X['Personality_encoded']
X = X.drop(['Personality', 'Personality_encoded'], axis=1)

### <span style="color:green"> 7. Encode Categorical Features

In this step, we will convert categorical features into numerical format using ordinal encoding. To ensure consistent encoding across both training and testing sets, we first concatenate them into a single DataFrame. After encoding, we split the data back into the initial training and testing sets.

In [45]:
# ===== Encode Categorical Features ======
X_full = pd.concat([X, X_test], axis=0)

categorical_cols = [col for col in X.columns if X[col].dtype == 'object']

ordinal_encoder = OrdinalEncoder()
X_full[categorical_cols] = ordinal_encoder.fit_transform(X_full[categorical_cols])

X = X_full.iloc[:len(X)]
X_test = X_full.iloc[len(X):]

### <span style="color:green"> 8. XGBoost Model

We now define the hyperparameters for the XGBoost classifier. Since this is a binary classification task, we set the objective to a logistic function that predicts probabilities for the two classes. The evaluation metric is set to logloss, which is suitable for measuring the quality of probability-based predictions. To prevent overfitting and manage complexity, we limit each tree's depth to 4. A learning_rate of 0.1 ensures that the model learns gradually and in a more stable manner.

We also use a subsample and colsample_bytree of 0.8, meaning each tree is trained on 80% of the data and 80% of the features, introducing randomness to improve generalization. Finally, early stopping rounds is set to 10 to help prevent overfitting, and a fixed random state ensures that results are reproducible every time the code is run.

In [46]:
# ===== Define XGBoost Model =====
xgb = XGBClassifier(
    objective="binary:logistic",
    eval_metric="logloss",
    max_depth=4,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    early_stopping_rounds=10,
    random_state=0
)

### <span style="color:green"> 9. Stratified 5-Fold Cross-Validation

We will use stratified 5-fold cross-validation for training the model and generating predictions. This method ensures that each fold maintains the original proportion of extroverts and introverts in the target variable.

For each of the 5 folds, we fit the model on 4/5 of the data, generate predictions on the held-out validation fold and also predict on the test set.

After completing all folds, we will have out-of-fold predictions for the entire training set, which will be used for model evaluation, and five sets of predictions for the test set, which will be averaged to produce the final output.

In [47]:
# === Stratified 5-Fold Cross-Validation =====
valid_preds = np.zeros(len(X))
test_preds = np.zeros(len(X_test))

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

for (train_idx, valid_idx) in skf.split(X, y):
    X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
    y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]

    xgb.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)

    valid_preds[valid_idx] = xgb.predict(X_valid)
    test_preds += xgb.predict(X_test) / skf.n_splits

### <span style="color:green"> 10. Model Evaluation

In this step, we will calculate the accuracy score by comparing the out-of-fold predictions with the true target values.

In [48]:
# === Model Evaluation =====
cv_score = accuracy_score(y, valid_preds)
print(f'CROSS-VALIDATION ACCURACY SCORE: {cv_score:.6f}')

CROSS-VALIDATION ACCURACY SCORE: 0.971119


### <span style="color:green"> 11. Create Submission File

The final step is creating a submission file for the competition. To do that, we need convert the encoded target labels back to their original categories: extrovert and introvert.

In [49]:
# ==== Create Submission File =====
test_preds = (test_preds > 0.5).astype(int)
final_preds = label_encoder.inverse_transform(test_preds)
output = pd.DataFrame({'id': X_test.index, 'Personality': final_preds})
output.to_csv('submission.csv', index=False)