# INTRO #
This is the legendary Titanic ML competition – the best, first challenge for me to dive into ML competitions and familiarize myself with how the Kaggle platform works.

The competition is simple: I use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

train data and test data are provided by kaggle in the data tab of Titanic competition page. Here are the steps that involved in achieving highest accuracy

1. Data Preprocessing
2. Feature Engineering
3. Training data with multiple ML models along with simple Neural Network model
4. Test model and caluclate accuracy and compare all models and choose model with highest accuracy and submit



## Data Preprocessing

In [1]:
import pandas as pd

In [2]:
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

Below, after displaying the data, we can see there are 11 columns in which 10 columns are given for us to predict the column of Survived and when we display test data, we only will have 10 columns and we need to predict the Survived column for test dataset.

In [3]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


by using pandas build-in function info(), we can get basic information about each column like its data types and number of non null values in that column

In [5]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Now, lets start cleaning the data, first of all, lets see of there are any null values in any of the columns. As we can see columns Age and Cabin and Embarked have null values in train data as well as re have null value in test data

In [6]:
train_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [7]:
test_data.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

Lets start cleaning with the column Cabin, since it have more null values in it and also the data type of cabin is string, so, we can used mean or median to fill the null values and also the number of null values is also very high, so, lets just remove the Cabin column, both in train and test data using pandas built-in function drop()

In [8]:
train_data.drop('Cabin', axis=1, inplace=True)
test_data.drop('Cabin', axis=1, inplace=True)

In [9]:
print('Columns in Training Data after dropping column Cabin is ')
print(train_data.columns, '\n')

print('Columns in Test Data after dropping column Cabin is ')
print(test_data.columns)

Columns in Training Data after dropping column Cabin is 
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Embarked'],
      dtype='object') 

Columns in Test Data after dropping column Cabin is 
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Embarked'],
      dtype='object')


Lets look at null values again and we can see column embarked have only 2 null values in train data and 1 null value in test data. Since the dataset is small, lets not remove them. Since Embarked have only 3 unique values C, Q and S, we can replace null values with mode of Embarked column in train data

Now, Fare column has float data type, so we can just do mean or median of that column, lets just replace null value with median using pandas buildin function called fillna().

In [10]:
train_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         2
dtype: int64

In [11]:
test_data.isnull().sum()

PassengerId     0
Pclass          0
Name            0
Sex             0
Age            86
SibSp           0
Parch           0
Ticket          0
Fare            1
Embarked        0
dtype: int64

In [12]:
train_data['Embarked'].mode()

0    S
Name: Embarked, dtype: object

In [13]:
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True)
test_data['Fare'].fillna(test_data['Fare'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_data['Fare'].fillna(test_data['Fare'].median(), inplace=True)


In [14]:
train_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         0
dtype: int64

In [15]:
test_data.isnull().sum()

PassengerId     0
Pclass          0
Name            0
Sex             0
Age            86
SibSp           0
Parch           0
Ticket          0
Fare            0
Embarked        0
dtype: int64

Now, lets go to deal with null values of Age column

We can fill Age with many ways cause it have integer/float values in it. 
1. Mean --> Not sustainable to outliers
2. Median --> This is good and simpler approach that filles sensable middle value
3. Median of a group --> We can group 4 to 5 rows and find median with respect to few columns like pclass, sex, etc.,
4. Use some regression model like knearest neighbour or decision tree to predict the missing values of age 

4th approach looks like it might give accurate results but for smaller dataset like titanic, it might not be good approach and could lead to overfitting. Lets try 3rd approach cause it could give sensible result

In [16]:
def fill_age(df):
    for pclass in range(1, 4):
        for sex in ['male', 'female']:
            median_age = df[(df['Pclass'] == pclass) & (df['Sex'] == sex)]['Age'].median()
            df.loc[(df['Age'].isnull()) & (df['Pclass'] == pclass) & (df['Sex'] == sex), 'Age'] = median_age
    return df

In [17]:
train_data = fill_age(train_data)
test_data = fill_age(test_data)

In [18]:
print(train_data.isnull().sum())
print('\n')
print(test_data.isnull().sum())

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64


PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64


## Feature/Column Engineering

In feature engineering, we will try to convert all columns to either integers or float and try to combine relative columns and try to divide and extract a feature that could be helpful to train model better.

In [19]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


From belowe, we can see, SibSp and Parch can be combined together as Family size, lets do that 

In [20]:
train_data['Fam'] = train_data['SibSp'] + train_data['Parch']
test_data['Fam'] = test_data['SibSp'] + test_data['Parch']

In [21]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Fam
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,0


Now, lets change Sex and Embarked columns to integers as there is only 2 unique values and even in Embarked also 3 unique values

In [22]:
train_data['Sex'] = train_data['Sex'].map({'male': 0, 'female': 1})
test_data['Sex'] = test_data['Sex'].map({'male': 0, 'female': 1})
train_data['Embarked'] = train_data['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})
test_data['Embarked'] = test_data['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})

In [23]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Fam
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,2,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,0,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,2,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,2,1
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,2,0


The Name column contains valuable information that can be used for feature engineering. Instead of using the entire name (which isn’t directly useful), we can extract meaningful features like titles.
1. We can extract titles such as Mr, Mrs, Miss, etc., are embedded in the name and can provide insights about social status, gender, and age.
2. And drop the original name column here

In [24]:
train_data['Title'] = train_data['Name'].str.extract(' ([A-Za-z]+)\.', expand=True)
test_data['Title'] = test_data['Name'].str.extract(' ([A-Za-z]+)\.', expand=True)

  train_data['Title'] = train_data['Name'].str.extract(' ([A-Za-z]+)\.', expand=True)
  test_data['Title'] = test_data['Name'].str.extract(' ([A-Za-z]+)\.', expand=True)


In [25]:
print("Title Counts in Training Data:")
print(train_data['Title'].value_counts(), "\n")

Title Counts in Training Data:
Title
Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Countess      1
Capt          1
Ms            1
Sir           1
Lady          1
Mme           1
Don           1
Jonkheer      1
Name: count, dtype: int64 



In [26]:
print("Title Counts in Test Data:")
print(test_data['Title'].value_counts(), "\n")

Title Counts in Test Data:
Title
Mr        240
Miss       78
Mrs        72
Master     21
Col         2
Rev         2
Ms          1
Dr          1
Dona        1
Name: count, dtype: int64 



In [27]:

title_mapping = {
    'Mr': 'Mr', 'Miss': 'Miss', 'Mrs': 'Mrs', 'Master': 'Master',
    'Jonkheer': 'Rare', 'Don': 'Rare', 'Mme': 'Mrs', 'Lady': 'Rare',
    'Sir': 'Rare', 'Ms': 'Miss', 'Capt': 'Rare', 'Countess': 'Rare', 
    'Col': 'Rare', 'Major': 'Rare', 'Rev': 'Rare', 'Mlle': 'Miss',
    'Dr': 'Rare',  'Dona': 'Rare'  
}

train_data['Title'] = train_data['Title'].map(title_mapping)
test_data['Title'] = test_data['Title'].map(title_mapping)

In [28]:
train_data['Title'] = train_data['Title'].map({'Mr': 0, 'Miss': 1, 'Mrs': 2, 'Master': 3, 'Rare': 4})
test_data['Title'] = test_data['Title'].map({'Mr': 0, 'Miss': 1, 'Mrs': 2, 'Master': 3, 'Rare': 4})

In [29]:
train_data.drop('Name', axis=1, inplace=True)
test_data.drop('Name', axis=1, inplace=True)

In [30]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Fam,Title
0,1,0,3,0,22.0,1,0,A/5 21171,7.25,2,1,0
1,2,1,1,1,38.0,1,0,PC 17599,71.2833,0,1,2
2,3,1,3,1,26.0,0,0,STON/O2. 3101282,7.925,2,0,1
3,4,1,1,1,35.0,1,0,113803,53.1,2,1,2
4,5,0,3,0,35.0,0,0,373450,8.05,2,0,0


Now, I think Only Ticket column does not have numerical values in it. It looks very complicated. So, for simplicity, Lets drop it for now.

In [31]:
train_data.drop('Ticket', axis=1, inplace=True)
test_data.drop('Ticket', axis=1, inplace=True)

In [32]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Fam,Title
0,1,0,3,0,22.0,1,0,7.25,2,1,0
1,2,1,1,1,38.0,1,0,71.2833,0,1,2
2,3,1,3,1,26.0,0,0,7.925,2,0,1
3,4,1,1,1,35.0,1,0,53.1,2,1,2
4,5,0,3,0,35.0,0,0,8.05,2,0,0


In [33]:
test_data.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Fam,Title
0,892,3,0,34.5,0,0,7.8292,1,0,0
1,893,3,1,47.0,1,0,7.0,2,1,2
2,894,2,0,62.0,0,0,9.6875,1,0,0
3,895,3,0,27.0,0,0,8.6625,2,0,0
4,896,3,1,22.0,1,1,12.2875,2,2,2


Lets drop SibSp and Parch as Fam already been added and i dont think they will add much difference in training while Fam is there, so lets just drop them to avoid redundency

In [34]:
train_data.drop(['SibSp', 'Parch'], axis=1, inplace=True)
test_data.drop(['SibSp', 'Parch'], axis=1, inplace=True)

In [35]:
print(train_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    int64  
 4   Age          891 non-null    float64
 5   Fare         891 non-null    float64
 6   Embarked     891 non-null    int64  
 7   Fam          891 non-null    int64  
 8   Title        891 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 62.8 KB
None


In [36]:
print(test_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Sex          418 non-null    int64  
 3   Age          418 non-null    float64
 4   Fare         418 non-null    float64
 5   Embarked     418 non-null    int64  
 6   Fam          418 non-null    int64  
 7   Title        418 non-null    int64  
dtypes: float64(2), int64(6)
memory usage: 26.3 KB
None


## Training Data with Multiple ML models

First of all, lets divide training data into X and y where y is output which is Survived column and rest of them all are X. Using sklearn train_and_test_split, lets split training data into train and validate in 80 and 20 % of entire training data.

In [37]:
X = train_data.drop(['Survived'], axis=1)
y = train_data['Survived']

In [38]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [39]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 712 entries, 331 to 102
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Pclass       712 non-null    int64  
 2   Sex          712 non-null    int64  
 3   Age          712 non-null    float64
 4   Fare         712 non-null    float64
 5   Embarked     712 non-null    int64  
 6   Fam          712 non-null    int64  
 7   Title        712 non-null    int64  
dtypes: float64(2), int64(6)
memory usage: 50.1 KB


Lets scale Age and Fare cause they are having high value numbers and scaling them could reduce value and strink them in between -1 to +1 or 0 to 1 or in small values.

Here, lets use standard scaling where, it will reduce values to have mean 0 and standard deviation 1 using StandardScale function in sklearn

In [40]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train[['Age', 'Fare']] = scaler.fit_transform(X_train[['Age', 'Fare']])
X_val[['Age', 'Fare']] = scaler.fit_transform(X_val[['Age', 'Fare']])
test_data[['Age', 'Fare']] = scaler.transform(test_data[['Age', 'Fare']])

In [41]:
X_train.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Fare,Embarked,Fam,Title
331,332,1,0,1.249288,-0.078684,2,0,0
733,734,2,0,-0.445751,-0.377145,2,0,0
382,383,3,0,0.232265,-0.474867,2,0,0
704,705,3,0,-0.219746,-0.47623,2,1,0
813,814,3,1,-1.726446,-0.025249,2,6,1


Let’s dive into training! Since our plan is to try various models, we’ll start with a baseline model to establish a reference point, then gradually test more sophisticated models.

Step-by-Step Training Process

1. Train a Baseline Model
	* Start with Logistic Regression (simple and interpretable).
	* Use it as a benchmark for comparing other models.

2. Train Advanced Models
	* Tree-based models: Decision Tree, Random Forest.
	* Boosting models: AdaBoost, Gradient Boosting, XGBoost, LightGBM.
	* Distance-based model: KNN.
	* Simple Neural Network.

3. Compare Models
	* Evaluate all models using metrics like accuracy or F1-score on the validation set.

Baseline Model: Logistic Regression

In [63]:
from sklearn.linear_model import LogisticRegression

logistic_model = LogisticRegression(random_state=42)
logistic_model.fit(X_train, y_train)
y_pred = logistic_model.predict(X_val)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Lets caluclate its accuracy with classification report which contains f1 score, precision, recall

From below we can see the accuracy is 78% which is not bad for baseline model

In [64]:
from sklearn.metrics import accuracy_score, classification_report

logistic_accuracy = accuracy_score(y_val, y_pred)
print(f"Logistic Regression Accuracy: {logistic_accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_val, y_pred))

Logistic Regression Accuracy: 0.7821

Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.82      0.82       105
           1       0.74      0.73      0.73        74

    accuracy                           0.78       179
   macro avg       0.78      0.77      0.77       179
weighted avg       0.78      0.78      0.78       179



Below is an explanation of each model used in the code, including its key characteristics, how it works, and why it’s included in the training pipeline

1. Logistic Regression
	•	Logistic regression predicts the probability of an instance belonging to a class using the logistic (sigmoid) function.
	•	It’s essentially a linear model but applied to classification tasks.
	•	The decision boundary is a straight line (or a hyperplane in higher dimensions).
	•	It’s simple, interpretable, and works well as a baseline model for binary classification tasks.

2. Decision Tree Classifier
	•	Splits the data into subsets based on feature thresholds, forming a tree structure.
	•	Each node represents a feature, and the branches represent decisions based on feature values.
	•	The tree continues splitting until it reaches a stopping criterion (e.g., maximum depth or minimum samples per leaf).
	•	Easy to understand and visualize.
	•	Can capture non-linear relationships between features.

3. Random Forest Classifier
	•	Combines multiple decision trees (trained on different random subsets of data and features) to improve stability and accuracy.
	•	Outputs the majority vote (classification) or average prediction (regression) from all trees.
	•	Reduces overfitting compared to a single decision tree.
	•	Handles missing data and categorical variables well.

4. AdaBoost Classifier
	•	Trains a sequence of weak learners (e.g., decision stumps).
	•	Each learner focuses on the mistakes of the previous one by assigning higher weights to misclassified instances.
	•	Combines the predictions of all learners to make the final prediction.
	•	Works well with smaller, less complex datasets.
	•	Focuses on difficult examples, improving classification performance.

5. Gradient Boosting Classifier
	•	Sequentially trains decision trees to minimize the residual errors of previous trees.
	•	Optimizes a loss function using gradient descent to improve model predictions over iterations.
	•	Provides better accuracy than AdaBoost on many datasets.
	•	Highly customizable with various hyperparameters.

6. XGBoost (Extreme Gradient Boosting)
	•	An optimized version of Gradient Boosting that is faster and more efficient.
	•	Uses advanced regularization (L1 and L2) to prevent overfitting.
	•	Supports parallel computing, which makes it much faster than traditional Gradient Boosting.
	•	High performance on structured/tabular data.
	•	Consistently ranks among the top models in machine learning competitions.

7. LightGBM (Light Gradient Boosting Machine)
	•	Similar to XGBoost but optimized for speed and efficiency.
	•	Grows trees leaf-wise rather than level-wise, which can reduce error more efficiently.
	•	Handles categorical features directly and works well with large datasets.
	•	Faster training and prediction compared to XGBoost.
	•	Handles large-scale datasets effectively.

8. K-Nearest Neighbors (KNN)
	•	Predicts the class of a data point based on the majority class of its k nearest neighbors in the feature space.
	•	Distance is typically calculated using Euclidean distance.
	•	Simple and intuitive.
	•	No explicit training phase, making it a lazy learning algorithm.
	•	Effective for datasets with smaller feature spaces.

9. Neural Network
	•	Consists of layers of interconnected nodes (neurons).
	•	Input features are transformed through weighted connections and activation functions to make predictions.
	•	Uses backpropagation and gradient descent to learn optimal weights.
	•	Can model complex, non-linear relationships.
	•	Scales well with larger datasets and features.

In [44]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'AdaBoost': AdaBoostClassifier(random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'XGBoost': XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
    'LightGBM': LGBMClassifier(random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=5)
}

In [65]:
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    accuracy = accuracy_score(y_val, y_pred)
    results[name] = accuracy
    print(f"{name}: Accuracy = {accuracy:.4f}")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Parameters: { "use_label_encoder" } are not used.



Logistic Regression: Accuracy = 0.7821
Decision Tree: Accuracy = 0.7374
Random Forest: Accuracy = 0.7933
AdaBoost: Accuracy = 0.8101
Gradient Boosting: Accuracy = 0.8045
XGBoost: Accuracy = 0.8156
[LightGBM] [Info] Number of positive: 268, number of negative: 444
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000323 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 435
[LightGBM] [Info] Number of data points in the train set: 712, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376404 -> initscore=-0.504838
[LightGBM] [Info] Start training from score -0.504838
LightGBM: Accuracy = 0.8101
KNN: Accuracy = 0.5642


In [46]:
sorted_results = sorted(results.items(), key=lambda x: x[1], reverse=True)
print("\nModel Performance (Sorted by Accuracy):")
for name, acc in sorted_results:
    print(f"{name}: {acc:.4f}")


Model Performance (Sorted by Accuracy):
XGBoost: 0.8156
AdaBoost: 0.8101
LightGBM: 0.8101
Gradient Boosting: 0.8045
Random Forest: 0.7933
Logistic Regression: 0.7821
Decision Tree: 0.7374
KNN: 0.5642


Okay, until now, we tried basic traditional ML models but, we can try advanced techniques likes hyper parameter tuning and fine tuning, then move on to ensemble methoda and try simple neural network.

Try Fine Tuning the Top-performing Models

Fine-tuning involves optimizing the hyperparameters of a model to improve its accuracy. Let’s start with Random Forest and Gradient Boosting as examples.

Before Fine Tuning Parameters, lets see their default values that we trained with earlier
1. Logistic Regression (LogisticRegression)
LogisticRegression(
    penalty='l2',             # Regularization (L2 by default)
    dual=False,               # Only relevant for certain solvers
    tol=1e-4,                 # Stopping criterion tolerance
    C=1.0,                    # Inverse of regularization strength
    fit_intercept=True,       # Whether to include an intercept in the model
    intercept_scaling=1,      # For certain solvers only
    solver='lbfgs',           # Optimization algorithm ('lbfgs' by default)
    max_iter=100,             # Maximum number of iterations
    multi_class='auto',       # Handles multi-class problems automatically
    random_state=None         # Random seed for reproducibility
)

2. Decision Tree (DecisionTreeClassifier)
DecisionTreeClassifier(
    criterion='gini',         # Splitting criterion ('gini' or 'entropy')
    splitter='best',          # Strategy to split at each node ('best' or 'random')
    max_depth=None,           # No limit on tree depth
    min_samples_split=2,      # Minimum samples required to split a node
    min_samples_leaf=1,       # Minimum samples required at a leaf node
    max_features=None,        # No limit on the number of features considered
    random_state=None         # Random seed for reproducibility
)

3. Random Forest (RandomForestClassifier)
RandomForestClassifier(
    n_estimators=100,         # Number of trees in the forest
    criterion='gini',         # Splitting criterion ('gini' or 'entropy')
    max_depth=None,           # No limit on tree depth
    min_samples_split=2,      # Minimum samples required to split a node
    min_samples_leaf=1,       # Minimum samples required at a leaf node
    max_features='sqrt',      # Number of features to consider for splitting
    bootstrap=True,           # Use bootstrapped samples
    random_state=None         # Random seed for reproducibility
)

4. AdaBoost (AdaBoostClassifier)
AdaBoostClassifier(
    base_estimator=None,      # Default is `DecisionTreeClassifier(max_depth=1)`
    n_estimators=50,          # Number of weak learners
    learning_rate=1.0,        # Contribution of each weak learner
    algorithm='SAMME.R',      # Boosting algorithm ('SAMME' or 'SAMME.R')
    random_state=None         # Random seed for reproducibility
)

5. Gradient Boosting (GradientBoostingClassifier)
GradientBoostingClassifier(
    loss='log_loss',          # Loss function for classification
    learning_rate=0.1,        # Step size shrinkage
    n_estimators=100,         # Number of boosting stages
    subsample=1.0,            # Fraction of samples used for training each tree
    criterion='friedman_mse', # Splitting criterion
    max_depth=3,              # Maximum tree depth
    min_samples_split=2,      # Minimum samples required to split a node
    min_samples_leaf=1,       # Minimum samples required at a leaf node
    random_state=None         # Random seed for reproducibility
)

6. XGBoost (XGBClassifier)
XGBClassifier(
    booster='gbtree',         # Boosting method ('gbtree', 'gblinear', 'dart')
    learning_rate=0.3,        # Step size shrinkage
    n_estimators=100,         # Number of boosting rounds
    max_depth=6,              # Maximum depth of a tree
    min_child_weight=1,       # Minimum sum of instance weight needed in a child
    subsample=1.0,            # Fraction of samples used per tree
    colsample_bytree=1.0,     # Fraction of features used per tree
    random_state=None         # Random seed for reproducibility
)

7. LightGBM (LGBMClassifier)
LGBMClassifier(
    boosting_type='gbdt',     # Boosting method ('gbdt', 'dart', 'goss')
    num_leaves=31,            # Maximum leaves in one tree
    learning_rate=0.1,        # Step size shrinkage
    n_estimators=100,         # Number of boosting rounds
    max_depth=-1,             # No limit on tree depth
    random_state=None         # Random seed for reproducibility
)

8. K-Nearest Neighbors (KNeighborsClassifier)
KNeighborsClassifier(
    n_neighbors=5,            # Number of neighbors to consider
    weights='uniform',        # All neighbors have equal weight
    algorithm='auto',         # Algorithm for nearest neighbors search
    leaf_size=30,             # Leaf size for tree-based algorithms
    metric='minkowski',       # Distance metric
    p=2,                      # Power parameter for Minkowski distance
    n_jobs=None               # Number of parallel jobs
)

1. Random Forest Fine-Tuning
	•	n_estimators: Number of trees in the forest.
	•	max_depth: Maximum depth of each tree.
	•	min_samples_split: Minimum number of samples required to split an internal node.
	•	min_samples_leaf: Minimum number of samples required to be at a leaf node.

In [68]:
from sklearn.model_selection import GridSearchCV

rf_params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf = RandomForestClassifier(random_state=42)
rf_grid = GridSearchCV(rf, rf_params, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
rf_grid.fit(X_train, y_train)

print("Best Parameters for Random Forest:", rf_grid.best_params_)
print("Best Cross-Validation Accuracy for Random Forest:", rf_grid.best_score_)


Fitting 5 folds for each of 81 candidates, totalling 405 fits
Best Parameters for Random Forest: {'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 100}
Best Cross-Validation Accuracy for Random Forest: 0.8272037821333595


2. Gradient Boosting Fine Tuning
	•	n_estimators: Number of boosting stages.
	•	learning_rate: Step size for weight updates.
	•	max_depth: Maximum depth of each tree.

In [69]:
gb_params = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}

gb = GradientBoostingClassifier(random_state=42)
gb_grid = GridSearchCV(gb, gb_params, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
gb_grid.fit(X_train, y_train)

print("Best Parameters for Gradient Boosting:", gb_grid.best_params_)
print("Best Cross-Validation Accuracy for Gradient Boosting:", gb_grid.best_score_)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best Parameters for Gradient Boosting: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 200}
Best Cross-Validation Accuracy for Gradient Boosting: 0.8257953314291342


3. Logistic Regression
	•	C: Regularization strength (smaller values imply stronger regularization).
	•	solver: Optimization algorithm.
	•	penalty: Type of regularization (L1, L2).

In [70]:
from sklearn.linear_model import LogisticRegression

lr_params = {
    'C': [0.01, 0.1, 1, 10, 100],        
    'solver': ['liblinear', 'lbfgs'],    
    'penalty': ['l2']                    
}


lr = LogisticRegression(random_state=42)
lr_grid = GridSearchCV(lr, lr_params, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
lr_grid.fit(X_train, y_train)


print("Best Parameters for Logistic Regression:", lr_grid.best_params_)
print("Best Cross-Validation Accuracy for Logistic Regression:", lr_grid.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best Parameters for Logistic Regression: {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}
Best Cross-Validation Accuracy for Logistic Regression: 0.8201812272234807


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

4. Decision Tree Fine Tuning

	•	criterion: Splitting criterion (gini, entropy).

	•	max_depth: Maximum depth of the tree.

	•	min_samples_split: Minimum samples required to split a node.

	•	min_samples_leaf: Minimum samples required to form a leaf node.

In [71]:
from sklearn.tree import DecisionTreeClassifier

dt_params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

dt = DecisionTreeClassifier(random_state=42)
dt_grid = GridSearchCV(dt, dt_params, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
dt_grid.fit(X_train, y_train)

print("Best Parameters for Decision Tree:", dt_grid.best_params_)
print("Best Cross-Validation Accuracy for Decision Tree:", dt_grid.best_score_)

Fitting 5 folds for each of 72 candidates, totalling 360 fits
Best Parameters for Decision Tree: {'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 2}
Best Cross-Validation Accuracy for Decision Tree: 0.810371318822023


4. LightGBM Fine-Tuning

	•	num_leaves: Maximum number of leaves per tree.

	•	learning_rate: Step size shrinkage.
    
	•	n_estimators: Number of boosting iterations.

In [72]:
from lightgbm import LGBMClassifier

lgbm_params = {
    'num_leaves': [31, 50, 70],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [100, 200, 300]
}

lgbm = LGBMClassifier(random_state=42)
lgbm_grid = GridSearchCV(lgbm, lgbm_params, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
lgbm_grid.fit(X_train, y_train)

print("Best Parameters for LightGBM:", lgbm_grid.best_params_)
print("Best Cross-Validation Accuracy for LightGBM:", lgbm_grid.best_score_)



Fitting 5 folds for each of 27 candidates, totalling 135 fits
[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001269 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001308 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142
[LightGBM] [Info] Number of posit



[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.020084 seconds.
You can set `force_row_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 374

[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142

[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001212 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info




[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002325 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Start training from score -0.508955
[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001199 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFr












[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001297 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480



[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001842 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [bin



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.042879 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Start training from score -0.508955


[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001801 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] S















[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001494 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142
[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001438 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [bin



[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002009 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480
[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001099 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFro



[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001752 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142
[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001533 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFro













[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001437 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480





[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003255 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Start training from score -0.508955

[LightGBM] [Info] Number of positive: 214, number of negative: 355




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001004 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142


[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001438 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] S















[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001524 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480
[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001130 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [bin



[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001105 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Start training from score -0.508955

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001434 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFr





















[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001535 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142





[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001388 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480
[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000943 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFro








[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001423 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Start training from score -0.508955
[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001272 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [I




[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001699 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142






[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001922 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480













[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003692 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480




[LightGBM] [Info] Start training from score -0.501480
[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001770 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Start training from score -0.508955
[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001419 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number 



[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001209 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480
[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001343 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFro





[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002191 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142
[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003839 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info



[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001525 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480




[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002432 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480

[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002584 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set



[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000920 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Start training from score -0.508955






[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001054 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:Bo



[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001579 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142






[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001062 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480
[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.014709 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFr



[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001141 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Start training from score -0.508955
[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001385 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFro
















[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001691 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142
[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001523 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [bi



[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001425 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480
[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001469 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFro







[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001575 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142










[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001936 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480
[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001509 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:Boos







[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001596 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142
[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001070 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFro



[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001833 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Start training from score -0.508955

[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001250 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info]









[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003168 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142






[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001403 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:Bo




[LightGBM] [Info] Number of positive: 215, number of negative: 355




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.031132 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480
[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000953 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Sta




[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001047 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002901 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] S




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004922 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480













[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001160 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480




[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001573 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Start training from score -0.508955
[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.023758 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFro




[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001835 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142
[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008829 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFr



[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001723 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480





[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000803 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:Boo









[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001569 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372




[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142

[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001143 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142
[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001468 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=tru





[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001204 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480
[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001530 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostF



[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001308 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372

[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142

[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001891 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from scor






[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001016 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376

[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480





[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002116 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480






[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001256 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Start training from score -0.508955







[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001586 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142
[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001235 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:Boost







[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001374 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480
[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001233 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFro




[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001531 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142













[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001299 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142







[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001295 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480






[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001362 seconds.
You can set `force_row_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 370

[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480



[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001879 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.

[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Start training from score -0.508955




[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001536 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142




[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001767 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142




[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006410 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8

[LightGBM] [Info] Start training from score -0.501480
[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006504 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training 




[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001375 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142


[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001677 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374

[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from sc








[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001547 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480





[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001692 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480




[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008935 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Start training from score -0.508955

[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002147 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFr















[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001213 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142






[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001683 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480
[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001204 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFr



[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002935 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Start training from score -0.508955



































[LightGBM] [Info] Number of positive: 268, number of negative: 444
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000273 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 435
[LightGBM] [Info] Number of data points in the train set: 712, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376404 -> initscore=-0.504838
[LightGBM] [Info] Start training from score -0.504838
Best Parameters for LightGBM: {'learning_rate': 0.01, 'n_estimators': 200, 'num_leaves': 31}
Best Cross-Validation Accuracy for LightGBM: 0.8271939328277357


5. K-Nearest Neighbors (KNN) Fine-Tuning

	•	n_neighbors: Number of neighbors to consider.

	•	weights: Weight function used (uniform, distance).
    
	•	metric: Distance metric.

In [73]:
from sklearn.neighbors import KNeighborsClassifier

knn_params = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

knn = KNeighborsClassifier()
knn_grid = GridSearchCV(knn, knn_params, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
knn_grid.fit(X_train, y_train)

print("Best Parameters for KNN:", knn_grid.best_params_)
print("Best Cross-Validation Accuracy for KNN:", knn_grid.best_score_)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
Best Parameters for KNN: {'metric': 'manhattan', 'n_neighbors': 9, 'weights': 'distance'}
Best Cross-Validation Accuracy for KNN: 0.6614104205653503


Lets just again compare all the results we got, here are previous results we got without fine tuning

XGBoost: 0.8156
AdaBoost: 0.8101
LightGBM: 0.8101
Gradient Boosting: 0.8045
Random Forest: 0.7933
Logistic Regression: 0.7821
Decision Tree: 0.7374
KNN: 0.5642

Now, lets see results after fine tuning

In [74]:
model_results = {
    'Logistic Regression': lr_grid.best_score_,
    'Decision Tree': dt_grid.best_score_,
    'Random Forest': rf_grid.best_score_,
    'AdaBoost': 0.8101,
    'Gradient Boosting': gb_grid.best_score_,
    'XGBoost': 0.8156,
    'LightGBM': lgbm_grid.best_score_,
    'KNN': knn_grid.best_score_
}

sorted_results = sorted(model_results.items(), key=lambda x: x[1], reverse=True)

print("Model Performance (Sorted by Cross-Validation Accuracy):")
for model, score in sorted_results:
    print(f"{model}: {score:.4f}")

Model Performance (Sorted by Cross-Validation Accuracy):
Random Forest: 0.8272
LightGBM: 0.8272
Gradient Boosting: 0.8258
Logistic Regression: 0.8202
XGBoost: 0.8156
Decision Tree: 0.8104
AdaBoost: 0.8101
KNN: 0.6614


Now, lets try Ensemble Methods where combing 2 or more top performing models to get optimum results.

1. Stacking (Meta-Ensemble)

Stacking combines predictions from multiple base models using a meta-model. For example:
	• Base models: Random Forest, LightGBM, Gradient Boosting.
	• Meta-model: Logistic Regression (or another lightweight model).

In [75]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression


base_models = [
    ('rf', RandomForestClassifier(**rf_grid.best_params_, random_state=42)),
    ('lgbm', LGBMClassifier(**lgbm_grid.best_params_, random_state=42)),
    ('gb', GradientBoostingClassifier(**gb_grid.best_params_, random_state=42))
]


meta_model = LogisticRegression()

stacking_model = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5)
stacking_model.fit(X_train, y_train)

y_pred_stacking = stacking_model.predict(X_val)
stacking_accuracy = accuracy_score(y_val, y_pred_stacking)
print(f"Stacking Model Validation Accuracy: {stacking_accuracy:.4f}")

[LightGBM] [Info] Number of positive: 268, number of negative: 444
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000443 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 435
[LightGBM] [Info] Number of data points in the train set: 712, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376404 -> initscore=-0.504838
[LightGBM] [Info] Start training from score -0.504838




[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000476 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376098 -> initscore=-0.506142
[LightGBM] [Info] Start training from score -0.506142
[LightGBM] [Info] Number of positive: 214, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000378 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374
[LightGBM] [Info] Number of data points in the train set: 569, number of used features: 8
[LightGBM] [Info] [binary:BoostFro



[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000330 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480




[LightGBM] [Info] Number of positive: 215, number of negative: 355
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000333 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 370
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377193 -> initscore=-0.501480
[LightGBM] [Info] Start training from score -0.501480




[LightGBM] [Info] Number of positive: 214, number of negative: 356
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000386 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375
[LightGBM] [Info] Number of data points in the train set: 570, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.375439 -> initscore=-0.508955
[LightGBM] [Info] Start training from score -0.508955




Stacking Model Validation Accuracy: 0.8212




2. Blending

Blending uses a holdout dataset to combine predictions from multiple models. It’s simpler but less robust than stacking.

In [76]:
rf_preds = RandomForestClassifier(**rf_grid.best_params_, random_state=42).fit(X_train, y_train).predict_proba(X_val)[:, 1]
lgbm_preds = LGBMClassifier(**lgbm_grid.best_params_, random_state=42).fit(X_train, y_train).predict_proba(X_val)[:, 1]
gb_preds = GradientBoostingClassifier(**gb_grid.best_params_, random_state=42).fit(X_train, y_train).predict_proba(X_val)[:, 1]

combined_preds = (rf_preds + lgbm_preds + gb_preds) / 3

blended_preds = (combined_preds > 0.5).astype(int)
blending_accuracy = accuracy_score(y_val, blended_preds)
print(f"Blending Model Validation Accuracy: {blending_accuracy:.4f}")

[LightGBM] [Info] Number of positive: 268, number of negative: 444
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000401 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 435
[LightGBM] [Info] Number of data points in the train set: 712, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376404 -> initscore=-0.504838
[LightGBM] [Info] Start training from score -0.504838
Blending Model Validation Accuracy: 0.8212


I dont think there is much increase in accuracy for ensemble techniques over hyper parameter tuning. Lets try simple neural network.

Since, the dataset is very small, neural network might not give good results but lets try and see.

In [77]:
import torch
from torch.utils.data import Dataset, DataLoader

class TitanicDataset(Dataset):
    def __init__(self, features, labels=None):
        self.features = torch.tensor(features.values, dtype=torch.float32)
        self.labels = torch.tensor(labels.values, dtype=torch.float32) if labels is not None else None

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        if self.labels is not None:
            return self.features[idx], self.labels[idx]
        return self.features[idx]

train_dataset = TitanicDataset(X_train, y_train)   
val_dataset = TitanicDataset(X_val, y_val)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

In [78]:
import torch.nn as nn

class TitanicNN(nn.Module):
    def __init__(self, input_size):
        super(TitanicNN, self).__init__()
        self.fc1 = nn.Linear(input_size, 128)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 1)  
        self.sigmoid = nn.Sigmoid()  

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.relu(self.fc2(x))
        x = self.sigmoid(self.fc3(x))
        return x

In [79]:
input_size = X_train.shape[1]
model = TitanicNN(input_size)
criterion = nn.BCELoss() 
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

epochs = 50
for epoch in range(epochs):
    model.train()
    train_loss = 0.0

    for features, labels in train_loader:
        outputs = model(features).squeeze()
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_loss += loss.item()

    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        correct = 0
        total = 0
        for features, labels in val_loader:
            outputs = model(features).squeeze()
            loss = criterion(outputs, labels)
            val_loss += loss.item()

            predicted = (outputs >= 0.5).float()
            correct += (predicted == labels).sum().item()
            total += labels.size(0)

    val_accuracy = correct / total
    print(f"Epoch {epoch+1}/{epochs}, Train Loss: {train_loss/len(train_loader):.4f}, "
          f"Val Loss: {val_loss/len(val_loader):.4f}, Val Accuracy: {val_accuracy:.4f}")

Epoch 1/50, Train Loss: 5.2406, Val Loss: 1.8827, Val Accuracy: 0.4134
Epoch 2/50, Train Loss: 1.7273, Val Loss: 2.2117, Val Accuracy: 0.4134
Epoch 3/50, Train Loss: 1.4708, Val Loss: 0.6644, Val Accuracy: 0.5866
Epoch 4/50, Train Loss: 1.0895, Val Loss: 1.5190, Val Accuracy: 0.4134
Epoch 5/50, Train Loss: 1.2133, Val Loss: 0.7270, Val Accuracy: 0.4637
Epoch 6/50, Train Loss: 1.1147, Val Loss: 0.8082, Val Accuracy: 0.4358
Epoch 7/50, Train Loss: 1.1223, Val Loss: 0.6836, Val Accuracy: 0.5866
Epoch 8/50, Train Loss: 0.9800, Val Loss: 0.8729, Val Accuracy: 0.4413
Epoch 9/50, Train Loss: 0.8603, Val Loss: 0.7102, Val Accuracy: 0.4749
Epoch 10/50, Train Loss: 0.8011, Val Loss: 0.6763, Val Accuracy: 0.6145
Epoch 11/50, Train Loss: 0.7512, Val Loss: 0.7826, Val Accuracy: 0.4581
Epoch 12/50, Train Loss: 0.8496, Val Loss: 0.6404, Val Accuracy: 0.5922
Epoch 13/50, Train Loss: 0.6927, Val Loss: 0.6952, Val Accuracy: 0.5866
Epoch 14/50, Train Loss: 0.7405, Val Loss: 0.9021, Val Accuracy: 0.5866
E

Now, lets compare the accuracies of all models and display them in descending order in which we can see Random Forest FT: 0.8272
Light GBM FT: 0.8272
Gradient Boost FT: 0.8258
Logistic FT: 0.8202 are performing well

In [80]:
model_results = {
    'Logistic': logistic_accuracy,
    'Decision Tree': 0.7374,
    'Random Forest': 0.7933,
    'AdaBoost': 0.8101,
    'Gradient Boosting': 0.8045,
    'XGBoost': 0.8156,
    'LightGBM': 0.8101,
    'KNN': 0.5642,

    'Random Forest FT': rf_grid.best_score_,
    'Gradient Boost FT': gb_grid.best_score_,
    'Logistic FT': lr_grid.best_score_,
    'Decision Tree FT': dt_grid.best_score_,
    'Light GBM FT': lgbm_grid.best_score_,
    'KNN FT': knn_grid.best_score_,

    'Neural Network': val_accuracy
}

sorted_results = sorted(model_results.items(), key=lambda x: x[1], reverse=True)

print("Model Performance (Sorted by Cross-Validation Accuracy):")
for model, score in sorted_results:
    print(f"{model}: {score:.4f}")

Model Performance (Sorted by Cross-Validation Accuracy):
Random Forest FT: 0.8272
Light GBM FT: 0.8272
Gradient Boost FT: 0.8258
Logistic FT: 0.8202
XGBoost: 0.8156
Decision Tree FT: 0.8104
AdaBoost: 0.8101
LightGBM: 0.8101
Gradient Boosting: 0.8045
Random Forest: 0.7933
Logistic: 0.7821
Neural Network: 0.7821
Decision Tree: 0.7374
KNN FT: 0.6614
KNN: 0.5642


## Predict Output with all Models we Trained

Now, Finally predict the output of survived for test dataset for all models we did

First, lets get predictions using Base Models

In [67]:
results = {}
test_data_ids = test_data['PassengerId']
for name, model in models.items():
    y_pred = model.predict(test_data)
    output = pd.DataFrame({'PassengerId': test_data_ids, 'Survived': y_pred})
    filename = f"{name.replace(' ', '_').lower()}_predictions.csv"
    output.to_csv(filename, index=False)
    print(f"Saved predictions for {name} to {filename}")

Saved predictions for Logistic Regression to logistic_regression_predictions.csv
Saved predictions for Decision Tree to decision_tree_predictions.csv
Saved predictions for Random Forest to random_forest_predictions.csv
Saved predictions for AdaBoost to adaboost_predictions.csv
Saved predictions for Gradient Boosting to gradient_boosting_predictions.csv
Saved predictions for XGBoost to xgboost_predictions.csv
Saved predictions for LightGBM to lightgbm_predictions.csv
Saved predictions for KNN to knn_predictions.csv


Now, Predictions of fine tuned models

In [81]:
fine_tine_models = {
    'Random Forest FT': rf_grid,
    'Gradient Boost FT': gb_grid,
    'Logistic FT': lr_grid,
    'Decision Tree FT': dt_grid,
    'Light GBM FT': lgbm_grid,
    'KNN FT': knn_grid,
}

for name, model in fine_tine_models.items():
    y_pred = model.predict(test_data)
    output = pd.DataFrame({'PassengerId': test_data_ids, 'Survived': y_pred})
    filename = f"{name.replace(' ', '_').lower()}_predictions.csv"
    output.to_csv(filename, index=False)
    print(f"Saved predictions for {name} to {filename}")

Saved predictions for Random Forest FT to random_forest_ft_predictions.csv
Saved predictions for Gradient Boost FT to gradient_boost_ft_predictions.csv
Saved predictions for Logistic FT to logistic_ft_predictions.csv
Saved predictions for Decision Tree FT to decision_tree_ft_predictions.csv
Saved predictions for Light GBM FT to light_gbm_ft_predictions.csv
Saved predictions for KNN FT to knn_ft_predictions.csv




Now, lets get prediction of Neural Network

In [82]:
nn_preds = model.predict(test_data)
nn_preds_binary = (nn_preds >= 0.5).astype(int).squeeze()
nn_output = pd.DataFrame({'PassengerId': test_data_ids, 'Survived': nn_preds_binary})

nn_output.to_csv('neural_network_predictions.csv', index=False)
print("Neural Network predictions saved to 'neural_network_predictions.csv'")

Neural Network predictions saved to 'neural_network_predictions.csv'
