<a href="https://colab.research.google.com/github/jerekorhonenn/Bankruptcy_Prediction/blob/main/Introduction_to_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing necessary libraries


In this section, we are importing the necessary libraries that we will be using in this lesson. The libraries used are:

- `pandas` for loading and manipulating the data
- `numpy` for numerical computations
- `seaborn` for visualizing the data
- `matplotlib` for creating plots
- `sklearn.linear_model` for logistic regression model
- `sklearn.model_selection` for splitting the data into training and testing sets, and for hyperparameter tuning
- `sklearn.metrics` for evaluating the performance of the model

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

## Loading the Titanic dataset

In this section, we are loading the Titanic dataset from seaborn library. This dataset contains information about the passengers on the Titanic, including their demographics, ticket information, and survival status.

Here is the link to kaggle competiton hosting this dataset: https://www.kaggle.com/competitions/titanic/overview

In [None]:
data = sns.load_dataset("titanic")

In [None]:
data

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


### Dataset exploration

In [None]:
# Check first 5 lines 
data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [None]:
# Get columns
data.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

### Checking if there is missing data

In [None]:
data.isnull().sum(axis=0)

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [None]:
data_new = data.drop(["age", "deck", "alive", "who"], axis=1)

In [None]:
data_new

Unnamed: 0,survived,pclass,sex,sibsp,parch,fare,embarked,class,adult_male,embark_town,alone
0,0,3,male,1,0,7.2500,S,Third,True,Southampton,False
1,1,1,female,1,0,71.2833,C,First,False,Cherbourg,False
2,1,3,female,0,0,7.9250,S,Third,False,Southampton,True
3,1,1,female,1,0,53.1000,S,First,False,Southampton,False
4,0,3,male,0,0,8.0500,S,Third,True,Southampton,True
...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,0,0,13.0000,S,Second,True,Southampton,True
887,1,1,female,0,0,30.0000,S,First,False,Southampton,True
888,0,3,female,1,2,23.4500,S,Third,False,Southampton,False
889,1,1,male,0,0,30.0000,C,First,True,Cherbourg,True


### Let's check feature types

**Numerical features**: These are features that are numerical in nature, such as height, weight, age, etc. They can be either continuous or discrete. Continuous numerical features can take any value within a range, such as weight, height, etc. On the other hand, discrete numerical features can take only a specific set of values, such as the number of children in a family.

**Categorical features**: These are features that can be divided into categories or groups, such as gender, color, etc. They can be either ordinal or nominal. Ordinal categorical features have a natural ordering, such as grade levels (1st, 2nd, 3rd). Nominal categorical features do not have a natural ordering, such as color (red, green, blue).

**Text features**: These are features that consist of text data, such as reviews, articles, etc. They are usually represented as a string of characters and require special processing, such as tokenization, in order to be used in a machine learning model.

**Binary features**: These are features that can only take one of two values, such as True/False or 0/1. These features are often used as flags or indicators in the data.

In [None]:
data_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   sibsp        891 non-null    int64   
 4   parch        891 non-null    int64   
 5   fare         891 non-null    float64 
 6   embarked     889 non-null    object  
 7   class        891 non-null    category
 8   adult_male   891 non-null    bool    
 9   embark_town  889 non-null    object  
 10  alone        891 non-null    bool    
dtypes: bool(2), category(1), float64(1), int64(4), object(3)
memory usage: 58.6+ KB


In [None]:
data_new.head()

Unnamed: 0,survived,pclass,sex,sibsp,parch,fare,embarked,class,adult_male,embark_town,alone
0,0,3,male,1,0,7.25,S,Third,True,Southampton,False
1,1,1,female,1,0,71.2833,C,First,False,Cherbourg,False
2,1,3,female,0,0,7.925,S,Third,False,Southampton,True
3,1,1,female,1,0,53.1,S,First,False,Southampton,False
4,0,3,male,0,0,8.05,S,Third,True,Southampton,True


In [None]:
data_processed = pd.get_dummies(data_new, columns=["sex", "embarked", "class", "adult_male", "embark_town", "alone"], drop_first=True)

In [None]:
data_processed.head()

Unnamed: 0,survived,pclass,sibsp,parch,fare,sex_male,embarked_Q,embarked_S,class_Second,class_Third,adult_male_True,embark_town_Queenstown,embark_town_Southampton,alone_True
0,0,3,1,0,7.25,1,0,1,0,1,1,0,1,0
1,1,1,1,0,71.2833,0,0,0,0,0,0,0,0,0
2,1,3,0,0,7.925,0,0,1,0,1,0,0,1,1
3,1,1,1,0,53.1,0,0,1,0,0,0,0,1,0
4,0,3,0,0,8.05,1,0,1,0,1,1,0,1,1


In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

## Splitting the data into training and testing sets

In this section, we are splitting the data into training and testing sets using the `train_test_split()` function from `sklearn.model_selection`. The `X` variable contains all the features (or independent variables) and `y` variable contains the target variable (survived or not). The `test_size` parameter specifies what proportion of the data should be used for testing, and the `random_state` parameter ensures that the same random split is obtained every time the code is run.

In [None]:
X = data_processed.drop(['survived'], axis=1)
scaler = MinMaxScaler()
X_norm = scaler.fit_transform(X)
y = data_processed['survived']
y

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: survived, Length: 891, dtype: int64

In [None]:
X_norm

array([[1.        , 0.125     , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.125     , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.        , 0.        , ..., 0.        , 1.        ,
        1.        ],
       ...,
       [1.        , 0.125     , 0.33333333, ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ],
       [1.        , 0.        , 0.        , ..., 1.        , 0.        ,
        1.        ]])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=515) # 0.2 == 20%

In [None]:
X_train_norm, X_test_norm, y_train, y_test = train_test_split(X_norm, y, test_size=0.2, random_state=515) # 0.2 == 20%

### Training KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)

In [None]:
knn.fit(X_train_norm, y_train)

KNeighborsClassifier()

## Evaluating KNN model

In [None]:
y_pred = knn.predict(X_test_norm)
print(y_pred)
print("Accuracy:", accuracy_score(y_test, y_pred) * 100)
# print("Precision:", precision_score(y_test, y_pred))
# print("Recall:", recall_score(y_test, y_pred))
# print("F1-score:", f1_score(y_test, y_pred))

# Accuracy: 72.62569832402235 <- not norm

[0 1 0 0 1 0 0 1 1 1 0 1 0 1 0 0 1 0 0 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0
 0 0 1 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0
 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 1
 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0
 0 1 1 1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 1 1 0 1]
Accuracy: 77.6536312849162


### Training LogisticRegression

LogisticRegression()

In [None]:
y_pred = lr.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-score:", f1_score(y_test, y_pred))

Accuracy: 0.8156424581005587
Precision: 0.7971014492753623
Recall: 0.7432432432432432
F1-score: 0.7692307692307693


#### Want to explore further?
- Here is a good Kaggle analysis: https://www.kaggle.com/code/startupsci/titanic-data-science-solutions