<a href="https://colab.research.google.com/github/martharegina/machinelearning/blob/main/titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [89]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

In [90]:
train_data = pd.read_csv('/content/train.csv')
test_data = pd.read_csv('/content/test.csv')

In [91]:
# Cek jumlah missing values di train_data
train_data.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


In [92]:
# Cek jumlah missing values di test_data
test_data.isnull().sum()

Unnamed: 0,0
PassengerId,0
Pclass,0
Name,0
Sex,0
Age,86
SibSp,0
Parch,0
Ticket,0
Fare,1
Cabin,327


In [93]:
# Hitung jumlah missing cabin per class
total_per_class = train_data.groupby('Pclass').size()
missing_cabin_per_class = train_data[train_data['Cabin'].isna()].groupby('Pclass').size()
fraction_missing = missing_cabin_per_class / total_per_class

result = {
    'Total per Class': total_per_class,
    'Missing Cabin per Class': missing_cabin_per_class,
    'Fraction Missing': fraction_missing
}

result_df = pd.DataFrame(result)
print(result_df)

        Total per Class  Missing Cabin per Class  Fraction Missing
Pclass                                                            
1                   216                       40          0.185185
2                   184                      168          0.913043
3                   491                      479          0.975560


In [94]:
# Hapus unnecessary columns
unnecessary_columns = ['Cabin', 'Ticket']
train_data = train_data.drop(unnecessary_columns, axis=1)
test_data = test_data.drop(unnecessary_columns, axis=1)

In [95]:
# Buat kolom Title
train_data['Title'] = train_data['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
test_data['Title'] = test_data['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)

In [96]:
# Buat kolom FamilySize
train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch']
test_data['FamilySize'] = test_data['SibSp'] + test_data['Parch']

In [97]:
# Buat kolom IsAlone
def is_alone(row):
  if row['SibSp'] == 0 and row['Parch'] == 0:
    return True
  else:
    return False

train_data['IsAlone'] = train_data.apply(is_alone, axis=1)
test_data['IsAlone'] = test_data.apply(is_alone, axis=1)

In [98]:
# Buat df average umur per title
average_age_per_title = train_data.groupby('Title')['Age'].mean().astype(int)
average_age_per_title

Unnamed: 0_level_0,Age
Title,Unnamed: 1_level_1
Capt,70
Col,58
Countess,33
Don,40
Dr,42
Jonkheer,38
Lady,48
Major,48
Master,4
Miss,21


In [99]:
# Fill missing values kolom Age
train_data['Age'] = train_data['Age'].fillna(train_data['Title'].map(average_age_per_title))
test_data['Age'] = test_data['Age'].fillna(test_data['Title'].map(average_age_per_title))

In [100]:
# Fill missing values kolom Fare
median_fare = train_data['Fare'].median()
test_data['Fare'] = test_data['Fare'].fillna(median_fare)

In [101]:
# Fill missing values kolom Embarked
mode_embarked = train_data['Embarked'].mode()[0]
train_data['Embarked'] = train_data['Embarked'].fillna(mode_embarked)

In [102]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Fare         891 non-null    float64
 9   Embarked     891 non-null    object 
 10  Title        891 non-null    object 
 11  FamilySize   891 non-null    int64  
 12  IsAlone      891 non-null    bool   
dtypes: bool(1), float64(2), int64(6), object(4)
memory usage: 84.5+ KB


In [103]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          418 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Fare         418 non-null    float64
 8   Embarked     418 non-null    object 
 9   Title        418 non-null    object 
 10  FamilySize   418 non-null    int64  
 11  IsAlone      418 non-null    bool   
dtypes: bool(1), float64(2), int64(5), object(4)
memory usage: 36.5+ KB


The plan:
* feature engineering: title dari nama, family size, isAlone
* fill missing value age berdasarkan rata-rata title
* encoding categorical values
* decide which one is target dan yang mana yang features
* lihat distribusi masing-masing
* bandingkan fitur terhadap survived
* split data
* pemilihan model
* evaluasi model