Github repository: https://github.com/jmlarios/AI-MACHINE-LEARNING-FOUNDATIONS

In [3]:
import pandas as pd

In [8]:
D = pd.read_excel('titanic3.xlsx')

D

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,,C,,,


# **First Impression of the Data**

At first sight we can identify some columns/features that instantly seem important to predict if the passengers survived and some that don't really seem to have an impact on the outcome. The first impression was to include the features 'pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'fare' and 'boat', although dropping certain columns will be discussed after. The rest of the remaining features though, did not seem necessary at all.

The first thing done was to see any missing values in the untouched dataset and then we would choose what to drop and what to either undersample or oversample if there were still missing values in any of the chosen features.

In [9]:
D.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [10]:
D.isnull().sum() / len(D) * 100

pclass        0.000000
survived      0.000000
name          0.000000
sex           0.000000
age          20.091673
sibsp         0.000000
parch         0.000000
ticket        0.000000
fare          0.076394
cabin        77.463713
embarked      0.152788
boat         62.872422
body         90.756303
home.dest    43.086325
dtype: float64

# **Feature Selection**

For this assignment I considered most important the features 'pclass', 'survived', 'sex', 'age' and 'parch'. For me this were the most relevant features when it comes to predicting the outcome of a passenger. For 'home.dest', 'embarked', 'cabin', 'body', 'ticket' it kind of seemed obvious to drop this columns as they didn't really, at least to my impression, alter or condition the outcome of the passenger nearly at all, so the natural thing was to drop them. For 'fare', 'sibsp' and 'name' there was more doubt about if they should be dropped or not, in the end I just decided to try and drop them as I felt the rest of the features that were mentioned in the very beggining of this markdown were the most important to make an accurate prediction.

In [11]:
D.drop(columns=['home.dest'], inplace=True)
D.drop(columns=['embarked'], inplace=True)
D.drop(columns=['cabin'], inplace=True)
D.drop(columns=['body'], inplace=True)
D.drop(columns=['fare'], inplace=True)
D.drop(columns=['boat'], inplace=True)
D.drop(columns=['ticket'], inplace=True)
D.drop(columns=['sibsp'], inplace=True)
D.drop(columns=['name'], inplace=True)

D

Unnamed: 0,pclass,survived,sex,age,parch
0,1,1,female,29.0000,0
1,1,1,male,0.9167,2
2,1,0,female,2.0000,2
3,1,0,male,30.0000,2
4,1,0,female,25.0000,2
...,...,...,...,...,...
1304,3,0,female,14.5000,0
1305,3,0,female,,0
1306,3,0,male,26.5000,0
1307,3,0,male,27.0000,0


In [12]:
D.isnull().sum()

pclass        0
survived      0
sex           0
age         263
parch         0
dtype: int64

## **Oversampling 'Age' for missing values**

After choosing the features that the model would be trained and tested on, the dataset was checked again for missing values and some of them popped up on the 'age' feature. To solve this impasse the technique of oversampling was chosen as I felt that there were too many missing values to just drop the rows with them if I were to implement undersampling.

After seeing the output of the following cell, we can see that the standard deviation is somewhat high in the 'age' feature, therefore for the missing values the oversampling was done using the median, as the mean would not represent a balanced outcome.

In [13]:
D.describe()

Unnamed: 0,pclass,survived,age,parch
count,1309.0,1309.0,1046.0,1309.0
mean,2.294882,0.381971,29.881135,0.385027
std,0.837836,0.486055,14.4135,0.86556
min,1.0,0.0,0.1667,0.0
25%,2.0,0.0,21.0,0.0
50%,3.0,0.0,28.0,0.0
75%,3.0,1.0,39.0,0.0
max,3.0,1.0,80.0,9.0


In [14]:
D['age'].fillna(D['age'].median())

0       29.0000
1        0.9167
2        2.0000
3       30.0000
4       25.0000
         ...   
1304    14.5000
1305    28.0000
1306    26.5000
1307    27.0000
1308    29.0000
Name: age, Length: 1309, dtype: float64

In [15]:
D.isnull().sum()

pclass        0
survived      0
sex           0
age         263
parch         0
dtype: int64

# **Data Encoding**

For categorical variables OneHotEncoder is used as it was suggested in the guidelines. In this case the variables encoded are 'sex' and 'pclass'. In the following cell we can see they have been dummified for each class of the feature.

In [17]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

encoded_sex = encoder.fit_transform(D[['sex']]).toarray()
encoded_sex_df = pd.DataFrame(encoded_sex, columns=encoder.get_feature_names_out(['sex']))

encoded_pclass = encoder.fit_transform(D[['pclass']]).toarray()
encoded_pclass_df = pd.DataFrame(encoded_pclass, columns=encoder.get_feature_names_out(['pclass']))

D_encoded = pd.concat([D, encoded_sex_df, encoded_pclass_df], axis=1).drop(columns=['sex', 'pclass'])

D_encoded

Unnamed: 0,survived,age,parch,sex_female,sex_male,pclass_1,pclass_2,pclass_3
0,1,29.0000,0,1.0,0.0,1.0,0.0,0.0
1,1,0.9167,2,0.0,1.0,1.0,0.0,0.0
2,0,2.0000,2,1.0,0.0,1.0,0.0,0.0
3,0,30.0000,2,0.0,1.0,1.0,0.0,0.0
4,0,25.0000,2,1.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...
1304,0,14.5000,0,1.0,0.0,0.0,0.0,1.0
1305,0,,0,1.0,0.0,0.0,0.0,1.0
1306,0,26.5000,0,0.0,1.0,0.0,0.0,1.0
1307,0,27.0000,0,0.0,1.0,0.0,0.0,1.0


# **Splitting the Data**

In [21]:
from sklearn.model_selection import train_test_split

X = D_encoded.drop(columns=['survived'])  # Features
y = D_encoded['survived']  # Target variable

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


In [22]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize and train
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Accuracy: {accuracy:.4f}")


Random Forest Accuracy: 0.8092
