# Data Science Project: Spaceship Titanic

* **Dataset:** Spaceship Titanic (Kaggle)

* 1- Data Cleaning
* 2- Analysis
* 3- Modeling
* 4- Review

**Spaceship Titanic Data**

**Files:**

* **`train.csv`**: Training data (~8700 passengers).
* **`test.csv`**: Test data (~4300 passengers).
* **`sample_submission.csv`**: Submission format.

**Data Fields:**

* **`PassengerId`**: Unique ID (`gggg_pp`, group/number).
* **`HomePlanet`**: Origin planet.
* **`CryoSleep`**: Suspended animation (True/False).
* **`Cabin`**: Cabin location (`deck/num/side`).
* **`Destination`**: Destination planet.
* **`Age`**: Passenger age.
* **`VIP`**: VIP service (True/False).
* **`RoomService`, `FoodCourt`, ..., `VRDeck`**: Amenity spending.
* **`Name`**: Passenger name.
* **`Transported`**: (Target) Transported to another dimension (True/False).

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

from sklearn.impute import SimpleImputer

from sklearn.compose import make_column_transformer, ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier

from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split, GridSearchCV

## Exploratory Data Analysis (EDA)

In [69]:
trainfilepath = '/Users/eldiablolatino/Developer/Spaceship Titanic/train.csv'
testfilepath = '/Users/eldiablolatino/Developer/Spaceship Titanic/test.csv'
train_df = pd.read_csv(trainfilepath)
test_df = pd.read_csv(testfilepath)
train_df_copy = train_df.copy()
test_df_copy = test_df.copy()
train_df_copy.head(10)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
5,0005_01,Earth,False,F/0/P,PSO J318.5-22,44.0,False,0.0,483.0,0.0,291.0,0.0,Sandie Hinetthews,True
6,0006_01,Earth,False,F/2/S,TRAPPIST-1e,26.0,False,42.0,1539.0,3.0,0.0,0.0,Billex Jacostaffey,True
7,0006_02,Earth,True,G/0/S,TRAPPIST-1e,28.0,False,0.0,0.0,0.0,0.0,,Candra Jacostaffey,True
8,0007_01,Earth,False,F/3/S,TRAPPIST-1e,35.0,False,0.0,785.0,17.0,216.0,0.0,Andona Beston,True
9,0008_01,Europa,True,B/1/P,55 Cancri e,14.0,False,0.0,0.0,0.0,0.0,0.0,Erraiam Flatic,True


In [66]:
train_df_copy.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [67]:
train_df_copy.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


In [68]:
train_df_copy.describe(include=['O'])

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,VIP,Name
count,8693,8492,8476,8494,8511,8490,8493
unique,8693,3,2,6560,3,2,8473
top,0001_01,Earth,False,G/734/S,TRAPPIST-1e,False,Gollux Reedall
freq,1,4602,5439,8,5915,8291,2


### **Groupby** analysis

In [70]:
train_df_copy.groupby('HomePlanet', as_index=False)['Transported'].agg(['mean','count'])

Unnamed: 0,HomePlanet,mean,count
0,Earth,0.423946,4602
1,Europa,0.658846,2131
2,Mars,0.523024,1759


In [71]:
train_df_copy.groupby('CryoSleep', as_index=False)['Transported'].agg(['mean','count'])

Unnamed: 0,CryoSleep,mean,count
0,False,0.328921,5439
1,True,0.817583,3037


In [72]:
train_df_copy.groupby(['Destination'], as_index=False)['Transported'].agg(['mean','count'])

Unnamed: 0,Destination,mean,count
0,55 Cancri e,0.61,1800
1,PSO J318.5-22,0.503769,796
2,TRAPPIST-1e,0.471175,5915


In [73]:
train_df_copy.groupby(['VIP'], as_index=False)['Transported'].agg(['mean','count'])

Unnamed: 0,VIP,mean,count
0,False,0.506332,8291
1,True,0.38191,199


## Family Size and Survival Analysis from PassengerId

* **Objective:**
    * Determine family sizes from the `PassengerId` column.
    * Classify families into "Alone","Small","Medium", and "Large" categories.
* **Methodology:**
    * Extract the `gggg` portion of `PassengerId` to create a `Family_id` column.
    * Calculate family sizes and assign family categories.
    * Analyze the `Transported` rate for each family category.

In [78]:
train_df_copy['Family_id'] = train_df['PassengerId'].str.split('_').str[0]
family_sizes = train_df_copy['PassengerId'].str.split('_').str[0].value_counts()

train_df_copy['FamilySize'] = train_df_copy['PassengerId'].str.split('_').str[0].map(family_sizes)

def categorize_family(size):
    if size == 1:
        return 'Alone'
    elif 2 <= size <= 4:
        return 'Small'
    elif 5 <= size <= 6:
        return 'Medium'
    else:
        return 'Large'

train_df_copy['FamilyCategory'] = train_df_copy['FamilySize'].apply(categorize_family)

print(train_df_copy[['PassengerId','Family_id','FamilySize', 'FamilyCategory']].head())

  PassengerId Family_id  FamilySize FamilyCategory
0     0001_01      0001           1          Alone
1     0002_01      0002           1          Alone
2     0003_01      0003           2          Small
3     0003_02      0003           2          Small
4     0004_01      0004           1          Alone


In [75]:
train_df_copy.groupby(['FamilyCategory'], as_index=False)['Transported'].agg(['mean','count'])

Unnamed: 0,FamilyCategory,mean,count
0,Alone,0.452445,4805
1,Large,0.495522,335
2,Medium,0.601367,439
3,Small,0.569685,3114
