In [None]:
df_train = catalog.load("train.input")
df_train.head()

**PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

**HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.

**CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

**Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

**Destination** - The planet the passenger will be debarking to.

**Age** - The age of the passenger.

**VIP** - Whether the passenger has paid for special VIP service during the voyage.

**RoomService, FoodCourt, ShoppingMall, Spa, VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

**Name** - The first and last names of the passenger.

**Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.


In [None]:
df_train.info()


In [None]:
#!pip install missingno
import missingno
missingno.matrix(df_train)

In [None]:
df_train.isna().sum()

**missing values**: let's see what matters and inspect individual proprties to see if we just fill with values or drop.

To Do:
- create `travel alone` param
- enum HomePlanet
- CryoSleep to bool
- split cabin into deck, num, side
- Destination into enum
- VIP to bool
- calc total spending
- (has family)

## Travelling alone?

In [None]:
# traveling alone

# split out group id
# create list of non-unique group ids
# create alone property, set true if not in group list

split_df = df_train["PassengerId"].str.split(pat="_",expand=True)
split_df.head()

In [None]:
alone = split_df[0].value_counts() == 1
alone.head()

In [None]:
split_df = split_df.merge(alone.rename("alone"), left_on=0, right_index=True)

In [None]:
df_train["alone"] = split_df['alone']
df_train.head()

## Homeplanet

In [None]:
df_train["HomePlanet"].value_counts()

In [None]:
#!pip install sklearn

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df_train['HomePlanet'])
df_train['HomePlanet'] = le.transform(df_train['HomePlanet'])
df_train.head()

## CryoSleep

In [None]:
le.fit(df_train['CryoSleep'])
df_train['CryoSleep'] = le.transform(df_train['CryoSleep'])
df_train.head()

In [None]:
df_train["CryoSleep"].value_counts()

## Cabin

In [None]:
df_cab = df_train["Cabin"].str.split(pat="/",expand=True)
df_cab.head()

In [None]:
df_train["Deck"] = df_cab[0]
df_train["Room"] = df_cab[1]
df_train["Side"] = df_cab[2]
df_train.head()

In [None]:
le.fit(df_train['Deck'])
df_train['Deck'] = le.transform(df_train['Deck'])
le.fit(df_train['Side'])
df_train['Side'] = le.transform(df_train['Side'])
df_train.head()

In [None]:
df_train['Side'].value_counts()

In [None]:
df_train['Deck'].value_counts()

In [None]:
# let's drop cabin
df_train = df_train.drop("Cabin", axis=1)

## Destination


In [None]:
df_train['Destination'].value_counts()

In [None]:
le.fit(df_train['Destination'])
df_train['Destination'] = le.transform(df_train['Destination'])

## VIP

In [None]:
le.fit(df_train['VIP'])
df_train['VIP'] = le.transform(df_train['VIP'])

## check for Nan

In [None]:
df_train.isna().sum()

let's:
- all price related cols to 0.0
- age to avg
- Room to 0
- name to "no Name

In [None]:
cols = ["RoomService","FoodCourt","ShoppingMall","Spa","VRDeck"]
df_train[cols]=df_train[cols].fillna(0.0)

In [None]:
df_train["Age"].median()

In [None]:
df_train["Age"] = df_train["Age"].fillna(27)

In [None]:
df_train["Name"] = df_train["Name"].fillna("No Name")

In [None]:
df_train["Room"] = df_train["Room"].fillna(0)

In [None]:
df_train.isna().sum()

## Total Spend

In [None]:
# let's make Nan to 

In [None]:
df_train["TotalSpend"] = df_train["RoomService"] + df_train["FoodCourt"] + df_train["ShoppingMall"] + df_train["Spa"] + df_train["VRDeck"]
df_train.head()

In [None]:
corr_matrix = df_train.corr(method='pearson')
corr_matrix.style.background_gradient(cmap='coolwarm')