# Task 2 - spaceship titanic

> It is year 2912 and Spaceship Titanic, an interstellar passenger liner, is on its maiden voyage with almost 13,000 passengers onboard. Spaceship Titanic is transporting emigrants from our solar system to three newly habitable exoplanets when it collides with a cosmic anomaly hidden without a dust cloud which resulted in almost half of the passengers onboard to be transported to another dimension. Your task is to help rescue the lost passengers by predicting which passengers were transported by the cosmic anomaly by using records from the ships damaged computer system.


Dataset and amazing description ~ Emre Rençberoğlu

In [None]:
import pandas as pd
url = "https://gist.githubusercontent.com/SaxMan96/d90c454ec90c8270ef29193ef4b26726/raw/8f6373f8549cd2f4b87b5d12ad688f56f6fae7ca/spaceship_titanic.csv"
df = pd.read_csv(url, index_col=0)

In [None]:
df.shape

In [None]:
df.isna().sum()

In [None]:
df.head(5)

In [None]:
print(*df.Cabin.unique()[:20], sep='\t', end='...')

`Cabin` column contains 3 informations: Deck, Num and Side of cabin. Use split method to `split()` the string into 3 distinct columns. 
- Pass separator string to `pat` argument of the `split()` method.
- Pass `expand` bollean value to expand the split strings into separate columns.

Doculemtation: [pandas.Series.str.split](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html)

In [None]:
df[['Deck', 'Num', 'Side']] = df['Cabin'].str.split(pat='/', expand=True).fillna('Missing')

In [None]:
df.Deck.nunique(), df.Num.nunique(), df.Side.nunique()

As you can see `Num` holds numerical value, but there are some missing values, that can't be treated as number.
- Assign `MissingNum` to boolean value which represents whether `df.Num` is equal to `'Missing'`
- Replace `'Missing'` with `-1` in `Num` column using `replace()` method.

Documentation: [pandas.Series.replace](https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html)

In [None]:
df.Num.nunique()
df = df.assign(
    MissingNum = df.Num=='Missing',
    Num = df.Num.replace(to_replace='Missing', value=-1).astype(int),
)

In [None]:
df.Num.plot.hist(bins=100)

categorical features

In [None]:
df.Deck.value_counts()

In [None]:
df.Side.value_counts()

In [None]:
df.HomePlanet.value_counts()

In [None]:
df.Destination.value_counts()

As we will use Logistic Regression all variables have to be numeric. So in order to convert categorical features to numeric we will use `get_dummies()` function that converts categorical column to dummy/indicator variables. Look how does it work below:

In [None]:
pd.get_dummies(df['Destination'], prefix='Destination').head()

Now for all categorical features w want to add indicator variables using for loop.

Use `feature_name` variable to select proper column and name the prefix.

In [None]:
for feature_name in ['Deck', 'Side', 'HomePlanet', 'Destination']:
    df = pd.concat([df, pd.get_dummies(df[feature_name], prefix=feature_name)], axis=1)

now we will drop unnecessary columns

In [None]:
drop_columns = ['Deck', 'Side', 'HomePlanet', 'Cabin', 'Destination', 'Name']
df = df.drop(columns=drop_columns)

Now we will fill missing values in spending categories, because null means that there are no spendings, so we can fill it with zeros

In [None]:
fill_cols = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
df[fill_cols] = df[fill_cols].fillna(0)

Here we create `TotalSpend` variable that is sum of all categories.

In [None]:
df = df.assign(TotalSpend = df[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1))

Here we separate predicted variable from predictors. 

In [None]:
y = df.pop('Transported')
X = df.copy(deep=True)

Split X and y to train and test set using train_test_split.

Documentation: [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7)

In [None]:
X_train.isna().sum()

as you can see above there is still some missing values in numerical/boolean values.

In [None]:
df[['CryoSleep','Age','VIP']].head()

We will use Imputers to fill the missing values. We will use 2 different methods for that.

- `KNNImputer` - Each sample’s missing values are imputed using the mean value from `n_neighbors` nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.
- `SimpleImputer` - Replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value.

Task:
- Use `KNNImputer` with 2 neighbouring samples to fill up `CryoSleep` column
- Use `SimpleImputer` with `'mean'`strategy to fill up `Age` column
- Use `SimpleImputer` with `'most_fequent'` strategy to fill up `VIP` column

In [None]:
from sklearn.impute import SimpleImputer, KNNImputer
import numpy as np

knn_imp = KNNImputer(n_neighbors=2) #CryoSleep
mean_imp = SimpleImputer(strategy='mean') #Age
freq_imp = SimpleImputer(strategy="most_frequent") #VIP

for imputer, feature_name in zip([knn_imp, mean_imp, freq_imp], ['CryoSleep', 'Age', 'VIP']):
    X_train[feature_name] = imputer.fit_transform(X_train[[feature_name]])
    X_test[feature_name] = imputer.transform(X_test[[feature_name]])

We use `fit_transform` on train set and `transform` on test set, because we don't want to bias the model with information from the test set.

In [None]:
X_train.isna().sum()

Use `StandardScaler` to standarize the train features. Use `fit_transform()` and `transform()` methods in similar fashion as with imputers on both train and test set. 

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, log_loss

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

create `LogisticRegression` model and fit it with train set


In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

To print the documentation you can use `?` after the method name

In [None]:
model.score?

---
As you can see above the `score()` method returns accuracy of the model given passed data. 
Calculate accuracy on train and test set.

In [None]:
model.score(X_train, y_train), model.score(X_test, y_test)

---
display confusion matrix using `from_estimator` from `ConfusionMatrixDisplay`

Documentation: [sklearn.metrics.ConfusionMatrixDisplay.from_estimator](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay.from_estimator)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)

display ROC curve using `from_estimator` from `RocCurveDisplay`

Documentation: [sklearn.metrics.RocCurveDisplay.from_estimator](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.RocCurveDisplay.html)

In [None]:
from sklearn.metrics import RocCurveDisplay

RocCurveDisplay.from_estimator(model, X_test, y_test)