<a href="https://www.kaggle.com/code/guptaayushman24/spaceship-model?scriptVersionId=146906397" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Problem Statement

In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

In [None]:
data = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')

Checking the shape of the dataset

In [None]:
data.shape

Data contains 8693 rows and 14 columns

Seeing the top 5 records of the dataset

In [None]:
data.head()

Checking the duplicates in the dataset

In [None]:
data.duplicated().sum()

No duplicate record present in the dataset

Dropping the passenger_id column from the dataset

In [None]:
data.drop('PassengerId',axis=1,inplace=True)

In [None]:
data.head()

Dropping the name column of the passenger

In [None]:
data.drop('Name',axis=1,inplace=True)

In [None]:
data.head()

Checking if there is null value present in the dataset

In [None]:
data.isnull().sum().sum()

In [None]:
data.shape

The missing values in the dataset is about 24% of the total data in the dataset so we cannot drop the missing values directly instead we will replace it

Seperating the columns on the basis of the dataset that is numerical or categorical

In [None]:
data['HomePlanet'].dtype

In [None]:
lst_num=[] # These list will store all the numerical columns of the dataset
lst_cat=[] # These list will store all the categorical columns of the dataset

In [None]:
for i in data.columns :
    if (data[i].dtype=='O') :
        lst_cat.append(i)
    else :
        lst_num.append(i)

In [None]:
lst_num

In [None]:
data['Transported'].dtype

In [None]:
data.head()

Checking the null values in the categorical columns

In [None]:
for i in lst_cat :
    print("The number of null values in the {} column is {}".format(i,data[i].isnull().sum()))
    

Replacing the null values in the of the categorical column with the help of the mode

In [None]:
for i in lst_cat :
    data[i] = data[i].fillna(data[i].mode()[0])

In [None]:
for i in lst_cat :
    print("The mode of the {} column is {}".format(i,data[i].mode()[0]))

In [None]:
lst_num

Checking the outliers in the numerical columns if any outliers are present or not


In [None]:
def outlier(df,variable) :
    q1 = df[variable].quantile(0.25)
    q3 = df[variable].quantile(0.75)
    iqr = q3-q1
    lower_fence = q1-1.5*iqr
    higher_fence = q3+1.5*iqr
    return lower_fence,higher_fence

In [None]:
for i in lst_num :
    if (i=='Transported') :
        break
    plt.title('Box plot of the {} column'.format(i))
    sns.boxplot(data=data,x=data[i])
    plt.show()

Checking the missing values in the numerical columns

In [None]:
for i in lst_num :
    print('Number of missing values in the {} column is {}'.format(i,data[i].isnull().sum()))

Replacing the null values with the median of the numerical columns

In [None]:
for i in lst_num :
    data[i] = data[i].fillna(data[i].median())
    

In [None]:
for i in lst_num :
    print('Number of missing values in the {} column is {}'.format(i,data[i].isnull().sum()))

Univariate Analysis

In [None]:
# Plotting the pie chart of the HomePlanet
name_home = data['HomePlanet'].value_counts().index.tolist()
count_home = data['HomePlanet'].value_counts().tolist()

fig, ax = plt.subplots()
ax.pie(count_home, labels=name_home, autopct='%1.1f%%',shadow=True,startangle=180)

From the above graph we can see that in the spaceship the population of the people from the Earth are more

In [None]:
# Plotting the bar chat of the people who are in the cryo sleep
import matplotlib.pyplot as plt

# Your data
cryo_sleep_number = data['CryoSleep'].value_counts().tolist()
cryo_sleep_labels = ['True', 'False'] 

bar_colors = ['tab:green', 'tab:red']  # Adjust the colors as needed

fig, ax = plt.subplots()

# Create the bar plot with automatic x-axis labels
ax.bar(cryo_sleep_labels, cryo_sleep_number, color=bar_colors)

ax.set_title('Number of people in cryo sleep or not')
plt.show()




From the above graph we can see that maximum people are in the cryosleep

In [None]:
# Plotting the age column for seeing the distribution of Age
from math import sqrt
no_of_bins = sqrt(data.shape[0])
bin_width = [data['Age'].min(),data['Age'].max()]
hist_color='red'
plt.hist(data['Age'],bins=round(no_of_bins),range=bin_width,color=hist_color)
plt.xlabel('Age')
plt.ylabel('Frequency of different age')

From the above graph we can see that there are many people who are above 25 and lesy s than 30 and there are very few people who are between 70 to 80

In [None]:
data['VIP'].value_counts()

In [None]:
# Plotting the pie chart of the VIP column 
# VIP-> Passenger who take the VIP service for the travelling
bool_VIP = data['VIP'].value_counts().index.tolist()
count_vip = data['VIP'].value_counts().tolist()

fig, ax = plt.subplots()
ax.pie(count_vip, labels=bool_VIP, autopct='%1.1f%%',shadow=True,startangle=90)
plt.title('Percentage of people who take the VIP service and percentage who did not take the VIP service')


From the above pie chart we can see that there are very few people who took the VIP services around 97.1% people did not take the VIP service and aroung 2.3% people take the VIP service

In [None]:
# Analysis of the Age vs CryoSleep

plt.figure(figsize=(8, 6))
sns.histplot(data=data, x=data['Age'], hue=data['CryoSleep'], bins=20, kde=True)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution by Cryosleep')
plt.legend(title='Cryosleep', labels=['False', 'True'])
plt.show()

From the above graph we can see that the people whose age around 20 to 30 are in cryosleep means youngsters are more in cryosleep than older ones

In [None]:
# Analysis of the Age vs VIP

plt.figure(figsize=(8, 6))
sns.histplot(data=data, x=data['Age'], hue=data['VIP'], bins=20, kde=True)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution by VIP')
plt.legend(title='VIP', labels=['False', 'True'])
plt.show()

In [None]:
data.head()

In [None]:
# Calculating the average of these 5 columns and creating the new column whcih is Avg_Total_Expense
data['Avg_Total_Expense'] = (data['RoomService'] + data['FoodCourt'] + data['ShoppingMall'] + data['Spa'] + data['VRDeck']) / 5

In [None]:
data['Avg_Total_Expense'].head()

In [None]:
plt.scatter(data['Age'],data['Avg_Total_Expense'])

From the above graph we can see that the people who hava age from 30 to 50 years have more expenses than others some older people also have the more expenses

In [None]:
data['Transported'].value_counts()


In [None]:
Transported_bool = data['Transported'].value_counts().index.tolist()
count_Transported = data['Transported'].value_counts().tolist()

fig, ax = plt.subplots()
ax.pie(count_Transported, labels=Transported_bool, autopct='%1.1f%%',shadow=True,startangle=180)

From the above graph we can see thaht mostly people get transported into the another dimmension around 50.4% people get transported and around 49.6% people are not transported

In [None]:
data['Destination'].value_counts()

In [None]:
destination_number = data['Destination'].value_counts().tolist()
destination_name = data['Destination'].value_counts().index
bar_colors = ['tab:green',  'tab:red', 'tab:orange']

fig, ax = plt.subplots()


ax.bar(destination_name, destination_number, color=bar_colors)

ax.set_title('Number of people vs Destination')
plt.xlabel('Destination')
plt.ylabel('Number of people')
plt.show()


From the above graph we can see that the most of the people are transported to the TRSPPIST-1e planet

In [None]:
data.groupby('Destination')['HomePlanet'].value_counts().unstack(fill_value=0)

In [None]:
lst_homeplanet=data.groupby('Destination')['HomePlanet'].value_counts().unstack(fill_value=0).columns.tolist()

In [None]:
lst_homeplanet

In [None]:
# Pie chart of the 55_Cancri_e
number = [721,886,193]
lst = lst_homeplanet

fig, ax = plt.subplots()
ax.pie(number, labels=lst, autopct='%1.1f%%',shadow=True,startangle=45)
plt.title('Percentage of people who are transferred to 55 Cancri e planet')

From the above pie chart we can see that more people from Europa are transferred to the 55 Cancri e planet

In [None]:
# Pie chart of the PSO J318.5-22
number = [728,19,49]
lst = lst_homeplanet

fig, ax = plt.subplots()
ax.pie(number, labels=lst, autopct='%1.1f%%',shadow=True,startangle=45)
plt.title('Percentage of people who are transferred to PSO J318.5-22 Planet')

From the above pie chart we can see that more people from Earth are transferred to the PSO J318.5-22 Planet

In [None]:
# Pie chart of the TRAPPIST-1e	
number = [3354,1226,1517]
lst = lst_homeplanet

fig, ax = plt.subplots()
ax.pie(number, labels=lst, autopct='%1.1f%%',shadow=True,startangle=45)
plt.title('Percentage of people who are transferred to TRAPPIST Planet')

From the above pie chart we can see that more people from Earth are transferred to the TRAPPIST Planet

From the above three pie chart we can conclude that there are more people from the Earth in the spacehip because out of two out of three pie chart Earth is dominating

****Encoding of the categorical column into the numerical columns****

In [None]:
# Frequecny Encoding of the HomePlanet
data['HomePlanet'] = data['HomePlanet'].map({'Earth':1,'Europa':2,'Mars':3})

In [None]:
data.head()

In [None]:
# Doing Binary Encoding on the CryoSleep column
data['CryoSleep'] = data['CryoSleep'].map({True:1,False:0})

In [None]:
# Frequency Encoding on the Destination
data['Destination'] = data['Destination'].map({'TRAPPIST-1e':1,'55 Cancri e':2,'PSO J318.5-22':3})

In [None]:
# Doing Binary Encoding on the VIP column
data['VIP'] = data['VIP'].map({True:1,False:0})

In [None]:
# Doing Binary Encoding on the VIP column
data['Transported'] = data['Transported'].map({True:1,False:0})

## Checking the correlation of the Cabin Column with the target column which is Transported column

In [None]:
from pandas import factorize

labels, categories = factorize(data['Cabin'])
data["labels"] = labels
abs(data['Transported'].corr(data["labels"]))*100

From the above result we can see that there is very weak correlation of the Cabin column with the target column which is Transported column so we are dropping the Cabin Column

Dropping the labels column

In [None]:
data.drop('labels',axis=1,inplace=True)

In [None]:
data.drop('Cabin',axis=1,inplace=True) 

## Checking the multicollinearity of all the features with the help of the heat map

In [None]:
data.head()

In [None]:
# Plotting the heat map 
fig, ax = plt.subplots(figsize=(10, 5))
dataplot = sns.heatmap(data.corr(numeric_only=True), annot=True, ax=ax)

# Display the heatmap
plt.show()

In [None]:
# Dropping these column we have created the new column Avg_total_Expense which has the average of all these column
data.drop(['ShoppingMall','Spa','FoodCourt','RoomService','VRDeck'],axis=1,inplace=True)

In [None]:
data.head()

In [None]:
# Plotting the heat map 
fig, ax = plt.subplots(figsize=(10, 5))
dataplot = sns.heatmap(data.corr(numeric_only=True), annot=True, ax=ax)

# Display the heatmap
plt.show()

From the above heatmap we can see that there is very less correlation of the HomePlanet column with the target column which is Transported Column so we are dropping the HomePlanet Column

In [None]:
data.drop('HomePlanet',axis=1,inplace=True)

In [None]:
data.head()

## Loading the test dataset

In [None]:
data_test = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')

In [None]:
data_test

## We have to do the same transformation with the test dataset which we have done with the training dataset

In [None]:
# Dropping the PassengerID and the Name from the test dataset
data_test.drop(['PassengerId','Name'],axis=1,inplace=True)

In [None]:
lst_num_test=[] # These list will store all the numerical columns of the test_dataset
lst_cat_test=[] # These list will store all the categorical columns of the test_dataset
for i in data_test.columns :
    if (data_test[i].dtype=='O') :
        lst_cat_test.append(i)
    else :
        lst_num_test.append(i)

In [None]:
lst_num_test

In [None]:
lst_cat_test

In [None]:
for i in lst_cat_test :
    print("The number of null values in the {} column is {}".format(i,data_test[i].isnull().sum()))

In [None]:
for i in lst_cat_test :
    data_test[i] = data_test[i].fillna(data_test[i].mode()[0])

In [None]:
for i in lst_cat_test :
    print("The mode of the {} column is {}".format(i,data_test[i].mode()[0]))

In [None]:
for i in lst_num_test :
    print('Number of missing values in the {} column is {}'.format(i,data_test[i].isnull().sum()))

In [None]:
for i in lst_num_test :
    data_test[i] = data_test[i].fillna(data_test[i].median())

In [None]:
# Frequecny Encoding of the HomePlanet
data_test['HomePlanet'] = data_test['HomePlanet'].map({'Earth':1,'Europa':2,'Mars':3})

In [None]:
# Doing Binary Encoding on the CryoSleep column
data_test['CryoSleep'] = data_test['CryoSleep'].map({True:1,False:0})

In [None]:
#  Frequency Encoding on the Destination
data_test['Destination'] = data_test['Destination'].map({'TRAPPIST-1e':1,'55 Cancri e':2,'PSO J318.5-22':3})

In [None]:
# # Doing Binary Encoding on the VIP column
data_test['VIP'] = data_test['VIP'].map({True:1,False:0})

In [None]:
data_test.drop('Cabin',axis=1,inplace=True)

In [None]:
data_test['Avg_Total_Expense'] = (data_test['RoomService'] + data_test['FoodCourt'] + data_test['ShoppingMall'] + data_test['Spa'] + data_test['VRDeck']) / 5

In [None]:
data_test.drop(['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck'],axis=1,inplace=True)

In [None]:
data_test.drop('HomePlanet',axis=1,inplace=True)

In [None]:
data_test.head()

In [None]:
data.head()

In [None]:
data_test.head()

In [None]:
X = data.drop('Transported',axis=1)
y = data['Transported']

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.33,random_state=42)

In [None]:
# Creating the dictionary of the different classification models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
model_params={
    'Logistic Regression' : LogisticRegression(),
    'Support Vector' : SVC(),
    'Decision Tree' : DecisionTreeClassifier(),
    'Random Forest' : RandomForestClassifier()
}

In [None]:
for model in model_params.values() :
    print(model)

In [None]:
for model in model_params.values() :
    model.fit(X_train,y_train)

In [None]:
# According to the problem if actually people are missing means they are not transported then the output column will be zero and if the people are transpoted then it is 1
# So if actually people are missing but our model is predicting they are transprted then it is the problem so we have to reduce the False Positive means we have to increase the Precision 

##  Now we will use the classification machine learning model for predicting the output


In [None]:
for name,model in model_params.items() :
    model_predict = model.predict(X_test)
    print("The prediction of the {} model is {}".format(name,model_predict))

In [None]:
# Finding the accuracy of all the classificaiton model which is in the model_params dictionary
from sklearn.metrics import accuracy_score
for name,model in model_params.items() :
    model_predict = model.predict(X_test)
    model_accuracy = accuracy_score(model_predict,y_test)
    print('The accuracy of the {} is {}'.format(name,model_accuracy))

From the above results we can see that Support Vector and the Logistic Regression is giving the highest accuracy so now we will see from these two model for which precision has more that model we will select

## Confusion matrix for all the Logistic Regression and Support Vector

In [None]:
from sklearn.metrics import confusion_matrix
for name,model in model_params.items() :
    model_predict = model.predict(X_test)
    conf_mat = confusion_matrix(model_predict,y_test)
    if (name=='Logistic Regression' or name=='Support Vector') :
          print('The confusion matrix of the {} is {}'.format(name,conf_mat))

## Precsion score for the Logistic Regression

In [None]:
# tp = conf_mat[0][0]
# fp = conf_mat[0][1]
# fn = conf_mat[1][0]
# tn = conf_mat[1][1]

In [None]:
# Logistic Regression
tp = 1225
fp = 602
fn = 199
tn = 843
precision_logistic = tp/(tp+fp)
print(precision_logistic)

In [None]:
# Support Vector
tp1 = 1135
fp1 = 472
fn1 = 289
tn1 = 973
precision_svm = tp1/(tp1+fp1)
print(precision_svm)

# From above result we are getting the higer precision value of the Support Vector than the Logistic Regression so we are selecting the Support Vector as our model for the given problem

# Doing the Standarization of the data

In [None]:
from sklearn.preprocessing import StandardScaler

scaler_train = StandardScaler()
X_train_scaled = scaler_train.fit_transform(X_train)

scaler_test = StandardScaler()
X_test_scaled = scaler_test.fit_transform(X_test)

In [None]:
# Hyperparameter Tunning of the Support Vector
model_svm = SVC()
model_svm.fit(X_train_scaled,y_train)

In [None]:
model_predict_svm=model_svm.predict(X_test_scaled)

In [None]:
model_predict_svm

In [None]:
accuracy_score_svc = accuracy_score(model_predict_svm,y_test)
precision_svm = tp1/(tp1+fp1)

In [None]:
# Printing the accuracy and the precision of the Support Vector Classifier
print("Accuracy of the Support Vector Classifier is : {}".format(accuracy_score_svc))
print("Precision of the Support Vector Classifier is : {}".format(precision_svm))

# HyperParameter Tunning 

In [None]:
model_param = {
    'kernel':{'linear', 'poly', 'rbf', 'sigmoid'},
    'gamma': {'scale', 'auto'}
}

In [None]:
# # HyperParameter Tunning on the Support Vector Classifier Model
gscv = GridSearchCV(model_svm,param_model,cv=10,scoring='precision')
gscv.fit(X_train_scaled,y_train)
print(gscv.best_params_)
print(gscv.best_score_)