# Basic Understanding of Data

In [156]:
# Importing Libraries
import pandas as pd 

pd.set_option("display.max.rows", None)
pd.set_option("display.max.columns", None)

In [157]:
# Loading Datasets - If you want to see Data Description go to "dataset/raw/FileDescriptions"
train_df = pd.read_csv("../dataset/raw/train.csv")
test_df = pd.read_csv("../dataset/raw/test.csv")

In [158]:
# Showing Training Dataframe
print("Training Dataframe shape: ", train_df.shape)
train_df.head()

Training Dataframe shape:  (8693, 14)


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [159]:
# Showing Testing Dataframe
print("Testing Dataframe shape: ", test_df.shape)
test_df.head()

Testing Dataframe shape:  (4277, 13)


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Brence Harperez


## Observations: differences between both dataframes
We notice that, contrary to the training dataframe, the testing dataframe does not have the column "transported". This is because we have to build the model using training data to do prediction for testing data

In [160]:
# Checking for duplicates rows
print(f"Data Duplicates in Training Dataframe: {train_df.duplicated().sum()}")
print(f"Data Duplicates in Testing Dataframe: {test_df.duplicated().sum()}")

Data Duplicates in Training Dataframe: 0
Data Duplicates in Testing Dataframe: 0


In [161]:
# Checking each Data-Type
print(f"Data Types of Training Dataframe: {train_df.dtypes}\n")
print(f"Data Types of Testing Dataframe: {test_df.dtypes}")

Data Types of Training Dataframe: PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

Data Types of Testing Dataframe: PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
dtype: object


## Observations: CryoSleep & VIP Data Type
We ncan observe that CryoSleep & VIP contains boolean values but their Data Type is object, so we have to convert their Data Type to _bool_.

In [162]:
# Checking for missing values in Testing Dataframe
missing_df = test_df.isnull().sum().to_frame().rename(columns={0: "Number of Missing Values"})
missing_df["% of Missing Values"] = round(100 * test_df.isnull().sum() / len(test_df), 2)
missing_df

Unnamed: 0,Number of Missing Values,% of Missing Values
PassengerId,0,0.0
HomePlanet,87,2.03
CryoSleep,93,2.17
Cabin,100,2.34
Destination,92,2.15
Age,91,2.13
VIP,93,2.17
RoomService,82,1.92
FoodCourt,106,2.48
ShoppingMall,98,2.29


In [163]:
# Checking for missing values in Training Dataframe
missing_df = train_df.isna().sum().to_frame().rename(columns={0: "Number of Missing Values"})
missing_df["% of Missing Values"] = round(100 * train_df.isna().sum() / len(train_df), 2)
missing_df

Unnamed: 0,Number of Missing Values,% of Missing Values
PassengerId,0,0.0
HomePlanet,201,2.31
CryoSleep,217,2.5
Cabin,199,2.29
Destination,182,2.09
Age,179,2.06
VIP,203,2.34
RoomService,181,2.08
FoodCourt,183,2.11
ShoppingMall,208,2.39


## Observations: % of Missing Values analysis
 We can observe that there is very less % of missing values in both training & testing data. So instead of dropping those missing values, we will fill/replace those missing values with best suitable values according to the data.

In [164]:
# Checking Cardinality of both Dataframes
print("Cardinality of each row in Training Dataframe is: ")
print(train_df.select_dtypes(include="object").nunique())
print("\nCardinality of each row in Testing Dataframe is: ")
print(test_df.select_dtypes(include="object").nunique())

Cardinality of each row in Training Dataframe is: 
PassengerId    8693
HomePlanet        3
CryoSleep         2
Cabin          6560
Destination       3
VIP               2
Name           8473
dtype: int64

Cardinality of each row in Testing Dataframe is: 
PassengerId    4277
HomePlanet        3
CryoSleep         2
Cabin          3265
Destination       3
VIP               2
Name           4176
dtype: int64


# Observations: Cardinalities
We can observe that PassengerId, Cabin & Name feature of both datasets are having high cardinality. We normally drop the features having high cardinality but in this project we will do Feature Engineering and will create new features from this features.