# **1- About Dataset**

Survived = if passenger survived or not

Ticket = Ticket number

Sex = Male or Female

Embarked = Port of Embarkation  C = Cherbourg, Q = Queenstown, S = Southampton

Fare = Passenger fare

Cabin = Cabin number

pclass: Ticket class

1st = Upper

2nd = Middle

3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

# **2- importing libraries**

In [2]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder , StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import RandomOverSampler

# **3- Loading Data**

In [3]:
train_df = pd.read_csv('/kaggle/input/titanic/train.csv')
test_df = pd.read_csv('/kaggle/input/titanic/test.csv')
submission = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')

test_df = pd.concat([test_df , submission] , axis = 1)

train_df = train_df.drop(['PassengerId'] , axis = 1)
test_df = test_df.drop(['PassengerId'] , axis = 1)

FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/input/titanic/train.csv'

In [None]:
train_df.head()


In [None]:
test_df.head()

# **4- Data Information**

In [None]:
train_df.info()

In [None]:
test_df.info()

### We have both missing values and categorical data and we need to handle them

In [None]:
train_df.isnull().sum()

In [None]:
test_df.isnull().sum()

NameError: name 'test_df' is not defined

### 'Cabin' has a lot of missing values (about 70%) so we have to drop it during missing value handling
### We can use imputers (simple imputer / iterative imputer) in order to handle Nan values for 'Age' and 'Embarked'

# **5- Data Visualization**

In [None]:
visual_data = train_df.drop(['Ticket' ,'Name'  , 'Cabin'] , axis = 1)

num_cols = len(visual_data.columns)
num_rows = (num_cols + 2) // 3

fig, axes = plt.subplots(num_rows, 3, figsize=(12, 3 * num_rows))
axes = axes.flatten()

for i, column in enumerate(visual_data.columns):
    ax = axes[i]
    sns.histplot(visual_data[column], ax=ax, color = "maroon")
    ax.set_title(column)

for i in range(num_cols, num_rows * 3):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()

#### Most of the passengers were Male , between 15 - 35 years old , embarked for southampton and ...


# **6- Data Preprocessing**

# 6-1 Clean up and Removing useless features

In [None]:
def preprocess(df):
    
    def normalize_name(x):
        return " ".join([v.strip(",()[].\"'") for v in x.split(" ")])
    
    def ticket_number(x):
        return x.split(" ")[-1]
        
    def ticket_item(x):
        items = x.split(" ")
        if len(items) == 1:
            return "NONE"
        return "_".join(items[0:-1])
    
    
    train_df["Ticket_number"] = train_df["Ticket"].apply(ticket_number)
    train_df["Ticket_item"] = train_df["Ticket"].apply(ticket_item)  
    test_df["Ticket_number"] = test_df["Ticket"].apply(ticket_number)
    test_df["Ticket_item"] = test_df["Ticket"].apply(ticket_item) 

    return df
    
train_df = preprocess(train_df)
test_df = preprocess(test_df)

train_df = train_df.drop(['Name' ,'Ticket'] , axis = 1) # Ticket replaced with Ticket_number and Ticket item
test_df = test_df.drop(['Name' ,'Ticket'] , axis = 1) # Ticket replaced with Ticket_number and Ticket item

train_df.head(5)


# Copied from https://www.kaggle.com/code/gusthema/titanic-competition-w-tensorflow-decision-forests?scriptVersionId=130042040&cellId=7

In [None]:
test_df.head()

### **Note :**
The order in which you handle categorical values and missing values in your data preprocessing pipeline can depend on the specific characteristics of your dataset and the machine learning algorithm you plan to use. However, there are some general guidelines that can help you decide the order:

**Handle Missing Values First:** In most cases, it's a good practice to address missing values before encoding categorical variables. Missing data can have a significant impact on the performance of your machine learning models, and addressing them early can help ensure that your models are not biased or inaccurate due to missing information. You can use techniques like imputation (e.g., filling missing values with the mean, median, or mode) or advanced methods like predictive imputation or interpolation.

**Encode Categorical Variables:** Once you've dealt with missing values, you can proceed with encoding categorical variables. This step involves converting categorical data into a numerical format that machine learning algorithms can work with. Common encoding techniques include one-hot encoding, label encoding, or ordinal encoding, depending on the nature of your categorical data.

**Consider Simultaneous Handling:** In some cases, you might need to handle missing values and encode categorical variables simultaneously. For example, you might want to impute missing values with a value specific to the category or use different imputation techniques for different categorical columns. This requires careful consideration of the relationships between missing values and categorical variables in your dataset.


# 6-2 Missing values

In [None]:
null_count_columns = train_df.isnull().sum()

null_counts_df = pd.DataFrame(list(null_count_columns.items()), columns=["Column", "NullCount"])

null_counts_df.sort_values(by="NullCount", inplace=True)

plt.figure(figsize=(10, 6))
plt.barh(null_counts_df["Column"], null_counts_df["NullCount"], color="maroon")
plt.xlabel("Null Counts")
plt.ylabel("Columns")
plt.title("Null Counts in Dataset Columns")
plt.grid(axis="x", linestyle="--", alpha=0.7)
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
null_count_columns = test_df.isnull().sum()

null_counts_df = pd.DataFrame(list(null_count_columns.items()), columns=["Column", "NullCount"])

null_counts_df.sort_values(by="NullCount", inplace=True)

plt.figure(figsize=(10, 6))
plt.barh(null_counts_df["Column"], null_counts_df["NullCount"], color="maroon")
plt.xlabel("Null Counts")
plt.ylabel("Columns")
plt.title("Null Counts in Dataset Columns")
plt.grid(axis="x", linestyle="--", alpha=0.7)
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# At first 'Cabin' column should drop from dataset because of 70% missing values
train_df = train_df.drop(['Cabin'] , axis = 1)
test_df = test_df.drop(['Cabin'] , axis = 1)

# I want to use simple imputers to fill nan values for Age and Fare and Embarked
imp = SimpleImputer(strategy = 'mean')
train_df[['Age', 'Fare']] = imp.fit_transform(train_df[['Age', 'Fare']])
train_df[['Age', 'Fare']] = pd.DataFrame(train_df[['Age', 'Fare']] , columns = ['Age' , 'Fare'])

test_df[['Age', 'Fare']] = imp.fit_transform(test_df[['Age', 'Fare']])
test_df[['Age', 'Fare']] = pd.DataFrame(test_df[['Age', 'Fare']] , columns = ['Age' , 'Fare'])


imp = SimpleImputer(strategy = 'most_frequent')
train_df[['Embarked']] = imp.fit_transform(train_df[['Embarked']])
train_df[['Embarked']] = pd.DataFrame(train_df[['Embarked']] , columns = ['Embarked'])

test_df[['Embarked']] = imp.fit_transform(test_df[['Embarked']])
test_df[['Embarked']] = pd.DataFrame(test_df[['Embarked']] , columns = ['Embarked'])

train_df.head(7)

In [None]:
test_df.head(7)

In [None]:
train_df.isnull().sum()

In [None]:
test_df.isnull().sum()

### HOORAY ! There is no missing value left

# 6-3 Categorical data

When dealing with structured, tabular data (which we usually be doing), the first question we generally ask ourselves is whether the values are of a numeric or categorical nature. Types of Data:

Qualitative Data

Quantitative Data

Qualitative Data: “Data Associated with the quality in different categories”. Data is measurements, each fall into one of several categories. (Hair Color, ethnic groups and other attributes of the population)

(a). Nominal Data: “With no inherent order or ranking” ~ Data with no inherent order or ranking such as gender or race, such kind of data called Nominal Data.

(b). Ordinal Data: “with an order series”

Quantitative Data: “Data associated with Quantity which can be measured” ~ Data measured on a numeric scale (distance travelled to college, the number of children in a family etc.)
(a). Discrete Data: “Based on count, finite number of values possible and value cannot be subdivided” ~ Data which can be categorized into classification, data which is based upon counts, there are only a finite number of values possible and values cannot be subdivided meaningfully, such kind of data is called Discrete Data.

(b). Continuous Data: “measured on a continuum or a scale, value which can be subdivided into finer increments” ~ Data which can be measured on a continuum or a scale, data which can be have almost any numeric value and can be subdivided into finer and finer increments, such kind of data is called Continuous Data.

Usually there are 2 kinds of categorical data:

● Ordinal Data: The categories have an inherent order like: socio economic status (“low income”,”middle income”,”high income”), education level (“high school”,”BS”,”MS”,”PhD”), income level (“less than 50K”, “50K-100K”, “over 100K”), satisfaction rating (“extremely dislike”, “dislike”, “neutral”, “like”, “extremely like”)

● Nominal Data: The categories do not have an inherent order like: blood type, zip code, gender, race, ethnicity Also binary data would be nominal or ordinal. Generally, In Ordinal data, while encoding, one should retain the information regarding the order in which the category is provided. While encoding Nominal data, we have to consider the presence or absence of a feature. In such a case, no notion of order is present.

So how should we select encoding methods is depends algorithm(s) we apply :

● Some algorithms can work with categorical data directly or For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation). So Some implementations of machine learning algorithms require all data to be numerical. For example, scikit-learn has this requirement.

● If we categorize algorithms to linear and tree based models we should consider that generally linear models are sensitive to order of ordinal data so we should select appropriate encoding methods.

In [None]:
# we have lots of categorical features so before using iterative imputer its better to encode nominal and ordinal categorical data
nominals = ['Sex' , 'Embarked' ,"Ticket_item"]
ordinals = ['Pclass' , "Parch" , "SibSp"] # values are numeric and meaningful so there is no need to encode

# we want to use LabelEncoding for nominal values
le = LabelEncoder()
train_df['Sex'] = le.fit_transform(train_df['Sex'])
train_df['Embarked'] = le.fit_transform(train_df['Embarked'])
train_df['Ticket_item'] = le.fit_transform(train_df['Ticket_item'])

test_df['Sex'] = le.fit_transform(test_df['Sex'])
test_df['Embarked'] = le.fit_transform(test_df['Embarked'])
test_df['Ticket_item'] = le.fit_transform(test_df['Ticket_item'])

#Ticket_number column is defined as an object type so we need to convert its type to int or float
train_df['Ticket_number'] = pd.to_numeric(train_df['Ticket_number'], errors='coerce')
test_df['Ticket_number'] = pd.to_numeric(test_df['Ticket_number'], errors='coerce')

train_df.head(7)

In [None]:
train_df.info()

In [None]:
test_df.info()

### There is no categorical value left now we can go for outliers

# 6-4 Outliers

In [None]:
train_df.describe()

In [None]:
test_df.describe()

#### I want to remove Outliers with Z_SCORE method and i want to choose [ -3 , 3 ] as my threshold ....

A common threshold is to consider data points with z-scores outside of the range [-3, 3] as outliers. This means that any data point with a z-score less than -3 or greater than 3 would be considered an outlier. Here's a brief explanation of why this range is often used:

**Standard Deviation Interpretation:** In a standard normal distribution (mean = 0, standard deviation = 1), about 99.7% of the data falls within 3 standard deviations of the mean. Therefore, by setting the threshold at 3 standard deviations (z-scores), you are effectively capturing the majority of the data points and treating those outside this range as outliers.

**Adjustable Sensitivity:** You can adjust the threshold based on your specific requirements. If you want to be more conservative and capture even fewer outliers, you could use a smaller range, like [-2, 2]. Conversely, if you want to be more permissive and capture more potential outliers, you could use a larger range, like [-4, 4].

**Data Distribution:** The choice of threshold may also depend on the distribution of your data. For data that doesn't follow a normal distribution, a z-score threshold may not be the best approach. In such cases, you might consider using other methods, such as the Interquartile Range (IQR) or domain-specific knowledge.

Remember that the choice of threshold is somewhat arbitrary, and it's important to consider the context of your analysis, the characteristics of your data, and your specific goals when deciding on an appropriate threshold for identifying outliers. Additionally, it's often a good practice to visualize your data before and after removing outliers to assess the impact of your choices.

In [None]:
threshold = 3
# identify outliers using z-score method

# Calculate the z-scores for each data point
z_scores = np.abs((train_df - train_df.mean()) / train_df.std())

# outliers by filtering based on the z-scores
outliers = z_scores > threshold

# Now 'no_outliers' contains the data without outliers
outliers.head(5)

In [None]:
threshold = 3
# identify outliers using z-score method

# Calculate the z-scores for each data point
z_scores = np.abs((test_df - test_df.mean()) / test_df.std())

# outliers by filtering based on the z-scores
outliers = z_scores > threshold

# Now 'no_outliers' contains the data without outliers
outliers.head(5)

In [None]:
# removing outliers using z-score method

# Calculate the z-scores for each train point
z_scores = np.abs((train_df - train_df.mean()) / train_df.std())

# Remove outliers by filtering based on the z-scores
train_df = train_df[(z_scores <= threshold)]


# Calculate the z-scores for each test point
z_scores = np.abs((test_df - test_df.mean()) / test_df.std())

# Remove outliers by filtering based on the z-scores
test_df = test_df[(z_scores <= threshold)]


train_df.head(5)

In [None]:
test_df.head()

In [None]:
train_df.isnull().sum()

In [None]:
test_df.isnull().sum()

### OH !!!!! again missing values
### This NaN values are because of removing outliers and we need to handle them again

In [None]:
# I want to drop NaN values because they are not too much
train_df = train_df.dropna()
test_df = test_df.dropna()
train_df.info()

In [None]:
test_df.info()

# 6-5 Splitting to test , validation set

In [None]:
# Train set is ok to be trained on our models but we need validation so we are going to split test set to test and validation set
X_train = train_df.drop('Survived' , axis =1)
y_train = train_df['Survived']

test_df_X = test_df.drop('Survived' , axis =1)
test_df_y = test_df['Survived']

X_test ,X_val ,y_test ,y_val = train_test_split( test_df_X , test_df_y , test_size = 0.5 ,random_state =42)

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape,"\n")

print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape,"\n")

print("X_test validation:", X_val.shape)
print("y_test validation:", y_val.shape)

# 6-6 Data transformation

#### **Standardize features by removing the mean and scaling to unit variance.**

#### The standard score of a sample x is calculated as:

#### z = (x - u) / s

#### where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

In [None]:
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train) , columns = X_train.columns)
X_val = pd.DataFrame(scaler.transform(X_val) , columns = X_val.columns)
X_test = pd.DataFrame(scaler.transform(X_test) , columns = X_test.columns)

In [None]:
X_train.head(7)

# 6-7 Correlation

In [None]:
corrmat = X_train.corr()
f, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(corrmat, vmax= 1, square=True);

# 6-7 Feature Selection

In [None]:
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

feature_importances = model.feature_importances_

importances_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances})
importances_df = importances_df.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 6) ,)
plt.barh(importances_df['Feature'], importances_df['Importance'] , color = 'Maroon')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance')
plt.show()

### All features are important and there is no need to drop any of them

# 6-9 Imbalanced Data

In [None]:
y_train.value_counts()

In [None]:
f,ax=plt.subplots(1,1,figsize=(12,8))
y_train.value_counts().plot.pie(explode=[0,0.01],autopct='%1.1f%%',shadow=False , colors = ['#850428','#328f60'])
plt.show()

In [None]:
oversampler = RandomOverSampler(random_state=42)
X_train, y_train = oversampler.fit_resample(X_train, y_train)
y_train.value_counts()

In [None]:
f,ax=plt.subplots(1,1,figsize=(12,8))
y_train.value_counts().plot.pie(explode=[0,0.01],autopct='%1.1f%%',shadow=False , colors = ['#850428','#328f60'])
plt.show()

## Finally we did Preprocessing on our dataset. Now its the time to Model our data with Neural Network