# Assignment 2: CRISP-DM Model with Titanic Dataset (max. 10 points)

Goal of this assignment is to apply **CRISP-DM** process model for given dataset of Titanic passengers.

* First you load **Titanic** dataset, and then extract more information from the dataset and visualize the results.
* Also, two basic machine learning classification models are implemented. 

You can find Titanic dataset from the *data/* subdirectory:
[Titanic dataset](data/titanic.csv).
Use this dataset in the Assignment 3.

### Add your information

In [1]:
# TODO: Replace with your name or names
student_name = 'student name/names'
student_email = 'student email/emails'

## 2.1: Business Understanding (max. 1 points)

In this phase, you should define your project goals and success criteria.

**Question**: Can you think what these could be for this assignment?

* **TODO**: Define project goals and success criteria.

## 2.2: Data Understanding (max. 3 points)

In the "Data Understanding" phase of this CRISP-DM project, you aim to get a grasp of your dataset and its characteristics.

What can you do in this phase?

1. Structure of dataset. Understand the dataset's structure like the number of features (columns), the data types of each feature, and the size of the dataset.
2. Summary statistics. Calculate summary statistics for numerical features, such as mean, median etc.
3. Missing values. Identify the missing values of the dataset.
4. Data visualization. Create data visualizations to gain insights of the data.
5. Outlier detection. Identify and examine potential outliers. You can use chart types like box plots or scatter plots to visualize outliers.

Complete these five tasks using Titanic dataset. Remember also to add Markdown cells to your Jupyter Notebook documents to explain.

Use also different styles in Markdown text: lists, figures, highlights, bold, italic, links, even direct quotes, etc.
It makes Jupyter Document easier to read and highlights the key points.

###  Structure of Dataset

Load Titanic dataset, and then show few lines of data and information of column types.

In [2]:
# TODO: Structure of dataset
#file_name = 'data/titanic.csv'

### Summary Statistics
Show basic summary and information of the dataset.

In [3]:
# TODO: Summary statistics

### Missing Values
Calculate missing values in the dataset.

Count NaN values in each column.

In [4]:
# TODO: Missing values

### Visualization

Visualize survivors, passenger ages, survival rate etc. Then try to find different ways to visualize data.

Ideas what to visualize:
* the count of survivors.
* the distribution of passenger ages.
* the survival rate by passenger class.
* the survival rate by gender.
* the survival rate by passenger class and gender.
* more visualization ideas from you?

In [5]:
# TODO: Visualization

### Outlier Detection
One important step for the quality of the data is to implement **Outlier Detection**.

How to detect outliers?

* You can detect outliers using **visual inspection** of the dataset by creating scatter plots, histograms, box plots etc.
* You can use statistical methods like **Z-Score** or **IQR**.

**Z-Score**: Calculate the z-score for each data point. If the z-score is significantly different from zero (> 2 or < -2), it may be an outlier.

Using the **IQR**, the outlier data points are the ones falling below Q1 – 1.5 IQR or above Q3 + 1.5 IQR.
The Q1 is the 25th percentile, and Q3 is the 75th percentile of the dataset, and IQR represents the _interquartile_ range calculated by _(Q3–Q1)_.

In [6]:
# TODO: Outlier detection

## 2.3: Data Preprocessing (max. 3 points)

Data preprocessing is an important step to ensure your dataset is ready for machine learning phase.
Here are your most important tasks for data preprocessing phase with the Titanic dataset.

1. Handling missing values. You should decide how to handle missing values.
2. Feature engineering. You can create new features or transform existing ones. With this dataset, you can extract titles from passenger names, create a family size feature, have a passenger a cabin or not, or categorize age and fare into groups.
3. Visualization. Do visualization with new features. Also recalculate statistics after data is preprocessed.
4. Categorical variable encoding. You need to encode categorical variables for a machine learning model. Select method from one-hot encoding, label encoding or ordinal encoding.
5. Feature scaling. Scaling features, such as age or fare. Select standardization or min-max scaling method for scaling.

### Handling Missing Values
NaN (and Inf) values are problems for ML models.

In [7]:
# TODO: Handling missing values

### Feature Engineering

You can create new features or transform existing ones.

Some ideas what new features/columns to create:
* Create a new feature `HasCabin` (has a cabin or not).
* Create a new feature `HasEmbarked` (has an embarkation port or not).
* Extract `ticketNumber` and ``ticketPrefix`` features of tickets.
* Create a new feature `FamilySize`.
* Create a new feature `Deck` from `Cabin` column.
* Extract a new feature `title` from the `Name`.
* More ideas from you.

Extra Question: Are there any passengers who share the same ticket number?

In [8]:
# TODO: Feature engineering

### Visualize New Features

Do visualization with selected new features.

Some ideas what to visualize:
* Visualize based on ticket prefixes.
* Visualize based on ticket numbers.
* Show the survivors based on `Pclass` and embarkation port.
* Also other new features can be used to visualization.

In [9]:
# TODO: Visualization

### Categorical Variable Encoding

Why categorical variable encoding?

 * Since most machine learning models accept only numerical variables, you need to encode categorical variables.
 * Select your encoding method from these: one-hot encoding, label encoding, or ordinal encoding.

In [10]:
# TODO: Categorical variable encoding

### Feature Scaling

Why perform feature scaling?

* Most machine learning methods benefit from scaled features.
* Scale all numerical features for a machine learning model.
* Features in your dataset might have different scales, which can vary widely.

There are several common methods for feature scaling:

1. **Min-Max Scaling** scales features to a specific range (commonly 0 to 1 or -1 to 1).
2. **Standardization** (Z-Score) scales features to have a mean of 0 and a standard deviation of 1.
3. **Robust Scaling** uses the median and interquartile range to scale features.

In [11]:
# TODO: Feature scaling

## 2.4: Modelling (max. 3 points)

Target is to do **Titanic Survival Prediction with Machine Learning**.
You should build and compare machine learning models for predicting passenger survival on the Titanic.

1) Data splitting

    * Split the dataset into training and testing sets (e.g., _80% training_ and _20% testing_) to evaluate model performance.

2) Modeling

    * Implement two different classifiers: **k-Nearest Neighbors** (kNN), and **Random Forest** (RF).
    * Train each model on the training data.

3) Model evaluation

    * Evaluate the performance of each model on the testing data using evaluation metrics such as accuracy, precision and F1-score.

4) Comparison

    * Compare the results of the classifiers to determine which one performs the best in terms of survival prediction.

5) Conclusion

    * Summarize your findings and provide insights into which model is most suitable for predicting Titanic passenger survival based on the dataset.
    * Include visualizations and explanations to explain your findings.

Note: You may consider hyperparameter tuning for the classifiers and further data exploration to enhance your analysis.
You can find with hyperparameter tuning the best configuration for each model. This can increase the performance of your model.

### Preprocessing Data for Machine Learning

Select the features and target variable for an ML process.

Process (replace or delete) all rows with `NaN` values.

In [12]:
# TODO: Preprocessing

### Data Splitting

Split the data into **training** and **test sets**.

Use the following properties for split the data in this assignment:

* `y` is the target variable to predict.
* `Test_size=0.2` specifies that 20% of the data will be used for testing.
* `Random_state=42` is used to set a seed for the random number generator, and it ensures that the split will be reproducible.
* `Stratify=y` ensures that the class distribution in the target variable is preserved in both the training and testing sets.

In [13]:
# TODO: Data splitting
# X_train, X_test, y_train, y_test = train_test_split(..., test_size=0.2, random_state=42, stratify=y)

### Modeling
Implement these **two different** machine learning classification models:

1. **k-Nearest Neighbors (kNN) Classification**
2. **Random Forest (RF) Classification**

Then train the models and make predictions using trained ML models with separate test data.

In [14]:
# TODO: Modeling with kNN and Random Forest.
#  Clearly separate these models and results you got by using different variable names for different models.

#### Feature importance values

Show the feature importance values of an ML learning process in descending order if it's possible with the ML method used.

In [15]:
# TODO: Feature importance values (only if the values can be obtained).

### Model Evaluation

Evaluate the performance of both models. Calculate metrics that can be compared.

In [16]:
# TODO: Evaluation of both models

### Comparison

Compare the results of the classifiers.

In [17]:
# TODO: Comparison of results

### Conclusion

Summarize your findings.

**TODO: Your Conclusion: (write this conclusion with Markdown)**