# Portfolio assignment week 6

# 1. Decision Trees and Naive bayes
The scikit-learn library provides different parameters for decision trees and naive bayes. 

Based on the last code example [in the accompanying notebook](../Exercises/E_DT_NB.ipynb), add several new models to the `classifiers` variable. These models should have different parameters. For instance, create a new decision tree with a max depth of 1. Another possibility is to add different datasets or add noise.

Try to understand why some models behave differently than others. Give arguments what influences model performance and why.

# 2. Decision Tree Evaluation
As shown in the [in the accompanying notebook](../Exercises/E_DT_NB.ipynb) it is possible to visualize the decision tree. 

For this exercise, you can use your own dataset if that is eligable for supervised classification. Otherwise, you can use the [breast cancer dataset](https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset) which you can find on assemblix2019 (`/data/datasets/DS3/`). Go through the data science pipeline as you've done before:

1. Try to understand the dataset globally.
2. Load the data.
3. Exploratory analysis
4. Preprocess data (skewness, normality, etc.)
5. Modeling (cross-validation and training)
6. Evaluation
7. **Explanation**

Explain how the decision tree behaves under certain circumstances. What features seem important and how are the decisions made?

## Dataset
It is not fun to use the same dataset each time for each excersize :D. Consequently, I started to search for a fun and perfect dataset for this excercize (of course I participate in resit, and this is the reason that I have enough time to do this :D ). During my exploration of the concept of Decision Trees, I came across two interesting links (<a href="https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052" target="_blank">tds-link</a> and <a href="https://en.wikipedia.org/wiki/Sinking_of_the_Titanic" target="_blank">wiki</a>) that caught my attention. Upon skimming through the wiki link, I stumbled upon a section called "Casualties and survivors," which provided detailed information about the number of saved and lost passengers based on their gender and class. This dataset was sourced from an article written by L. Mersey in 1912 [1]. I found it intriguing that the author of the wiki link employed a treemap to categorize passengers, resulting in valuable insights into the casualties of this catastrophic event. Additionally, the tds-link utilized Decision Trees to classify the casualties and survivors based on their gender and age.Consequently, I found this topic fascinating to work on and present my own Decision tree model to describe the casualty of this catasrophic event. The Titanic dataset is extracted from this <a href="https://www.kaggle.com/datasets/vinicius150987/titanic3?resource=download" target="_blank">Kaggle</a> link.

[1] Mersey, L. (1999). The Loss of the Titanic, 1912. London: Stationery Office.

Thus, here I will delve into this dataset to extract the most reliable model for the number of lost and survived passengers. 

Like always, one can beging wit loading the data.

## Loading Data

In [1]:
# General Modules
import yaml
import os
import pandas as pd
import numpy as np
import graphviz
import time
import re
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# Dataset module
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons, make_circles

# preprocessing modules
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

# Classification modules
from sklearn import tree
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline

# Metrics modules
from sklearn.metrics import accuracy_score

# Model selection modules
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV

In [2]:
#inspired by https://fennaf.gitbook.io/bfvm22prog1/data-processing/configuration-files/yaml

def configReader():
    """
    explanation: This function open config,yaml file 
    and fetch the gonfigue file information
    input: ...
    output: configue file
    """
    with open("config.yaml", "r") as inputFile:
        config = yaml.safe_load(inputFile)
    return config

In [3]:
def dataframe_maker(config):
    file_directory, file_name = config.values()
    os.chdir(file_directory)
    df = pd.read_excel(file_name)#.drop('Unnamed: 32', axis=1)
    return df
df = dataframe_maker(configReader())
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


This dataset consists of 1309 samples (passengers) and 14 features. Although each feature has the potential to influence the model's outcome, some of them may not significantly impact the model's functionality. Therefore, understanding the nature of each feature becomes crucial for the subsequent steps in the data-science pipeline. This knowledge empowers researchers to discern whether a feature holds importance in the analysis.  Thus, all the features will be described in the following paragraphs.

1. **pclass**: presents ticket classes, a potentially significant feature that could have influenced the chances of survival. It is plausible that individuals from higher classes had a better chance of surviving, making this feature relevant for the analysis.

2. **survived**: shows the labels that whether a passenger lost his/her life or survived from this disastorous event.

3. **name**: contains the names of the passengers.

4. **sex**: includes the gender of the passengers, which can significantly influence the model's performance. During emergency conditions, women are often prioritized for protection, making this feature highly relevant and potentially impactful on the model's predictions.

5. **age**: This variable represents the age of the passengers, and once again, it can significantly impact the model's outcome. During emergency conditions, children are often given priority for protection, making this feature particularly relevant for predicting survival rates.

6. **sibsp**: is the number of siblings and spouse. It can have a slight effect on the outcome of the model. Maybe people with higher number of sibsp had less chance of surviving.

7. **parch**: this number preents the number of parents and children. This parameter is quite similar to privious one (sibsp).

8. **ticket**: contains the number of passengers' tickets. It seems irrelevant to the topic.

9. **fare**: includes the price of the tickets, which can be useful for classifying passengers based on their socio-economic status. Higher ticket prices might indicate a higher social class, while lower prices might correspond to a lower class. This feature can be valuable in categorizing passengers according to their economic standing.

10. **cabin**: consists of information about the numbers of cabins for which passengers paid. This feature could be significant as passengers from cabins near the deck might have had an advantage, reaching the surface of the ship more quickly, which could have increased their chances of survival.

11. **embarked**: includes the ports in which passengers are get into the ship.

12. **boat**: consists of the boats' codes with which some of the passengers survived.

13. **body**: contains the body numbers if did not survive and body was recovered. <a href="https://github.com/awesomedata/awesome-public-datasets/issues/351" target="_blank">link</a> 

14. **home.dest**: contains the home destination of all the the passengers.

Next part is the most vital part of this project which is **Data Inspection**.

### Data Inspection


## 3. Naive Bayes 

During the Corona pandemic, seven roommates in a student house did a Corona test. The table below show the data of these students: whether they experiences shivers, had a running nose, or had a headache. The test result is also shown.

Roommate | shivers | running nose | headache | test result
--|--|--|--|--
1 | Y | N | No | Negative
2 | N | N | Mild | Negative
3 | Y | Y | No | Positive
4 | N | Y | No | Negative
5 | N | N | Heavy | Positive
6 | Y | N | No | Negative
7 | Y | Y | Mild | Positive

Explain why it is not useful to include the column 'Roommate' in a classification procedure.

Train a [Categorical Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html) classifier on this dataset, where the Test Results are your classes and the other features the data. For this to work, you'll need [Pandas `get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) to transform the nominal data into something that sklearn can work with. Use all seven rows in your training.

If you let your fitted classifier predict the test results (based on all the data), you will (hopefully) see that the prediction for observation number 5 (1-based) is wrong (it predicts Negative while the actual value is Positive). Show by manual calculation that the prediction for this instance is indeed higher ($p=0.527$) for the Negative class than for the Positive class.
