# Portfolio assignment week 5

## 1. SVC

The Scikit-learn library provides different kernels for the Support Vector Classifier, e.g. `RBF` or `polynomial`.

Based on the examples [in the accompanying notebook](../Exercises/E_LR_SVM.ipynb), create your own `SVC` class and configure it with different kernels to see if you are able to have it correctly separate the moon-dataset. You can also use a `precomputed` kernel. In addition, there are several parameters you can tune to for better results. Make sure to go through [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

**Hint**:

- Plot the support vectors for understanding how it works.
- Give arguments why a certain kernel behaves a certain way.

## 2. Model Evaluation

Classification metrics are important for measuring the performance of your model. Scikit-learn provides several options such as the `classification_report` and `confusion_matrix` functions. Another helpful option is the `AUC ROC` and `precision-recall curve`. Try to understand what these metrics mean and give arguments why one metric would be more important then others.

For instance, if you have to predict whether a patient has cancer or not, the number of false negatives is probably more important than the number of false positives. This would be different if we were predicting whether a picture contains a cat or a dog – or not: it all depends on the context. Thus, it is important to understand when to use which metric.

For this exercise, you can use your own dataset if that is eligable for supervised classification. Otherwise, you can use the [breast cancer dataset](https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset) which you can find on assemblix2019 (`/data/datasets/DS3/`). Go through the data science pipeline as you've done before:

1. Try to understand the dataset globally.
2. Load the data.
3. Exploratory analysis
4. Preprocess data (skewness, normality, etc.)
5. Modeling (cross-validation and training)
6. **Evaluation**

Create and train several `LogisticRegression` and `SVM` models with different values for their hyperparameters. Make use of the model evaluation techniques that have been described during the plenary part to determine the best model for this dataset. Accompany you elaborations with a conclusion, in which you explicitely interpret these evaluation and describe why the different metrics you are using are important or not. Make sure you take the context of this dataset into account.

# Data
for this Assignment I found two great dataset, one of them is [Cancer Dataset](https://www.kaggle.com/datasets/erdemtaha/cancer-data) and the other one is [Star Classification Dataset](https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17). I had difficulty deciding between the star dataset, which involves multi-class classification, and the cancer dataset, which is a binary classification problem. However, for this week's assignment, I decided to work on the Cancer Dataset since I worked another physics dataset for second assignment. I may explore the star classification dataset in the future, particularly for unsupervised classification section. 

# Loading Data

In [1]:
# general libraries
import yaml
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#inspired by https://fennaf.gitbook.io/bfvm22prog1/data-processing/configuration-files/yaml

def configReader():
    """
    explanation: This function open config,yaml file 
    and fetch the gonfigue file information
    input: ...
    output: configue file
    """
    with open("config.yaml", "r") as inputFile:
        config = yaml.safe_load(inputFile)
    return config

In [3]:
def dataframe_maker(config):
    file_directory, file_name = config.values()
    os.chdir(file_directory)
    df = pd.read_csv(file_name).drop('Unnamed: 32', axis=1)
    return df
df = dataframe_maker(configReader())
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


This dataset consists of 569 samples with 32 features. In my opinion, during the inspection phase, one of the primary objectives is to become acquainted with various aspects of the dataset. Therefore, I conducted some research on the different features of this dataset and compiled a list of them below. The features primarily covers the following parameters:

**concavity**: This concept pertains to the presence of concave areas on the surface of a tumor. A greater number of concave points along the nuclear border is associated with a higher probability of malignancy. The severity and quantity of these concave points demonstrate a positive correlation with the diagnosis.[<a href="https://rpubs.com/Kevin_Nguyen_Tran/662211" target="_blank">link</a>]

**compactness**: Tumor compactness is defined as the ratio of the tumor's volume to its surface area. This feature is closely tied to the spatial configuration of tumors, and as a result, we can expect to observe some dependencies between tumor compactness and other spatial features. [<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5352371/" target="_blank">link</a>]

**Fractal Dimension**:  Fractal dimension analysis is a computational image processing technique utilized to evaluate the level of complexity within patterns. The technique, described comprehensively in the provided link, demonstrates its effectiveness in enhancing the histopathological diagnosis of breast cancer. [<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8087740/" target="_blank">link</a>] 

**symmetry**: In typical normal tissues, cell division typically produces identical or nearly identical pairs of daughter cells. However, in the context of cancer, cell division often follows an asymmetric pattern, characterized by a series of events that break the symmetry. If you're interested in learning more about this topic, there is an informative article available that provides further insights. [<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5837760/" target="_blank">link</a>]

**smoothness**: Benign masses typically exhibit smooth, round, and well-defined boundaries, in contrast to malignant tumors that often display spiculated, rough, and indistinct edges. Various methods can be employed to quantify tumor smoothness, including the neighboring gray-level dependence matrix (NGLDM) method and the peak-variance method. [<a href="https://pubmed.ncbi.nlm.nih.gov/1623493/" target="_blank">link</a>] [<a href="https://my.clevelandclinic.org/health/diseases/22121-benign-tumor" target="_blank">link</a>]

**radius, area, and perimeter**: The spatial configuration of a tumor plays a crucial role in evaluating its malignancy. A commonly used approach to determine the volume and perimeter of subcutaneous tumors involves measuring the length and width of the tumor using a caliper. This method assumes that the tumor has an ellipsoidal shape and that its height is equal to its width. By applying the formula for the volume and perimeter of an ellipsoid, these measurements can be used to estimate these parameters accurately.[<a href="https://biopticon.com/resources/tumor-volume-measurements-by-calipers/" target="_blank">link</a>]

**cancer texture**: The topic of cancer texture remains a subject of debate within scientific societies. While tumors may appear hard to the touch externally, research has revealed that individual cells within the tissue exhibit non-uniform rigidity and can even display variations in softness throughout the tumor. Texture analysis is employed to quantitatively assess texture characteristics. By examining the spatial variation in pixel intensities, it captures and quantifies intuitive qualities such as roughness, smoothness, silkiness, or bumpiness associated with the tumor's appearance. [<a href="https://cos.northeastern.edu/news/cancer-tumors-arent-always-as-tough-as-they-seem/" target="_blank">link</a>] [<a href="https://www.mathworks.com/help/images/texture-analysis-1.html" target="_blank">link</a>]

# Inspecting Dataset
Similar to all the privious notebooks, I should metioned that this section is one of the most important part of datascience pipeline. Consequently, one should be cautious to find all the abnormaliries and characteristics of a dataset.

In [4]:
def inspecting_data(df):

    # find the shape of data
    print(f'dataset has {df.shape[0]} observations, and {df.shape[1]} variables\n')

    # finding the information of this dataset
    print(f'{df.info()}\n')

    # extract the number of null values of the dataset
    null_values = df.isnull().sum().sum()
    print(f'the total number of null values in this dataset is {null_values}\n')

    # find whether the number of unique ids is equel to the number of observations
    if df.id.unique().shape[0] == df.shape[0]:
        print(f'the number of unique IDs is {df.shape[0]}, which it is equal to the number of observations\n')
    
    # find the distributaion of the datapoints in the label column
    member_numbers = df.diagnosis.value_counts()
    print(f'number of members in each diagnosis category')
    print(f'{member_numbers}\n')
    print(f'Benign (B): {(member_numbers[0] / df.shape[0]).round(3)}%')
    print(f'Malignant (M): {(member_numbers[1] / df.shape[0]).round(3)}%')

In [5]:
inspecting_data(df)

dataset has 569 observations, and 32 variables

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non

During the data inspection phase, it was observed that the dataset consists of 30 float features, one integer column representing the ID, and a categorical column containing the labels. One of the most important characteristics of this dataset is the presence of data imbalance. Specifically, the benign group constitutes two-thirds of the total datapoints, while the malignant cases represent only one-third of the dataset. however, the classification methods will initially be implemented without addressing this concern, and the subsequent step will involve tackling the data imbalance and re-implementing the classification methods with the appropriate adjustments. With this manner one can see the effect of the data imbalance in the outcome of the classification methods. 