## Practical 2: Supervised Learning (Logistic Regression)

Logistic regression is a widely used linear model for classification tasks. It is commonly employed when the dependent variable or target variable is binary, meaning it can take one of two possible values. The logistic regression algorithm calculates the probability of a sample belonging to a particular class using a logistic function, also known as the sigmoid function. It models the relationship between the independent variables (features) and the probability of the target variable being in a particular class. Practical details can be found here: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

In this practical example, the Breast Cancer Wisconsin (Diagnostic) Dataset is utilized to demonstrate the application of logistic regression. This dataset can be obtained from the UCI Machine Learning repository, specifically from the following URL: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).

The dataset file is named "wdbc.csv" and consists of 569 samples. Each sample is described by 30 features derived from digitized images of fine needle aspirates (FNA) of breast masses. These features provide information about the characteristics of the cell nuclei present in the images. The goal of this classification task is to accurately classify each sample into one of two classes: Benign (1) or Malignant (0). Thus, this is a binary classification problem.

To successfully complete the task, you are required to follow the steps outlined below and provide your comments on the questions associated with each step. The steps are as follows:


In [None]:
#import related libraries
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

### Experiments

#### 1. Load the dataset (e.g. as a pandas dataframe)


#### 2. **Perform dataset inspection:**
Inspecting the data is a vital task in the data analysis process. This involves taks like previweing the dataset, exploring the datase dimensions, examining the dataset's structure, characteristics and quality, and checking for missing values to ensure its integrity before proceeding with further analysis or modeling.

In [None]:
# Inspect the dataset



In [None]:
# If there werer any missing values, would you drop rows with missing features or estimate missing feature values? 


#### 3. **Check the label distribution of the dataset and address class imbalance if present**. 

- Check the Label Distribution: Analyze the distribution: Examine the distribution of the target variable (class labels) in the dataset. Determine the number of samples belonging to each class (Benign and Malignant) and calculate the class proportions. This analysis helps identify if there is a significant imbalance between the classes.
Visualize the distribution: Create visual representations, such as bar plots or pie charts, to visualize the class distribution. This provides a clear understanding of the proportions and any potential imbalance.
Address Class Imbalance (if present):

- Understand class imbalance: Class imbalance occurs when one class has significantly more samples than the other. This can lead to biased model performance, as the model may be more inclined to predict the majority class.
Class imbalance can result in reduced predictive accuracy, lower recall for the minority class, and an overall biased model. It is crucial to address this issue to ensure fair and accurate predictions for both classes.

- Handling class imbalance: There are several techniques to address class imbalance, including:    
    - Resampling: This involves either oversampling the minority class or undersampling the majority class to achieve a balanced distribution. Techniques such as Random Oversampling, SMOTE (Synthetic Minority Over-sampling Technique), or Random Undersampling can be employed. https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html
    - Class weights: Assigning appropriate weights to the classes during model training can help mitigate the impact of class imbalance. This gives more importance to the minority class and helps achieve a balanced prediction. https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html
    - Algorithmic approaches: Some algorithms, such as ensemble methods like Random Forest or boosting algorithms like AdaBoost, handle class imbalance inherently and may provide better results without additional modifications. More practical details can be found here: https://imbalanced-learn.org/stable/ensemble.html

The imbalanced-learn library provides various implementations of techniques to handle class imbalance. https://imbalanced-learn.org/stable/references/index.html

In [None]:
# Check the label distribution. Are the classes imbalanced? Follow any approach from the above approaches to make the classes balanced.  


In [None]:
# Divide the data into training (70%) and testing (30%) splits


#### 4. **Feature Normalization**

Feature normalization, also known as feature scaling, is a common preprocessing step in machine learning that involves transforming the features of a dataset to a standardized scale or range. This process aims to bring the features onto a similar scale, which can be beneficial for many machine learning algorithms. Feature normalization is necessary when the features in a dataset have different scales, units of measurement, or ranges. If the features are not normalized, it can lead to biased model training, as some features may dominate others simply due to their larger values.

There are various methods for feature normalization, including:

- Min-Max Scaling: Rescales the features to a specified range, often between 0 and 1. It involves subtracting the minimum value and dividing by the range (maximum - minimum).

- Standardization: Standardizes the features to have zero mean and unit variance. It involves subtracting the mean and dividing by the standard deviation.

- Robust Scaling: Similar to standardization, but it uses median and interquartile range to handle outliers instead of mean and standard deviation.

The choice of normalization method depends on the characteristics of the dataset and the requirements of the machine learning models. It is essential to normalize features to ensure fair contributions from all features and to prevent any feature from dominating the learning process. 

**Perform the following tasks:**


- Determine the need for feature normalization:
  - Consider the requirements and assumptions of the machine learning model being used.
  - Assess whether feature normalization is necessary for the dataset.
  - Analyze the range and distribution of the features to identify potential variations in scale or magnitude.
- Normalize the features:
  - Utilize a scaling method from scikit-learn to normalize the features.
  - Explore the available scaling methods in scikit-learn's preprocessing module: https://scikit-learn.org/stable/modules/preprocessing.html.
  - Select an appropriate scaling method based on the characteristics of the dataset and the requirements of the models.: 
  - Apply the chosen scaling method to the dataset to normalize the feature values.
- Observe the effects of normalization:
  - Analyze the transformed features to observe any changes in scale or distribution.
  - Consider the impact of feature normalization on subsequent analysis or modeling steps.








In [None]:
#Answer




#### 5. **Feature Selection using SelectKBest**

Feature selection is a crucial step in the machine learning pipeline that involves selecting a subset of the most relevant features from a dataset. The goal of feature selection is to improve model performance, interpretability, and computational efficiency by reducing the dimensionality of the dataset and eliminating irrelevant or redundant features.

In many datasets, there may be features that do not contribute significantly to the target variable or contain redundant information. Including these features during model training can lead to overfitting, increased computational complexity, and decreased generalization performance. Feature selection helps address these issues by identifying and retaining only the most informative features.

Feature selection methods evaluate the importance or relevance of each feature based on certain criteria. These criteria can be statistical measures, information theory-based metrics, or machine learning algorithms. The selected features can provide insights into the underlying relationships in the data and improve the model's ability to make accurate predictions.

The SelectKBest feature selection method, available in scikit-learn, is one such approach ( https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html). It ranks features based on statistical evaluation measures, such as chi-squared, ANOVA F-value, or mutual information. By specifying the desired number of top features (k), SelectKBest selects the features with the highest scores, indicating their relevance to the target variable.

**Perform the following taks**

- Familiarize yourself with the SelectKBest feature selection method provided by scikit-learn.

- Determine an appropriate value of k, representing the number of top features to select.
- Implement the SelectKBest method from scikit-learn on the dataset.
- Use the specified statistical evaluation measure (e.g., chi-squared, ANOVA F-value, or mutual information) to rank the features.
- Select the top k features based on their scores to form the reduced feature subset.

In [None]:
#Feature selection
#Feature selection algorithms can be used to select only the most useful features. 
#Apply the SlectKBest feature selection method of scikit-learn: https://scikit-learn.org/stable/modules/feature_selection.html  



In [11]:
#Train a Logistic regression model: 
#https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


In [None]:
#Test the model


### Results and Analysis 
#### 6. **Peform the following steps**


- Based on the earlier analysis of class imbalance, feature normalization, and feature selection, train multiple logistic regression models, each incorporating different variations such as comparing pre-processing techniques to no pre-processing, class balancing to no balancing, and feature selection to non-selected features.
- Compare the performance of these logistic regression models using the classification accuracy measure.
- Reflect on whether classification accuracy alone is a good measure of the model's overall performance, taking into account the potential impact of these factors.
- Analyze if the earlier steps taken to address class imbalance, feature selection, and normalization have influenced the model's performance and the resulting classification accuracy.
- Discuss the limitations of classification accuracy as a single metric, considering potential issues like misclassification costs, imbalanced datasets, or the presence of different evaluation criteria in the problem domain.

In [None]:
#Calculate the classification accuracy of the model
#Is the classification accuracy a good measure of the model performance?


#### 7. **Visualize the confusion matrix**
A confusion matrix is a table that visualizes the performance of a classification model by presenting the counts of predicted and actual class labels. It is a useful tool for evaluating the performance of machine learning models, especially in binary classification tasks. The matrix organizes the predictions into four categories: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).The confusion matrix allows for a detailed analysis of the model's performance, particularly in terms of misclassifications. It provides insights into the types and quantities of errors made by the model and can be used to calculate various evaluation metrics such as accuracy, precision, recall, and F1 score.

The confusion matrix helps in identifying the following aspects:

 - Overall Model Accuracy: It provides a measure of how well the model performs in terms of correctly predicting both positive and negative samples.

 - Class-Specific Accuracy: It allows for an assessment of the model's performance for each individual class, indicating if the model is biased towards a particular class or struggling to predict certain classes accurately.

 - Misclassification Patterns: By examining the distribution of false positives and false negatives, patterns and tendencies in misclassifications can be identified, offering insights into potential model weaknesses and areas for improvement.
 
**Perform the following tasks**

- Visualize the confusion matrices for different models that you have trained using the ConfusionMatrixDisplay module from scikit-learn to visualize the confusion matrix. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html

- Reflect on the confusion matrix visualization and accuracy measures.
- Consider potential issues related to overall accuracy, such as imbalanced datasets or misclassification costs.
- Analyze the accuracy of each class to identify specific issues, such as low accuracy for certain classes or imbalanced performance.
- Discuss the implications of these issues and their impact on the model's effectiveness in real-world scenarios.

In [None]:
# Visualize the confusion matrix of the mode: 
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html
# Can you identify any issues related to the overall accuracy and the accuracy of each class?


#### 8. Compute and Interpret Precision, Recall, and F1-Score 

Precision, Recall, and F1-Score are evaluation metrics commonly used in classification tasks to assess the performance of a machine learning model.

- Precision: It measures the proportion of correctly predicted positive samples out of all samples predicted as positive. Precision tells us how good the model is at correctly identifying positive samples out of all samples it predicted as positive. It helps us understand the model's ability to avoid falsely labeling negative samples as positive (false positives). Higher precision indicates a lower rate of false positives and a better ability to accurately identify positive samples.
- Recall (also known as sensitivity or true positive rate): Recall measures the model's ability to identify positive samples correctly out of all actual positive samples. It helps us understand how well the model can capture all the positive samples in the dataset. Higher recall indicates a lower rate of false negatives and a better ability to correctly identify positive samples.
- F1-Score: The F1-Score combines precision and recall into a single metric. It provides a balanced measure that takes into account both precision and recall, especially useful for imbalanced datasets. The F1-Score is the harmonic mean of precision and recall, ensuring that both metrics contribute equally to the overall evaluation. It helps evaluate the model's overall performance, particularly when there is a trade-off between minimizing false positives and false negatives.

**Perform the following**

Prints a detailed classification report showing the precision, recall and f1-score using https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

In [None]:
#Prints a detailed classification report showing the precision, recall and f1-score
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
#Can you compute these quantities from the confustion matrix?


### Further Model tuning

To further enhance the performance of the logistic regression model, you can experiment with various techniques and parameters. Here are some additional steps you can take to improve the results:

**Feature Scaling:** Try different types of feature scaling methods and analyze their impact on the classification accuracy of the model. Feature scaling is often beneficial for logistic regression, as it can improve convergence and prevent certain features from dominating others. You can explore techniques such as StandardScaler, MinMaxScaler, or RobustScaler from the scikit-learn library. By applying different scaling methods, you can observe how they affect the model's accuracy and choose the one that yields the best results. More information on feature scaling techniques can be found in the scikit-learn documentation: https://scikit-learn.org/stable/modules/preprocessing.html

**SelectKBest:** Vary the parameter k, which represents the number of features selected, in the SelectKBest feature selection method. This technique selects the top k features based on their statistical significance. By experimenting with different values of k, you can observe how it affects the classification accuracy of the model. Try a range of values for k, such as 5, 10, 15, and so on, and evaluate the model's performance. This experimentation will allow you to identify the optimal value of k that maximizes the accuracy. You can find more details on SelectKBest in the scikit-learn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

**Recursive Feature Elimination (RFE):** Explore a different feature selection method like Recursive Feature Elimination. RFE recursively eliminates features from the dataset based on their importance. By using this technique, you can observe its effect on the model's performance. Experiment with different numbers of features to be eliminated at each step and evaluate how it influences the accuracy. RFE can help identify the most relevant features for the classification task. Refer to the scikit-learn documentation for more information on RFE: https://scikit-learn.org/stable/modules/feature_selection.html

**LogisticRegression() Parameters:** Experiment with different parameters for the LogisticRegression() model itself. You can vary parameters such as the regularization strength (C), penalty type (l1 or l2), solver algorithm, and class weights. By adjusting these parameters, you can observe their impact on the model's performance. For example, you can try using different values of C (smaller or larger) to control the regularization strength and explore different solvers like 'liblinear' or 'lbfgs'. Additionally, you can experiment with different class weight configurations to address class imbalance if present in the dataset. The scikit-learn documentation provides more details on the available parameters and their effects: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

By conducting these experiments, you can fine-tune the logistic regression model and optimize its performance for the given classification task. Remember to carefully analyze the results at each step and choose the parameter configurations that yield the best accuracy.
        