# Clustering
#### Part of the course on "Foundations of machine learning", Department of Mathematics and Statistics, University of Turku, Finland
#### Lectures available on YouTube: https://youtube.com/playlist?list=PLbkSohdmxoVAZ9DEHEWHjeGK7Ei-DjKHI&si=Msu74_I0qhLrRWcu
#### Code available on GitHub: https://github.com/ionpetre/FoundML_course_assignments

#### This notebook is based on the following sources: 

> https://www.kaggle.com/code/prashant111/logistic-regression-classifier-tutorial/

Linear discrimination is a classification approach where the objective is to learn a simple, linear function separating our classes. Rather than learning the distribution of the classes, this approach focuses on learning the separation of the classes, a problem that may often be simpler to solve. We discuss about pairwise separation, about gradient descent, and about logistic discrimination. 

In this noteebook we use the UCI heart disease dataset for the tutorial part and the NNN datasset for the challenge part. 

#### Load the libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report
#import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Reset the seed of the random number generator, for reproducibility purposes

import os

def reset_seed(SEED = 0):
    """Reset the seed for every random library in use (System, numpy)"""

    os.environ['PYTHONHASHSEED']=str(SEED)
    np.random.seed(SEED)


reset_seed(220)

## I. Demo decision trees on the UCI heart disease dataset

#### The UCI heart disease dataset: https://archive.ics.uci.edu/dataset/45/heart+disease

The UCI Heart Disease dataset is a well-known dataset used in machine learning and data analysis to study and predict the presence of heart disease in individuals. It is often referred to as the "Cleveland Heart Disease dataset" because it was collected from the Cleveland Clinic Foundation in the late 1980s.

Here are some key details about the UCI Heart Disease dataset:

1. Data Source: The dataset was collected from the Cleveland Clinic Foundation's Heart Disease Institute. The original dataset had several contributors, including Robert Detrano, Don Brownlee, and Wesley Turner.

2. Data Description: The UCI Heart Disease dataset consists of 303 instances, each representing a patient, and contains 76 attributes. However, only 14 of these attributes are typically used in analysis and modeling. These attributes include features such as age, sex, chest pain type, resting blood pressure, cholesterol level, maximum heart rate, exercise-induced angina, and more.

3. Target Variable: The primary target variable in this dataset is the presence of heart disease, where '0' typically indicates no heart disease and '1' indicates the presence of heart disease. This binary classification task makes it a popular choice for predictive modeling.

4. Purpose: The UCI Heart Disease dataset is commonly used for research, practice, and educational purposes in the field of cardiovascular medicine, as well as in machine learning and data science. Researchers and data scientists use this dataset to develop predictive models for diagnosing heart disease and assessing cardiovascular risk factors.

5. Data Availability: The UCI Heart Disease dataset is publicly available and can be accessed through the UCI Machine Learning Repository or various data science platforms and libraries.

In [None]:
from sklearn.datasets import fetch_openml

X, _ = fetch_openml(
    data_id=43823,
    as_frame=True,
    return_X_y=True,
    parser = 'auto'
)


In [None]:
X.info()

In [None]:
X

In [None]:
# The target feature is 'Heart_Disease'. We save it in the variable y and encode it as 0/1.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = pd.DataFrame(le.fit_transform(X['Heart_Disease']))
y.value_counts()

In [None]:
# We drop the target feature 'Heart_Disease' from the X dataset. 

X = X.drop('Heart_Disease', axis=1)
X

In [None]:
X_train_valid, X_test, y_train_valid, y_test = train_test_split(
    X, 
    y, 
    test_size=0.2, 
    random_state=150, 
    stratify=y,
    shuffle=True
)

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_valid, 
    y_train_valid, 
    test_size=0.25, 
    random_state=150, 
    stratify=y_train_valid,
    shuffle=True
)

# convert to pandas dataframe
X_train = pd.DataFrame(X_train, columns=X.columns)
X_valid = pd.DataFrame(X_valid, columns=X.columns)
X_test = pd.DataFrame(X_test, columns=X.columns)
y_train = pd.DataFrame(y_train, columns=y.columns)
y_valid = pd.DataFrame(y_valid, columns=y.columns)
y_test = pd.DataFrame(y_test, columns=y.columns)

del X
del y

In [None]:
# Standardise the data

from sklearn.preprocessing import StandardScaler

stand_scaler = StandardScaler()
stand_scaler.fit(X_train)

X_train_std = pd.DataFrame(stand_scaler.transform(X_train), columns=X_train.columns)
X_valid_std = pd.DataFrame(stand_scaler.transform(X_valid), columns=X_valid.columns)
X_test_std  = pd.DataFrame(stand_scaler.transform(X_test), columns=X_test.columns)

In [None]:
print(X_train_std.info(), "\n", X_valid_std.info(), "\n", X_test_std.info())
print(y_train.value_counts(), "\n", y_valid.value_counts(), "\n", y_test.value_counts())

#### Train a logistic regrerssor on the standardised data

In [None]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression(
    penalty='l2', 
    multi_class='auto', 
    max_iter=100, 
    random_state=10,
    class_weight='balanced',
    n_jobs=-1,
    verbose=1,
)

log_clf.fit(X_train_std, np.ravel(y_train))


#### Check the quality of the predictions through the accuracy score (on train and validation) and through the confusion matrix.

classes = ['NO','YES']

def plot_confusionmatrix(y_train_pred, y_train, dom):
    print(f'{dom} confusion matrix:')
    cf = confusion_matrix(y_train_pred,y_train, normalize=None)
    plt.figure(figsize=(4,3))
    sns.heatmap(cf, annot=True, yticklabels=classes, xticklabels=classes, cmap='Blues', fmt='g')
    plt.tight_layout()
    plt.show()

In [None]:
y_train_pred = log_clf.predict(X_train_std)
y_valid_pred = log_clf.predict(X_valid_std)

print("The classification results on the training data:")
print(classification_report(y_train,y_train_pred))
print("Confusion matrix (training data):\n", confusion_matrix(y_train,y_train_pred))

print("\n The classification results on the validation data:")
print(classification_report(y_valid,y_valid_pred))
print("Confusion matrix (validation data):\n", confusion_matrix(y_valid,y_valid_pred))

#### Training with fewer features

We can use a decision tree as a feature ranking tool, and train the logisitc regression model on the top ranked features. We train the decision tree, and the extract the top features from the trained model. 

#### Train a decision tree classifier with its default setup. 

In [None]:
tree_clf = tree.DecisionTreeClassifier(
    ccp_alpha = 0,
    class_weight = None,
    criterion = 'gini',
    max_depth = 3,
    max_features = None,
    max_leaf_nodes = 10,
    min_impurity_decrease = 0.0,
    min_samples_leaf = 1,
    min_samples_split = 2,
    min_weight_fraction_leaf = 0.0,
    random_state = 2023,
    splitter = 'best',
)

tree_clf.fit(X_train_std,y_train)

#### Display the decision tree. 
Each node is shown with its internal characteristics: the decision rule, the impurity, the class balance. The decision that could be made in each node based on the class balance is indicated through the coloring of the node. The more orange the node, the clearer the decision for "Not heart disease". The more blue the node, the clearer the decision for "Heart disease".

In [None]:
plt.figure(figsize=(60,30))
features = list(X_train.columns.values)
classes = ['NO','YES']
tree.plot_tree(
    tree_clf,
    feature_names = features,
    class_names = classes,
    filled = True,
    proportion = False,
    impurity = True, 
    rounded = True,
    fontsize = 16
)
plt.show()

In [None]:
# Model's performance on the test dataset

y_train_pred = tree_clf.predict(X_train_std)
y_valid_pred = tree_clf.predict(X_valid_std)


print("The classification results on the training data:")
print(f'Training score {accuracy_score(y_train, y_train_pred)}')
print(classification_report(y_train,y_train_pred))
print("Confusion matrix (training data):\n", confusion_matrix(y_train,y_train_pred))

print("\n The classification results on the validation data:")
print(f'Validation score {accuracy_score(y_valid, y_valid_pred)}')
print(classification_report(y_valid,y_valid_pred))
print("Confusion matrix (validation data):\n", confusion_matrix(y_valid,y_valid_pred))

In [None]:
# Extract the feature ranking from the decision tree, sort it by the ranking score

feat_imp = pd.DataFrame(tree_clf.feature_importances_, index=X_train_std.columns, columns =["Score"])
feat_imp.sort_values(by='Score', ascending=False)

In [None]:
# Train a new logistic regression model, just on the features with non-zero ranking score in the decision tree model

log_clf_v2 = LogisticRegression(
    penalty='l2', 
    multi_class='auto', 
    max_iter=100, 
    random_state=10,
    class_weight='balanced',
    n_jobs=-1,
    verbose=1,
)

features = ["Chest_pain_type", 
            "ST_depression", 
            "Slope_of_ST", 
            "Number_of_vessels_fluro",
            "Max_HR",
            "BP"
           ]

log_clf_v2.fit(X_train_std[features], np.ravel(y_train))

y_train_pred = log_clf_v2.predict(X_train_std[features])
y_valid_pred = log_clf_v2.predict(X_valid_std[features])

print("\n The classification results on the training data:")
print(f'Train score {accuracy_score(y_train, y_train_pred)}')
print(classification_report(y_train,y_train_pred))
print("Confusion matrix (training data):\n", confusion_matrix(y_train,y_train_pred))

print("\n The classification results on the validation data:")
print(f'Validation score {accuracy_score(y_valid, y_valid_pred)}')
print(classification_report(y_valid,y_valid_pred))
print("Confusion matrix (validation data):\n", confusion_matrix(y_valid,y_valid_pred))


#### Conclusion
We conclude that the logistic regression model has about the same performance on the full feature space as on the smaller features. Since the number of features is relatively small, we may continue using the original features. 

#### We check the performance of the model on the test dataset. 

In [None]:
y_test_pred = log_clf.predict(X_test_std)

print("\n The classification results on the test data:")
print(f'Test score {accuracy_score(y_test, y_test_pred)}')
print(classification_report(y_test,y_test_pred))
print("Confusion matrix (test data):\n", confusion_matrix(y_test,y_test_pred))

#### Decision boundary visualisation
To visualise the decision boundaries we have to transform the data in 2D and we do this through PCA. We also retrain the models on the transformed data. This is only for visualisation purposes, the classifications should be done in the original data space. 

In [None]:
# PCA transformation into 2D

from sklearn.decomposition import PCA

pca_model = PCA(n_components = 2)
pca_model.fit(X_train_std, y_train.values.ravel())

X_train_std_pca = pca_model.transform(X_train_std)
X_valid_std_pca = pca_model.transform(X_valid_std)
X_test_std_pca = pca_model.transform(X_test_std)

In [None]:
# Create new models for the visualisation in 2D: logistic regression and decision tree

log_clf_2D = LogisticRegression(
    penalty='l2', 
    multi_class='auto', 
    max_iter=100, 
    random_state=10,
    class_weight='balanced',
    n_jobs=-1,
    verbose=1,
)

tree_clf_2D = tree.DecisionTreeClassifier(
    ccp_alpha = 0,
    class_weight = None,
    criterion = 'gini',
    max_depth = 3,
    max_features = None,
    max_leaf_nodes = 10,
    min_impurity_decrease = 0.0,
    min_samples_leaf = 1,
    min_samples_split = 2,
    min_weight_fraction_leaf = 0.0,
    random_state = 2023,
    splitter = 'best',
)

classifiers = [log_clf_2D, tree_clf_2D]

# Train the models on the PCA-transformed data, just for visualisation purposes

for i in range(len(classifiers)):    
    classifiers[i].fit(X_train_std_pca, np.ravel(y_train))



from itertools import product
from sklearn.inspection import DecisionBoundaryDisplay

# Plot the decision regions on the training, on the validation, and on the test data

fig, axes = plt.subplots(1, 2, figsize=(8,4))

for i in range(len(classifiers)):
    axes[i].set(title=classifiers[i].__class__.__name__)
    disp = DecisionBoundaryDisplay.from_estimator(
        classifiers[i], X_train_std_pca, response_method="predict",
        #xlabel=X_train_std.columns, ylabel=y_train.columns,
        alpha=0.5,
        ax=axes[i],
        plot_method='contourf'
    )
    disp.ax_.scatter(X_train_std_pca[:, 0], X_train_std_pca[:, 1], c=y_train, edgecolor="k")

plt.show()

fig, axes = plt.subplots(1, 2, figsize=(8,4))

for i in range(len(classifiers)):
    axes[i].set(title=classifiers[i].__class__.__name__)
    disp = DecisionBoundaryDisplay.from_estimator(
        classifiers[i], X_valid_std_pca, response_method="predict",
        #xlabel=X_train_std.columns, ylabel=y_train.columns,
        alpha=0.5,
        ax=axes[i],
        plot_method='contourf'
    )
    disp.ax_.scatter(X_valid_std_pca[:, 0], X_valid_std_pca[:, 1], c=y_valid, edgecolor="k")

plt.show()

fig, axes = plt.subplots(1, 2, figsize=(8,4))

for i in range(len(classifiers)):
    axes[i].set(title=classifiers[i].__class__.__name__)
    disp = DecisionBoundaryDisplay.from_estimator(
        classifiers[i], X_test_std_pca, response_method="predict",
        #xlabel=X_train_std.columns, ylabel=y_train.columns,
        alpha=0.5,
        ax=axes[i],
        plot_method='contourf'
    )
    disp.ax_.scatter(X_test_std_pca[:, 0], X_test_std_pca[:, 1], c=y_test, edgecolor="k")

plt.show()

In [None]:
from sklearn.metrics import RocCurveDisplay

RocCurveDisplay.from_estimator(log_clf, X_test_std, y_test)
plt.plot([0,1], [0,1], 'k--' )
plt.title('ROC curve for the logistic regression classifier')
plt.show()

In [None]:
del X_train_valid
del y_train_valid
del X_train
del X_train_std
del X_train_std_pca
del y_train
del X_valid
del X_valid_std
del X_valid_std_pca
del y_valid
del X_test
del X_test_std
del X_test_std_pca
del y_test

del log_clf
del log_clf_v2
del log_clf_2D
del tree_clf
del tree_clf_2D

## Challenge: Rain prediction model

We will train a logistic regression model to predict whether it will rain tomorrow. The dataset contains about 10 years of daily weather observations from many locations across Australia.

Data source: http://www.bom.gov.au/climate/dwo/ and http://www.bom.gov.au/climate/data.
Copyright Commonwealth of Australia 2010, Bureau of Meteorology.
Data downloaded from: https://www.kaggle.com/datasets/arunavakrchakraborty/australia-weather-data
Great data exploration analysis on this dataset: https://www.kaggle.com/code/prashant111/logistic-regression-classifier-tutorial/

**Data Description**

Location - Name of the city from Australia.
MinTemp - The Minimum temperature during a particular day. (degree Celsius)
MaxTemp - The maximum temperature during a particular day. (degree Celsius)
Rainfall - Rainfall during a particular day. (millimeters)
Evaporation - Evaporation during a particular day. (millimeters)
Sunshine - Bright sunshine during a particular day. (hours)
WindGusDir - The direction of the strongest gust during a particular day. (16 compass points)
WindGuSpeed - Speed of strongest gust during a particular day. (kilometers per hour)
WindDir9am - The direction of the wind for 10 min prior to 9 am. (compass points)
WindDir3pm - The direction of the wind for 10 min prior to 3 pm. (compass points)
WindSpeed9am - Speed of the wind for 10 min prior to 9 am. (kilometers per hour)
WindSpeed3pm - Speed of the wind for 10 min prior to 3 pm. (kilometers per hour)
Humidity9am - The humidity of the wind at 9 am. (percent)
Humidity3pm - The humidity of the wind at 3 pm. (percent)
Pressure9am - Atmospheric pressure at 9 am. (hectopascals)
Pressure3pm - Atmospheric pressure at 3 pm. (hectopascals)
Cloud9am - Cloud-obscured portions of the sky at 9 am. (eighths)
Cloud3pm - Cloud-obscured portions of the sky at 3 pm. (eighths)
Temp9am - The temperature at 9 am. (degree Celsius)
Temp3pm - The temperature at 3 pm. (degree Celsius)
RainToday - If today is rainy then ‘Yes’. If today is not rainy then ‘No’.
RainTomorrow - If tomorrow is rainy then 1 (Yes). If tomorrow is not rainy then 0 (No).

#### Load the dataset
For this challenge, you need to download the training and the test datasets from Moodle (or from the Kaggle source above) and make sure it is saved in the same folder as your code or indicate the relative folder location in the read function below. 

In [None]:
X_train = pd.read_csv("AUS_weather_training_data.csv")
X_test = pd.read_csv("AUS_weather_test_data.csv")

#### Q1. How many features you have in the training dataset (not counting the target feature 'RainTomorrow')? 
#### Q2. How many data points do you have in the training set? 
#### Q3. How many data points do you have in the test set? 
#### Q4. Do you have missing values in the test set? 
#### Q5. Do you have the 'RainTomorrow' feature in the test dataset? 

In [None]:
# Your code here


In [None]:
# Drop the 'row ID' feature from both sets

X_train = X_train.drop("row ID", axis=1)
X_test = X_test.drop("row ID", axis=1)

In [None]:
# find the categorical variables

categorical = [var for var in X_train.columns if X_train[var].dtype=='O']
print('There are {} categorical variables\n'.format(len(categorical)))
print('The categorical variables are :', categorical)

# check for missing values in the categorical variables 

X_train[categorical].isnull().sum()

In [None]:
# find the numerical variables

numerical = [var for var in X_train.columns if X_train[var].dtype!='O']
print('There are {} numerical variables\n'.format(len(numerical)))
print('The numerical variables are :', numerical)

# check missing values in the numerical variables
X_train[numerical].isnull().sum()

In [None]:
# Extract the target variable from the training dataset

y = X_train['RainTomorrow']
X = X_train.drop(['RainTomorrow'], axis=1)

# Update the numerical variables, we need them later
numerical.remove("RainTomorrow")

del X_train

In [None]:
# Split the data into train/validation/test

X_train, X_valid, y_train, y_valid = train_test_split(
    X, 
    y, 
    test_size=0.2, 
    random_state=150, 
    shuffle=True,
    stratify = y,
)

# convert to pandas dataframe
X_train = pd.DataFrame(X_train, columns=X.columns)
X_valid = pd.DataFrame(X_valid, columns=X.columns)

del X
del y

print(X_train.info(), "\n", y_train.value_counts())
print(X_valid.info(), "\n", y_valid.value_counts())

In [None]:
# Imputate the missing numerical values using the median on that feature
# Imputate the missing categorical values using the most frequent value on that feature

print('\n Missing data before imputation:\n', X_train.isnull().sum())

from sklearn.impute import SimpleImputer

frequent_imputer = SimpleImputer(strategy='most_frequent') 
# Your code here: train the model

X_train[categorical] = frequent_imputer.transform(X_train[categorical])

median_imputer = SimpleImputer(strategy='median') # imputing using constant value
# Your code here: train the model

X_train[numerical] = median_imputer.transform(X_train[numerical])

print('\n Missing data after imputation:\n', X_train.isnull().sum())

#### One-hot encoding for the categorical features

In [None]:
X_train = pd.get_dummies(X_train, columns = categorical)
print(X_train.info())
print(X_train.columns)

#### Feature scaling
We use MinMax to bring all features into [0,1]

In [None]:
# Your code here



#### Model training
We train a logistic regression model to predict if it rains tomorrow

In [None]:
# Apply the same transformations to the validation data. 
# Your code here



In [None]:
# Train a logistic regression model

logreg = LogisticRegression(
    penalty='l2', 
    multi_class='auto', 
    max_iter=1000, 
    random_state=10,
    class_weight='balanced',
    n_jobs=-1,
    verbose=0,
)

# Your code here


#### Q6. What is the accuracy score on the training data (2 decimals only)? 
#### Q7. What is the accuracy score on the validation data (2 decimals only)? 
#### Q8. How many of the predictions on the test data are "1" ("it will rain tomorrow")? 

In [None]:
# Check the model on the test data. 
# Your code here

