# W5 Practicals - Performance

### Aims:
* To gain some practical experience in evaluating supervised machine learning
models.
* To produce some assessable work for this subject.

### Procedure:
In Prac W2 we applied k-NN and decision tree models to simple classification and regression datasets. The evaluation of the models was based on a single 70/30 split of the data into training and test data. In this Prac we will look more closely at the evaluation of these models.
> Select one of the questions (3 – 6) from Prac W2 and revisit it for this prac.

> On blackboard you will find a [link](https://docs.google.com/spreadsheets/d/1HAIDBp9ofIeEp5_braHBnwJ-heN8qdaxn5qk3q1c0vo/edit?usp=sharing) to a Google spreadsheet. Go there and enter your answers for Q2 – Q5.

In [49]:
# Common Imports
import pandas as pd

In [50]:
# ANSI color
class color:
  YELLOW_BOLD = '\033[1;33m'
  END = '\033[0m'

### Q1: Repeat Q2 from Prac W2 10 times, saving the 10 resulting training and test sets.

In [51]:
# Specific Imports
import os
import datetime
from sklearn.model_selection import train_test_split

In [53]:
def load_and_split_data(CSV_file, test_size=0.3, random_state=42):
  """
  Loads a CSV file and splits it into training and test sets.

  Parameters:
  CSV_file (str): The path to the CSV file.
  test_size (float): The proportion of the data to include in the test split.
  random_state (int): The seed used by the random number generator.

  Returns:
  tuple: A tuple containing the training and test sets.
  """
  df = pd.read_csv(CSV_file, header=None)
  X, y = df.iloc[:, :-1].values, df.iloc[:, -1].values
  return train_test_split(X, y, test_size=test_size, random_state=random_state)

def save_split_data(data_splits, iteration, headers, save_dir="datasets"):
    """
    Saves the combined X_train, y_train, X_test, and y_test for one iteration to CSV files.
    The files are named X{iteration}_train.csv and X{iteration}_test.csv.

    Parameters:
    data_splits (tuple): A tuple containing the training and test sets.
    iteration (int): The iteration number.
    headers (list): A list of column headers for the CSV files.
    save_dir (str): The directory where the CSV files will be saved.
    """
    if not os.path.exists(save_dir):
      os.makedirs(save_dir)

    train_dir = os.path.join(save_dir, 'train')
    test_dir = os.path.join(save_dir, 'test')

    os.makedirs(train_dir, exist_ok=True)
    os.makedirs(test_dir, exist_ok=True)

    X_train, X_test, y_train, y_test = data_splits

    def save_data(X, y, filename):
        combined_data = pd.DataFrame(X)
        combined_data['target'] = y
        combined_data.columns = headers
        combined_data.to_csv(filename, index=False)

    save_data(X_train, y_train, os.path.join(train_dir, f"X{iteration}_train.csv"))
    save_data(X_test, y_test, os.path.join(test_dir, f"X{iteration}_test.csv"))

    print(f"Data saved as '{save_dir}/train/X{iteration}_train.csv' and 'test/X{iteration}_test.csv'")

for type in ["classification", "regression"]:
   print(f"{color.YELLOW_BOLD}{type.title()}{color.END}")
   if type == "classification":
      CSV = 'w3classif.csv'
      headers = ["feature1", "feature2", "target"]
   elif type == "regression":
      CSV = 'w3regr.csv'
      headers = ["Feature1", "Feature2"]
   else:
      exit

   for i in range(1,11):
      data_splits = load_and_split_data(CSV, test_size=0.3, random_state=None)  # random_state=None for variability
      save_split_data(data_splits, i, headers, f"dataset/{type}")


[1;33mClassification[0m
Data saved as 'train/X1_train.csv' and 'test/X1_test.csv'
Data saved as 'train/X2_train.csv' and 'test/X2_test.csv'
Data saved as 'train/X3_train.csv' and 'test/X3_test.csv'
Data saved as 'train/X4_train.csv' and 'test/X4_test.csv'
Data saved as 'train/X5_train.csv' and 'test/X5_test.csv'
Data saved as 'train/X6_train.csv' and 'test/X6_test.csv'
Data saved as 'train/X7_train.csv' and 'test/X7_test.csv'
Data saved as 'train/X8_train.csv' and 'test/X8_test.csv'
Data saved as 'train/X9_train.csv' and 'test/X9_test.csv'
Data saved as 'train/X10_train.csv' and 'test/X10_test.csv'
[1;33mRegression[0m
Data saved as 'train/X1_train.csv' and 'test/X1_test.csv'
Data saved as 'train/X2_train.csv' and 'test/X2_test.csv'
Data saved as 'train/X3_train.csv' and 'test/X3_test.csv'
Data saved as 'train/X4_train.csv' and 'test/X4_test.csv'
Data saved as 'train/X5_train.csv' and 'test/X5_test.csv'
Data saved as 'train/X6_train.csv' and 'test/X6_test.csv'
Data saved as 'train/X

### Q2: Calculate the training and test set errors over all of the datasets from Q1 and calculate the average training and test errors over the 10 trials. Are the averages lower or higher than the values you found in Prac W3 (or alternatively compare with the values for the first of your 10 runs)?


### Question 3: Repeat Q1 and Q2 but use a different split – try 50/50 or 90/10. Compare your average error values with those you found in Q2.

### Q4: Calculate the sample standard deviation of your training and test set error values over the 10 trials from Q2 and Q3. What do you observe?

### Q5: Perform 10-fold cross validation using your model and the (original) dataset (use existing Matlab or python functions to do this). What are the mean and standard devations of the cross-validation error?