#NOTES#
-logistic regression, svm.s
-predict type of dinasour
-stratified k-fold cross-validation to evaluate classification models

### HW2 - Multinomial Logistic Regression & SVMs 

In this assignment, you are given a dataset comprising information about dinosaurs. You will use logistic regression and support vector machine models to predict the type of dinosaur based on the provided information. In this assignment, you may utilize built-in libraries. Employ _stratified k-fold cross-validation_ (CV) for evaluating the classification models. Stratification ensures that each CV fold maintains a similar distribution of class examples as the entire training set. You can design various experiments by selecting some/all information provided in the dataset. Here, we expect the best result you obtained after these experiments and observations. Please explicitly mention your feature selection method in your report while presenting results. 

Stratified k-fold cross-validation is a technique used to evaluate the performance of machine learning models, particularly in classification tasks, where the target class distribution may be imbalanced. In this method, the dataset is divided into 'k' equally sized folds, ensuring that each fold maintains a similar distribution of class examples as the entire dataset. This stratification process helps to reduce the bias and variance in model performance estimation by preventing a skewed distribution of classes in the train and test sets.

During the cross-validation process, the model is trained on 'k-1' folds and tested on the remaining fold, iterating this process 'k' times. Each iteration uses a different fold for testing, and the average performance metric (e.g., accuracy) is calculated over all iterations.

Here's a code example using the scikit-learn library:


In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target

# Initialize stratified k-fold cross-validation
skf = StratifiedKFold(n_splits=5)  #the number of folds is 5

# Initialize the logistic regression model
model = LogisticRegression(solver='lbfgs', max_iter=1000)

# Perform stratified k-fold cross-validation
accuracy_scores = []
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    accuracy_scores.append(accuracy)

# Calculate the mean accuracy
mean_accuracy = np.mean(accuracy_scores)
print("Mean accuracy:", mean_accuracy)


Mean accuracy: 0.9733333333333334


In this example, we use the Iris dataset and a logistic regression model to demonstrate stratified k-fold cross-validation with 5 folds. The performance of the model is evaluated using accuracy as the performance metric, and the mean accuracy is reported.


Stratified cross-validation is particularly useful when dealing with imbalanced datasets, where some classes have significantly fewer examples compared to others. In such cases, using standard cross-validation might lead to situations where one or more folds contain very few or even none of the underrepresented class instances. This could result in an inaccurate and biased performance estimation of the model, as the model is not adequately tested on all classes.

For balanced datasets, where class distributions are roughly equal, stratified cross-validation may not provide significant benefits over standard cross-validation. However, it is still a good practice to use stratified cross-validation as a default approach, as it generally leads to more stable and reliable performance estimates.

### Training and Evaluations

Use the data provided in _train.csv_ file for training and _test.csv_ file for testing. For model evaluations compute _mean weighted F1 scores_. Also compute confusion matrices to evaluate and compare the performances of the classification models.

Here is an example code how to compute mean weighted F1 score in k-fold cross-validation setting:

In [None]:
f1_scores = []

# Perform k-fold stratified cross-validation
for train_index, test_index in cv.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
   
    # Necessary code to compute the predictions using your classifier..
   
    # Compute the weighted-average F1-score for this fold
    fold_f1_score = f1_score(y_test, y_pred, average='weighted')
    f1_scores.append(fold_f1_score)

# Calculate the mean F1-score across all folds
mean_weighted_f1_score = np.mean(f1_scores)
print("Mean weighted-average F1-score across", k, "folds:", mean_weighted_f1_score)

NameError: ignored

__Your Work__:

In [None]:
#Include your codes below by including as many cells as necessary to clearly demonstrate your work
#Please write your codes in separate sections for data pre-processing, Logistic Regression and SVM models etc.

In [1]:
from google.colab import drive # import lib
drive.mount("/content/gdrive") # mount gdrive

Mounted at /content/gdrive


In [66]:
import numpy as np # for np array ops
import matplotlib.pyplot as plt # for visualization purposes
from pandas import * # for reading and parsing .csv files
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [67]:
sTrainAdd = "/content/gdrive/MyDrive/MsC/cmp712/hw2/train.csv" # address of train.csv on 
sTestAdd = "/content/gdrive/MyDrive/MsC/cmp712/hw2/test.csv" # address of train.csv on 
data = read_csv(sTrainAdd)

In [None]:
temp = data.iloc[:,:]
nRows, nColumns = temp.shape
sClass = "type"
x = DataFrame()
y = DataFrame()
for idxColumns in range(nColumns):
  temp1 = data.iloc[:, idxColumns]
  if temp1._name==sClass:
    y = y.append(temp1)
  else:
    x = x.append(temp1)
x = x.T
y = y.T

In [70]:
# Encoding
le = LabelEncoder()
x.diet = le.fit_transform(x.diet)
x.lived_in = le.fit_transform(x.lived_in)
x.period = le.fit_transform(x.period)

In [71]:
# Limit dataset (temp)
x_ = x.drop(columns=['id', 'name', 'length', 'taxonomy', 'named_by', 'species', 'link'])
print(x_)

     diet  period  lived_in
0       0     108        25
1       2     108         1
2       1      65         5
3       1      43        24
4       1      21        25
..    ...     ...       ...
241     1      11        13
242     1      10         2
243     0      14         5
244     0      42         1
245     3     112        19

[246 rows x 3 columns]


In [74]:
skf = StratifiedKFold(n_splits=5)  #the number of folds is 5
#model = LogisticRegression(solver='lbfgs', max_iter=1000)
accuracy_scores = []
for train_index, test_index in skf.split(x_, y):
  X_train, X_test = x_.iloc[train_index], x_.iloc[test_index]
  y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
  svm = SVC(kernel='linear', C=1.0, random_state=42)
  svm.fit(X_train, y_train)
  y_pred = svm.predict(X_test)
  accuracy = accuracy_score(y_test, y_pred)
  accuracy_scores.append(accuracy)
    
# Compute the mean accuracy across all folds
mean_accuracy = sum(accuracy_scores) / len(accuracy_scores)
print("Mean accuracy:", mean_accuracy)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Mean accuracy: 0.38212244897959186
