### HW2 - Multinomial Logistic Regression & SVMs 

In this assignment, you are given a dataset comprising information about dinosaurs. You will use logistic regression and support vector machine models to predict the type of dinosaur based on the provided information. In this assignment, you may utilize built-in libraries. Employ _stratified k-fold cross-validation_ (CV) for evaluating the classification models. Stratification ensures that each CV fold maintains a similar distribution of class examples as the entire training set. You can design various experiments by selecting some/all information provided in the dataset. Here, we expect the best result you obtained after these experiments and observations. Please explicitly mention your feature selection method in your report while presenting results. 

Stratified k-fold cross-validation is a technique used to evaluate the performance of machine learning models, particularly in classification tasks, where the target class distribution may be imbalanced. In this method, the dataset is divided into 'k' equally sized folds, ensuring that each fold maintains a similar distribution of class examples as the entire dataset. This stratification process helps to reduce the bias and variance in model performance estimation by preventing a skewed distribution of classes in the train and test sets.

During the cross-validation process, the model is trained on 'k-1' folds and tested on the remaining fold, iterating this process 'k' times. Each iteration uses a different fold for testing, and the average performance metric (e.g., accuracy) is calculated over all iterations.

Here's a code example using the scikit-learn library:


In [2]:
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target

# Initialize stratified k-fold cross-validation
skf = StratifiedKFold(n_splits=5)  #the number of folds is 5

# Initialize the logistic regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)

# Perform stratified k-fold cross-validation
accuracy_scores = []
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    accuracy_scores.append(accuracy)

# Calculate the mean accuracy
mean_accuracy = np.mean(accuracy_scores)
print("Mean accuracy:", mean_accuracy)


Mean accuracy: 0.9733333333333334


In this example, we use the Iris dataset and a logistic regression model to demonstrate stratified k-fold cross-validation with 5 folds. The performance of the model is evaluated using accuracy as the performance metric, and the mean accuracy is reported.


Stratified cross-validation is particularly useful when dealing with imbalanced datasets, where some classes have significantly fewer examples compared to others. In such cases, using standard cross-validation might lead to situations where one or more folds contain very few or even none of the underrepresented class instances. This could result in an inaccurate and biased performance estimation of the model, as the model is not adequately tested on all classes.

For balanced datasets, where class distributions are roughly equal, stratified cross-validation may not provide significant benefits over standard cross-validation. However, it is still a good practice to use stratified cross-validation as a default approach, as it generally leads to more stable and reliable performance estimates.

### Training and Evaluations

Use the data provided in _train.csv_ file for training and _test.csv_ file for testing. For model evaluations compute _mean weighted F1 scores_. Also compute confusion matrices to evaluate and compare the performances of the classification models.

Here is an example code how to compute mean weighted F1 score in k-fold cross-validation setting:

In [None]:
from sklearn.metrics import f1_score

f1_scores = []

# Perform k-fold stratified cross-validation
for train_index, test_index in cv.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
   
    # Necessary code to compute the predictions using your classifier..
   
    # Compute the weighted-average F1-score for this fold
    fold_f1_score = f1_score(y_test, y_pred, average='weighted')
    f1_scores.append(fold_f1_score)

# Calculate the mean F1-score across all folds
mean_weighted_f1_score = np.mean(f1_scores)
print("Mean weighted-average F1-score across", k, "folds:", mean_weighted_f1_score)

__Your Work__:

__Include necessary packages__

In [21]:
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics
from sklearn import svm
import numpy as np
from sklearn.metrics import f1_score

__Data cleaning & preparation__

In [222]:
tryc = 6 # submission try count

train = pd.read_csv("train.csv", dtype=str)
test = pd.read_csv("test.csv", dtype=str)
train.shape, test.shape

((246, 11), (62, 10))

In [223]:
# There was an incorrect entry in the period column of test data, fixing that at the first stage
test.loc[test[test.period == 'USA'].index.values[0], "lived_in"] = "USA"
test.loc[test[test.period == 'USA'].index.values[0], "period"] = np.nan

__Feature Engineering__

In [224]:
# check how many unique items in each column:
for cg in train.columns:
    print(f"CG: {cg} - nUnique: {train[cg].nunique()}")


CG: id - nUnique: 246
CG: name - nUnique: 246
CG: diet - nUnique: 4
CG: period - nUnique: 127
CG: lived_in - nUnique: 30
CG: type - nUnique: 6
CG: length - nUnique: 64
CG: taxonomy - nUnique: 92
CG: named_by - nUnique: 222
CG: species - nUnique: 224
CG: link - nUnique: 246


In [225]:
# convert length column to float
def split_col(col):
    if not type(col) == float:
        return float(col.split("m")[0])
    else:
        return np.NaN
train["length"] = train["length"].apply(split_col)

# drop NaN rows from training data:
train = train.dropna().copy(deep=True)

In [226]:
def f(x):
    if type(x) == str:
        return "".join(x.split(" ")[:2])
    else:
        return np.nan
train["short_period"]=train["period"].apply(f)

In [227]:
# seperately look at last 2 levels of taxonomy instead of taking whole taxonomy list
train["last_taxonomy"] = train["taxonomy"].apply(lambda x: x.split(" ")[-1])
train["penultimate_taxonomy"] = train["taxonomy"].apply(lambda x: x.split(" ")[-2])

train

Unnamed: 0,id,name,diet,period,lived_in,type,length,taxonomy,named_by,species,link,short_period,last_taxonomy,penultimate_taxonomy
1,2,riojasaurus,herbivorous/omnivorous,Late Triassic 221-210 million years ago,Argentina,sauropod,5.15,Dinosauria Saurischia Sauropodomorpha Prosauro...,Bonaparte (1969),incertus,https://www.nhm.ac.uk/discover/dino-directory/...,LateTriassic,Melanorosauridae,Anchisauria
2,3,tsintaosaurus,herbivorous,Late Cretaceous 84-71 million years ago,China,euornithopod,12.00,Dinosauria Ornithischia Genasauria Cerapoda Or...,Young (1958),spinorhinus,https://www.nhm.ac.uk/discover/dino-directory/...,LateCretaceous,Lambeosaurinae,Euhadrosauria
3,4,alamosaurus,herbivorous,Late Cretaceous 70-65 million years ago,USA,sauropod,21.00,Dinosauria Saurischia Sauropodomorpha Sauropod...,Gilmore (1922),sanjuanensis,https://www.nhm.ac.uk/discover/dino-directory/...,LateCretaceous,Lithostrotia,Titanosauria
4,5,iguanodon,herbivorous,Early Cretaceous 140-110 million years ago,United Kingdom,euornithopod,10.00,Dinosauria Ornithischia Genasauria Cerapoda Or...,Boulenger and van Beneden (1881),bernissartensis,https://www.nhm.ac.uk/discover/dino-directory/...,EarlyCretaceous,Iguanodontoidea,Ankylopollexia
5,6,anserimimus,carnivorous,Late Cretaceous 84-65 million years ago,Mongolia,large theropod,3.50,Dinosauria Saurischia Theropoda Neotheropoda T...,Barsbold (1988),planinychus,https://www.nhm.ac.uk/discover/dino-directory/...,LateCretaceous,Ornithomimidae,Ornithomimosauria
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
240,241,alioramus,carnivorous,Late Cretaceous 71-65 million years ago,Mongolia,large theropod,6.00,Dinosauria Saurischia Theropoda Neotheropoda T...,Kurzanov (1976),remotus,https://www.nhm.ac.uk/discover/dino-directory/...,LateCretaceous,Tyrannosauroidea,Tyrannoraptora
242,243,atlascopcosaurus,herbivorous,Early Cretaceous 121-97 million years ago,Australia,euornithopod,3.00,Dinosauria Ornithischia Genasauria Cerapoda Or...,Rich and Rich (1989),loadsi,https://www.nhm.ac.uk/discover/dino-directory/...,EarlyCretaceous,Euornithopoda,Ornithopoda
243,244,microraptor,carnivorous,Early Cretaceous 125-122 million years ago,China,small theropod,0.80,Dinosauria Saurischia Theropoda Neotheropoda T...,Xu Zhou and Wang (2000),zhaoianus,https://www.nhm.ac.uk/discover/dino-directory/...,EarlyCretaceous,Dromaeosauridae,Eumaniraptora
244,245,carnotaurus,carnivorous,Late Cretaceous 70 million years ago,Argentina,large theropod,7.60,Dinosauria Saurischia Theropoda Neotheropoda C...,Bonaparte (1985),sastrei,https://www.nhm.ac.uk/discover/dino-directory/...,LateCretaceous,Abelisauridae,Neoceratosauria


In [228]:
# period's million years ago part
train["period_mya"] = train["period"].apply(lambda x: " ".join(x.split(" ")[2:]))
train

Unnamed: 0,id,name,diet,period,lived_in,type,length,taxonomy,named_by,species,link,short_period,last_taxonomy,penultimate_taxonomy,period_mya
1,2,riojasaurus,herbivorous/omnivorous,Late Triassic 221-210 million years ago,Argentina,sauropod,5.15,Dinosauria Saurischia Sauropodomorpha Prosauro...,Bonaparte (1969),incertus,https://www.nhm.ac.uk/discover/dino-directory/...,LateTriassic,Melanorosauridae,Anchisauria,221-210 million years ago
2,3,tsintaosaurus,herbivorous,Late Cretaceous 84-71 million years ago,China,euornithopod,12.00,Dinosauria Ornithischia Genasauria Cerapoda Or...,Young (1958),spinorhinus,https://www.nhm.ac.uk/discover/dino-directory/...,LateCretaceous,Lambeosaurinae,Euhadrosauria,84-71 million years ago
3,4,alamosaurus,herbivorous,Late Cretaceous 70-65 million years ago,USA,sauropod,21.00,Dinosauria Saurischia Sauropodomorpha Sauropod...,Gilmore (1922),sanjuanensis,https://www.nhm.ac.uk/discover/dino-directory/...,LateCretaceous,Lithostrotia,Titanosauria,70-65 million years ago
4,5,iguanodon,herbivorous,Early Cretaceous 140-110 million years ago,United Kingdom,euornithopod,10.00,Dinosauria Ornithischia Genasauria Cerapoda Or...,Boulenger and van Beneden (1881),bernissartensis,https://www.nhm.ac.uk/discover/dino-directory/...,EarlyCretaceous,Iguanodontoidea,Ankylopollexia,140-110 million years ago
5,6,anserimimus,carnivorous,Late Cretaceous 84-65 million years ago,Mongolia,large theropod,3.50,Dinosauria Saurischia Theropoda Neotheropoda T...,Barsbold (1988),planinychus,https://www.nhm.ac.uk/discover/dino-directory/...,LateCretaceous,Ornithomimidae,Ornithomimosauria,84-65 million years ago
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
240,241,alioramus,carnivorous,Late Cretaceous 71-65 million years ago,Mongolia,large theropod,6.00,Dinosauria Saurischia Theropoda Neotheropoda T...,Kurzanov (1976),remotus,https://www.nhm.ac.uk/discover/dino-directory/...,LateCretaceous,Tyrannosauroidea,Tyrannoraptora,71-65 million years ago
242,243,atlascopcosaurus,herbivorous,Early Cretaceous 121-97 million years ago,Australia,euornithopod,3.00,Dinosauria Ornithischia Genasauria Cerapoda Or...,Rich and Rich (1989),loadsi,https://www.nhm.ac.uk/discover/dino-directory/...,EarlyCretaceous,Euornithopoda,Ornithopoda,121-97 million years ago
243,244,microraptor,carnivorous,Early Cretaceous 125-122 million years ago,China,small theropod,0.80,Dinosauria Saurischia Theropoda Neotheropoda T...,Xu Zhou and Wang (2000),zhaoianus,https://www.nhm.ac.uk/discover/dino-directory/...,EarlyCretaceous,Dromaeosauridae,Eumaniraptora,125-122 million years ago
244,245,carnotaurus,carnivorous,Late Cretaceous 70 million years ago,Argentina,large theropod,7.60,Dinosauria Saurischia Theropoda Neotheropoda C...,Bonaparte (1985),sastrei,https://www.nhm.ac.uk/discover/dino-directory/...,LateCretaceous,Abelisauridae,Neoceratosauria,70 million years ago


In [229]:
train.columns

Index(['id', 'name', 'diet', 'period', 'lived_in', 'type', 'length',
       'taxonomy', 'named_by', 'species', 'link', 'short_period',
       'last_taxonomy', 'penultimate_taxonomy', 'period_mya'],
      dtype='object')

In [230]:
# similarly add new features to the test.csv as well
test["short_period"]=test["period"].apply(f)
test["length"] = test["length"].apply(split_col)

test["last_taxonomy"] = test["taxonomy"].apply(lambda x: x.split(" ")[-1])
test["penultimate_taxonomy"] = test["taxonomy"].apply(lambda x: x.split(" ")[-2])

test["period_mya"] = test["period"].apply(lambda x: " ".join(x.split(" ")[2:]) if type(x)==str else np.nan)
test

Unnamed: 0,id,name,diet,period,lived_in,length,taxonomy,named_by,species,link,short_period,last_taxonomy,penultimate_taxonomy,period_mya
0,1,magyarosaurus,herbivorous,Late Cretaceous 71-65 million years ago,Romania,5.0,Dinosauria Saurischia Sauropodomorpha Sauropod...,Nopcsa (1915),dacus,https://www.nhm.ac.uk/discover/dino-directory/...,LateCretaceous,Lithostrotia,Titanosauria,71-65 million years ago
1,2,camarasaurus,herbivorous,Late Jurassic 150-140 million years ago,USA,23.0,Dinosauria Saurischia Sauropodomorpha Sauropod...,Cope (1877),supremus,https://www.nhm.ac.uk/discover/dino-directory/...,LateJurassic,Camarasauridae,Camarasauromorpha,150-140 million years ago
2,3,gorgosaurus,carnivorous,Late Cretaceous 80-73 million years ago,USA,8.6,Dinosauria Saurischia Theropoda Neotheropoda T...,Lamb (1914),libratus,https://www.nhm.ac.uk/discover/dino-directory/...,LateCretaceous,Albertosaurinae,Tyrannosauridae,80-73 million years ago
3,4,acrocanthosaurus,carnivorous,Early Cretaceous 115-105 million years ago,USA,12.0,Dinosauria Saurischia Theropoda Neotheropoda T...,Stovall and Langston (1950),atokensis,https://www.nhm.ac.uk/discover/dino-directory/...,EarlyCretaceous,Carcharodontosauridae,Allosauroidea,115-105 million years ago
4,5,hagryphus,omnivorous,,USA,3.0,Dinosauria Saurischia Theropoda Neotheropoda T...,Zanno and Sampson (2005),giganteus,https://www.nhm.ac.uk/discover/dino-directory/...,,Oviraptorosauria,Maniraptora,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57,58,bellusaurus,herbivorous,Mid Jurassic 180-159 million years ago,China,5.0,Dinosauria Saurischia Sauropodomorpha Sauropod...,Dong and Azuma (1990),sui,https://www.nhm.ac.uk/discover/dino-directory/...,MidJurassic,Macronaria,Neosauropoda,180-159 million years ago
58,59,talarurus,herbivorous,Late Cretaceous 99-89 million years ago,Mongolia,6.0,Dinosauria Ornithischia Genasauria Thyreophora...,Maleev (1952),plicatospineus,https://www.nhm.ac.uk/discover/dino-directory/...,LateCretaceous,Ankylosauridae,Ankylosauria,99-89 million years ago
59,60,deltadromeus,carnivorous,Late Cretaceous 99-94 million years ago,Morocco,8.1,Dinosauria Saurischia Theropoda Neotheropoda T...,Sereno Duthiel Iarochene Larsson Lyon Magwene ...,agilis,https://www.nhm.ac.uk/discover/dino-directory/...,LateCretaceous,Coelurosauria,Avetheropoda,99-94 million years ago
60,61,centrosaurus,herbivorous,Late Cretaceous 76-74 million years ago,Canada,6.0,Dinosauria Ornithischia Genasauria Cerapoda Ma...,apertus,,https://www.nhm.ac.uk/discover/dino-directory/...,LateCretaceous,Centrosaurinae,Ceratopsidae,76-74 million years ago


__Select Categories__

In [263]:
train.columns

Index(['id', 'name', 'diet', 'period', 'lived_in', 'type', 'length',
       'taxonomy', 'named_by', 'species', 'link', 'short_period',
       'last_taxonomy', 'penultimate_taxonomy', 'period_mya'],
      dtype='object')

In [264]:
# selecting those categories because others such as "id", "name", "link", "named_by" or "species" are mostly unique to the id and not a common feature used by many type of dinasaurs
# so these are not categorical variables.
feature_names = ["diet","short_period","lived_in","last_taxonomy", "penultimate_taxonomy", "period_mya", "length"]

In [265]:
trainX = train[feature_names].copy(deep=True)
testX = test[feature_names].copy(deep=True)

In [266]:
trainX

Unnamed: 0,diet,short_period,lived_in,last_taxonomy,penultimate_taxonomy,period_mya,length
1,herbivorous/omnivorous,LateTriassic,Argentina,Melanorosauridae,Anchisauria,221-210 million years ago,5.15
2,herbivorous,LateCretaceous,China,Lambeosaurinae,Euhadrosauria,84-71 million years ago,12.00
3,herbivorous,LateCretaceous,USA,Lithostrotia,Titanosauria,70-65 million years ago,21.00
4,herbivorous,EarlyCretaceous,United Kingdom,Iguanodontoidea,Ankylopollexia,140-110 million years ago,10.00
5,carnivorous,LateCretaceous,Mongolia,Ornithomimidae,Ornithomimosauria,84-65 million years ago,3.50
...,...,...,...,...,...,...,...
240,carnivorous,LateCretaceous,Mongolia,Tyrannosauroidea,Tyrannoraptora,71-65 million years ago,6.00
242,herbivorous,EarlyCretaceous,Australia,Euornithopoda,Ornithopoda,121-97 million years ago,3.00
243,carnivorous,EarlyCretaceous,China,Dromaeosauridae,Eumaniraptora,125-122 million years ago,0.80
244,carnivorous,LateCretaceous,Argentina,Abelisauridae,Neoceratosauria,70 million years ago,7.60


In [267]:
testX

Unnamed: 0,diet,short_period,lived_in,last_taxonomy,penultimate_taxonomy,period_mya,length
0,herbivorous,LateCretaceous,Romania,Lithostrotia,Titanosauria,71-65 million years ago,5.0
1,herbivorous,LateJurassic,USA,Camarasauridae,Camarasauromorpha,150-140 million years ago,23.0
2,carnivorous,LateCretaceous,USA,Albertosaurinae,Tyrannosauridae,80-73 million years ago,8.6
3,carnivorous,EarlyCretaceous,USA,Carcharodontosauridae,Allosauroidea,115-105 million years ago,12.0
4,omnivorous,,USA,Oviraptorosauria,Maniraptora,,3.0
...,...,...,...,...,...,...,...
57,herbivorous,MidJurassic,China,Macronaria,Neosauropoda,180-159 million years ago,5.0
58,herbivorous,LateCretaceous,Mongolia,Ankylosauridae,Ankylosauria,99-89 million years ago,6.0
59,carnivorous,LateCretaceous,Morocco,Coelurosauria,Avetheropoda,99-94 million years ago,8.1
60,herbivorous,LateCretaceous,Canada,Centrosaurinae,Ceratopsidae,76-74 million years ago,6.0


__Build One Hot Encoded Form of the Data__

In [268]:
## Concatenate train and test .csv files before building transformed matrix to avoid missing columns issue.

train_test_X = pd.concat([trainX, testX], ignore_index=True)

transformer = make_column_transformer(
    (OneHotEncoder(sparse_output=False), feature_names),
    remainder='passthrough'
)
transformed = transformer.fit_transform(train_test_X)
train_test_X_dummies = pd.DataFrame(
    transformed,
    columns=transformer.get_feature_names_out()
)

train_X=train_test_X_dummies.iloc[:trainX.shape[0]].copy(deep=True)
test_X=train_test_X_dummies.iloc[trainX.shape[0]:].copy(deep=True)
train_y = train["type"].values

In [269]:
train_X

Unnamed: 0,onehotencoder__diet_carnivorous,onehotencoder__diet_herbivorous,onehotencoder__diet_herbivorous/omnivorous,onehotencoder__diet_omnivorous,onehotencoder__diet_unknown,onehotencoder__short_period_EarlyCretaceous,onehotencoder__short_period_EarlyJurassic,onehotencoder__short_period_LateCretaceous,onehotencoder__short_period_LateJurassic,onehotencoder__short_period_LateTriassic,...,onehotencoder__length_21.5,onehotencoder__length_22.0,onehotencoder__length_23.0,onehotencoder__length_24.0,onehotencoder__length_25.0,onehotencoder__length_26.0,onehotencoder__length_28.0,onehotencoder__length_30.0,onehotencoder__length_35.0,onehotencoder__length_nan
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
224,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
225,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
226,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
227,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [270]:
test_X

Unnamed: 0,onehotencoder__diet_carnivorous,onehotencoder__diet_herbivorous,onehotencoder__diet_herbivorous/omnivorous,onehotencoder__diet_omnivorous,onehotencoder__diet_unknown,onehotencoder__short_period_EarlyCretaceous,onehotencoder__short_period_EarlyJurassic,onehotencoder__short_period_LateCretaceous,onehotencoder__short_period_LateJurassic,onehotencoder__short_period_LateTriassic,...,onehotencoder__length_21.5,onehotencoder__length_22.0,onehotencoder__length_23.0,onehotencoder__length_24.0,onehotencoder__length_25.0,onehotencoder__length_26.0,onehotencoder__length_28.0,onehotencoder__length_30.0,onehotencoder__length_35.0,onehotencoder__length_nan
229,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
230,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
231,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
232,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
233,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
286,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
287,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
288,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
289,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<h1> Part 1) Predict using support vector machines (SVMs) </h1>

__Train model & compute mean accuracy and mean f1 score of the model__

In [271]:
# NOTE:
# I will compute accuracy for both kernels : rbf and linear
# then will select the best performing kernel type and will predict test data using that.
#


# Load the dataset
# Initialize stratified k-fold cross-validation
skf = StratifiedKFold(n_splits=5)  #the number of folds is 5

# Perform stratified k-fold cross-validation
rbf_accuracy_scores = []
linear_accuracy_scores = []
linear_models = []
rbf_models = []
for train_index, test_index in skf.split(train_X, train_y):
    X_train, X_test = train_X.iloc[train_index], train_X.iloc[test_index]
    y_train, y_test = train_y[train_index], train_y[test_index]

    linear_svm_classifier = svm.SVC(kernel="linear") # linear kernel
    rbf_svm_classifier = svm.SVC(kernel="rbf") # rbf kernel


    # fit classifiers && train models
    linear_svm_classifier.fit(X_train, y_train)
    linear_models.append(linear_svm_classifier)
    rbf_svm_classifier.fit(X_train, y_train)
    rbf_models.append(rbf_svm_classifier)

    # make predictions using trained models
    linear_y_pred = linear_svm_classifier.predict(X_test)
    rbf_y_pred = rbf_svm_classifier.predict(X_test)

    # evaluate scores
    linear_accuracy_scores.append(metrics.accuracy_score(y_test, linear_y_pred))
    rbf_accuracy_scores.append(metrics.accuracy_score(y_test, rbf_y_pred))


# Calculate the mean accuracy
mean_accuracy = np.mean(linear_accuracy_scores)
best_accuracy = np.max(linear_accuracy_scores)
rbf_mean_accuracy = np.mean(rbf_accuracy_scores)
print("Mean accuracy:", mean_accuracy, "best accuracy", best_accuracy, "on split", linear_accuracy_scores.index(best_accuracy))
print("Mean accuracy rbf:", rbf_mean_accuracy)

Mean accuracy: 0.8210628019323671 best accuracy 0.8695652173913043 on split 2
Mean accuracy rbf: 0.812463768115942


In [272]:
f1_scores = []
k = 5
skf = StratifiedKFold(n_splits=k)

# Perform k-fold stratified cross-validation
for train_index, test_index in skf.split(train_X, train_y):
    X_train, X_test = train_X.iloc[train_index], train_X.iloc[test_index]
    y_train, y_test = train_y[train_index], train_y[test_index]

    linear_svm_classifier = svm.SVC(kernel="linear") # linear kernel
    # fit classifiers && train models
    linear_svm_classifier.fit(X_train, y_train)
    # make predictions using trained models
    linear_y_pred = linear_svm_classifier.predict(X_test)

    # Compute the weighted-average F1-score for this fold
    fold_f1_score = f1_score(y_test, linear_y_pred, average='weighted')
    f1_scores.append(fold_f1_score)

# Calculate the mean F1-score across all folds
mean_weighted_f1_score = np.mean(f1_scores)
best_f1_score = np.max(f1_scores)
print("Mean weighted-average F1-score across", k, "folds:", mean_weighted_f1_score)
print("Best f1 score was:", best_f1_score)

# picking best training accuracy model according these
linear_svm_classifier = linear_models[linear_accuracy_scores.index(best_accuracy)]

Mean weighted-average F1-score across 5 folds: 0.8146871816227689
Best f1 score was: 0.8622074022941364


__Predict test.csv with trained SVM model and save submission file__

In [273]:
linear_svm_predictions = linear_svm_classifier.predict(test_X)

In [274]:
test_X["type"] = linear_svm_predictions

In [275]:
test_X["id"] = range(1, test_X.shape[0]+1)

In [276]:
# prepare submission file now.
submission_file = test_X[["id","type"]]
submission_file

Unnamed: 0,id,type
229,1,sauropod
230,2,sauropod
231,3,large theropod
232,4,large theropod
233,5,small theropod
...,...,...
286,58,sauropod
287,59,armoured dinosaur
288,60,large theropod
289,61,ceratopsian


In [277]:
# save svm submission file
submission_file.to_csv(f"SVM_submission_{tryc}.csv",index=False)

In [None]:
# submitted my work at Kaggle.
# scored 0.96511 at SVM submissions (Muhammet Ali Öztürk)

<h1> Part 2) Predict using the Logistic Regression Model </h1>

In [278]:
# include package
from sklearn.linear_model import LogisticRegression

__Compute mean accuracy and mean f1 scores similarly__

In [279]:
# compute mean accuracy similarly.
k = 5
skf = StratifiedKFold(n_splits=k)  #the number of folds is 5

# Perform stratified k-fold cross-validation
accuracy_scores = []
# The given logistic regression classifier is chosen with C=5.0 for weaker regularization, penalty='l2' for L2 regularization,
# solver='lbfgs' for optimization, multi_class='multinomial' for multi-class classification,
# and max_iter=10000 for increasing maximum number of iterations. These parameters were likely
# chosen through experimentation and tuning to achieve better performance on this specific dataset.
logistic_classifier = LogisticRegression(C=5.0, penalty='l2', solver='lbfgs', multi_class='multinomial', max_iter=10000)
logistic_models = []

for train_index, test_index in skf.split(train_X, train_y):
    X_train, X_test = train_X.iloc[train_index], train_X.iloc[test_index]
    y_train, y_test = train_y[train_index], train_y[test_index]

    # fit classifiers && train models
    logistic_classifier.fit(X_train, y_train)
    logistic_models.append(logistic_classifier)
    # make predictions using trained models
    logistic_y_pred = logistic_classifier.predict(X_test)

    # evaluate scores
    accuracy_scores.append(metrics.accuracy_score(y_test, logistic_y_pred))


# Calculate the mean accuracy
mean_accuracy = np.mean(accuracy_scores)
print("Mean accuracy:", mean_accuracy)
print(f"Best accuracy was: {np.max(accuracy_scores)} on split {accuracy_scores.index(np.max(accuracy_scores))}")

Mean accuracy: 0.838743961352657
Best accuracy was: 0.9111111111111111 on split 4


In [280]:
f1_scores = []
skf = StratifiedKFold(n_splits=k)

# Perform k-fold stratified cross-validation
logistic_classifier = LogisticRegression(C=5.0, penalty='l2', solver='lbfgs', multi_class='multinomial', max_iter=10000)
for train_index, test_index in skf.split(train_X, train_y):
    X_train, X_test = train_X.iloc[train_index], train_X.iloc[test_index]
    y_train, y_test = train_y[train_index], train_y[test_index]

    # fit classifiers && train models
    logistic_classifier.fit(X_train, y_train)
    # make predictions using trained models
    logistic_y_pred = logistic_classifier.predict(X_test)

    # Compute the weighted-average F1-score for this fold
    fold_f1_score = f1_score(y_test, logistic_y_pred, average='weighted')
    f1_scores.append(fold_f1_score)

# Calculate the mean F1-score across all folds
mean_weighted_f1_score = np.mean(f1_scores)
print("Mean weighted-average F1-score across", k, "folds:", mean_weighted_f1_score)
print("Best f1 score was:", np.max(f1_scores))

# picking best training accuracy model according these
logistic_classifier = logistic_models[accuracy_scores.index(np.max(accuracy_scores))]

Mean weighted-average F1-score across 5 folds: 0.8325436508971041
Best f1 score was: 0.9105744754041968


__Predict with LRM && save submission file__

In [281]:
test_X = test_X.drop(["id", "type"], axis=1)
LR_predictions = logistic_classifier.predict(test_X)
test_X["type"] = LR_predictions
test_X["id"] = range(1, test_X.shape[0] + 1)
# prepare submission file now.
submission_file = test_X[["id", "type"]]
submission_file
# save svm submission file
submission_file.to_csv(f"LR_submission_{tryc}.csv", index=False)

In [282]:
# submitted my work at Kaggle.
# scored 0.91860 at Logistic Regression Model submissions (Muhammet Ali Öztürk)