In this notebook we solve the problem of the imbalance of classes in the AFKABAN datasets

In [22]:
import time
import os.path
import numpy as np
import pandas as pd
import pickle
import random
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, GroupKFold
from sklearn.metrics import f1_score, confusion_matrix
import hyperopt
from hyperopt import tpe
from hpsklearn import HyperoptEstimator
from sklearn.neighbors import KNeighborsClassifier
from datetime import timedelta
from tenacity import retry, stop_after_attempt
import sys, errno
import AZKABANML

In [11]:
# Set paths
path = 'F:/AFKABAN/review2024'
classifypath = 'F:/AFKABAN/Classify/'

df_120 = pd.read_feather(f'{path}/SED_120_df.feather')
df_200 = pd.read_feather(f'{path}/SED_200_df.feather')

track_120 = pd.read_feather(f'{path}/track_120_df.feather')
track_200 = pd.read_feather(f'{path}/track_200_df.feather')

#df_120_w_h = pd.read_feather(f'{path}/df_120_w_h.feather')
#df_200_w_h = pd.read_feather(f'{path}/df_200_w_h.feather')

Class imbalance is when the classes are not represented equally. Here, we have ~ 600 polar cod, ~300 atlantic cod and ~ 100 Pandalus. Maybe even more imbalance when looking per frequency bandwidth

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

Methods to resolve te problem of imbalanced data:
    
    - over sampling (duplicate classes with fewer samples)
    - under sampling (delete from classes with many samples)
    - Use better metrics (F1 score solved this and accounts for false +)
    - SMOTE?

# Over sampling
Easiest. Not so many samples so prefer over sampling to under sampling. Try multiplying shrimps and atlantic cod to match polar cod #

In [12]:
def balance_classes(df, track):
    'Balance classes representative by over sampling pandalus and atlantic cod'

    count = df.groupby('Species').count().iloc[:,5]
    df_balanced = df[df['Species']=='Polar cod'] #dominant species
    select_ind = np.where(df['Species']=='Polar cod')[0]
    track_balanced = track.iloc[select_ind]

    name_list = ['Atlantic cod','Northern shrimp']


    for spec in name_list:
        new_ind = random.choices(np.where(df['Species']==spec)[0][:], k=count['Polar cod'])
        df_balanced = pd.concat([df_balanced, df.iloc[new_ind,:]])
        df_balanced = df_balanced.reset_index(drop=True)

        track_balanced = pd.concat([track_balanced, track.iloc[new_ind,:]])
        track_balanced = track_balanced.reset_index(drop=True)

    return df_balanced, track_balanced

In [13]:
df_120_balanced, track_120_balanced = balance_classes(df_120, track_120)
df_200_balanced, track_200_balanced = balance_classes(df_200, track_200)

In [14]:
track_120_balanced

Unnamed: 0,Region_name
0,Region 125
1,Region 125
2,Region 125
3,Region 125
4,Region 125
...,...
2080,Region 301
2081,Region 327
2082,Region 310
2083,Region 301


In [15]:
df_120_balanced.to_feather(f'{path}/df_120_balanced.feather')
df_200_balanced.to_feather(f'{path}/df_200_balanced.feather')

track_120_balanced.to_feather(f'{path}/track_120_balanced.feather')
track_200_balanced.to_feather(f'{path}/track_200_balanced.feather')

Now the hyperoptimizer splits into 90% training and 10% testing with equal (+/- 1) distribution across the classes.

### linear balanced

In [16]:
df_120_sigbs_balanced = 10**(df_120_balanced.iloc[:,:-1]/10)
df_200_sigbs_balanced = 10**(df_200_balanced.iloc[:,:-1]/10)

In [17]:
df_120_sigbs_balanced['Species'] = df_120_balanced['Species']
df_200_sigbs_balanced['Species'] = df_200_balanced['Species']

In [18]:
df_120_sigbs_balanced.to_feather(f'{path}/df_120_sigbs_balanced.feather')
df_200_sigbs_balanced.to_feather(f'{path}/df_200_sigbs_balanced.feather')

In [19]:
len(df_120_balanced)/30

69.5