## Feature Engineering

Possible bins ideas:
- number of migrations
- debt
- win ratio
- age
- pokemon level

possible feature engineering
- criminal record * debt: multiplication or combined category after binning (.5 correlation)
- number of migration * criminal records (.15 correlation)
- number of migration * debt (.15 correlation)

nominal
- city
- occupation
- pokemon type
- pokeball usage
- battle strategy
- criminal record
- champ
- rare item
- charitable activites


ordinal
- economic status
- gym badges
- win rate (bin)
- debt (bin)
- age (bin)
- pokemon level (bin)
- migration (bin)

lets figure out how to create bins for the 5 columns

possible bins
Age 
- equal bins 'teen, young adult, adult, senior'
win rate (2 iterations possible. might change the quintile to septile)
- equal bins 'very low, low, average, high, very high'
- quintile
debt
- equal bins
pokemon level
- equal bins 'very low, low, average, high, very high'
migration
- quartile
- equal bins 'very low, low, average, high, very high'

In [57]:
import pandas as pd
import numpy as np

path = 'data\pokemon_team_rocket_dataset.csv'
df = pd.read_csv(path)

### Update columns name

In [58]:
df.columns = [col.replace(' ', '_') for col in df.columns]
df.columns

Index(['ID', 'Age', 'City', 'Economic_Status', 'Profession',
       'Most_Used_Pokemon_Type', 'Average_Pokemon_Level', 'Criminal_Record',
       'PokéBall_Usage', 'Win_Ratio', 'Number_of_Gym_Badges',
       'Is_Pokemon_Champion', 'Battle_Strategy', 'Number_of_Migrations',
       'Rare_Item_Holder', 'Debt_to_Kanto', 'Charity_Participation',
       'Team_Rocket'],
      dtype='object')

### helper functions

In [59]:
def add_bins(df, column_name, new_column_name, labels, bins):

    #prevent errrors
    if column_name not in df.columns:
        print(f"Column ' {column_name} 'not found in dataframe")
        return df
    

    if isinstance(bins, list):
        #edge case
        if len(bins) != len(labels) + 1:
            print("Error: Number of bin edges must be one more than the number of labels")
            return df
        
        df[new_column_name] = pd.cut(df[column_name], bins=bins, labels=labels, right=False)
    elif isinstance(bins, int):
        df[new_column_name] = pd.qcut(df[column_name], q = bins, labels = labels)
    else:
        #edge case
        print("Error: 'bins' must be a list or integer")
        return df
    return df

### age bins

In [60]:
age_bins = [9, 19, 35, 55, 71]
age_labels = ['Teen', 'Young Adult','Adult','Senior']

df = add_bins(df, 'Age', 'Age_Stage', age_labels, age_bins)

### Win Rate

In [61]:
df.head()

Unnamed: 0,ID,Age,City,Economic_Status,Profession,Most_Used_Pokemon_Type,Average_Pokemon_Level,Criminal_Record,PokéBall_Usage,Win_Ratio,Number_of_Gym_Badges,Is_Pokemon_Champion,Battle_Strategy,Number_of_Migrations,Rare_Item_Holder,Debt_to_Kanto,Charity_Participation,Team_Rocket,Age_Stage
0,0,27,Pewter City,Middle,Fisherman,Rock,50,0,DuskBall,51,1,False,Unpredictable,25,False,24511,True,No,Young Adult
1,1,55,Viridian City,Middle,PokéMart Seller,Grass,35,1,HealBall,53,2,False,Unpredictable,19,False,177516,True,Yes,Senior
2,2,14,Pallet Town,High,Police Officer,Poison,96,0,NetBall,76,5,False,Aggressive,18,False,85695,True,No,Teen
3,3,41,Cerulean City,Middle,Gym Leader Assistant,Dragon,23,0,UltraBall,27,0,False,Defensive,10,False,39739,True,No,Adult
4,4,15,Pallet Town,Middle,Gym Leader Assistant,Ground,16,1,HealBall,51,1,False,Aggressive,17,True,126923,False,Yes,Teen


In [62]:
win_rate_bins = [-1,20,40,60,80,101]
win_rate_labels = ['Very Low', 'Low', 'Average', 'High', 'Very High']

df = add_bins(df, 'Win_Ratio', 'Win_Ratio_bins', win_rate_labels, win_rate_bins)
# df = add_bins(df, 'Win_Ratio', 'Win_Ratio_per', win_rate_labels, 5)
df.head()

Unnamed: 0,ID,Age,City,Economic_Status,Profession,Most_Used_Pokemon_Type,Average_Pokemon_Level,Criminal_Record,PokéBall_Usage,Win_Ratio,Number_of_Gym_Badges,Is_Pokemon_Champion,Battle_Strategy,Number_of_Migrations,Rare_Item_Holder,Debt_to_Kanto,Charity_Participation,Team_Rocket,Age_Stage,Win_Ratio_bins
0,0,27,Pewter City,Middle,Fisherman,Rock,50,0,DuskBall,51,1,False,Unpredictable,25,False,24511,True,No,Young Adult,Average
1,1,55,Viridian City,Middle,PokéMart Seller,Grass,35,1,HealBall,53,2,False,Unpredictable,19,False,177516,True,Yes,Senior,Average
2,2,14,Pallet Town,High,Police Officer,Poison,96,0,NetBall,76,5,False,Aggressive,18,False,85695,True,No,Teen,High
3,3,41,Cerulean City,Middle,Gym Leader Assistant,Dragon,23,0,UltraBall,27,0,False,Defensive,10,False,39739,True,No,Adult,Low
4,4,15,Pallet Town,Middle,Gym Leader Assistant,Ground,16,1,HealBall,51,1,False,Aggressive,17,True,126923,False,Yes,Teen,Average


### Debt

In [63]:
debt_bins = [-1,80000, 160000, 240000, 320000, 400001]
debt_labels = ['Very Low', 'Low', 'Average', 'High', 'Very High']

df = add_bins(df, 'Debt_to_Kanto', 'Debt_bin', debt_labels, debt_bins)
# df = add_bins(df, 'Debt_to_Kanto', 'Debt_per', debt_labels, 5)
df.head()

Unnamed: 0,ID,Age,City,Economic_Status,Profession,Most_Used_Pokemon_Type,Average_Pokemon_Level,Criminal_Record,PokéBall_Usage,Win_Ratio,...,Is_Pokemon_Champion,Battle_Strategy,Number_of_Migrations,Rare_Item_Holder,Debt_to_Kanto,Charity_Participation,Team_Rocket,Age_Stage,Win_Ratio_bins,Debt_bin
0,0,27,Pewter City,Middle,Fisherman,Rock,50,0,DuskBall,51,...,False,Unpredictable,25,False,24511,True,No,Young Adult,Average,Very Low
1,1,55,Viridian City,Middle,PokéMart Seller,Grass,35,1,HealBall,53,...,False,Unpredictable,19,False,177516,True,Yes,Senior,Average,Average
2,2,14,Pallet Town,High,Police Officer,Poison,96,0,NetBall,76,...,False,Aggressive,18,False,85695,True,No,Teen,High,Low
3,3,41,Cerulean City,Middle,Gym Leader Assistant,Dragon,23,0,UltraBall,27,...,False,Defensive,10,False,39739,True,No,Adult,Low,Very Low
4,4,15,Pallet Town,Middle,Gym Leader Assistant,Ground,16,1,HealBall,51,...,False,Aggressive,17,True,126923,False,Yes,Teen,Average,Low


### Pokemon Level

In [64]:
pokemon_levels_bins = win_rate_bins
pokemon_levels_labels = ['Very Low', 'Low', 'Average', 'High', 'Very High']

df = add_bins(df, 'Average_Pokemon_Level', 'Pokemon_Level_bin', pokemon_levels_labels, pokemon_levels_bins)
# df = add_bins(df, 'Average_Pokemon_Level', 'Pokemon_Level_per', pokemon_levels_labels, 5)

df.head()



Unnamed: 0,ID,Age,City,Economic_Status,Profession,Most_Used_Pokemon_Type,Average_Pokemon_Level,Criminal_Record,PokéBall_Usage,Win_Ratio,...,Battle_Strategy,Number_of_Migrations,Rare_Item_Holder,Debt_to_Kanto,Charity_Participation,Team_Rocket,Age_Stage,Win_Ratio_bins,Debt_bin,Pokemon_Level_bin
0,0,27,Pewter City,Middle,Fisherman,Rock,50,0,DuskBall,51,...,Unpredictable,25,False,24511,True,No,Young Adult,Average,Very Low,Average
1,1,55,Viridian City,Middle,PokéMart Seller,Grass,35,1,HealBall,53,...,Unpredictable,19,False,177516,True,Yes,Senior,Average,Average,Low
2,2,14,Pallet Town,High,Police Officer,Poison,96,0,NetBall,76,...,Aggressive,18,False,85695,True,No,Teen,High,Low,Very High
3,3,41,Cerulean City,Middle,Gym Leader Assistant,Dragon,23,0,UltraBall,27,...,Defensive,10,False,39739,True,No,Adult,Low,Very Low,Low
4,4,15,Pallet Town,Middle,Gym Leader Assistant,Ground,16,1,HealBall,51,...,Aggressive,17,True,126923,False,Yes,Teen,Average,Low,Very Low


### Migrations

In [65]:
migrations_bins = [-1, 6, 12, 18, 24, 31]
migrations_labels = win_rate_labels

df = add_bins(df, 'Number_of_Migrations', 'Migrations_bin', migrations_labels, migrations_bins)
# df = add_bins(df, 'Number_of_Migrations', 'Migrations_per',migrations_labels, 5)
df.head()

Unnamed: 0,ID,Age,City,Economic_Status,Profession,Most_Used_Pokemon_Type,Average_Pokemon_Level,Criminal_Record,PokéBall_Usage,Win_Ratio,...,Number_of_Migrations,Rare_Item_Holder,Debt_to_Kanto,Charity_Participation,Team_Rocket,Age_Stage,Win_Ratio_bins,Debt_bin,Pokemon_Level_bin,Migrations_bin
0,0,27,Pewter City,Middle,Fisherman,Rock,50,0,DuskBall,51,...,25,False,24511,True,No,Young Adult,Average,Very Low,Average,Very High
1,1,55,Viridian City,Middle,PokéMart Seller,Grass,35,1,HealBall,53,...,19,False,177516,True,Yes,Senior,Average,Average,Low,High
2,2,14,Pallet Town,High,Police Officer,Poison,96,0,NetBall,76,...,18,False,85695,True,No,Teen,High,Low,Very High,High
3,3,41,Cerulean City,Middle,Gym Leader Assistant,Dragon,23,0,UltraBall,27,...,10,False,39739,True,No,Adult,Low,Very Low,Low,Low
4,4,15,Pallet Town,Middle,Gym Leader Assistant,Ground,16,1,HealBall,51,...,17,True,126923,False,Yes,Teen,Average,Low,Very Low,Average


### Drop old columns

In [66]:
df.drop(columns=['Win_Ratio', 'Number_of_Migrations', 'Average_Pokemon_Level', 'Age', 'Debt_to_Kanto', 'ID'], inplace=True)

In [67]:
df.head()

Unnamed: 0,City,Economic_Status,Profession,Most_Used_Pokemon_Type,Criminal_Record,PokéBall_Usage,Number_of_Gym_Badges,Is_Pokemon_Champion,Battle_Strategy,Rare_Item_Holder,Charity_Participation,Team_Rocket,Age_Stage,Win_Ratio_bins,Debt_bin,Pokemon_Level_bin,Migrations_bin
0,Pewter City,Middle,Fisherman,Rock,0,DuskBall,1,False,Unpredictable,False,True,No,Young Adult,Average,Very Low,Average,Very High
1,Viridian City,Middle,PokéMart Seller,Grass,1,HealBall,2,False,Unpredictable,False,True,Yes,Senior,Average,Average,Low,High
2,Pallet Town,High,Police Officer,Poison,0,NetBall,5,False,Aggressive,False,True,No,Teen,High,Low,Very High,High
3,Cerulean City,Middle,Gym Leader Assistant,Dragon,0,UltraBall,0,False,Defensive,False,True,No,Adult,Low,Very Low,Low,Low
4,Pallet Town,Middle,Gym Leader Assistant,Ground,1,HealBall,1,False,Aggressive,True,False,Yes,Teen,Average,Low,Very Low,Average


In [68]:
import itertools

def permuations(listp, listb):

    results = []

    for per in itertools.permutations(listp):
        for bin in itertools.permutations(listb):
            for idx in itertools.product([0,1], repeat=4):

                temp_perm = []
                for j in range(4):
                    if idx[j] == 0:
                        temp_perm.append(per[j])
                    else:
                        temp_perm.append(bin[j])
                results.append(temp_perm)

    return results

In [69]:
listp = ['bp','pp','tp','lp']
listb = ['bb', 'pb', 'tb', 'lb']

test = permuations(listp, listb)

In [70]:
len(test)

9216

### Iteration 1

In [71]:
df = df.head(4000)

# df.drop(columns=['Win_Ratio_per', 'Debt_per', 'Pokemon_Level_per', 'Migrations_per'], inplace=True)



In [72]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.compose import ColumnTransformer

In [73]:
df.columns

Index(['City', 'Economic_Status', 'Profession', 'Most_Used_Pokemon_Type',
       'Criminal_Record', 'PokéBall_Usage', 'Number_of_Gym_Badges',
       'Is_Pokemon_Champion', 'Battle_Strategy', 'Rare_Item_Holder',
       'Charity_Participation', 'Team_Rocket', 'Age_Stage', 'Win_Ratio_bins',
       'Debt_bin', 'Pokemon_Level_bin', 'Migrations_bin'],
      dtype='object')

In [74]:
ordinal_col = ['Economic_Status', 'Age_Stage', 'Win_Ratio_bins', 'Pokemon_Level_bin', 'Migrations_bin', 'Debt_bin']
ohe_col = ['City', 'Profession', 'Most_Used_Pokemon_Type', 'PokéBall_Usage', 'Battle_Strategy']


In [75]:
eco_order = ['Low', 'Middle', 'High']
age_order = age_labels
order = win_rate_labels

In [76]:
ordinal_transformer = OrdinalEncoder(categories=[eco_order, age_order, order, order, order, order])
ohe_transformer = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

ct = ColumnTransformer(
    transformers=[
        ('ordinal', ordinal_transformer, ordinal_col),
        ('onehot', ohe_transformer, ohe_col)
    ],
    remainder='passthrough'
)

transformed_data = ct.fit_transform(df)

onehot_fn = ct.named_transformers_['onehot'].get_feature_names_out(ohe_col)

In [77]:
remainder_col = [col for col in df.columns if col not in ordinal_col + ohe_col]

In [78]:
remainder_col

['Criminal_Record',
 'Number_of_Gym_Badges',
 'Is_Pokemon_Champion',
 'Rare_Item_Holder',
 'Charity_Participation',
 'Team_Rocket']

In [79]:
feature_names = ordinal_col + list(onehot_fn) + remainder_col
len(feature_names)

69

In [80]:
transformed_df = pd.DataFrame(transformed_data, columns=feature_names)
transformed_df

Unnamed: 0,Economic_Status,Age_Stage,Win_Ratio_bins,Pokemon_Level_bin,Migrations_bin,Debt_bin,City_Celadon City,City_Cerulean City,City_Cinnabar Island,City_Fuchsia City,...,PokéBall_Usage_UltraBall,Battle_Strategy_Aggressive,Battle_Strategy_Defensive,Battle_Strategy_Unpredictable,Criminal_Record,Number_of_Gym_Badges,Is_Pokemon_Champion,Rare_Item_Holder,Charity_Participation,Team_Rocket
0,1.0,1.0,2.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0,1,False,False,True,No
1,1.0,3.0,2.0,1.0,3.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1,2,False,False,True,Yes
2,2.0,0.0,3.0,4.0,3.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0,5,False,False,True,No
3,1.0,2.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0,0,False,False,True,No
4,1.0,0.0,2.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1,1,False,True,False,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3995,0.0,1.0,2.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0,0,False,False,True,No
3996,0.0,1.0,3.0,4.0,1.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0,3,False,False,True,No
3997,1.0,1.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0,1,False,False,True,No
3998,0.0,3.0,3.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0,8,False,False,True,No


In [81]:
for col in ordinal_col:
    transformed_df[col] = transformed_df[col].astype(int)

for col in onehot_fn:
    transformed_df[col] = transformed_df[col].astype(int)

transformed_df.head()

Unnamed: 0,Economic_Status,Age_Stage,Win_Ratio_bins,Pokemon_Level_bin,Migrations_bin,Debt_bin,City_Celadon City,City_Cerulean City,City_Cinnabar Island,City_Fuchsia City,...,PokéBall_Usage_UltraBall,Battle_Strategy_Aggressive,Battle_Strategy_Defensive,Battle_Strategy_Unpredictable,Criminal_Record,Number_of_Gym_Badges,Is_Pokemon_Champion,Rare_Item_Holder,Charity_Participation,Team_Rocket
0,1,1,2,2,4,0,0,0,0,0,...,0,0,0,1,0,1,False,False,True,No
1,1,3,2,1,3,2,0,0,0,0,...,0,0,0,1,1,2,False,False,True,Yes
2,2,0,3,4,3,1,0,0,0,0,...,0,1,0,0,0,5,False,False,True,No
3,1,2,1,1,1,0,0,1,0,0,...,1,0,1,0,0,0,False,False,True,No
4,1,0,2,0,2,1,0,0,0,0,...,0,1,0,0,1,1,False,True,False,Yes


In [82]:
def changetoBool(columns, df):
    for col in columns:
        df[col] = df[col].map({True: 1, False:0})
    return df


columns = ['Is_Pokemon_Champion', 'Rare_Item_Holder', 'Charity_Participation']
transformed_df = changetoBool(columns, transformed_df)

In [83]:
transformed_df.head()

Unnamed: 0,Economic_Status,Age_Stage,Win_Ratio_bins,Pokemon_Level_bin,Migrations_bin,Debt_bin,City_Celadon City,City_Cerulean City,City_Cinnabar Island,City_Fuchsia City,...,PokéBall_Usage_UltraBall,Battle_Strategy_Aggressive,Battle_Strategy_Defensive,Battle_Strategy_Unpredictable,Criminal_Record,Number_of_Gym_Badges,Is_Pokemon_Champion,Rare_Item_Holder,Charity_Participation,Team_Rocket
0,1,1,2,2,4,0,0,0,0,0,...,0,0,0,1,0,1,0,0,1,No
1,1,3,2,1,3,2,0,0,0,0,...,0,0,0,1,1,2,0,0,1,Yes
2,2,0,3,4,3,1,0,0,0,0,...,0,1,0,0,0,5,0,0,1,No
3,1,2,1,1,1,0,0,1,0,0,...,1,0,1,0,0,0,0,0,1,No
4,1,0,2,0,2,1,0,0,0,0,...,0,1,0,0,1,1,0,1,0,Yes


In [85]:
target = 'Team_Rocket'
x = transformed_df.drop(columns=[target])
y = transformed_df[target]

In [86]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size =.2, random_state=42)

In [87]:
models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'SVM' : SVC()
}

In [88]:
for name, model in models.items():
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    print(f'----- {name} -----')
    print(f'Accuracy: {accuracy_score(y_test, y_pred):.4f}')
    print(classification_report(y_test, y_pred))
    print('\n' + '-'*35 + '\n')

----- Logistic Regression -----
Accuracy: 0.9800
              precision    recall  f1-score   support

          No       0.98      1.00      0.99       639
         Yes       1.00      0.90      0.95       161

    accuracy                           0.98       800
   macro avg       0.99      0.95      0.97       800
weighted avg       0.98      0.98      0.98       800


-----------------------------------

----- Random Forest -----
Accuracy: 0.9888
              precision    recall  f1-score   support

          No       0.99      1.00      0.99       639
         Yes       1.00      0.94      0.97       161

    accuracy                           0.99       800
   macro avg       0.99      0.97      0.98       800
weighted avg       0.99      0.99      0.99       800


-----------------------------------

----- Gradient Boosting -----
Accuracy: 0.9862
              precision    recall  f1-score   support

          No       0.99      1.00      0.99       639
         Yes       0.9