## Feature Engineering

Possible bins ideas:
- number of migrations
- debt
- win ratio
- age
- pokemon level

possible feature engineering
- criminal record * debt: multiplication or combined category after binning (.5 correlation)
- number of migration * criminal records (.15 correlation)
- number of migration * debt (.15 correlation)

nominal
- city
- occupation
- pokemon type
- pokeball usage
- battle strategy
- criminal record
- champ
- rare item
- charitable activites


ordinal
- economic status
- gym badges
- win rate (bin)
- debt (bin)
- age (bin)
- pokemon level (bin)
- migration (bin)

lets figure out how to create bins for the 5 columns

possible bins
Age 
- equal bins 'teen, young adult, adult, senior'
win rate (2 iterations possible. might change the quintile to septile)
- equal bins 'very low, low, average, high, very high'
- quintile
debt
- equal bins
pokemon level
- equal bins 'very low, low, average, high, very high'
migration
- quartile
- equal bins 'very low, low, average, high, very high'

In [1]:
import pandas as pd
import numpy as np

path = 'data\pokemon_team_rocket_dataset.csv'
df = pd.read_csv(path)

### Update columns name

In [8]:
df.columns = [col.replace(' ', '_') for col in df.columns]
df.columns

Index(['ID', 'Age', 'City', 'Economic_Status', 'Profession',
       'Most_Used_Pokemon_Type', 'Average_Pokemon_Level', 'Criminal_Record',
       'PokéBall_Usage', 'Win_Ratio', 'Number_of_Gym_Badges',
       'Is_Pokemon_Champion', 'Battle_Strategy', 'Number_of_Migrations',
       'Rare_Item_Holder', 'Debt_to_Kanto', 'Charity_Participation',
       'Team_Rocket', 'Age_Stage', 'Win_Ratio_bins'],
      dtype='object')

### helper functions

In [7]:
def add_bins(df, column_name, new_column_name, labels, bins):

    #prevent errrors
    if column_name not in df.columns:
        print(f"Column ' {column_name} 'not found in dataframe")
        return df
    

    if isinstance(bins, list):
        #edge case
        if len(bins) != len(labels) + 1:
            print("Error: Number of bin edges must be one more than the number of labels")
            return df
        
        df[new_column_name] = pd.cut(df[column_name], bins=bins, labels=labels, right=False)
    elif isinstance(bins, int):
        df[new_column_name] = pd.qcut(df[column_name], q = bins, labels = labels)
    else:
        #edge case
        print("Error: 'bins' must be a list or integer")
        return df
    return df

### age bins

In [9]:
age_bins = [10, 19, 35, 55, 70]
age_labels = ['Teen', 'Young Adult','Adult','Senior']

df = add_bins(df, 'Age', 'Age_Stage', age_labels, age_bins)

### Win Rate

In [6]:
df.head()

Unnamed: 0,ID,Age,City,Economic_Status,Profession,Most_Used_Pokemon_Type,Average_Pokemon_Level,Criminal_Record,PokéBall_Usage,Win_Ratio,Number_of_Gym_Badges,Is_Pokemon_Champion,Battle_Strategy,Number_of_Migrations,Rare_Item_Holder,Debt_to_Kanto,Charity_Participation,Team_Rocket,Age_Stage,Win_Ratio_bins
0,0,27,Pewter City,Middle,Fisherman,Rock,50,0,DuskBall,51,1,False,Unpredictable,25,False,24511,True,No,Young Adult,Average
1,1,55,Viridian City,Middle,PokéMart Seller,Grass,35,1,HealBall,53,2,False,Unpredictable,19,False,177516,True,Yes,Senior,Average
2,2,14,Pallet Town,High,Police Officer,Poison,96,0,NetBall,76,5,False,Aggressive,18,False,85695,True,No,Teen,High
3,3,41,Cerulean City,Middle,Gym Leader Assistant,Dragon,23,0,UltraBall,27,0,False,Defensive,10,False,39739,True,No,Adult,Low
4,4,15,Pallet Town,Middle,Gym Leader Assistant,Ground,16,1,HealBall,51,1,False,Aggressive,17,True,126923,False,Yes,Teen,Average


In [10]:
win_rate_bins = [0,20,40,60,80,100]
win_rate_labels = ['Very Low', 'Low', 'Average', 'High', 'Very High']

df = add_bins(df, 'Win_Ratio', 'Win_Ratio_bins', win_rate_labels, win_rate_bins)
df = add_bins(df, 'Win_Ratio', 'Win_Ratio_per', win_rate_labels, 5)
df.head()

Unnamed: 0,ID,Age,City,Economic_Status,Profession,Most_Used_Pokemon_Type,Average_Pokemon_Level,Criminal_Record,PokéBall_Usage,Win_Ratio,...,Is_Pokemon_Champion,Battle_Strategy,Number_of_Migrations,Rare_Item_Holder,Debt_to_Kanto,Charity_Participation,Team_Rocket,Age_Stage,Win_Ratio_bins,Win_Ratio_per
0,0,27,Pewter City,Middle,Fisherman,Rock,50,0,DuskBall,51,...,False,Unpredictable,25,False,24511,True,No,Young Adult,Average,Average
1,1,55,Viridian City,Middle,PokéMart Seller,Grass,35,1,HealBall,53,...,False,Unpredictable,19,False,177516,True,Yes,Senior,Average,Average
2,2,14,Pallet Town,High,Police Officer,Poison,96,0,NetBall,76,...,False,Aggressive,18,False,85695,True,No,Teen,High,Very High
3,3,41,Cerulean City,Middle,Gym Leader Assistant,Dragon,23,0,UltraBall,27,...,False,Defensive,10,False,39739,True,No,Adult,Low,Very Low
4,4,15,Pallet Town,Middle,Gym Leader Assistant,Ground,16,1,HealBall,51,...,False,Aggressive,17,True,126923,False,Yes,Teen,Average,Average


### Debt

In [12]:
debt_bins = [0,80000, 160000, 240000, 320000, 400000]
debt_labels = ['Very Low', 'Low', 'Average', 'High', 'Very High']

df = add_bins(df, 'Debt_to_Kanto', 'Debt_bin', debt_labels, debt_bins)
df = add_bins(df, 'Debt_to_Kanto', 'Debt_per', debt_labels, 5)
df.head()

Unnamed: 0,ID,Age,City,Economic_Status,Profession,Most_Used_Pokemon_Type,Average_Pokemon_Level,Criminal_Record,PokéBall_Usage,Win_Ratio,...,Number_of_Migrations,Rare_Item_Holder,Debt_to_Kanto,Charity_Participation,Team_Rocket,Age_Stage,Win_Ratio_bins,Win_Ratio_per,Debt_bin,Debt_per
0,0,27,Pewter City,Middle,Fisherman,Rock,50,0,DuskBall,51,...,25,False,24511,True,No,Young Adult,Average,Average,Very Low,Low
1,1,55,Viridian City,Middle,PokéMart Seller,Grass,35,1,HealBall,53,...,19,False,177516,True,Yes,Senior,Average,Average,Average,Very High
2,2,14,Pallet Town,High,Police Officer,Poison,96,0,NetBall,76,...,18,False,85695,True,No,Teen,High,Very High,Low,High
3,3,41,Cerulean City,Middle,Gym Leader Assistant,Dragon,23,0,UltraBall,27,...,10,False,39739,True,No,Adult,Low,Very Low,Very Low,Low
4,4,15,Pallet Town,Middle,Gym Leader Assistant,Ground,16,1,HealBall,51,...,17,True,126923,False,Yes,Teen,Average,Average,Low,Very High


### Pokemon Level

In [13]:
pokemon_levels_bins = win_rate_bins
pokemon_levels_labels = ['Very Low', 'Low', 'Average', 'High', 'Very High']

df = add_bins(df, 'Average_Pokemon_Level', 'Pokemon_Level_bin', pokemon_levels_labels, pokemon_levels_bins)
df = add_bins(df, 'Average_Pokemon_Level', 'Pokemon_Level_per', pokemon_levels_labels, 5)

df.head()



Unnamed: 0,ID,Age,City,Economic_Status,Profession,Most_Used_Pokemon_Type,Average_Pokemon_Level,Criminal_Record,PokéBall_Usage,Win_Ratio,...,Debt_to_Kanto,Charity_Participation,Team_Rocket,Age_Stage,Win_Ratio_bins,Win_Ratio_per,Debt_bin,Debt_per,Pokemon_Level_bin,Pokemon_Level_per
0,0,27,Pewter City,Middle,Fisherman,Rock,50,0,DuskBall,51,...,24511,True,No,Young Adult,Average,Average,Very Low,Low,Average,Average
1,1,55,Viridian City,Middle,PokéMart Seller,Grass,35,1,HealBall,53,...,177516,True,Yes,Senior,Average,Average,Average,Very High,Low,Low
2,2,14,Pallet Town,High,Police Officer,Poison,96,0,NetBall,76,...,85695,True,No,Teen,High,Very High,Low,High,Very High,Very High
3,3,41,Cerulean City,Middle,Gym Leader Assistant,Dragon,23,0,UltraBall,27,...,39739,True,No,Adult,Low,Very Low,Very Low,Low,Low,Very Low
4,4,15,Pallet Town,Middle,Gym Leader Assistant,Ground,16,1,HealBall,51,...,126923,False,Yes,Teen,Average,Average,Low,Very High,Very Low,Very Low


### Migrations

In [14]:
migrations_bins = [0, 6, 12, 18, 24, 30]
migrations_labels = win_rate_labels

df = add_bins(df, 'Number_of_Migrations', 'Migrations_bin', migrations_labels, migrations_bins)
df = add_bins(df, 'Number_of_Migrations', 'Migrations_per',migrations_labels, 5)
df.head()

Unnamed: 0,ID,Age,City,Economic_Status,Profession,Most_Used_Pokemon_Type,Average_Pokemon_Level,Criminal_Record,PokéBall_Usage,Win_Ratio,...,Team_Rocket,Age_Stage,Win_Ratio_bins,Win_Ratio_per,Debt_bin,Debt_per,Pokemon_Level_bin,Pokemon_Level_per,Migrations_bin,Migrations_per
0,0,27,Pewter City,Middle,Fisherman,Rock,50,0,DuskBall,51,...,No,Young Adult,Average,Average,Very Low,Low,Average,Average,Very High,Very High
1,1,55,Viridian City,Middle,PokéMart Seller,Grass,35,1,HealBall,53,...,Yes,Senior,Average,Average,Average,Very High,Low,Low,High,High
2,2,14,Pallet Town,High,Police Officer,Poison,96,0,NetBall,76,...,No,Teen,High,Very High,Low,High,Very High,Very High,High,High
3,3,41,Cerulean City,Middle,Gym Leader Assistant,Dragon,23,0,UltraBall,27,...,No,Adult,Low,Very Low,Very Low,Low,Low,Very Low,Low,Low
4,4,15,Pallet Town,Middle,Gym Leader Assistant,Ground,16,1,HealBall,51,...,Yes,Teen,Average,Average,Low,Very High,Very Low,Very Low,Average,High


### Drop old columns

In [15]:
df.drop(columns=['Win_Ratio', 'Number_of_Migrations', 'Average_Pokemon_Level', 'Age', 'Debt_to_Kanto', 'ID'], inplace=True)

In [16]:
df.head()

Unnamed: 0,City,Economic_Status,Profession,Most_Used_Pokemon_Type,Criminal_Record,PokéBall_Usage,Number_of_Gym_Badges,Is_Pokemon_Champion,Battle_Strategy,Rare_Item_Holder,...,Team_Rocket,Age_Stage,Win_Ratio_bins,Win_Ratio_per,Debt_bin,Debt_per,Pokemon_Level_bin,Pokemon_Level_per,Migrations_bin,Migrations_per
0,Pewter City,Middle,Fisherman,Rock,0,DuskBall,1,False,Unpredictable,False,...,No,Young Adult,Average,Average,Very Low,Low,Average,Average,Very High,Very High
1,Viridian City,Middle,PokéMart Seller,Grass,1,HealBall,2,False,Unpredictable,False,...,Yes,Senior,Average,Average,Average,Very High,Low,Low,High,High
2,Pallet Town,High,Police Officer,Poison,0,NetBall,5,False,Aggressive,False,...,No,Teen,High,Very High,Low,High,Very High,Very High,High,High
3,Cerulean City,Middle,Gym Leader Assistant,Dragon,0,UltraBall,0,False,Defensive,False,...,No,Adult,Low,Very Low,Very Low,Low,Low,Very Low,Low,Low
4,Pallet Town,Middle,Gym Leader Assistant,Ground,1,HealBall,1,False,Aggressive,True,...,Yes,Teen,Average,Average,Low,Very High,Very Low,Very Low,Average,High
