# Rocket League Notebook 6: Tree Classifiers

## Goals 

- Create models using tree classifiers

## Contents

- (I) Random Forest Classifier, unaggregated data, cars included
    - OneHotEncorder used on car names
    - Use VarianceThreshold with default settings
    - Use StandardScaler with default settings on all columns
    - Run RandomForestClassifier with default settings
- (II) Random Forest Classifier, unaggregated, overall outliers removed, no cars
    - Filter out overall outliers
    - No cars
    - Use VarianceThreshold with default settings
    - Use StandardScaler with default settings on all columns
    - Run RandomForestClassifier with default settings
- (III) Random Forest Classifier, grouped by match_id and aggregated with mean, outliers removed by rank, no cars
    - Group by match_id and aggregate with mean
    - No cars
    - Filter out outliers by rank
    - Use VarianceThreshold with default settings
    - Use StandardScaler with default settings on all columns
    - Run RandomForestClassifier with default settings
- (IV) Basic Gradient Boosting Classifier, unaggregated, with cars
    - No prior aggregation of training set
    - OneHotEncode car names, passthrough all other features
    - Use VarianceThreshold with default settings
    - Use StandardScaler with default settings on all columns
    - Run GradientBoostClassifier with default settings
- (V) Random Forest Classifier, grouped by match_id and aggregated by mean, outliers removed by rank, no cars
    - Group by match_id and aggregate with mean
    - Filter out outliers by rank
    - No cars
    - Use VarianceThreshold with default settings
    - Use StandardScaler with default settings on all columns
    - Run RandomForestClassifier with default settings
- (VI) Random Forest Classifier, matches wide, no cars
    - Widen dataset so that each column has a _winner and _loser suffix and is all on the same line
    - No cars
    - Use VarianceThreshold with default settings
    - Use StandardScaler with default settings on all columns
    - Run RandomForestClassifier with default settings
- (VII) Random Forest Classifier, matches wide, some feature engineering
    - Widen dataset so that each column has a _winner and _loser suffix and is all on the same line
    - Add the following features to the dataset:
        - total_goals = lambda x: x['goals_win']+x['goals_lose'],
        - total_saves = lambda x: x['saves_win']+x['saves_lose'],
        - gs_score_ratio = lambda x: ((x['goals_win']+x['goals_lose'])*100 + (x['saves_win']+x['saves_lose'])*50)/(x['score_win']+x['score_lose']),
        - mean_percent_supersonic_speed = lambda x: (x['percent_supersonic_speed_win'] + x['percent_supersonic_speed_lose'])/2,
        - total_time_supersonic_speed = lambda x: x['time_supersonic_speed_win'] + x['time_supersonic_speed_lose'],
        - total_powerslide_count = lambda x: x['count_powerslide_win']+x['count_powerslide_lose'],
        - total_percent_ground_low_air = lambda x: x['percent_ground_win']+x['percent_ground_lose']+x['percent_low_air_win']+x['percent_low_air_lose'],
        - log_avg_powerslide_dur = lambda x: np.log(0.01+(x['avg_powerslide_duration_win']+x['avg_powerslide_duration_lose'])/2),
        - gs_over_pssl = lambda x: (x['goals_win']+x['goals_lose']+x['saves_win']+x['saves_lose'])/(1-0.01*x['percent_supersonic_speed_lose'])
    - Use VarianceThreshold with default settings
    - Use StandardScaler with default settings on all columns
    - Run RandomForestClassifier with default settings
- (VIII) Gradient Boosting Classifier, matches wide, some feature engineering
    - Widen dataset so that each column has a _winner and _loser suffix and is all on the same line
    - Add the following features to the dataset:
        - total_goals = lambda x: x['goals_win']+x['goals_lose'],
        - total_saves = lambda x: x['saves_win']+x['saves_lose'],
        - gs_score_ratio = lambda x: ((x['goals_win']+x['goals_lose'])*100 + (x['saves_win']+x['saves_lose'])*50)/(x['score_win']+x['score_lose']),
        - mean_percent_supersonic_speed = lambda x: (x['percent_supersonic_speed_win'] + x['percent_supersonic_speed_lose'])/2,
        - total_time_supersonic_speed = lambda x: x['time_supersonic_speed_win'] + x['time_supersonic_speed_lose'],
        - total_powerslide_count = lambda x: x['count_powerslide_win']+x['count_powerslide_lose'],
        - total_percent_ground_low_air = lambda x: x['percent_ground_win']+x['percent_ground_lose']+x['percent_low_air_win']+x['percent_low_air_lose'],
        - log_avg_powerslide_dur = lambda x: np.log(0.01+(x['avg_powerslide_duration_win']+x['avg_powerslide_duration_lose'])/2),
        - gs_over_pssl = lambda x: (x['goals_win']+x['goals_lose']+x['saves_win']+x['saves_lose'])/(1-0.01*x['percent_supersonic_speed_lose'])
    - Use VarianceThreshold with default settings
    - Use StandardScaler with default settings on all columns
    - Run GradientBoostClassifier with default settings

## Results

- (I) Random Forest Classifier, unaggregated data, cars included
    - Accuracy Score:  0.4483766018192683
    - Cars might not contribute much
- (II) Random Forest Classifier, unaggregated, overall outliers removed, no cars
    - Accuracy Score:  0.45614035087719296
    - Taking out outliers helps some, but overall outliers have helped less than rank ones in the past
- (III) Random Forest Classifier, grouped by match_id and aggregated with mean, outliers removed by rank, no cars
    - Accuracy Score:  0.5685406237966885
    - Removing outliers by rank remains optimal
- (IV) Basic Gradient Boosting Classifier, unaggregated, with cars
    - Accuracy Score:  0.4640462120709116
    - Gradient Boosting gets us a couple extra percent under the same conditions as Random Forest
    - Gradient Boost puts 6x more importance on percent supersonic speed than any other feature
- (V) Random Forest Classifier, grouped by match_id and aggregated by mean, outliers removed by rank, no cars
    - Accuracy score:  0.5695032730073162
    - Same model as (III)
    - Relative importance of features is much more evenly distributed
- (VI) Random Forest Classifier, matches wide, no cars
    - Accuracy score:  0.4535918204753685
    - Matches wide no cars slightly improves unaggregated long form.
    - May benefit from outlier removal here.
- (VII) Random Forest Classifier, matches wide, some feature engineering
    - Accuracy score:  0.36608684105696454
    - My features are not helping
    - Outlier removal may help
- (VIII) Gradient Boosting Classifier, matches wide, some feature engineering
    - Accuracy score:  0.45850484663391317
    - Gradient boosting likes my features a little more
    - Outlier removal may help

## Imports

In [96]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import VarianceThreshold
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

## Read In

In [3]:
matches = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

## Converters and Functions

In [4]:
converter = { 'bronze': 1, 'silver': 2, 'gold': 3, 'platinum': 4, 'diamond': 5, 'champion': 6 }
catvars = ['rank', 'color', 'map_code', 'car_name']

In [81]:
def find_outliers(col):

    try:
        Q1 = col.quantile(0.25)
        Q3 = col.quantile(0.75)
        IQR = Q3 - Q1
        lowbound = Q1-1.5*IQR
        highbound = Q3+1.5*IQR
        df_outliers = (col >= lowbound) & (col <= highbound)
    except:
        df_outliers = (col == col)

    return df_outliers

def filter_outliers(df):

    filtered_df = df[df.apply(find_outliers).all(axis = 'columns')]

    return filtered_df

## (I) Random Forest Classifier, unaggregated data, cars included

In [64]:
dropcols = ['match_id', 'color', 'rank', 'map_code']
X = matches.drop(columns = dropcols).fillna(0)
y = matches['rank']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

ohe = OneHotEncoder(sparse = False, drop = 'first')

ct = ColumnTransformer(transformers= [
        ('ohe', ohe, ['car_name'])
        ],
        remainder = 'passthrough'
    )

pipe = Pipeline(steps = [
        ('ct', ct),
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
        ('rf', RandomForestClassifier())
    ])

pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

0.4483766018192683

In [66]:
importances = pd.DataFrame({
    'variable': pipe['vt'].get_feature_names_out(),
    'importance': pipe['rf'].feature_importances_
})

importances['variable'] = importances['variable'].str.extract(r'(\d+)').astype(int)
importances['variable_name'] = importances['variable'].apply(lambda x: pipe['ct'].get_feature_names_out()[x])
importances.sort_values('importance', ascending = False).head(20)

Unnamed: 0,variable,importance,variable_name
134,135,0.02969,remainder__percent_supersonic_speed
122,123,0.02316,remainder__time_supersonic_speed
135,136,0.022361,remainder__percent_ground
136,137,0.021087,remainder__percent_low_air
120,121,0.0198,remainder__avg_speed
133,134,0.019676,remainder__percent_boost_speed
131,132,0.019381,remainder__avg_speed_percentage
93,94,0.018192,remainder__bcpm
129,130,0.017752,remainder__count_powerslide
140,141,0.016916,remainder__avg_distance_to_ball_no_possession


## (II) Random Forest Classifier, unaggregated, overall outliers removed, no cars

In [83]:
dropcols = ['match_id', 'color', 'rank', 'map_code', 'car_name']

matches_inliers = filter_outliers(matches.fillna(0))

X = matches_inliers.drop(columns = dropcols)
y = matches_inliers['rank']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

pipe = Pipeline(steps = [
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
        ('rf', RandomForestClassifier())
    ])

pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

0.45614035087719296

## (III) Random Forest Classifier, grouped by match_id and aggregated with mean, outliers removed by rank, no cars

In [84]:
matches_grouped = matches.groupby(['match_id', 'rank']).mean().reset_index().fillna(0)
matches_grouped_fo = matches_grouped.groupby('rank').apply(filter_outliers)
matches_grouped_fo.index = matches_grouped_fo.index.droplevel(level = 0)
matches_grouped_fo

Unnamed: 0,match_id,rank,duration,possession_time,time_in_side,shots,shots_against,goals,goals_against,saves,...,percent_defensive_half,percent_offensive_half,percent_behind_ball,percent_infront_ball,percent_most_back,percent_most_forward,percent_closest_to_ball,percent_farthest_from_ball,demos_inflicted,demos_taken
13,13,bronze,305.0,109.850,140.745,5.5,5.5,2.5,2.5,2.0,...,58.569202,41.430801,70.771775,29.228229,97.668655,97.668655,97.668655,97.668655,0.0,0.0
232,232,bronze,338.0,108.920,149.115,5.0,5.0,4.0,4.0,0.5,...,61.240378,38.759622,71.040054,28.959943,97.864825,97.864825,97.864825,97.864825,0.0,0.0
560,560,bronze,418.0,113.925,178.180,8.0,8.0,6.0,6.0,2.0,...,61.411490,38.588507,71.989610,28.010392,97.536205,97.536205,97.536205,97.536205,0.0,0.0
566,566,bronze,405.0,97.990,176.865,6.0,6.0,5.0,5.0,1.0,...,61.636724,38.363277,73.097965,26.902036,97.007312,97.007312,97.007312,97.007312,0.0,0.0
567,567,bronze,158.0,34.915,70.230,2.5,2.5,2.0,2.0,0.5,...,58.774396,41.225604,72.545948,27.454049,95.938950,95.938950,95.938950,95.938950,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30100,30100,silver,362.0,132.640,165.855,5.5,5.5,3.0,3.0,1.0,...,61.630180,38.369820,76.252072,23.747926,100.205005,100.205005,100.205005,100.205005,0.5,0.5
30104,30104,silver,399.0,109.850,174.940,9.0,9.0,5.0,5.0,2.5,...,62.478385,37.521617,72.439780,27.560217,97.855410,97.855410,97.855410,97.855410,1.0,1.0
30108,30108,silver,400.0,124.265,175.290,7.5,7.5,5.0,5.0,0.5,...,61.419119,38.580884,74.710930,25.289073,98.351063,98.351063,98.351063,98.351063,0.5,0.5
30118,30118,silver,390.0,124.960,172.195,7.0,7.0,4.5,4.5,2.5,...,59.690851,40.309152,71.153999,28.846002,98.034600,98.034600,98.034600,98.034600,0.5,0.5


In [87]:
dropcols = ['match_id', 'rank']

X = matches_grouped_fo.drop(columns = dropcols)
y = matches_grouped_fo['rank']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

pipe = Pipeline(steps = [
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
        ('rf', RandomForestClassifier())
    ])

pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

0.5685406237966885

## (IV) Basic Gradient Boosting Classifier, unaggregated, with cars

In [8]:
dropcols = ['match_id', 'color', 'rank', 'map_code']
X = matches.drop(columns = dropcols).fillna(0)
y = matches['rank']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

ohe = OneHotEncoder(sparse = False, drop = 'first')

ct = ColumnTransformer(transformers= [
        ('ohe', ohe, ['car_name'])
        ],
        remainder = 'passthrough'
    )

pipe = Pipeline(steps = [
        ('ct', ct),
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
        ('gbc', GradientBoostingClassifier())
    ])

pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

0.4640462120709116

In [19]:
len(pipe['vt'].get_feature_names_out())

166

In [34]:
np.where(pipe['ct'].get_feature_names_out() == 'remainder__assists')
pipe['ct'].get_feature_names_out()[89]
pipe['vt'].get_feature_names_out()[89]

'x90'

In [39]:
importances = pd.DataFrame({
    'variable': pipe['vt'].get_feature_names_out(),
    'importance': pipe['gbc'].feature_importances_
})

importances['variable'] = importances['variable'].str.extract(r'(\d+)').astype(int)
importances['variable_name'] = importances['variable'].apply(lambda x: pipe['ct'].get_feature_names_out()[x])
importances.sort_values('importance', ascending = False).head(20)

Unnamed: 0,variable,importance,variable_name
134,135,0.279188,remainder__percent_supersonic_speed
133,134,0.051825,remainder__percent_boost_speed
130,131,0.04925,remainder__avg_powerslide_duration
136,137,0.047861,remainder__percent_low_air
93,94,0.046859,remainder__bcpm
129,130,0.043932,remainder__count_powerslide
135,136,0.042344,remainder__percent_ground
140,141,0.03787,remainder__avg_distance_to_ball_no_possession
120,121,0.034439,remainder__avg_speed
119,120,0.028012,remainder__percent_boost_75_100


In [43]:
print(confusion_matrix(y_test, pipe.predict(X_test)))
print(classification_report(y_test, pipe.predict(X_test)))

[[  95    5    6   99   19  145]
 [   0 1780  880   40  237    2]
 [   4  838 1455  208  944    9]
 [  38   57  255 1488 1015  273]
 [  15  254  895  849 1676   60]
 [  72    6   28  637  182  495]]
              precision    recall  f1-score   support

      bronze       0.42      0.26      0.32       369
    champion       0.61      0.61      0.61      2939
     diamond       0.41      0.42      0.42      3458
        gold       0.45      0.48      0.46      3126
    platinum       0.41      0.45      0.43      3749
      silver       0.50      0.35      0.41      1420

    accuracy                           0.46     15061
   macro avg       0.47      0.43      0.44     15061
weighted avg       0.47      0.46      0.46     15061



## (V) Random Forest Classifier, grouped by match_id and aggregated by mean, outliers removed by rank, no cars

In [88]:
dropcols = ['match_id', 'rank']

X = matches_grouped_fo.drop(columns = dropcols)
y = matches_grouped_fo['rank']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

pipe = Pipeline(steps = [
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
        ('rf', RandomForestClassifier())
    ])

pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

0.5695032730073162

In [90]:
importances = pd.DataFrame({
    'variable': pipe['vt'].get_feature_names_out(),
    'importance': pipe['rf'].feature_importances_
})

importances.sort_values('importance', ascending = False).head(20)

Unnamed: 0,variable,importance
52,percent_supersonic_speed,0.046497
11,bcpm,0.03265
54,percent_low_air,0.031711
53,percent_ground,0.029178
40,time_supersonic_speed,0.028631
48,avg_powerslide_duration,0.025606
51,percent_boost_speed,0.024699
47,count_powerslide,0.02454
76,percent_behind_ball,0.021065
77,percent_infront_ball,0.02026


## (VI) Random Forest Classifier, matches wide, no cars

In [44]:
matches_win = matches.sort_values(['match_id', 'goals', 'score'], ascending=[True, False, False]).drop_duplicates(subset = ['match_id'], keep = 'first')
matches_lose = matches.sort_values(['match_id', 'goals', 'score'], ascending=[True, False, False]).drop_duplicates(subset = ['match_id'], keep = 'first')
matches_wide = matches_win.merge(matches_lose, on = ['match_id', 'rank', 'map_code'], suffixes=('_win','_lose'))

In [46]:
dropcols_wide = ['match_id', 'color_win', 'color_lose', 'rank', 'map_code', 'car_name_win', 'car_name_lose']
X = matches_wide.drop(columns = dropcols_wide).fillna(0)
y = matches_wide['rank']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

pipe = Pipeline(steps = [
    ('vt', VarianceThreshold()),
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier())
])

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.4535918204753685


In [56]:
importances = pd.DataFrame({
    'variable': pipe['vt'].get_feature_names_out(),
    'importance': pipe['rf'].feature_importances_
})

importances.sort_values('importance', ascending = False).head(10)

Unnamed: 0,variable,importance
53,percent_supersonic_speed_win,0.018146
138,percent_supersonic_speed_lose,0.017484
55,percent_low_air_win,0.013871
41,time_supersonic_speed_win,0.012061
124,avg_speed_lose,0.011994
140,percent_low_air_lose,0.011904
52,percent_boost_speed_win,0.011736
139,percent_ground_lose,0.011661
137,percent_boost_speed_lose,0.011626
54,percent_ground_win,0.011045


## (VII) Random Forest Classifier, matches wide, some feature engineering

In [70]:
matches_wider = matches_wide.assign(total_goals = lambda x: x['goals_win']+x['goals_lose'],
                    total_saves = lambda x: x['saves_win']+x['saves_lose'],
                    gs_score_ratio = lambda x: ((x['goals_win']+x['goals_lose'])*100 + (x['saves_win']+x['saves_lose'])*50)/(x['score_win']+x['score_lose']),
                    mean_percent_supersonic_speed = lambda x: (x['percent_supersonic_speed_win'] + x['percent_supersonic_speed_lose'])/2,
                    total_time_supersonic_speed = lambda x: x['time_supersonic_speed_win'] + x['time_supersonic_speed_lose'],
                    total_powerslide_count = lambda x: x['count_powerslide_win']+x['count_powerslide_lose'],
                    total_percent_ground_low_air = lambda x: x['percent_ground_win']+x['percent_ground_lose']+x['percent_low_air_win']+x['percent_low_air_lose'],
                    log_avg_powerslide_dur = lambda x: np.log(0.01+(x['avg_powerslide_duration_win']+x['avg_powerslide_duration_lose'])/2),
                    gs_over_pssl = lambda x: (x['goals_win']+x['goals_lose']+x['saves_win']+x['saves_lose'])/(1-0.01*x['percent_supersonic_speed_lose'])            
    )

In [80]:
dropcols_wide = ['match_id', 'color_win', 'color_lose', 'rank', 'map_code', 'car_name_win', 'car_name_lose']
X = matches_wider.drop(columns = dropcols_wide).fillna(0)
y = matches_wider['rank']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

pipe = Pipeline(steps = [
    ('vt', VarianceThreshold()),
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier())
])

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.36608684105696454


In [72]:
importances = pd.DataFrame({
    'variable': pipe['vt'].get_feature_names_out(),
    'importance': pipe['rf'].feature_importances_
})

importances.sort_values('importance', ascending = False).head(25)

Unnamed: 0,variable,importance
138,percent_supersonic_speed_lose,0.018554
53,percent_supersonic_speed_win,0.01514
173,mean_percent_supersonic_speed,0.014898
174,total_time_supersonic_speed,0.012322
126,time_supersonic_speed_lose,0.011913
140,percent_low_air_lose,0.011336
55,percent_low_air_win,0.010912
41,time_supersonic_speed_win,0.010303
52,percent_boost_speed_win,0.01023
59,avg_distance_to_ball_no_possession_win,0.010213


## (VIII) Gradient Boosting Classifier, matches wide, some feature engineering

In [67]:
dropcols_wide = ['match_id', 'color_win', 'color_lose', 'rank', 'map_code', 'car_name_win', 'car_name_lose']
X = matches_wider.drop(columns = dropcols_wide).fillna(0)
y = matches_wider['rank']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

pipe = Pipeline(steps = [
    ('vt', VarianceThreshold()),
    ('scaler', StandardScaler()),
    ('gbc', GradientBoostingClassifier())
])

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.45850484663391317


In [68]:
importances = pd.DataFrame({
    'variable': pipe['vt'].get_feature_names_out(),
    'importance': pipe['gbc'].feature_importances_
})

importances.sort_values('importance', ascending = False).head(25)

Unnamed: 0,variable,importance
173,mean_percent_supersonic_speed,0.122919
138,percent_supersonic_speed_lose,0.093343
53,percent_supersonic_speed_win,0.048738
137,percent_boost_speed_lose,0.030022
55,percent_low_air_win,0.028201
12,bcpm_win,0.026943
140,percent_low_air_lose,0.023908
59,avg_distance_to_ball_no_possession_win,0.022325
49,avg_powerslide_duration_win,0.022102
144,avg_distance_to_ball_no_possession_lose,0.021778


## (IX) Gradient Boost Classifier, grouped by match_id and aggregated by mean, outliers removed by rank, no cars

In [95]:
dropcols = ['match_id', 'rank']

X = matches_grouped_fo.drop(columns = dropcols)
y = matches_grouped_fo['rank']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

pipe = Pipeline(steps = [
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
        ('gbc', GradientBoostingClassifier())
    ])

pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

0.5723912206391991

In [92]:
dropcols = ['match_id', 'rank']

X = matches_grouped_fo.drop(columns = dropcols)
y = matches_grouped_fo['rank']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

pipe = Pipeline(steps = [
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
        ('gbc', GradientBoostingClassifier())
    ])

pipe.fit(X, y)
accuracy_score(y, pipe.predict(X))

0.6745126353790614

In [93]:
test_prep = test.groupby('match_id').mean().reset_index().fillna(0)
y_pred = pipe.predict(test_prep.drop(columns='match_id'))
submission = pd.DataFrame({'match_id':test_prep.index, 'rank': y_pred})
submission['rank'] = submission['rank'].map(converter)
submission['match_id'] = submission['match_id']+30121
submission

Unnamed: 0,match_id,rank
0,30121,5
1,30122,3
2,30123,4
3,30124,1
4,30125,5
...,...,...
2495,32616,3
2496,32617,1
2497,32618,6
2498,32619,3


In [94]:
#submission.to_csv('../submissions/submission_2022-04-03_v3.csv', index = False)

# (X) Gradient Boost Classifier, grouped by match_id and aggregated by mean, outliers removed by rank, resampled

Gradien Boost got worse with oversampling

In [99]:
dropcols = ['match_id', 'rank']

X = matches_grouped_fo.drop(columns = dropcols)
y = matches_grouped_fo['rank']

pipe_trans = Pipeline(steps = [
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler())
])

X_trans = pipe_trans.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_trans, y, random_state=42, stratify = y)

sm = SMOTE(random_state=42)
X_train_rs, y_train_rs = sm.fit_resample(X_train, y_train)

gbc = GradientBoostingClassifier()

gbc.fit(X_train_rs, y_train_rs)
accuracy_score(y_test, gbc.predict(X_test))

0.5541008856372738