# Rocket League Notebook 7: Logistic Regression Kitchen Sink, Outliers & Resamples

## Goals 

- Create a giant, kitchen-sink models with all numerical columns, this time filtering outliers out and oversampling minority classes

## Contents

- (I) Logistic Regression, matches grouped with mean, outliers filtered
    - Matches is grouped by match_id and aggregated by mean
    - Aggregated data is then filtered for overall outliers
    - Use VarianceThreshold with default settings
    - Use StandardScaler with default settings on all columns
    - Run logistic regression with max_iter set to 1000
    - Use RandomizedSearchCV, CV = 3, n_iter = 10, to tune hyperparameter C to a value in [100, 10, 1, 0.1, 0.01] 
- (II) Logistic Regression, matches grouped with mean, outliers filtered by rank
    - Matches is grouped by match_id and aggregated by mean
    - Aggregated data is then filtered for outliers *in each rank*
    - Use VarianceThreshold with default settings
    - Use StandardScaler with default settings on all columns
    - Run logistic regression with max_iter set to 1000
    - Use RandomizedSearchCV, CV = 3, n_iter = 10, to tune hyperparameter C to a value in [100, 10, 1, 0.1, 0.01]
- (III) Logistic Regression, grouped with mean, outiers removed by rank, resampled
    - Matches is grouped by match_id and aggregated by mean
    - Aggregated data is then filtered for outliers *in each rank*
    - Use VarianceThreshold with default settings
    - Use StandardScaler with default settings on all columns
    - Oversample data using SMOTE so that each classification is equally represented
    - Run logistic regression with max_iter set to 1000
    - Use RandomizedSearchCV, CV = 3, n_iter = 10, to tune hyperparameter C to a value in [100, 10, 1, 0.1, 0.01]
- (IV) Logistic Regression, grouped with mean, outiers removed by rank, resampled, log-transform avg powerslide duration
    - Matches is grouped by match_id and aggregated by mean
    - Aggregated data is then filtered for outliers *in each rank*
    - Add column that log-transforms the (heavily skewed) avg powerslide duration column
    - Use VarianceThreshold with default settings
    - Use StandardScaler with default settings on all columns
    - Oversample data using SMOTE so that each classification is equally represented
    - Run logistic regression with max_iter set to 1000
    - Use RandomizedSearchCV, CV = 3, n_iter = 10, to tune hyperparameter C to a value in [100, 10, 1, 0.1, 0.01]
- (V) Logistic Regression, ungrouped, outliers removed by rank
    - Aggregated data is filtered for outliers *in each rank*
    - Use VarianceThreshold with default settings
    - Use StandardScaler with default settings on all columns
    - Run logistic regression with max_iter set to 1000
    - Use RandomizedSearchCV, CV = 3, n_iter = 10, to tune hyperparameter C to a value in [100, 10, 1, 0.1, 0.01]

## Results

- (I) Logistic Regression, matches grouped with mean, outliers filtered
    - Accuracy Score:  0.5811900191938579
    - Outliers being filtered from whole dataset might not help the rank classifications be better represented
    - Yields *submission_2022-04-01_v1.csv*
- (II) Logistic Regression, matches grouped with mean, outliers filtered by rank
    - Accuracy Score:  0.6080092414324221
    - Best performance in the notebook; however, bronze and silver are still much less represented in this dataset
    - Yields *submission_2022-04-02_v1.csv*
- (III) Logistic Regression, grouped with mean, outiers removed by rank, resampled
    - Accuracy Score:  0.5943396226415094
    - Slightly worse performance than (II) but performing better on bronze and silver with oversampling
    - Yields *submission_2022-04-02_v1.csv*
- (IV) Logistic Regression, grouped with mean, outiers removed by rank, resampled, log-transform avg powerslide duration
    - Accuracy Score:  0.5995379283788987
    - This is the start of more feature engineering. avg powerslide duration shows a lot of promise for distinguishing classifications (decreases as rank increases)
- (V) Logistic Regression, ungrouped, outliers removed by rank
    - Accuracy score:  0.49894881694124843
    - Ungrouped and unaveraged data seems to predict worse than averaged; however, best approach would be to widen and/or aggregate in distinct ways for specific columns.

## Imports

In [72]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy.stats import pearsonr
from imblearn.over_sampling import RandomOverSampler, SMOTE
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import VarianceThreshold, RFE
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

## Converters and Functions

In [3]:
converter = { 'bronze': 1, 'silver': 2, 'gold': 3, 'platinum': 4, 'diamond': 5, 'champion': 6 }

In [11]:
def find_outliers(col):

    try:
        Q1 = col.quantile(0.25)
        Q3 = col.quantile(0.75)
        IQR = Q3 - Q1
        lowbound = Q1-1.5*IQR
        highbound = Q3+1.5*IQR
        df_outliers = (col >= lowbound) & (col <= highbound)
    except:
        df_outliers = (col == col)

    return df_outliers

def filter_outliers(df):

    filtered_df = df[df.apply(find_outliers).all(axis = 'columns')]

    return filtered_df

## Read in

In [5]:
matches = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

In [6]:
print('matches shape: ', matches.shape)
print('test shape: ', test.shape)

matches shape:  (60242, 91)
test shape:  (5000, 90)


## (I) Logistic Regression, matches grouped with mean, outliers filtered

In [7]:
matches_prepped = filter_outliers(matches.groupby(['match_id', 'rank']).mean().reset_index().fillna(0))

X = matches_prepped.drop(columns = ['match_id', 'rank'])
y = matches_prepped[['rank']]

X_train, X_test, y_train, y_test =train_test_split(X, y, random_state=42, stratify = y)

pipe = Pipeline(steps = [

        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
        ('logreg', LogisticRegression(max_iter = 1000))
    ])

param_grid = {'logreg__C':[100, 10, 1.0, 0.1, 0.01]}

# gs = GridSearchCV(pipe, param_grid, scoring = 'accuracy')

# gs.fit(X_train, y_train)

rs = RandomizedSearchCV(pipe, param_grid, scoring = 'accuracy', verbose=2)

rs.fit(X_train, y_train)


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 5 candidates, totalling 25 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=100; total time=   5.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=100; total time=   5.7s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=100; total time=   7.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=100; total time=   6.9s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=100; total time=   7.7s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END .......................................logreg__C=10; total time=   6.8s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END .......................................logreg__C=10; total time=   7.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END .......................................logreg__C=10; total time=   8.3s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END .......................................logreg__C=10; total time=   8.2s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END .......................................logreg__C=10; total time=   7.2s
[CV] END ......................................logreg__C=1.0; total time=   4.8s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=1.0; total time=   7.0s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=1.0; total time=   6.8s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=1.0; total time=   5.4s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=1.0; total time=   4.5s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=0.1; total time=   2.6s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=0.1; total time=   4.3s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=0.1; total time=   6.1s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=0.1; total time=   5.6s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=0.1; total time=   4.6s


  y = column_or_1d(y, warn=True)


[CV] END .....................................logreg__C=0.01; total time=   1.7s


  y = column_or_1d(y, warn=True)


[CV] END .....................................logreg__C=0.01; total time=   1.7s


  y = column_or_1d(y, warn=True)


[CV] END .....................................logreg__C=0.01; total time=   1.6s


  y = column_or_1d(y, warn=True)


[CV] END .....................................logreg__C=0.01; total time=   1.8s


  y = column_or_1d(y, warn=True)


[CV] END .....................................logreg__C=0.01; total time=   1.7s


  y = column_or_1d(y, warn=True)


RandomizedSearchCV(estimator=Pipeline(steps=[('vt', VarianceThreshold()),
                                             ('scaler', StandardScaler()),
                                             ('logreg',
                                              LogisticRegression(max_iter=1000))]),
                   param_distributions={'logreg__C': [100, 10, 1.0, 0.1, 0.01]},
                   scoring='accuracy', verbose=2)

In [8]:
y_pred = rs.predict(X_test)

print("Accuracy Score: ", accuracy_score(y_test, y_pred))
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
print("Classification Report: \n", classification_report(y_test, y_pred))

Accuracy Score:  0.5811900191938579
Confusion Matrix: 
 [[  0   0   2   4   2  12]
 [  0 808 308   2  29   0]
 [  0 275 734  19 331   0]
 [  0   1  34 599 333  40]
 [  1  30 315 269 780   5]
 [  2   0   0 155  13 107]]
Classification Report: 
               precision    recall  f1-score   support

      bronze       0.00      0.00      0.00        20
    champion       0.73      0.70      0.71      1147
     diamond       0.53      0.54      0.53      1359
        gold       0.57      0.59      0.58      1007
    platinum       0.52      0.56      0.54      1400
      silver       0.65      0.39      0.49       277

    accuracy                           0.58      5210
   macro avg       0.50      0.46      0.48      5210
weighted avg       0.58      0.58      0.58      5210



In [9]:
test_prep = test.groupby('match_id').mean().fillna(0)
y_pred = rs.predict(test_prep)
y_pred = pd.Series(y_pred).map(converter)
submission = pd.concat([test_prep.reset_index()['match_id'], y_pred], axis = 1).rename(columns = {0: 'rank'})
submission

Unnamed: 0,match_id,rank
0,30121,5
1,30122,3
2,30123,4
3,30124,4
4,30125,5
...,...,...
2495,32616,3
2496,32617,6
2497,32618,6
2498,32619,3


In [10]:
#submission.to_csv("../submissions/submission_2022-04-01_v1.csv", index = False)

## (II) Logistic Regression, matches grouped with mean, outliers filtered by rank

In [13]:
matches_grouped = matches.groupby(['match_id', 'rank']).mean().reset_index().fillna(0)
matches_grouped_fo = matches_grouped.groupby('rank').apply(filter_outliers)
matches_grouped_fo.index = matches_grouped_fo.index.droplevel(level = 0)
matches_grouped_fo

Unnamed: 0,match_id,rank,duration,possession_time,time_in_side,shots,shots_against,goals,goals_against,saves,...,percent_defensive_half,percent_offensive_half,percent_behind_ball,percent_infront_ball,percent_most_back,percent_most_forward,percent_closest_to_ball,percent_farthest_from_ball,demos_inflicted,demos_taken
13,13,bronze,305.0,109.850,140.745,5.5,5.5,2.5,2.5,2.0,...,58.569202,41.430801,70.771775,29.228229,97.668655,97.668655,97.668655,97.668655,0.0,0.0
232,232,bronze,338.0,108.920,149.115,5.0,5.0,4.0,4.0,0.5,...,61.240378,38.759622,71.040054,28.959943,97.864825,97.864825,97.864825,97.864825,0.0,0.0
560,560,bronze,418.0,113.925,178.180,8.0,8.0,6.0,6.0,2.0,...,61.411490,38.588507,71.989610,28.010392,97.536205,97.536205,97.536205,97.536205,0.0,0.0
566,566,bronze,405.0,97.990,176.865,6.0,6.0,5.0,5.0,1.0,...,61.636724,38.363277,73.097965,26.902036,97.007312,97.007312,97.007312,97.007312,0.0,0.0
567,567,bronze,158.0,34.915,70.230,2.5,2.5,2.0,2.0,0.5,...,58.774396,41.225604,72.545948,27.454049,95.938950,95.938950,95.938950,95.938950,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30100,30100,silver,362.0,132.640,165.855,5.5,5.5,3.0,3.0,1.0,...,61.630180,38.369820,76.252072,23.747926,100.205005,100.205005,100.205005,100.205005,0.5,0.5
30104,30104,silver,399.0,109.850,174.940,9.0,9.0,5.0,5.0,2.5,...,62.478385,37.521617,72.439780,27.560217,97.855410,97.855410,97.855410,97.855410,1.0,1.0
30108,30108,silver,400.0,124.265,175.290,7.5,7.5,5.0,5.0,0.5,...,61.419119,38.580884,74.710930,25.289073,98.351063,98.351063,98.351063,98.351063,0.5,0.5
30118,30118,silver,390.0,124.960,172.195,7.0,7.0,4.5,4.5,2.5,...,59.690851,40.309152,71.153999,28.846002,98.034600,98.034600,98.034600,98.034600,0.5,0.5


In [44]:
X = matches_grouped_fo.drop(columns = ['match_id', 'rank'])
y = matches_grouped_fo[['rank']]

X_train, X_test, y_train, y_test =train_test_split(X, y, random_state=42, stratify = y)

pipe = Pipeline(steps = [
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
        ('logreg', LogisticRegression(max_iter = 1000))
    ])

param_grid = {'logreg__C':[100, 10, 1.0, 0.1, 0.01]}

# gs = GridSearchCV(pipe, param_grid, scoring = 'accuracy')

# gs.fit(X_train, y_train)

rs = RandomizedSearchCV(pipe, param_grid, scoring = 'accuracy', verbose=2)

rs.fit(X_train, y_train)


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 5 candidates, totalling 25 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=100; total time=   5.2s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=100; total time=   5.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=100; total time=   6.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=100; total time=   7.1s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=100; total time=   7.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END .......................................logreg__C=10; total time=   7.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END .......................................logreg__C=10; total time=   7.3s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END .......................................logreg__C=10; total time=   7.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END .......................................logreg__C=10; total time=   7.2s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END .......................................logreg__C=10; total time=   8.4s
[CV] END ......................................logreg__C=1.0; total time=   4.8s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=1.0; total time=   4.5s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=1.0; total time=   4.6s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=1.0; total time=   4.8s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=1.0; total time=   4.0s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=0.1; total time=   2.1s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=0.1; total time=   2.4s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=0.1; total time=   2.6s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=0.1; total time=   2.7s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=0.1; total time=   2.2s


  y = column_or_1d(y, warn=True)


[CV] END .....................................logreg__C=0.01; total time=   1.2s


  y = column_or_1d(y, warn=True)


[CV] END .....................................logreg__C=0.01; total time=   1.3s


  y = column_or_1d(y, warn=True)


[CV] END .....................................logreg__C=0.01; total time=   1.1s


  y = column_or_1d(y, warn=True)


[CV] END .....................................logreg__C=0.01; total time=   1.4s


  y = column_or_1d(y, warn=True)


[CV] END .....................................logreg__C=0.01; total time=   1.3s


  y = column_or_1d(y, warn=True)


RandomizedSearchCV(estimator=Pipeline(steps=[('vt', VarianceThreshold()),
                                             ('scaler', StandardScaler()),
                                             ('logreg',
                                              LogisticRegression(max_iter=1000))]),
                   param_distributions={'logreg__C': [100, 10, 1.0, 0.1, 0.01]},
                   scoring='accuracy', verbose=2)

In [45]:
y_pred = rs.predict(X_test)
print("Accuracy Score: ", accuracy_score(y_test, y_pred))
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
print("Classification Report: \n", classification_report(y_test, y_pred))

Accuracy Score:  0.6080092414324221
Confusion Matrix: 
 [[ 41   0   0   8   0  49]
 [  0 772 267   0  17   0]
 [  0 252 670  11 305   0]
 [  3   0  15 656 305  77]
 [  0  15 285 237 786   1]
 [ 21   0   0 159   9 233]]
Classification Report: 
               precision    recall  f1-score   support

      bronze       0.63      0.42      0.50        98
    champion       0.74      0.73      0.74      1056
     diamond       0.54      0.54      0.54      1238
        gold       0.61      0.62      0.62      1056
    platinum       0.55      0.59      0.57      1324
      silver       0.65      0.55      0.60       422

    accuracy                           0.61      5194
   macro avg       0.62      0.58      0.59      5194
weighted avg       0.61      0.61      0.61      5194



In [46]:
rs.fit(X, y)

  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 5 candidates, totalling 25 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=100; total time=   7.8s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=100; total time=   9.6s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=100; total time=  12.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=100; total time=  11.9s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=100; total time=  10.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END .......................................logreg__C=10; total time=  12.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END .......................................logreg__C=10; total time=  12.6s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END .......................................logreg__C=10; total time=  11.7s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END .......................................logreg__C=10; total time=  11.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


[CV] END .......................................logreg__C=10; total time=  11.6s
[CV] END ......................................logreg__C=1.0; total time=   8.1s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=1.0; total time=   8.3s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=1.0; total time=   8.3s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=1.0; total time=   7.4s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=1.0; total time=   8.4s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=0.1; total time=   2.9s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=0.1; total time=   4.0s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=0.1; total time=   4.4s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=0.1; total time=   3.6s


  y = column_or_1d(y, warn=True)


[CV] END ......................................logreg__C=0.1; total time=   3.1s


  y = column_or_1d(y, warn=True)


[CV] END .....................................logreg__C=0.01; total time=   2.4s


  y = column_or_1d(y, warn=True)


[CV] END .....................................logreg__C=0.01; total time=   1.8s


  y = column_or_1d(y, warn=True)


[CV] END .....................................logreg__C=0.01; total time=   2.2s


  y = column_or_1d(y, warn=True)


[CV] END .....................................logreg__C=0.01; total time=   2.1s


  y = column_or_1d(y, warn=True)


[CV] END .....................................logreg__C=0.01; total time=   2.0s


  y = column_or_1d(y, warn=True)


NotFittedError: This VarianceThreshold instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [49]:
test_prep = test.groupby('match_id').mean().reset_index().fillna(0)
y_pred = rs.predict(test_prep.drop(columns = 'match_id'))

In [51]:
submission = pd.DataFrame({'match_id':test_prep.index, 'rank': y_pred})
submission['rank'] = submission['rank'].map(converter)
submission['match_id'] = submission['match_id']+30121
submission

Unnamed: 0,match_id,rank
0,30121,5
1,30122,3
2,30123,4
3,30124,4
4,30125,5
...,...,...
2495,32616,3
2496,32617,6
2497,32618,6
2498,32619,3


In [52]:
#submission.to_csv('../submissions/submission_2022-04-02_v1.csv', index = False)

## (III) Logistic Regression, grouped with mean, outiers removed by rank, resampled

In [53]:
X = matches_grouped_fo.drop(columns = ['match_id', 'rank'])
y = matches_grouped_fo['rank']

pipe_trans = Pipeline(steps = [
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
])

X_trans = pipe_trans.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_trans, y, random_state=42, stratify = y)

sm = SMOTE(random_state=42)
X_train_rs, y_train_rs = sm.fit_resample(X_train, y_train)

logreg = LogisticRegression(max_iter = 1000)

param_grid = {'C':[100, 10, 1.0, 0.1, 0.01]}

rs = RandomizedSearchCV(logreg, param_grid, scoring = 'accuracy')

rs.fit(X_train_rs, y_train_rs)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

RandomizedSearchCV(estimator=LogisticRegression(max_iter=1000),
                   param_distributions={'C': [100, 10, 1.0, 0.1, 0.01]},
                   scoring='accuracy')

In [54]:
y_pred = rs.predict(X_test)
print("Accuracy Score: ", accuracy_score(y_test, y_pred))
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
print("Classification Report: \n", classification_report(y_test, y_pred))

Accuracy Score:  0.5943396226415094
Confusion Matrix: 
 [[ 70   0   0   5   0  23]
 [  0 806 235   0  15   0]
 [  1 294 646  15 282   0]
 [  9   0  18 610 244 175]
 [  0  18 304 290 702  10]
 [ 77   0   0  86   6 253]]
Classification Report: 
               precision    recall  f1-score   support

      bronze       0.45      0.71      0.55        98
    champion       0.72      0.76      0.74      1056
     diamond       0.54      0.52      0.53      1238
        gold       0.61      0.58      0.59      1056
    platinum       0.56      0.53      0.55      1324
      silver       0.55      0.60      0.57       422

    accuracy                           0.59      5194
   macro avg       0.57      0.62      0.59      5194
weighted avg       0.59      0.59      0.59      5194



In [56]:
rs.fit(X_trans, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

RandomizedSearchCV(estimator=LogisticRegression(max_iter=1000),
                   param_distributions={'C': [100, 10, 1.0, 0.1, 0.01]},
                   scoring='accuracy')

In [66]:
test_prep = test.groupby('match_id').mean().reset_index().drop(columns = ['match_id', 'assists', 'mvp']).fillna(0)


In [63]:
test_prep_trans = pipe_trans.fit_transform(test_prep)
y_pred = rs.predict(test_prep_trans)

In [64]:
submission = pd.DataFrame({'match_id':test_prep.index, 'rank': y_pred})
submission['rank'] = submission['rank'].map(converter)
submission['match_id'] = submission['match_id']+30121
submission

Unnamed: 0,match_id,rank
0,30121,5
1,30122,3
2,30123,4
3,30124,4
4,30125,5
...,...,...
2495,32616,3
2496,32617,6
2497,32618,6
2498,32619,3


In [65]:
#submission.to_csv('../submissions/submission_2022-04-02_v2.csv', index = False)

### Error in training for submission_2022-04-02_v2.csv. Adjusted below.

In [118]:
X = matches_grouped_fo.drop(columns = ['match_id', 'rank'])
y = matches_grouped_fo['rank']

pipe_trans = Pipeline(steps = [
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
])

X_trans = pipe_trans.fit_transform(X)

sm = SMOTE(random_state=42)
X_rs, y_rs = sm.fit_resample(X_trans, y)

logreg = LogisticRegression(max_iter = 1000)

param_grid = {'C':[100, 10, 1.0, 0.1, 0.01]}

rs = RandomizedSearchCV(logreg, param_grid, scoring = 'accuracy')

rs.fit(X_rs, y_rs)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

RandomizedSearchCV(estimator=LogisticRegression(max_iter=1000),
                   param_distributions={'C': [100, 10, 1.0, 0.1, 0.01]},
                   scoring='accuracy')

In [119]:
test_prep = test.groupby('match_id').mean().reset_index().drop(columns = ['match_id', 'assists', 'mvp']).fillna(0)

In [120]:
test_prep_trans = pipe_trans.fit_transform(test_prep)
y_pred = rs.predict(test_prep_trans)

In [121]:
submission = pd.DataFrame({'match_id':test_prep.index, 'rank': y_pred})
submission['rank'] = submission['rank'].map(converter)
submission['match_id'] = submission['match_id']+30121
submission

Unnamed: 0,match_id,rank
0,30121,5
1,30122,3
2,30123,5
3,30124,5
4,30125,5
...,...,...
2495,32616,4
2496,32617,6
2497,32618,6
2498,32619,3


In [122]:
#submission.to_csv('../submissions/submission_2022-04-03_v1.csv', index = False)

## (IV) Logistic Regression, grouped with mean, outiers removed by rank, resampled, log-transform avg powerslide duration

In [42]:
X = matches_grouped_fo.drop(columns = ['match_id', 'rank']).assign(log_avg_powerslide_duration = lambda x: np.log(x['avg_powerslide_duration']+0.01))
y = matches_grouped_fo['rank']

pipe_trans = Pipeline(steps = [
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
])

X_trans = pipe_trans.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_trans, y, random_state=42, stratify = y)

sm = SMOTE(random_state=42)
X_train_rs, y_train_rs = sm.fit_resample(X_train, y_train)

logreg = LogisticRegression(max_iter = 1000)

param_grid = {'C':[100, 10, 1.0, 0.1, 0.01]}

# gs = GridSearchCV(pipe, param_grid, scoring = 'accuracy')

# gs.fit(X_train, y_train)

rs = RandomizedSearchCV(logreg, param_grid, scoring = 'accuracy')

rs.fit(X_train_rs, y_train_rs)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

RandomizedSearchCV(estimator=LogisticRegression(max_iter=1000),
                   param_distributions={'C': [100, 10, 1.0, 0.1, 0.01]},
                   scoring='accuracy')

In [43]:
y_pred = rs.predict(X_test)
print("Accuracy Score: ", accuracy_score(y_test, y_pred))
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
print("Classification Report: \n", classification_report(y_test, y_pred))

Accuracy Score:  0.5995379283788987
Confusion Matrix: 
 [[ 74   0   0   5   0  19]
 [  0 804 237   0  15   0]
 [  1 294 644  16 283   0]
 [  8   0  17 619 240 172]
 [  0  18 304 277 715  10]
 [ 71   0   0  87   6 258]]
Classification Report: 
               precision    recall  f1-score   support

      bronze       0.48      0.76      0.59        98
    champion       0.72      0.76      0.74      1056
     diamond       0.54      0.52      0.53      1238
        gold       0.62      0.59      0.60      1056
    platinum       0.57      0.54      0.55      1324
      silver       0.56      0.61      0.59       422

    accuracy                           0.60      5194
   macro avg       0.58      0.63      0.60      5194
weighted avg       0.60      0.60      0.60      5194



## (V) Logistic Regression, ungrouped, outliers removed by rank

In [17]:
matches_fo = filter_outliers(matches.fillna(0))

In [23]:
dropcols = ['color', 'match_id', 'rank', 'map_code', 'car_name']
X = matches_fo.drop(columns = dropcols)
y = matches_fo['rank']

X_test, X_train, y_test, y_train = train_test_split(X, y, random_state = 42, stratify = y)

pipe = Pipeline(steps=[
    ('vt', VarianceThreshold()),
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=1000))
])

param_grid = {'logreg__C':[100, 10, 1.0, 0.1, 0.01]}

rs = RandomizedSearchCV(pipe, param_grid, scoring = 'accuracy', n_iter = 20)

rs.fit(X_train, y_train)
y_pred = rs.predict(X_test)

print('accuracy score: ', accuracy_score(y_test, y_pred))
print('confusion matrix: \n', confusion_matrix(y_test, y_pred))
print('classification report: \n', classification_report(y_test, y_pred))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

accuracy score:  0.49894881694124843
confusion matrix: 
 [[   3    1    6   43   12   70]
 [   0 3633 1584   22  347    0]
 [   0 1676 3014  233 1799    5]
 [   1   36  363 2587 1881  296]
 [   0  327 1787 1400 3407   35]
 [   3    1   12  960  208  409]]
classification report: 
               precision    recall  f1-score   support

      bronze       0.43      0.02      0.04       135
    champion       0.64      0.65      0.65      5586
     diamond       0.45      0.45      0.45      6727
        gold       0.49      0.50      0.50      5164
    platinum       0.45      0.49      0.47      6956
      silver       0.50      0.26      0.34      1593

    accuracy                           0.50     26161
   macro avg       0.49      0.39      0.41     26161
weighted avg       0.50      0.50      0.50     26161



## (VI) Logistic Regression, select columns, grouped with mean, outiers removed by rank, resampled

In [105]:
cols = ['percent_supersonic_speed', 'bcpm', 'percent_ground', 'percent_low_air']

matches_cols_grouped = matches.groupby(['match_id', 'rank'])[cols].mean().reset_index().fillna(0)
matches_cols_grouped_fo = matches_grouped.groupby('rank').apply(filter_outliers)
matches_cols_grouped_fo.index = matches_cols_grouped_fo.index.droplevel(level = 0)

X = matches_cols_grouped_fo[cols]
y = matches_cols_grouped_fo['rank']

pipe_trans = Pipeline(steps = [
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
])

X_trans = pipe_trans.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_trans, y, random_state=42, stratify = y)

sm = SMOTE(random_state=42)
X_train_rs, y_train_rs = sm.fit_resample(X_train, y_train)

logreg = LogisticRegression(max_iter = 1000)

param_grid = {'C':[100, 10, 1.0, 0.1, 0.01]}

rs = RandomizedSearchCV(logreg, param_grid, scoring = 'accuracy')

rs.fit(X_train_rs, y_train_rs)



RandomizedSearchCV(estimator=LogisticRegression(max_iter=1000),
                   param_distributions={'C': [100, 10, 1.0, 0.1, 0.01]},
                   scoring='accuracy')

In [106]:
y_pred = rs.predict(X_test)
print("Accuracy Score: ", accuracy_score(y_test, y_pred))
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
print("Classification Report: \n", classification_report(y_test, y_pred))

Accuracy Score:  0.4326145552560647
Confusion Matrix: 
 [[ 60   0   0   8   5  25]
 [  0 707 257   8  84   0]
 [  0 427 426  77 306   2]
 [ 48  18  73 423 266 228]
 [  3 136 303 348 472  62]
 [121   0   6 109  27 159]]
Classification Report: 
               precision    recall  f1-score   support

      bronze       0.26      0.61      0.36        98
    champion       0.55      0.67      0.60      1056
     diamond       0.40      0.34      0.37      1238
        gold       0.43      0.40      0.42      1056
    platinum       0.41      0.36      0.38      1324
      silver       0.33      0.38      0.35       422

    accuracy                           0.43      5194
   macro avg       0.40      0.46      0.41      5194
weighted avg       0.43      0.43      0.43      5194



### (VIa) RFE Feature Selection with data grouped with mean (consider submitting)

In [112]:
dropcols = ['match_id', 'rank']
matches_grouped_fo = matches.groupby(['match_id', 'rank']).mean().reset_index().fillna(0).groupby('rank').apply(filter_outliers)
X = matches_cols_grouped_fo.drop(columns = dropcols)
y = matches_cols_grouped_fo['rank']
logreg = LogisticRegression(max_iter=1000)
selector = RFE(logreg, n_features_to_select=15, step=1)
selector = selector.fit(X, y)
selected_cols = selector.support_
selected_cols

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

array([ True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False,  True,  True, False, False,
       False, False, False,  True, False, False,  True,  True,  True,
        True,  True, False, False,  True, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False,  True,  True, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False])

In [113]:
best12 = X.columns[selected_cols]

In [126]:
matches_grouped_fo = matches.groupby(['match_id', 'rank'])[best12].mean().reset_index().fillna(0).groupby('rank').apply(filter_outliers)
X = matches_cols_grouped_fo[best12]
y = matches_cols_grouped_fo['rank']

pipe_trans = Pipeline(steps = [
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
])

X_trans = pipe_trans.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_trans, y, random_state=42, stratify = y)

sm = SMOTE(random_state=42)
X_train_rs, y_train_rs = sm.fit_resample(X_train, y_train)

logreg = LogisticRegression(max_iter = 1000)

param_grid = {'C':[100, 10, 1.0, 0.1, 0.01]}

rs = RandomizedSearchCV(logreg, param_grid, scoring = 'accuracy')

rs.fit(X_train_rs, y_train_rs)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

RandomizedSearchCV(estimator=LogisticRegression(max_iter=1000),
                   param_distributions={'C': [100, 10, 1.0, 0.1, 0.01]},
                   scoring='accuracy')

In [127]:
y_pred = rs.predict(X_test)
print("Accuracy Score: ", accuracy_score(y_test, y_pred))
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
print("Classification Report: \n", classification_report(y_test, y_pred))

Accuracy Score:  0.5390835579514824
Confusion Matrix: 
 [[ 73   0   0   5   1  19]
 [  0 769 252   1  34   0]
 [  1 339 580  28 290   0]
 [ 27   1  32 517 268 211]
 [  2  39 312 302 634  35]
 [ 92   0   0  92  11 227]]
Classification Report: 
               precision    recall  f1-score   support

      bronze       0.37      0.74      0.50        98
    champion       0.67      0.73      0.70      1056
     diamond       0.49      0.47      0.48      1238
        gold       0.55      0.49      0.52      1056
    platinum       0.51      0.48      0.49      1324
      silver       0.46      0.54      0.50       422

    accuracy                           0.54      5194
   macro avg       0.51      0.57      0.53      5194
weighted avg       0.54      0.54      0.54      5194



In [128]:
X = matches_cols_grouped_fo[best12]
y = matches_cols_grouped_fo['rank']

pipe_trans = Pipeline(steps = [
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
])

X_trans = pipe_trans.fit_transform(X)

sm = SMOTE(random_state=42)
X_rs, y_rs = sm.fit_resample(X_trans, y)

logreg = LogisticRegression(max_iter = 1000)

param_grid = {'C':[100, 10, 1.0, 0.1, 0.01]}

rs = RandomizedSearchCV(logreg, param_grid, scoring = 'accuracy')

rs.fit(X_rs, y_rs)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

RandomizedSearchCV(estimator=LogisticRegression(max_iter=1000),
                   param_distributions={'C': [100, 10, 1.0, 0.1, 0.01]},
                   scoring='accuracy')

In [132]:
test_grouped = test.groupby('match_id')[best12].mean().reset_index().fillna(0)
test_trans = pipe_trans.fit_transform(test_grouped.drop(columns = 'match_id'))
y_pred = rs.predict(test_trans)

In [133]:
submission = pd.DataFrame({'match_id':test_prep.index, 'rank': y_pred})
submission['rank'] = submission['rank'].map(converter)
submission['match_id'] = submission['match_id']+30121
submission

Unnamed: 0,match_id,rank
0,30121,5
1,30122,3
2,30123,4
3,30124,3
4,30125,5
...,...,...
2495,32616,3
2496,32617,5
2497,32618,6
2498,32619,3


In [134]:
submission.to_csv('../submissions/submission_2022-04-03_v2.csv', index = False)

### (VIb) RFE Feature Selection with data grouped with mean, plus interactions

In [116]:
dropcols = ['match_id', 'rank']
#matches_grouped_fo = filter_outliers(matches.groupby(['match_id', 'rank']).mean().reset_index().fillna(0))
X = matches_cols_grouped_fo[best12]
y = matches_cols_grouped_fo['rank']

pipe_trans = Pipeline(steps = [
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
])

X_trans = pipe_trans.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_trans, y, random_state=42, stratify = y)

sm = SMOTE(random_state=42)
X_train_rs, y_train_rs = sm.fit_resample(X_train, y_train)

pipe = Pipeline(steps = [
        ('pf', PolynomialFeatures(interaction_only=True)),
        ('logreg', LogisticRegression(max_iter = 1000))
])

param_grid = {'logreg__C':[100, 10, 1.0, 0.1, 0.01]}

rs = RandomizedSearchCV(pipe, param_grid, scoring = 'accuracy')

rs.fit(X_train_rs, y_train_rs)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

RandomizedSearchCV(estimator=Pipeline(steps=[('pf',
                                              PolynomialFeatures(interaction_only=True)),
                                             ('logreg',
                                              LogisticRegression(max_iter=1000))]),
                   param_distributions={'logreg__C': [100, 10, 1.0, 0.1, 0.01]},
                   scoring='accuracy')

In [117]:
y_pred = rs.predict(X_test)
print("Accuracy Score: ", accuracy_score(y_test, y_pred))
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
print("Classification Report: \n", classification_report(y_test, y_pred))

Accuracy Score:  0.5358105506353484
Confusion Matrix: 
 [[ 64   0   0   7   1  26]
 [  0 769 256   1  30   0]
 [  1 350 586  34 267   0]
 [ 22   4  42 512 274 202]
 [  4  46 334 294 619  27]
 [ 85   0   0  94  10 233]]
Classification Report: 
               precision    recall  f1-score   support

      bronze       0.36      0.65      0.47        98
    champion       0.66      0.73      0.69      1056
     diamond       0.48      0.47      0.48      1238
        gold       0.54      0.48      0.51      1056
    platinum       0.52      0.47      0.49      1324
      silver       0.48      0.55      0.51       422

    accuracy                           0.54      5194
   macro avg       0.51      0.56      0.53      5194
weighted avg       0.54      0.54      0.53      5194

