## Description

In a PUBG game, up to 100 players start in each match (matchId). Players can be on teams (groupId) which get ranked at the end of the game (winPlacePerc) based on how many other teams are still alive when they are eliminated. In game, players can pick up different munitions, revive downed-but-not-out (knocked) teammates, drive vehicles, swim, run, shoot, and experience all of the consequences -- such as falling too far or running themselves over and eliminating themselves.

You are provided with a large number of anonymized PUBG game stats, formatted so that each row contains one player's post-game stats. The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 player per group.

**You must create a model which predicts players' finishing placement based on their final stats, on a scale from 1 (first place) to 0 (last place).**

## Evaluation Criteria

Submissions are evaluated on Mean Absolute Error between your predicted winPlacePerc and the observed winPlacePerc.

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

Submission File
For each Id in the test set, you must predict their placement as a percentage (0 for last, 1 for first place) for the winPlacePerc variable. The file should contain a header and have the following format:

  - Id,winPlacePerc
  - 47734,0
  - 47735,0.5
  - 47736,0
  - 47737,1
  - etc.
  
See sample_submission.csv on the data page for a full sample submission.

## Setup

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import os

## Data Overview

In [None]:
#import data

PATH = r'/Users/nicholasbeaudoin/Desktop/Kaggle/PUBG'

# # Labels
# df_train = pd.read_csv('../input/train_V2.csv')

# # No labels
# df_test = pd.read_csv('../input/test_V2.csv')

# Labels
df_train = pd.read_csv(f'{PATH}/data/train_V2.csv', low_memory=False)

# No labels
df_test = pd.read_csv(f'{PATH}/data/test_V2.csv', low_memory=False)

In [None]:
df_train.head().T

In [None]:
# Dummy out match type
dummies_train = pd.get_dummies(df_train['matchType'])
dummies_test = pd.get_dummies(df_test['matchType'])

In [None]:
# Merge dummies to DF
df_train = pd.concat([df_train, dummies_train], axis=1)
df_test = pd.concat([df_test, dummies_test], axis=1)

In [None]:
# Drop IDs (3 of them)
df_train.drop(['Id', 'groupId', 'matchId', 'matchType'], axis=1, inplace=True)

In [None]:
df_train.isna().sum()

#### Drop the one missing row with no observations for place

In [None]:
df_train.dropna(0, inplace=True)

In [None]:
# Confirm that one row was dropped
df_train.shape

## Model Pre-Process

Feature selection, scaling, split-train-test

In [None]:
# Get all variables except outcome
feature_cols = df_train.columns[df_train.columns!='winPlacePerc']

X = df_train[feature_cols]
y = df_train.winPlacePerc

In [None]:
# Checking shape
print(X.shape)
print(y.shape)

In [None]:
# create a train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

In [None]:
# Feature scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))

X_train_scaled = scaler.fit_transform(X_train)
X_train = pd.DataFrame(X_train_scaled)

X_test_scaled = scaler.transform(X_test)
X_test = pd.DataFrame(X_test_scaled)

## Linear Regression


In [None]:
# Import
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# Instantiate
linreg = LinearRegression()

# Fit
linreg.fit(X_train, y_train)

# Predict
y_pred = linreg.predict(X_test)

# Evaluate
mean_absolute_error(y_pred, y_test)

## KNN Regression

In [None]:
# Import
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error

# Instantiate
knn = KNeighborsRegressor(9)

# Fit
knn.fit(X_train, y_train)

# Predict
y_pred = knn.predict(X_test)

# Evaluate
mean_absolute_error(y_pred, y_test)

In [None]:
scores = []
for k in range(1,50):
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(X_train,y_train)
    y_pred = knn.predict(X_test)
    score = mean_absolute_error(y_pred, y_test)
    scores.append([k, score])

In [None]:
data = pd.DataFrame(scores,columns=['k','score'])
data.plot.line(x='k',y='score'); 

## Decision Tree Regression

One of the major advantages of decision trees is that they can pick up nonlinear interactions between variables in the data that linear regression can’t.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

# Instantiate the classifier
tree = DecisionTreeRegressor()

# Train model on training set
tree.fit(X_train, y_train)

# Predict
y_pred = tree.predict(X_test)

# Eval Metric
mean_absolute_error(y_pred, y_test)

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Instantiate model
rf = RandomForestRegressor()

# Train model on training set
rf.fit(X_train, y_train)

# Predict
y_pred = rf.predict(X_test)

# Eval Metric
mean_absolute_error(y_pred, y_test)

## Support Vector Regression

 - Performs linear regression in a higher dimensional space
 - kernel = linear, polynomial, gaussian, rbf
 - We want rbf because we want non-linear
 - Need to apply feature scaling

In [None]:
# Import model
from sklearn.svm import SVR

# Instantiate
svr = SVR(kernel = 'rbf', gamma='scale')

# Fit
svr.fit(X_train, y_train)

### Evaluation
from sklearn.metrics import mean_absolute_error

# Predict
y_pred = svr.predict(X_test)

# Evaluate
mean_absolute_error(y_pred, y_test)

## Grid Search with Best Algorithm

In [None]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel': ('linear', 'rbf','poly'), 'C':[1, 5, 10, 15, 20], 'epsilon':[0.1]}
svr = SVR()
clf = GridSearchCV(svr, parameters)
clf.fit(X_train, y_train)
clf.best_params_

## Apply optimal parameters from GridSearch

In [None]:
# Instantiate
svr = SVR(kernel = 'rbf', gamma='scale', C=40, epsilon=0.1)

# Fit
svr.fit(X_train, y_train)

### Evaluation
from sklearn.metrics import mean_absolute_error

# Predict
y_pred = svr.predict(X_test)

# Evaluate
mean_absolute_error(y_pred, y_test)

### Submit to Kaggle

In [None]:
def submission(df, features, algorithm):
    x_oos = df[features]
    algo = algorithm
    algo.fit(X_train, y_train)
    pred = algo.predict(x_oos)
    
    test_id = df["Id"]
    sub = pd.DataFrame({'Id': test_id, "winPlacePerc": pred} , columns=['Id', 'winPlacePerc'])
    return sub.head()

In [None]:
submission(df_test, feature_cols, RandomForestRegressor())

In [None]:
'''
Create a submission file
'''
import random

# predict class probabilities for the actual testing data (not X_test)
X_oos = df_test[feature_cols]

svr = SVR(kernel = 'rbf', gamma='scale', C=40, epsilon=0.1)
svr.fit(X_train,y_train)
pred = svr.predict(X_oos)


test_id = df_test["Id"]
sub = pd.DataFrame({'Id': test_id, "winPlacePerc": pred} , columns=['Id', 'winPlacePerc'])
sub.to_csv("submission.csv", index = False)