# **PUBG : Predict player placement tutorial**

### Hi kagglers 👋👋👋

### Welcome on this tutorial ! It is aimed for beginners but whatever your level you could read it, and if you find a way to improve it I encourage you to fork this notebook and contribute by adding a better solution !¶

![](https://static.gamespot.com/uploads/original/1197/11970954/3206264-pubg+artwork_.jpg)

### In this notebook, we are going to predict the final placement of a player of the famous battle royale game PUBG. By doing this, we will go through several topics and fundamental techniques of machine learning. Here is a list of these techniques and some additional resources that you can consult to find out more:¶
  

[EDA | Data exploration](https://medium.com/python-pandemonium/introduction-to-exploratory-data-analysis-in-python-8b6bcb55c190)  
[Features engineering](https://adataanalyst.com/machine-learning/comprehensive-guide-feature-engineering/)  
[Evaluating a model over one training | metrics](https://machinelearningmastery.com/metrics-evaluate-machine-learning-algorithms-python/)  
[Evaluating a model over several trainings | k-fold cross validation](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6)  


## **Table of content**

1. [Data exploration](#data_exploration)
2. [Feature egineering](#fe)
3. [Try several models](#trymodels)
4. [Choosing the best model](#choose)
6. [Make prediction](#submission)

## **Imports & useful functions**

In [None]:
import pandas as pd
import matplotlib
import pydot
import re
import dask.dataframe as dd

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

import numpy as np
import seaborn as sns
sns.set()
import sklearn

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import lightgbm as lgb

import gc
gc.enable()

In [None]:
# Create table for missing data analysis
def draw_missing_data_table(df):
    total = df.isnull().sum().sort_values(ascending=False)
    percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    return missing_data

In [None]:
#path of datasets
path_train = '../input/train_V2.csv'
path_test = '../input/test_V2.csv'

## **1. Data exploration** <a name="data_exploration"></a>

In [None]:
#create dataframe for training dataset and print ten first rows as preview
train_df_raw = pd.read_csv(path_train)
train_df_raw.head()

In [None]:
# Compute some basical statistics on the dataset
train_df_raw.describe()

### On a first look, we can suppose that there are no missing data ... let's verified it in the next cell !

In [None]:
draw_missing_data_table(train_df_raw)

### Only one missing data ! We will delete this row during features engineering

In [None]:
train_df_raw.info()

### With those data we can see that one feature will need to be modified : the matchType feature. Indeed, any other features are already numerical and the Id related features are not taken into account for feature engineering.

In [None]:
# Let's plot some histograms on main features to have a previzualisation of some of the data ...
train_df_raw.drop(['Id', 'groupId', 'matchId', 'winPoints', 'rankPoints', 'teamKills', 'vehicleDestroys', 'roadKills', 'swimDistance', 'numGroups'], 1).hist(bins=50, figsize=(50,80), layout=(8, 3))
plt.show()

### Now, let's try to find some correlation in the data:

In [None]:
plt.figure(figsize=(20,15))  
sns.heatmap(train_df_raw.corr(), annot=True, fmt=".2f")
plt.show()

### With this first simple data exploration, we can observe that :

* Majority of players have no kills at the end of the game (same conclusion for assists)
* Players who killed other players have between 1 and 10 kills (maximum in almost all cases, same conclusion for assists)
* Walk distance do not exeed 7000-8000 meters.
* We can identify low importance variables which are almost always zero: swim distance, vehicle destroyed, roadkills
* Teamkills are extremely rare but we can assume that when a player killed a teamate, it compromised a lot his placement so this variable can be relevant
* The killplace variable seem to show a strong correlation between placement and number of enemy players killed for the 0 to 90 placements and a decorrelation for the end of the placements (top 10 players).
* The final classement if well distributed between 0 and 100, with a majority of 0 probably consequence of early leaving of players.
* winPoints and killpoints seems to be redundant, indeed, players who are doing the biggest number of kills often win their games. We may delete one of those columns.
* obviously, damage dealt is strongly correlated withs number of kills, but not enought to delete one of those columns
* No other strong correlation can be spotted

## **2. Features engineering** <a name="fe"></a>

### **2.1 Non numerical variables treatment**

#### Let's take a look on the only non numerical variable of this dataset, the matchType column:

In [None]:
train_df_raw.matchType.unique().tolist()

### **Thoses data need to be explained, so after a few researchs it turns out that:**

- fpp means "first person player", those games are not very different than TPP games (third person player) so we won't treat those data differently.

- crash event is described by developpers as following: “In Crash Carnage, no firearms spawn so you’ll need to focus on melee weapons, throwables, and of course your driving skills to carry your duo to that final circle. Circles move considerably faster in this event, so loot quick, grab a vehicle, and crash your way to road warrior glory.” This mode is not standard so let's check whereas lots of data cmes from this game mode.

- flare event is a mode with a flare gun that allow to get some weapons and armor by calling a care package. This mode can be played with a 4 person team and it is not very different from basic squad mode so we will include flare games into squad games during feature engineering.

In [None]:
train_df_raw['matchType'].value_counts()

### Interresting ... Crash games are a minority : 0.1% of total games and about 0.5% of total duo games. With this observations, we can conclude that adding crash games to duo games will not skew the result of the prediction.

### **2.2 Formatting function - features creation**

### We now have enough informations do transform our data to make it ready for the machine learning algorithm. To do that, we will build a function that take a dataframe as argument and return a new fully formatted dataframe ready for prediction.

In [None]:
def preprocess_data(df, with_categorical=False):

    processed_df = df.drop(['Id', 'rankPoints'],  axis=1)
            
    # handle matchType column by creating dummies cols or creating new categorical variable column
    print('-'*5 + ' handling matchType column ' + '-'*5)
    new_matchType_cols = list()
    if with_categorical:
        for mtype in processed_df['matchType']:
            if mtype in ['squad', 'squad-fpp', 'normal-squad-fpp', 'normal-squad', 'flarefpp', 'flaretpp']:
                new_matchType_cols.append([3])
            elif mtype in ['solo', 'solo-fpp', 'normal-solo-fpp', 'normal-solo']:
                new_matchType_cols.append([1])
            else:
                new_matchType_cols.append([2])
        match_df = pd.DataFrame(new_matchType_cols, columns=['matchType'], index=processed_df.index)
        
    else:
        for mtype in processed_df['matchType']:
            if mtype in ['squad', 'squad-fpp', 'normal-squad-fpp', 'normal-squad', 'flarefpp', 'flaretpp']:
                new_matchType_cols.append([1, 0, 0])
            elif mtype in ['solo', 'solo-fpp', 'normal-solo-fpp', 'normal-solo']:
                new_matchType_cols.append([0, 0, 1])
            else:
                new_matchType_cols.append([0, 1, 0])
        match_df = pd.DataFrame(new_matchType_cols, columns=['squad','duo', 'solo'], index=processed_df.index)
        
    processed_df = processed_df.drop(['matchType'],  axis=1)
    
    # create matchSize column
    print('-'*5 + ' create matchSize column ' + '-'*5)
    match_size = processed_df.groupby(['matchId']).size().reset_index(name='matchSize')
    processed_df = processed_df.merge(match_size, how='left', on=['matchId'])
    
    # create teamSize column
    print('-'*5 + ' create teamSize column ' + '-'*5)
    processed_df['combinedId'] = processed_df['matchId'] + processed_df['groupId']
    group_size = processed_df.groupby(['combinedId']).size().reset_index(name='teamSize')
    processed_df = processed_df.merge(group_size, how='left', on=['combinedId'])
    
    # create totalDistance col
    print('-'*5 + ' create totalDistance column ' + '-'*5)
    processed_df['totalDistance'] = processed_df['rideDistance'] + processed_df['walkDistance'] + processed_df['swimDistance']
    #processed_df['headshotRate'] = processed_df['headshotKills'] / processed_df['kills']
    #processed_df['killstreaksRate'] = processed_df['killStreaks'] / processed_df['kills']
    
    processed_df = processed_df.drop(['combinedId', 'matchId', 'groupId'],  axis=1)
    processed_df = processed_df.join(match_df)
    
    # delete low importances features
    processed_df = processed_df.drop(['teamKills', 'vehicleDestroys', 'roadKills', 'swimDistance', 'headshotKills', 'solo', 'duo', 'squad'], 1)

    return processed_df

## **3. Make prediction**  <a name="submission"></a>

In [None]:
train_df = preprocess_data(train_df_raw.dropna())
X_train = train_df.drop('winPlacePerc', 1)
y_train = train_df['winPlacePerc']
sc = StandardScaler()
X_train = pd.DataFrame(sc.fit_transform(X_train.values), index=X_train.index, columns=X_train.columns)
X_train.head()

In [None]:
test_df_raw = pd.read_csv(path_test)
# assert there are no missing data as in the train dataframe
draw_missing_data_table(test_df_raw)

In [None]:
# apply the same transformation on test dataset than on train dataset
X_test = preprocess_data(test_df_raw)
X_test = pd.DataFrame(sc.fit_transform(X_test.values), index=X_test.index, columns=X_test.columns)
X_test.head()

In [None]:
# Create and train model on train data sample
params = {
    #'num_leaves': 2048,
    'learning_rate': 0.001,
    #'n_estimators': 1000,
    #'max_depth':10,
    'min_data_in_leaf': 400,
    'max_bin': 10000,
    #'bagging_fraction':0.8,
    #'bagging_freq':5,
    #'feature_fraction':0.9,
    #'verbose':50,
    'boosting_type': 'dart',
    'random_state': 42,
    'objective' : 'regression',
    'metric': 'mae'
    }

model = lgb.LGBMRegressor(**params, verbose=2, silent=False)
model.fit(X_train, y_train, eval_metric= 'mae')

In [None]:
# Predict for test data sample
prediction = model.predict(X_test)

In [None]:
lgb.plot_importance(model)
plt.show()

In [None]:
# Tip found here: https://www.kaggle.com/anycode/simple-nn-baseline-3
for i in range(len(test_df_raw)):
    winPlacePerc = prediction[i]
    maxPlace = int(test_df_raw.iloc[i]['maxPlace'])
    if maxPlace == 0:
        winPlacePerc = 0.0
    elif maxPlace == 1:
        winPlacePerc = 1.0
    else:
        gap = 1.0 / (maxPlace - 1)
        winPlacePerc = round(winPlacePerc / gap) * gap
    
    if winPlacePerc < 0: winPlacePerc = 0.0
    if winPlacePerc > 1: winPlacePerc = 1.0    
    prediction[i] = winPlacePerc

In [None]:
result_df = test_df_raw.copy()
result_df['winPlacePerc'] = prediction

result_df.head()
result_df.to_csv('submission.csv', columns=['Id', 'winPlacePerc'], index=False)