<img src="images/pubg.jpg"></img>
# Winner Winner Chicken Dinner!
Author: Jin Yeom (jinyeom@utexas.edu)

## Contents
- [Configurations](#Configurations)
- [Preprocessing](#Preprocessing)
- [Basic EDA](#Basic-EDA)
- [Feature engineering](#Feature-engineering)
- [References](#References)

In [1]:
import numpy as np
import pandas as pd
import torch

In [2]:
%matplotlib inline

## Configurations

In [3]:
pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)

## Preprocessing

In [4]:
df_train = pd.read_csv("datasets/train.csv")
df_test = pd.read_csv("datasets/test.csv")

OSError: Initializing from file failed

In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4357336 entries, 0 to 4357335
Data columns (total 26 columns):
Id                 int64
groupId            int64
matchId            int64
assists            int64
boosts             int64
damageDealt        float64
DBNOs              int64
headshotKills      int64
heals              int64
killPlace          int64
killPoints         int64
kills              int64
killStreaks        int64
longestKill        float64
maxPlace           int64
numGroups          int64
revives            int64
rideDistance       float64
roadKills          int64
swimDistance       float64
teamKills          int64
vehicleDestroys    int64
walkDistance       float64
weaponsAcquired    int64
winPoints          int64
winPlacePerc       float64
dtypes: float64(6), int64(20)
memory usage: 864.3 MB


[This kernel](https://www.kaggle.com/shubhammank/basic-eda-nn) presents a nice trick to significantly reduce the required amount of RAM, simply by converting data types of the columns to their minimally required data types, e.g., columns that do not exceed `int8` should be typed `int8`, rather than `int64`.

In [10]:
def suitable_dtype(curr_dtype, c_min, c_max):
    types = [np.int8, np.int16, np.int32, np.int64]
    if str(curr_dtype).startswith("int"):
        for t in types:
            if c_min > np.iinfo(t).min and c_max < np.iinfo(t).max:
                return t
    else:
        types = [np.float16, np.float32, np.float64]
        for t in types:
            if c_min > np.finfo(t).min and c_max < np.finfo(t).max:
                return t
    return types[-1] # return the safest option


def optim_mem(df):
    for c in df.columns:
        suitable = suitable_dtype(df[c].dtype, df[c].min(), df[c].max())
        df[c] = df[c].astype(suitable)
    return df

In [12]:
df_train = optim_mem(df_train)
df_test = optim_mem(df_test)

In [13]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4357336 entries, 0 to 4357335
Data columns (total 26 columns):
Id                 int32
groupId            int32
matchId            int32
assists            int8
boosts             int8
damageDealt        float16
DBNOs              int8
headshotKills      int8
heals              int8
killPlace          int8
killPoints         int16
kills              int8
killStreaks        int8
longestKill        float16
maxPlace           int8
numGroups          int8
revives            int8
rideDistance       float16
roadKills          int8
swimDistance       float16
teamKills          int8
vehicleDestroys    int8
walkDistance       float16
weaponsAcquired    int8
winPoints          int16
winPlacePerc       float16
dtypes: float16(6), int16(2), int32(3), int8(15)
memory usage: 178.7 MB


We're going to utilize an important insight from [kernel](https://www.kaggle.com/anycode/simple-nn-baseline), and group the entries by **matches** and **groups**.

In [14]:
df_train = df_train.groupby(["matchId", "groupId"])
df_train.info()

AttributeError: Cannot access callable attribute 'info' of 'DataFrameGroupBy' objects, try using the 'apply' method

## Basic EDA

## References

- https://www.kaggle.com/c/pubg-finish-placement-prediction (Kaggle competition page)
- https://www.kaggle.com/shubhammank/basic-eda-nn (Basic EDA and neural network solution)
- https://www.kaggle.com/anycode/simple-nn-baseline (Another neural network solution)