
**1. Introduction**

![Imgur](https://i.imgur.com/NmskNuo.jpg)

**PUBG (Player Unknown's Battlegrounds)** is a hugely successful and popular online shooter game. It's of so called "battle royale" type - the game ends when the last team stays alive on a map.  The difference to the normal deathmatch is that after you are killed in battle royale game you're not re-spawned anymore (perma-death). Here is the [official game site](https://www.pubg.com/).
At the moment this competition was launched there were only two maps: "Erangel" and "Miramar". Currently there is "Vikendi" as well but it is not in our dataset.

There were few datasets regarding this game on Kaggle before. If you want for example to see my non-parametric Survival Analysis (Kaplan-Meier) click [here](https://www.kaggle.com/datark1/pubg-survival-analysis-kaplan-meier).

This kernel is mostly EDA oriented but we will look for some anomalies as well ( possibly cheaters).

**2. Database description**

OK, let's see what's inside.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import warnings
warnings.filterwarnings("ignore")

In [None]:
train = pd.read_csv('../input/train_V2.csv')

The first glance at the data. Below the first 5 rows:

In [None]:
train.head()

In [None]:
train.shape

In total we have:
* 26 columns
* 43 257 336 observations (rows)

Now - list of columns.

In [None]:
train.columns


For better understanding of database below there is a columns descriptions:

*     **groupId** - Players team ID
*     **matchId** - Match ID
*     **assists** - Number of assisted kills. The killed is actually scored for the another teammate.
*     **boosts** - Number of boost items used by a player. These are for example: energy dring, painkillers, adrenaline syringe.
*     **damageDealt** - Damage dealt to the enemy
*     **DBNOs** - Down But No Out - when you lose all your HP but you're not killed yet. All you can do is only to crawl.
*     **headshotKills** - Number of enemies killed with a headshot
*     **heals** - Number of healing items used by a player. These are for example: bandages, first-aid kits
*     **killPlace** - Ranking in a match based on kills.
*     **killPoints** - Ranking in a match based on kills points.
*     **kills** - Number of enemy players killed.
*     **killStreaks** - Max number of enemy players killed in a short amount of time.
*     **longestKill** - Longest distance between player and killed enemy.
*     **maxPlace** - The worst place we in the match.
*     **numGroups** - Number of groups (teams) in the match.
*     **revives** - Number of times this player revived teammates.
*     **rideDistance** - Total distance traveled in vehicles measured in meters.
*     **roadKills** - Number of kills from a car, bike, boat, etc.
*     **swimDistance** - Total distance traveled by swimming (in meters).
*     **teamKills** - Number teammate kills (due to friendly fire).
*     **vehicleDestroys** - Number of vehicles destroyed.
*     **walkDistance** - Total distance traveled on foot measured (in meters).
*     **weaponsAcquired** - Number of weapons picked up.
*     **winPoints** - Ranking in a match based on won matches.

And our target column:
*     **winPlacePerc** - Normalised placement (rank). The 1st place is 1 and the last one is 0.



Let's create some basic descriptive statistics for each column. These will be usefull to set the visualisation parameters, to filter out the outliers and to get the feeling about the ranges/scales.

In [None]:
train.describe()

Now, let's check if there are any missing data.

In [None]:
train.isna().sum()

**3. Exploratory Data Analysis**

Nice - it looks we do not have any missing values. That's a perfect starting point for EDA and for ML as well.

In [None]:
train.plot(x="kills",y="damageDealt", kind="scatter", figsize = (15,10))

There is an obvious correlation between number of kills and damage dealt. We see also that there are some outliers. The maximum kills is 60 which is much bigger than the wast majority of players get.

Now let's see what are the distances at which enemies were killed.


Let's look at our kills master:

In [None]:
train[train['kills']==60]

There is an obvious correlation between number of kills and damage dealt. We see also that there are some outliers. The maximum kills is 60 which is much bigger than the wast majority of players get.

Now let's see at headshoots statistics as this is one of the most satisfying thing you can score during a game. Players without any headshoot kills are filtered out.

In [None]:
headshots = train[train['headshotKills']>0]
plt.figure(figsize=(15,5))
sns.countplot(headshots['headshotKills'].sort_values())
print("Maximum number of headshots that the player scored: " + str(train["headshotKills"].max()))

DBNO - Down But Not Out. How many enemies DBNOs an average player scores.

In [None]:
headshots = train[train['DBNOs']>0]
plt.figure(figsize=(15,5))
sns.countplot(headshots['DBNOs'].sort_values())
print("Mean number of DBNOs that the player scored: " + str(train["DBNOs"].mean()))

Is there a correlation between DBNOs and kills?

In [None]:
train.plot(x="DBNOs",y="kills", kind="scatter", figsize = (15,10))

It seems that DBNOs are correlated with kills. That makes sense as usually if player is not killed by headshoot yu have to finish him while he's in DBNO state.

**Maximum distances**

Range is filtered to a resonable kill distance, e.g. 200 meters. To give you the feeling about distances in the game I prepared a small comparison in the picture below. On the left side the building I'm aiming at is approximately 100m away, on the right side around 200m.

![Imgur](https://i.imgur.com/js8kQpU.jpg)

In [None]:
dist = train[train['longestKill']<200]
dist.hist('longestKill', bins=20, figsize = (15,10))

In [None]:
print("Average longest kill distance a player achieve is {:.1f}m, 95% of them not more than {:.1f}m and a maximum distance is {:.1f}m." .format(train['longestKill'].mean(),train['longestKill'].quantile(0.95),train['longestKill'].max()))

Longest kill of 1323m seems a bit unrealistic (cheater?) but from another side with a 8x scope, a static target, very good position and a lot of luck it is possible.

To get a scale the entire Miramar map is 8x8km and 1300 meters is about like shooting from La Bendita crater to Impala city. Below the picture showing this in practice.
![Imgur](https://i.imgur.com/7WzRzkQ.jpg)

**Driving vs. Walking**

I filtered data to exclude for players who don't ride at all and don't walk.     

In [None]:
walk0 = train["walkDistance"] == 0
ride0 = train["rideDistance"] == 0
swim0 = train["swimDistance"] == 0
print("{} of players didn't walk at all, {} players didn't drive and {} didn't swim." .format(walk0.sum(),ride0.sum(),swim0.sum()))

Above numbers indicate that there is a significant number of players who didn't walk at all. We should think how to interpret these record. It is obvious that you have to walk just a little bit in order to play this game (to get to a car at least). Are this disconnected players? If yes they shouldn't score any points. Let's check this.

In [None]:
walk0_rows = train[walk0]
print("Average place of non-walking players is {:.3f}, minimum is {} and the best is {}, 95% of players has a score below {}." 
      .format(walk0_rows["winPlacePerc"].mean(), walk0_rows["winPlacePerc"].min(), walk0_rows["winPlacePerc"].max(),walk0_rows["winPlacePerc"].quantile(0.95)))
walk0_rows.hist('winPlacePerc', bins=40, figsize = (15,7))

As we see most of the non-walking players score only last places. However, few of them got better places and a few even the top ones. This may be indication of presence of famous **cheaters**! Let's print couple of suspicious row.

In [None]:
suspects = train.query('winPlacePerc ==1 & walkDistance ==0').head()
suspects.head()

In [None]:
print("Maximum ride distance for suspected entries is {:.3f} meters, and swim distance is {:.1f} meters." .format(suspects["rideDistance"].max(), suspects["swimDistance"].max()))

Interestingly, all of the columns connected to travelling are zero.

In [None]:
ride = train.query('rideDistance >0 & rideDistance <10000')
walk = train.query('walkDistance >0 & walkDistance <4000')
ride.hist('rideDistance', bins=40, figsize = (15,10))
walk.hist('walkDistance', bins=40, figsize = (15,10))

Plots above show that players mostly walk during a game. That's obvious when you think that vehicles are usually used just to loot more locations and to get a more strategic positions for attack and defend.

Now let's create a sum of walking, driving and swimming distances for each row.

In [None]:
travel_dist = train["walkDistance"] + train["rideDistance"] + train["swimDistance"]
travel_dist = travel_dist[travel_dist<5000]
travel_dist.hist(bins=40, figsize = (15,10))

**4. Weapons Acquired**

In [None]:
print("Average number of acquired weapons is {:.3f}, minimum is {} and the maximum {}, 99% of players acquired less than weapons {}." 
      .format(train["weaponsAcquired"].mean(), train["weaponsAcquired"].min(), train["weaponsAcquired"].max(), train["weaponsAcquired"].quantile(0.99)))

train.hist('weaponsAcquired', figsize = (20,10),range=(0, 10))

**5. Correlation map**

In [None]:
f,ax = plt.subplots(figsize=(20, 15))
sns.heatmap(train.corr(), annot=True, linewidths=.6, fmt= '.2f',ax=ax)
plt.show()

**6. Analysis of TOP 10% of players**

In [None]:
top10 = train[train["winPlacePerc"]>0.9]
print("TOP 10% overview\n")
print("Average number of kills: {:.1f}\nMinimum: {}\nThe best: {}\n95% of players within: {} kills." 
      .format(top10["kills"].mean(), top10["kills"].min(), top10["kills"].max(),top10["kills"].quantile(0.95)))

top10.plot(x="kills", y="damageDealt", kind="scatter", figsize = (15,10))

Let's see their way of travelling and comare this to the overall population.

In [None]:
fig, ax1 = plt.subplots(figsize = (15,10))
walk.hist('walkDistance', bins=40, figsize = (15,10), ax = ax1)
walk10 = top10[top10['walkDistance']<5000]
walk10.hist('walkDistance', bins=40, figsize = (15,10), ax = ax1)

print("Average walking distance: " + str(top10['walkDistance'].mean()))

In [None]:
fig, ax1 = plt.subplots(figsize = (15,10))
ride.hist('rideDistance', bins=40, figsize = (15,10), ax = ax1)
ride10 = top10.query('rideDistance >0 & rideDistance <10000')
ride10.hist('rideDistance', bins=40, figsize = (15,10), ax = ax1)
print("Average riding distance: " + str(top10['rideDistance'].mean()))

What about the longest distances at which they scored their kills?

In [None]:
print("On average the best 10% of players have the longest kill at {:.3f} meters, and the best score is {:.1f} meters." .format(top10["longestKill"].mean(), top10["longestKill"].max()))

Let's see now the correlations between the variables

In [None]:
f,ax = plt.subplots(figsize=(15, 15))
sns.heatmap(train.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
plt.show()