# Left4Dead Machine Learning Project
I used to play Left4Dead and Left4Dead 2 when I was younger, and there is a lot of data available thanks to Jack Lacey on Kaggle.com [1]. This project will showcase usage of ML algorithms and feature engineering to predict playtime.

First, let's inspect the dataset. Since the data will not be updated, just upload the whole dataset.

In [8]:
import numpy as np # Numpy .where method
import pandas as pd # Import pandas to work with dataframes
from sklearn.model_selection import train_test_split # Split into training and testing sets

In [4]:
d = pd.read_csv('l4d2_player_stats_final.csv')

In [5]:
d.info() # Info gives quick view on dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20830 entries, 0 to 20829
Columns: 113 entries, Username to Average_Friendly_Fire
dtypes: float64(111), int64(1), object(1)
memory usage: 18.0+ MB


A quick look at the info shows that there are 20830 rows and 113 features.

Each row describes one player's stats.

Each column describes an a player attribute, say, the amount of kills with an uzi submachine gun, amount of kills with a pistol, etc.

For this project, I want to use Machine Learning algorithms to predict playtime, so that will be the target variable (feature, attribute, etc).

Quickly, let's look at summary statistics for the dataset.

In [13]:
d.describe() # summary statistics

Unnamed: 0,Username,Playtime_(Hours),Pistol_Shots,Pistol_Kills,Pistol_Usage,Magnum_Shots,Magnum_Kills,Magnum_Usage,Uzi_Shots,Uzi_Kills,...,Knife_Kills,Knife_Usage,Molotovs_Thrown,Molotov_Kills,Pipe_Bombs_Thrown,Pipe_Bomb_Kills,Bile_Jars_Thrown,Bile_Jar_Hits,Most_Friendly_Fire,Average_Friendly_Fire
count,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,...,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0
mean,10414.5,104.684003,12389.739558,2031.092127,8.035493,4948.215218,1984.546471,5.151902,7227.58699,1274.591119,...,18.040278,0.050348,248.814546,2000.592751,255.532117,1799.752184,120.713778,179.056313,49765.28,81.363946
std,6013.247389,1974.873029,24198.764272,3944.710074,6.324443,16679.459523,8250.642273,5.586247,20501.115475,3779.680779,...,202.059539,0.302169,1333.472437,10028.70491,924.607365,6059.289202,586.911986,873.542727,6928768.0,879.579826
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5207.25,19.027153,3336.5,519.0,4.8,522.0,164.0,1.58,1204.0,195.0,...,0.0,0.0,33.0,237.0,43.0,286.0,15.0,22.0,301.0,35.0
50%,10414.5,36.110417,6568.0,1060.0,6.68,1462.0,502.0,3.54,2923.5,506.0,...,0.0,0.0,75.0,583.0,95.0,658.0,36.0,58.0,705.0,55.0
75%,15621.75,73.698333,12966.75,2153.0,9.47,3918.0,1446.0,6.8075,6364.25,1146.75,...,0.0,0.0,175.0,1388.75,210.75,1457.0,86.0,137.0,1441.0,88.0
max,20829.0,277827.960278,608711.0,131565.0,100.0,627966.0,411640.0,100.0,731767.0,152921.0,...,18605.0,21.32,125856.0,635486.0,55566.0,249158.0,41774.0,87673.0,1000000000.0,121347.0


It is worth noting that 0 is contained as the minimum for all variables we have a view on. It is worth inspecting if there is a "statistic" tracked for someone who logged on, but never actually played the game.

There are 533 players tracked at 0 playtime.

In [24]:
len(np.where(d['Playtime_(Hours)'] == 0)[0])

533

Zero playtime obviously means not playing the game at all. Let's see if playtime just is not tracked for these players, or if all other attributes for that player are zero.

There are two (or more) ways to do this.
1. Inspect the data manually using the indices given above.
2. Create a function that returns a list of observation indices that are all zero. 

In [25]:
# 1. Manual inspection

# array of indices where players have 0 playtime
zero_playtime_arr = np.where(d['Playtime_(Hours)'] == 0)[0]

# Locate indides where playtime is 0
d.loc[zero_playtime_arr]

Unnamed: 0,Username,Playtime_(Hours),Pistol_Shots,Pistol_Kills,Pistol_Usage,Magnum_Shots,Magnum_Kills,Magnum_Usage,Uzi_Shots,Uzi_Kills,...,Knife_Usage,Molotovs_Thrown,Molotov_Kills,Pipe_Bombs_Thrown,Pipe_Bomb_Kills,Bile_Jars_Thrown,Bile_Jar_Hits,Most_Friendly_Fire,Difficulty,Average_Friendly_Fire
33,33,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0
71,71,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0
227,227,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0
229,229,0.0,245.0,64.0,40.00,0.0,0.0,0.0,43.0,36.0,...,0.0,0.0,0.0,2.0,16.0,1.0,0.0,0.0,Easy,0.0
230,230,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20667,20667,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0
20775,20775,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0
20795,20795,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0
20800,20800,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0


In [28]:
# type(d)
d.columns

Index(['Username', 'Playtime_(Hours)', 'Pistol_Shots', 'Pistol_Kills',
       'Pistol_Usage', 'Magnum_Shots', 'Magnum_Kills', 'Magnum_Usage',
       'Uzi_Shots', 'Uzi_Kills',
       ...
       'Knife_Usage', 'Molotovs_Thrown', 'Molotov_Kills', 'Pipe_Bombs_Thrown',
       'Pipe_Bomb_Kills', 'Bile_Jars_Thrown', 'Bile_Jar_Hits',
       'Most_Friendly_Fire', 'Difficulty', 'Average_Friendly_Fire'],
      dtype='object', length=113)

We observe that indices [33, 71, 277, ..., etc] have all values at 0 for displayed attributes.

But to manually look at all 533 observations and 113 attributes is prone to error.

Let's use the second method and automate the process.

In [27]:
def valid_row(df, indices,match_val = None):
    '''
    __________
    valid_row
    __________
    ARGUMENTS:
    df         - a dataframe to inspect rows
    indices    - indices of rows in dataframe to inspect
    search_val - a value to compare against columns
    __________
    DESCR:
    The `valid_row` function looks for search_val in a dataframe by the indices and returns a list of indices
    where search_val matches every attribute.
    
    '''
    
    if match_val == None:
        raise ValueError(f'match_val argument {None} cannot campare to values in DataFrame')
        
    else:
        df[indices]
        
    

`Playtime_(Hours)` is the target variable; split it from the dataset.

Also split `Username` from the dataset because it does not describe anything helpful in making predictions.

In [11]:
X = d.drop(['Username','Playtime_(Hours)'], axis=1) # The dataset without the target, Playtime_(Hours)
y = d['Playtime_(Hours)'] # the target variable

0        2433.577222
1         121.879444
2          69.955278
3          48.421667
4         307.639722
            ...     
20825      34.481389
20826      19.644722
20827      13.125278
20828      11.973333
20829      31.906667
Name: Playtime_(Hours), Length: 20830, dtype: float64

### References
https://www.kaggle.com/datasets/jacklacey/left-4-dead-2-20000-player-stats