# Left4Dead Machine Learning Project
I used to play Left4Dead and Left4Dead 2 when I was younger, and there is a lot of data available thanks to Jack Lacey on Kaggle.com [1]. This project will showcase usage of ML algorithms and feature engineering to predict playtime.

First, let's inspect the dataset. Since the data will not be updated, just upload the whole dataset.

In [1]:
import numpy as np # Numpy .where method
import pandas as pd # Import pandas to work with dataframes
from sklearn.model_selection import train_test_split # Split into training and testing sets

In [2]:
d = pd.read_csv('l4d2_player_stats_final.csv')

In [3]:
d.info() # Info gives quick view on dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20830 entries, 0 to 20829
Columns: 113 entries, Username to Average_Friendly_Fire
dtypes: float64(111), int64(1), object(1)
memory usage: 18.0+ MB


A quick look at the info shows that there are 20830 rows and 113 features.

Each row describes one player's stats.

Each column describes an a player attribute, say, the amount of kills with an uzi submachine gun, amount of kills with a pistol, etc.

For this project, I want to use Machine Learning algorithms to predict playtime, so that will be the target variable (feature, attribute, etc).

Quickly, let's look at summary statistics for the dataset.

In [4]:
d.describe() # summary statistics

Unnamed: 0,Username,Playtime_(Hours),Pistol_Shots,Pistol_Kills,Pistol_Usage,Magnum_Shots,Magnum_Kills,Magnum_Usage,Uzi_Shots,Uzi_Kills,...,Knife_Kills,Knife_Usage,Molotovs_Thrown,Molotov_Kills,Pipe_Bombs_Thrown,Pipe_Bomb_Kills,Bile_Jars_Thrown,Bile_Jar_Hits,Most_Friendly_Fire,Average_Friendly_Fire
count,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,...,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0,20830.0
mean,10414.5,104.684003,12389.739558,2031.092127,8.035493,4948.215218,1984.546471,5.151902,7227.58699,1274.591119,...,18.040278,0.050348,248.814546,2000.592751,255.532117,1799.752184,120.713778,179.056313,49765.28,81.363946
std,6013.247389,1974.873029,24198.764272,3944.710074,6.324443,16679.459523,8250.642273,5.586247,20501.115475,3779.680779,...,202.059539,0.302169,1333.472437,10028.70491,924.607365,6059.289202,586.911986,873.542727,6928768.0,879.579826
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5207.25,19.027153,3336.5,519.0,4.8,522.0,164.0,1.58,1204.0,195.0,...,0.0,0.0,33.0,237.0,43.0,286.0,15.0,22.0,301.0,35.0
50%,10414.5,36.110417,6568.0,1060.0,6.68,1462.0,502.0,3.54,2923.5,506.0,...,0.0,0.0,75.0,583.0,95.0,658.0,36.0,58.0,705.0,55.0
75%,15621.75,73.698333,12966.75,2153.0,9.47,3918.0,1446.0,6.8075,6364.25,1146.75,...,0.0,0.0,175.0,1388.75,210.75,1457.0,86.0,137.0,1441.0,88.0
max,20829.0,277827.960278,608711.0,131565.0,100.0,627966.0,411640.0,100.0,731767.0,152921.0,...,18605.0,21.32,125856.0,635486.0,55566.0,249158.0,41774.0,87673.0,1000000000.0,121347.0


It is worth noting that 0 is contained as the minimum for all variables we have a view on. It is worth inspecting if there is a "statistic" tracked for someone who logged on, but never actually played the game.

There are 533 players tracked at 0 playtime.

In [5]:
len(np.where(d['Playtime_(Hours)'] == 0)[0])

533

Zero playtime obviously means not playing the game at all. Let's see if playtime just is not tracked for these players, or if all other attributes for that player are zero.

There are two (or more) ways to do this.
1. Inspect the data manually using the indices given above.
2. Create a function that returns a list of observation indices that are all zero. 

In [6]:
# 1. Manual inspection

# array of indices where players have 0 playtime
zero_playtime_arr = np.where(d['Playtime_(Hours)'] == 0)[0]

# Locate indides where playtime is 0
d.loc[zero_playtime_arr]

Unnamed: 0,Username,Playtime_(Hours),Pistol_Shots,Pistol_Kills,Pistol_Usage,Magnum_Shots,Magnum_Kills,Magnum_Usage,Uzi_Shots,Uzi_Kills,...,Knife_Usage,Molotovs_Thrown,Molotov_Kills,Pipe_Bombs_Thrown,Pipe_Bomb_Kills,Bile_Jars_Thrown,Bile_Jar_Hits,Most_Friendly_Fire,Difficulty,Average_Friendly_Fire
33,33,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0
71,71,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0
227,227,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0
229,229,0.0,245.0,64.0,40.00,0.0,0.0,0.0,43.0,36.0,...,0.0,0.0,0.0,2.0,16.0,1.0,0.0,0.0,Easy,0.0
230,230,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20667,20667,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0
20775,20775,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0
20795,20795,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0
20800,20800,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Easy,0.0


In [7]:
# type(d)
d.columns

Index(['Username', 'Playtime_(Hours)', 'Pistol_Shots', 'Pistol_Kills',
       'Pistol_Usage', 'Magnum_Shots', 'Magnum_Kills', 'Magnum_Usage',
       'Uzi_Shots', 'Uzi_Kills',
       ...
       'Knife_Usage', 'Molotovs_Thrown', 'Molotov_Kills', 'Pipe_Bombs_Thrown',
       'Pipe_Bomb_Kills', 'Bile_Jars_Thrown', 'Bile_Jar_Hits',
       'Most_Friendly_Fire', 'Difficulty', 'Average_Friendly_Fire'],
      dtype='object', length=113)

We observe that indices [33, 71, 277, ..., etc] have all values at 0 for displayed attributes.

But to manually look at all 533 observations and 113 attributes is prone to error.

Let's use the second method and automate the process.

In [64]:
def is_value(df, indices, match_val = None):
    '''
    __________
    is_value
    __________
    ARGUMENTS:
    df         - a dataframe to inspect rows
    indices    - indices of rows in dataframe to inspect
    match_val  - a value to compare against columns
    __________
    DESCR:
    - requires preprocessing dataset to make sure `match_val` only matches true where comparison is needed.
    - raises an error if no `match_val` is specified.
    - requires a indices argument of observations to inspect.
    - tries matching the `match_val` argument with every column in the `df`argument, for every index.
    - records True for indices where the value in each attribute matches `match_val`.
    - returns new array of indices where `match_val` matches all data attributes.
    
    
    '''
    
    if match_val == None: # If no match value was specified. Raise error
        raise ValueError(f'match_val argument {match_val} cannot campare to values in DataFrame')
        
    else:        
        # Boolean list to compare against indices list
        zero_features = list()
        
        # Loop through all columns (attributes, features)
        
        for index in indices:                          # Loop through all indices
            #print(index)
            for column in df.columns:                  # Loop through all columns per index
                #print(column)
                if column != df.columns[-1]:           # Not the last column
                    if df.loc[index, column] != match_val: # If one of the attributes of the index is not match_val...
                        zero_features.append(False)            # ... index is not all of one value
                        break                              # Skip rest of columns and go to next index
                else:                                      # Is the last column
                    if df.loc[index, column] != match_val: # Last column value does not match, 0
                        zero_features.append(False)
                    else:                                  # Last column value matches, 1
                        zero_features.append(True)
                       
        return indices[np.array(zero_features)]

In [65]:
only_zero_indices = is_value(d.drop(['Username','Difficulty'], axis=1), zero_playtime_arr, match_val = 0)

In [66]:
d.drop(['Username','Difficulty'], axis=1).drop(only_zero_indices, axis = 0)

Unnamed: 0,Playtime_(Hours),Pistol_Shots,Pistol_Kills,Pistol_Usage,Magnum_Shots,Magnum_Kills,Magnum_Usage,Uzi_Shots,Uzi_Kills,Uzi_Usage,...,Knife_Kills,Knife_Usage,Molotovs_Thrown,Molotov_Kills,Pipe_Bombs_Thrown,Pipe_Bomb_Kills,Bile_Jars_Thrown,Bile_Jar_Hits,Most_Friendly_Fire,Average_Friendly_Fire
0,2433.577222,94665.0,10470.0,2.77,121222.0,27056.0,7.16,44666.0,5165.0,1.37,...,1793.0,0.47,11166.0,99278.0,5817.0,23433.0,5802.0,12863.0,13653.0,142.0
1,121.879444,9136.0,1371.0,1.47,14928.0,6802.0,7.30,997.0,187.0,0.20,...,24.0,0.03,788.0,10141.0,977.0,6962.0,519.0,1557.0,1914.0,89.0
2,69.955278,4100.0,693.0,4.87,222.0,133.0,0.93,2834.0,271.0,1.90,...,0.0,0.00,23.0,130.0,445.0,1202.0,44.0,83.0,3195.0,58.0
3,48.421667,7369.0,1208.0,5.99,784.0,250.0,1.24,3322.0,496.0,2.46,...,0.0,0.00,135.0,1090.0,105.0,716.0,48.0,75.0,1412.0,76.0
4,307.639722,51944.0,9481.0,8.93,20545.0,6813.0,6.42,38224.0,5493.0,5.17,...,0.0,0.00,613.0,4797.0,515.0,4195.0,272.0,424.0,10851.0,112.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20825,34.481389,10455.0,1494.0,10.70,2738.0,839.0,6.01,4613.0,595.0,4.26,...,0.0,0.00,24.0,212.0,73.0,433.0,16.0,42.0,504.0,62.0
20826,19.644722,3834.0,766.0,11.30,3023.0,1135.0,16.75,1899.0,397.0,5.86,...,0.0,0.00,46.0,368.0,82.0,407.0,25.0,25.0,470.0,82.0
20827,13.125278,3650.0,677.0,10.43,943.0,362.0,5.58,1401.0,85.0,1.31,...,2.0,0.03,34.0,311.0,84.0,730.0,35.0,85.0,149.0,31.0
20828,11.973333,1982.0,239.0,3.62,1423.0,407.0,6.16,2200.0,516.0,7.81,...,0.0,0.00,44.0,260.0,33.0,227.0,15.0,25.0,423.0,201.0


350

`Playtime_(Hours)` is the target variable; split it from the dataset.

Also split `Username` from the dataset because it does not describe anything helpful in making predictions.

In [10]:
X = d.drop(['Username','Playtime_(Hours)'], axis=1) # The dataset without the target, Playtime_(Hours)
y = d['Playtime_(Hours)'] # the target variable

### References
https://www.kaggle.com/datasets/jacklacey/left-4-dead-2-20000-player-stats