## Introduction
Coming from the eda in 'eda.iypnb', we found strongly correlated variables like 'acs', 'dd_per_round', and 'damage_per_round' to use in a linear regression model for predicting 'kd_ratio'. Let us begin working with these features to create a reliable model.

In [163]:
# Imports
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [164]:
# Loading csv
df = pd.read_csv('sample.csv', usecols=['kd_ratio', 'acs', 'dd_per_round', 'damage_per_round'])
display(df)

Unnamed: 0,damage_per_round,kd_ratio,dd_per_round,acs
0,169.8,1.32,35.0,259.0
1,165.6,1.23,28.0,250.3
2,188.8,1.51,59.0,286.8
3,162.7,1.32,34.0,244.6
4,180.9,1.38,37.0,282.6
...,...,...,...,...
5413,177.3,1.43,44.0,270.1
5414,159.3,1.17,21.0,234.9
5415,161.3,1.16,21.0,243.9
5416,143.8,1.05,9.0,217.0


In [165]:
# Lets observe some summary statistics for our chosen features, as well as kd_ratio:
print(df.describe())

       damage_per_round     kd_ratio  dd_per_round          acs
count       5418.000000  5418.000000   5418.000000  5418.000000
mean         145.827925     1.072091      8.287006   220.623736
std           14.224661     0.119815     13.555326    22.516934
min           64.500000     0.440000    -64.000000    91.900000
25%          136.225000     0.990000      0.000000   205.200000
50%          144.850000     1.060000      7.000000   218.500000
75%          153.900000     1.130000     15.000000   233.300000
max          227.300000     2.160000    102.000000   369.100000


# Feature Engineering
* Not all of these features are on the same scale, to ensure model efficiency it would be wise to shape our data.
* Looking at the observations made in 'eda.ipynb', we can observe that our features all follow a Gaussian distribution, and contain outliers. From this we can conclude that standardization is a suitable option for how we can shape our data.

In [166]:
# Lets begin to standardize our features.
ss = StandardScaler()

In [167]:
# Shaping data:
shaped_acs = np.array(df.acs).reshape(-1, 1)
shaped_dd_per_round = np.array(df.dd_per_round).reshape(-1, 1)
shaped_damage_per_round = np.array(df.damage_per_round).reshape(-1, 1)

In [168]:
# Transforming data:
df['acs_standardized'] = ss.fit_transform(shaped_acs)
df['dd_per_round_standardized'] = ss.fit_transform(shaped_dd_per_round)
df['damage_per_round_standardized'] = ss.fit_transform(shaped_damage_per_round)

In [169]:
# Observing new columns:
display(df)
print(df.describe())

Unnamed: 0,damage_per_round,kd_ratio,dd_per_round,acs,acs_standardized,dd_per_round_standardized,damage_per_round_standardized
0,169.8,1.32,35.0,259.0,1.704486,1.970846,1.685403
1,165.6,1.23,28.0,250.3,1.318075,1.454396,1.390114
2,188.8,1.51,59.0,286.8,2.939227,3.741531,3.021235
3,162.7,1.32,34.0,244.6,1.064909,1.897067,1.186224
4,180.9,1.38,37.0,282.6,2.752683,2.118403,2.465810
...,...,...,...,...,...,...,...
5413,177.3,1.43,44.0,270.1,2.197494,2.634853,2.212705
5414,159.3,1.17,21.0,234.9,0.634082,0.937946,0.947180
5415,161.3,1.16,21.0,243.9,1.033818,0.937946,1.087794
5416,143.8,1.05,9.0,217.0,-0.160949,0.052604,-0.142577


       damage_per_round     kd_ratio  dd_per_round          acs  \
count       5418.000000  5418.000000   5418.000000  5418.000000   
mean         145.827925     1.072091      8.287006   220.623736   
std           14.224661     0.119815     13.555326    22.516934   
min           64.500000     0.440000    -64.000000    91.900000   
25%          136.225000     0.990000      0.000000   205.200000   
50%          144.850000     1.060000      7.000000   218.500000   
75%          153.900000     1.130000     15.000000   233.300000   
max          227.300000     2.160000    102.000000   369.100000   

       acs_standardized  dd_per_round_standardized  \
count      5.418000e+03               5.418000e+03   
mean       2.413065e-16              -1.573738e-17   
std        1.000092e+00               1.000092e+00   
min       -5.717280e+00              -5.333231e+00   
25%       -6.850470e-01              -6.114033e-01   
50%       -9.432597e-02              -9.495346e-02   
75%        5.63017

# Model Creation
Now that we've standardized our features, we can begin to test different features and see how they perform in a linear regression model.

In [170]:
# First, lets create our testing and training sets:
x_train, x_test, y_train, y_test = train_test_split(df[['acs_standardized', 'dd_per_round_standardized', 'damage_per_round_standardized']],
                                                     df['kd_ratio'],
                                                     train_size=.8,
                                                     test_size=.2,
                                                     shuffle=True,
                                                     random_state=48)
print("training x:", x_train.head(2), "training y:", y_train.head(2), "testing x:", x_test.head(2), "testing y:", y_test.head(2), sep='\n')

training x:
      acs_standardized  dd_per_round_standardized  \
3560         -0.058794                  -0.021175   
4956          0.540810                   1.306839   

      damage_per_round_standardized  
3560                      -0.051178  
4956                       0.806566  
training y:
3560    1.03
4956    1.18
Name: kd_ratio, dtype: float64
testing x:
     acs_standardized  dd_per_round_standardized  \
832         -0.964862                  -0.316289   
213          0.314293                  -0.021175   

     damage_per_round_standardized  
832                      -1.063598  
213                       0.412848  
testing y:
832    1.09
213    1.06
Name: kd_ratio, dtype: float64


In [171]:
# Now that our testing and training data has been created, lets first create a linear regression model using all three features:
all_feature_model = LinearRegression()
all_feature_model.fit(x_train, y_train)
# Lets view the R^2 score of all_feature_model:
rr_score = all_feature_model.score(x_test, y_test)
print("all_feature_model R^2 score: " + str(rr_score))

all_feature_model R^2 score: 0.9402540007687736


In [172]:
# Lets see how our variables correlate to all_features_model:
coefs = all_feature_model.coef_
print(f"acs coef: {coefs[0]}\ndd_per_round coef: {coefs[1]}\ndamage_per_round coef: {coefs[2]}")

acs coef: 0.11742619432034328
dd_per_round coef: 0.11343740975020018
damage_per_round coef: -0.11350988695075338


In [173]:
# Now let us try a model containing dd_per_round and damage_per_round:
dd_and_damage_per_round_model = LinearRegression()
dd_and_damage_per_round_model.fit(x_train[['dd_per_round_standardized', 'damage_per_round_standardized']], y_train)
# Lets see the R^2 score of dd_and_damage_per_round_model
rr_score = dd_and_damage_per_round_model.score(x_test[['dd_per_round_standardized', 'damage_per_round_standardized']], y_test)
print("dd_and_damage_per_round_model R^2 score: " + str(rr_score))

dd_and_damage_per_round_model R^2 score: 0.915157994998116


In [174]:
# Lets see how dd_per_round and damage_per_round correlate to dd_and_damage_per_round_model:
coefs = dd_and_damage_per_round_model.coef_
print(f"dd_per_round coef: {coefs[0]}\ndamage_per_round coef: {coefs[1]}")

dd_per_round coef: 0.10720129942189839
damage_per_round coef: 0.007861708567579947


In [175]:
# Lets try a model using acs and dd_per_round:
acs_and_dd_per_round_model = LinearRegression()
acs_and_dd_per_round_model.fit(x_train[['acs_standardized', 'dd_per_round_standardized']], y_train)
# Lets see the R^2 score of acs_and_dd_per_round_model
rr_score = acs_and_dd_per_round_model.score(x_test[['acs_standardized', 'dd_per_round_standardized']], y_test)
print("acs_and_dd_per_round_model R^2 score: " + str(rr_score))

acs_and_dd_per_round_model R^2 score: 0.9192218462576195


In [176]:
# Lets see how acs and dd_per_round correlate to acs_and_dd_per_round_model:
coefs = acs_and_dd_per_round_model.coef_
print(f"acs: {coefs[0]}\ndd_per_round coef: {coefs[1]}")

acs: 0.017079263124967603
dd_per_round coef: 0.09943955182110335


In [177]:
# Finally we will observe a model with acs and damage_per_round
acs_and_damage_per_round_model = LinearRegression()
acs_and_damage_per_round_model.fit(x_train[['acs_standardized', 'damage_per_round_standardized']], y_train)
# Lets see the R^2 score of acs_and_damage_per_round_model
rr_score = acs_and_damage_per_round_model.score(x_test[['acs_standardized', 'damage_per_round_standardized']], y_test)
print("acs_and_damage_per_round_model R^2 score: " + str(rr_score))

acs_and_damage_per_round_model R^2 score: 0.7343417315511778


In [178]:
# Lets see how acs and dd_per_round correlate to acs_and_dd_per_round_model:
coefs = acs_and_damage_per_round_model.coef_
print(f"acs: {coefs[0]}\ndamage_per_round coef: {coefs[1]}")

acs: 0.0595182794963564
damage_per_round coef: 0.04370681584399669


# Conclusion
Seeing how linear regression models predict with different combinations of features, we can conclude the following:
* Our most accurate model contained all three features with an R^2 of 0.9402540007687736.
* The worst performing model contained the features 'acs' and 'damage_per_round', with an R^2 of 0.7343417315511778.
* Features paired with 'damage_per_round' tended to perform worse compared to others.