# Linear Regression
### Here we are choosing a linear regression to model our data. I did try other things such as a TensorFlow tree-model and a TensorFlow CNN, but due to their complexity, I decided on a, simple, multivariable linear regression. This allows for our model to be simple to understand, while adding multiple parameters so that we can draw a slightly more precise conclusion. It is important to not sacrafice understandability, given that we are going to present the results to non-technical stakeholders. (For more accurate "less readible" models see the TF tree and TF CNN under "Less Readible Models")

## Relevant Imports

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

## Examing the CSV File as a Pandas DataFrame

In [2]:
df = pd.read_csv("starcraft_player_data.csv")

In [3]:
display(df)

Unnamed: 0,GameID,LeagueIndex,Age,HoursPerWeek,TotalHours,APM,SelectByHotkeys,AssignToHotkeys,UniqueHotkeys,MinimapAttacks,MinimapRightClicks,NumberOfPACs,GapBetweenPACs,ActionLatency,ActionsInPAC,TotalMapExplored,WorkersMade,UniqueUnitsMade,ComplexUnitsMade,ComplexAbilitiesUsed
0,52,5,27,10,3000,143.7180,0.003515,0.000220,7,0.000110,0.000392,0.004849,32.6677,40.8673,4.7508,28,0.001397,6,0.000000,0.000000
1,55,5,23,10,5000,129.2322,0.003304,0.000259,4,0.000294,0.000432,0.004307,32.9194,42.3454,4.8434,22,0.001193,5,0.000000,0.000208
2,56,4,30,10,200,69.9612,0.001101,0.000336,4,0.000294,0.000461,0.002926,44.6475,75.3548,4.0430,22,0.000745,6,0.000000,0.000189
3,57,3,19,20,400,107.6016,0.001034,0.000213,1,0.000053,0.000543,0.003783,29.2203,53.7352,4.9155,19,0.000426,7,0.000000,0.000384
4,58,3,32,10,500,122.8908,0.001136,0.000327,2,0.000000,0.001329,0.002368,22.6885,62.0813,9.3740,15,0.001174,4,0.000000,0.000019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3390,10089,8,?,?,?,259.6296,0.020425,0.000743,9,0.000621,0.000146,0.004555,18.6059,42.8342,6.2754,46,0.000877,5,0.000000,0.000000
3391,10090,8,?,?,?,314.6700,0.028043,0.001157,10,0.000246,0.001083,0.004259,14.3023,36.1156,7.1965,16,0.000788,4,0.000000,0.000000
3392,10092,8,?,?,?,299.4282,0.028341,0.000860,7,0.000338,0.000169,0.004439,12.4028,39.5156,6.3979,19,0.001260,4,0.000000,0.000000
3393,10094,8,?,?,?,375.8664,0.036436,0.000594,5,0.000204,0.000780,0.004346,11.6910,34.8547,7.9615,15,0.000613,6,0.000000,0.000631


## Data Cleaning:
As you can see we are also going to take out the data points that are incomplete** (those that have "?") in them and delete "GameID" since it is a variable that we aren't interested in studying. <br> <br>

 (**NOTE: Even though these data points are ~1% of the data, we will keep an eye on the variables: Age, HoursPerWeek, TotalHours to make sure deleting these datapoints was irrelevant) <br> <br>
 
 Then we are going to conver the rest of the data-points to floats since  they are all real numbers.

In [4]:
df = df.replace("?", np.nan)
df = df.dropna()
df = df.drop(["GameID"], axis = 1)
df = df.astype(float)

#### Here is the new dataset:

In [5]:
df

Unnamed: 0,LeagueIndex,Age,HoursPerWeek,TotalHours,APM,SelectByHotkeys,AssignToHotkeys,UniqueHotkeys,MinimapAttacks,MinimapRightClicks,NumberOfPACs,GapBetweenPACs,ActionLatency,ActionsInPAC,TotalMapExplored,WorkersMade,UniqueUnitsMade,ComplexUnitsMade,ComplexAbilitiesUsed
0,5.0,27.0,10.0,3000.0,143.7180,0.003515,0.000220,7.0,0.000110,0.000392,0.004849,32.6677,40.8673,4.7508,28.0,0.001397,6.0,0.0,0.000000
1,5.0,23.0,10.0,5000.0,129.2322,0.003304,0.000259,4.0,0.000294,0.000432,0.004307,32.9194,42.3454,4.8434,22.0,0.001193,5.0,0.0,0.000208
2,4.0,30.0,10.0,200.0,69.9612,0.001101,0.000336,4.0,0.000294,0.000461,0.002926,44.6475,75.3548,4.0430,22.0,0.000745,6.0,0.0,0.000189
3,3.0,19.0,20.0,400.0,107.6016,0.001034,0.000213,1.0,0.000053,0.000543,0.003783,29.2203,53.7352,4.9155,19.0,0.000426,7.0,0.0,0.000384
4,3.0,32.0,10.0,500.0,122.8908,0.001136,0.000327,2.0,0.000000,0.001329,0.002368,22.6885,62.0813,9.3740,15.0,0.001174,4.0,0.0,0.000019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3335,4.0,20.0,8.0,400.0,158.1390,0.013829,0.000504,7.0,0.000217,0.000313,0.003583,36.3990,66.2718,4.5097,30.0,0.001035,7.0,0.0,0.000287
3336,5.0,16.0,56.0,1500.0,186.1320,0.006951,0.000360,6.0,0.000083,0.000166,0.005414,22.8615,34.7417,4.9309,38.0,0.001343,7.0,0.0,0.000388
3337,4.0,21.0,8.0,100.0,121.6992,0.002956,0.000241,8.0,0.000055,0.000208,0.003690,35.5833,57.9585,5.4154,23.0,0.002014,7.0,0.0,0.000000
3338,3.0,20.0,28.0,400.0,134.2848,0.005424,0.000182,5.0,0.000000,0.000480,0.003205,18.2927,62.4615,6.0202,18.0,0.000934,5.0,0.0,0.000000


#### Here are our variables of interest:

In [6]:
variable_names = df.columns[1:].tolist()
print(variable_names)

['Age', 'HoursPerWeek', 'TotalHours', 'APM', 'SelectByHotkeys', 'AssignToHotkeys', 'UniqueHotkeys', 'MinimapAttacks', 'MinimapRightClicks', 'NumberOfPACs', 'GapBetweenPACs', 'ActionLatency', 'ActionsInPAC', 'TotalMapExplored', 'WorkersMade', 'UniqueUnitsMade', 'ComplexUnitsMade', 'ComplexAbilitiesUsed']


## Calculating Correlation:

To find which of our variables are most "important" in creating a linear regression, we calculate the correlation of each of the variables first:

In [7]:
var_corr_dict = {}
var_corr_squared = {}
for variable in variable_names:
    x = df[variable]
    y = df["LeagueIndex"]
    var_corr_dict[variable] = np.corrcoef(x, y)[0][1] 

sorted_vars = dict(sorted(var_corr_dict.items(), key=lambda item: item[1], reverse=True))

Printing each of the covariances, we can see the correlation from -1 to 1 of our variables to the rank of the player.

At the same time we are also going to create a dictionary of the r^2 values of each of variables, so we can see their ranking in terms of purely strength of relationship, not their strength and direction:

In [9]:
for entry in sorted_vars:
    correlation_value = sorted_vars[entry]
    print(entry + " Correlation = " + str(correlation_value))
    var_corr_squared[entry] = pow(correlation_value, 2)

APM Correlation = 0.6241710970772599
NumberOfPACs Correlation = 0.5891934978942297
AssignToHotkeys Correlation = 0.487279732086571
SelectByHotkeys Correlation = 0.42863675558377823
UniqueHotkeys Correlation = 0.32241463073766213
WorkersMade Correlation = 0.3104515329474209
MinimapAttacks Correlation = 0.2705257885396429
TotalMapExplored Correlation = 0.2303474332536572
HoursPerWeek Correlation = 0.21792962195020307
MinimapRightClicks Correlation = 0.2063800518574088
ComplexUnitsMade Correlation = 0.17118967334107923
ComplexAbilitiesUsed Correlation = 0.1560332352922749
UniqueUnitsMade Correlation = 0.15193343560364345
ActionsInPAC Correlation = 0.14030289579225091
TotalHours Correlation = 0.023883507561895907
Age Correlation = -0.12751785681253291
GapBetweenPACs Correlation = -0.5375356609357472
ActionLatency Correlation = -0.6599402519394563


**Sorting then printing the values of the r^2 for each of the variables:** <br> <br>

 **(**NOTE: Age, HoursPerWeek, TotalHours are all irrelevant, so the slight data erasal earlier is OK.)**

In [13]:
sorted_vars_sq = dict(sorted(var_corr_squared.items(), key=lambda item: item[1], reverse=True))

for entry in sorted_vars_sq:
    correlation_value_sq = sorted_vars_sq[entry]
    print(entry + " Correlation = " + str(correlation_value_sq))

ActionLatency Correlation = 0.435521136129913
APM Correlation = 0.38958955842663023
NumberOfPACs Correlation = 0.34714897796083766
GapBetweenPACs Correlation = 0.28894458677763063
AssignToHotkeys Correlation = 0.2374415373023604
SelectByHotkeys Correlation = 0.18372946823738764
UniqueHotkeys Correlation = 0.10395119411370303
WorkersMade Correlation = 0.09638015430940355
MinimapAttacks Correlation = 0.07318420226499557
TotalMapExplored Correlation = 0.05305994000654806
HoursPerWeek Correlation = 0.04749332012335843
MinimapRightClicks Correlation = 0.04259272580466674
ComplexUnitsMade Correlation = 0.029305904258625413
ComplexAbilitiesUsed Correlation = 0.02434637051577442
UniqueUnitsMade Correlation = 0.02308376885432647
ActionsInPAC Correlation = 0.01968490256769122
Age Correlation = 0.01626080380606165
TotalHours Correlation = 0.000570421933459139


# Linear Regression
As we can see the three most relevant variables are "ActionLatency", "APM", "NumberOfPACs", and "GapBetweenPACs". Now we will run a linear regression optimization to find the parameters of the linear regression, and the r^2 (and the correlation which is just r) to see good of a fit the linear model is:

In [14]:
X = df[["ActionLatency", "APM", "NumberOfPACs", "GapBetweenPACs"]]

# Add constant term to the independent variables
X = sm.add_constant(X)
y = df["LeagueIndex"]

# Create and fit the linear regression model
model = sm.OLS(y, X)
results = model.fit()

# Print the model summary
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:            LeagueIndex   R-squared:                       0.495
Model:                            OLS   Adj. R-squared:                  0.495
Method:                 Least Squares   F-statistic:                     818.1
Date:                Wed, 24 May 2023   Prob (F-statistic):               0.00
Time:                        20:50:50   Log-Likelihood:                -4830.3
No. Observations:                3338   AIC:                             9671.
Df Residuals:                    3333   BIC:                             9701.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const              4.1724      0.229     18.

Since we have an r^2 value of 0.495, $\implies$ 0.7035 correlation. This means in a model where: <br><br>
$RANK = 4.1724  - 0.019(ActionLatency) + 0.0082(APM) + 211.74(NumberOfPACs) - 0.0122(GapBetweenPACs) + \epsilon$ <br><br>
explains 49.5% of a players Rank in StarCraft 2. Although this might seem small, accounting nearly half of all players ranking down to 3 specific metrics can be a very helpful tool to players trying to get higher ranking.