# Data Mining and Statistics
## Session 4 - Regression Analysis - ANSWERS
*Peter Stikker - Haarlem, the Netherlands*

----

In [None]:
try:
    import pandas as pd
    print('pandas already installed, only imported')
except:
    !pip install pandas
    import pandas as pd
    print('pandas was not installed, installed and imported')    

# numpy as np
try:
    import numpy as np
    print('NumPy already installed, only imported')
except:
    !pip install numpy
    import numpy as np
    print('NumPy was not installed, installed and imported')
    
    
# pyplot as plt
try:
    import matplotlib.pyplot as plt
    print('PyPlot already installed, only imported')
except:
    !pip install matplotlib
    import matplotlib.pyplot as plt
    print('PyPlot was not installed, installed and imported')

try:
    import statsmodels.api as sm
    print('statsmodels already installed, only imported')
except:
    !pip install statsmodels
    import statsmodels.api as sm
    print('statsmodels was not installed, installed and imported')    
    
# sklearn
try:
    from sklearn.linear_model import LinearRegression
    print('sklearn already installed, only imported')
except:
    !pip install sklearn
    from sklearn.linear_model import LinearRegression
    print('sklearn was not installed, installed and imported')

from sklearn import metrics

try:
    import seaborn as sns
    print('statsmodels already installed, only imported')
except:
    !pip install seaborn
    import seaborn as sns
    print('statsmodels was not installed, installed and imported')

In [None]:
soccerDF=pd.read_csv('data/Soccer2019C.csv')

In [None]:
x = soccerDF["Age"].to_numpy().reshape((-1,1))
y = soccerDF["Overall"].to_numpy().reshape((-1,1))

**The manual calculation:**

In [None]:
sx2 = x.var()
mxy = np.array(x*y).mean()
b1=(mxy-x.mean()*y.mean())/sx2
print("The coefficient (b1): ",b1)

b0=y.mean()-b1*x.mean()
print("The intercept (b0): ",b0)

**Using sklearn:**

In [None]:
x=x.reshape((-1,1))
y=y.reshape((-1,1))

model = LinearRegression().fit(x,y)
b1=model.coef_[0]
print('The slope (b1): ',b1[0])

b0=model.intercept_
print('The intercept (b0): ',b0[0])

yPred = model.predict(x)

det=metrics.r2_score(y,yPred)
print('Coefficient of determination: ',det)

**Using statsmodels:**

In [None]:
X=soccerDF["Age"]
X=sm.add_constant(X)
model = sm.OLS(soccerDF["Overall"],X).fit()
model.summary()

My take on a function for linear regression:

In [None]:
def linearRegression(xVal, yVal):
    model = LinearRegression().fit(xVal,yVal)
    yPred = model.predict(xVal)
    b1V2=model.coef_[0]
    print('The slope (b1): ',b1V2)

    b0V2=model.intercept_
    print('The intercept (b0): ',b0V2[0])
    det2=metrics.r2_score(yVal,yPred)
    print('Coefficient of determination: ',det2)

Now to find out which variable has the highest correlation. Micha is a big fan of seaborn and liked to visualize this by creating all scatterplots between all possible pairs of variables: 

In [None]:
soccerLim =soccerDF.iloc[:,0:10] # Showing this for all variables will simply take too much time for now, so just to get an example limit the data
sns.pairplot(soccerLim) # Show the scatterplots of each possible pair of variables

The full version would be a bit impracticle. He also used a heat map. Looks also nice:

In [None]:
sns.heatmap(soccerDF.corr()) # show a heatmap which columns are corrolated.

I'm more of a numbers guy myself. We can generate a so-called correlation matrix using our pandas dataframe.

In [None]:
corrMatrix=soccerDF.corr(method='pearson')
corrMatrix.head()

To determine the one with the best determination coefficient, we simply square the results:

In [None]:
detMatrix=corrMatrix**2
detMatrix.head()

Then replace the 1's with a 0 (to avoid getting the diagonal), and determine the maximum:

In [None]:
detMatrix = detMatrix.replace(1,0)
maxRsquare=detMatrix.values.max()
maxRsquare

Okay, uhm but between which two is this:

In [None]:
for column in detMatrix:
    if detMatrix[column].values.max()==maxRsquare:
        print(column)

Great, lets check:

In [None]:
X=soccerDF["StandingTackle"]
X=sm.add_constant(X)
model = sm.OLS(soccerDF["SlidingTackle"],X).fit()
model.summary()