##### Assignment 4 - Linear Regression

## Part 1 - Read your data frame.

The CSV file is "hockey_data2.csv".  This data came from Kaggle.  It has **a lot** of hockey performance features.  You will want to read up on some of them.  Looking at correlations with Salary might help you find important features.  

In [6]:
import pandas as pd

# Load the CSV file
hockey_df = pd.read_csv("hockey_data2.csv")

# Calculate the correlation coefficients
hockey_corr = hockey_df.corr()["Salary"].sort_values(ascending=False)

# Print the correlations
print(hockey_corr)



Salary    1.000000
GF        0.659654
xGF       0.657915
SCF       0.652631
SF        0.651598
            ...   
iHDf     -0.058446
iPenDf   -0.061890
DftRd    -0.206641
Ovrl     -0.222808
DftYr    -0.471547
Name: Salary, Length: 144, dtype: float64


## Part 2 - Display info
Take a look at the information.  Note any features that are missing values.  Note any features that are objects, but could be numerical

In [7]:
import numpy as np

print("********INFO********")
print(hockey_df.info())
print("********IsNull********")
print(hockey_df.isnull().sum())
print("********Check for Numerical********")
for col in hockey_df.columns:
    if hockey_df[col].dtype == 'object' and pd.to_numeric(hockey_df[col], errors='coerce').notnull().all():
        print(col, 'could be numerical')

********INFO********
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 612 entries, 0 to 611
Columns: 154 entries, Salary to GS/G
dtypes: float64(73), int64(71), object(10)
memory usage: 736.4+ KB
None
********IsNull********
Salary      0
Born        0
City        0
Pr/St     153
Cntry       0
         ... 
Grit        0
DAP         0
Pace        1
GS          1
GS/G        2
Length: 154, dtype: int64
********Check for Numerical********


Clean up any missing data values.  If a lot of values of a feature are missing, remove the feature.  Otherwise remove the instance or replace by the median.

In [17]:
import numpy as np

print("********Cleaning Dataset********")

threshold = len(hockey_df)*0.4
hockey_df = hockey_df.dropna(thresh=threshold, axis=1)
    
for col in hockey_df.columns:
    if hockey_df[col].isnull().sum() > 0:
        if hockey_df[col].isnull().sum() > 0.4*len(hockey_df):
            hockey_df = hockey_df.drop(col, axis=1, inplace=True)
        else:
            if hockey_df[col].dtype == 'float64':
                median = np.median(hockey_df[col])
                hockey_df[col].fillna(median, inplace=True)
            elif hockey_df[col].dtype == 'int64':
                median = np.median(hockey_df[col])
                hockey_df[col].fillna(median, inplace=True)
            elif hockey_df[col].dtype == 'object':
                hockey_df[col].fillna('unknown', inplace=True)
            else:
                median = np.median(hockey_df[col])
                hockey_df[col].fillna(median, inplace=True)
    else:
        if hockey_df[col].dtype == 'float64':
            median = np.median(hockey_df[col])
            hockey_df[col].fillna(median, inplace=True)
        elif hockey_df[col].dtype == 'int64':
            median = np.median(hockey_df[col])
            hockey_df[col].fillna(median, inplace=True)
        elif hockey_df[col].dtype == 'object':
            hockey_df[col].fillna('unknown', inplace=True)
        else:
            median = np.median(hockey_df[col])
            hockey_df[col].fillna(median, inplace=True)

print("********IsNull********")
hockey_df=hockey_df.dropna()
print(hockey_df.isnull().sum())


********Cleaning Dataset********
********IsNull********
Salary    0
Born      0
City      0
Pr/St     0
Cntry     0
         ..
Grit      0
DAP       0
Pace      0
GS        0
GS/G      0
Length: 154, dtype: int64


### _Notes_
(Your notes here)

## Part 3 - Split the data frame

Use the train_test_split() function to split the data set into training(75%) and test(25%) sets. 

In [19]:
from sklearn.model_selection import train_test_split

# Split the data into features (X) and target (y)
X = hockey_df.drop(['Salary'], axis=1)
Y = hockey_df['Salary']

# Split the data into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

# Print the shapes of the resulting data sets
print("X_train shape:", X_train.shape)
print("y_train shape:", Y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", Y_test.shape)

X_train shape: (370, 153)
y_train shape: (370,)
X_test shape: (124, 153)
y_test shape: (124,)


## Part 4 - Do some scatter plots
Our goal is to predict the salary for a player based on their stats.  We have a lot of features we can use, so you will want to look at a subset.

Use scatter_matrix.

In [10]:
# Your code here

## Part 5 - Linear Regression  (Input is GP = games played)
1. Pull out "GP" for the X and "Salary" for y. 
2. Fit the data.
3. Show R2 and Mean Square Error
4. Discuss the results

In [11]:
# Your code here

Discussion: 

## Part 6 - Add features to X
1. Pick up to 4 other features that you think might improve the model and use them for X.
2. Fit the data
3. Show the scores
4. Discuss the results

In [12]:
# Your code here

Discussion: 

## Part 7 - Add a new feature to the Model

Lets make a new feature that attemps to capture a players contributions.  We will combine G=Goals over the season, A=Assists over the season and +/- = extra goals scored while player is on the ice over the season.  (If the players team scored 7 goals and the opponents scored 3 goals while the player was on the ice, then their plus/minus is 7 - 3 = 4.  A positive +/- is good, negative not so much). 

Lets create a feature that combines these and call it Goal-Power.  
Make it G + .4A + .3 +/- 

1. Fit the data using Goal-Power and Salary
1. Show the scores
1. Scatter plot the data
1. Discuss the results

In [13]:
# Your code here

Discussion:

## Part 8 - Use the new feature and others

1. Use up to 4 other features along with Goal-Power to predict the salary. Don't use any of the three features (G, A, +/-) that we used to create Goal-Power.
2. Fit the data
3. Show the scores
4. Discuss the results

In [14]:
# Your code here

Discussion:

## Part 9 - Test Set time
Evaluate the model for r2 and mean square error on the test set and discuss your results in comparison to part 8.

In [15]:
# Your code here

Discussion:

# Bonus options
For each of the options, redo your regression using the new features, report the results and discuss.


1. Look for a better set of features to predict Salary.  
2. Add in polynomial features.