## Section 1: Business Understanding
My purpose with this project is to answer a few questions as they relate to specific players in hockey. I want to follow the CRISP-DM process to clean, explore, and understand the data, before creating visualizations and answering the questions outlined below.

### Question 1: How good of a goalie is Carey Price?

### Question 2: How does Carey Price compare to other great goalies?

### Question 3: Can we predict how many goals a player like Alexander Ovechkin will score?

In [None]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

## Section 2 (part 1): Data Understanding

In [None]:
df_players = pd.read_csv('player_info.csv')
df_players.info()

In [None]:
# Find Carey Price 
df_players[df_players.firstName == 'Carey']

### Carey Price vs other goalies

In [None]:
# read in the dataset, drop any null records for clean dataset, and view the resulting subset for Carey Price
# We don't want null records for later visualizations
df_goalies = pd.read_csv('game_goalie_stats.csv')
df_goalies = df_goalies.dropna()
df_goalies[df_goalies.player_id == 8471679]

In [None]:
# His save percentage when both teams are at even strength (no one in the penalty box)
df_goalies[df_goalies.player_id == 8471679]['evenStrengthSavePercentage'].mean()

### Carey still does better than the rest of the league.

In [None]:
# the entire league average save percentage
df_goalies['evenStrengthSavePercentage'].mean()

## Section 3 (part 1): Data Preparation

### Create DataFrame of distinct goalies and their mean save percentages

In [None]:
# Grab all goalie averages, convert to DataFrame
def goalie_averages(df_goalies):
    '''
    INPUT:
    df_goalies - pandas dataframe of goalie data from our dataset
    
    OUTPUT:
    df_goalie_averages - a new dataframe that has the following characteristics:
            1. grouped by the player id
            2. contains the mean even strength save percentage for each player
            3. converts the Series back to a dataframe
            4. rename the column to SavePercentage to focus solely on that aspect
            5. resets the index to tidy the dataframe
    '''
    df_goalie_averages = df_goalies.groupby(['player_id'])['evenStrengthSavePercentage'].mean()
    df_goalie_averages = df_goalie_averages.to_frame()
    df_goalie_averages.rename(columns = {'evenStrengthSavePercentage': 'SavePercentage'}, inplace=True)
    df_goalie_averages.reset_index(inplace=True)
    return df_goalie_averages

In [None]:
df_goalie_averages = goalie_averages(df_goalies)

In [None]:
df_goalie_averages[df_goalie_averages.player_id == 8471679]

### Plot proportions and counts of goalies based on their save percentages

In [None]:
# Plot the proportions of goalie save percentages 

bins = np.arange(10, df_goalie_averages['SavePercentage'].max()+5, 5)
sb.distplot(df_goalie_averages['SavePercentage'], bins=bins)
plt.xlabel('Save Percentage')
plt.ylabel('Proportion of Goalies')
plt.title('Even Strength Save Percentage by Proportion')
plt.show()

### Save Percentage by Proportion
As we can see above, a large proportion of NHL goalies have even strength save percentages that hover between 80% and 95%. There are outliers in the data. I'm also sure that those with unusually high percentages would be found to only have a few games played, meaning they haven't had long careers or just had really lucky nights.

Carey Price would fall into the bin with the tallest bar.

In [None]:
# Plot the counts of goalies by their average save percentage
ticks = [50, 60, 70, 80, 90, 100]
labels = ['{}'.format(v) for v in ticks]
plt.hist(data = df_goalie_averages, x = 'SavePercentage', bins=bins)
plt.xscale('log')
plt.xticks(ticks, labels)
plt.xlim((75,100))
plt.xlabel('Save Percentage')
plt.ylabel('Counts of Goalies')
plt.title('Save Percentage of Goalies by Count')
plt.show()

### Save Percentages by Count
Again, this graph is the same as above, but focuses on the counts of goalies instead of proportions and is "zoomed in" to focus on that core group of goalies that have long careers by which they could be measured. Carey still falls into the bin with the tallest bar which is what we would expect for a long-term high performing goalie.

### Covert game_id to str, substring out last two numbers, then convert to date and add as extra column
The data in each of these files has a unique game_id. It almost looks like it could be broken down into a date and the game number in that season. Using this assumption I want to break it out into a new date field to get so I can visualize over time how many saves Carey has in his career.

In [None]:
df_goalies['date'] = df_goalies['game_id'].apply(str)
df_goalies

In [None]:
df_goalies['date'] = df_goalies['date'].str.slice(0, 8)
df_goalies

In [None]:
## if date ends in 00, then remove, assumption is there are preseason games included in these dataframes that we want to remove
df_goalies = df_goalies[~df_goalies['date'].str.endswith('00')]

df_goalies['date'] = pd.to_datetime(df_goalies['date'], format='%Y-%m-%d')
df_goalies.drop_duplicates(inplace=True)
df_goalies.head()

In [None]:
df_goalies[df_goalies.player_id == 8471679]['date']

In [None]:
# Grab all of the save percentages by game for Carey Price
price_saves = df_goalies[df_goalies.player_id == 8471679]['evenStrengthSavePercentage']
price_saves

# Grab all of the dates of games for Carey Price
price_dates = df_goalies[df_goalies.player_id == 8471679]['date']
price_dates.head()


### Question 1: How good of a goalie is Carey Price?

In [None]:
# Plot the dates and amounts of saves over Carey's career
fig = plt.figure(figsize=[12, 12])
colors = df_goalies[df_goalies.player_id == 8471679]['saves']
plt.scatter(price_dates, price_saves, c=colors, cmap='viridis')
plt.ylabel('Save Percentage')
plt.xlabel('Date')
plt.title("Save percentages and numbers by game")
plt.colorbar()
plt.show()

### Save Percentages and Counts by Game
The graph above is a multivariate analysis of Carey's career. It shows all of his games by year, with the save percentage for that game shown on the left and the counts of saves shown by the color bar. A slight trend can be seen from the colors. The more saves he makes, the better his save percentage. What this might tell us is that he performs well when he's pressured.

### Grab all goalies higher than the average of averages and compare against Carey Price
Now I want to figure out just how good Carey Price is compared to everyone. We'll start by getting the average of all goalies and then comparing that save percentage to Carey Price's own.

In [None]:
# Verify there are no nulls
df_goalie_averages.isnull().sum()

In [None]:
# get the average of all goalies
df_goalie_averages['SavePercentage'].mean()

In [None]:
# get all the goalies with save percentages greater than or equal to the overall average
best_goalies = df_goalie_averages[ df_goalie_averages >= df_goalie_averages['SavePercentage'].mean()]

In [None]:
best_goalies.info()

In [None]:
best_goalies.describe()

In [None]:
# Outliers in the data. These players had a 100% save percentage. But this could be because they only played 1 game,
# managed to get a shutout, and then never played again. They are not indicative of career long goalies.
best_goalies.loc[best_goalies['SavePercentage']==100]

In [None]:
best_goalies.loc[best_goalies['player_id']==8471679]

### Question 2: How does Carey Price compare to other great goalies?

### Visualize Carey Price against 4 of the greatest goalies to play the game in the last 15 years.
I want to see how Andrei Vasilevskiy, Tuukka Rask, Semyon Varlamov, and Henrik Lundqvist all compare to Carey Price as goalies.

In [None]:
# Plot 5 of the best goalies in the NHL to rank their average save percentage
great_goalies = [8471679, 8468685, 8473575, 8471695, 8476883]

top_five = pd.DataFrame()
for goalie in great_goalies:
  top_five = top_five.append(best_goalies.loc[best_goalies['player_id']==goalie])

top_five = pd.merge(top_five, df_players, on='player_id')
top_five['fullName'] = top_five['firstName'] + ' ' + top_five['lastName']
top_five = top_five[['player_id', 'SavePercentage', 'fullName']]
top_five.sort_values('SavePercentage', inplace=True)
top_five

In [None]:
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)
ax.hlines(y=top_five.player_id, xmin=11, xmax=26, color='gray', alpha=0.7, linewidth=1, linestyles='dashdot')
ax.scatter(y=top_five.fullName, x=top_five.SavePercentage, s=75, color='firebrick', alpha=0.7)

# Title, Label, Ticks and Ylim
ax.set_title('Top Five Goalies by Save Percentage', fontdict={'size':22})
ax.set_xlabel('Save Percentage')
ax.set_yticks(top_five.fullName)
ax.set_yticklabels(top_five.fullName, fontdict={'horizontalalignment': 'right'})
ax.set_xlim(91.8, 93)
plt.show()

### Results?
Overall, Carey comes in 4th place against these great goalies. Nearly 92 shots out of 100 is nothing to scoff out, but for players like Henrik Lundqvist and Tuukka Rask, they are both able to make that extraordinary effort to save more. I would want to dig deeper into this, like how a team's defense plays into protecting their goalie, or limiting the options forwards can have to set up and make shots. Maybe we would find that Carey's team (Montreal Canadiens) doesn't have as good a defense over this period of time compared to Henrik's (Rangers) and Tuukka's (Bruins).

## Section 2 (part 2): Data Understanding

### Alexander Ovechkin Goals
Alexander Ovechkin is chasing after history. He wants to surpass Wayne Gretzsky's all-time goal record. And he just might do it. Let's see how many goals he has amassed in his career so far.

In [None]:
# Find Alexander Ovechkin
df_players[df_players.lastName == 'Ovechkin']

In [None]:
# Clean the data of duplicates and sum Ovi's goal total
df_goals = pd.read_csv('game_skater_stats.csv')
df_goals.drop_duplicates(inplace=True)
df_goals[df_goals.player_id == 8471214].goals.sum()

## What's wrong with the data?
So I've removed any duplicates, and this dataset only goes through the 2019-2020 season, but a quick check on ESPN.com says that Ovechkin had about anywhere from about 658 to 706 goals about that time if you subtract the last two seasons from his total. I need as close of an accurate goal count I can get from the data before building a neural network.

In [None]:
df_goals.info()

## Section 3 (part 2): Data Preparation

### Fix dates, then create running total after each game and add additional column for neural network.
Again, I'm using the assumption that the game_id is a combination of a date and game number and I want to do what I did with it for Carey Price so I can create a running total of his career goals after each game.

In [None]:
df_goals['gameID'] = df_goals['game_id'].apply(str)
df_goals.info()

In [None]:
df_goals = df_goals[~df_goals['gameID'].str.endswith('00')]
df_goals.shape

In [None]:
df_goals[df_goals.player_id == 8471214].goals.sum()

In [None]:
df_goals['date'] = df_goals['gameID'].str.slice(0, 8)
df_goals.head()

In [None]:
df_goals = df_goals[~df_goals['date'].str.endswith('00')]
df_goals.head()

In [None]:
ovi_goals = df_goals[df_goals.player_id == 8471214]
ovi_goals.info()

In [None]:
ovi_goals['goals'].sum()

### Start putting together neural net training set

In [None]:
df_games = pd.read_csv('game.csv')

In [None]:
# Merge Ovechkin data with the games data and clean to remove duplicates
result = pd.merge(ovi_goals, df_games, on='game_id')
result.drop_duplicates(inplace=True)
result.info()

### Where I discovered I was wrong...
In the cell below, I took a glance at what the game_id looked like compared to the date_time_GMT field from the game.csv. It turns out, if you break out what would look like the date from game_id, it does not match the date_time_GMT field. So what is this game_id? Is it just a random number? It doesn't look like it to me, but it won't work for what I have in mind.

In [None]:
result[['game_id', 'date_time_GMT']]

In [None]:
# Sort the result dataset by game dates
result.sort_values(by=['date_time_GMT'], inplace=True)
result.head()

In [None]:
# Create a column over the data that adds the cumulative sum of Ovechkin's goals by each game
result['careerGoals'] = result.goals.cumsum()
result

In [None]:
result.info()

In [None]:
# Create a dataset with the relevant columns for a neural net, then clean up the dates, and set them as indexes
oviData = pd.DataFrame(result, columns=['game_id', 'player_id', 'goals', 'date_time_GMT', 'careerGoals'])
oviData['date'] = oviData['date_time_GMT'].str.slice(0, 10)
oviData['date'] = pd.to_datetime(oviData['date'], format='%Y-%m-%d')
oviData = oviData.set_index('date')
oviData.shape

## Section 4: Data Modeling

### My training and test set
I put together the training/test set in the cell below for my neural net. I'm using the date as my index, with a few important fields I wanted to include in case I run into something I might need later. I've included a careerGoals field at the end that provides the running total of all of Ovechkin's goals to feed my neural network so it can make predictions about how many goals he will get.

In [None]:
oviData.head()

### Build Neural Net to predict future goals for Alexander Ovechkin
Here I began the process of normalizing my data, splitting it out into a test and training set, and finally training my neural network to make predictions on how many goals Alexander Ovechkin would get in his career. I used a YouTube video from the Computer Science channel about a similar neural network trained to make stock price predictions over time and repurposed it for this project. I also included some hyperparameters to help with precision and accuracy of the neural network. It was all a little over my head, but I followed along as best I could and tried different things to get the neural network to work as best I could.

In [None]:
import math
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras import regularizers

In [None]:
# Show the career goal progression of Alexander Ovechkin
plt.figure(figsize=(16,8))
plt.title('Ovechkin Career Goal Progression')
plt.plot(oviData['careerGoals'])
plt.xlabel('Date')
plt.ylabel('Goals')
plt.show()

### Observations of Career Goal Progression
So we can see the steady incline of Ovechkin's goal progressions over time. Every game is used and each time he adds goals to his careerGoals field, the progression goes up. But what about the flat lines? Well every hockey season has to come to an end at some point and these represent the offseason where no games are played and no goals are counted. We will want to make sure that the neural net can account for this period of time.

In [None]:
# Create a Series of of the cumulative goal counts and get the training data length
data = oviData.filter(['careerGoals'])

dataset = data.values

training_data_len = math.ceil(len(dataset) * .8)

training_data_len

In [None]:
# scale the goal counts using MinMaxScaler to reduce them all between 0 and 1
# fit the data to be scaled.
scaler = MinMaxScaler(feature_range=(0,1))
scaled_data = scaler.fit_transform(dataset)

scaled_data

In [None]:
#get the training data from our scaled dataset and append 60 (totals after each game) records to x_train and the 61 to y_train, 
#60 is arbitrary
train_data = scaled_data[0:training_data_len, :]

x_train = [] 
y_train = []

for i in range(60, len(train_data)):
  x_train.append(train_data[i-60:i, 0])
  y_train.append(train_data[i, 0])
  if i<= 60:
    print(x_train)
    print(y_train)
    print()

In [None]:
# convert x and y train to array objects
x_train, y_train = np.array(x_train), np.array(y_train)

In [None]:
# Reshape x_train to be 3 dimensions (# of rows, # of columns, # of features)
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
x_train.shape

In [None]:
# Build our RNN model
model = Sequential()
model.add(LSTM(60, return_sequences=True, input_shape=(x_train.shape[1], 1), bias_regularizer=regularizers.l2(1e-4)))
model.add(LSTM(60, return_sequences=False))
model.add(Dense(30))
model.add(Dense(1))

In [None]:
#Compile with adam (SGD) optimizer and loss function based off mean squared error (how we minimize errors in our model)
model.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
# Fit the training data on our model, 1 batch of all our data, 3 cycles (epochs) for training
model.fit(x_train, y_train, batch_size=1, epochs=3)

In [None]:
# Create our testing training set on the last 60 records
test_data = scaled_data[training_data_len - 60: , :]

x_test = []
y_test = dataset[training_data_len:, :]
for i in range(60, len(test_data)):
  x_test.append(test_data[i -60:i, 0])


In [None]:
# Convert to an array
x_test = np.array(x_test)

In [None]:
# Reshape the data to be 3 dimensions
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1))

In [None]:
# Make our predictions on the test data and invert the scaled data
predictions = model.predict(x_test)
predictions = scaler.inverse_transform(predictions)

## Section 5: Evaluation

### How did it perform?
Using the same method from the YouTube video, we calculated how well the neural net made predictions using the Root Mean Squared Error calculation. The closer to zero it is, the better the predictions were.

In [None]:
# Calculate the root mean squared error on our predictions (closer to zero is better)
rmse = np.sqrt(np.mean(predictions - y_test)**2)
rmse

### Question 3: Can we predict how many goals a player like Alexander Ovechkin will score?

In [None]:
# Plot how our model performed versus the actual results
train = data[:training_data_len]
valid = data[training_data_len:]
valid['Predictions'] = predictions

plt.figure(figsize=(16, 8))
plt.title('Model')
plt.xlabel('Date')
plt.ylabel('Career Goals')
plt.plot(train['careerGoals'])
plt.plot(valid[['careerGoals', 'Predictions']])
plt.legend(['Train', 'Actual', 'Predictions'], loc='lower right')
plt.show()

### Neural Net Performance
Hopefully, we see from the above graphic that the model's predictions run close to the actual results from our training/test dataset. I noticed that I would get varying results every time I ran this. Sometimes the RMSE would be as high as 64, and other times it would be 0.2. I tried to include differences in the layers of my model, hyperparameter tuning, and adjusting using L1 and L2 weights, but other than narrowing the gap in these differences, I still wasn't able to get a consistent result. This would require more research and possibly more data or different parameters. Many neural networks training on thousands of records. The problem here with hockey players, most don't play more than 1000 games in their career. This could also be why the model is inconsistent, lack of data.

In [None]:
# Show the results
valid

### Just for Fun
I wanted to see how it could predict the next game's total career goals if there was another record.

In [None]:
# predict the next record's goals 
goal_count = oviData

new_df = goal_count.filter(['careerGoals'])

last_60_values = new_df[-60:].values

last_60_values_scaled = scaler.transform(last_60_values)

X_test = []

X_test.append(last_60_values_scaled)

X_test = np.array(X_test)

X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))

pred_goals = model.predict(X_test)

pred_goals = scaler.inverse_transform(pred_goals)
print(pred_goals)