# NBA_MVP_Prediction_Model

### Requirements/Dependencies:
1. Have Anaconda installed in your system: https://www.anaconda.com/distribution/
   - I recommend using Spyder(included in Anaconda Distribution) for the running the python scripts.
   - You can run Spyder through Anaconda Navigator or through terminal by typing 'Spyder' then press Enter.
2. Make sure the following modules/packages are available:
   - numpy, pandas, requests, BeautifulSoup4, datetime, sklearn (All of these are built-in packages in Anaconda)

### Quick Steps:
1. Create the following folders for storage of generated csv files:
   - `per_game_stats`
   - `totals_stats`
2. Run first the `datasetFactory.py` to generate the csv files
3. Then run `randForest.py` to run predictions
Note: Recommended IDE is Spyder(from Anaconda Distribution) due to many built in libraries that may not be available when running just IDLE the comes with the typical Python installer.

The objective of the predictive model is to return a number of players from the current season that have the highest potential to be the next MVP using statistics from years 1957 to 2019 as the training dataset.

This repository initially contains the following files
1. `NBA MVP Prediction Model - Notes.ipynb` --> This notebook is where I plan everything from pseudo code to final code before I paste them to the respective python files: `datasetFactory.py`, `nbaPlayers_StatsScraper.py`,`randForest.py`

2. `nbaPlayers_StatsScraper.py` --> A module containing two methods for webscraping data https://www.basketball-reference.com/
3. `datasetFactory.py` --> Running this generates csv files of webscraped data from https://www.basketball-reference.com/
4. `randForest.py` --> Contains the Machine Learning Model. Now currently using Random Forest Regression. But I might consider another model that is more applicable.

The notebook, `NBA MVP Prediction Model - Notes.ipynb`, elaborates these 4 major steps:

## 1. Webscraping for Data

Methods from `nbaPlayers_StatsScraper.py` module:

`scrapeNBAStats(year,type)`
`scrapeMVPs()`

## 2. Preparation of Dataset/s

Running `datasetFactory.py` delivers these outputs:
1. csv files of NBA Players Statistics Per Game for every season (1956-57 to 2018-19)
2. `training_data.csv`
3. `nbaMVPs.csv`

## 3. Building the Machine Learning Model
Running `randForest.py` delivers these csv outputs:
1. `mvpTop10candidates.csv`
2. `CompletePredictions.csv`

The script shall perform these main steps:

a. PREPARE data - creating DataFrames and pre-processing

b. DEFINE model - choose model and instantiate

c. FIT model - train the model

d. PREDICT - Create a dataframe for the stats of current season.
   
   As of this time, the current season is 2019-20.

e. EVALUATE - using mean absolute error (only shown in the .ipynb file)

f. RESULTS - display top 10 potential MVPs and the complete predictions


# 1. Webscraping for Data
## `nbaPlayers_StatsScraper.py`

In [None]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import os

## METHOD 1: `scrapeNBAStats(year,type='per_game')`
Web Scraping Source: e.g. for year 2019 -- https://www.basketball-reference.com/leagues/NBA_2019_per_game.html

**OUTPUT**: `nbaPlayers_stats_{type}_{year}.csv`

In [None]:
def scrapeNBAStats(year=datetime.now().year, type='per_game'):
    '''
    Scrapes for per game statistics of all NBA Players on a given season
    
    scrapeNBAStats(year=datetime.now().year, type='per_game')
    year: int object; defaults to current year.
          Pertains to the season--e.g. For season 2003-04, year to input is 2004
    type: str object; Only two choices-- 'per-game' or 'totals'. Defaults to 'per_game'
    
    OUTPUT: 'nbaPlayers_statsPerGame_yyyy.csv'
    'yyyy' is the year of the season
    '''

    # URL to be requested
    url = f"https://www.basketball-reference.com/leagues/NBA_{year}_{type}.html"
    
    # Create requests object: res
    print("\nNow requesting...", url)
    res = requests.get(url)
    print(res.raise_for_status)

    # Create BeautifulSoup Object: `soup`
    soup = BeautifulSoup(res.text, features='lxml')
    print("Created BeautifulSoup object: soup")
    
    # Parse the column headers and store them in a list
    headers = soup.thead.getText().split('\n')[3:-2] # Slicers are intended to exclude unnecessary headers
    
    # Parse the rows(player stats) and store them in a list
    rows = soup.findAll('tr')[1:]
    
    # Create the rows for each player and their stats as a list of list
    player_stats = []
    for i in range(0,len(rows)):
        try:
            row = [td.getText() for td in rows[i]][1:] # Parses the texts within the tags and excludes the values
                                                    # under 'Rk' column since it was also excluded in our headers
            player_stats.append(row)
        except AttributeError: # For every 20 iteration of this loop, it encounters this error
                               # and needs to pass over it and continue on the next iteration
            pass

    print("Scraping and Parsing Complete!")         
    
    # Create a pandas DataFrame
    stats = pd.DataFrame(player_stats, columns=headers)
    
    season_prefix = str(year-1)
    season_suffix = str(year)[2:]
    
    season = "{}-{}".format(season_prefix,season_suffix)
    
    stats['Season'] = season # Add this additional column to indicate the NBA Season
    
    # Remove unnecessary character under 'Player' column.
    stats['Player'] = stats['Player'].str.replace('*','') # Remove '*'
    
    # Create a csv file from this DataFrame
    filename = f'nbaPlayers_stats_{type}_{year}.csv'
    filepath = os.path.join(f'{type}_stats',filename)
    stats.to_csv(filepath,header=True, index=False)
    print("Generated csv file last {}: {}".format(datetime.now(),filename))

In [None]:
# Test run
scrapeNBAStats(2019,type='totals')

In [None]:
# Check csv files by viewing them using pandas DataFrames
filename = 'nbaPlayers_stats_totals_2019.csv'
filepath = os.path.join(f'{type}_stats',filename)
season_df = pd.read_csv(filepath)

season_df.head()

## METHOD 2: `scrapeMVPs()`
Scrapes for NBA MVPs from to 1957 to 2019
    
**OUTPUT:** `nbaMVPs.csv`

In [None]:
def scrapeMVPs():
    '''
    Scrapes for NBA MVPs from to 1957 to 2019
    OUTPUT: 'nbaMVPs.csv'
    '''
    # URL to be requested
    url = "https://www.basketball-reference.com/awards/mvp.html"

    # Create requests object: res
    print("Now requesting...", url)
    res = requests.get(url)
    print(res.raise_for_status)

    # Create BeautifulSoup Object: `soup`
    soup = BeautifulSoup(res.text, features='lxml')
    print("Created BeautifulSoup object: soup")
    
    # Parse the table
    html_table = soup.findAll('tr')[1:] # Sliced first header

    # Parse the column headers and store them in a list
    headers = [col_head.getText() for col_head in html_table][0].split('\n')[1:4] # Slicers are intended to exclude unnecessary headers

    # Parse the rows and store them in a list
    raw_rows = [col_head.getText() for col_head in html_table][1:64] # The scope of slice is from 1957 to 2019
    players = []
    for row in raw_rows:
        season = row[:7]
        league = row[7:10]
        player = row[10:].split('(V)')[0]
        players.append([season,league,player])
    
    print("Scraping and Parsing Complete!")        
    
    # Create a pandas DataFrame
    mvp = pd.DataFrame(players, columns=headers)
    
    # Remove additional space on last character of each player. e.g. "Stephen Curry " should be "Stephen Curry"
    corrected_names = []
    for i in range(0,len(mvp.index)):
        corrected_names.append(mvp.iloc[i]['Player'][:-1])
    
    # Apply corrections
    mvp['Player'] = corrected_names
    
    # Create a csv file from this DataFrame
    filename = 'nbaMVPs.csv'
    mvp.to_csv(filename,header=True, index=False)
    print("Generated csv file last {}: {}".format(datetime.now(),filename))

In [None]:
# Test run:
scrapeMVPs()

# 2. Preparation of Datasets

## `datasetFactory.py`

In [None]:
# No need to import these two modules for this notebook.
#from nbaPlayers_StatsScraper import scrapeNBAStats
#from nbaPlayers_StatsScraper import scrapeMVPs
import os
import pandas as pd

# Build a training dataset consisting of complete stats from year 1957 to 2019

##### UNCOMMENT TO GENERATE CSVs OF NBA STATS #####
years = [i for i in range(1957,2020)]

type = 'per_game' # Choose between 'per_game' or 'totals'

for i in years:
    scrapeNBAStats(year=i,type=type)
##### UNCOMMENT BLOCK ENDS HERE ###################

# Create an initial DataFrame for the year 1957
init_filepath = os.path.join(f'{type}_stats',f'nbaPlayers_stats_{type}_1957.csv')
training_df = pd.read_csv(init_filepath)

# Then create a for loop to concatenate the stats from 1958 to 2019
# to the DataFrame, "training_df"
years = [year for year in range(1958,2020)]

for season in years:
    filename = 'nbaPlayers_stats_{}_{}.csv'.format(type,season)
    filepath = os.path.join(f'{type}_stats',filename)
    df = pd.read_csv(filepath)
    training_df = pd.concat([training_df,df])

# This shall be the training data.
# Generate a csv file out of this compilation of stats.
training_filename = f'training_data_{type}.csv'
training_filepath = os.path.join(f'{type}_stats',training_filename) 
training_df.to_csv(training_filepath, header=True,index=False)

# 3. Optimizing the Machine Learning Model
Before finally using the full training data, we need to know the optimum parameters to be set on the Random Forest Regression Model.

In [None]:
# Import modules from sklearn library
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from datetime import datetime
import numpy as np

In [None]:
# PREPARE: Pre-processing
# Create DataFrames for building training data

type = 'per_game' # Choose between 'per_game' or 'totals'

train_df = pd.read_csv(f'training_data_{type}.csv')
mvp_df = pd.read_csv('nbaMVPs.csv')

# Add an additional feature for the training data: 'MVP'
# contains only two unique values of 0 and 1. 0 for 'not MVP' and 1 for 'MVP'

# Add additional column ['Season-Player'] on train_df and mvp_df
# This additional column shall served as an MVP marker reference
train_df['Season-Player'] = train_df['Season'] + '_' + train_df['Player'].str.replace(' ','_')
mvp_df['Season-Player'] = mvp_df['Season'] + '_' + mvp_df['Player'].str.replace(' ','_')

# Create a list that shall contain the values for the 'MVP' column for train_df

mvp_colvals = [] # The contents of this list shall be stored under the new column,"MVP" on train_df

# Convert the 'Season-Player' column into an array for faster computing
train_array = np.array(train_df['Season-Player'])
mvp_array = np.array(mvp_df['Season-Player'])

# Perform a for loop to store values on mvp_colvals
for val in train_array:
    if val in mvp_array:
        mvp_colvals.append(1)
    else:
        mvp_colvals.append(0)

# Add additional column 'MVP' on train_df. This column shall be the target(y) for our training dataset
train_df['MVP'] = mvp_colvals

# Update the 'training_data.csv' by overwriting it with the data under train_df
train_df.to_csv('training_data.csv', header=True,index=False)

# Drop NaN values from train_df
train_df.dropna(axis=0,inplace=True)

# Set index to 'Season-Player' column
train_df = train_df.set_index('Season-Player')

In [None]:
# DEFINE: Set features & target, apply train_test_split, define model

features = ['G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P','3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB','DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']
X = train_df[features]
y = train_df['MVP']

# train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X,y,train_size=0.75,test_size=0.25,random_state=0)


In [None]:
# Pre-evaluation: Use Mean Absolute Error and validate across different number of estimators
def getMAEs(n_est, X_train, X_valid, y_train, y_valid):
    for estimators in n_est:
        model = RandomForestRegressor(n_estimators=estimators,random_state=0)
        model.fit(X_train,y_train)
        model_predictions = model.predict(X_valid)
        mae = mean_absolute_error(y_valid,model_predictions)
        print("MAE at {} estimators:".format(estimators),mae)
        
n = [1000,1500,2000]
getMAEs(n, X_train, X_valid, y_train, y_valid)

In [None]:
# Create Model. Choose the optimum number of n_estimators.
model = RandomForestRegressor(n_estimators=1000,random_state=0)

In [None]:
# FIT: Train the model
model.fit(X_train,y_train)

In [None]:
# PREDICT
model_predictions = model.predict(X_valid)

In [None]:
# EVALUATE:
validation = X_valid.copy()
validation['Actual MVP'] = y_valid
validation['Prediction'] = model_predictions

In [None]:
validation[validation['Actual MVP'] == 1]
#validation[validation['Prediction'] >= 0.2]
validation.nlargest(5,'Prediction')

In [None]:
# TEST PREDICTION
# Create DataFrames for building test data:
test_df = pd.read_csv('nbaPlayers_statsPerGame_2019.csv')

# Drop NaN values from test_df
test_df.dropna(axis=0,how='any',inplace=True)

# Add 'Season-Player' column
test_df['Season-Player'] = test_df['Season'] + '_' + test_df['Player'].str.replace(' ','_')

# Set index to 'Season-Player' column
test_df = test_df.set_index('Season-Player')

X_test = test_df[features]

In [None]:
# DEFINE MODEL, FIT MODEL, PREDICT
model_final = RandomForestRegressor(n_estimators=1000,random_state=0)
model_final.fit(X,y)
model_final_preds = model_final.predict(X_test)

In [None]:
# VALIDATION OF TEST DATA
results = X_test.copy()
results['Prediction'] = model_final_preds

In [None]:
# Display results
results.nlargest(5,'Prediction')