# NBA MVP Prediction Model
Creating a Machine Learning Model that predicts the possible candidates for MVP on a given (and preferably current) NBA Season.

Before running the `datasetFactory.py`, it's best to read and test the codes on this notebook. This environment is a great tool for familiarizing with the codes behind all of these steps.

This notebook contains these 4 major parts:

1. Webscraping for Data

    Methods from `nbaPlayers_StatsScraper.py` module:
    - `scrapeNBAStats(year)`
    - `scrapeMVPs()`
    

2. Preparation of Datasets

    Running `datasetFactory.py` with the folowing outputs
    - `training_data.csv`


3. Building the Machine Learning Model

    Main Steps:
    
    a. PREPARE data - creating DataFrames and data cleansing
    
    b. DEFINE model - choose model and instantiate
    
    c. FIT model - train the model
    
    d. PREDICT
    
    e. EVALUATE - using mean absolute error
    

# 1. Webscraping for Data
## `nbaPlayers_StatsScraper.py`

In [None]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime

## METHOD 1: `scrapeNBAStats(year)`
Web Scraping Source: e.g. for year 2019 -- https://www.basketball-reference.com/leagues/NBA_2019_per_game.html

**OUTPUT**: `nbaPlayers_statsPerGame_YYYY.csv`

In [None]:
def scrapeNBAStats(year=datetime.now().year):
    '''
    Scrapes for per game statistics of all NBA Players on a given season
    
    scrapeNBAStats(year=datetime.now().year)
    year: int object; defaults to current year
    
    OUTPUT: 'nbaPlayers_statsPerGame_yyyy.csv'
    'yyyy' is the year of the season
    '''
    # URL to be requested
    url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html".format(year)
    
    # Create requests object: res
    print("Now requesting...", url)
    res = requests.get(url)
    print(res.raise_for_status)

    # Create BeautifulSoup Object: `soup`
    soup = BeautifulSoup(res.text, features='lxml')
    print("Created BeautifulSoup object: soup")
    
    # Parse the column headers and store them in a list
    headers = soup.thead.getText().split('\n')[3:-2] # Slicers are intended to exclude unnecessary headers
    
    # Parse the rows(player stats) and store them in a list
    rows = soup.findAll('tr')[1:]
    
    # Create the rows for each player and their stats as a list of list
    player_stats = []
    for i in range(0,len(rows)):
        try:
            row = [td.getText() for td in rows[i]][1:] # Parses the texts within the tags and excludes the values
                                                    # under 'Rk' column since it was also excluded in our headers
            player_stats.append(row)
        except AttributeError: # For every 20 iteration of this loop, it encounters this error
                                # and needs to pass over it and continue on the next iteration
            pass

    print("Scraping and Parsing Complete!")         
    
    # Create a pandas DataFrame
    stats = pd.DataFrame(player_stats, columns=headers)
    
    season_prefix = str(year-1)
    season_suffix = str(year)[2:]
    
    season = "{}-{}".format(season_prefix,season_suffix)
    
    stats['Season'] = season
    
    # Create a csv file from this DataFrame
    filename = 'nbaPlayers_statsPerGame_{}.csv'.format(year)
    stats.to_csv(filename,header=True, index=False)
    print("Generated csv file last {}: {}".format(datetime.now(),filename))

In [None]:
# Test run
scrapeNBAStats(2019)

In [None]:
# Check csv files by viewing them using pandas DataFrames
season_df = pd.read_csv('nbaPlayers_statsPerGame_2019.csv')

season_df.head()

## METHOD 2: `scrapeMVPs()`
Scrapes for NBA MVPs from to 2000 to 2018
    
**OUTPUT:** `nbaMVPs.csv`

In [None]:
def scrapeMVPs():
    '''
    Scrapes for NBA MVPs from to 2000 to 2018
    OUTPUT: 'nbaMVPs.csv'
    '''
    # URL to be requested
    url = "https://www.basketball-reference.com/awards/mvp.html"

    # Create requests object: res
    print("Now requesting...", url)
    res = requests.get(url)
    print(res.raise_for_status)

    # Create BeautifulSoup Object: `soup`
    soup = BeautifulSoup(res.text, features='lxml')
    print("Created BeautifulSoup object: soup")
    
    # Parse the table
    html_table = soup.findAll('tr')[1:] # Sliced first header

    # Parse the column headers and store them in a list
    headers = [col_head.getText() for col_head in html_table][0].split('\n')[1:4] # Slicers are intended to exclude unnecessary headers

    # Parse the rows and store them in a list
    raw_rows = [col_head.getText() for col_head in html_table][1:21]
    players = []
    for row in raw_rows:
        season = row[:7]
        league = row[7:10]
        player = row[10:].split('(V)')[0]
        players.append([season,league,player])
    
    print("Scraping and Parsing Complete!")         
    
    # Create a pandas DataFrame
    mvp = pd.DataFrame(players, columns=headers)

    # Create a csv file from this DataFrame
    filename = 'nbaMVPs.csv'
    mvp.to_csv(filename,header=True, index=False)
    print("Generated csv file last {}: {}".format(datetime.now(),filename))    

In [None]:
# Test run:
scrapeMVPs()

In [None]:
# Check csv files by viewing them using pandas DataFrames
mvp_df = pd.read_csv('nbaMVPs.csv')

mvp_df

# 2. Preparation of Datasets

## `datasetFactory.py`

In [None]:
# No need to import these two modules for this notebook.
#from nbaPlayers_StatsScraper import scrapeNBAStats
#from nbaPlayers_StatsScraper import scrapeMVPs

import pandas as pd

# Build a training dataset consisting of complete stats from year 2000 to 2018

##### UNCOMMENT TO GENERATE CSVs OF NBA STATS #####
years = [i for i in range(2000,2019)]

for i in years:
    scrapeNBAStats(i)
##### UNCOMMENT BLOCK ENDS HERE ###################

# Create an initial DataFrame for the year 2000
training_df = pd.read_csv('nbaPlayers_statsPerGame_2000.csv')

# Then create a for loop to concatenate the stats from 2001 to 2018
# to the DataFrame, "training_df"
years = [year for year in range(2001,2019)]

for season in years:
    filename = 'nbaPlayers_statsPerGame_{}.csv'.format(season)
    df = pd.read_csv(filename)
    training_df = pd.concat([training_df,df])

# This shall be the training data.
# Generate a csv file out of this compilation of stats.
training_df.to_csv('training_data.csv', header=True,index=False)

In [None]:
# Check csv files by viewing them using pandas DataFrames
train_df = pd.read_csv('training_data.csv')

train_df.head()

# 3. Building the Machine Learning Model

In [None]:
# Import sklearn modules

# PREPARE

# DEFINE

# FIT

# PREDICT

# EVALUATE

# FINAL PREDICTION