## Project Purpose:
### In this personal project, I will create a machine learning model that will accurately predict whether the Chicago Bulls basketball team will win a given game based on that games statistics (box score). To do this, I need a dataset containing the team's stat totals for all 72 regular season basketball games in the 2020-2021 season. 
### After some searching, I was unable to find such a dataset. However, I was able to locate a website that contained links to all 72 box scores. From these box score pages, I will scrape the desired data. When this is complete, I will clean and model the data to fit my requirements for this project. Next, I'll use this cleaned dataset to train and test my Random Forest model.

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows', None)

In [2]:
url = "https://www.basketball-reference.com/teams/CHI/2021_games.html"

In [3]:
r = requests.get(url)

In [4]:
r.status_code #check to see if site allows web scraping (if 200 prints, we're good to go.)

200

In [5]:
data = r.text

In [6]:
soup = BeautifulSoup(data,'html.parser')

In [7]:
#pull all links for HTML page
my_list = []
for link in soup.find_all('a'): #Find all link tags
    my_list.append(link.get('href')) #append links

In [9]:
#Collect all the boxscore links
box_scores = []
for i in my_list:
    if "boxscores/2" in i:
        box_scores.append(i)

In [10]:
box_scores #shows path to box scores

['/boxscores/202105160CHI.html',
 '/boxscores/202012230CHI.html',
 '/boxscores/202012260CHI.html',
 '/boxscores/202012270CHI.html',
 '/boxscores/202012290WAS.html',
 '/boxscores/202012310WAS.html',
 '/boxscores/202101010MIL.html',
 '/boxscores/202101030CHI.html',
 '/boxscores/202101050POR.html',
 '/boxscores/202101060SAC.html',
 '/boxscores/202101080LAL.html',
 '/boxscores/202101100LAC.html',
 '/boxscores/202101150OKC.html',
 '/boxscores/202101170DAL.html',
 '/boxscores/202101180CHI.html',
 '/boxscores/202101220CHO.html',
 '/boxscores/202101230CHI.html',
 '/boxscores/202101250CHI.html',
 '/boxscores/202101300CHI.html',
 '/boxscores/202102010CHI.html',
 '/boxscores/202102030CHI.html',
 '/boxscores/202102050ORL.html',
 '/boxscores/202102060ORL.html',
 '/boxscores/202102080CHI.html',
 '/boxscores/202102100CHI.html',
 '/boxscores/202102120CHI.html',
 '/boxscores/202102150IND.html',
 '/boxscores/202102170CHI.html',
 '/boxscores/202102190PHI.html',
 '/boxscores/202102200CHI.html',
 '/boxscor

In [11]:
first_half_path = "https://www.basketball-reference.com"

In [12]:
#Create entire link to each boxscore
box_score_urls = [] 
for i in box_scores:
    i = first_half_path+i
    box_score_urls.append(i)

In [13]:
box_score_urls

['https://www.basketball-reference.com/boxscores/202105160CHI.html',
 'https://www.basketball-reference.com/boxscores/202012230CHI.html',
 'https://www.basketball-reference.com/boxscores/202012260CHI.html',
 'https://www.basketball-reference.com/boxscores/202012270CHI.html',
 'https://www.basketball-reference.com/boxscores/202012290WAS.html',
 'https://www.basketball-reference.com/boxscores/202012310WAS.html',
 'https://www.basketball-reference.com/boxscores/202101010MIL.html',
 'https://www.basketball-reference.com/boxscores/202101030CHI.html',
 'https://www.basketball-reference.com/boxscores/202101050POR.html',
 'https://www.basketball-reference.com/boxscores/202101060SAC.html',
 'https://www.basketball-reference.com/boxscores/202101080LAL.html',
 'https://www.basketball-reference.com/boxscores/202101100LAC.html',
 'https://www.basketball-reference.com/boxscores/202101150OKC.html',
 'https://www.basketball-reference.com/boxscores/202101170DAL.html',
 'https://www.basketball-reference

In [14]:
cols = ["Players", "MP", "FG","FGA","FG%","3PA","3P","3P%","FT","FTA","FT%","ORB","DRB",
        "TRB","AST","STL","BLK","TOV","PF","PTS","+/-","Chicago Score", "Opponent Score"]

box_score_frames_list = [] #fill up list with box score dataframes

for link in box_score_urls:
    req = requests.get(link)
    box_score_data = req.text
    box_score_soup = BeautifulSoup(box_score_data,'html.parser')
    chi_box_score_table = box_score_soup.find_all('table', id="box-CHI-game-basic") #Find all tables with an id of the box score
    table_rows= chi_box_score_table[0].find_all('tr')
    table_df = []
    
    for line in table_rows:
        scores = box_score_soup.find_all('div',{'class':'score'}) #Find Scores
        chi_score = scores[0]
        opp_score = scores[1]
        
        td = line.find_all('td')
        #site HTML has the players name under an <a tag...
        a = line.find_all('a') #Find all a tags
        a = str(a)
        a = a.replace(a[0:37],"<td class=\"right\" data-stat=\"player\">") #This will convert the a tag into looking like
                                                                            #an td tag
        a = a.replace("</a>","</td>") #Adjust the end tag to be a td end tag
        a = a.replace("]","")
        if len(td) > 0:
            a = BeautifulSoup(a,'html.parser')
            td.insert(0,a)
            td.insert(21,chi_score) #insert chicago's score as the last column of the dataframe
            td.insert(21,opp_score) #insert opp_score as the last column of the dataframe
       
        row = [line.text for line in td]
        if len(row) > 0: #Avois any rows that are empty
            table_df.append(row)
    temp_df = pd.DataFrame(table_df, columns=cols)
    box_score_frames_list.append(temp_df)

In [15]:
box_scores_df = pd.concat(box_score_frames_list)

In [23]:
box_scores_df.head(20) #This is a complete dataframe of game-by-game box scores of the Chicago Bulls 2020-2021 season
                        #still needs some cleaning though.

Unnamed: 0,Players,MP,FG,FGA,FG%,3PA,3P,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,+/-,Chicago Score,Opponent Score
0,Coby White,31:16,6,13,0.462,4.0,9.0,0.444,3.0,4.0,0.75,0.0,5.0,5.0,5.0,1.0,1.0,4.0,1.0,19.0,-9.0,118.0,112.0
1,Patrick Williams,29:37,4,8,0.5,0.0,1.0,0.0,3.0,3.0,1.0,3.0,4.0,7.0,1.0,1.0,2.0,3.0,2.0,11.0,-2.0,118.0,112.0
2,Garrett Temple,28:38,3,5,0.6,1.0,2.0,0.5,0.0,0.0,,1.0,2.0,3.0,3.0,0.0,0.0,2.0,1.0,7.0,-7.0,118.0,112.0
3,Thaddeus Young,25:33,8,15,0.533,0.0,1.0,0.0,4.0,6.0,0.667,1.0,6.0,7.0,3.0,0.0,3.0,4.0,2.0,20.0,5.0,118.0,112.0
4,Lauri Markkanen,24:47,6,11,0.545,2.0,4.0,0.5,3.0,4.0,0.75,0.0,5.0,5.0,1.0,0.0,0.0,2.0,1.0,17.0,-11.0,118.0,112.0
5,Javonte Green,23:14,3,5,0.6,0.0,1.0,0.0,0.0,0.0,,0.0,2.0,2.0,1.0,4.0,1.0,0.0,3.0,6.0,12.0,118.0,112.0
6,Devon Dotson,21:25,5,10,0.5,1.0,5.0,0.2,0.0,0.0,,0.0,2.0,2.0,4.0,1.0,0.0,0.0,2.0,11.0,13.0,118.0,112.0
7,Denzel Valentine,20:49,3,9,0.333,2.0,4.0,0.5,0.0,0.0,,0.0,2.0,2.0,3.0,0.0,1.0,0.0,3.0,8.0,9.0,118.0,112.0
8,Cristiano Felício,17:57,2,5,0.4,0.0,1.0,0.0,1.0,2.0,0.5,3.0,5.0,8.0,0.0,1.0,0.0,0.0,2.0,5.0,5.0,118.0,112.0
9,Ryan Arcidiacono,16:44,5,6,0.833,4.0,4.0,1.0,0.0,0.0,,0.0,4.0,4.0,2.0,0.0,0.0,1.0,3.0,14.0,15.0,118.0,112.0


In [17]:
box_score_totals_df = box_scores_df.loc[box_scores_df['Players']==''] #Select totals rows

In [18]:
#Some scores are being pulled incorrectly. Next few lines fix that.
#Where ever PTS = Opponent score, we have an issue. PTS is the total points the bulls scored.
#So, if they are the same, then it would appear the scraper switched the opponent score and bulls score
box_score_totals_df["Opponent Score Fix"] = np.where(box_score_totals_df['PTS'] == box_score_totals_df["Opponent Score"], 
             box_score_totals_df["Chicago Score"],box_score_totals_df["Opponent Score"]) #Flip Chicago score and Opp score

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [19]:
box_score_totals_df=box_score_totals_df.iloc[1:] #First row is a duplicate of last game vs MIL

In [20]:
box_score_totals_df = box_score_totals_df.drop(columns=["Players","Chicago Score","Opponent Score","+/-"])

In [21]:
box_score_totals_df = box_score_totals_df.rename(columns={"Opponent Score Fix":"Opponent Score","PTS":"Chicago Score"})

In [22]:
box_score_totals_df.head()

Unnamed: 0,MP,FG,FGA,FG%,3PA,3P,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,Chicago Score,Opponent Score
15,240,36,87,0.414,8,35,0.229,24,28,0.857,8,29,37,20,10,3,15,22,104,124
15,240,34,91,0.374,16,39,0.41,22,25,0.88,13,28,41,23,7,7,20,16,106,125
15,240,44,86,0.512,17,41,0.415,23,31,0.742,7,47,54,27,4,6,24,21,128,129
15,240,38,87,0.437,15,36,0.417,24,32,0.75,5,32,37,26,7,1,14,23,115,107
13,240,50,92,0.543,14,36,0.389,19,25,0.76,9,35,44,34,9,3,15,27,133,130


In [23]:
dtype_change = box_score_totals_df[['Chicago Score','Opponent Score']].astype(int) #Change scoring columns to int types

In [24]:
box_score_totals_df["Formatted Chicago Score"] = dtype_change['Chicago Score']

In [25]:
box_score_totals_df["Formatted Opponent Score"] = dtype_change['Opponent Score']

In [26]:
#Create column to demonstrate if the bulls won or not.
box_score_totals_df["W/L"] = np.where(box_score_totals_df["Formatted Chicago Score"] > box_score_totals_df["Formatted Opponent Score"],"1","0")

In [27]:
box_score_totals_df.head()

Unnamed: 0,MP,FG,FGA,FG%,3PA,3P,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,Chicago Score,Opponent Score,Formatted Chicago Score,Formatted Opponent Score,W/L
15,240,36,87,0.414,8,35,0.229,24,28,0.857,8,29,37,20,10,3,15,22,104,124,104,124,0
15,240,34,91,0.374,16,39,0.41,22,25,0.88,13,28,41,23,7,7,20,16,106,125,106,125,0
15,240,44,86,0.512,17,41,0.415,23,31,0.742,7,47,54,27,4,6,24,21,128,129,128,129,0
15,240,38,87,0.437,15,36,0.417,24,32,0.75,5,32,37,26,7,1,14,23,115,107,115,107,1
13,240,50,92,0.543,14,36,0.389,19,25,0.76,9,35,44,34,9,3,15,27,133,130,133,130,1


In [28]:
box_score_totals_df = box_score_totals_df.drop(columns=["Chicago Score","Opponent Score"])

In [29]:
box_score_totals_df = box_score_totals_df.rename(columns={"Formatted Chicago Score": "Chicago Score","Formatted Opponent Score":"Opponent Score"})

In [30]:
box_score_totals_df.head()

Unnamed: 0,MP,FG,FGA,FG%,3PA,3P,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,Chicago Score,Opponent Score,W/L
15,240,36,87,0.414,8,35,0.229,24,28,0.857,8,29,37,20,10,3,15,22,104,124,0
15,240,34,91,0.374,16,39,0.41,22,25,0.88,13,28,41,23,7,7,20,16,106,125,0
15,240,44,86,0.512,17,41,0.415,23,31,0.742,7,47,54,27,4,6,24,21,128,129,0
15,240,38,87,0.437,15,36,0.417,24,32,0.75,5,32,37,26,7,1,14,23,115,107,1
13,240,50,92,0.543,14,36,0.389,19,25,0.76,9,35,44,34,9,3,15,27,133,130,1


In [31]:
box_score_totals_df["W/L"].value_counts() #Should say L: 41, W: 31

0    41
1    31
Name: W/L, dtype: int64

In [33]:
box_score_totals_df.to_csv('chicago_bulls_2019_2020.csv',index=False)

### Using this refined dataset, we will utilize machine learning techniques to project if the bulls will win or lose based on a game's statistics/boxscore