# Predicting the Outcomes of NCAA Basketball Games

### Introduction  
The objective of our project is to build a model that can predict the outcome of any matchup between NCAA Divison 1 men's basketball teams. We hypothesize that the amount of points a team scores in a game is dependent on the quality of their opponents defense. The worse their opponents defense, the more points they will score. Pretty straight forward.

### Imports

In [4]:
import pandas as pd
import numpy as np

### **Scraping Data**

For this project, all of our data will come from TeamRankings.com - a website that supplies statistics for most major American sports. 

### What Data Do We Need?

We know what website we are going to scrape from, but now we need to make sure they actually have that information. For our model, we have a dependent variable - points scored - and we have independent variables - metrics that measure the quality of a teams defense.

Without expanding too much into their definitions, the metrics we will use for measuring quality of defense are:

1. Defensive Efficiency
2. Opponent Floor Percentage
3. Opponent Effective Field Goal Percentage
4. Opponent True Shooting Percentage

These four metrics are our independent variables. 

So, when scraping the website, we need to get the four metrics listed above, along with the game outcomes for every game played for every single team. 

### Where Can We Get It?

This is where our web scraping comes in. Looking at TeamRankings.com, we should be able to pull this information together rather quickly.

First, we need to get the score of each teams games. Browsing the website, we notice that each team has their own webpage with this information nicely displayed. Below is a link to an example page for the Oakland University Golden Grizzlies (OU):

> [Oakland Golden Grizzlies](https://www.teamrankings.com/ncaa-basketball/team/Oakland-Golden-Grizzlies/game-log)

This is great! The table gives us the opponent and the score for each game - just what we need. Now, we need to analyze the URL to see if it is set up in a way that we can easily scrape from. Let's look at two teams - OU and UVA - and see how each of their URLs are set up:

> https://www.teamrankings.com/ncaa-basketball/team/Oakland-Golden-Grizzlies/game-log

>https://www.teamrankings.com/ncaa-basketball/team/Virginia-Cavaliers/game-log

Some good news! It appears from our quick research that the URL for each teams schedule is the exact same - only differentiated by the schools name and mascot name typed out with a dash between each word. This will allow us to write some code to quickly get to each teams website. We will return to that in a minute.

The next bit of information we need is the defensive metrics - our independent variables. TeamRankings has these readily supplied on their website in table format so we can quickly scrape them without any issue. Given that we are only looking at four unique metrics, I won't bother trying to create a loop later to quicken this process. 

### Getting a List of all NCAA D1 Teams

In [5]:
# Get a data frame of all of the teams in D1
teams = pd.read_html('https://www.teamrankings.com/ncb/teams/')[0]

# Make numerous adjustments to the strings in the 'Team' column
# These adjustments will allow us to use the team names in web addresses later
teams['Team'] = teams['Team'].apply(lambda x: x.replace(' ','-'))
teams['Team'] = teams['Team'].apply(lambda x: x.replace(".",''))
teams['Team'] = teams['Team'].apply(lambda x: x.replace("'",''))
teams['Team'] = teams['Team'].apply(lambda x: x.replace("&",''))
teams['Team'] = teams['Team'].apply(lambda x: x.replace("(",''))
teams['Team'] = teams['Team'].apply(lambda x: x.replace(")",''))
teams['Team'] = teams['Team'].apply(lambda x: x.replace("---",'-'))

# Convert the 'Team' column to a list
team_urls = teams['Team'].tolist()

# Check a random team
print(team_urls[85])

#Check how many teams are in our list
len(team_urls)

Elon-University-Phoenix


357

### Scrape Every Teams Schedule and Save it to a Dictionary 

In [6]:
#Create a blank dictionary
team_dict = {}

# Loop through our list of teams and scrape the necessary data
# Also add a column with the team name
for name in team_urls:
    team_dict["{}".format(name)] = pd.read_html('https://www.teamrankings.com/ncaa-basketball/team/{}/game-log'.format(name))[0]
    team_dict.get("{}".format(name))['home'] = name
    
# Check a team
team_dict.get('Oakland-Golden-Grizzlies').head()

Unnamed: 0,Date,Opponent,Opp Rank,H/A/N,Score,home
0,11/25,at Xavier,55,Away,L 49-101,Oakland-Golden-Grizzlies
1,11/26,vs Toledo,83,Neutral,L 53-80,Oakland-Golden-Grizzlies
2,11/27,vs Bradley,137,Neutral,L 60-74,Oakland-Golden-Grizzlies
3,11/29,at Michigan,5,Away,L 71-81,Oakland-Golden-Grizzlies
4,12/01,at Purdue,26,Away,L 50-93,Oakland-Golden-Grizzlies


### Create one Data Frame with all Team Scoring Data

In [7]:
# Create a blank df
results = pd.DataFrame(columns=['Date','Opponent','Opp Rank','H/A/N','Score','Home'])

# Append each teams data to the data frame
for name in team_urls:
    results = results.append(team_dict.get('{}'.format(name)),ignore_index=False,sort=False)

# Add a 'points_Scored' column
results['points_scored'] = results['Score'].apply(lambda x: x.split(' ')[1].split('-')[0])

# Drop unneccesary columns
results.drop(['Date','Opp Rank','H/A/N','Score','Home'],inplace=True,axis=1)

# Remove the "at" from the 'Opponent' column
results['Opponent'] = results['Opponent'].apply(lambda x: x.replace('at ',''))
results['Opponent'] = results['Opponent'].apply(lambda x: x.replace('vs ',''))

# Check the resulting data frame
results.head()

Unnamed: 0,Opponent,home,points_scored
0,E Tenn St,Abilene-Christian-Wildcats,70
1,Austin Peay,Abilene-Christian-Wildcats,80
2,Neb Omaha,Abilene-Christian-Wildcats,70
3,Howard Payne,Abilene-Christian-Wildcats,81
4,Tarleton State,Abilene-Christian-Wildcats,69


### Convert the Team Names  
We have a small problem. The team name formats are different depending on where we scrape them from. To overcome this, we need to load a conversion document.

In [8]:
# Load and examine the conversion CSV
conversion = pd.read_csv('team_name_conversion.csv')
conversion.head()

Unnamed: 0,team_short,team_long
0,Abl Christian,Abilene-Christian-Wildcats
1,Air Force,Air-Force-Falcons
2,Akron,Akron-Zips
3,Alab A&M,Alabama-AM-Bulldogs
4,Alabama,Alabama-Crimson-Tide


In [9]:
# Make a dictionary pairing the names with eachother
convert_dict = pd.Series(conversion['team_long'].values,index=conversion['team_short']).to_dict()

# Create a function to return the proper name for a team and then make a new column using that function
def convert_name(bad_name):
  return convert_dict.get('{}'.format(bad_name))

# Apply the function
results['opp'] = results['Opponent'].apply(convert_name)

# Check the results
results.head()

Unnamed: 0,Opponent,home,points_scored,opp
0,E Tenn St,Abilene-Christian-Wildcats,70,East-Tennessee-St-Buccaneers
1,Austin Peay,Abilene-Christian-Wildcats,80,Austin-Peay-Governors
2,Neb Omaha,Abilene-Christian-Wildcats,70,Nebraska-Omaha-Mavericks
3,Howard Payne,Abilene-Christian-Wildcats,81,
4,Tarleton State,Abilene-Christian-Wildcats,69,Tarleton-State-Texans


In [10]:
# There will be some incompletes in this conversion due to the occurence of non-D1 teams
# We will drop these games by dropping any rows containing "None" from the results
results.dropna(axis=0,how='any',inplace=True)

In [11]:
# General Cleanup

# Drop "Opponent" column and rearrange the others
results.drop('Opponent',axis=1)
results = results[['home','opp','points_scored']]

### Scrape the Defensive Metrics of Every Team  
Now that we have the scoring data, we need to go out and get the defensive data for each team.

In [12]:
# Get the required data
def_eff = pd.read_html('https://www.teamrankings.com/ncaa-basketball/stat/defensive-efficiency')[0]
opp_fp = pd.read_html('https://www.teamrankings.com/ncaa-basketball/stat/opponent-floor-percentage')[0]
opp_efg = pd.read_html('https://www.teamrankings.com/ncaa-basketball/stat/opponent-effective-field-goal-pct')[0]
opp_ts = pd.read_html('https://www.teamrankings.com/ncaa-basketball/stat/opponent-true-shooting-percentage')[0]

In [13]:
# Remove and rename columns in each data frame

# def_eff
def_eff = def_eff[['Team','2019']]
def_eff.rename(columns={'2019': 'def_eff'},inplace=True)

#opp_fp
opp_fp = opp_fp[['Team','2019']]
opp_fp.rename(columns={'2019': 'opp_fp'},inplace=True)

#opp_efg
opp_efg = opp_efg[['Team','2019']]
opp_efg.rename(columns={'2019': 'opp_efg'},inplace=True)

#opp_ts
opp_ts = opp_ts[['Team','2019']]
opp_ts.rename(columns={'2019': 'opp_ts'},inplace=True)

In [14]:
# Combine all metric dfs
metrics = def_eff.join(opp_fp.set_index('Team'),on='Team')
metrics = metrics.join(opp_efg.set_index('Team'),on='Team')
metrics = metrics.join(opp_ts.set_index('Team'),on='Team')

# Convert the team names
metrics['Team'] = metrics['Team'].apply(convert_name)

#Create a function to convert from percent to float and apply it

# Create the function
def p2f(percent):
    if percent == '--':
        return 0
    else:
        return float(percent.strip('%'))/100

# Apply it to the columns
metrics['opp_fp'] = metrics['opp_fp'].apply(p2f)
metrics['opp_efg'] = metrics['opp_efg'].apply(p2f)
metrics['opp_ts'] = metrics['opp_ts'].apply(p2f)

#Check the df
metrics.head()

Unnamed: 0,Team,def_eff,opp_fp,opp_efg,opp_ts
0,Abilene-Christian-Wildcats,0.941,0.461,0.504,1.109
1,Memphis-Tigers,0.86,0.413,0.412,0.912
2,Jackson-State-Tigers,0.949,0.46,0.486,1.065
3,Loyola-Chicago-Ramblers,0.932,0.421,0.522,1.081
4,UAB-Blazers,0.965,0.442,0.493,1.043


### Combine the Scoring Data with the Defensive Metrics Data

In [15]:
# Combine the results df with the metrics data frame
model_df = results.join(metrics.set_index('Team'),on='opp')
model_df2 = model_df.set_index('home')

# Do some cleanup
model_df2 = model_df2[(model_df2['def_eff'] != '--') & 
          (model_df2['opp_fp'] != '--') &
          (model_df2['opp_efg'] != '--') &
          (model_df2['opp_ts'] != '--')]

# Check the final result
model_df2.head()

Unnamed: 0_level_0,opp,points_scored,def_eff,opp_fp,opp_efg,opp_ts
home,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Abilene-Christian-Wildcats,East-Tennessee-St-Buccaneers,70,0.945,0.447,0.498,1.061
Abilene-Christian-Wildcats,Austin-Peay-Governors,80,1.014,0.479,0.531,1.119
Abilene-Christian-Wildcats,Nebraska-Omaha-Mavericks,70,1.061,0.489,0.526,1.112
Abilene-Christian-Wildcats,Texas-Tech-Red-Raiders,44,0.897,0.428,0.462,1.003
Abilene-Christian-Wildcats,Arkansas-Razorbacks,72,0.942,0.462,0.473,1.032


### Modeling using SKLearn

In [16]:
# Import linear_model
from sklearn import linear_model

In [17]:
# Create a linear model for each team and store the results in a dictionary

# Get unique teams
myset = set(model_df2.index.values.tolist())
mynewlist = list(myset)
team_urls = mynewlist

#Create a blank dictionary
model_dict = {}

#Loop through the list and create a model for each team
for team in team_urls:
    
    # get a df for just that team
    team_model = model_df2.loc['{}'.format(team)]

    #create a df of predictors
    lm_predictors = team_model[['opp_ts','def_eff','opp_fp','opp_efg']]
    
    #create a df for target
    lm_target = team_model[['points_scored']]
    
    #define X and y
    X = lm_predictors
    y = lm_target
    
    #fit the model
    lm = linear_model.LinearRegression()
    model = lm.fit(X,y)
    
    #make predictions
    predictions = lm.predict(X)
    
    #store everything in in dictionary
    model_dict['{}_int'.format(team)] = lm.intercept_
    model_dict['{}_coef'.format(team)] = lm.coef_
    model_dict['{}_score'.format(team)] = lm.score(X,y)

In [18]:
# Create a function that will produce a game prediction
def predict_game(team_a,team_b):
  
    #Set up variables for team A
    a_int = model_dict['{}_int'.format(team_a)][0].astype(float)
    a_coef_de = model_dict['{}_coef'.format(team_a)][0][1].astype(float)
    a_coef_fp = model_dict['{}_coef'.format(team_a)][0][2].astype(float)
    a_coef_efg = model_dict['{}_coef'.format(team_a)][0][3].astype(float)
    a_coef_ts = model_dict['{}_coef'.format(team_a)][0][0].astype(float)
    a_de = metrics.set_index('Team').loc['{}'.format(team_a)][0]
    a_de = pd.to_numeric(a_de)
    a_fp = metrics.set_index('Team').loc['{}'.format(team_a)][1].astype(float)
    a_efg = metrics.set_index('Team').loc['{}'.format(team_a)][2].astype(float)
    a_ts = metrics.set_index('Team').loc['{}'.format(team_a)][3].astype(float)
    
    #Set up variables for team B
    b_int = model_dict['{}_int'.format(team_b)][0].astype(float)
    b_coef_de = model_dict['{}_coef'.format(team_b)][0][1].astype(float)
    b_coef_fp = model_dict['{}_coef'.format(team_b)][0][2].astype(float)
    b_coef_efg = model_dict['{}_coef'.format(team_b)][0][3].astype(float)
    b_coef_ts = model_dict['{}_coef'.format(team_b)][0][0].astype(float)
    b_de = metrics.set_index('Team').loc['{}'.format(team_b)][0]
    b_de = pd.to_numeric(b_de)
    b_fp = metrics.set_index('Team').loc['{}'.format(team_b)][1].astype(float)
    b_efg = metrics.set_index('Team').loc['{}'.format(team_b)][2].astype(float)
    b_ts = metrics.set_index('Team').loc['{}'.format(team_b)][3].astype(float)
    
    #Calculate how many points A will score against B
    
    a_points = a_int + (a_coef_de * b_de) + (a_coef_fp * b_fp) + (a_coef_efg * b_efg) + (a_coef_ts * b_ts)
    b_points = b_int + (b_coef_de * a_de) + (b_coef_fp * a_fp) + (b_coef_efg * a_efg) + (b_coef_ts * a_ts)
    
    
    #Print out the results
    print('{}: {}'.format(team_a,a_points))
    print('{}: {}'.format(team_b,b_points))

### Use the Function

In [19]:
predict_game('Oakland-Golden-Grizzlies','Virginia-Cavaliers')

Oakland-Golden-Grizzlies: 47.271345178843234
Virginia-Cavaliers: 73.36373485315545


All done! We can now use our function to produce a predicted outcome between any two D1 NCAA basketball teams