In [1]:
library(tidyverse)
library(tidymodels)
install.packages("kknn")
install.packages("GGally")
library(kknn)
set.seed(3)
library(GGally)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.4     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

In [2]:
set.seed(3)

Group members: Jiaming Chang, Charmaine Ma, Ewan Painter, Yang Wang

Introduction

This project focuses on Men's College Basketball in the US, specifically pertaining to the NCAA tournament or March Madness. This tournament is interesting to study because of its high variance compared to other major sports. Its variance can be attributed to two major factors. Firstly, the tournament draws its pool of 68 teams from the current 363 Division I schools. Many top teams never play each other in their season leading up to the tournament. Some top teams that gain entry into the tournament by winning a smaller conference championship and never play any top team in their season. Due to this, it becomes difficult to evaluate the relative strength of teams and predict individual matchups. The second factor that makes matchups difficult to predict is that each round of the tournament is a best of 1 (in comparison, professional basketball playoff series are a best of 7), further enabling upsets and unpredictability. Our project will test how truly random these results are, or if a common theme can be established amongst successful teams in the tournament. Our dataset includes comprehensive statistics for all NCAA tournament teams from 2013 to 2021 as well as their seed in the tournament and the round of the tournament they reached.

Preliminary Data Analysis

In [3]:
cbb <- read_csv("https://raw.githubusercontent.com/naw333/College-Basketball-Data-Science/main/cbb.csv")
head(cbb,n=3)

[1mRows: [22m[34m3523[39m [1mColumns: [22m[34m24[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (4): TEAM, CONF, POSTSEASON, SEED
[32mdbl[39m (20): G, W, ADJOE, ADJDE, BARTHAG, EFG_O, EFG_D, TOR, TORD, ORB, DRB, FT...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


TEAM,CONF,G,W,ADJOE,ADJDE,BARTHAG,EFG_O,EFG_D,TOR,⋯,FTRD,2P_O,2P_D,3P_O,3P_D,ADJ_T,WAB,POSTSEASON,SEED,YEAR
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
North Carolina,ACC,40,33,123.3,94.9,0.9531,52.6,48.1,15.4,⋯,30.4,53.9,44.6,32.7,36.2,71.7,8.6,2ND,1,2016
Wisconsin,B10,40,36,129.1,93.6,0.9758,54.8,47.7,12.4,⋯,22.4,54.8,44.7,36.5,37.5,59.3,11.3,2ND,1,2015
Michigan,B10,40,33,114.4,90.4,0.9375,53.9,47.7,14.0,⋯,30.0,54.7,46.8,35.2,33.2,65.9,6.9,2ND,3,2018


In [7]:
cbb_split <- initial_split(cbb, prop = 0.75, strata = POSTSEASON)
cbb_train <- training(cbb_split)
cbb_test <- testing(cbb_split)

In [8]:
## Table shows number of teams in the dataset as well as averages for important metrics.

cbb_summ <- summarize(cbb_train, team_count = nrow(cbb_train), mean_ADJOE = mean(ADJOE), mean_ADJDE = mean(ADJDE), mean_ADJ_T = mean(ADJ_T), mean_ORB = mean(ORB), mean_TOR = mean(TOR), mean_W = mean(W))
cbb_summ

team_count,mean_ADJOE,mean_ADJDE,mean_ADJ_T,mean_ORB,mean_TOR,mean_W
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2642,103.0669,103.1858,67.74576,29.30776,18.71363,15.89175


In [14]:
## Graph shows relationship between a team's regular season wins and its number of points scored per possession.
## WAB |FTRD |BARTHAG |integrate G and W| DRB| 2P_O|  EFG_D |3P_O 
## group by year?
## include seed if table not too big 
cbb_modified <- cbb|> filter(POSTSEASON != "N/A")|>mutate(PERC= W/G) |> mutate(POSTSEASON=as_factor(POSTSEASON))|> mutate(TEAM = as_factor(TEAM))

cbb_short <- cbb_modified|> select(
                          TEAM,
                          PERC,
                          ADJOE,
                          ADJDE,
                          EFG_O,
                          TOR,
                          TORD,
                          ORB,
                          FTR,
                          `2P_D`,
                          `3P_D`,
                          ADJ_T,
                          POSTSEASON,
                          YEAR
                          )
head(cbb_short)
# options(repr.plot.width=20, repr.plot.height=10)
# cbb_testplot <- cbb_short |> ggplot(aes(x = PERC, y = POSTSEASON)) +  
#                             geom_point() + 
#                             labs( x = "PERC", 
#                                   y = "POSTSEASON")
# cbb_testplot

TEAM,PERC,ADJOE,ADJDE,EFG_O,TOR,TORD,ORB,FTR,2P_D,3P_D,ADJ_T,POSTSEASON,YEAR
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>
North Carolina,0.825,123.3,94.9,52.6,15.4,18.2,40.7,32.3,44.6,36.2,71.7,2ND,2016
Wisconsin,0.9,129.1,93.6,54.8,12.4,15.8,32.1,36.2,44.7,37.5,59.3,2ND,2015
Michigan,0.825,114.4,90.4,53.9,14.0,19.5,25.5,30.7,46.8,33.2,65.9,2ND,2018
Texas Tech,0.8157895,115.2,85.2,53.5,17.7,22.8,27.4,32.9,41.9,29.7,67.5,2ND,2019
Gonzaga,0.9487179,117.8,86.3,56.6,16.2,17.1,30.0,39.0,40.0,29.0,71.5,2ND,2017
Kentucky,0.725,117.2,96.2,49.9,18.1,16.1,42.0,51.8,44.9,32.2,65.9,2ND,2014


Methods

To perform the data analysis, a K nearest neighbors classifier will be used to classify the team’s POSTSEASON (round where the observed team’s season ended) based on the predictor variables: ADJOE, ADJDE, ORB, 3P_O, ADJ_T, TOR, G, and W (Adjusted Offensive Efficiency, Adjusted Defensive Efficiency, Offensive Rebound Rate, Three-Point Shooting Percentage, Adjusted Tempo, Turnover Rate, Number of games played, and Number of games won respectively). The basketball data set is split into training and testing sets, which will be used to train the classifier and evaluate the classifier’s performance. The training set will be used to perform a 5-fold cross-validation to select the k value that optimizes the classifier’s performance based on the accuracy of the classifier. The performance of the classifier will be tested using the test set, then a new set of observations for basketball teams will be classified using the trained classifier. The correlation coefficient will be calculated between each predictor variable and POSTSEASON, and the result of this will be visualized as a bar graph.

Expected Outcomes

Generally, we expect strength in defensive variables to be a greater indicator of success as opposed to offensive variables.
What impact could such findings have?
Uncovering key metrics can guide team strategy and enhance analysts' tournament predictions.

This research could lead up to the following future questions. 
-Individual Player Metrics: Can individual stats offer deeper insights into team strengths? 
-Match up Predictability: Can the classifier predict specific game outcomes, especially potential upsets? 
-Metric Evolution: As basketball evolves, how does the significance of certain metrics change? 
-External Factors: What other factors, like morale or injuries, influence success, and can they be incorporated for improved predictions?