# MaxBlue Algorithm for Determining Basketball Winners

Hypothesis: in a match up between two teams, the bluest team will win.

## Contents
- [Data collection](#data_collection)
    - [Scraping the page](#scraping_the_page)
    - [Saving the images](#saving_the_images)
    - [Calculating the blue](#calculating_the_blue)
- [Visualizations](#visualizations)
    - [Maximum blue value](#maximum_blue_value)
    - [Mean blue value](#mean)
    - [Mean blue value, penalized](#mean_penalized)
    - [Median blue value](#median)
    - [Median blue value, penalized](#median_penalized)

In [1]:
from bs4 import BeautifulSoup
import cv2
import numpy as np
import pandas as pd
import requests

<a id='data_collection'></a>
## Data collection

To keep some semblance of standardization, all the images come from the exact same place: the ESPN website.

<a id='scraping_the_page'></a>
### Scraping the page

Each school listed in the bracket is a link that contains the school's name and ID. We can grab all these links, and then store the names and IDs.

In [39]:
# constants

HEADERS = {
"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
}

SITE = "http://www.espn.com/mens-college-basketball/tournament/bracket"
TEAM_ID = "team/_/id/"
LOGO_DIR = "logos/"
IMAGE_LINK = "https://a.espncdn.com/i/teamlogos/ncaa/500/{}.png"
PIXEL_FILE = "pixels.csv"

In [27]:
# read the bracket, and find each team
page = requests.get(SITE, headers=HEADERS)
soup = BeautifulSoup(page.content, "lxml")
link_tags = soup.find_all("a")
links = [link_tag.get("href") for link_tag in link_tags]
teams = [link for link in links if TEAM_ID in link]

In [28]:
# use the link to grab information about each team
# two teams appear twice, so skip those

repeats = ["prairie-view-a%26m-panthers", "st.-john's-red-storm"]

team_map = {}
for team in teams:
    
    splits = team.split("/")
    team_id = splits[-2]
    team_name = splits[-1]
    
    if (team_name not in repeats):
        team_map[team_name] = team_id

<a id='saving_the_images'></a>
### Saving the images

Use the ID to access the team logo, and save that locally.

In [29]:
for (team_name, team_id) in team_map.items():        
    logo_url = IMAGE_LINK.format(team_id)
    file_name = LOGO_DIR + team_name + ".png"
    with open(file_name, "wb") as f:
        response = requests.get(logo_url)
        pic = response.content
        f.write(pic)

<a id='calculating_the_blue'></a>
### Calculating the blue

Use `opencv2` to read in the images. Each pixel has a red, green, blue, and alpha value. Store these in a CSV.

In [37]:
# function to calculate the blueness
def get_blue(image_file):
    
    # convert to pixel list and filter out see-through pixels
    image = cv2.imread(image_file, cv2.IMREAD_UNCHANGED)
    n_pixels = image.shape[0] * image.shape[1]
    all_pixels = np.reshape(image, (n_pixels, 4))
    pixels = np.array([p for p in all_pixels if p[3] != 0])
    
    # get the max blue value (scalar)
    b_max = max(pixels[:, 0])
    
    # and the mean color (list)
    bgra_mean = np.average(pixels, axis=0)
    
    # and the median color (list)
    bgra_median = np.median(pixels, axis=0)
    
    return b_max, bgra_mean, bgra_median    

In [38]:
# save the blue info

out = ["team,b_max,r_mean,g_mean,b_mean,a_mean,r_med,g_med,b_med,a_med"]
for team_name in team_map.keys():
        
        # files
        image_file = LOGO_DIR + team_name + ".png"
        max_file = LOGO_DIR + team_name + "-max.png"
        mn_file = LOGO_DIR + team_name + "-mean.png"
        md_file = LOGO_DIR + team_name + "-median.png"

        # calculate blueness
        b_max, bgra_mean, bgra_median = get_blue(image_file)
        [b_mn, g_mn, r_mn, a_mn] = bgra_mean
        [b_md, g_md, r_md, a_md] = bgra_mean

        # save pixels
        row = ",".join([
            team_name, str(b_max),
            str(r_mn), str(g_mn), str(b_mn), str(a_mn),
            str(r_md), str(g_md), str(b_md), str(a_md)])
        out.append(row)

        # save color blocks--
        # this is not at all necessary for the analysis, but i like to look at them
        color_max = np.zeros((100, 100, 4), dtype=np.uint8)
        color_max[:, :] = [b_max, 0, 0, 255]
        cv2.imwrite(max_file, color_max)
        color_mn = np.zeros((100, 100, 4), dtype=np.uint8)
        color_mn[:, :] = bgra_mean
        cv2.imwrite(mn_file, color_mn)
        color_md = np.zeros((100, 100, 4), dtype=np.uint8)
        color_md[:, :] = bgra_median
        cv2.imwrite(md_file, color_md)
    
with open(PIXEL_FILE, "w") as f:
    csv = "\n".join(out)
    f.write(csv)

<a id='visualizations'></a>
## Visualizations

"MaxBlue" is a little ambiguous, so let's look at several different ways of calculating it.

In [40]:
df = pd.read_csv(PIXEL_FILE)

 <a id='maximum_blue_value'></a>
### Maximum blue value

Each pixel has a blue value. Simply take the highest one.

In [60]:
# check out the results
df = pd.read_csv("pixels.csv")
df["penalized_b"] = df["b"] - (df["r"] + df["g"])
df = df.sort_values("penalized_b", ascending=False)
print(df[["team", "b", "penalized_b"]])

                             team           b  penalized_b
6                duke-blue-devils  134.000000    87.000000
32               nevada-wolf-pack   98.000000    53.000000
66              kentucky-wildcats  177.705379    37.299562
42          kansas-state-wildcats  118.000000    33.000000
63         georgia-state-panthers  185.852697    10.673037
49                  iowa-hawkeyes    0.000000     0.000000
54              utah-state-aggies   98.360339    -9.932571
44             villanova-wildcats  167.455472   -14.945385
28                  buffalo-bulls  191.579281   -14.957625
41                   oregon-ducks   73.000000   -38.000000
18        michigan-state-spartans   59.000000   -38.000000
15                  yale-bulldogs  162.119506   -39.502233
5     fairleigh-dickinson-knights  157.234022   -39.621279
67     abilene-christian-wildcats  154.874882   -52.297259
45             saint-mary's-gaels  132.956572   -52.374860
48            cincinnati-bearcats   36.811323   -68.5160

 <a id='mean'></a>
### Mean blue value

Take the average of all the blue values.

 <a id='mean_penalized'></a>
### Mean blue value, penalized

Take the average of all the blue values, and subtract off the mean red value and mean green value.

 <a id='median'></a>
### Median blue value

Take the median of all the blue values.

 <a id='median_penalized'></a>
### Median blue value, penalized

Take the median of all the blue values, and subtract off the median red value and median green value.