# MaxBlue Algorithm for Determining Basketball Winners

Hypothesis: in a match up between two teams, the bluest team will win. This will be the basis for my bracket picks.

## Contents
- [Data collection](#data_collection)
    - [Scraping the page](#scraping_the_page)
    - [Saving the images](#saving_the_images)
    - [Calculating the blue](#calculating_the_blue)
- [Visualizations](#visualizations)
    - [Maximum blue value](#max)
    - [Maximum blue value, penalized (subtraction)](#max_sub)
    - [Maximum blue value, penalized (division)](#max_div)
- [Conclusion](#conclusion)
    - [Feature Selection Sunday](#selection_sunday)
    - [Bracket](#bracket)

In [1]:
from bs4 import BeautifulSoup
import cv2
import numpy as np
import os
import pandas as pd
import requests

<a id='data_collection'></a>
## Data collection

In an attempt at keeping this hot mess of an analysis standardized, all the images come from the exact same place: the ESPN website.

<a id='scraping_the_page'></a>
### Scraping the page

Each school listed in the bracket is a link that contains the school's name and ID. We can grab all these links, and then store the names and IDs.

In [2]:
# constants

HEADERS = {
"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
}

SITE = "http://www.espn.com/mens-college-basketball/tournament/bracket"
LOGO_DIR = "logos/"
IMAGE_LINK = "https://a.espncdn.com/i/teamlogos/ncaa/500/{}.png"
PIXEL_FILE = "pixels.csv"

if not os.path.exists(LOGO_DIR):
    os.mkdir(LOGO_DIR)

In [3]:
# read the bracket, and find each team
page = requests.get(SITE, headers=HEADERS)
soup = BeautifulSoup(page.content, "lxml")
tags = soup.find_all("dt")

In [4]:
# parse the html to get team info
# pls don't read this disgusting code

team_map_ = dict()
seen_ids = set()
for tag in tags:
    
    string = str(tag)
    if (len(string) < 10):
        continue
        
    [team1, team2] = string.split("<br/>")
    
    # team1
    seed = team1.split(" ")[0].replace("<dt>", "").replace("<b>", "")
    url = team1.split("team/_/id/")[1]
    id_ = url.split("/")[0]
    name = url.split('">')[1].split("<")[0]
    if (id_ not in seen_ids):
        team_map_[name] = (seed, id_)
    seen_ids.add(id_)
    
    # team2
    seed = team2.split(" ")[0].replace("<b>", "")
    url = team2.split("team/_/id/")[1]
    id_ = url.split("/")[0]
    name = url.split('">')[1].split("<")[0]
    if (id_ not in seen_ids):
        team_map_[name] = (seed, id_)
    seen_ids.add(id_)
    
# some of the names are too long to display
# let's fix that
team_map = dict()
for (team, (seed, id_)) in team_map_.items():
    
    if (len(team) > 4):
        if (" " not in team):
            team_ = team[:3].upper()
        else:
            team_ = "".join([c for c in team if c.isupper()])
    else:
        team_ = team
    
    team_map[team_] = (seed, id_)

<a id='saving_the_images'></a>
### Saving the images

Use the ID to access the team logo, and save that locally.

In [5]:
for (team_name, (seed, team_id)) in team_map.items():        
    logo_url = IMAGE_LINK.format(team_id)
    file_name = LOGO_DIR + team_name + ".png"
    with open(file_name, "wb") as f:
        response = requests.get(logo_url)
        pic = response.content
        f.write(pic)

<a id='calculating_the_blue'></a>
### Calculating the blue

Use `opencv2` to read in the images. Each pixel has a red, green, blue, and alpha value. Store these in a CSV.

In [6]:
# function to calculate the blueness
def get_blue(image_file):
    
    # convert to pixel list and filter out see-through pixels
    image = cv2.imread(image_file, cv2.IMREAD_UNCHANGED)
    n_pixels = image.shape[0] * image.shape[1]
    all_pixels = np.reshape(image, (n_pixels, 4))
    pixels = np.array([p for p in all_pixels if p[3] != 0])
    
    # get the mean color
    bgra_mean = np.average(pixels, axis=0)
    
    # and the median color
    bgra_median = np.median(pixels, axis=0)
    
    return bgra_mean, bgra_median    

In [7]:
# save the blue info

out = ["team,seed,r_mean,g_mean,b_mean,r_med,g_med,b_med"]
for (team_name, (seed, team_id)) in team_map.items():

    # files
    image_file = LOGO_DIR + team_name + ".png"
    mn_file = LOGO_DIR + team_name + "-mean.png"
    md_file = LOGO_DIR + team_name + "-median.png"

    # calculate blueness
    bgra_mean, bgra_median = get_blue(image_file)
    [b_mn, g_mn, r_mn, a_mn] = bgra_mean
    [b_md, g_md, r_md, a_md] = bgra_median

    # save pixels
    row = ",".join([
        team_name, seed,
        str(r_mn), str(g_mn), str(b_mn),
        str(r_md), str(g_md), str(b_md)])
    out.append(row)

    # save color blocks--
    # this is not at all necessary for the analysis, but i like to look at them

    def save_color_block(file, color):
        color_block = np.zeros((100, 100, 4), dtype=np.uint8)
        color_block[:, :] = color
        cv2.imwrite(file, color_block)

    save_color_block(mn_file, bgra_mean)
    save_color_block(md_file, bgra_median)
    
with open(PIXEL_FILE, "w") as f:
    csv = "\n".join(out)
    f.write(csv)

<a id='visualizations'></a>
## Visualizations

"MaxBlue" is a little ambiguous, so we'll look at several different ways of calculating it.

When reading in the images, there were two ways to calculate the blue-ness of a logo: taking the mean value of the pixels, and taking the median. The mean will blur together the colors, while the median will grab the most prevalent color. For example, Iowa State's logo is predominantly red, with some yellow. The mean color is orange, but the median color is red.


<img src='demo.png'>


For each of the blue-ness calculations, we'll see how it performs with both the mean and the median.

Also, all these plots were made with `ggplot2`, saved, and read in because `matplotlib` will never be prettier no matter how hard I try. Skeleton `R` code is available on [my github](https://github.com/malaikahanda/marchmadness/blob/master/plots/plots.r).

In [8]:
df = pd.read_csv(PIXEL_FILE)
df["seed"] = df["seed"].astype(int)
df.head()

Unnamed: 0,team,seed,r_mean,g_mean,b_mean,r_med,g_med,b_med
0,BEL,11,161.571232,125.746369,152.820584,230.0,68.0,117.0
1,TEM,11,196.945403,68.768823,104.959586,178.0,8.0,56.0
2,NDS,16,91.69128,128.851433,61.555582,0.0,87.0,61.0
3,NCC,16,158.99349,142.292741,147.590061,159.0,161.0,164.0
4,AS,11,196.955922,103.316853,51.170241,176.0,68.0,54.0


<a id='max'></a>
### Maximum blue value

Here we look at just the blue value in the rgb color.

Here's maximum blue with the mean:

<img src='plots/mean.jpg'>

Here's maximum blue with the median:

<img src='plots/median.jpg'>

<a id='max_sub'></a>
### Maximum blue value, penalized (subtraction)

Just because a color has a high blue value in the rgb color doesn't mean that it looks blue to us. For example, white has the highest blue value you can get (255). However, white also has high red and green values. We want a color that's mostly blue, with low red and green. Here, we'll measure blue-ness by subtracting off the red and green values.

Here's the subtraction-penalized maximum blue with the mean:

<img src='plots/mean_sub.jpg'>

Here's the subtraction-penalized maximum blue with the median:

<img src='plots/median_sub.jpg'>

<a id='max_div'></a>
### Maximum blue value, penalized (division)

We can also try penalizing by dividing, rather than subtracting.

Here's the division-penalized maximum blue with the mean:

<img src='plots/mean_div.jpg'>

Here's the division-penalized maximum blue with the median:

<img src='plots/median_div.jpg'>

<a id='conclusion'></a>
## Conclusion

This has been a journey. Am I more knowledgeable about basketball than before? No. But am I more knowledgeable about data science than before? Also no.

###### <a id='selection_sunday'></a>
### Feature selection Sunday

I am wildly proud of the above heading. Please take a moment to enjoy it.

Anyway, it was a little unclear how I was going to decide the best way of calculating blue-ness. I could look at which method correlates best with seed, or I could look at which one best captures the essence of being blue. The first option would likely give me better results, but the second option does a better job of getting the spirit of MaxBlue, so that's what I decided to do.

Based on the plots, the metric that best represents how blue a team's logo is comes by penalizing via the subtraction the median color of the logo, so that's what I've used below.

In [9]:
df["blueness"] = df["b_med"] - (df["r_med"] + df["g_med"])

def winner(seed1, seed2):
    sub = df[["seed", "team", "blueness"]]
    print(sub[sub["seed"] == seed1])
    print(sub[sub["seed"] == seed2])
    return
    
# winner(2, 4)

<a id='bracket'></a>
### Bracket

Aaaand here's my filled out bracket:

<img src='bracket.jpg'>