# Data Mining

Disclaimer: The following code was written in Python 3.5.2

We're going to mine [gamesheets from sports-reference.com](http://www.sports-reference.com/cbb/boxscores/index.cgi?month=02&day=3&year=2017). Note that all we collect are the name of the winning and losing teams, and their scores. Nothing more. 

Data mining scripts involve many lines of code depending on the source html. This sports-reference data requires relatively short scripts. Since this is a tutorial, you _won't_ be fully comfortable understanding each line of code. Focus instead on the whole process. 

### Load raw html and make soup

I've already downloaded the source html for games played on February 3, 2017. Open [this link](http://www.sports-reference.com/cbb/boxscores/index.cgi?month=02&day=3&year=2017) and follow along!

In [1]:
from bs4 import BeautifulSoup
with open('./../html/sample-gamesheets/2017-2-3.txt', 'rb') as raw_html:
    soup = BeautifulSoup(raw_html.read().decode(), 'html.parser')

### Collect team names and scores

This is where the [SelectorGadget](http://selectorgadget.com/) becomes a time-saver. Every argument to the `find` and `find_all` functions comes directly from the SelectorGadget. 

In [2]:
# BeautifulSoup has keyword arguments which correspond to attribute keywords.
# Here I use a keyword argument (class_ = "game_summaries")
# and a dictionary for attribute keywords ({"class": "game_summary nohover"})
game_summary = soup.find(class_="game_summaries") \
                   .find('div', {"class": "game_summary nohover"})

winner_data = game_summary.find(class_='winner')
winning_team = winner_data.a.text.strip()
winning_score = winner_data.find(class_='right').text.strip()

loser_data = game_summary.find(class_='loser')
losing_team = loser_data.a.text.strip()
losing_score = loser_data.find(class_='right').text.strip()

print("{}  {} {}  {}".format(winning_team, winning_score, losing_score, losing_team))

Buffalo  96 69  Ball State


### Clean it up

In [3]:
from collections import namedtuple
Team = namedtuple('Team', ('name', 'points', 'won'))

def get_game_summaries(soup):
    """Get ResultSet of game summaries from soup object.
    
    Arguments:
        soup (BeautifulSoup) -- Contains game summary data.
        
    Returns: ResultSet of game summaries.
    """
    game_summaries = soup.find(class_="game_summaries") \
                         .find_all('div', {"class":'game_summary nohover'})
    return game_summaries

def get_game_data(game_summary):
    """Mine game summary data from game summary.
    
    Arguments:
        game_summary (str) -- Contains teams and scores.
    
    Returns: Tuple of (winning_team, winning_score, losing_team, losing_score)
    """
    team_data = game_summary.find(class_='winner')
    if not team_data:  # sometimes returns a NoneType
        raise Exception("No data was found.")
    winning_team = team_data.a.text.strip()
    winning_score = team_data.find(class_='right').text.strip()

    team_data = game_summary.find(class_='loser')
    losing_team = team_data.a.text.strip()
    losing_score = team_data.find(class_='right').text.strip()

    return (winning_team, winning_score, losing_team, losing_score)

for game_summary in get_game_summaries(soup):
    winning_team, winning_score, losing_team, losing_score = get_game_data(game_summary)
    print("{:<25} {:>4} {:>4} {:>25}".format(winning_team, winning_score, losing_score, losing_team))

Buffalo                     96   69                Ball State
Central Michigan            86   82          Western Michigan
Yale                        87   78                  Columbia
Brown                       81   70                   Cornell
Princeton                   69   64                 Dartmouth
Rhode Island                70   59                  Davidson
Harvard                     69   59                      Penn
Monmouth                    71   70               St. Peter's
Iona                        95   76                     Rider


We're missing the date. That seems like valuable information we should hold onto.

### Method 1
Write a function to grab the date from each page.

In [4]:
def get_date(soup):
    """
    Collects date from html.
    """
    raw = soup.find(class_="game_summaries").h2.text.strip()
    scores, date = raw.split('—')
    month, day, year = date.strip().replace(',', '').split(' ')
    return month, day, year
get_date(soup)

('Feb', '3', '2017')

When you're about to do something difficult, remember: 

## DON'T

**D**emands: Does my project require this?  
**O**nline sources: Has someone else done it better?  
**N**etwork: Are my friends smarter than me?  
**T**ry something else.

### Method 2

When scraping the original data, we used the date as our unique identifier to tell files apart. We can just borrow the date from our file path. 

In [5]:
import os.path as ospath

def get_date(fp):
    """Returns date from file path argument."""
    parent, child = ospath.split(fp)  # returns ('./../html/sample-gamesheets', '2017-2-3.txt')
    date = child.replace('.txt', '')  # returns '2017-2-3'
    return date

file_path = './../html/sample-gamesheets/2017-2-3.txt'
get_date(file_path)

'2017-2-3'

# Let's Pull It All Together

In [6]:
file_path = './../html/sample-gamesheets/2017-2-3.txt'
date = get_date(file_path)
for game_summary in get_game_summaries(soup):
    winning_team, winning_score, losing_team, losing_score = get_game_data(game_summary)
    print("{:<10} {:<25} {:>4} {:>4} {:>25}".format(date, winning_team, winning_score, losing_score, losing_team))

2017-2-3   Buffalo                     96   69                Ball State
2017-2-3   Central Michigan            86   82          Western Michigan
2017-2-3   Yale                        87   78                  Columbia
2017-2-3   Brown                       81   70                   Cornell
2017-2-3   Princeton                   69   64                 Dartmouth
2017-2-3   Rhode Island                70   59                  Davidson
2017-2-3   Harvard                     69   59                      Penn
2017-2-3   Monmouth                    71   70               St. Peter's
2017-2-3   Iona                        95   76                     Rider


# Writing Data

Writing data may be the easiest part of the process. The hard part is doing it _quickly_. There are a number of ways we can do this, but I'll show you one way that I like best.

In [7]:
def init_writer(data_path, colnames):
    """Initialize writer.
    
    Arguments:
        data_path (str) -- Relative or absolute path to exported data.
        colnames (tuple, List) -- Column names to be exported with data.
        
    Returns: io Buffered Writer to tab-separated variable file.
    """
    writer = open(data_path, 'wb')
    header = "\t".join(colnames) + '\n'
    writer.write(header.encode('utf-8'))
    return writer

data_path = "./../data/sample-gamesheets.txt"
colnames = ("Date", "WinningTeam", "WinningScore", "LosingTeam", "LosingScore")
writer = init_writer(data_path, colnames)

From here you can write rows as necessary without ever closing the file. 

In [8]:
html_path = './../html/sample-gamesheets/2017-2-3.txt'
date = get_date(html_path)

data_path = "./../data/sample-gamesheets.txt"
colnames = ("Date", "WinningTeam", "WinningScore", "LosingTeam", "LosingScore")
writer = init_writer(data_path, colnames)

for game_summary in get_game_summaries(soup):    
    row = "\t".join([date] + list(get_game_data(game_summary))) + '\n'
    writer.write(row.encode())
writer.close()

Now let's quick look at our data file:

In [9]:
with open(data_path, 'rb') as ifile:
    for line in ifile:
        print(line.decode().split('\t'))

['Date', 'WinningTeam', 'WinningScore', 'LosingTeam', 'LosingScore\n']
['2017-2-3', 'Buffalo', '96', 'Ball State', '69\n']
['2017-2-3', 'Central Michigan', '86', 'Western Michigan', '82\n']
['2017-2-3', 'Yale', '87', 'Columbia', '78\n']
['2017-2-3', 'Brown', '81', 'Cornell', '70\n']
['2017-2-3', 'Princeton', '69', 'Dartmouth', '64\n']
['2017-2-3', 'Rhode Island', '70', 'Davidson', '59\n']
['2017-2-3', 'Harvard', '69', 'Penn', '59\n']
['2017-2-3', 'Monmouth', '71', "St. Peter's", '70\n']
['2017-2-3', 'Iona', '95', 'Rider', '76\n']


Alternatively, you can examine the data file in Excel by importing the text file. I prefer this method.

# Miner Class

We need to use Python classes to mine efficiently. The following Python files in the /src directory will be helpful: 

- Miner.py: Base Miner Class
- BoxscoreMiner.py: Mines Box Score Data
- GamesheetMiner.py: Mines Gamesheet Data

If you're reading this later, there are several articles online which explain inheritance better than I can. Once you've covered that, go over and look at the Miner classes listed above. Most of the 

### Base Miner Class

This exists because some functions are common to each Miner. If, heaven forbid, we need to update one of these functions, we want to change it in one place, and _only_ one place. 

### Gamesheet Miner

The `GamesheetMiner` will pull everything together that we've done so far. When we're done, the following code will mine and export all the gamesheet data.

```
miner = GamesheetMiner("./../data/feb-gamesheets.txt")
gamesheet_dir = "./../html/sample-gamesheets/"
for root, dirs, files in os.walk(gamesheet_dir):
    for f in files:
        if f.endswith('txt'):
            print("Mining", f)
            miner.mine_gamesheet(os.path.join(gamesheet_dir, f))
            miner.write()
```

### Boxscore Miner

Because scores alone aren't very interesting, I've gone ahead and made a `BoxscoreMiner` class to get some more interesting data. In the end, the following code will mine and export all boxscore data.

```
miner = BoxscoreMiner("./../data/feb-boxscores.txt")
boxscore_dir = "./../html/sample-boxscores/"
for root, dirs, files in os.walk(boxscore_dir):
    for f in files:
        if f.endswith('txt'):
            print("Mining", f)
            miner.mine_boxscore(os.path.join(boxscore_dir, f))
            miner.write()
```

See how similar the different programs are? This is why we use classes. **It's easier to remember how to use each class when they have the same interface**.