<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Get-data-from-PinballMap-API" data-toc-modified-id="Get-data-from-PinballMap-API-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Get data from PinballMap API</a></span></li><li><span><a href="#Define-scraping-functions" data-toc-modified-id="Define-scraping-functions-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Define scraping functions</a></span></li><li><span><a href="#Test-scraping-functions" data-toc-modified-id="Test-scraping-functions-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Test scraping functions</a></span></li><li><span><a href="#Scrape-data-for-all-pinball-machines" data-toc-modified-id="Scrape-data-for-all-pinball-machines-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Scrape data for all pinball machines</a></span></li></ul></div>

# Introduction

This is a project about pinball machines, inspired by [Pinball Map](https://pinballmap.com/). In this notebook I fetch data about pinball machines from the Pinball Map API, then join it with data scraped from the [Internet Pinball Database](https://www.ipdb.org/search.pl). Once I have the data how I want it, I'll move it into Tableau to and create a dashboard.

There's no real research question here; the purpose of this project is to practice my web scraping, regex, and Tableau skills.

# Get data from Pinball Map API

First I need to call up the [Pinball Map API](https://pinballmap.com/api/v1/docs) and fetch data on all the machines in its database.

In [1]:
# Import packages
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time

# Define function to call API
def api_call(url, field):
    '''Fetches data from API into DataFrame
       Dependencies: requests, pandas'''
    
    response = requests.get(url)
    df = pd.DataFrame(response.json()[field])
    return df


In [2]:
# Get data on machines
machines = api_call('https://pinballmap.com/api/v1/machines.json', 'machines')
machines.head()

Unnamed: 0,created_at,id,ipdb_id,ipdb_link,is_active,machine_group_id,manufacturer,name,opdb_id,updated_at,year
0,2016-08-05T17:52:19.366-07:00,2694,6417.0,https://www.ipdb.org/machine.cgi?id=6417,False,58.0,Spooky Pinball,Rob Zombie's Spookshow International (LE),G5pp2-MBRK4,2018-07-17T11:31:02.934-07:00,2016
1,,676,5163.0,http://ipdb.org/machine.cgi?id=5163,False,,Stern,Pirates of the Caribbean,GR7ZX-MQ23b,2018-07-16T09:00:50.066-07:00,2006
2,2014-02-26T11:22:21.790-08:00,1968,483.0,http://ipdb.org/machine.cgi?id=483,False,,Gottlieb,Challenger,G50L9-MDxXD,2018-07-16T09:00:41.586-07:00,1971
3,2013-06-28T20:55:47.875-07:00,1679,527.0,http://ipdb.org/machine.cgi?id=527,False,,Bally,City Slicker,GrEVb-MLOxJ,2018-07-16T09:00:41.673-07:00,1987
4,,727,874.0,http://ipdb.org/machine.cgi?id=874,False,,Bally,Flash Gordon,G5728-MDbjD,2018-07-16T09:00:41.689-07:00,1980


In [36]:
# Check data types and missing values
machines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1153 entries, 0 to 1152
Data columns (total 11 columns):
created_at          750 non-null object
id                  1153 non-null int64
ipdb_id             1086 non-null float64
ipdb_link           1153 non-null object
is_active           872 non-null object
machine_group_id    138 non-null float64
manufacturer        1153 non-null object
name                1153 non-null object
opdb_id             1076 non-null object
updated_at          1153 non-null object
year                1153 non-null int64
dtypes: float64(2), int64(2), object(7)
memory usage: 99.2+ KB


The printout above reveals that there are some fields with a lot of missing values. I'll check this again once I join the rest of the data. 

# Define scraping functions

For now, I'm just writing some functions to scrape the data I want from the Internet Pinball Database. Later, I may come along and refactor these functions as methods of a class.

In [7]:
# Define functions to scrape desired data from pages

# Define a function to scrape a page into a bs4 object
def get_data_table(url):
    '''Returns main table data from web page as type bs4.element.ResultSet.
       Dependencies: bs4.BeautifulSoup, requests'''
    
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    table = soup.find('table')
    table_contents = (table.nextSibling.nextSibling.nextSibling.nextSibling
                      .nextSibling.nextSibling.nextSibling.nextSibling
                      .nextSibling)
    table_data = table_contents.findAll('tr')
    return table_data
       
# Define a function to extract the IPD ID from a url
def get_id(url):
    '''Extracts machine's IPD ID from url.'''
    idx = int(re.findall(r'=(\d+)', str(url))[0])
    return idx

# Define a function to extract the star rating
def get_rating(table_data):
    '''Returns game rating from table data.
       Dependencies: bs4.BeautifulSoup, re'''
    
    if (('No ratings on file' in str(table_data)) 
        or ('Needs More Ratings!' in str(table_data))):
        value = None
    else: 
        value = int(re.findall(r'/(\d+)stars.png', str(table_data))[0])
    return value

# Define a function to extract the date of manufacture
def get_date(table_data):
    '''Returns date of manufacture from table data.
       Dependencies: bs4.BeautifulSoup, re'''
    
    if (('Date of Manufacture' in str(table_data)) 
        or ('Project Date' in str(table_data))):
        date = int(re.findall(r',\s(\d{4})', str(table_data))[0])
        return date
    else:
        return None

# Define a function to extract number of flippers
def get_flippers(table_data):
    '''Returns number of flippers from table data.
       Dependencies: bs4.BeautifulSoup, re'''
    
    if 'Flippers</a> (' in str(table_data):
        flippers = (int(str(table_data).split('Flippers</a> (')[1]
                        .split(')')[0]))
        return flippers
    else:
        return None

# Define a function to extract number of units manufactured
def get_units(table_data):
    '''Returns number of units manufactured from table data.
       Dependencies: bs4.BeautifulSoup, re'''
    
    if 'Production:' in str(table_data):
        units = int(re.findall(r'>(\d+\,?\d+)\sunits', 
                               str(table_data))[0].replace(',', '')) 
        return units
    else:
        return None

# Define a function to scrape all the desired data into a DataFrame
def scrape_urls(urls):
    '''Pulls desired data into a DataFrame
       Dependencies: bs4.BeautifulSoup, time, requests, re, pandas'''
    
    # Initialize lists
    ids = []
    ratings = []
    dates = []
    flippers = []
    units = []
        
    # Scrape each page and collect data
    try:
        for url in urls:
            if len(url) > 0:
                table_data = get_data_table(url)
                ids.append(get_id(url))
                ratings.append(get_rating(table_data))
                dates.append(get_date(table_data))
                flippers.append(get_flippers(table_data))
                units.append(get_units(table_data))
                
                # Pause to avoid jamming server
                time.sleep(0.1)
            else:
                continue
    
    # On error, print the url of the offending page
    except (ValueError, KeyError, TypeError, IndexError):
        print(get_id(url))
    
    # Concatenate the results into a DataFrame
    results = pd.concat([pd.Series(ids, name='ipd_id'), 
                         pd.Series(ratings,  name='rating'),
                         pd.Series(dates, name='date'),
                         pd.Series(flippers, name='flippers'),
                         pd.Series(units, name='units')], axis=1)
    return results

The last function wraps the previous ones and returns a DataFrame containing the ID, rating, dates of manufacture, number of flippers, and number of units manufactured for each machine represented in the API response above. Note that I included a tiny pause after scraping each page to avoid overloading the server.

# Test scraping functions

Before scraping all 1153 urls that I need, let's do a quick test to make sure things are working as expected.

In [8]:
# Get a group of urls to scrape
sample_urls = list(machines['ipdb_link'][30:40])
sample_urls

['http://ipdb.org/machine.cgi?id=3667',
 'http://ipdb.org/machine.cgi?id=4692',
 'http://www.ipdb.org/machine.cgi?id=1622',
 'http://ipdb.org/machine.cgi?id=4358',
 'http://ipdb.org/machine.cgi?id=1871',
 'http://ipdb.org/machine.cgi?id=828',
 'http://ipdb.org/machine.cgi?id=2506',
 'http://ipdb.org/machine.cgi?id=4540',
 'http://ipdb.org/machine.cgi?id=2165',
 'http://www.ipdb.org/machine.cgi?id=2355']

In [9]:
# Scrape selected urls
sample_df = scrape_urls(sample_urls)
sample_df.head(10)

Unnamed: 0,ipd_id,rating,date,flippers,units
0,3667,,1965.0,2.0,825.0
1,4692,,,2.0,470.0
2,1622,7.0,,2.0,4315.0
3,4358,8.0,,2.0,1369.0
4,1871,7.0,,2.0,
5,828,8.0,,2.0,8045.0
6,2506,8.0,,2.0,1600.0
7,4540,,,,
8,2165,7.0,1978.0,,10320.0
9,2355,7.0,1979.0,2.0,16842.0


There are a lot of NaNs, but I can resolve those along with the rest of the missing values later. For now, this looks good. 

# Scrape data for all pinball machines

Now I'm ready to scrape all the machine metadata I need. Based on the execution time of the test run, I'm expecting the cell below to take about 49 minutes to run.

In [10]:
# Scrape data for all machines from API response
urls = list(machines['ipdb_link'])
scrape_data = scrape_urls(urls)
scrape_data.head()