# Contestant Data Scraper

Everyone should watch the Bachelor(ette). I blindly live by this axiom and assume that everyone understands the beauty that is *Sister Wives* meets the *Dating Game* meets the Jerry Springer hosted *Baggage* (if you haven't seen this, take a weekend and get back to me). And yes, I will shut down all of you sexists out there who believe this show is better suited for the female audience. That's like saying that sports are only for men, and if you believe this then you are most likely a douchebag and your opinion is nullified by default.

But let's get down to the point. Countless number (exageration) of people who watch this show attempt to predict who is going home, when they will go home, and, of course, who is going to win the season. Well, this is the first step in an attempt to build a probabistic model that will forcast the odds of each contestant by learning from past season patterns.  However, there is little structured data about the television show, so I had to collect it myself. Well, I had to set up a webscraper to do it for me.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import html5lib

To begin, I first need to collect all available contestant data at the moment. This can be updated for the current season every single week as updates are made. The easiest way to do this was to go to Wikipedia. Let's first collect the urls of those seasons that have data available...

#### The Bachelor

In [2]:
import numpy as np
bachelor_seasons = np.arange(9,21)

bachelor_urls = []
for season in bachelor_seasons:
    season_url = 'https://en.wikipedia.org/wiki/The_Bachelor_(season_{})'.format(season)
    bachelor_urls.append(season_url)

#### The Bachelorette

In [3]:
import numpy as np
bachelorette_seasons = np.arange(4,13)
bachelorette_seasons = np.insert(bachelorette_seasons,0,2)

bachelorette_urls = []
for season in bachelorette_seasons:
    season_url = 'https://en.wikipedia.org/wiki/The_Bachelorette_(season_{})'.format(season)
    bachelorette_urls.append(season_url)

So the next two functions aren't really going to make sense at the moment, but essentially I need them to clean up some of the data.  The `getDigits()` function simply collects any numerical digits in a string. The `hasNumbers` function just checks to see if the string has digits in it. This will come in handy if you look closely `dataCollector` function.

In [84]:
def getDigits(str1):
    c = ""
    for i in str1:
        if i.isdigit():
            c += i
    return c

In [89]:
def hasNumbers(inputString):
    return any(char.isdigit() for char in inputString)

There are going to be a string of functions that will be involved in grabbing the data and outputting it to a dataframe and what not. Those are all below.  This first one simply collects the page from a given url and returns the html in lmxl format.

In [4]:
def makeSoup(url):
    response=requests.get(url)
    soup=BeautifulSoup(response.content,"lxml")
    return soup

Here is where all of the magic happens! Okay, not really. This is the very important and very boring cleaning of the data as you read it in and then outputting it to a dictionary so that it can easily be formatted to other, more flexible structures.

In [111]:
def dataCollector(soup):
    import re
    # make the soup
    this = makeSoup(soup)
    
    # find the right table
    tables = this.findChildren('table')
    table = tables[1]
    
    # turn the data into a workable format
    data   = [[td.text for td in row.select('td')]
             for row in table.findAll('tr')]
    
    # create the header row and the body
    header = ['name','age', 'hometown', 'occupation', 'elimination']
    body = data[1:]
    cols   =  zip(*body)
    
    # create a dict with the data
    tbl_d  = {name:col for name, col in zip(header,cols)}
    
    # extract the season number from the original soup
    number = re.findall(r'\d+', soup)
    
    # remove brackets
    num= ''.join(number)
    
    # create a new key of seasons
    tbl_d['season'] = [num] * len(tbl_d['age'])
    
    new_names = []
    for name in tbl_d['name']:
        cleaned_name = re.sub(r'\[\w+\]', ' ', name)
        new_names.append(cleaned_name)

    tbl_d['name'] = new_names
    
    # find the first name with last name abbreviation
    name_abbreviation = []
    
    for name in tbl_d['name']:
        names = name.split(" ")

        new_names = []
        for name in names:
            cleaned_name = re.sub(r'\(\w+\)', ' ', name)
            new_names.append(cleaned_name)
        filtered = filter(lambda items: items.strip(), new_names) # remove blank space items 
        new_names = list(filtered)

        if len(new_names)== 1:
            new_name = new_names[0]
        else:
            new_name = "{} {}.".format(new_names[0], new_names[-1][0])
        name_abbreviation.append(new_name)
        
    tbl_d['name_abbreviation'] = name_abbreviation
    
    # clean up the hometowns
    new_hometowns = []
    
    for hometown in tbl_d['hometown']:
        cleaned_hometown = re.sub(r'\[\w+\]', ' ', hometown)
        new_hometowns.append(cleaned_hometown)

    tbl_d['hometown'] = new_hometowns
    
    # find just the number of the elimination week
    elimination_week = []

    for item in tbl_d['elimination']:
        if hasNumbers(item):
            digits = getDigits(item)
            episode = digits[0]
        else:
            episode = item
        elimination_week.append(episode)
    tbl_d['elimination'] = elimination_week
    
    # return dictionary
    return tbl_d

This final function is meant to tie it all together and output it to a dataframe...

In [112]:
def frameMaker(urls):
    frames = []

    for url in urls:
        dictionary = dataCollector(url)
        frame = pd.DataFrame(dictionary)
        frames.append(frame)

    combined = pd.concat(frames)
    
    return combined.reset_index()

In [116]:
bachelorette_frame = frameMaker(bachelorette_urls)

In [113]:
bachelor_frame = frameMaker(bachelor_urls)

In [120]:
bachelorette_frame.to_csv('data/bachelorette_contestants.csv', index = False)

In [121]:
bachelor_frame.to_csv('data/bachelor_contestants.csv', index = False)