Name: Kyle Salgado-Gouker <br>
Date: October 7, 2023 <br>
Class: DSC540 - Professor Williams <br>
Project: Milestone 2

### Cleaning/Formatting Flat File Source

Perform at least 5 data transformation and/or cleansing steps to your flat file data. The below examples are not required - they are just potential transformations you could do. If your data doesn't work for these scenarios, complete different transformations. You can do the same transformation multiple times if needed to clean your data. The goal is a clean dataset at the end of the milestone.

* Replace Headers
* Format data into a more readable format
* Identify outliers and bad data
* Find duplicates
* Fix casing or inconsistent values
* Conduct Fuzzy Matching

Make sure you clearly label each transformation (Step #1, Step #2, etc.) in your code and describe what it is doing in 1-2 sentences. You can submit a Jupyter Notebook or a PDF of your code. If you submit a .py file you need to also include a PDF or attachment of your results.
Milestone 2 is due Sunday, by Midnight of Week 6. Refer to the rubric for more grading detail.

In [1]:
import time

# file system searches etc
import os
from os.path import basename, exists
import glob

# regular expressions
import re
import math
# for sampling (will be required)
import random

# data frames
import pandas as pd
# smart arrays etc (will be required)
import numpy as np

# web access and html parsing (urllib, its submodules)
import requests
import urllib
import urllib.request
import urllib.error
from urllib.request import urlretrieve
# for a workaround.
import ssl

# parser of web pages
from bs4 import BeautifulSoup
# more efficient parsing.
import lxml

# for plots (will be required)
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D

# Fuzzy string matching
# If necessary, here are the installation commands.
# !pip install SciPy
# !pip install python-Levenshtein
# !pip install fuzzywuzzy
from fuzzywuzzy import fuzz

# for accessing sql (will be required)
import sqlite3

# fancy table printing
from tabulate import tabulate

In [2]:
# For testing: Make warnings fatal.

import warnings
warnings.filterwarnings("error")

In [3]:
# Store final project data in its own directory.

FINAL_DATA_DIRECTORY = "data/final"

# Check if the directory exists
if not os.path.exists(FINAL_DATA_DIRECTORY):
    # If it doesn't exist, make it
    os.makedirs(FINAL_DATA_DIRECTORY)
    print(f"Directory '{FINAL_DATA_DIRECTORY}' created.")


In [4]:
# constants for accessing files and the web.
QUEUE_TIMES_API = "https://queue-times.com/en-US/pages/api"
WIKIPEDIA_PARK_RANKINGS = "https://en.wikipedia.org/wiki/List_of_amusement_park_rankings"
WIKIPEDIA_PARK_RANKINGS_FILE = FINAL_DATA_DIRECTORY+"/amusement_park_rankings.html"
RCDB_CSV_FILE = FINAL_DATA_DIRECTORY+"/coaster_db.csv"
RCSB_URL = "https://github.com/RobMulla/twitch-stream-projects/blob/main/001-rollercoaster-dataset/dbv1.csv"
QUEUE_TIMES_PARK_LIST_URL = "https://queue-times.com/en-US/parks?group=country"

In [5]:
# web download function

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36"
}

# download: A good citizen download function
#     url - the url accessed
#     destination - local file to write
#
# respects code 429 and waits instead of pounding.

# Function to disable SSL certificate verification
def disable_ssl_verification():
    ssl._create_default_https_context = ssl._create_unverified_context

# Call the function to disable SSL verification
# This is to workaround an SSL certificate error I am getting.
disable_ssl_verification()

def download(url, destination, secure=True):
    try:
        # Send a GET request with headers
        response = requests.get(url, headers=headers, verify=secure)
        # Check if the request was successful
        if response.status_code == 200:
            with open(destination, 'w') as f:
                f.write(response.text)
            print("Downloaded " + destination)
        elif response.status_code == 429:
            # Extract the Retry-After header value
            # This is to avoid hammering sites.
            retry_after = response.headers.get("Retry-After")
            if retry_after:
                # Convert the Retry-After value to seconds
                retry_after_seconds = int(retry_after)
                print("Rate limit exceeded. Waiting for " + str(retry_after_seconds) + " seconds.")
                time.sleep(retry_after_seconds)
                # Retry the request after waiting
                download(url, destination)
            else:
                print("Rate limit exceeded. Retry-After header not found.")
        else:
            print("Website returned " + str(response.status_code))
    except urllib.error.HTTPError:
        print("Failed to download " + url)
    except Exception:
        print("Error writing " + destination)
    return

def downloadFile(url, filename):
    if not exists(filename):
        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local + "\n")
        return local, _

def downloadRawFile(url, filename):
    if not os.path.exists(filename):
        # Modify the URL to the raw content URL (replace "github.com" with "raw.githubusercontent.com")
        raw_url = url + "?raw=true"
        # Download the raw content
        local, _ = downloadFile(raw_url, filename)


#### Useful functions

* These are mostly for output.
* Mostly not using now. (will be required)

In [6]:
def pretty_print_df(df, rows=None):
    if rows is not None:
        df = df.head(rows)  # Use head() to limit the DataFrame to the specified number of rows
    # Use Tabulate to show the data.
    print(tabulate(df, headers='keys', tablefmt='pretty', showindex=False))
    print("\n" + "="*40 + "\n")  # Separation between DataFrames

# Prints a title decorated by stars.
def formatFancyTitle(title):
    # Calculate the length of the title
    title_length = len(title)
    # format title with decoration
    title = "*" * (title_length + 4) + "\n" + f"* {title} *" + "\n" + "*" * (title_length + 4) + "\n"
    return title

def formatTestStat(value, dec=6):
    format_string = "{:."+str(dec)+"f}"
    return format_string.format(value)

# white
TABLE_BACKGROUND_COLOR = (255, 255, 255)
# black
TABLE_FONT_COLOR = (0, 0, 0)
TABLE_FONT_SIZE = 12
TABLE_WIDTH = 600
TABLE_HEIGHT = 800
TYPEFACE_FILE = "/Library/Fonts/Menlo.ttc"

def drawText(image_filename, text, title = "", width=TABLE_WIDTH, height=TABLE_HEIGHT, background_color=TABLE_BACKGROUND_COLOR, font_name=TYPEFACE_FILE, font_color=TABLE_FONT_COLOR, font_size=TABLE_FONT_SIZE):
    # Create an image with white background
    image = Image.new('RGB', (width, height), background_color)
    # Set the font style and size
    font = ImageFont.truetype(font_name, font_size)
    # Create a drawing context
    draw = ImageDraw.Draw(image)
    # Calculate the position to start drawing the table
    x, y = 10, 10
    # Add an optional title.
    if len(title) > 0:
        text = formatFancyTitle(title) + "\n" + text
    # Draw the table onto the image
    draw.text((x, y), text, font=font, fill=font_color)
    # Save the image
    image.save(image_filename)
    return text
    
def drawTable(image_filename, table, title="", width=TABLE_WIDTH, height=TABLE_HEIGHT, background_color=TABLE_BACKGROUND_COLOR, font_name=TYPEFACE_FILE, font_color=TABLE_FONT_COLOR, font_size=TABLE_FONT_SIZE):
    text = drawText(image_filename, table, title, width, height, background_color, font_name, font_color, font_size)
    print(text)
    return text
    
def drawReport(image_filename, text, title="", width=TABLE_WIDTH, height=TABLE_HEIGHT, background_color=TABLE_BACKGROUND_COLOR, font_name=TYPEFACE_FILE, font_color=TABLE_FONT_COLOR, font_size=TABLE_FONT_SIZE):
    if len(title) > 0:
        drawText(image_filename, text, title, width, height, background_color, font_name, font_color, font_size)
        print(formatFancyTitle(title))
    else:
        drawText(image_filename, text)
    print(text)
    return text
    
# formats long tables side by side.    
def combineTables(table1, table2, table3):
    # Split the input strings into rows
    table1_rows = table1.strip().split("\n")
    table2_rows = table2.strip().split("\n")
    table3_rows = table3.strip().split("\n")
    
    max_row_count = max(len(table1_rows), len(table2_rows), len(table3_rows))
    combined_table = ""
    
    for row_idx in range(max_row_count):
        # Get the corresponding rows from each table
        table1_row = table1_rows[row_idx] if row_idx < len(table1_rows) else ""
        table2_row = table2_rows[row_idx] if row_idx < len(table2_rows) else ""
        table3_row = table3_rows[row_idx] if row_idx < len(table3_rows) else ""

        # Combine the rows into a single row
        combined_row = f"{table1_row} {table2_row} {table3_row}".strip()

        # Add the combined row to the overall table
        combined_table += combined_row + "\n"

    return combined_table



#### Download the data files for this project (so far)

Download them locally. Two reasons:
* May not be accessible this weekend.
* Second file sometimes changes. (no surprises!)

In [13]:
# Get wikipedia ranking page.
downloadRawFile(WIKIPEDIA_PARK_RANKINGS, WIKIPEDIA_PARK_RANKINGS_FILE)
# Get flat file.
downloadRawFile(RCSB_URL, RCDB_CSV_FILE)

# Workaround weird character conversion issue that started recently.
encoding = "utf-8"
# Read the HTML from the URL. Force "utf-8" encoding to workaround issue.
with open(WIKIPEDIA_PARK_RANKINGS_FILE, 'r', encoding = encoding) as rankings_file:
    # Read the contents of the file into a buffer
    rankings_html = rankings_file.read()
    

Downloaded data/final/coaster_db.csv



#### First, fix the issue with the missing column header.

Note: This seems like a recent new problem, since the file I analyzed before did not exhibit the issue. 

### Transformation 1: Fix the bad column header.

In [14]:
# Load the CSV file into a string buffer
with open(RCDB_CSV_FILE, 'r', encoding = encoding) as file:
    csv_data = file.read()

# If the file is missing the first column header, fix it.
if csv_data.startswith(','):
    print("Fixing csv file: ", RCDB_CSV_FILE)
    csv_data = "Ride name" + csv_data
    # Save the modified CSV data.
    with open(RCDB_CSV_FILE, 'w', encoding = encoding) as file:
        file.write(csv_data)


Fixing csv file:  data/final/coaster_db.csv


#### Now the csv file can be used.

#### Recently Flat File has new issues. We can still use it but first we need to apply some fixes. 

### Transformation 2: Drop the columns we don't need.

* Column 0: Ride Name  <<<=== File missing this column. Starts with a comma.
* Column 1 & 2: Drop
* Column 3: Park Name
* Column 4-5: Drop
* Column 6: Opening Date Keep
* Column 7-8: Drop
* Column 9: Manufacturer
* Column 10-11: Drop
* Column 12: Height
* Column 13: Length
* Column 14: Speed
* Column 15: Inversions
* Column 16: Duration keep
* Column 17: Capacity keep
* Column 18: Height Restriction
* Column 19-21: Drop
* Column 22: Cost Keep
* Column 23-end: Drop


In [15]:
# Read csv file into panda and make a data frame.

columns_to_load = ["Ride name", "Location", "Opening date", "Type", "Manufacturer", "Height", "Length",
 "Speed", "Inversions", "Duration", "Capacity", "Height Restriction", "Cost", 
 "Drop", "Max vertical angle", "G-force"]

# These columns will all need to be transformed into more useful types but for now we need to load them and the only choice is str.
dtype_options = {
    "Ride name": str,
    "Location": str,
    "Opening date": str,
    "Type": str, 
    "Manufacturer": str,
    "Height": str,
    "Length": str,
    "Speed": str,
    "Inversions": str,
    "Duration": str,
    "Capacity": str,
    "Height Restriction": str,
    "Cost": str, 
    "Drop": str,
    "Max vertical angle": str,
    "G-force": str
}

rcdb_df = pd.read_csv(RCDB_CSV_FILE, usecols=columns_to_load, dtype=dtype_options)


#### Time to have a closer look at the flat file.


In [16]:

desired_columns = ['Ride name', 'Location']
sorted_df = rcdb_df.sort_values(by='Ride name')

print(tabulate(sorted_df[desired_columns], headers='keys', tablefmt='pretty', showindex=False))


+------------------------------------------------------------------+-----------------------------------------+
|                            Ride name                             |                Location                 |
+------------------------------------------------------------------+-----------------------------------------+
|     ("Super Grover's Box Car Derby", 'SeaWorld San Antonio')     |                   nan                   |
|          ('Canyon Blaster', 'Six Flags Magic Mountain')          |        Six Flags Magic Mountain         |
|                   ('Cobra', 'Tivoli Friheden')                   |             Tivoli Friheden             |
|                   ('Comet', 'Waldameer Park')                    |             Waldameer Park              |
|              ('Crazy Bird', 'Happy Valley Tianjin')              |          Happy Valley Tianjin           |
|              ('Desmo Race', 'Mirabilandia (Italy)')              |                   nan                   |
|

### Transformation 3: Fix the first column's data.

* Remove the records with nan in Location or Ride name.
* Remove the records with garbage Ride names (the ones in parentheses), which will not match Queue Time rides.

In [17]:
def parse_ride_name_and_location(row):
    # define a pattern to fix the bad data:
        # ^: Match beginning of string.
        # \(: Only edits fields that start with '('
        # (?:"(.*?)"|\'(.*?)\') - Matches all the text between double quote or single quotes
        # \s* Skips space.
        # , skips comma (only 'outside' the quoted strings)
        # \s* Skips space.
        # (?:"(.*?)"|\'(.*?)\') - Again, matches all the text between double quote or single quotes
        # \): Matches a closing parenthesis.
        # $: End of the string.
        
    pattern = r'^\((?:"(.*?)"|\'(.*?)\')\s*,\s*(?:"(.*?)"|\'(.*?)\')\)$'

    # Check if the data in 'Ride name' matches the above pattern.
    match = re.match(pattern, row['Ride name'])
    
    if match:
        # If we have this pattern, fix the data of the two columns
        if match.group(1):
            row['Ride name'] = match.group(1) # first double quote (.*?)
        else:
            row['Ride name'] = match.group(2) # first single quote (.*?)
        
        if match.group(3):
            row['Location'] = match.group(3) # second double quote (.*?)
        else:
            row['Location'] = match.group(4) # second single quote (.*?)
    
    return row

# Perform this transformation on the data set.
# axis = 1 means apply the function to every row.
rcdb_df = rcdb_df.apply(parse_ride_name_and_location, axis=1)


#### Review fixed data.

In [18]:
sorted_df = rcdb_df.sort_values(by='Ride name')

print(tabulate(sorted_df[desired_columns], headers='keys', tablefmt='pretty', showindex=False))


+-------------------------------------------------------+-----------------------------------------+
|                       Ride name                       |                Location                 |
+-------------------------------------------------------+-----------------------------------------+
|              10 Inversion Roller Coaster              |           Chimelong Paradise            |
|                        ARTHUR                         |               Europa-Park               |
|                         Abyss                         |             Adventure World             |
|                        Abyssus                        |              Energylandia               |
|                      Accelerator                      |              Drayton Manor              |
|                        Acrobat                        |           Nagashima Spa Land            |
|                    Adrenaline Peak                    |           Oaks Amusement Park           |


### Transformation 4: Delete obvious trash.

* Remove the records with nan in Location or Ride name.


In [19]:
print(len(rcdb_df), " original records.")

# Drop records where 'Location' or 'Ride name' is NaN or an empty string
rcdb_df.dropna(subset=['Location', 'Ride name'], inplace=True)
rcdb_df = rcdb_df[rcdb_df['Location'] != '']  # Remove empty strings in 'Location'
rcdb_df = rcdb_df[rcdb_df['Ride name'] != '']  # Remove empty strings in 'Ride name'

print(len(rcdb_df), " records after discarding trash.")


695  original records.
608  records after discarding trash.


### Transformation 5: Drop duplicates.

For now I will define duplicates as same Ride name, same Location. Ride names are sometimes repeated in other parks.

In [20]:
# Define 'duplicate' as when Ride name and Location are the same.
rcdb_df.drop_duplicates(subset=['Ride name', 'Location'], inplace=True)

print(len(rcdb_df), " records after dropping duplicates.")


608  records after dropping duplicates.


- Okay, now there are no duplicates, but there have been. Still counts as a transformation!

#### Resolve Park Name Conflict

The park name in the flat file must be the same as in the Queue Times file. First, though, I need a universal park name between the other two data sources.

#### Process Wikipedia Rankings

The most important data here are the park names and the year-by-year attendance figures.

In [21]:
# Parse wikipedia rankings using Beautiful Soup with the lxml parser
soup = BeautifulSoup(rankings_html, 'lxml')

# Get the tables
tables = soup.find_all('table')

# Store DataFrames for each table in a list.
dataframes = []

# Skip table 0 - Corporations
# Skip table 1 - Worldwide
# Skip tables 6+ - Waterparks (for now)

wiki_columns = ['Rank', 'Amusement park', 'Location', '2009', '2010', '2011', '2012', 
                '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021']

# Use these 4 tables (most popular parks in North America, Latin America, Asia, and Europe/Middle East)
for table in tables[2:5]:
    # Make a data frame from each table.
    df = pd.read_html(str(table))[0]  
    df.columns = wiki_columns
    # Each html page scraped is a different beast.
    # This solves the issue with the wikipedia page parsing.
    # Sometimes I need the text of the td item, but sometimes the name of the park is iin a title attribute
    if 'Amusement Park' in df.columns:
        df['Amusement Park'] = df['Amusement Park'].apply(lambda x: x['title'] if isinstance(x, dict) and 'title' in x else x)
    # Add dataframe to a list.
    dataframes.append(df)
    
# Concatenate the DataFrames into one DataFrame.
all_wiki_amusement_parks_df = pd.concat(dataframes, ignore_index=True)

# Drop the 'Rank' column. Do this before finding duplicates!
all_wiki_amusement_parks_df.drop(columns=['Rank'], inplace=True)

# Remove duplicate rows based on all columns
all_wiki_amusement_parks_df.drop_duplicates(inplace=True)

# Convert all columns except 'Amusement Park' and 'Location' to numbers. Needed for stats and plotting later.
columns_to_convert = all_wiki_amusement_parks_df.columns.difference(['Amusement park', 'Location'])
all_wiki_amusement_parks_df[columns_to_convert] = all_wiki_amusement_parks_df[columns_to_convert].apply(pd.to_numeric, errors='coerce')

# Replace NaN values with 0 in the combined DataFrame. Attendance = 0 when park is closed for pandemic, never opened, or gone for good.
# List of columns to fill with 0
columns_to_fillna_with_0 = ['2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021']

# Loop through each column and fill NaN with 0
for column in columns_to_fillna_with_0:
    all_wiki_amusement_parks_df[column].fillna(0, inplace=True)

# Sort the data frame by Amusement park.
all_wiki_amusement_parks_df = all_wiki_amusement_parks_df.sort_values(by='Amusement park')

# Fix the column names.
wiki_columns = ['Amusement park', 'Location', '2009', '2010', '2011', '2012', 
                '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021']

all_wiki_amusement_parks_df.columns = wiki_columns


#### Let's see the full list of amusement parks!

In [23]:

wiki_desired_columns = ['Amusement park', 'Location', '2009', '2021']
print(tabulate(all_wiki_amusement_parks_df[wiki_desired_columns], headers='keys', tablefmt='pretty', showindex=False))


+------------------------------------------------------------+-------------------------------------------+------------+------------+
|                       Amusement park                       |                 Location                  |    2009    |    2021    |
+------------------------------------------------------------+-------------------------------------------+------------+------------+
|            Alton Towers at Alton Towers Resort             |           Alton, United Kingdom           | 2650000.0  | 1800000.0  |
|                     Beto Carrero World                     |          Santa Catarina, Brazil           | 1000000.0  | 1895000.0  |
|                  Busch Gardens Tampa Bay                   |       Tampa, Florida, United States       | 4100000.0  | 3210000.0  |
|                    Canada's Wonderland                     |         Vaughan, Ontario, Canada          | 3160000.0  |  587000.0  |
|                        Cedar Point                         |       

#### Introduce 2 Fields for Park Matching.

* Park number: Matching Park number in Queue Times data, which will be the unique Park identifier.
* Best match: For Fuzzy string matching.

In [24]:
all_wiki_amusement_parks_df['Park number'] = ""
all_wiki_amusement_parks_df['Best match'] = ""


#### Web Scrape Queue Times for Park Numbers.

In [25]:
# URL to web scrape.
url = QUEUE_TIMES_PARK_LIST_URL

# Read it.
response = requests.get(url)

# If the request was successful
if response.status_code == 200:

    soup = BeautifulSoup(response.text, 'html.parser')

    # Initialize lists to store the data
    amusement_parks = []
    park_numbers = []
    countries = []

    # Each web page is a little different.
    # The QT park page by country has a separate panel for each country.
    # Inside each country panel is the park a_tag which has the data (see below)
    #
    # 1. Find all the <div> panels (countries)
    div_panels = soup.find_all('div', class_='panel')

    # Iterate through the countries
    for panel in div_panels:
        # Find the <h2> tag inside the panel
        country_name = panel.find('h2').text.strip()
        
        # Find all the <a> Park tags inside the panel
        a_tags = panel.find_all('a', class_='panel-block')
        
        # Iterate through the Park <a> tags
        for a_tag in a_tags:
            # Extract the Amusement park and link
            amusement_park = a_tag.text.strip()
            link = a_tag['href'] 
            amusement_park = amusement_park.split("\n")[0]
            # Append the data to the lists
            amusement_parks.append(amusement_park)
            # The park number is part of the link.
            park_numbers.append(link.split("/")[-1])
            countries.append(country_name)

    # Make the dataframe from the lists collected in the scrapiing.
    QT_park_list_df = pd.DataFrame({
        'Amusement park': amusement_parks,
        'Park number': park_numbers,
        'Country': countries
    })

    # ALL DONE!
    
    # Display the DataFrame
    print(tabulate(QT_park_list_df, headers='keys', tablefmt='pretty', showindex=False))
else:
    print("Failed to fetch the URL from Queue Times.")

+-------------------------------------------+-------------+---------------------+
|              Amusement park               | Park number |       Country       |
+-------------------------------------------+-------------+---------------------+
|                Familypark                 |     322     |       Austria       |
|                Bellewaerde                |     276     |       Belgium       |
|               Bobbejaanland               |     311     |       Belgium       |
|            Plopsaland De Panne            |     54      |       Belgium       |
|              Walibi Belgium               |     14      |       Belgium       |
|            Beto Carrero World             |     319     |       Brazil        |
|            Canada's Wonderland            |     58      |       Canada        |
|            La Ronde, Montreal             |     48      |       Canada        |
|          Shanghai Disney Resort           |     30      |        China        |
|             Dj

#### Match Amusement Parks by name and move Park Number field from QT DF into Wiki DF.

* Exact match
* Near match (QT inside Wiki)
* Near match (Wiki inside QT)
* Fuzzy match.


#### Note: The fuzzy matching algorithm cannot resolve a few important differences between some wikipedia/queuetime entries.

Fix these now. They will be then be skipped in the Park Matching algorithm.

In [26]:
# These parks need to be set.

all_wiki_amusement_parks_df.loc[all_wiki_amusement_parks_df['Amusement park'] == 'Disneyland Park', 'Park number'] = '16'
all_wiki_amusement_parks_df.loc[all_wiki_amusement_parks_df['Amusement park'] == 'Disneyland Hong Kong', 'Park number'] = '31'
all_wiki_amusement_parks_df.loc[all_wiki_amusement_parks_df['Amusement park'] == 'Disneyland Park at Disneyland Paris', 'Park number'] = '4'
all_wiki_amusement_parks_df.loc[all_wiki_amusement_parks_df['Amusement park'] == 'Magic Kingdom Theme Park at Walt Disney World Resort', 'Park number'] = '6'
all_wiki_amusement_parks_df.loc[all_wiki_amusement_parks_df['Amusement park'] == 'Walt Disney Studios Park at Disneyland Paris', 'Park number'] = '28'

# # These parks need to be removed because Queue Times doesn't include them.
indices_to_remove = [index for index, row in all_wiki_amusement_parks_df.iterrows() if row['Amusement park'] in 
                     ['Fantasilandia', 'La Feria', 'Parque Mundo Aventura', 'Parque Plaza Sésamo', 'Parque Warner', 
                      'Parque Xcaret', 'Parque de la Costa', 'Theme Parque Nacional del Café', 'Puy du Fou',
                      'Mundo Petapa', 'Futuroscope', 'La Feria Chapultepec Mágico', 'Tivoli Gardens']]

# Remove rows with the specified indices
all_wiki_amusement_parks_df.drop(indices_to_remove, inplace=True)



#### Park matching algorithm.

In [27]:
for w, row_wiki in all_wiki_amusement_parks_df.iterrows():
    amusement_park_w = row_wiki['Amusement park']
    
    # Initialize variables to store the best match and its index
    best_match_index = None
    best_match_score = -1
    
    # Iterate through each row in QT_park_list_df
    for q, row_QT in QT_park_list_df.iterrows():
        amusement_park_Q = row_QT['Amusement park']
        
        # exact match first
        if row_wiki['Park number'] == "" and amusement_park_w == amusement_park_Q:
            # print(f"Exact match for {amusement_park_w} and {amusement_park_Q}")
            all_wiki_amusement_parks_df.at[w, 'Park number'] = row_QT['Park number']
            break
            
        # if queue time string is in wiki string
        elif row_wiki['Park number'] == "" and amusement_park_w.find(amusement_park_Q) != -1:
            # print(f"Near match (wiki contains QT) for {amusement_park_w} and {amusement_park_Q}")
            all_wiki_amusement_parks_df.at[w, 'Park number'] = row_QT['Park number']
            break
            
        # if wiki string is in queue time string
        elif row_wiki['Park number'] == "" and amusement_park_Q.find(amusement_park_w) != -1:
            # print(f"Near match (QT contains wiki) for {amusement_park_w} and {amusement_park_Q}")
            all_wiki_amusement_parks_df.at[w, 'Park number'] = row_QT['Park number']
            break
            
        # fuzzy match
        elif row_wiki['Park number'] == "":
            fuzz_score = fuzz.ratio(amusement_park_w, amusement_park_Q)
            # print(f"Fuzzy match for {amusement_park_w} and {amusement_park_Q}: {fuzz_score}")
            
            if fuzz_score > best_match_score:
                best_match_score = fuzz_score
                best_match_index = q
                
    if all_wiki_amusement_parks_df.at[w, 'Park number'] == "" and best_match_index is not None:
        all_wiki_amusement_parks_df.at[w, 'Park number'] = QT_park_list_df.at[best_match_index, 'Park number']


In [28]:
# Print the two data frames after matching.
desired_columns = ['Amusement park', 'Park number']
sorted_df = all_wiki_amusement_parks_df.sort_values(by='Amusement park')
print(tabulate(sorted_df[desired_columns], headers='keys', tablefmt='pretty', showindex=False))
desired_columns = ['Amusement park', 'Park number']
sorted_df = QT_park_list_df.sort_values(by='Amusement park')
print(tabulate(sorted_df[desired_columns], headers='keys', tablefmt='pretty', showindex=False))


+------------------------------------------------------------+-------------+
|                       Amusement park                       | Park number |
+------------------------------------------------------------+-------------+
|            Alton Towers at Alton Towers Resort             |      1      |
|                     Beto Carrero World                     |     319     |
|                  Busch Gardens Tampa Bay                   |     24      |
|                    Canada's Wonderland                     |     58      |
|                        Cedar Point                         |     50      |
|              Chessington World of Adventures               |      3      |
|                        De Efteling                         |     160     |
|              Disney California Adventure Park              |     17      |
|    Disney's Animal Kingdom at Walt Disney World Resort     |      8      |
|   Disney's Hollywood Studios at Walt Disney World Resort   |      7      |

#### Now we have a uniform method of referencing parks.

- We need to apply it to the flat file.

### Transformation 6: Use a uniform park reference. 

* Match Location of RCDB against Amusement park field of QT/Wiki. We use both!
* Create Ride Number field.

In [29]:
# Introduce Park number field in flat file.
rcdb_df['Park number'] = ""


### Park Matching/Resolution for RCDB

* Look for matches (same strategy as before) for RCDB Location in QT
* Look for matches (same strategy as before) for RCDB Location in Wiki.
* Fuzzy matching across BOTH.
* Introduce Park number variable in RCDB to be unique park identifier across all data sets.

In [30]:
for rcdb_index, row_rcdb in rcdb_df.iterrows():
    location = row_rcdb['Location']
    
    # Initialize variables to store the best match and its index
    best_park_number = None
    best_match_score = -1
    best_match_amusement_park = ""
    
    # Iterate through each row in QT_park_list_df
    for q, row_QT in QT_park_list_df.iterrows():
        amusement_park_Q = row_QT['Amusement park']
        park_number = row_QT['Park number']
        
        # exact match first
        if row_rcdb['Park number'] == "" and location == amusement_park_Q:
            # print(f"Q Exact match for {location} and {amusement_park_Q}  - park = {park_number}")
            rcdb_df.at[rcdb_index, 'Park number'] = row_QT['Park number']
            break
            
        # if queue time string is in wiki string
        elif row_rcdb['Park number'] == "" and location.find(amusement_park_Q) != -1:
            # print(f"Q Near match (location contains QT) for {location} and {amusement_park_Q}  - park = {park_number}")
            rcdb_df.at[rcdb_index, 'Park number'] = row_QT['Park number']
            break
            
        # if wiki string is in queue time string
        elif row_rcdb['Park number'] == "" and amusement_park_Q.find(location) != -1:
            # print(f"Q Near match (QT contains location) for {location} and {amusement_park_Q}  - park = {park_number}")
            rcdb_df.at[rcdb_index, 'Park number'] = row_QT['Park number']
            break
            
        # fuzzy match
        elif row_rcdb['Park number'] == "":
            fuzz_score = fuzz.ratio(location, amusement_park_Q)
            # print(f"Fuzzy match for {location} and {amusement_park_Q}: {fuzz_score}")
            
            if fuzz_score > best_match_score:
                best_match_score = fuzz_score
                best_park_number = row_QT['Park number']
                best_match_amusement_park = amusement_park_Q
         
    for w, row_wiki in all_wiki_amusement_parks_df.iterrows():
        amusement_park_w = row_wiki['Amusement park']
        park_number = row_wiki['Park number']
        # exact match first
        if row_rcdb['Park number'] == "" and amusement_park_w == location:
            # print(f"W Exact match for {amusement_park_w} and {location} - park = {park_number}")
            rcdb_df.at[rcdb_index, 'Park number'] = row_wiki['Park number']
            break
            
        # if queue time string is in wiki string
        elif row_rcdb['Park number'] == "" and location.find(amusement_park_w) != -1:
            # print(f"W Near match (location contains amusement_park_w) for {amusement_park_w} and {location} - park = {park_number}")
            rcdb_df.at[rcdb_index, 'Park number'] = row_wiki['Park number']
            break
            
        # if wiki string is in queue time string
        elif row_rcdb['Park number'] == "" and amusement_park_w.find(location) != -1:
            # print(f"W Near match (amusement_park_w contains location) for {amusement_park_w} and {location} - park = {park_number}")
            rcdb_df.at[rcdb_index, 'Park number'] = row_wiki['Park number']
            break
            
        # fuzzy match
        elif row_rcdb['Park number'] == "":
            fuzz_score = fuzz.ratio(location, amusement_park_w)
            # print(f"Fuzzy match for {location} and {amusement_park_w}: {fuzz_score}")
            
            if fuzz_score > best_match_score:
                best_match_score = fuzz_score
                best_park_number = row_wiki['Park number']
                best_match_amusement_park = amusement_park_w


    if rcdb_df.at[rcdb_index, 'Park number'] == "" and best_park_number is not None:
        rcdb_df.at[rcdb_index, 'Park number'] = "-"+best_park_number
        # print(f"Fuzzy match for {location} is {best_match_amusement_park}")


#### After Matching

1. Park number positive = near match or exact match.
2. Park number negative = fuzzy match.

In [31]:
desired_columns = ['Ride name', 'Location', 'Park number']
sorted_df = rcdb_df.sort_values(by='Ride name')

print(tabulate(sorted_df[desired_columns], headers='keys', tablefmt='pretty', showindex=False))


+-------------------------------------------------------+-----------------------------------------+-------------+
|                       Ride name                       |                Location                 | Park number |
+-------------------------------------------------------+-----------------------------------------+-------------+
|              10 Inversion Roller Coaster              |           Chimelong Paradise            |     -4      |
|                        ARTHUR                         |               Europa-Park               |     51      |
|                         Abyss                         |             Adventure World             |     -97     |
|                        Abyssus                        |              Energylandia               |     317     |
|                      Accelerator                      |              Drayton Manor              |     -49     |
|                        Acrobat                        |           Nagashima Spa Land  

#### Fuzzy Matches That Worked

* Hagrid’s Magical Creatures Motorbike Adventure     |    Universal's Islands of Adventure     |     -64 
*  Insane                         |               Gröna Lund                |    -166  
*  Jetline                        |               Gröna Lund                |    -166  
*  Jurassic World VelociCoaster              |    Universal's Islands of Adventure     |     -64  
*  Kvasten                        |               Gröna Lund                |    -166
*  Monster                        |               Gröna Lund                |    -166 
*  Orkanen                        |            Fårup Sommarland             |     -18 
*  Steel Eel                       |          SeaWorld San Antonio           |     -22 
*  Texas Stingray                     |          SeaWorld San Antonio           |     -22 
*  The Great White                    |          SeaWorld San Antonio           |     -22 
*  The Incredible Hulk Coaster              |    Universal's Islands of Adventure     |     -64 
*  Tornado                        |     Parque de Atracciones de Madrid     |    -321   
*  Vilda Musen                      |               Gröna Lund                |    -166   

For these coasters change the Park number to positive by stripping the '-'.

In [32]:
# Fuzzy matching worked for a few important coasters. We need to strip off the '-' of their Park number fields.
resolved_fuzzy_parks = ["-18", "-22", "-64", "-166", "-321"]

# Define a function to remove the negative sign
def remove_negative_sign(park_number):
    if park_number in resolved_fuzzy_parks:
        return park_number.replace('-', '')
    return park_number

# Apply the function to the 'Park number' column
rcdb_df['Park number'] = rcdb_df['Park number'].apply(remove_negative_sign)

desired_columns = ['Ride name', 'Location', 'Park number']
sorted_df = rcdb_df.sort_values(by='Ride name')

print(tabulate(sorted_df[desired_columns], headers='keys', tablefmt='pretty', showindex=False))


+-------------------------------------------------------+-----------------------------------------+-------------+
|                       Ride name                       |                Location                 | Park number |
+-------------------------------------------------------+-----------------------------------------+-------------+
|              10 Inversion Roller Coaster              |           Chimelong Paradise            |     -4      |
|                        ARTHUR                         |               Europa-Park               |     51      |
|                         Abyss                         |             Adventure World             |     -97     |
|                        Abyssus                        |              Energylandia               |     317     |
|                      Accelerator                      |              Drayton Manor              |     -49     |
|                        Acrobat                        |           Nagashima Spa Land  

#### The Fuzzy Match sometimes worked! We fixed the coasters entries for those fuzzy parks.

### Transformation 7: Drop all rides that are not in Parks of the Queue Times Database

In [33]:
# Drop records where 'Ride name' starts with '(' and ends with ')'
rcdb_df = rcdb_df[~rcdb_df['Park number'].str.contains(r'-')]

print(len(rcdb_df))

376


#### Ride List (Intersection of QT/Wiki/Flat)

In [34]:
desired_columns = ['Ride name', 'Location', 'Park number']
sorted_df = rcdb_df.sort_values(by='Ride name')

print("Master List of Rides (so far)")
print(tabulate(sorted_df[desired_columns], headers='keys', tablefmt='pretty', showindex=False))


Master List of Rides (so far)
+-------------------------------------------------------+----------------------------------+-------------+
|                       Ride name                       |             Location             | Park number |
+-------------------------------------------------------+----------------------------------+-------------+
|                        ARTHUR                         |           Europa-Park            |     51      |
|                        Abyssus                        |           Energylandia           |     317     |
|                   Adventure Express                   |           Kings Island           |     60      |
|                       Afterburn                       |            Carowinds             |     59      |
|                  Alpenexpress Enzian                  |           Europa-Park            |     51      |
|                      Alpengeist                       |    Busch Gardens Williamsburg    |     23      |
|      


### Five (or in this case seven) transformations completed!


### Next Step:

* Webscrape the Queue Times Pages for Ride Entries in each park.
* Add missing rides to data base.