# Step 0: Imports and Reading Data

This explanation provides a comprehensive overview of the initial setup step, detailing the purpose and actions taken to prepare for the data analysis.

In this initial step, we perform the essential setup required for our data analysis project. This includes importing the necessary libraries, configuring settings, and loading our dataset.

1. **Importing Libraries**:
   - We start by importing the essential libraries that we will use throughout our analysis:
     - `pandas` for data manipulation and analysis.
     - `os` for interacting with the operating system.
     - `numpy` for numerical computations.
     - `matplotlib.pylab` for creating visualizations.
     - `seaborn` for enhanced data visualizations built on top of matplotlib.
   
2. **Configuring Settings**:
   - We configure `matplotlib` to use the 'ggplot' style for our plots, which provides a clean and visually appealing layout.
   - We set the maximum number of columns displayed by `pandas` to 200, ensuring that we can view a large number of columns in our DataFrames without truncation.

3. **Library Versions**:
   - We define and utilize two functions:
     - `get_library_versions(libraries)`: This function takes a list of library names and returns a dictionary containing their respective versions.
     - `print_library_versions(versions)`: This function prints the versions of the libraries in a structured format.
   - We then create a list of the libraries we have imported and use these functions to display their versions, confirming that the libraries have been loaded successfully.

4. **Reading the Data**:
   - We specify the relative path to our dataset, `rollercoaster_db.csv`, located in the `data/rollercoaster_db` directory.
   - Using the `os.path.basename` function, we extract the file name from the path.
   - We read the CSV file into a `pandas` DataFrame using the `pd.read_csv` function.
   - Finally, we print a success message indicating that the dataset has been read successfully.

By completing these steps, we ensure that our working environment is properly set up, and our data is loaded and ready for analysis. This foundational setup is crucial for maintaining a streamlined and efficient workflow throughout the project.

## Import libraries


In [1]:
import pandas as pd
import os
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
import importlib.metadata
# Configure matplotlib style and pandas options
plt.style.use('ggplot')
pd.set_option('display.max_columns', 200)

# Get versions of the libraries
pandas_version = importlib.metadata.version('pandas')
numpy_version = importlib.metadata.version('numpy')
matplotlib_version = importlib.metadata.version('matplotlib')
seaborn_version = importlib.metadata.version('seaborn')

# Print libraries and versions
print('Libraries read successfully!')
print(f'- pandas version: {pandas_version}')
print(f'- numpy version: {numpy_version}')
print(f'- matplotlib version: {matplotlib_version}')
print(f'- seaborn version: {seaborn_version}')

Libraries read successfully!
- pandas version: 2.2.2
- numpy version: 2.0.0
- matplotlib version: 3.9.0
- seaborn version: 0.13.2


In [2]:
# Modularization

def get_library_versions(libraries):
    """Get the versions of the specified libraries."""
    versions = {}
    for lib in libraries:
        try:
            versions[lib] = importlib.metadata.version(lib)
        except importlib.metadata.PackageNotFoundError:
            versions[lib] = 'Not installed'
    return versions

def print_library_versions(versions):
    """Print the versions of the libraries."""
    print('Libraries read successfully!')
    for lib, version in versions.items():
        print(f'- {lib} version: {version}')

# List of libraries to check
libraries = ['pandas', 'numpy', 'matplotlib', 'seaborn']

# Get and print versions of the libraries
versions = get_library_versions(libraries)
print_library_versions(versions)

Libraries read successfully!
- pandas version: 2.2.2
- numpy version: 2.0.0
- matplotlib version: 3.9.0
- seaborn version: 0.13.2


## Reading data

In [3]:
# Relative path from the notebook to the CSV file
file_path = '../data/rollercoaster_db/coaster_db.csv'
# Extract the file name from the path
file_name = os.path.basename(file_path)
# Load the CSV file into a DataFrame
data = pd.read_csv(file_path)
print(f'{file_name} read successsfully!')

coaster_db.csv read successsfully!


## Step 1: Data Understanding

- Dataframe ``shape``
- ``head`` and ``tail``
- ``dtypes``
- ``describe``

In this step, we focus on the initial understanding of the dataset. This involves examining the structure, contents, and basic statistics of the data. The following actions are performed:

1. **DataFrame Shape**:
   - We use the `.shape` attribute of the DataFrame to obtain the dimensions of the dataset. This returns a tuple representing the number of rows and columns, providing an overview of the dataset's size.

2. **Head and Tail**:
   - We use the `.head()` and `.tail()` methods to display the first few and last few rows of the DataFrame, respectively. This helps us get a sense of the data's structure and contents.

3. **Data Types**:
   - The `.dtypes` attribute is used to identify the data types of each column in the DataFrame. Understanding the data types is crucial for selecting appropriate data processing and analysis techniques.

4. **Descriptive Statistics**:
   - The `.describe()` method provides summary statistics for numerical columns in the DataFrame. This includes measures such as mean, median, standard deviation, minimum, and maximum values. These statistics help us understand the distribution and variability of the data.

By performing these actions, we gain a preliminary understanding of the dataset's structure, content, and basic statistical properties. This foundational knowledge is essential for informing subsequent steps in the data analysis process, such as data cleaning, transformation, and modeling.

In [12]:
# DataFrame Shape
print("DataFrame Shape:", data.shape)


DataFrame Shape: (1087, 56)


In [13]:
# Display first few rows
print("Head of the DataFrame:")
data.head()

Head of the DataFrame:


Unnamed: 0,coaster_name,Length,Speed,Location,Status,Opening date,Type,Manufacturer,Height restriction,Model,Height,Inversions,Lift/launch system,Cost,Trains,Park section,Duration,Capacity,G-force,Designer,Max vertical angle,Drop,Soft opening date,Fast Lane available,Replaced,Track layout,Fastrack available,Soft opening date.1,Closing date,Opened,Replaced by,Website,Flash Pass Available,Must transfer from wheelchair,Theme,Single rider line available,Restraint Style,Flash Pass available,Acceleration,Restraints,Name,year_introduced,latitude,longitude,Type_Main,opening_date_clean,speed1,speed2,speed1_value,speed1_unit,speed_mph,height_value,height_unit,height_ft,Inversions_clean,Gforce_clean
0,Switchback Railway,600 ft (180 m),6 mph (9.7 km/h),Coney Island,Removed,"June 16, 1884",Wood,LaMarcus Adna Thompson,,Lift Packed,50 ft (15 m),,gravity,,,Coney Island Cyclone Site,1:00,1600 riders per hour,2.9,LaMarcus Adna Thompson,30°,43 ft (13 m),,,,Gravity pulled coaster,,,,,,,,,,,,,,,,1884,40.574,-73.978,Wood,1884-06-16,6 mph,9.7 km/h,6.0,mph,6.0,50.0,ft,,0,2.9
1,Flip Flap Railway,,,Sea Lion Park,Removed,1895,Wood,Lina Beecher,,,,1.0,,,a single car. Riders are arranged 1 across in ...,,,,12.0,Lina Beecher,,,,,,,,,1902.0,,,,,,,,,,,,,1895,40.578,-73.979,Wood,1895-01-01,,,,,,,,,1,12.0
2,Switchback Railway (Euclid Beach Park),,,"Cleveland, Ohio, United States",Closed,,Other,,,,,,,,,,,,,,,,,,,,,,,1895.0,,,,,,,,,,,,1896,41.58,-81.57,Other,,,,,,,,,,0,
3,Loop the Loop (Coney Island),,,Other,Removed,1901,Steel,Edwin Prescott,,,,1.0,,,a single car. Riders are arranged 2 across in ...,,,,,Edward A. Green,,,,,Switchback Railway,,,,1910.0,,Giant Racer,,,,,,,,,,,1901,40.5745,-73.978,Steel,1901-01-01,,,,,,,,,1,
4,Loop the Loop (Young's Pier),,,Other,Removed,1901,Steel,Edwin Prescott,,,,1.0,,,,,,,,Edward A. Green,,,,,,,,,1912.0,,,,,,,,,,,,,1901,39.3538,-74.4342,Steel,1901-01-01,,,,,,,,,1,


In [14]:
# Display last few rows
print("Tail of the DataFrame:")
data.tail()

Tail of the DataFrame:


Unnamed: 0,coaster_name,Length,Speed,Location,Status,Opening date,Type,Manufacturer,Height restriction,Model,Height,Inversions,Lift/launch system,Cost,Trains,Park section,Duration,Capacity,G-force,Designer,Max vertical angle,Drop,Soft opening date,Fast Lane available,Replaced,Track layout,Fastrack available,Soft opening date.1,Closing date,Opened,Replaced by,Website,Flash Pass Available,Must transfer from wheelchair,Theme,Single rider line available,Restraint Style,Flash Pass available,Acceleration,Restraints,Name,year_introduced,latitude,longitude,Type_Main,opening_date_clean,speed1,speed2,speed1_value,speed1_unit,speed_mph,height_value,height_unit,height_ft,Inversions_clean,Gforce_clean
1082,American Dreier Looping,"3,444 ft (1,050 m)",53 mph (85 km/h),Other,,,Steel,Anton Schwarzkopf,55 in (140 cm),,111 ft (34 m),3.0,Booster Wheel Lift Hill,,3 trains with 5 cars. Riders are arranged 2 ac...,,,"1,500 riders per hour",4.7,Werner Stengel,,,,,,,,,,,,,,,,,,,,,,2022,,,Steel,,53 mph,85 km/h,53.0,mph,53.0,111.0,ft,,3,4.7
1083,Pantheon (roller coaster),"3,328 ft (1,014 m)",73 mph (117 km/h),Busch Gardens Williamsburg,Under construction,2022,Steel – Launched,Intamin,,Blitz Coaster,178 ft (54 m),2.0,LSM,,2 trains with 5 cars. Riders are arranged 2 ac...,,,,,,95°,,,,,,,,,,,,,,,,,,,,,2022,37.2339,-76.6426,Steel,2022-01-01,73 mph,117 km/h,73.0,mph,73.0,178.0,ft,,2,
1084,Tron Lightcycle Power Run,"3,169.3 ft (966.0 m)",59.3[1] mph (95.4 km/h),Other,,"June 16, 2016",Steel – Launched,Vekoma,4[2] ft (122 cm),Motorbike roller coaster,78.1 ft (23.8 m),0.0,LIM Launch,,7 trains with 7 cars. Riders are arranged 2 ac...,,~2:00,"1,680 riders per hour",4.0,Walt Disney Imagineering,,,,,,,,,,,,,,Must transfer from wheelchair,Tron,Single rider line available,,,,,TRON Lightcycle / Run,2022,,,Steel,2016-06-16,59.3 mph,95.4 km/h,59.3,mph,59.3,78.1,ft,,0,4.0
1085,Tumbili,770 ft (230 m),34 mph (55 km/h),Kings Dominion,Under construction,,Steel – 4th Dimension – Wing Coaster,S&S – Sansei Technologies,,4D Free Spin,112 ft (34 m),0.0,Vertical chain lift hill,,Single car trains with riders arranged 4 acros...,Jungle X-Pedition,0:55,,,,,,,,The Crypt,,,,,,,Official website,,,,,,,,,,2022,,,Steel,,34 mph,55 km/h,34.0,mph,34.0,112.0,ft,,0,
1086,Wonder Woman Flight of Courage,"3,300 ft (1,000 m)",58 mph (93 km/h),Six Flags Magic Mountain,Under construction,2022,Steel – Single-rail,Rocky Mountain Construction,,Raptor – Custom,131 ft (40 m),3.0,Chain lift hill,,,DC Universe,,,,,87°,127 ft (39 m),,,Green Lantern: First Flight Tidal Wave,,,,,,,,,,,,,,,,,2022,,,Steel,2022-01-01,58 mph,93 km/h,58.0,mph,58.0,131.0,ft,,3,


In [16]:
# Data Types of each column
# Every column is actually a series and each series has a type
print("Data Types:")
data.dtypes

Data Types:


coaster_name                      object
Length                            object
Speed                             object
Location                          object
Status                            object
Opening date                      object
Type                              object
Manufacturer                      object
Height restriction                object
Model                             object
Height                            object
Inversions                       float64
Lift/launch system                object
Cost                              object
Trains                            object
Park section                      object
Duration                          object
Capacity                          object
G-force                           object
Designer                          object
Max vertical angle                object
Drop                              object
Soft opening date                 object
Fast Lane available               object
Replaced        

In [17]:
# Descriptive Statistics
print("Descriptive Statistics:")
data.describe()

Descriptive Statistics:


Unnamed: 0,Inversions,year_introduced,latitude,longitude,speed1_value,speed_mph,height_value,height_ft,Inversions_clean,Gforce_clean
count,932.0,1087.0,812.0,812.0,937.0,937.0,965.0,171.0,1087.0,362.0
mean,1.54721,1994.986201,38.373484,-41.595373,53.850374,48.617289,89.575171,101.996491,1.326587,3.824006
std,2.114073,23.475248,15.516596,72.285227,23.385518,16.678031,136.246444,67.329092,2.030854,0.989998
min,0.0,1884.0,-48.2617,-123.0357,5.0,5.0,4.0,13.1,0.0,0.8
25%,0.0,1989.0,35.03105,-84.5522,40.0,37.3,44.0,51.8,0.0,3.4
50%,0.0,2000.0,40.2898,-76.6536,50.0,49.7,79.0,91.2,0.0,4.0
75%,3.0,2010.0,44.7996,2.7781,63.0,58.0,113.0,131.2,2.0,4.5
max,14.0,2022.0,63.2309,153.4265,240.0,149.1,3937.0,377.3,14.0,12.0


In [4]:
# Identifying categorical (object) and numerical columns
categorical_cols = data.select_dtypes(include=['object']).columns
numerical_cols = data.select_dtypes(include=['number']).columns

In [5]:
# Display information about categorical columns
print("Categorical Columns:")
data[categorical_cols].describe(include='all')

Categorical Columns:


Unnamed: 0,coaster_name,Length,Speed,Location,Status,Opening date,Type,Manufacturer,Height restriction,Model,Height,Lift/launch system,Cost,Trains,Park section,Duration,Capacity,G-force,Designer,Max vertical angle,Drop,Soft opening date,Fast Lane available,Replaced,Track layout,Fastrack available,Soft opening date.1,Closing date,Opened,Replaced by,Website,Flash Pass Available,Must transfer from wheelchair,Theme,Single rider line available,Restraint Style,Flash Pass available,Acceleration,Restraints,Name,Type_Main,opening_date_clean,speed1,speed2,speed1_unit,height_unit
count,1087,953,937,1087,874,837,1087,1028,831,744,965,795,382,718,487,765,575,362,578,357,494,96,69,173,335,19,96,236,27,88,87,50,106,44,81,22,46,60,24,35,1087,837,937,935,937,965
unique,990,569,243,280,15,656,98,102,100,317,382,116,272,221,271,208,160,70,153,83,235,84,1,140,95,1,84,144,16,67,16,1,1,33,1,6,1,50,12,18,3,602,225,229,2,2
top,Batman: The Ride,935 ft (285 m),50 mph (80 km/h),Other,Operating,1976,Steel,Vekoma,48 in (122 cm),Custom,70 ft (21 m),Chain lift hill,$10 million,2 trains with 6 cars. Riders are arranged 2 ac...,Planet Snoopy,1:30,1200 riders per hour,4,Werner Stengel,90°,100 ft (30 m),"April 29, 2005",Fast Lane available,Mine Train Through Nature's Wonderland,Twister,Fastrack available,"April 29, 2005","September 16, 2007",1895,The Joker,Official website,Flash Pass Available,Must transfer from wheelchair,Toy Story,Single rider line available,Over-the-shoulder,Flash Pass available,0 to 40 mph (0 to 64 km/h) in 3 seconds,Single Lap Bar,Das große LEGO-Rennen,Steel,1999-01-01,50 mph,80 km/h,mph,ft
freq,7,21,63,250,668,7,308,135,224,20,20,411,6,52,14,59,36,46,172,45,15,4,69,6,43,19,4,8,7,4,64,50,106,4,81,17,46,4,6,4,816,10,63,63,780,794


In [6]:
# Display information about numerical columns
print("Numerical Columns:")
data[numerical_cols].describe()

Numerical Columns:


Unnamed: 0,Inversions,year_introduced,latitude,longitude,speed1_value,speed_mph,height_value,height_ft,Inversions_clean,Gforce_clean
count,932.0,1087.0,812.0,812.0,937.0,937.0,965.0,171.0,1087.0,362.0
mean,1.54721,1994.986201,38.373484,-41.595373,53.850374,48.617289,89.575171,101.996491,1.326587,3.824006
std,2.114073,23.475248,15.516596,72.285227,23.385518,16.678031,136.246444,67.329092,2.030854,0.989998
min,0.0,1884.0,-48.2617,-123.0357,5.0,5.0,4.0,13.1,0.0,0.8
25%,0.0,1989.0,35.03105,-84.5522,40.0,37.3,44.0,51.8,0.0,3.4
50%,0.0,2000.0,40.2898,-76.6536,50.0,49.7,79.0,91.2,0.0,4.0
75%,3.0,2010.0,44.7996,2.7781,63.0,58.0,113.0,131.2,2.0,4.5
max,14.0,2022.0,63.2309,153.4265,240.0,149.1,3937.0,377.3,14.0,12.0


In [23]:
data.columns

Index(['coaster_name', 'Length', 'Speed', 'Location', 'Status', 'Opening date',
       'Type', 'Manufacturer', 'Height restriction', 'Model', 'Height',
       'Inversions', 'Lift/launch system', 'Cost', 'Trains', 'Park section',
       'Duration', 'Capacity', 'G-force', 'Designer', 'Max vertical angle',
       'Drop', 'Soft opening date', 'Fast Lane available', 'Replaced',
       'Track layout', 'Fastrack available', 'Soft opening date.1',
       'Closing date', 'Opened', 'Replaced by', 'Website',
       'Flash Pass Available', 'Must transfer from wheelchair', 'Theme',
       'Single rider line available', 'Restraint Style',
       'Flash Pass available', 'Acceleration', 'Restraints', 'Name',
       'year_introduced', 'latitude', 'longitude', 'Type_Main',
       'opening_date_clean', 'speed1', 'speed2', 'speed1_value', 'speed1_unit',
       'speed_mph', 'height_value', 'height_unit', 'height_ft',
       'Inversions_clean', 'Gforce_clean'],
      dtype='object')

In [24]:
len(data.columns)

56

# Step 2: Data Preparation
Preparation before analysis

- Dropping irrelevant columns and rows.
- Identifying duplicated columns.
- Renaming columns.
- Feature creation.

In [13]:
data_df = data[['coaster_name',
      'Location',
      'Status',
      'Manufacturer',
      'year_introduced',
      'latitude',
      'longitude',
      'Type_Main',
      'opening_date_clean',
      'speed_mph',
      'height_ft',
      'Inversions_clean',
      'Gforce_clean']].copy()

In [10]:
# Example of dropping single columns
# df.drop(['Opening data'], axis = 1)

In [14]:
data_df.shape

(1087, 13)

In [15]:
data_df.dtypes

coaster_name           object
Location               object
Status                 object
Manufacturer           object
year_introduced         int64
latitude              float64
longitude             float64
Type_Main              object
opening_date_clean     object
speed_mph             float64
height_ft             float64
Inversions_clean        int64
Gforce_clean          float64
dtype: object

In [16]:
data_df['opening_date_clean']

0       1884-06-16
1       1895-01-01
2              NaN
3       1901-01-01
4       1901-01-01
           ...    
1082           NaN
1083    2022-01-01
1084    2016-06-16
1085           NaN
1086    2022-01-01
Name: opening_date_clean, Length: 1087, dtype: object

Insight: `data_df['opening_date_clean']` is a date! So, we should use the pandas method ``to_datatime`` to modify this data type.

In [17]:
data_df['opening_date_clean'] = pd.to_datetime(data_df['opening_date_clean'])

In [19]:
# Rename

data_df = data_df.rename(columns= {'coaster_name': 'Coaster_Name',
                        'year_introduced': 'Year_Introduced',
                        'opening_date_clean': 'Opening_Date',
                        'speed_mph': 'Speed_mph',
                        'height_ft': 'Height_ft',
                        'Inversions_clean': 'Inversions', 
                        'Gforce_clean': 'Gforce'})

In [20]:
# where missing values are?
data_df.isna()

Unnamed: 0,Coaster_Name,Location,Status,Manufacturer,Year_Introduced,latitude,longitude,Type_Main,Opening_Date,Speed_mph,Height_ft,Inversions,Gforce
0,False,False,False,False,False,False,False,False,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,True,True,False,False
2,False,False,False,True,False,False,False,False,True,True,True,False,True
3,False,False,False,False,False,False,False,False,False,True,True,False,True
4,False,False,False,False,False,False,False,False,False,True,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1082,False,False,True,False,False,True,True,False,True,False,True,False,False
1083,False,False,False,False,False,False,False,False,False,False,True,False,True
1084,False,False,True,False,False,True,True,False,False,False,True,False,False
1085,False,False,False,False,False,True,True,False,True,False,True,False,True


In [21]:
# how much now values?
data_df.isna().sum()

Coaster_Name         0
Location             0
Status             213
Manufacturer        59
Year_Introduced      0
latitude           275
longitude          275
Type_Main            0
Opening_Date       250
Speed_mph          150
Height_ft          916
Inversions           0
Gforce             725
dtype: int64

In [23]:
# Is there duplicated data?
data_df.loc[data_df.duplicated()]

Unnamed: 0,Coaster_Name,Location,Status,Manufacturer,Year_Introduced,latitude,longitude,Type_Main,Opening_Date,Speed_mph,Height_ft,Inversions,Gforce


In [24]:
data_df[data_df.duplicated(subset=['Coaster_Name'])]

Unnamed: 0,Coaster_Name,Location,Status,Manufacturer,Year_Introduced,latitude,longitude,Type_Main,Opening_Date,Speed_mph,Height_ft,Inversions,Gforce
43,Crystal Beach Cyclone,Crystal Beach Park,Removed,Traver Engineering,1927,42.8617,-79.0598,Wood,1926-01-01,60.0,,0,4.0
60,Derby Racer,Revere Beach,Removed,Fred W. Pearce,1937,42.4200,-70.9860,Wood,1911-01-01,,,0,
61,Blue Streak (Conneaut Lake),Conneaut Lake Park,Closed,,1938,41.6349,-80.3180,Wood,1938-05-23,50.0,,0,
167,Big Thunder Mountain Railroad,Other,,Arrow Development (California and Florida)Dyna...,1980,,,Steel,NaT,35.0,,0,
237,Thunder Run (Canada's Wonderland),Canada's Wonderland,Operating,Mack Rides,1986,43.8427,-79.5423,Steel,1981-05-23,39.8,32.8,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1063,Lil' Devil Coaster,Six Flags Great Adventure,Operating,Zamperla,2021,40.1343,-74.4434,Steel,1999-01-01,,,0,
1064,Little Dipper (Conneaut Lake Park),Conneaut Lake Park,Operating,Allan Herschell Company,2021,41.6343,-80.3165,Steel,1950-01-01,,,0,
1080,Iron Gwazi,Busch Gardens Tampa Bay,Under construction,Rocky Mountain Construction,2022,28.0339,-82.4231,Steel,NaT,76.0,,2,
1082,American Dreier Looping,Other,,Anton Schwarzkopf,2022,,,Steel,NaT,53.0,,3,4.7
