# Data Analysis on Roller Coasters Dataset

In [2]:
# Import necessary libraries
import pandas as pd
import seaborn as sns
coasters = pd.read_csv("/coasters.csv") # load data

## What variables are available in this dataset?
'state', 'country', 'steel', 'construction', 'height', 'speed', 'length', 'inversions', 'numinversions', 'duration', 'gforce', 'opened', 'region', 'continent'

In [7]:
column_names = coasters.columns # get column names
column_names

Index(['state', 'country', 'steel', 'construction', 'height', 'speed',
       'length', 'inversions', 'numinversions', 'duration', 'gforce', 'opened',
       'region', 'continent'],
      dtype='object')

## How many coasters are there data for?
408

In [None]:
len(coasters) # get length (i.e. number of rows) of dataset

408

## Do any of the descriptive statistics make any of the variables stick out from the rest?
Yes, the 'length' and 'numinversions' variables have standard deviations close to their means which may indicate high variation in these columns.

In [14]:
stats = coasters.describe() # get summary statistics for numerical variables
stats

Unnamed: 0,steel,height,speed,length,numinversions,duration,gforce,opened,region
count,408.0,326.0,270.0,318.0,408.0,192.0,60.0,380.0,408.0
mean,0.897059,23.124871,69.362667,597.040716,0.784314,112.4625,4.115,1995.434211,1.786765
std,0.304255,18.523768,29.327743,432.256987,1.664058,50.518467,1.006896,12.753121,0.887323
min,0.0,2.4384,9.72,12.192,0.0,0.3,2.1,1924.0,1.0
25%,1.0,8.6508,45.0,291.0,0.0,75.0,3.175,1991.0,1.0
50%,1.0,18.288,68.85,415.7475,0.0,108.0,4.5,1999.0,1.0
75%,1.0,33.1674,88.95,833.121,0.0,140.75,5.0,2004.0,3.0
max,1.0,128.016,194.4,2243.02,10.0,300.0,6.2,2014.0,3.0


## Is standardization necessary for these variables?
Depending on the intended model used, standardization may be useful for this data, although it may not be necessary. However, a few times it would be necessary include, but are not limited to, when the model intended to be used assumes a normal distribution, is sensitive to variation, or when the variables are all on different scales. Since this data is not normalized, some of the variables have large coefficients of variation, and they are all on different scales, standardization would be necessary if the intended model met any of the above criteria.

In [15]:
coast_std = stats.loc['std'] # get standard deviation from descriptive statistics
coast_mean = stats.loc['mean'] # get mean from descriptive statistics
coast_cv = coast_std / coast_mean # calculate coefficent of variation for each variable
coast_cv

Unnamed: 0,0
steel,0.33917
height,0.801032
speed,0.422817
length,0.723999
numinversions,2.121674
duration,0.449203
gforce,0.244689
opened,0.006391
region,0.496609


## Are there any missing values?
Yes, there are missing values in the 'height', 'speed', 'length', 'duration', 'gforce', and 'opened' columns with the most missing values coming from the 'gforce' column.

In [16]:
coasters.isnull().sum() # get number of missing values for each column

Unnamed: 0,0
state,0
country,0
steel,0
construction,0
height,82
speed,138
length,90
inversions,0
numinversions,0
duration,216


## What continents are represented?
North America, Latin America, and Europe

In [17]:
unique_continents = coasters['continent'].unique() # get unique values of 'continent' variable
unique_continents

array(['North America', 'Latin America', 'Europe'], dtype=object)

## What is the tallest coaster included? fastest? longest?
Tallest: Coaster No. 193 at a height of 128.016

Fastest: Coaster No. 193 at a speed of 194.4

Longest: Coaster No. 164 at a length of 2243.02

In [19]:
tallest_index = coasters['height'].idxmax() # get index of tallest coaster
tallest_height = coasters.loc[tallest_index].height # get height of tallest coaster
[tallest_index, tallest_height]

[193, 128.016]

In [20]:
fastest_index = coasters['speed'].idxmax() # get index of fastest coaster
fastest_speed = coasters.loc[fastest_index].speed # get speed of fastest coaster
[fastest_index, fastest_speed]

[193, 194.4]

In [22]:
longest_index = coasters['length'].idxmax() # get index of longest coaster
longest_length = coasters.loc[longest_index].length # get length of longest coaster
[longest_index, longest_length]

[164, 2243.02]

## What is the average year coasters opened in North America? Latin America? Europe?
North America: 1992.56

Latin America: 2000.10

Europe: 1997.77

In [None]:
avg_opened = coasters.groupby('continent')['opened'].mean() # get mean of 'opened' for each continent
avg_opened

Unnamed: 0_level_0,opened
continent,Unnamed: 1_level_1
Europe,1997.774194
Latin America,2000.101695
North America,1992.563452


## How does the average height, speed, and length change based on if the coasters is wood or steel?

Wood coasters are on average taller, faster, and longer than steel ones.

In [None]:
height_construct = coasters.groupby('construction')['height'].mean() # get mean of 'height' for each construction
height_construct

Unnamed: 0_level_0,height
construction,Unnamed: 1_level_1
Steel,22.082011
Wood,31.028653


In [None]:
speed_construct = coasters.groupby('construction')['speed'].mean() # get mean of 'speed' for each construction
speed_construct

Unnamed: 0_level_0,speed
construction,Unnamed: 1_level_1
Steel,66.468475
Wood,89.451765


In [None]:
length_construct = coasters.groupby('construction')['length'].mean() # get mean of 'length' for each construction
length_construct

Unnamed: 0_level_0,length
construction,Unnamed: 1_level_1
Steel,528.607143
Wood,1072.65405


## Which continent has the tallest, fastest, and/or longest average coaster?

North America has the tallest, fastest, and longest average coaster.

In [None]:
tallest_continent = coasters.groupby('continent')['height'].mean() # get mean of 'height' for each continent
tallest_continent

Unnamed: 0_level_0,height
continent,Unnamed: 1_level_1
Europe,16.028448
Latin America,21.327692
North America,29.607641


In [None]:
fastest_continent = coasters.groupby('continent')['speed'].mean() # get mean of 'speed' for each continent
fastest_continent

Unnamed: 0_level_0,speed
continent,Unnamed: 1_level_1
Europe,54.955789
Latin America,67.171429
North America,81.077143


In [None]:
longest_continent = coasters.groupby('continent')['length'].mean() # get mean of 'length' for each continent
longest_continent

Unnamed: 0_level_0,length
continent,Unnamed: 1_level_1
Europe,484.44
Latin America,493.067692
North America,722.590188
