# Roller Coaster

#### Overview

This project is slightly different than others you have encountered thus far. Instead of a step-by-step tutorial, this project contains a series of open-ended requirements which describe the project you'll be building. There are many possible ways to correctly fulfill these requirements, and you should expect to use the internet, Codecademy, and other resources when you encounter a problem that you cannot easily solve.

#### Project Goals

You will work to create several data visualizations that will give you insight into the world of roller coasters.

## Prerequisites

In order to complete this project, you should have completed the first two lessons in the [Data Analysis with Pandas Course](https://www.codecademy.com/learn/data-processing-pandas) and the first two lessons in the [Data Visualization in Python course](https://www.codecademy.com/learn/data-visualization-python). This content is also covered in the [Data Scientist Career Path](https://www.codecademy.com/learn/paths/data-science/).

## Project Requirements

1. Roller coasters are thrilling amusement park rides designed to make you squeal and scream! They take you up high, drop you to the ground quickly, and sometimes even spin you upside down before returning to a stop. Today you will be taking control back from the roller coasters and visualizing data covering international roller coaster rankings and roller coaster statistics.

   Roller coasters are often split into two main categories based on their construction material: **wood** or **steel**. Rankings for the best wood and steel roller coasters from the 2013 to 2018 [Golden Ticket Awards](http://goldenticketawards.com) are provded in `'Golden_Ticket_Award_Winners_Wood.csv'` and `'Golden_Ticket_Award_Winners_Steel.csv'`, respectively. Load each csv into a DataFrame and inspect it to gain familiarity with the data.

In [1]:
# 1 
# Import necessary libraries
%matplotlib notebook
import pandas as pd
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.ticker import FuncFormatter
import matplotlib.ticker as mtick
import matplotlib.patheffects as path_effects

# load rankings wood data
wood_rankings = pd.read_csv('Golden_Ticket_Award_Winners_Wood.csv')

# load rankings steel data
steel_rankings = pd.read_csv('Golden_Ticket_Award_Winners_Steel.csv')

In [2]:
# Inspect wood Data
wood_rankings.head()

Unnamed: 0,Rank,Name,Park,Location,Supplier,Year Built,Points,Year of Rank
0,1,Boulder Dash,Lake Compounce,"Bristol, Conn.",CCI,2000,1333,2013
1,2,El Toro,Six Flags Great Adventure,"Jackson, N.J.",Intamin,2006,1302,2013
2,3,Phoenix,Knoebels Amusement Resort,"Elysburg, Pa.",Dinn/PTC-Schmeck,1985,1088,2013
3,4,The Voyage,Holiday World,"Santa Claus, Ind.",Gravity Group,2006,1086,2013
4,5,Thunderhead,Dollywood,"Pigeon Forge, Tenn.",GCII,2004,923,2013


In [3]:
# Inspect steel Data
steel_rankings.head()

Unnamed: 0,Rank,Name,Park,Location,Supplier,Year Built,Points,Year of Rank
0,1,Millennium Force,Cedar Point,"Sandusky, Ohio",Intamin,2000,1204,2013
1,2,Bizarro,Six Flags New England,"Agawam, Mass.",Intamin,2000,1011,2013
2,3,Expedition GeForce,Holiday Park,"Hassloch, Germany",Intamin,2001,598,2013
3,4,Nitro,Six Flags Great Adventure,"Jackson, N.J.",B&M,2001,596,2013
4,5,Apollo’s Chariot,Busch Gardens Williamsburg,"Williamsburg, Va.",B&M,1999,542,2013


In [4]:
# columns
wood_rankings.columns

Index(['Rank', 'Name', 'Park', 'Location', 'Supplier', 'Year Built', 'Points',
       'Year of Rank'],
      dtype='object')

In [5]:
# How many roller coasters are included in each ranking dataset? 
wood_options_count = wood_rankings.Name.nunique()
print('Wood Roller Coaster options: ' + str(wood_options_count))

steel_options_count = steel_rankings.Name.nunique()
print('Steel Roller Coaster options: ' + str(steel_options_count))

Wood Roller Coaster options: 61
Steel Roller Coaster options: 63


In [6]:
# How many different roller coaster suppliers are included in the rankings?
wood_supplier_count = wood_rankings.Supplier.nunique()
print('Wood Roller Coaster Supplier options: ' + str(wood_supplier_count))

steel_supplier_count = steel_rankings.Supplier.nunique()
print('Steel Roller Coaster Supplier options: ' + str(steel_supplier_count))

Wood Roller Coaster Supplier options: 32
Steel Roller Coaster Supplier options: 15


2. Write a function that will plot the ranking of a given roller coaster over time as a line. Your function should take a roller coaster's name and a ranking DataFrame as arguments. Make sure to include informative labels that describe your visualization.

   Call your function with `"El Toro"` as the roller coaster name and the wood ranking DataFrame. What issue do you notice? Update your function with an additional argument to alleviate the problem, and retest your function.

In [14]:
# 2
# Create a function to plot rankings over time for 1 roller coaster
def plot_coaster_ranking(coaster_name, park_name, rankings_df):
  coaster_rankings = rankings_df[(rankings_df['Name'] == coaster_name) & (rankings_df['Park'] == park_name)]
  fig, ax = plt.subplots()
  ax.plot(coaster_rankings['Year of Rank'],coaster_rankings['Rank'])
  ax.set_yticks(coaster_rankings['Rank'].values)
  ax.set_xticks(coaster_rankings['Year of Rank'].values)
  ax.invert_yaxis()
  plt.title("{} Rankings".format(coaster_name))
  plt.xlabel('Year')
  plt.ylabel('Ranking')
  plt.show()


# Create a plot of El Toro ranking over time
plot_coaster_ranking('El Toro', 'Six Flags Great Adventure', wood_rankings)


<IPython.core.display.Javascript object>

3. Write a function that will plot the ranking of two given roller coasters over time as lines. Your function should take both roller coasters' names and a ranking DataFrame as arguments. Make sure to include informative labels that describe your visualization.

   Call your function with `"El Toro"` as one roller coaster name, `"Boulder Dash"` as the other roller coaster name, and the wood ranking DataFrame. What issue do you notice? Update your function with two additional arguments to alleviate the problem, and retest your function.

In [8]:
# 3
# Create a function to plot rankings over time for 2 roller coasters
def plot_coasters_rankings(coaster_name1, park_name1, coaster_name2, park_name2, rankings_df):
  coaster_rankings1 = rankings_df[(rankings_df['Name'] == coaster_name1) & (rankings_df['Park'] == park_name1)]
  coaster_rankings2 = rankings_df[(rankings_df['Name'] == coaster_name2) & (rankings_df['Park'] == park_name2)]
  fig, ax = plt.subplots()
  ax.plot(coaster_rankings1['Year of Rank'],coaster_rankings1['Rank'], color = 'green', label = coaster_name1)
  ax.plot(coaster_rankings2['Year of Rank'],coaster_rankings2['Rank'], color = 'red', label = coaster_name2)
  ax.invert_yaxis()
  plt.title("{} vs {} Rankings".format(coaster_name1,coaster_name2))
  plt.xlabel('Year')
  plt.ylabel('Ranking')
  plt.legend()
  plt.show()


# Create a plot of El Toro and Boulder Dash roller coasters
plot_coasters_rankings('El Toro','Six Flags Great Adventure','Boulder Dash','Lake Compounce',wood_rankings)


<IPython.core.display.Javascript object>

4. Write a function that will plot the ranking of the top `n` ranked roller coasters over time as lines. Your function should take a number `n` and a ranking DataFrame as arguments. Make sure to include informative labels that describe your visualization.

   For example, if `n == 5`, your function should plot a line for each roller coaster that has a rank of `5` or lower.
   
   Call your function with a value of `n` and either the wood ranking or steel ranking DataFrame.

In [15]:
# 4
# Create a function to plot top n rankings over time
def plot_top_n(rankings_df,n):
  top_n_rankings = rankings_df[rankings_df['Rank'] <= n]
  fig, ax = plt.subplots(figsize=(10,10))
  for coaster in set(top_n_rankings['Name']):
    coaster_rankings = top_n_rankings[top_n_rankings['Name'] == coaster]
    ax.plot(coaster_rankings['Year of Rank'],coaster_rankings['Rank'],label=coaster)
  ax.set_yticks([i for i in range(1,6)])
  ax.invert_yaxis()
  plt.title("Top 10 Rankings")
  plt.xlabel('Year')
  plt.ylabel('Ranking')
  plt.legend(loc=4)
  plt.show()

    
# Create a plot of top n rankings over time
plot_top_n(wood_rankings,5)


<IPython.core.display.Javascript object>

5. Now that you've visualized rankings over time, let's dive into the actual statistics of roller coasters themselves. [Captain Coaster](https://captaincoaster.com/en/) is a popular site for recording roller coaster information. Data on all roller coasters documented on Captain Coaster has been accessed through its API and stored in `roller_coasters.csv`. Load the data from the csv into a DataFrame and inspect it to gain familiarity with the data.

In [16]:
# 5
# load roller coaster data
coaster_data = pd.read_csv('roller_coasters.csv')
print(coaster_data.head())

            name material_type seating_type  speed  height  length  \
0       Goudurix         Steel     Sit Down   75.0    37.0   950.0   
1  Dream catcher         Steel    Suspended   45.0    25.0   600.0   
2     Alucinakis         Steel     Sit Down   30.0     8.0   250.0   
3       Anaconda        Wooden     Sit Down   85.0    35.0  1200.0   
4         Azteka         Steel     Sit Down   55.0    17.0   500.0   

   num_inversions     manufacturer            park            status  
0             7.0           Vekoma    Parc Asterix  status.operating  
1             0.0           Vekoma   Bobbejaanland  status.operating  
2             0.0         Zamperla    Terra Mítica  status.operating  
3             0.0  William J. Cobb  Walygator Parc  status.operating  
4             0.0           Soquet          Le Pal  status.operating  


6. Write a function that plots a histogram of any numeric column of the roller coaster DataFrame. Your function should take a DataFrame and a column name for which a histogram should be constructed as arguments. Make sure to include informative labels that describe your visualization.

   Call your function with the roller coaster DataFrame and one of the column names.

In [17]:
# 6
# Create a function to plot histogram of column values
def plot_histogram(coaster_df, column_name):
  plt.hist(coaster_df[column_name].dropna())
  plt.title('Histogram of Roller Coaster {}'.format(column_name))
  plt.xlabel(column_name)
  plt.ylabel('Count')
  plt.show()


# Create histogram of roller coaster speed
plt.clf()
plot_histogram(coaster_data, 'speed')
plt.show()

<IPython.core.display.Javascript object>

In [18]:
# Create histogram of roller coaster length
plt.clf()
plot_histogram(coaster_data, 'length')
plt.show()


<IPython.core.display.Javascript object>

In [19]:
# Create histogram of roller coaster number of inversions
plt.clf()
plot_histogram(coaster_data, 'num_inversions')
plt.show()

<IPython.core.display.Javascript object>

In [20]:
# Create a function to plot histogram of height values
def plot_height_histogram(coaster_df):
  heights = coaster_df[coaster_df['height'] <= 140]['height'].dropna()
  plt.hist(heights)
  plt.title('Histogram of Roller Coaster Height')
  plt.xlabel('Height')
  plt.ylabel('Count')
  plt.show()
# Create a histogram of roller coaster height
plot_height_histogram(coaster_data)
plt.show()

<IPython.core.display.Javascript object>

7. Write a function that creates a bar chart showing the number of inversions for each roller coaster at an amusement park. Your function should take the roller coaster DataFrame and an amusement park name as arguments. Make sure to include informative labels that describe your visualization.

   Call your function with the roller coaster DataFrame and amusement park name.

In [21]:
# 7
# Create a function to plot inversions by coaster at park
def plot_inversions_by_coaster(coaster_df, park_name):
  park_coasters = coaster_df[coaster_df['park'] == park_name]
  park_coasters = park_coasters.sort_values('num_inversions', ascending=False)
  coaster_names = park_coasters['name']
  number_inversions = park_coasters['num_inversions']
  plt.bar(range(len(number_inversions)),number_inversions)
  ax = plt.subplot()
  ax.set_xticks(range(len(coaster_names)))
  ax.set_xticklabels(coaster_names,rotation=90)
  plt.title('Number of Inversions Per Coaster at {}'.format(park_name))
  plt.xlabel('Roller Coaster')
  plt.ylabel('# of Inversions')
  plt.show()

# Create barplot of inversions by roller coasters
fig = plt.figure(figsize=(10,30))
plt.clf()
plot_inversions_by_coaster(coaster_data, 'Six Flags Great Adventure')
plt.show()

<IPython.core.display.Javascript object>

8. Write a function that creates a pie chart that compares the number of operating roller coasters (`'status.operating'`) to the number of closed roller coasters (`'status.closed.definitely'`). Your function should take the roller coaster DataFrame as an argument. Make sure to include informative labels that describe your visualization.

   Call your function with the roller coaster DataFrame.

In [22]:
# 8
# Create a function to plot a pie chart of status.operating
def pie_chart_status(coaster_df):
  operating_coasters = coaster_df[coaster_df['status'] == 'status.operating']
  closed_coasters = coaster_df[coaster_df['status'] == 'status.closed.definitely']
  num_operating_coasters = len(operating_coasters)
  num_closed_coasters = len(closed_coasters)
  status_counts = [num_operating_coasters,num_closed_coasters]
  plt.pie(status_counts,autopct='%0.1f%%',labels=['Operating','Closed'])
  plt.title('Operating roller coasters vs  Closed roller coasters')
  plt.axis('equal')
  plt.show()
  

# Create pie chart of roller coasters
pie_chart_status(coaster_data)
plt.show()

<IPython.core.display.Javascript object>

9. `.scatter()` is another useful function in matplotlib that you might not have seen before. `.scatter()` produces a scatter plot, which is similar to `.plot()` in that it plots points on a figure. `.scatter()`, however, does not connect the points with a line. This allows you to analyze the relationship between two variables. Find [`.scatter()`'s documentation here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html).

   Write a function that creates a scatter plot of two numeric columns of the roller coaster DataFrame. Your function should take the roller coaster DataFrame and two-column names as arguments. Make sure to include informative labels that describe your visualization.
   
   Call your function with the roller coaster DataFrame and two-column names.

In [23]:
# 9
# Create a function to plot scatter of any two columns
def plot_scatter(coaster_df, column_x, column_y):
  plt.scatter(coaster_df[column_x],coaster_df[column_y])
  plt.title('Scatter Plot of {} vs. {}'.format(column_y,column_x))
  plt.xlabel(column_x)
  plt.ylabel(column_y)
  plt.show()
  

# Create a function to plot scatter of speed vs height
def plot_scatter_height_speed(coaster_df):
  coaster_df = coaster_df[coaster_df['height'] < 140]
  plt.scatter(coaster_df['height'],coaster_df['speed'])
  plt.title('Scatter Plot of Speed vs. Height')
  plt.xlabel('Height')
  plt.ylabel('Speed')
  plt.show()
  

# Create a scatter plot of roller coaster height by speed
plot_scatter_height_speed(coaster_data)
plt.show()

<IPython.core.display.Javascript object>