### The purpose of this notebook is to explore, and eventually clean, the three data sets of roller coaster information.

In [1]:
import pandas as pd

#### Creation of DataFrames from CSV files

In [3]:
steel = pd.read_csv("Golden_Ticket_Award_Winners_Steel.csv", index_col="Rank")
steel.head()

Unnamed: 0_level_0,Name,Park,Location,Supplier,Year Built,Points,Year of Rank
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Millennium Force,Cedar Point,"Sandusky, Ohio",Intamin,2000,1204,2013
2,Bizarro,Six Flags New England,"Agawam, Mass.",Intamin,2000,1011,2013
3,Expedition GeForce,Holiday Park,"Hassloch, Germany",Intamin,2001,598,2013
4,Nitro,Six Flags Great Adventure,"Jackson, N.J.",B&M,2001,596,2013
5,Apollo’s Chariot,Busch Gardens Williamsburg,"Williamsburg, Va.",B&M,1999,542,2013


In [5]:
wood = pd.read_csv("Golden_Ticket_Award_Winners_Wood.csv", index_col="Rank")
wood.head()

Unnamed: 0_level_0,Name,Park,Location,Supplier,Year Built,Points,Year of Rank
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Boulder Dash,Lake Compounce,"Bristol, Conn.",CCI,2000,1333,2013
2,El Toro,Six Flags Great Adventure,"Jackson, N.J.",Intamin,2006,1302,2013
3,Phoenix,Knoebels Amusement Resort,"Elysburg, Pa.",Dinn/PTC-Schmeck,1985,1088,2013
4,The Voyage,Holiday World,"Santa Claus, Ind.",Gravity Group,2006,1086,2013
5,Thunderhead,Dollywood,"Pigeon Forge, Tenn.",GCII,2004,923,2013


In [6]:
coasters = pd.read_csv("roller_coasters.csv")
coasters.head()

Unnamed: 0,name,material_type,seating_type,speed,height,length,num_inversions,manufacturer,park,status
0,Goudurix,Steel,Sit Down,75.0,37.0,950.0,7.0,Vekoma,Parc Asterix,status.operating
1,Dream catcher,Steel,Suspended,45.0,25.0,600.0,0.0,Vekoma,Bobbejaanland,status.operating
2,Alucinakis,Steel,Sit Down,30.0,8.0,250.0,0.0,Zamperla,Terra Mítica,status.operating
3,Anaconda,Wooden,Sit Down,85.0,35.0,1200.0,0.0,William J. Cobb,Walygator Parc,status.operating
4,Azteka,Steel,Sit Down,55.0,17.0,500.0,0.0,Soquet,Le Pal,status.operating


In [9]:
steel.info()
wood.info()
coasters.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 180 entries, 1 to 50
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Name          180 non-null    object
 1   Park          180 non-null    object
 2   Location      180 non-null    object
 3   Supplier      180 non-null    object
 4   Year Built    180 non-null    int64 
 5   Points        180 non-null    int64 
 6   Year of Rank  180 non-null    int64 
dtypes: int64(3), object(4)
memory usage: 11.2+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 180 entries, 1 to 50
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Name          180 non-null    object
 1   Park          180 non-null    object
 2   Location      180 non-null    object
 3   Supplier      179 non-null    object
 4   Year Built    180 non-null    int64 
 5   Points        180 non-null    int64 
 6   Year of Rank  180 non-nu

#### Initial Observations:
- 1 null entry in the wooden parks data set (supplier)
- Several null entries within coasters data set.  For now will leave, but will reassess after merging w/ other data sets
- Location will require cleaning - in format of city, state OR city, country (steel & wood data sets)
- Status will require cleaning - must cut status., then further segment into types of closed status

We will clean each data set one at a time, then merge into one DF for exploration and analysis

#### Investigating **steel** data set

In [13]:
steel["Location"].value_counts()

Sandusky, Ohio              19
Williamsburg, Va.           12
Charlotte, N.C.             10
Tampa, Fla.                  9
Agawam, Mass.                9
Austell, Ga.                 8
Jackson, N.J.                7
Arlington, Texas             7
Mason, Ohio                  7
Vaughan, Ontario, Canada     7
Valencia, Calif.             7
Orlando, Fla.                6
Gurnee, Ill.                 6
Brühl, Germany               6
Hassloch, Germany            6
Rust, Germany                5
Louisville, Ky.              5
Gothenburg, Sweden           5
Doswell, Va.                 5
Staffordshire, England       4
Montreal, Quebec, Canada     3
Santa Claus, Ind.            3
West Mifflin, Pa.            3
San Antonio, Texas           3
Mexico City, Mexico          3
Farmington, Utah             3
Allentown, Pa.               2
Hershey, Pa.                 2
Gothemburg, Sweden           1
Stockholm, Sweden            1
Upper Marlsboro, Md.         1
Pigeon Forge, Tenn.          1
Branson,

##### Looking at value counts, we can see that there are 36 different values.  Country values include full name, while many state names are abbreviated.  

##### First, we will repeat this process with the wooden data set, so we can determine all unique countries, and state abbreviations.  Then we will combine these into a singular collection, so we can easily clean both sets with the same process.

In [14]:
wood["Location"].value_counts()

Elysburg, Pa.              12
Santa Claus, Ind.          12
Pigeon Forge, Tenn.         9
Eureka, Mo.                 9
Mason, Ohio                 7
West Mifflin, Pa.           7
Jackson, N.J.               6
Bristol, Conn.              6
Branson, Mo.                6
Erie, Pa.                   6
Hershey, Pa.                5
Santa Clara, Calif.         5
Muskegon, Mich.             4
Gothenburg, Sweden          4
Brooklyn, N.Y.              3
Conneaut Lake, Pa.          3
Bessemer, Ala.              3
Wisconsin Dells, Wis.       3
Orlando, Fla.               3
Sandusky, Ohio              3
Blackpool, England          3
Norrköping, Sweden          3
Vancouver, B.C., Canada     3
Sevenum, Netherlands        3
Rust, Germany               3
Ashbourne, Ireland          3
Soltau, Germany             3
Santa Cruz, Calif.          3
Kansas City, Mo.            3
Gurnee, Ill.                3
Lake George, N.Y.           3
Legendfeld, Germany         2
Stockholm, Sweden           2
Doswell, V

##### State abbreviations are as follows:
- Ala. : Alabama
- Calif. : California
- Conn. : Connecticut
- Fla. : Florida 
- Ga. : Georgia
- Ill. : Illinois
- Ind. : Indiana
- Ky. : Kentucky
- Mass. : Massachusetts
- Md. : Maryland
- Mich. : Michigan
- Minn. : Minnesota
- Mo. : Missouri
- N.C. : North Carolina
- N.J. : New Jersey
- N.Y. : New York
- Pa. : Pennsylvania
- Tenn. : Tennessee
- Va. : Virginia
- Wis. : Wisconsin

##### Non-State Abbreviations:
- Ont. : Ontario
- B.C. : British Columbia

##### We can also determine all unique countries:
- Countries : ["Canada", "Spain", "Sweden", "Austria", "Mexico", "England", "Germany", "Netherlands", "Ireland", "South Korea", "China", "Denmark", "Wales"]

##### Lastly, there is one weird case where "Kemah Boardwalk", the park name, is listed as a location.  We'll need to replace it with target location: **Kemah, Texas, United States**