<h1>SI 564 Final Project: Trails in U.S. National Parks</h1>
<p>Haley Johnson</p>

<p>Code to create normalized database tables from All Trails dataset</p>

In [1]:
import pandas as pd
import numpy as np
import pymysql
from sqlalchemy import create_engine 
from IPython.display import Image

<h2>Database Diagram</h2>
<p><a href = "https://www.kaggle.com/datasets/planejane/national-park-trails">National Parks data</a></p>

In [2]:
Image(url="nat_parks_erd.png", width=800, height=800)

<h2>Connect to SQL</h2>

In [3]:
password = "a64e04a17572b13a8074eec5a61b10ea"
engine = create_engine(f'mysql+pymysql://haleyej-rw:{password}@34.134.16.183:14192/nat_parks')

<h2>Split Data Into Tables</h2>

In [4]:
df = pd.read_csv("trails_data.csv")

In [5]:
df = df.drop(columns = ['trail_id', 'city_name', 'country_name', '_geoloc', 'features', 'activities'])

In [6]:
df['area_name'] = df['area_name'].str.replace(" National Park", "")

In [7]:
df['state_name'] = df['state_name'].astype(str)
df['state_name'] = df['state_name'].apply(lambda s: np.where(s == 'Maui', 'Hawaii', s))

In [8]:
def create_table(df, col):
    '''
    Turns column in the dataframe
    into a new dataframe that just
    contains the unique values in 
    that column
    
    Function is used to split
    big dataframe into smaller table
    for normalization
    
    Returns a new dataframe, just
    made up of the target column
    '''
    temp = df[col].unique()
    df = pd.DataFrame(temp, columns = [col])
    df = df.reset_index()
    df = df.rename(columns = {'index': 'id'})
    return df
    

In [9]:
parks_df = create_table(df, 'area_name')
states_df = create_table(df, 'state_name')
routes_df = create_table(df, 'route_type')

<h3>Trails Table</h3>

In [10]:
metric = df[df['units'] == 'm']
imperial = df[df['units'] == 'i']

In [11]:
def meters_to_yards(s):
    '''
    Takes in column of dataframe 
    
    Convers meters to yards
    '''
    return s * 1.09361
    

In [12]:
metric['elevation_gain'] = metric['elevation_gain'].apply(meters_to_yards)
metric['length'] = metric['length'].apply(meters_to_yards)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  metric['elevation_gain'] = metric['elevation_gain'].apply(meters_to_yards)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  metric['length'] = metric['length'].apply(meters_to_yards)


In [13]:
df = pd.concat([imperial, metric])

In [14]:
df = df.drop(columns = ['units'])

<h2>Normalize</h2>

In [15]:
def normalize(df1, df2, target, fk):
    '''
    Normalize tables to prepare for SQL
    puts foreign keys into main table
    
    Takes in four arguments:
    two dataframes that are being marged, 
    the column used to merge them
    the foreign key connecting the table
    
    Return a dataframe with the 
    foreign key normalized
    '''
    df1 = df1.merge(df2, on = target)
    df1 = df1.rename(columns = {'id': fk})
    df1 = df1.drop(columns = target)
    return df1

In [16]:
targets = [(states_df, 'state_name', 'state_id'), (routes_df, 'route_type', 'route_type_id'),
           (parks_df, 'area_name', 'park_id')]
           
for target in targets:
    df = normalize(df, target[0], target[1], target[2])

<h2>Write To SQL</h2>

In [17]:
df.to_sql("trails", con=engine, index = False)
states_df.to_sql("states", con = engine, index = False)
parks_df.to_sql("parks", con = engine, index = False)
routes_df.to_sql("route_types", con = engine, index = False)