# Data Preprocessing
Myriah Hodgson, for DSCI 372

In [127]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns

Let's start by reading in the data:

In [128]:
# read in the data and look at the columns
park_visitation = pd.read_csv('Park_Data.csv')
park_visitation.columns

Index(['ParkName', 'UnitCode', 'ParkType', 'Region', 'State', 'Year', 'Month',
       'RecreationVisits', 'NonRecreationVisits', 'TentCampers', 'RVCampers',
       'Backcountry'],
      dtype='object')

And finding if there are any null values:

In [129]:
# find anly null values
print(park_visitation.isnull().sum())

ParkName               0
UnitCode               0
ParkType               0
Region                 0
State                  0
Year                   0
Month                  0
RecreationVisits       0
NonRecreationVisits    0
TentCampers            0
RVCampers              0
Backcountry            0
dtype: int64


There are no null values in any of our columns. Let's look at the info of our columns:

In [130]:
print(park_visitation.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33395 entries, 0 to 33394
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   ParkName             33395 non-null  object
 1   UnitCode             33395 non-null  object
 2   ParkType             33395 non-null  object
 3   Region               33395 non-null  object
 4   State                33395 non-null  object
 5   Year                 33395 non-null  int64 
 6   Month                33395 non-null  int64 
 7   RecreationVisits     33395 non-null  int64 
 8   NonRecreationVisits  33395 non-null  int64 
 9   TentCampers          33395 non-null  int64 
 10  RVCampers            33395 non-null  int64 
 11  Backcountry          33395 non-null  int64 
dtypes: int64(7), object(5)
memory usage: 3.1+ MB
None


In [131]:
park_visitation.head(2)

Unnamed: 0,ParkName,UnitCode,ParkType,Region,State,Year,Month,RecreationVisits,NonRecreationVisits,TentCampers,RVCampers,Backcountry
0,Acadia NP,ACAD,National Park,Northeast,ME,1979,1,6011,15252,102,13,0
1,Acadia NP,ACAD,National Park,Northeast,ME,1979,2,5243,13776,53,8,0


In [132]:
park_visitation['ParkType'].unique()

array(['National Park'], dtype=object)

We can drop the 'ParkType' column because all applicable data entries are from national park data.

In [133]:
park_visitation = park_visitation.drop(columns=['ParkType'])

Fortunately, it looks like our data is largely cleaned enough. All the entries we wish to be numeric already are, and there are no null values in the entire dataset. The only reasonable concern is that our 'Backcountry' column is filled with quite a few 0 values. However, considering the low number of campers generally allowed for much of those months, it is reasonable that backcountry camping is the least plausible activity, or an activity restricted to only a few months or parks. 

My objective in this project is to predict each data entry as being in high season or low season based on the features of our datasets. Such predictions could help individuals decide historically what time of year is busy at certain parks, in order for visitors to decide when they should visit and additionally park staff to decide how to appropriately staff their parks. 

Let's create our target variable based on the 'Recreation Visits' column. If entries are above the median value, we will consider them high season, and all other entries as low season.

In [134]:
# group by the park name, taking the median of that park's recreation visits
park_grouped = park_visitation.groupby(by='ParkName')['RecreationVisits'].median().reset_index()

# rename as median recreation
park_grouped = park_grouped.rename(columns={'RecreationVisits':'MedianRecreation'})

# merge tables
park_visitation = pd.merge(left=park_visitation, right=park_grouped, on='ParkName')


In [None]:
# create column for if high visitation or not - 1 if above that park's median, 0 else
park_visitation['High_Visitation'] = (park_visitation['RecreationVisits'] > park_visitation['MedianRecreation']).astype(int)

# drop 'RecreationVisits' and 'MedianRecreation' because they now can't be used as predictorsf
park_visitation = park_visitation.drop(columns=['RecreationVisits', 'MedianRecreation'])

We can next encode our categorical variables as numeric:

In [136]:
park_encoded = pd.get_dummies(park_visitation, columns=['ParkName', 'UnitCode', 'Region', 'State'], drop_first=True)

In [137]:
park_encoded.head()

Unnamed: 0,Year,Month,NonRecreationVisits,TentCampers,RVCampers,Backcountry,High_Visitation,ParkName_Arches NP,ParkName_Badlands NP,ParkName_Big Bend NP,...,State_SC,State_SD,State_TN,State_TX,State_UT,State_VA,State_VI,State_WA,State_WV,State_WY
0,1979,1,15252,102,13,0,0,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,1979,2,13776,53,8,0,0,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,1979,3,15252,176,37,0,0,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,1979,4,37657,1037,459,0,1,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,1979,5,50616,3193,1148,0,1,False,False,False,...,False,False,False,False,False,False,False,False,False,False


And normalize our numeric columns:

In [138]:
from sklearn.preprocessing import StandardScaler

num_features = ['Year', 'Month', 'NonRecreationVisits', 'TentCampers', 'RVCampers', 'Backcountry']

# scale the numeric features
scaler = StandardScaler()
park_encoded[num_features] = scaler.fit_transform(park_encoded[num_features])

Finally, we can save our cleaned data to be able to be used in the next notebook.

In [139]:
park_encoded.to_csv('Cleaned_ParkData.csv')