# A Diversity Analysis of Participation in Outdoor Recreation in Washington

### Final Project for HCDS 512, Autumn 2018

The purpose of this notebook is to complete a statistical analysis and visualization of the racial and gender diversity of participants in certain outdoor recreational activities in Washington State. This work was originally completed for the University of Washington's DATA 512 course on Human Centered Data Science in Autumn 2018. 

The notebook is divided into the following sections:

 - Methods
     - Data Acquisition
     - Data Processing
     - Statistical Analysis
     - Data Visualization

All sections are thoroughly documented so as to support reproducibility of this analysis.

In [6]:
# import necessary packages and notebook setup
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sodapy import Socrata

#### Data Acquisition

code from this link: https://dev.socrata.com/foundry/data.wa.gov/amq9-iaai

In [23]:
# Define endpoints for each of the 7 parts of the dataset
endpoints = ['amq9-iaai', 'ek6m-rgb7', '8zc8-9ad4', 'v2c2-rkrp', 'hzyw-na2k', 'uwas-gd9z', 'q62a-ce6s']

# Define app token for API requests
app_token = 'OnW02vywUSKEfNP2DEYO7OMM5'


#Create dictionary to save dataframes
parts = {}

# Make API call for each endpoint and save file
for endpoint in endpoints:
    file_name = 'data_raw/scorp_' + endpoint + '.csv'
    client = Socrata("data.wa.gov", app_token)
    results = client.get(endpoint, limit=3200)
    results_df = pd.DataFrame.from_records(results)
    results_df.to_csv(file_name)
    parts[endpoint] = results_df

#### Data Processing

In [39]:
#Merge all dataframes from the data dictionary
keys = list(parts.keys())
combined_data = parts[keys[0]]

for i in range(1, len(keys)):
    combined_data = combined_data.merge(parts[keys[i]], how = 'inner', on = ['idnumber'])

In [40]:
# Check shape of resulting dataframe
combined_data.shape

(3114, 1575)

In [41]:
# Check first few rows of resulting dataframe
combined_data.head()

Unnamed: 0,act1,act101,act102,act103,act104,act105,act106,act107,act108,act109,...,tab245a,tab285a,tenn2x,voll3x,walk8x,wghts2x,wski2x,wsrf2x,xski2x,xski6x
0,1111000000100000,Not checked,Not checked,Checked,Checked,Checked,Checked,Not checked,Not checked,Not checked,...,No,No,14.0,0.0,0.0,60.0,0.0,0.0,0.0,0.0
1,10,Not checked,Not checked,Not checked,Not checked,Not checked,Not checked,Not checked,Not checked,Not checked,...,No,No,0.0,0.0,365.0,0.0,0.0,0.0,0.0,0.0
2,1100000000000000,Not checked,Not checked,Checked,Checked,Not checked,Not checked,Not checked,Not checked,Not checked,...,No,No,0.0,0.0,100.0,50.0,0.0,0.0,0.0,0.0
3,100001101111001000,Checked,Not checked,Not checked,Not checked,Not checked,Checked,Checked,Not checked,Checked,...,No,Yes,0.0,15.0,40.0,0.0,12.0,20.0,5.0,0.0
4,1000000000000,Not checked,Not checked,Not checked,Not checked,Not checked,Checked,Not checked,Not checked,Not checked,...,No,No,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0


In [52]:
# Change "checked/not checked" and "yes/no" to 0/1
combined_data = combined_data.replace({'Yes': 1, 'No': 0, 'Not checked': 0, 'Checked': 1})
combined_data.head()

Unnamed: 0,act1,act101,act102,act103,act104,act105,act106,act107,act108,act109,...,tab245a,tab285a,tenn2x,voll3x,walk8x,wghts2x,wski2x,wsrf2x,xski2x,xski6x
0,1111000000100000,0,0,1,1,1,1,0,0,0,...,0,0,14.0,0.0,0.0,60.0,0.0,0.0,0.0,0.0
1,10,0,0,0,0,0,0,0,0,0,...,0,0,0.0,0.0,365.0,0.0,0.0,0.0,0.0,0.0
2,1100000000000000,0,0,1,1,0,0,0,0,0,...,0,0,0.0,0.0,100.0,50.0,0.0,0.0,0.0,0.0
3,100001101111001000,1,0,0,0,0,1,1,0,1,...,0,1,0.0,15.0,40.0,0.0,12.0,20.0,5.0,0.0
4,1000000000000,0,0,0,0,0,1,0,0,0,...,0,0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0


In [59]:
# Drop participants with no gender
combined_data.gender.unique()

gender_dataset = combined_data[combined_data.gender != "Don't know"]

In [66]:
# Check shape to see how many observations dropped (only 2)
gender_dataset.shape

(3112, 1575)

In [67]:
# Save gender data set to clean data folder
gender_dataset.to_csv('data_clean/gender_data_clean.csv')

In [76]:
# Drop participants with no race/ethnicity information

# Drop "Don't know" and "Refused"
race_dataset = combined_data[combined_data.race09 != 1]
race_dataset = race_dataset[race_dataset.race08 != 1]
race_dataset.shape

(2910, 1575)

https://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns

In [80]:
# Create one race column - define function to map columns to races
def label_race (row):
   if row['race01'] == 1 :
      return 'White'
   if row['race01'] + row['race02'] + row['race03'] + row['race04'] + row['race05'] + row['race06'] + row['race07'] > 1 :
      return 'Two Or More'
   if row['race02'] == 1 :
      return 'Black'
   if row['race03'] == 1:
      return 'Hispanic/Latino'
   if row['race04']  == 1:
      return 'Asian'
   if row['race05'] == 1:
      return 'Haw/Pac Isl.'
   if row['race06'] == 1:
      return 'American Indian/Alaska Native'
   return 'Other'

In [82]:
race_dataset['race_label'] = race_dataset.apply(label_race, axis=1)

In [83]:
# Save race data set to clean data folder
race_dataset.to_csv('data_clean/race_data_clean.csv')

#### Statistical Analysis

#### Data Visualization