# Exploration

The dataset I'm using came from Mountain Project. Filters include:

- Location: Boulder Canyon
- Type: 
    - Rock
    - 5.0 to 5.15d (start of technical climbing grade to the max in the grade)
    - Trad, Sport, and Toprope
- No quality or pitch filters

In [1]:
# Import modules
import os, sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Get path
data_dir = os.path.realpath('../data/')

## Data Cleaning

I'll start by getting an overview the data and making it more usable.

In [2]:
# Upload data
df_mp = pd.read_csv(data_dir + '/boulder_sport_50_515d.csv', header=0,
                    sep=',')
print('Frame shape: ' + str(df_mp.shape))
df_mp.tail(2)

Frame shape: (945, 11)


Unnamed: 0,Route,Location,URL,Avg Stars,Your Stars,Route Type,Rating,Pitches,Length,Area Latitude,Area Longitude
943,Serps Up,Serpent's Den > Boulder Canyon > Boulder > Col...,https://www.mountainproject.com/route/12226557...,3.0,-1,Sport,5.13b,1,25.0,40.00239,-105.41014
944,Aqueduct,Aqueduct Outcrop > Boulder Canyon > Boulder > ...,https://www.mountainproject.com/route/12246443...,2.0,-1,Sport,5.12a,1,50.0,39.97527,-105.4608


In [3]:
# Data types
df_mp.dtypes

Route              object
Location           object
URL                object
Avg Stars         float64
Your Stars          int64
Route Type         object
Rating             object
Pitches             int64
Length            float64
Area Latitude     float64
Area Longitude    float64
dtype: object

Variables include route name, location, Mountain Project URL, average stars, my ratings, route type, rating, pitches, length, and geographic information. 

There are a mix of objects (strings), floats, and integers. Something to note is that "Rating" is coming out as an object when it can be a float. This is a transformation I'll do later on.

I'll do some data cleaning to only include useful variables and make variable names easier to work with. 

In [4]:
# Copy original frame
df_copy = df_mp.copy()

# Simplify variable names
df_copy.columns = [x.lower().replace(' ', '_') for x in list(df_copy.columns)]

# Remove my ratings (your_stars)
df_copy = df_copy.drop('your_stars', 1)
df_copy.tail(2)

Unnamed: 0,route,location,url,avg_stars,route_type,rating,pitches,length,area_latitude,area_longitude
943,Serps Up,Serpent's Den > Boulder Canyon > Boulder > Col...,https://www.mountainproject.com/route/12226557...,3.0,Sport,5.13b,1,25.0,40.00239,-105.41014
944,Aqueduct,Aqueduct Outcrop > Boulder Canyon > Boulder > ...,https://www.mountainproject.com/route/12246443...,2.0,Sport,5.12a,1,50.0,39.97527,-105.4608


I'll check for null and null replacement values next. Based on the "Your Stars" rating from the original frame, it looks like -1 is used as a replacement value.

In [5]:
# Loop through columns to find -1
for curr_col in list(df_copy.columns):
    
    search_null = df_copy[df_copy[curr_col] == -1].shape[0]
    
    if search_null != 0:
        print(f'{curr_col}: {search_null} rows with -1' )

avg_stars: 7 rows with -1
pitches: 1 rows with -1


I double checked and did see 7 rows with -1 in "avg_stars" and 1 row with -1 in "pitches." It's not possible to have negative stars or pitches, so these must be value substitutes. I'll change these to nulls to not confuse nulls from actual values.

In [6]:
# Replace -1 with null
df_copy = df_copy.replace(to_replace=-1, value=np.nan)  # np.nan for numerical

In [7]:
# Count remaining null values d
df_copy.isna().sum()

route               0
location            0
url                 0
avg_stars           7
route_type          0
rating              0
pitches             1
length            197
area_latitude       0
area_longitude      0
dtype: int64

In addition to the 8 placeholder values that were changed to nulls, there are 197 missing length values.

The next thing I want to do here is to change ratings from objects to floats. All of the climbs are in the 5.0-5.15 range because I'm looking at technical climbing. I'll remove the 5 and turn the string portion of the rating (e.g. a, b, c, d) to numbers.

In [15]:
# Get climbing ratings
# TODO: clean ratings
[x.replace('5.', '') for x in 
 list(sorted(list(set(list(df_copy['rating'])))))]

['10',
 '10 X',
 '10+',
 '10+ PG13',
 '10+ R',
 '10-',
 '10- R',
 '10a',
 '10a A0',
 '10a/b',
 '10b',
 '10b PG13',
 '10b/c',
 '10b/c PG13',
 '10c',
 '10c PG13',
 '10c R',
 '10c/d',
 '10d',
 '10d PG13',
 '11',
 '11 PG13',
 '11+',
 '11-',
 '11a',
 '11a R',
 '11a/b',
 '11b',
 '11b PG13',
 '11b/c',
 '11b/c PG13',
 '11c',
 '11c R',
 '11c/d',
 '11d',
 '11d PG13',
 '12',
 '12+',
 '12-',
 '12a',
 '12a/b',
 '12a/b A0',
 '12a/b R',
 '12b',
 '12b/c',
 '12c',
 '12c PG13',
 '12c/d',
 '12d',
 '12d V6',
 '12d V7',
 '13',
 '13-',
 '13a',
 '13a/b',
 '13b',
 '13b/c',
 '13c',
 '13c V10',
 '13d',
 '14a',
 '14c',
 '3',
 '4',
 '5',
 '6',
 '7',
 '7+',
 '8',
 '8+',
 '8-',
 '9',
 '9 R',
 '9+',
 '9+ R',
 '9-',
 '9- PG13']

In [11]:
# Climbing rating transformation
df_copy['rating_float'] = df_copy['rating'].apply(
    lambda x: x.replace('5.', '')
)
df_copy[['rating', 'rating_float']].sample(10)

Unnamed: 0,rating,rating_float
485,5.10+,10+
84,5.8,8
524,5.11,11
223,5.10c,10c
493,5.12a,12a
701,5.11c,11c
78,5.13a/b,13a/b
746,5.10b,10b
245,5.9+,9+
88,5.9,9


## Data Exploration

There are string, float, and integer variables. I'll start by understanding the numerical attributes.

In [8]:
# Check for nulls
df_copy.describe()

Unnamed: 0,avg_stars,pitches,length,area_latitude,area_longitude
count,938.0,944.0,748.0,945.0,945.0
mean,2.149893,1.090042,72.462567,39.996613,-105.415008
std,0.667579,0.364586,38.805145,0.010475,0.023808
min,0.0,1.0,18.0,39.97207,-105.4648
25%,1.7,1.0,50.0,39.9937,-105.4181
50%,2.0,1.0,65.0,40.0001,-105.4124
75%,2.6,1.0,85.0,40.0035,-105.3974
max,4.0,4.0,370.0,40.0137,-105.3133


TODO: 
- Go back and change ratings to numerical
- write about it here 

- avg_stars: max = 4