#**Climbing Performance Prediction**


**Author:** Pedro Moura

**Date:** Mar 31, 2024

---

## 1. Goals of the Study


The aim of this project is to predict the hardest V grade a climber can climb using a regression model. This prediction is based on various factors including but not limited to body characteristics (height, weight, arm span), climbing experience, and detailed training habits (types and frequency of training).

##2. Data Source and Data Cleaning/Preprocessing

The data source I’ll be using is the ["Climbharder #V3 (Responses)"](https://docs.google.com/spreadsheets/d/1J6d45EqIlIsIqNdi2X-Zl-EGFxf9d9T3R_W55xrpEAs/edit#gid=1650492946) dataset. It contains climbers' responses about their physical attributes, climbing experience, and training habits. Here are some details about the data:


In [261]:
import pandas as pd

file_path = '/content/Climbharder #V3 (Responses).xlsx'
df = pd.read_excel(file_path)

# first few rows of the dataset
print(df.info())

# check for missing values
missing_values_percentage_csv = (df.isnull().sum() / len(df)) * 100
print(missing_values_percentage_csv)

# look the data types
print(df.dtypes)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 633 entries, 0 to 632
Data columns (total 35 columns):
 #   Column                                                                              Non-Null Count  Dtype         
---  ------                                                                              --------------  -----         
 0   Timestamp                                                                           633 non-null    datetime64[ns]
 1   Sex                                                                                 633 non-null    object        
 2   Height (cm)                                                                         633 non-null    object        
 3   Weight (KG)                                                                         633 non-null    object        
 4   Arm Span (cm)                                                                       630 non-null    object        
 5   How long have you been climbing for?              

Observations:

I) There are a lot o input variance caused by respondents. Here are some examples:

>"The one that is not sure about their answers"

> <img src="https://drive.google.com/uc?export=view&id=1t_D9WctWYtVeCO1i7ItOFFHasSGcijF9" width="400" height="70">


> "The one that doesn't know metric system"

> <img src="https://drive.google.com/uc?export=view&id=17CnyP7topkBTXZZOHnIhVJxGd6PpxEXT" width="400" height="100">


> "The one that get offended by survey"

> <img src="https://drive.google.com/uc?export=view&id=1InzTpI6hQMnoL77RfPLm6rbu0yGTAlkq" width="400" height="90">



II) There are a lot of missong value for some specific columns, for example:

> **464** Missing values - "Min Edge used (mm, +kg if weight added) - Open crimp (10 seconds)"               

> **434** Missing values - "Min Edge used (mm, +kg if weight added) - Half crimp (10 seconds)

> **423** Missing values - "Max Weight hangboard 18mm edge - open crimp (KG) (10 seconds)  (added weight only) "

There three are the top 3 features with most missing values.

### 2.1 Cleaning Data

The first thing I did before cleaning up the data was to analyze carefully each column. Columns like "Sex", "Where do you climb?" and climb grade related where all fine, I basically didn't have to change nothing. But the remaining one needed some modifications.

Here is my approach for each column/feature:
- **Height:** Converte to metric system(cm) and remove "cm" from response, if present.  
- **Weight:** Converte to metric system(kg) and remove "kg" from response, if present.
- **Arm Span**: Converte to metric system(cm) and remove "cm" from response, if present.
- **How long have you been climbing for?:** Replace for average.
- **All V Grade's columns:** Remove the V in front of the grade to make it numerical.
- **Max Weight hangboard 18mm edge - Half crimp (KG)  (10 seconds) (added weight only):** Not use
- **Max Weight hangboard 18mm edge - open crimp (KG) (10 seconds)  (added weight only):** Not use
- **Min Edge used (mm, +kg if weight added ) - Half Crimp (10 seconds):** Not use
- **Min Edge used (mm, +kg if weight added) - Open crimp (10 seconds):** Not use
- **Max pull up reps:** Fix some number approximations
- **5 rep max weighted pull ups:** Converte to metric system(kg) and remove "kg" from response, if present.
- **max push ups reps:** Fix some number approximations
- **max L-sit time:** Not use

If some colum/feature was not cited is because I will not change anything on it. Also, all columns with "Not use" I will not use on my model due the amount of missing values. I decided that by setting a treshold, so if a column has >60% of missing values, I will not be using it.

Here is the data after running small excel functions/scripts and a lot, and I mean A LOT, of manual fix:

In [262]:
file_path = '/content/Climb Data - Climb Data.csv.csv'
df_updated = pd.read_csv(file_path)


print(df_updated.head())

print(df_updated.info())

missing_values_percentage_csv = (df_updated.isnull().sum() / len(df_updated)) * 100

missing_values_percentage_csv

    Sex  Height (cm)  Weight (kg)  Arm Span (cm)  \
0  Male        173.0         77.0          178.0   
1  Male        180.0         81.0          180.0   
2  Male        178.0         67.0          175.0   
3  Male        173.0         70.0          178.0   
4  Male        184.0         84.0          197.0   

   How long have you been climbing for?          Where do you climb?  \
0                                  4.75  Indoor and outdoor climbing   
1                                  3.25         Indoor Climbing only   
2                                  0.75  Indoor and outdoor climbing   
3                                  9.25  Indoor and outdoor climbing   
4                                  6.75  Indoor and outdoor climbing   

  Hardest V Grade ever climbed Hardest V Grade climbed in the Last 3 months  \
0                           V8                                           V8   
1                           V3                                           V3   
2                

Sex                                                          0.000000
Height (cm)                                                  0.000000
Weight (kg)                                                  0.000000
Arm Span (cm)                                                3.815580
How long have you been climbing for?                         0.000000
Where do you climb?                                          0.000000
Hardest V Grade ever climbed                                 0.000000
Hardest V Grade climbed in the Last 3 months                 0.000000
The V grade you can send 90-100% of routes                   0.000000
Frequency of climbing sessions per week                      0.000000
Average hours climbing per week (not including training)     0.000000
Average hours Training for climbing per week                 0.000000
Hangboard Frequency per week                                 0.000000
Campus Board frequency per week                              0.000000
Campus Board time pe

### 2.2 Handling Missing Values

In [263]:
import numpy as np

# fill NaN values with its columns mean
numerical_columns = df_updated.select_dtypes(include=np.number).columns
df_updated[numerical_columns] = df_updated[numerical_columns].fillna(df_updated[numerical_columns].mean())

# mapping Sex
sex_mapping = {"Female": 0, "Male": 1}
df_updated["Sex"] = df_updated["Sex"].map(sex_mapping)

# mapping Climbing place
place_mapping = {"Indoor Climbing only":0, "Outdoor Climbing only": 1, "Indoor and outdoor climbing": 2}
df_updated["Where do you climb?"] = df_updated["Where do you climb?"].map(place_mapping)

# mapping "V" grades to numbers and "I don't boulder" to -1
grade_mapping = {"I don't boulder": -1}
for i in range(15):
  grade_mapping[f'V{i}'] = i

v_columns = ["Hardest V Grade ever climbed",
             "Hardest V Grade climbed in the Last 3 months",
             "The V grade you can send 90-100% of routes"]

for col in v_columns:
  df_updated[col] = df_updated[col].map(grade_mapping)

df_updated.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 629 entries, 0 to 628
Data columns (total 20 columns):
 #   Column                                                    Non-Null Count  Dtype  
---  ------                                                    --------------  -----  
 0   Sex                                                       629 non-null    int64  
 1   Height (cm)                                               629 non-null    float64
 2   Weight (kg)                                               629 non-null    float64
 3   Arm Span (cm)                                             629 non-null    float64
 4   How long have you been climbing for?                      629 non-null    float64
 5   Where do you climb?                                       629 non-null    int64  
 6   Hardest V Grade ever climbed                              629 non-null    int64  
 7   Hardest V Grade climbed in the Last 3 months              629 non-null    int64  
 8   The V grade you can 

### 2.3 Standardizing Data

In [264]:
from sklearn.preprocessing import StandardScaler

#numerical_columns = df_updated.select_dtypes(include=np.number).columns

scaler = StandardScaler()

df_updated[numerical_columns] = scaler.fit_transform(df_updated[numerical_columns])


## 3. Training/Evaluating Models

For this project I will be training two models, as you may have already noticed. Knowing is a continuous prediction problem, I deciced to explore Linear Regression model vs a Non-Linear Regression model (Decision Trees in this case).

### 3.1 Splitting the data

In [265]:
from sklearn.model_selection import train_test_split

target = 'Hardest V Grade ever climbed'

df_lr = df_updated.copy()

X = df_lr.drop(target, axis=1)
y = df_lr[target]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)


### 3.2 Training the models

> For Linear Regression (LR):

In [266]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)


> For Decision Trees (DT):

In [267]:
from sklearn.tree import DecisionTreeRegressor

dt_model = DecisionTreeRegressor(random_state=0)
dt_model.fit(X_train, y_train)


### 3.3 Evaluating Models Performance

> For Linear Regression (LR):

In [268]:
# predictions on the testing set
y_pred_lr = lr_model.predict(X_test)

# calculate evaluation metrics for LR
mae_lr = mean_absolute_error(y_test, y_pred_lr)
mse_lr = mean_squared_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mse_lr)


> For Decision Trees (DT):

In [269]:
# predictions on the same testing set
y_pred_dt = dt_model.predict(X_test)

# calculate evaluation metrics for the Decision Tree model
mae_dt = mean_absolute_error(y_test, y_pred_dt)
mse_dt = mean_squared_error(y_test, y_pred_dt)
rmse_dt = np.sqrt(mse_dt)


> Comparing Linear Regression vs. Non-Linear Regression (Decision Trees):

In [270]:
header = 'Linear Reg. | Non-Linear Reg. \n {}\n'.format("-"*30)
mae_out = 'MAE:  {:.2f} | {:.2f}\n'.format(mae_lr,mae_dt)
mse_out = 'MSE:  {:.2f} | {:.2f}\n'.format(mse_lr,mse_dt)
rmse_out = 'RMSE: {:.2f} | {:.2f}\n'.format(rmse_lr,rmse_dt)

print(header,mae_out,mse_out,rmse_out)

Linear Reg. | Non-Linear Reg. 
 ------------------------------
 MAE:  0.59 | 0.52
 MSE:  0.70 | 0.74
 RMSE: 0.84 | 0.86



Both models' performance was assessed using three evaluation metrics measurement: Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). Here are the results:


Linear Regression Model:

  - MAE: 0.59

  - MSE: 0.70

  - RMSE: 0.84


Non-Linear Regression Model:

  - MAE: 0.52

  - MSE: 0.74

  - RMSE: 0.86


The Non-Linear Regression model showed a lower MAE than the Linear Regression model; this indicated that it performed better on average, in terms of deviation from actual grades. So, basically the Non-Linear model's predictions are closer to reality. By being little more a accurate, I belive the Non-Linear model potentially identified complex data relationships that the Linear model could pick up on. But, when we look at the Mean Squared Error (MSE) and Root Mean Squared Error (RMSE), we see that the Non-Linear model had slightly higher statistics compared to the Linear one. This means that while it generally gets closer to real values in prediction, it may also suffer from larger errors in some cases.

##Conclusion

We can conclude that the Non-Linear model had a slightly lower MAE which makes it look like it's better at handling the complex relationships between climbers and their performance. However, with an increased MSE and RMSE, there's a chance that while the average errors are lower, the variability of those errors is higher. This project shows how difficult it was to predict climbing performance.




---

## Personal work/notes/curiosities

For future work, some to-do's:
- tuning of the Non-Linear model
- testing other non-linear models such as Random Forests or GBoosting
- Better data

### A) Feature importance

> trying to get some insight of what features are more important to do/work to improve climbing

In [271]:
feature_importances = dt_model.feature_importances_

features_df = pd.DataFrame({
  'Feature': X_train.columns,
  'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

print(features_df)

                                              Feature  Importance
6        Hardest V Grade climbed in the Last 3 months    0.787719
7          The V grade you can send 90-100% of routes    0.133266
4                How long have you been climbing for?    0.019356
10       Average hours Training for climbing per week    0.017367
5                                 Where do you climb?    0.005344
16                                   Max pull up reps    0.005138
3                                       Arm Span (cm)    0.004603
1                                         Height (cm)    0.004524
18                                  max push ups reps    0.004255
17                        5 rep max weighted pull ups    0.003246
11                       Hangboard Frequency per week    0.002593
14   Frequency of Endurance training sesions per week    0.002474
13                 Campus Board time per week (hours)    0.002351
2                                         Weight (kg)    0.002127
9   Averag

### B) Tunning Tentative

In [272]:
from sklearn.model_selection import GridSearchCV

param_grid = {
  'max_depth': [None, 10, 20, 30, 40, 50],
  'min_samples_split': [2, 5, 10],
  'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(DecisionTreeRegressor(random_state=42), param_grid, cv=5,
                           scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_score = -grid_search.best_score_

best_params, best_score


Fitting 5 folds for each of 54 candidates, totalling 270 fits


({'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 2},
 1.0678518407573467)

#### Testing tunned params:

In [273]:
tuned_dt_model = DecisionTreeRegressor(max_depth=10, min_samples_leaf=2, min_samples_split=2, random_state=0)
tuned_dt_model.fit(X_train, y_train)

y_pred_tuned = tuned_dt_model.predict(X_test)

mae_tuned = mean_absolute_error(y_test, y_pred_tuned)
mse_tuned = mean_squared_error(y_test, y_pred_tuned)
rmse_tuned = np.sqrt(mse_tuned)

mae_tuned, mse_tuned, rmse_tuned

(0.54832050665384, 0.8185716821291316, 0.9047495134727245)

and it got worse, nice...

### C) Prediction base on user input

Notes:

For "Sex":
- if "Female", answer "0". Otherwise, "1".

For "Where do you climb?"

- if "Indoor Climbing only", answer "0"
- if "Outdoor Climbing only",  "1"
- if"Indoor and outdoor climbing", "2"


For V grades:

- if "I don't boulder": -1
- else just put the grade without the V in front


In [None]:
feature_prompts = {
  "Sex": "Enter sex:" ,
  "Height (cm)": "Enter height (cm):" ,
  "Weight (kg)": "Enter weight (kg):" ,
  "Arm Span (cm)": "Enter arm span (cm):" ,
  "How long have you been climbing for?": "Enter how long have you been climbing for?:" ,
  "Where do you climb?": "Enter where do you climb?:" ,
  "Hardest V Grade climbed in the Last 3 months": "Enter hardest v grade climbed in the last 3 months:" ,
  "The V grade you can send 90-100% of routes": "Enter the v grade you can send 90-100% of routes:" ,
  "Frequency of climbing sessions per week": "Enter frequency of climbing sessions per week:" ,
  "Average hours climbing per week (not including training)": "Enter average hours climbing per week (not including training):" ,
  "Average hours Training for climbing per week": "Enter average hours training for climbing per week:" ,
  "Hangboard Frequency per week": "Enter hangboard frequency per week:" ,
  "Campus Board frequency per week": "Enter campus board frequency per week:" ,
  "Campus Board time per week (hours)": "Enter campus board time per week (hours):" ,
  "Frequency of Endurance training sesions per week": "Enter frequency of endurance training sesions per week:" ,
  "General Strength Training frequency per week": "Enter general strength training frequency per week:" ,
  "Max pull up reps": "Enter max pull up reps:" ,
  "5 rep max weighted pull ups": "Enter 5 rep max weighted pull ups:" ,
  "max push ups reps": "Enter max push ups reps:"
}

user_input = {}

for feature, prompt in feature_prompts.items():
    user_input[feature] = float(input(prompt))

In [275]:
def predict_user_grade(model, user_input):

    user_input_df = pd.DataFrame([user_input])

    df_updated[numerical_columns] = scaler.fit_transform(df_updated[numerical_columns])
    user_input_df = scaler.fit_transform(user_input_df)


    # Make prediction
    predicted_grade = model.predict(user_input_df)[0]

    return abs(predicted_grade)

# Assuming `tuned_dt_model` is your trained and tuned model
user_predicted_grade = predict_user_grade(tuned_dt_model, user_input)
print(f"Predicted hardest V grade: {user_predicted_grade}")


Predicted hardest V grade: 1.0




// check why is sandbagging the grade