# Modeling for Powerlifting Dataset

The cells below read in the data and import the libraries to assist with the analysis.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
from sklearn.feature_selection import SelectFromModel

In [2]:
data1 = './data_powerlifting/openpowerlifting.csv'
data2 = './data_powerlifting/meets.csv'
powerlift = pd.read_table(data1, sep=',')
meets = pd.read_table(data2, sep=',')
powerlift_meets = pd.merge(powerlift, meets)
power = powerlift_meets

The cell below repaces the names of the columns to be lowercase for easier coding during the analysis

In [3]:
powerlift.rename(str.lower, axis='columns', inplace = True)

In [4]:
powerlift.drop(['squat4kg', 'bench4kg', 'deadlift4kg','meetid','name','division'], axis=1, inplace=True)
powerlift.dropna(inplace=True)
powerlift.shape

(105220, 11)

**Removed all rows with missing data to clean up dataset**

In [5]:
powerlift.sample(n=3)

Unnamed: 0,sex,equipment,age,bodyweightkg,weightclasskg,bestsquatkg,bestbenchkg,bestdeadliftkg,totalkg,place,wilks
381271,F,Raw,32.0,67.5,67.5,92.99,61.23,111.13,265.35,2,270.82
49896,M,Wraps,24.0,115.8,125.0,245.0,160.0,265.0,670.0,6,388.59
385406,M,Wraps,37.0,106.7,110.0,230.0,150.0,250.0,630.0,2,374.39


# Modeling

In [6]:
powerlift.head(1)

Unnamed: 0,sex,equipment,age,bodyweightkg,weightclasskg,bestsquatkg,bestbenchkg,bestdeadliftkg,totalkg,place,wilks
0,F,Wraps,47.0,59.6,60,47.63,20.41,70.31,138.35,1,155.05


In [7]:
fd1 = pd.get_dummies(powerlift.equipment, prefix='equip', drop_first=True)
fd2 = pd.get_dummies(powerlift.weightclasskg, prefix='weightclass', drop_first=True)
fd3 = pd.get_dummies(powerlift.sex, prefix='sex', drop_first=True)

In [8]:
fd1.head()

Unnamed: 0,equip_Raw,equip_Single-ply,equip_Wraps
0,0,0,1
1,0,1,0
2,0,1,0
5,0,0,1
6,1,0,0


### Created dummified data to perform analysis and gain insights into categorical or class data and concatanate it into a single dataframe.

- **Equipment**: denoting if you use wrist wraps or other types of supports are examples as well as raw, which is no supports.
- **Weight Class**: Although this is numerical, the scale is not always the same and there is no limit at the top of a class
- **Sex**

In [9]:
power_dummy = pd.concat([powerlift, fd1,fd2,fd3], axis=1)
power_dummy.shape

(105220, 59)

**The following piece of code allows me to generate the list of columns to easily copy features to use in logisitcal regression analysis**

In [10]:
list(power_dummy);

# 1. Logistical Regression to Determine Weight Class

The modeling in this section will first look a logistic regression to determine an athlete's weightclass.  Although this is not extremely interesting, it demonstrates the ability to classify data.

## Feature Selection
Using all features except for the athlete's weight, we will predict what weightclass an athlete will be in.

In [19]:
f_feature = [
    'sex_M',
    'age',
    'totalkg',
    'wilks',
    'equip_Raw',
    'equip_Single-ply',
    'equip_Wraps',
]
    
Xf = power_dummy[f_feature]
yf = power_dummy.weightclasskg

### Once features selected, I fit a multinomial logistic regression

In [20]:
kf = model_selection.KFold(n_splits=5, shuffle=True)
f_LR = LogisticRegression(multi_class='multinomial', solver='sag')

scores = []

for train_index, test_index in kf.split(Xf, yf):
    f_LR = LogisticRegression().fit(Xf.iloc[train_index], yf.iloc[train_index])
    scores.append(f_LR.score(Xf, yf))

print(f'Mean of Accuracy for all folds: {np.mean(scores)}')

Mean of Accuracy for all folds: 0.2891313438509789


In [13]:
LR = LogisticRegression()
Xf_train, Xf_test, yf_train, yf_test = train_test_split(Xf,yf)
LR.fit(Xf_train,yf_train)
yf_pred = LR.predict(Xf_test)
print('Test Score:',LR.score(Xf_test, yf_test))

Test Score: 0.2398023189507698


**The accuracy of predicting weightclass of the model is ~31%**

**The baseline prediction for this data is the 90kg class of 8.9% making the model much more predictive than baseline**

In [14]:
power_dummy.weightclasskg.value_counts(normalize=True);

### Next, I will look at the ability to predict the sex of an athlete based on an athlete's strength and other categorical data related to competition

The cell below selects and defines the features as variables

In [15]:
# s_feature = [
#     'age',
#     'bestsquatkg',
#     'bestbenchkg',
#     'bestdeadliftkg',
#     'bodyweightkg',
#     'totalkg',
#     'wilks',
#     'equip_Raw',
#     'equip_Single-ply',
#     'equip_Wraps',
# ]
    
# Xs = power_dummy[s_feature]
# ys = power_dummy.sex

In [16]:
# kf_s = model_selection.KFold(n_splits=5, shuffle=True)
# s_LR = LogisticRegression(multi_class='multinomial', solver='sag')

# scores_s = []

# for train_index1, test_index1 in kf.split(Xs, ys):
#     s_LR = LogisticRegression().fit(Xs.iloc[train_index1], ys.iloc[train_index1])
#     scores_s.append(s_LR.score(Xs, ys))

# print(f'Mean of Accuracy for all folds: {np.mean(scores_s)}')

In [17]:
# LR1 = LogisticRegression()
# Xs_train, Xs_test, ys_train, ys_test = train_test_split(Xs,ys)
# LR1.fit(Xs_train,ys_train)
# ys_pred = LR1.predict(Xs_test)
# print('Test Score:',LR1.score(Xs_test, ys_test))

### The accuracy of this model is ~99% in its ability to predict sex compared to the
### Baseline model score which is ~68% to classify sex

In [18]:
power_dummy.sex.value_counts(normalize=True)

M    0.676487
F    0.323513
Name: sex, dtype: float64