## Body Performance

This simple project's goal is to practice data analysis and machine learning skills. The input data has people's body traits that we are using to predict how well they perform in sports. This way, I'll try to explore different funcionalities that may or not be the most effecive, for practicing purposes. First of all, let's load and visualize our dataset.

## Getting data

In [None]:
import pandas as pd

In [None]:
path = "../input/body-performance-data/"
df = pd.read_csv(path + "bodyPerformance.csv")

In [None]:
df.head()

As we can see, most of our variables are numeric, except for the gender one, and 'class' is what we want to predict. Let's visualiza our data distribution inside variables.

## EDA

In [None]:
df.info()

In [None]:
import plotly.express as px

In [None]:
age = df.groupby('age').count().reset_index()

In [None]:
px.line(age, x = 'age', y = 'gender', labels = {
    "gender":"Count",
    "age":"Age"
})

In [None]:
gender = df.groupby('gender').count().reset_index()

In [None]:
px.bar(gender, x='gender', y='age', labels = {
    "gender":"Gender",
    "age":"Age"}, color = 'gender')

In [None]:
def scatter(df, var1, var2):
    fig = px.scatter(df, x=var1, y=var2, labels = {
        var1:var1.title(),
        var2:var2.title()
    }, color = 'class')
    fig.show()

In [None]:
scatter(df, 'weight_kg', 'height_cm')

In [None]:
scatter(df, 'diastolic', 'systolic')

In [None]:
scatter(df, 'weight_kg', 'body fat_%')

This one is interesting. Low body fat seems to be correlated to the body performance, and high weight seem to be negative correlated.

Lastly, we can visualize how our variables are correlated with each other.

In [None]:
import seaborn as sns

In [None]:
sns.heatmap(df.corr())

As we can see, body fat is highly correlated to many other variables, which as an attention point when we run our model.

## Preprocessing

As we've already seem some visualizations, let's prepare our dataset for training and testing our machine learning model. The raw dataset is already in good shape, so there's not much work to be done here. The main task is to rescale our data, so we have no scale problems when running our model. But first, we need to transform "gender" into a dummy variable.

In [None]:
X = df.drop('class', axis = 1)
y = df['class']

In [None]:
X = pd.get_dummies(X, drop_first = True)

Now, our gender column assigns '1' for male, and '0' for female.

Finally, we can rescale our data and create our train and test datasets.

In [None]:
 from sklearn.preprocessing import StandardScaler

In [None]:
columns = X.columns

In [None]:
scaler = StandardScaler()

In [None]:
X = pd.DataFrame(scaler.fit_transform(X))
X.columns = columns

In [None]:
X.head()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10)

In [None]:
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

## Model

At this point, we're ready for training our model. We're running a random forest classifier for this multiclass classification task. That being said, we're tuning our hyperparameter so we can choose the best performance model.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
rf = RandomForestClassifier()
param = {'n_estimators':[100,400,500], 'max_depth':[60, 80, 100]}
rf_tuned = GridSearchCV(rf, param)
rf_tuned.fit(X_train, y_train)

Now that we've tuned our hyperparameters, we can take a look at the values that were chosen and analyse how well our model performs in the test sets.

In [None]:
rf_tuned.best_params_

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
y_pred = rf_tuned.predict(X_test)
accuracy_score(y_test, y_pred)

From that, we get that our model evaluated 73,59% of the bodies performances correctly, which is a reasonable metric. Our last step is to plot the confusion matrix of the set evaluated to visualize the performance in each of the classes available in the target space.

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
sns.heatmap(confusion_matrix(y_test, y_pred), annot = True, cmap = 'Blues')

As we can see, our model predicts the class labels reasonably well, which means that it could evaluate body performance at low errors magnitude. We can conclude that the results are satisfactory, taking into account that we didn't make any complicated calculation.