# Exploratory data analysis

This notebook contains exploratory data analysis focused on distributions and relationships between variables

In [1]:
import pandas as pd
import altair as alt
import re
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy import stats
import numpy as np

In [2]:
melted_df = pd.read_csv("melted_data.csv")
df = pd.read_csv("data.csv")

## Understanding safety scores and its relationships to other variables

```A score from 0 to 100 indicating how safe an individual is. 0 is extremely high risk of injury, and 100 is very low risk of injury. The Safety Score is comprised of the twist speed, bend angle, and tilt speed.```

I am assuming that the lift rate is not part of safety score calculations based on the above.

### Distribution of scores

In [3]:
alt.Chart(melted_df).transform_density(
    'Average Safety Score', 
    groupby=['Stage'],
    as_=['Score', 'density']
).mark_line().encode(
    x='Score:Q',
    y='density:Q',
    color='Stage'
).properties(title="Safety Score distribution")

In [4]:
# just for curiosity purposes at this stage
stats.ttest_ind(df['Intervention Average Safety Score'], df['Baseline Average Safety Score'], nan_policy='omit')

Ttest_indResult(statistic=2.3968913068280266, pvalue=0.016921130811952838)

## Relationship to the other variables

There's a strong relationship between safety score and all other variables - including lift rate which is not part of score calculations. The correlations are quite similar in total and when looking at them by group.

In [5]:
melted_df.drop(['File Count'], axis=1).corr()

Unnamed: 0,Average Lift Rate,Average Max Forward Bend,Average Max Tilt Velocity,Average Safety Score,Average Twist Velocity
Average Lift Rate,1.0,0.631283,0.10111,-0.531391,0.625421
Average Max Forward Bend,0.631283,1.0,0.285163,-0.624532,0.139894
Average Max Tilt Velocity,0.10111,0.285163,1.0,-0.866162,0.214834
Average Safety Score,-0.531391,-0.624532,-0.866162,1.0,-0.497452
Average Twist Velocity,0.625421,0.139894,0.214834,-0.497452,1.0


In [6]:
melted_df.drop(['File Count'], axis=1).groupby("Stage").corr()

Unnamed: 0_level_0,Unnamed: 1_level_0,Average Lift Rate,Average Max Forward Bend,Average Max Tilt Velocity,Average Safety Score,Average Twist Velocity
Stage,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Baseline,Average Lift Rate,1.0,0.621925,0.085359,-0.546222,0.719658
Baseline,Average Max Forward Bend,0.621925,1.0,0.337489,-0.679182,0.297791
Baseline,Average Max Tilt Velocity,0.085359,0.337489,1.0,-0.849977,0.184507
Baseline,Average Safety Score,-0.546222,-0.679182,-0.849977,1.0,-0.551257
Baseline,Average Twist Velocity,0.719658,0.297791,0.184507,-0.551257,1.0
Intervention,Average Lift Rate,1.0,0.634665,0.115799,-0.505488,0.565596
Intervention,Average Max Forward Bend,0.634665,1.0,0.270625,-0.587244,0.071569
Intervention,Average Max Tilt Velocity,0.115799,0.270625,1.0,-0.887326,0.24512
Intervention,Average Safety Score,-0.505488,-0.587244,-0.887326,1.0,-0.473766
Intervention,Average Twist Velocity,0.565596,0.071569,0.24512,-0.473766,1.0


### Understanding the score formula

How much of the score "formula" can we uncover from the averages (assuming some sort of linear model)? Looks like pretty much "all of it". It seems to be something along the lines of

`score = 120 -0.20 * bend - 0.3 * tilt - 0.6 * twist`

In [7]:
st_melted_df = melted_df.rename({
    "Average Safety Score": "score", 
    "Average Max Forward Bend": "forward_bend", 
    'Average Max Tilt Velocity':'tilt_velocity', 
    'Average Twist Velocity': 'twist_velocity',
    'Average Lift Rate': 'lift_rate'
}, axis=1)

model = ols('score ~ forward_bend + tilt_velocity + twist_velocity', data=st_melted_df).fit()
model.summary()

0,1,2,3
Dep. Variable:,score,R-squared:,0.988
Model:,OLS,Adj. R-squared:,0.988
Method:,Least Squares,F-statistic:,12540.0
Date:,"Fri, 27 May 2022",Prob (F-statistic):,0.0
Time:,12:55:38,Log-Likelihood:,-450.04
No. Observations:,476,AIC:,908.1
Df Residuals:,472,BIC:,924.8
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,120.2107,0.298,403.599,0.000,119.625,120.796
forward_bend,-0.2027,0.003,-71.898,0.000,-0.208,-0.197
tilt_velocity,-0.2989,0.002,-127.383,0.000,-0.304,-0.294
twist_velocity,-0.5923,0.011,-55.960,0.000,-0.613,-0.571

0,1,2,3
Omnibus:,60.608,Durbin-Watson:,1.962
Prob(Omnibus):,0.0,Jarque-Bera (JB):,164.875
Skew:,-0.62,Prob(JB):,1.58e-36
Kurtosis:,5.603,Cond. No.,1210.0


### How well does the lift rate alone predict the safety score?

In [8]:
model = ols('score ~ lift_rate', data=st_melted_df).fit()
model.summary()

0,1,2,3
Dep. Variable:,score,R-squared:,0.282
Model:,OLS,Adj. R-squared:,0.281
Method:,Least Squares,F-statistic:,186.5
Date:,"Fri, 27 May 2022",Prob (F-statistic):,4.8300000000000004e-36
Time:,12:55:38,Log-Likelihood:,-1416.0
No. Observations:,476,AIC:,2836.0
Df Residuals:,474,BIC:,2844.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,78.8316,0.825,95.598,0.000,77.211,80.452
lift_rate,-0.1107,0.008,-13.657,0.000,-0.127,-0.095

0,1,2,3
Omnibus:,54.292,Durbin-Watson:,1.904
Prob(Omnibus):,0.0,Jarque-Bera (JB):,80.456
Skew:,-0.773,Prob(JB):,3.38e-18
Kurtosis:,4.29,Cond. No.,385.0


In [9]:
alt.Chart(melted_df).mark_point().encode(
    x='Average Lift Rate',
    y='Average Safety Score',
    column='Stage'
)

## Distributions of the other quantitative variables

In [10]:
vrs = ['Average Lift Rate', 'Average Max Forward Bend', 'Average Max Tilt Velocity', 'Average Twist Velocity']

charts = []

for v in vrs:  
    c = alt.Chart(melted_df).transform_density(
        v, 
        groupby=['Stage'],
        as_=['Metric', 'density']
    ).mark_line().encode(
        x=alt.X('Metric:Q', title=v),
        y='density:Q',
        color='Stage'
    ).properties(title= v + " distribution")
    
    charts.append(c)
    
alt.hconcat(*charts)

How different are these distributions? Looks like the sensors result, on average, in both reduction of lift rates and the way workers bend (and, to some degree, the way workers twist).

In [11]:
results = []
for v in vrs:
    baseline = melted_df[melted_df['Stage'] == 'Baseline'][v]
    intervention = melted_df[melted_df['Stage'] == 'Intervention'][v]
    t = stats.ttest_ind(intervention, baseline, nan_policy='omit')
    results.append({
        'metric': v,
        'statistic': t[0],
        'p-value': t[1]
    })

pd.DataFrame(results).round(4)

Unnamed: 0,metric,statistic,p-value
0,Average Lift Rate,-3.1646,0.0017
1,Average Max Forward Bend,-7.6308,0.0
2,Average Max Tilt Velocity,-0.2096,0.834
3,Average Twist Velocity,1.6825,0.0931


## Distributions in baseline scenario

In [12]:
vrs = ['Average Lift Rate', 'Average Max Forward Bend', 'Average Max Tilt Velocity', 'Average Twist Velocity']

charts = []

baseline_melted_df = melted_df[melted_df['Stage'] == 'Baseline']

for v in vrs:  
    c = alt.Chart(baseline_melted_df).transform_density(
        v,         
        as_=['Metric', 'density']
    ).mark_line().encode(
        x=alt.X('Metric:Q', title=v),
        y='density:Q',
        #color='Stage'
    ).properties(title= v + " distribution")
    
    charts.append(c)
    
alt.hconcat(*charts)

In [13]:
baseline_melted_df[vrs].mean()

Average Lift Rate            101.8008
Average Max Forward Bend      84.4992
Average Max Tilt Velocity     79.5412
Average Twist Velocity        20.1480
dtype: float64

In [14]:
alt.Chart(baseline_melted_df).transform_density(
    'Average Safety Score', 
    #groupby=['Stage'],
    as_=['Score', 'density']
).mark_line().encode(
    x='Score:Q',
    y='density:Q',
    #color='Stage'
).properties(title="Baseline Safety Score distribution")

In [18]:
alt.Chart(df).mark_point().encode(
    x='Baseline Average Lift Rate',
    y='Intervention Average Lift Rate'
)