# Exploratory Data Anlysis - Multivariate Analysis

In [2]:
import json
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from ipywidgets import widgets

In [3]:
df = pd.read_csv('./../../../data/cleaned_data.csv')

In [4]:
# Load lists of numerical and categorical columns from the static file
with open('./../../../data/statics.json') as f:
    statics = json.load(f)
categorical_columns = statics['categorical_columns']
numerical_columns = statics['numerical_columns']


In [5]:
# Separate out the dataframe intro numerical and categorical dataframe
num_df = df[numerical_columns]
cat_df = df[categorical_columns]

## Correlation

For multivariate, analysis we will begin with correlation. It should be noted that correlation coefficients can only be calculated for numerical variables. It is meaningless to use it for categorical variables. The correlation coefficients can only tell us whether 2 variables are moving together or in opposite direction but in no way it can covey us the about the cause-effect relationship between variables.

In [6]:
# Compute the correlation coefficients
pearson_corr = num_df.corr()
spearman_corr = num_df.corr(method='spearman')

In [7]:
fig = px.imshow(pearson_corr)
fig.update_layout(height=600)
fig.show()

From the above chart it can be noted that there are no variables with extreme negative correlations.  
Total working years is highly correlated with age, monthly income, and years at company. It would be convenient for us to remove this variable but I would like to take this decision while feature selection.
Years at current company, years in current role. years since last promotion and years with current manager are all highly correlated with each other. 

Pearson correlation gives us linear correlation while Spearman correlation gives us non-linear correation. Repeating the above activity with Spearman correlation gives - 

In [8]:
fig = px.imshow(spearman_corr)
fig.update_layout(height=600)
fig.show()

The picture doesn't change. Hence it can safely concluded that features are only linearly correlated.

To understand the correlated features more closely, let us plot the pair plots for them.

In [9]:
sub_df1 = num_df[['Age', 'TotalWorkingYears', 'MonthlyIncome', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']]

In [10]:
fig = px.scatter_matrix(sub_df1)
fig.update_layout(width=1400, height=1400)
fig.show()

Some people start their job at a later stage, like in their 40s and 50s. But as such age and total working years do show positive corelation. Years at a company, years with curernt manager and years in current role all have high positive corellation amongst them. 

## Relationship with target variable

Let us gauge the available variables with respect to the attrition one by one.

We begin our analysis with the one of the most important factor about how satisifed an employee is in the company that is money.

In [11]:
df['Attrition'].value_counts()

No     1233
Yes     237
Name: Attrition, dtype: int64

In [13]:
# Divide the data based on attrition
attr = df[df['Attrition'] == 'Yes']
nattr = df[df['Attrition'] == 'No']

### Numerical

In [62]:
# Create interactive plots

# Create widget to select columns
numcols3 = widgets.Dropdown(options=numerical_columns, value=numerical_columns[0], description='Numerical columns')

# Create figure for the plot
trace1 = go.Histogram(x=attr[numerical_columns[0]], opacity=0.5, histnorm='probability', name='Attrition - Yes')
trace2 = go.Histogram(x=nattr[numerical_columns[0]], opacity=0.5, histnorm='probability', name='Attrition - No')
# Create widget for the plot
ng3 = go.FigureWidget(data=[trace1, trace2],
                      layout=go.Layout(barmode='overlay'))

# Create function to respond to changes
def num_response3(change):
    """Function to change the values based on selection of column"""
    with ng3.batch_update():
        ng3.data[0].x = attr[numcols3.value]
        ng3.data[1].x = nattr[numcols3.value]
        ng3.layout.barmode = 'overlay'
        ng3.layout.xaxis.title = numcols3.value

# Observe the change in the dropdown and trigger the function on change
numcols3.observe(num_response3, names='value')

num_container2 = widgets.VBox([numcols3, ng3])


In [63]:
display(num_container2)

VBox(children=(Dropdown(description='Numerical columns', options=('Age', 'DailyRate', 'DistanceFromHome', 'Emp…

The probability that people will shift is much more higher for the people between age 20 to 30 than other age groups. People above 30 are very less likely to shift. Distance from home, hourly rate, monthly rate, percentage salary hike, number of times employee was trainined in the last year and years since last promotion does not seem to be significant factors for the decision to leave the company. Employees who exist in the lower bands of monthly salary are more likely to leave the company than their higher bands couterparts. People who have worked in less than or equal to 4 companies are less likely to shift than the people who have worked in more than 4 companies. People with less number of years of working experience and less number of years in the current company are more likely to shift. If people stay in current role for more number of years then the probability that they will leave becomes less then the probability to not leave. If people work under one manager for longer period of time then people are less likely to leave.

### Categorical

We will repeat the above exercise with categorical variables.