# Descriptive Analytics



You know the drill by now - never reinvent the wheel!

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Part 1: Descriptive statistics

If using Google Colab, we need to upload our data (chimera_data.csv) to make it useable. Luckily, we can simply create a "choose file" button and then select and upload our file.
```
import os.path
from google.colab import files
if not os.path.exists("chimera_data.csv"):
    uploaded = files.upload()
```

Once we have uploaded our file, we need to set a path to it and acces it with `pandas`:

In [2]:
path = "chimera_data.csv"
df = pd.read_csv(path, sep = ",") # create a pandas data frame to store data, set the type of separator used in your CSV file

Let's take a sneak peak at what we imported:

In [3]:
df.head()

Unnamed: 0,admin_support,age,boss_survey,boss_tenure,city_size,clock_in,core,education,gender,half_day_leaves,...,remote,salary,subordinates,team_size,tenure,tenure_unit,training,variable_pay,years_since_promotion,exit
0,2,35,0.655444,3,6.1,1,1,2,1,4,...,0,53.894035,0,9,3,3,3,11,3,0
1,0,33,0.533455,4,9.4,0,1,2,0,5,...,0,35.606964,0,6,1,1,3,1,3,0
2,0,32,0.486568,5,2.2,0,1,1,0,4,...,0,27.40036,0,10,2,2,3,1,4,0
3,0,40,0.477364,4,4.3,0,1,3,0,4,...,0,36.138199,0,8,1,1,3,0,4,0
4,2,47,0.60323,4,2.2,0,1,1,1,5,...,0,42.77858,1,9,1,1,2,11,4,1


We can also get a summary of each column:

In [None]:
summary_stats = df.describe()
print(summary_stats)

As you already know, visualization is an extremely powerful tool within the descriptive analytics arsenal. Usually, you want to check the histograms of some of the key variables to get a feel for the data and to discover any issues (more on this part in class).

In [None]:
sns.histplot(data=df, x="boss_survey")
plt.show()

We can also count the number of empty values per column (note that there is none here - the dataset is already cleaned):

In [None]:
df.isnull().sum()

# Part 2: Bivariate tests

## 2.1 t-tests

Let's see whether we can find any initial trends in the data. For example, are women more likely to exit the firm? We will use [t-tests](https://towardsdatascience.com/the-statistical-analysis-t-test-explained-for-beginners-and-experts-fd0e358bbb62) to analyze the difference in means:

In [None]:
ttest = sm.stats.ttest_ind(df[df.gender==0].exit,df[df.gender==1].exit)

tstat = ttest[0] 
pvalue = ttest[1] 

print('the tstat for gender differences in exit is =', tstat)
print('the pvalue is =', pvalue)

The data indicates that women are, on average, more likely to quit. However, the finding is not very [significant from a statistical perspective](https://hbr.org/2016/02/a-refresher-on-statistical-significance). Also, don't forget that we are just taking a look at two variables, ignoring a lot of other information!

Maybe exiters are paid less?

In [None]:
ttest = sm.stats.ttest_ind(df[df.exit==1].salary,df[df.exit==0].salary)

tstat = ttest[0]
pvalue =  ttest[1]

print('the tstat for salary differences in exit is =', tstat)
print('the pvalue is =', pvalue)

There indeed is a difference in average salary between those that exit and those that don't. This time, statistically speaking, we can be more sure of this difference. But again, don't forget that we are looking at only two variables right now, i.e. we are assuming all else is equal.

**Exercise:** can you write a test to check if exiters (1) have higher or lower job satisfaction than stayers?





## 2.2 Correlations

We create a new pandas data frame to store the [correlation](https://https://en.wikipedia.org/wiki/Correlation_and_dependence) information, which we can display using `seaborn`:

In [None]:
correlation_matrix = df.corr(method='pearson', min_periods=1)
sns.heatmap(correlation_matrix, vmax=0.15,vmin=-0.15,  # vmax and vmin define the upper and lower boundaries of the colormap respectively
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()

Of course, we can improve the display. One thing to note, in particular, is that a correlation matrix is always symmetric - we do not need to see both sides of the diagnoal, one is sufficient. The diagnoal itself also doesn't have any information (each variable is correlated perfectly with itself).

In [None]:
# Set the seaborn theme to use
sns.set(style="white")

# adapt the part of the matrix that is shown
mask = np.zeros_like(correlation_matrix, dtype=bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
# sns.palplot(sns.diverging_palette(240, 0)) Uncomment to try out different color palettes to use below
cmap = sns.diverging_palette(240, 0, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(correlation_matrix, mask=mask, cmap=cmap, vmax=0.15,vmin=-0.15,  # vmax and vmin define the upper and lower boundaries of the colormap respectively
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()

Of course, we can put the above code into a function to make use of it more flexibly

In [None]:
def plot_corr(correlation_matrix):
    # Set the seaborn theme to use
    sns.set(style="white")

    # adapt the part of the matrix that is shown
    mask = np.zeros_like(correlation_matrix, dtype=bool)
    mask[np.triu_indices_from(mask)] = True

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))

    # Generate a custom diverging colormap
    # sns.palplot(sns.diverging_palette(240, 0)) Uncomment to try out different color palettes to use below
    cmap = sns.diverging_palette(240, 0, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(correlation_matrix, mask=mask, cmap=cmap, vmax=0.15,vmin=-0.15,  # vmax and vmin define the upper and lower boundaries of the colormap respectively
                square=True, linewidths=.5, cbar_kws={"shrink": .5})
    plt.show()

Let's try it out:

In [None]:
plot_corr(df.corr(method='pearson', min_periods=1))

## Part 2: Linear regression in Python

In an [ordinarly least squares, or OLS regression - usually simply "linear regression"](https://www.encyclopedia.com/social-sciences/applied-and-social-sciences-magazines/ordinary-least-squares-regression), we try to understand the *simultaenous* effect of multiple independent variables (X), on our dependet variable (y). As an example, we want to see how education and age, together, influence an employee's salary.

For illustrative purposes, we start with a single independent variable (age)

In [None]:
plt.scatter(df[['age']],df[['salary']],color="blue")
plt.show()

Note that we have used here the matplotlib function for graphing, rather than Seaborn. As for many things in Python, there are many different packages that serve the same functions with certain differences. Matplotlib is good for drawing graphs in a basic format and in interfacing with different data types. Seaborn has more of a design functionality: it can do very high-quality color schemes etc.

We now move onto the regression part. Again, there are many packages that enable us to do this. We present two here: the `statsmodels` package and the `scikit` library.
1. Pros of `statsmodels` package: great summary of the regression, easier to do polynomial regression and plotting 
2. Pros of `scikit` library: very easy to do machine learning concepts on it. 

We will see how to do linear regression in both. For this part of the class, we use `statsmodels` (note that `statsmodels` has already been imported above to run hypothesis tests). For machine learning concepts that we will see later, we will use `scikit`.

In [None]:
X = df[['age']]
Y = df[['salary']]

X = sm.add_constant(X) # In this package, by default, the regression will have no intercept, hence we need to manually add it to the X matrix, and call the result X_sm
lm = sm.OLS(Y, X).fit() # Fit an OLS with vector Y as dependent and matrix X_sm as independent
print(lm.summary()) # Display the summary of model results

Note that often, it makes sense to standardize the data first. We will look at this in class.

In [None]:
from statsmodels.graphics.regressionplots import abline_plot

ax = df.plot(x="age", y="salary", kind='scatter', color="blue")
abline_plot(model_results=lm, ax=ax, color="red")
plt.show()

Let's now look at multivariate associations, by adding in education as a second explanatory variable:

In [None]:
ax = plt.axes(projection='3d')

zdata = df[['salary']]
ydata = df[['age']]
xdata = df[['education']]
ax.scatter3D(xdata, ydata, zdata, c=zdata)
ax.set_xlabel('Education')
ax.set_ylabel('Age')
ax.set_zlabel('Salary')

plt.show()

We can run an OLS just like before:

In [None]:
X = df[['age','education']]
Y = df[['salary']]
X = sm.add_constant(X)
lm = sm.OLS(Y, X).fit()
print(lm.summary())

Let's get a better look at the data, visualizing the coefficients. The below code displays the confidence intervals of each coefficient

In [None]:
err_series = lm.params - lm.conf_int(alpha=0.05)[0]
coef_df = pd.DataFrame({'coef': lm.params.values[1:],
                        'err': err_series.values[1:],
                        'varname': err_series.index.values[1:]
                       })
fig, ax = plt.subplots(figsize=(8, 5))
coef_df.plot(x='varname', y='coef', kind='bar', 
             ax=ax, color='none', 
             yerr='err', legend=False)
ax.set_ylabel('')
ax.set_xlabel('')
ax.scatter(x=pd.np.arange(coef_df.shape[0]), 
           marker='o', s=120, 
           y=coef_df['coef'], color='black')
ax.axhline(y=0, linestyle='-', color='red', linewidth=4)
ax.xaxis.set_ticks_position('none')
plt.show()

What we actually care about is, whether employees exit the company. Hence, y should indicate whether an employee is an exiteer (not to be confused with a [Brexiteer](https://en.wikipedia.org/wiki/Glossary_of_Brexit_terms)). We combine all available information, in order to try to explain why someone left:

In [None]:
X = df.loc[:, df.columns != 'exit']
Y = df[["exit"]]
X = sm.add_constant(X)

We are now ready to fit our OLS model:

In [None]:
lm = sm.OLS(Y,X).fit()
print (lm.summary()) 

Let's get a better look at the coefficients. To do so, we will first create a new function based on the code we previously saw:

In [None]:
def plot_coef(model):
    err_series = model.params - model.conf_int(alpha=0.05)[0]
    coef_df = pd.DataFrame({'coef': model.params.values[1:],
                            'err': err_series.values[1:],
                            'varname': err_series.index.values[1:]
                           })
    fig, ax = plt.subplots(figsize=(8, 5))
    coef_df.plot(x='varname', y='coef', kind='bar', 
                 ax=ax, color='none', 
                 yerr='err', legend=False)
    ax.set_ylabel('')
    ax.set_xlabel('')
    ax.scatter(x=pd.np.arange(coef_df.shape[0]), 
               marker='o', s=120, 
               y=coef_df['coef'], color='black')
    ax.axhline(y=0, linestyle='-', color='red', linewidth=4)
    ax.xaxis.set_ticks_position('none')
    plt.show()

In [None]:
plot_coef(lm)

## Part 3: Logistic regression in Python

If our model were correct, how would an employee behave that has the same attributes as Employee No. 1, but a `boss_survey` value of `-1`?

In [None]:
example = X.iloc[[0]]
example.at[0, 'boss_survey'] = -1
lm.predict(example)

It's unclear how to interpret the result of 1.15. It's certainly not a probability! This is where Logistic regression comes in:

In a [Logistic (or Logit) regression](https://en.wikipedia.org/wiki/Logistic_regression), we try to understand the *simultaenous* effect of multiple independent variables (X), on our dependet variable (y), when y is either 0 or 1, exactly as in our problem!

In [None]:
logm = sm.Logit(endog=Y, exog=X).fit()
print (logm.summary())

We can, again, use our custom-made visualization function:

In [None]:
plot_coef(logm)

For further reading on Logit, see, for example, [here](https://statisticalhorizons.com/whats-so-special-about-logit). For more on Logit in Python, see [here](https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python).