In [None]:
# Import Package
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')

from scipy.stats import shapiro
from scipy.stats import skew
from scipy.stats import kurtosis

Load data

In [None]:
# Load Data
gradStudents = pd.read_csv('grad-students.csv')
recentGrads = pd.read_csv('recent-grads.csv')
majorList = pd.read_csv('majors-list.csv')
allAges = pd.read_csv('all-ages.csv')
womenStem = pd.read_csv('women-stem.csv')

First present the basic properties of these tables and seek relationship between. Use *df.info()* and *df.describe()*. We can see that there exists 3 missing value in *recentGrads* and 1 missing value in *majorList*.

In [None]:
gradStudents.info()

In [None]:
recentGrads.info()

In [None]:
majorList.info()

In [None]:
allAges.info()

In [None]:
womenStem.info()

Next, present the summary statistics of the data. We can see that 
- *majorList* provides a reference between major codes and major names, which also indicates the major category belong to. Notice that the information of *majorList* is contained in other tables.

- *womenStem* provides information of composition of men and women in each major, which is contained in *recentGrads*.

- *allAge* provides employment statistics of all-aged graduate students.

- *recentGrads* provides employment statistics of recent graduate students. Comparing to *allAged*, *recentGrads* give out more details of employment.

- *gradStudents* also provides employment statistics of recent graduate students. Coparing to *recent Gards*, *gradStudents* further provides information of employment statistics of non-graduate students.

In [None]:
majorList.describe()

In [None]:
womenStem.describe()

In [None]:
allAges.describe()

In [None]:
recentGrads.describe()

In [None]:
gradStudents.describe()

It's not necessary to explore attributes of table *majorList* and *womenStem*. The rest three tables are shown as follows, actually, they are similar. 

## Barplot

Barplot for *gradStudents*. All the variables are left-skewed. The long tail in number of graduate students and salary suggest great difference on employment among different majors.

In [None]:
col = gradStudents.columns[3:]
plt.figure(figsize=(20,16))
plt.subplots_adjust(hspace=0.4,wspace=0.4)
for i in range(len(col)):
    plt.subplot(4, 5, i+1)
    plt.hist(gradStudents[col[i]].dropna())
    plt.title(col[i])

The bar plot of these variables are as Figure 1. We can see that, all variables are left-skewed. Most variables, including Grad_total, Grad_employed, Grad_unemployed, Grad_unemployment_rate, Grad_median, Nongrad_total, Nongrad_employed, Nongrad_unemployed, Nongrad_unemployment_rate, Nongrad_median, are wide-spread and have a long tail, which suggests great difference among different majors.
We also perform the normality test on variables. All variables don’t follow a normal distribution, which is consistent with the histograms.

Barplot for *allAges*.

In [None]:
col = allAges.columns[3:]
plt.figure(figsize=(16, 8))
plt.subplots_adjust(hspace=0.4,wspace=0.4)
for i in range(len(col)):
    plt.subplot(2, 4, i+1)
    plt.hist(allAges[col[i]].dropna())
    plt.title(col[i])

Barplots for *recentGrads*

In [None]:
col = recentGrads.columns[3:]
col = list(col)
col.remove('Major_category')
plt.figure(figsize=(20, 16))
plt.subplots_adjust(hspace=0.4,wspace=0.4)
for i in range(len(col)):
    plt.subplot(4, 5, i+1)
    plt.hist(recentGrads[col[i]].dropna())
    plt.title(col[i])

## Normality Test

Test normality on gradStudents.

In [None]:
def normality_test(df, col_list):
    """
    Given a dataframe determines whether each numerical column is Gaussian 
    H0 = Assumes distribution is Gaussian
    H1 = Assumes distribution is not Gaussian
    Params:
        - df
    Returns:
        - W Statistic
        - p-value
        - List of columns that do not have gaussian distribution
    """
    non_gauss=[]
    w_stat=[]
    # Determine if each sample of numerical feature is gaussian
    alpha = 0.05
    for n in col_list:
        c = df[n].dropna()
        stat, p=shapiro(c)
        print('%s: (%f, %f)'%(n, skew(c),kurtosis(c)))

        if p <= alpha: # Reject Ho -- Distribution is not normal
            non_gauss.append(n)
            w_stat.append(stat)
    # Dictionary of numerical features not gaussian and W-Statistic        
    norm_dict=dict(zip(non_gauss,w_stat))
    return norm_dict

normality_test(gradStudents, gradStudents.columns[3:])

Test normality for *recentGrads*

In [None]:
col = list(recentGrads.columns[3:])
col.remove('Major_category')
normality_test(recentGrads, col)

Test normality for *allAges*

In [None]:
normality_test(allAges, allAges.columns[3:])

## Correlation plot

Correlation plot for *gradStudents*

In [None]:
col = gradStudents.columns[3:]
plt.matshow(gradStudents.loc[:,col].corr())
plt.xticks(range(len(col)), col, rotation=90)
plt.yticks(range(len(col)), col)
plt.colorbar()
plt.show()

The correlation plot is above.
We can see several highlight blocks. It’s because some variables are correlated, like Grad_total, Grad_employed, Grad_unemployed and some metrics are similar, like Grad_median, Grad_P25, Grad_P75. The correlation actually provides little information for analysis.
In addition, we explore the difference between graduate and non-graduate students, using variables unemployment rate and median of salary!

Correlation plot for *allAges*

In [None]:
col = allAges.columns[3:]
plt.matshow(allAges.loc[:,col].corr())
plt.xticks(range(len(col)), col, rotation=90)
plt.yticks(range(len(col)), col)
plt.colorbar()
plt.show()

Correlation plot for *recentGrads*.

In [None]:
col = list(recentGrads.columns[3:])
col.remove('Major_category')
plt.matshow(recentGrads.loc[:,col].corr())
plt.xticks(range(len(col)), col, rotation=90)
plt.yticks(range(len(col)), col)
plt.colorbar()
plt.show()

## Distribution Comparison

In addition, compare the employment between graduate and non-graduate students, including unemployment rate and median salary.

In [None]:
gradStudents.loc[:,['Grad_unemployment_rate', 'Nongrad_unemployment_rate']].plot(kind='density')
gradStudents.loc[:,['Grad_median', 'Nongrad_median']].plot(kind='density')

From above, the unemployment rate of non-graduate student is larger than the one of graduate student. The median salary of graduate student is also better than the one of non-graduate students. Thus, the employment of graduate student is better than non-graduate students on average.


# Conclusion


The unemployment rate and the median salary can be used as the KPI. Use bar plot to present statistics of each major. Use pie graph to display composition of students, including employed and unemployed, full time and part time. Also, use bar plot or density plot to compare recent and all-age employment status.
What types of graphs and charts can be used to clearly explain the data?
Histogram and density plot can be used to present the data distribution. Boxplot can be used to compare data with different distribution and know about outliers. Pie graph can be used to explore the composition of data.
