# Assignment 4
- toc: true
- badges: true
- comments: true
- categories: [jupyter]

Is there life after graduate school?

Download data of Science and Engineering PhDs awarded in the US. Do some analysis in **pandas**. Make a dashboard visualization of a few interesting aspects of the data

In [116]:
#Import packages
import numpy as np
import pandas as pd
import dash
from dash.dependencies import Input, Output
from dash import dcc
from dash import html
from pandas_datareader import data as web
from datetime import datetime as dt
import plotly.graph_objs as go
#import dash_core_components as dcc
#import dash_html_components as html
from dash import dcc
from dash import html
import json
import plotly.express as px

The First part of the dashboard is based on Table 1 [Doctorate recipients from U.S. colleges and universities: 1958–2017](https://ncses.nsf.gov/pubs/nsf19301/data). This dataset can give us a brief idea of the number of Phd awarded each year and the changes in number.

In [123]:
data1 = pd.read_csv("https://raw.githubusercontent.com/lucylin1997/lucylin1997.github.io/main/table1.csv")
data1 = data1.drop([0,1,2])
data1.columns =['Year','Doctorate Recipients','% change from previous year']
data1.head()

Unnamed: 0,Year,Doctorate Recipients,% change from previous year
3,1958,8773,-
4,1959,9213,5.0
5,1960,9733,5.6
6,1961,10413,7.0
7,1962,11500,10.4


The second part of the dashboard is based on Table 12 [Doctorate recipients, by major field of study: Selected years, 1987–2017](https://ncses.nsf.gov/pubs/nsf19301/data). This dataset demonstrates the number and percentage of doctorate recipents per institute in each subjects of fields from year 1987 to year 2017

In [118]:
data2_wide = pd.read_csv("https://raw.githubusercontent.com/lucylin1997/lucylin1997.github.io/main/table12_wide.csv",index_col = 0)
data2_wide.head()

Unnamed: 0,Field_of_Study,Category,Year,Number,Percent
1,"Aerospace, aeronautical, and astronautical eng...",Engineering,1987,142,0.4
2,"Aerospace, aeronautical, and astronautical eng...",Engineering,1992,234,0.6
3,"Aerospace, aeronautical, and astronautical eng...",Engineering,1997,273,0.6
4,"Aerospace, aeronautical, and astronautical eng...",Engineering,2002,209,0.5
5,"Aerospace, aeronautical, and astronautical eng...",Engineering,2007,267,0.6


The third part of the dashboard is based on Table 6 [Doctorates awarded, by state or location, broad field of study, and sex of doctorate recipients: 2017](https://ncses.nsf.gov/pubs/nsf19301/data). We first read in the dataset and do the data processing part. 

In [125]:
data = pd.read_csv("https://raw.githubusercontent.com/lucylin1997/lucylin1997.github.io/main/table6.csv" )
data = data.drop(0)
data.columns = ["State or Location", 'Total_Male', 'Total_Female', 'LifeScience_Male', 'LifeScience_Female','Physical_Male', 'Physical_Female', 'Mathematics_Male', 'Mathematics_Female','Psychology_Male','Psychology_Female','Engineering_Male','Engineering_Female','Education_Male','Education_Female','Humanities_Male','Humanities_Female','Other_Male','Other_Female']
data.head()

Unnamed: 0,State or Location,Total_Male,Total_Female,LifeScience_Male,LifeScience_Female,Physical_Male,Physical_Female,Mathematics_Male,Mathematics_Female,Psychology_Male,Psychology_Female,Engineering_Male,Engineering_Female,Education_Male,Education_Female,Humanities_Male,Humanities_Female,Other_Male,Other_Female
1,United Statesd,29146,25495,5629,6958,4068,2011,2866,976,3693,5381,7389,2448,1521,3300,2581,2708,1399,1713
2,Alabama,365,342,96,92,38,21,35,13,28,61,100,31,34,81,13,15,21,28
3,Alaska,19,33,D,D,9,11,0,0,D,D,D,D,D,D,0,0,D,D
4,Arizona,420,381,57,76,78,32,38,13,53,103,109,32,24,51,41,50,20,24
5,Arkansas,98,104,32,43,6,11,D,D,8,11,D,D,D,D,8,15,D,D


As can be seen from the dataset, the dataset is in state and location level, each column is the number of phd student in specific field, the field includes: *Life Science*, *Physical*, *Mathematics*, *Psychology*, *Engineering*, *Education*, *Humanity*, *Other Field*, and each gender is calculated seperately. Thus, for each field, we can add a new column that indicates the total number of phd(female + male). The missing value is denoted as *D*.

In [120]:
data['Total_Total'] = data['Total_Male'] + data['Total_Female']
data['LifeScience_Total'] = data['LifeScience_Male'] + data['LifeScience_Female']
data['Physical_Total'] = data['Physical_Male'] + data['Physical_Female'] 
data['Mathematics_Total'] = data['Mathematics_Male'] + data['Mathematics_Female']
data['Psychology_Total'] = data['Psychology_Male'] + data['Psychology_Female']
data['Engineering_Total'] = data['Engineering_Male'] + data['Engineering_Female']
data['Education_Total'] = data['Education_Male'] + data['Education_Female']
data['Humanities_Total'] = data['Humanities_Male'] + data['Humanities_Female']
data['Other_Total'] = data['Other_Male'] + data['Other_Female']

For the plotting convension, we need to convert the dataset from wide format to long format. 

In [121]:
data_long = data.melt(id_vars ='State or Location', value_name = 'Number of People', var_name = 'Fields' )
data_long[['Field','Subjects']] = data_long['Fields'].str.split('_', expand=True)
data_long = data_long.drop(['Fields'],axis = 1)
data_long = data_long[['State or Location','Field','Subjects','Number of People']]
data_long = data_long[data_long['Number of People'] != 'D' ]
data_long = data_long[data_long['Number of People'] != 'DD' ]
data_long.head()

Unnamed: 0,State or Location,Field,Subjects,Number of People
0,United Statesd,Total,Male,29146
1,Alabama,Total,Male,365
2,Alaska,Total,Male,19
3,Arizona,Total,Male,420
4,Arkansas,Total,Male,98


The **Field** column indicates the specific field of study and **Subjects** column indicates the *female, male or total*.

In [None]:
app = dash.Dash()


state_filter = data_long['State or Location'].unique()
Timeseries1 = px.line(data1, x="Year", y="Doctorate Recipients", title='Number of Recipents from 1958-2017')
Timeseries2 = px.line(data1, x="Year", y="% change from previous year", title='% Change in the Number of Recipents from 1958-2017')
app.layout = html.Div([
    html.Div([# Page 1 
        html.Div([ #Subpage 1
           #Row 1 (Header) 
            html.Div([
                html.Div([      
                    html.H1('Are there Life after Graduate School'),
                    html.H2('Dashboard1: Overview of Phd recipients from 1958-2017', style=dict(color='#7F90AC')),
                    ], className = "nine columns padded" )
            ], className = 'row gs-header gs-text-header'),
            #Row 2 (Time Series Plot)
            html.Div([
                html.Div([
                    
                    dcc.Graph(figure = Timeseries1)
                ],style={'display': 'inline-block', 'width': '40%'}),
                html.Div([
                    dcc.Graph(figure =Timeseries2)
                ],style={'display': 'inline-block', 'width': '40%'}),
                
            ], className = 'row'),
        ],className = 'subpage'),
    ],className = 'page'),
    html.Div([# Page 2 
        html.Div([ #Subpage 1
           #Row 1 (Header) 
            html.Div([
                html.Div([      
                    #html.H1('Are there Life after Graduate School'),
                    html.H2('Dashboard2: Doctorate-granting institutions and doctorate recipients per institution: 1973–2017', style=dict(color='#7F90AC')),
                    ], className = "nine columns padded" )
            ], className = 'row gs-header gs-text-header'),
            #Row 2 (Time Series Plot)
            html.Div([
                html.Div([
                    dcc.Graph(id = 'TimeSeries Plot3'),
                    dcc.Slider(
                        id='year-slider',
                        min=data3_wide['Year'].min(),
                        max=data3_wide['Year'].max(),
                        value=data3_wide['Year'].min(),
                        marks={str(year): str(year) for year in data3_wide['Year'].unique()},
                        step=None
                    )
                ],style={'display': 'inline-block', 'width': '40%'}),

            ], className = 'row'),
        ],className = 'subpage'),
    ],className = 'page'),
    html.Div([ # page 3

       html.A([ 'Print PDF' ], 
           className="button no-print", 
           style=dict(position="absolute", top=-40, right=0)),     
       html.Div([ # subpage 1

            # Row 1 (Header)

            html.Div([

                html.Div([      
                    #html.H1('Are there Life after Graduate School'),
                    html.H2('Dashboard3: The distribution of the number of Phd across Fields and Gender in the US', style=dict(color='#7F90AC')),
                    ], className = "nine columns padded" )

            ], className = "row gs-header gs-text-header"),
            html.Br([]),
        
            # Row 2
            html.Div([
             # Create a dropdown list
                
                html.Div([
                    dcc.Dropdown(
                    id='state_filter',
                    options=[{'label': i, 'value': i} for i in state_filter],
                    value='North Carolina'
                )
            ],
            style={'width': '40%', 'display': 'inline-block'},className = 'four columns'),
            ],className = 'row'),
            # Row 3 Create two bar plots
            html.Div([
                html.Div([
                   dcc.Graph(
                   id='crossfilter-indicator-barplot',
                   hoverData={'points': [{'customdata': 'Other'}]}
                   )
               ], style={'width': '49%', 'display': 'inline-block'},className = 'four columns'),
            
           
             
                 html.Div([
                    dcc.Graph(id='Field_by_gender'),
                   ], style={'display': 'inline-block', 'width': '40%'}),
          ],className = 'row')
        ],className = "subpage"),
        ], className = "page" )
   
])
@app.callback(
    dash.dependencies.Output('TimeSeries Plot3', 'figure'),
    dash.dependencies.Input('year-slider', 'value'))
def update_figure(selected_year):
    data2_new = data2_wide[data2_wide.Year == selected_year]

    fig = px.scatter(data2_new, x="Percent", y="Number",
                     color="Category", hover_name="Field_of_Study",
                     log_x=True, size_max=30)

    fig.update_layout(transition_duration=500)

    return fig
@app.callback(
    dash.dependencies.Output('crossfilter-indicator-barplot', 'figure'),
    dash.dependencies.Input('state_filter', 'value')) 
def update_graph(selected_states):
    Total_only_df = data_long[data_long['Subjects'] == 'Total']
    filtered_df = Total_only_df[Total_only_df['State or Location'] == selected_states]
    colors = px.colors.qualitative.Pastel2
    return {
        'data':[go.Bar(
           x = filtered_df['Field'],
           y = filtered_df['Number of People'],
           text = filtered_df['Field'],
           customdata = filtered_df['Field'],
           marker_color=colors
        )],
        'layout': go.Layout(
           xaxis={
               'title': 'Field'
           },
           yaxis={
               'title': 'Number of People'
           },
           height = 450,
           hovermode = 'closest',
           title = 'Total Number of Phd by Fields'
        )
    }

def create_barchart(dff, title):
    colors = [px.colors.qualitative.Pastel1[1],px.colors.qualitative.Set1[1] ]
    return {
        'data':[go.Bar(
           x = dff['Subjects'],
           y = dff['Number of People'],
           text = dff['Subjects'],
           customdata = dff['Subjects'],
           marker_color = colors
        )],
        'layout': go.Layout(
           xaxis={
               'title': 'Gender'
           },
           yaxis={
               'title': 'Number of People'
           },
           height = 450,
           hovermode = 'closest',
           title = 'Number of Female and Male Phd for Specific Field'
        )
    }
@app.callback(
    dash.dependencies.Output('Field_by_gender', 'figure'),
    [dash.dependencies.Input('crossfilter-indicator-barplot', 'hoverData'),
     dash.dependencies.Input('state_filter', 'value')])
def update_create_barchart(hoverData, selected_states):
    field_name = hoverData['points'][0]['customdata']
    dff = data_long[data_long['Field'] == field_name]
    dff = dff[dff['State or Location'] == selected_states]
    dff = dff[dff['Subjects'] != 'Total']
    title = 'Field by Gender'
    return create_barchart(dff, title)

app.css.append_css({
    'external_url': 'https://codepen.io/chriddyp/pen/bWLwgP.css'
})

if __name__ == '__main__':
    app.run_server()

**Below is the demo for the dashboard**

![image1](https://github.com/lucylin1997/fastpage_copy/blob/master/images/Assignment4_dashboard1.png?raw=true)

From Dashboard 1, we can see that the number of doctorate recipents is overall increasing, especially between 1960s and 1970s, the number almost triple. It may be possible that more and more students are willing to pursue the phd degree or it is possible that more and more students attend the universities.  As for the % change in the number of phds, between 1960s and 1970s, the perentage change is very large. There are several years when the number actually decreases, such as year 1977, 1999, 2010. 

![image2](https://github.com/lucylin1997/fastpage_copy/blob/master/images/Assignment4_dashboard_demo.gif?raw=true)

Dashboard 2 is a time series plot with year slider, it demonstates the scatter plot of the number of Phds and the percentage of Phds for each subjects of fields across the year.

Dashboard3 is actually an interactive plot, the two bar plot are linked with each other. When you select a state, the left bar plot will show the number of doctortate recipents for each fields, and when you click on a specific field, the right bar plot will demonstrate the number of male and female doctorate recipents for that field. From the dashboard, we can see that there are usually more male phds than female phds in engineering fields such as *Physics*, *Mathematics*, *Engineering*, and there are usually more female phds than male phds in fields such as *Psychology*, *Humanity*, *Education*.  