# Final Project
- toc: true
- badges: true
- comments: true
- categories: [jupyter]

## Overview

The final project of our group is to build a dashboard providing an overview of several essential aspects of the current COVID-19 situation and provide in-depth analysis of the relationship between the last status of COVID-19 positive patients and their characteristics. 

## Dataset

The dataset we use come from the several sources:
   - Vaccination data: [click here](https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-Jurisdi/unsk-b7fc)
   - ML model for predicting death or discharge: [click here](https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=89096912#89096912bcab02c187174a288dbcbf95d26179e8)
   - Death Cases Forecasting Data: [click here](https://www.cdc.gov/coronavirus/2019-ncov/science/forecasting/forecasting-us.html)

## Problem Description

All the patients in the clinical dataset are COVID-19 positive and the outcome variable is the last status of the patients (discharged vs. deceased). We want to build a model that can classify the last status of the patients based on their characteristics and compare the performances of different models. In addition, we are also interested in finding which factors contribute more to the prediction. The problem has some clinical meaning in that if we can discover some factors that are related to the last status of those COVID-19 positive patients, the results can give some advice to the clinicians when they provide health care to those patients in hospital. However, the dataset does not include information on whether the patients have taken COVID vaccinations or not and this may also be a related factor. So we provide some data visualizations to help us understand the possible relationship between the number of people who take vaccination and the number of deaths within a state.

## Data Product Introduction

The dashboard has four main pages:
   - Vaccine distribution
   - Reported and forcasted death cases
   - Model Performance
   - Model Explanation 

### Vaccination Distribution

This page provides an overview of the vaccination distribution across the US, including the number of people vaccined, the number of taking at least one dose/ fully vaccined/ receving the booster etc. The dropdown list and the date range selector enable the user to choose the time period or the states they are interested in. 

![vaccine_distribution](https://github.com/lucylin1997/fastpage_copy/blob/master/images/Vaccine_distribution.jpg?raw=true)

### Reported and Forcasted Death Cases

This page provides the reported deaths and forcasted deaths across the US. The dropdown lists enable the user to select the states and the model that are used to forcast the deaths.

![death](https://github.com/lucylin1997/fastpage_copy/blob/master/images/forcasted_death.jpg?raw=true)

### Model Performance

After the exploratory data analysis, we find that the dataset has some missing value and the class label is imbalanced, so we propose four methods to improve the data quality before training. 

   - Deleting all the missing value and keeping the imbalanced class
   - Deleting all the missing value and oversampling the majority class
   - Imputing the missing value and keeping the imbalanced class
   - Imputing the missing value and oversampling the majority class

For the classification model, we used  Ridge Classifier, K Nearest Neighbor, Logistic Regression, Light Gradient Boosting Machine, Linear Discriminant Analysis, Catboost Classifier, Adaboost Classifier, Gradient Boost Classifier, Random Forest Classifier, Extreme Gradient Boosting, Quadratic Boosting Classifier, Extra Tree Classifier, Support Vector Machine, Decision Tree, and Nayes Bayes

The first plot shows the performances of different models for the same data preprocessing method (the confusion matrix will change when you hover on different model points). Users can choose the evaluation metrics and the data preprocessing methods

![Model Performance](https://github.com/lucylin1997/fastpage_copy/blob/master/images/model_performance1.png?raw=true)

The second plot shows the performances of the same model for different data preprocessing methods (the corresponding confusion matrix will change when you hover on different preprocessing methods). Users can choose the evaluation metrics and the classifier. 

![Model_performance2](https://github.com/lucylin1997/fastpage_copy/blob/master/images/Model_performance2.png?raw=true)

### Model Explanation

In order to study the relationship between the relationship between the last status of COVID-19 positive patients and their characteristics, shap value of each features are provided. 

![Model Explannation](https://github.com/lucylin1997/fastpage_copy/blob/master/images/Model_Explanation.jpg?raw=true)

From the above shap plot, we find that the number of calendar days in the facility (length_of_stay) and the number of documented days of invasive ventilation support (invasive_vent_days) were the two biggest contributors to mortality.

## Personal Contribution and Reflection

In this project, I am mainly responsible for:

  - Data Preprocessing: creating two new datasets (deleting all the missing values directly/oversampling the minority class)
  - Built classification models
  - Built the model performance dashboard using `Plotly`

What I learned from this project:

  - Methods to deal with imbalanced data, oversampling the minority class
  - Different methods of dealing with the raw data may lead to different model results
  - Used `plotly` to build dashboard for data visualization to help us compare the performances of different models or different methods and it can enable us to report the results to other collaborators more easily

Key points in the individual work

- Package used to deal with the imbalanced data: `imblearn`
```python
import imblearn
X_train_resampled, y_train_resampled = imblearn.over_sampling.SMOTE().fit_resample(X_train, y_train)
```
- package used to build the model and save the performances: `pycaret`
```python
from pycaret.classification import *
def comparemodel(dataset):
    clfs = setup(
           data = dataset, 
           target = 'last.status',
           silent=True, 
           session_id=1,)
    best_model = compare_models(sort = 'Accuracy')
    best_results = pull()
    return best_results
```
- packages used to build the dashboard: `plotly`, `dash`, below are the codes for generating the dashboard

In [None]:
data_lucy = pd.read_csv('model_results_final.csv')
app = dash.Dash()
metric_filter = data_lucy['variable'].unique()
method_filter = data_lucy['Method'].unique()
model_filter = data_lucy['Model'].unique()
app.layout = html.Div([
    html.Div([
        html.Div([
                html.Div([      
                    
                    html.H2('Dashboard1: Performance of Classification Model by Methods and Metrics', style=dict(color='#7F90AC')),
                    ], className = "nine columns padded" )
            ], className = 'row gs-header gs-text-header'),
       html.Div([

        html.Div([
            dcc.Dropdown(
                id='metric_filter_lucy',
                options=[{'label': i, 'value': i} for i in metric_filter],
                value='Accuracy'
            )
        ],style={'width': '40%', 'display': 'inline-block','align-items': 'center','justify-content': 'center','margin-left': '5%'}),
        html.Div([
            dcc.Dropdown(
                id='method_filter_lucy',
                options=[{'label': i, 'value': i} for i in method_filter],
                value='Dataset after Deleting all the NAs'
            )
        ],
        style={'width': '40%', 'display': 'inline-block','align-items': 'center','justify-content': 'center','margin-left': '10%'})
    ]),
    html.Div([
        html.Div([
        dcc.Graph(
            id='method-filter-scatter_lucy',
            hoverData={'points': [{'customdata': 'Ridge Classifier'}]}
        )],style={'width': '49%', 'display': 'inline-block', 'padding': '0 20'}),
        html.Div([
        dcc.Graph(
            id = 'confusion_matrix1_lucy')
    ], style={'width': '49%', 'display': 'inline-block', 'padding': '0 20'})
    ])
    ],className = 'page'),
    # Page 2
     html.Div([
        html.Div([
                html.Div([      
                    
                    html.H2('Dashboard2: Performance of Classification Model by Models and Metrics', style=dict(color='#7F90AC')),
                    ], className = "nine columns padded" )
            ], className = 'row gs-header gs-text-header'),
       html.Div([
         html.Div([
            dcc.Dropdown(
                id='metric_filter1_lucy',
                options=[{'label': i, 'value': i} for i in metric_filter],
                value='Accuracy'
            )
            ],style={'width': '40%', 'display': 'inline-block','margin-left': '5%'}),
         html.Div([
            dcc.Dropdown(
                id='model_filter_lucy',
                options=[{'label': i, 'value': i} for i in model_filter],
                value='Ridge Classifier',
            )
            ],style={'width': '40%', 'display': 'inline-block','margin-left': '10%'} )
        
    ]),
    html.Div([
        html.Div([
        dcc.Graph(
            id='model-filter-scatter_lucy',
            hoverData={'points': [{'customdata':'Dataset after Deleting all the NAs'}]}
        )
        ], style={'width': '49%', 'display': 'inline-block','align-items': 'center','justify-content': 'center'}),
        html.Div([
        dcc.Graph(
            id='confusion_matrix_lucy'
        )
        ], style={'width': '49%', 'display': 'inline-block',  'align-items': 'center','justify-content': 'center'})
    ])
    ],className = 'page'),
])
@app.callback(
    dash.dependencies.Output('method-filter-scatter_lucy', 'figure'),
    [dash.dependencies.Input('method_filter_lucy', 'value'),
     dash.dependencies.Input('metric_filter_lucy','value')]
    ) 
def update_graph1(selected_methods,selected_metrics):
    filtered_df = data_lucy[(data_lucy['variable'] == selected_metrics) & (data_lucy['Method'] == selected_methods)]
    return {
        'data':[go_lucy.Scatter(
           x = filtered_df['Model'],
           y = filtered_df['value'],
           text = filtered_df['Model'],
           customdata = filtered_df['Model']
           #marker_color=colors
        )],
        'layout': go_lucy.Layout(
           
           yaxis={
               'title': 'Score'
           },
           height = 450,
           hovermode = 'closest',
           title = '<b>Performance of Classifier<b>'
        )
    }

def create_cm(dff):
    dff = dff.drop(['Method','Model','variable','value'],axis = 1)
    arr = np.zeros((2, 2), dtype=np.int)
    matrix = dff.values
    arr[0,0] = matrix[0,0]
    arr[0,1] = matrix[0,1]
    arr[1,0] = matrix[0,2]
    arr[1,1] = matrix[0,3]
    x = ['deceased','discharged']
    y = ['deceased','discharged']
    z_text = [[str(y) for y in x] for x in arr]
    fig = ff_lucy.create_annotated_heatmap(arr, x=x, y=y, annotation_text=z_text, colorscale=['aliceblue','aqua','aquamarine','darkturquoise'])
    fig.update_layout(title_text='<i><b>Confusion matrix</b></i>',
                  #xaxis = dict(title='x'),
                  #yaxis = dict(title='x')
                 )
    return fig
@app.callback(
    dash.dependencies.Output('confusion_matrix1_lucy', 'figure'),
    [dash.dependencies.Input('method_filter_lucy', 'value'),
     dash.dependencies.Input('method-filter-scatter_lucy','hoverData')
    ]) 
def update_cm_graph(selected_methods,hoverData):
    model_name = hoverData['points'][0]['customdata']
    dff = data_lucy[(data_lucy['Model'] == model_name) & (data_lucy['Method'] == selected_methods)]
    return create_cm(dff)
    
@app.callback(
    dash.dependencies.Output('model-filter-scatter_lucy', 'figure'),
    [dash.dependencies.Input('model_filter_lucy', 'value'),
     dash.dependencies.Input('metric_filter1_lucy','value')]
    ) 
def update_graph(selected_models,selected_metrics):
    filtered_df = data_lucy[(data_lucy['variable'] == selected_metrics) & (data_lucy['Model'] == selected_models)]
    return {
        'data':[go_lucy.Scatter(
           x = filtered_df['Method'],
           y = filtered_df['value'],
           text = filtered_df['Method'],
           customdata = filtered_df['Method']
           #marker_color=colors
        )],
        'layout': go_lucy.Layout(
           
           yaxis={
               'title': 'Data_Preprocessing_Method '
           },
           height = 450,
           hovermode = 'closest',
           title = '<b>Performance of Data Preprocessing Method<b>'
        )
    }
@app.callback(
    dash.dependencies.Output('confusion_matrix_lucy', 'figure'),
    [dash.dependencies.Input('model_filter_lucy', 'value'),
     dash.dependencies.Input('model-filter-scatter_lucy','hoverData')
    ]) 
def update_cm_graph(selected_models,hoverData):
    model_name = hoverData['points'][0]['customdata']
    dff = data_lucy[(data_lucy['Method'] == model_name) & (data_lucy['Model'] == selected_models)]
    return create_cm(dff)
    
app.css.append_css({
    'external_url': 'https://codepen.io/chriddyp/pen/bWLwgP.css'
})

if __name__ == '__main__':
    app.run_server()