# AI Curriculum Model Evaluation

In this section you will read in and analyze a diabeties readmission dataset. First we'll do some basic analysis, then we'll move on to doing more advanced analysis. We plan to follow all the steps we learned so far in this course and then put this model into production

## Your Data
Fill out the following information: 

*First Name:*   
*Last Name:*   
*E-mail:*

## About the dataset
This dataset is 10-years (from 1999 to 2008) of clinical care at 130 US hospitals and integrated delivery networks.
It contains 101,766 instances (patients) that are classified into three classes: no readmission, readmission in less than 30 days, and readmission in more than 30 day. Information was extracted from the database for encounters that satisfied the following criteria.

1. It is an inpatient encounter (a hospital admission).
2. It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.
3. The length of stay was at least 1 day and at most 14 days.
4. Laboratory tests were performed during the encounter.
5. Medications were administered during the encounter.

The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization.

Can we forecast whether a diabetes patient will be readmitted in the future or not?

## Data Set Description
**Encounter ID:** Unique identifier of an encounter</br>
**Patient number:** Unique identifier of a patient</br>
**Race Values:** Caucasian, Asian, African American, Hispanic, and other</br>
**Gender Values:** male, female, and unknown/invalid</br>
**Age:** Grouped in 10-year intervals: 0, 10), 10, 20), …, 90, 100)</br>
**Weight:** Weight in pounds</br>
**Admission type:** Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available</br>
**Discharge:** disposition Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available</br>
**Admission source:** Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital</br>
**Time in hospital:** Integer number of days between admission and discharge</br>
**Payer code Integer:** identifier corresponding to 23 distinct values, for example, Blue Cross/Blue Shield, Medicare, and self-pay Medical</br>
**Medical specialty:** Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon</br>
**Number of lab procedures:** Number of lab tests performed during the encounter</br>
**Number of procedures:** Numeric Number of procedures (other than lab tests) performed during the encounter</br>
**Number of medications:** Number of distinct generic names administered during the encounter</br>
**Number of outpatient:** visits Number of outpatient visits of the patient in the year preceding the encounter</br>
**Number of emergency:** visits Number of emergency visits of the patient in the year preceding the encounter</br>
**Number of inpatient:** visits Number of inpatient visits of the patient in the year preceding the encounter</br>
**Diagnosis 1:** The primary diagnosis (coded as first three digits of ICD9); 848 distinct values</br>
**Diagnosis 2:** Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values</br>
**Diagnosis 3:** Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values</br>
**Number of diagnoses:** Number of diagnoses entered to the system 0%</br>
**Glucose serum test result:** Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured</br>
**A1c test result:** Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured.</br>
**Change of medications:** Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change”</br>
**Diabetes medications:** Indicates if there was any diabetic medication prescribed. Values: “yes” and “no”</br>
**24 features for medications For the generic names:** metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, sitagliptin, insulin, glyburide-metformin, glipizide-metformin, glimepiride- pioglitazone, metformin-rosiglitazone, and metformin- pioglitazone, the feature indicates whether the drug was prescribed or there was a change in the dosage. Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed</br>
**Readmitted Days to inpatient readmission Values:** “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission

## References

1. [Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,” BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014](https://www.hindawi.com/journals/bmri/2014/781670/).
    
2. [Diabetes 130-US hospitals for years 1999-2008 Data Set](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008) - UCI Machine Learning Repository

3. [List of features and their descriptions in the initial dataset](https://www.hindawi.com/journals/bmri/2014/781670/tab1/)

In [0]:
# Import and Setup
# It is best practice to import all the libraires you think you need at one place in the begining of your code to keep track of it

# Fill your code here

import pandas as pd






### Task 1: Import the data 
Use the pandas.read_csv() function to import the dataset. The data is contained in one file called diabetic_data, you can clone it from the git repository provided for the course or from the source itself. This pandas dataframe will be used for data exploration.

In [0]:
# Enter your code here






### Task 2: Clean your dataset
Data wrangling is the process of taking raw messy data and transforming/cleaning/mapping it to a tidy format that is acceptable by machine learning algorithms.
1. Look at your dataset, how many columns have missing values (missing values can either be '?' ,Nan,' '  etc.). Check the datatypes and convert any numbers that were read as strings to numerical values. (Hint: You can use [str.replace()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html) to work with strings.)
2. What proportion of the Values are missing, can it be replaced by some data wrangling methods or should we just delete that column?
3. Do all the columns in right data type? (Hint: You can check it with 'dataset.dtypes')
4. "readmitted" will be your target variable we are trying to predict.

In [0]:
# Enter your code here






Please tell us in few sentence what did you learn from this task? What was the easiest and hardest part of this task? </br>
**Your interpertation :**

### Task 3: Exploratory data analysis 

1. Explore the dataset. Write a short description of the dataset describing the number of items, the number of variables and check to see if the values are reasonable. 
+ How many variables are categorical variables and how many are numerical? Do we need to encode the variables?
+ You can compute the correlation matrix and use a heat map to visualize the correlation coefficients
+ What are the proportions of various Race/Ethnic, age and gender groups in the dataset?
+ Check the distribution of the variables,what can we interpret from them?
+ Is there data Imbalance?
+ If possible create a visualization for your understanding. You can click [here](https://www.data-to-viz.com/#violin) for various types of visualization examples

In [0]:
# Enter your code here





Please tell us in few sentence what did you learn from this task? What was the easiest and hardest part of this task? </br>
**Your interpertation :**

### Task 4: Feature Engineering 
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models,resulting in improved model accuracy on unseen data

1. Take a look at the dataset, can we replace some missing values with something else which would be meaningful? (You can consult with subject matter expert)
+ Do we need to scale or normailze some features?
+ Can we create new features from existing information available in the data using feature crossing( combining 2 or more numerical features using an arithmetic operator)?
+ There are 3 diagnostics, with more than 700 possible values. can we make some order and normalize those diagnostics into categories?
+ There are too many medications that are taken by the patients. What can we interpret from it? Can perform some feature engineering their?
+ Explore other features which we can create that can be helpful for model

In [0]:
# Enter your code here







Please tell us in few sentence what did you learn from this task? What was the easiest and hardest part of this task? </br>
**Your interpertation :**

### Task 5: Create Baseline model
 A baseline Is a model that is both simple to set up and has a reasonable chance of providing decent results. Experimenting with them is
usually quick and low cost, since implementations are widely available in popular packages

1. Choose a baseline which you think might be best with the data analysis we performed 
+ Think on which performace metrices to choose based on the problem statement
+ Split dataset into train and test
+ What is the performance of the model? Can we improve on it?
+ Perform feature importance, does it make sense?
+ Perform Hyperparameter tuning. Did your model performance get better? State which are the best set for Hyperparameters?

In [0]:
# Enter your code here







Please tell us in few sentence what did you learn from this task? What was the easiest and hardest part of this task? </br>
**Your interpertation :**

### Task 6: Model Development

A machine learning model is the product of training a machine learning algorithm with training data.

1. Do baseline model satisfy our performance criteria?
2. What can we interpret for our basline model and feature importance?
3. What other models can we use? Can we create more new features based on the information we gathered?
4. Split the dataset into train and test, use same performace matrix as the baseline model
5. Perform cross validation
6. What is the final model performance?

In [0]:
# Enter your code here







Please tell us in few sentence what did you learn from this task? What was the easiest and hardest part of this task? </br>
**Your interpertation :**

## Task 7: Output and save your model

1. Create a Pickel file to save your model
2. Reload the saved Pickel file to predict on the test set provided

In [0]:
# Enter your code here



