# Diagnostic: The Data Science Game

## If you were a data scientist today, what would you do with this dataset?

 

We love data. Think of it as a game. Some warm up and stretching before the boot camp.

Here's a dataset about __heart disease__ originally from the University of California at Irvine. We'd like to use this first assignment to get you familiar with the hands-on nature of this course, as well as get a sense for your current level of applied data science.

We will check the notebooks you upload to get a sense of where you are, as well as how specifically we can help you grow.

 
__Instructions__

1. Download and open the jupyter notebook called Data_Science_Game_[Your Last Name]
2. Rename this notebook and change [Your Last Name] to your actual last name.
3. Make sure you completed step 2.
4. Download the heart.csv file.
5. Have fun and do whatever!
6. When you're done playing with the data, save the file and upload it here.
7. Really, that's it. Try not to spend more than 1-2 hours on this.

__Files:__

Notebook

__Data_Science_Game_[Your Last Name].ipynb__

Dataset

__heart.csv__

 
`Clarifying step 5:`

__What do you mean, do whatever?__

We gave you some template instructions in the notebook, but as a data scientist, you'll often encounter data...and not much else. You choose what you want to do with it! Do you want to visualize it, find the descriptive statistics, run some what-if scenarios, build a dashboard, build a machine learning model, etc.? There are so many options. Use what you already know how to do - your background may give you very different ideas of how to approach this. A business intelligence professional, a marketer, an engineer, and a statistician may all look at the same dataset and come up with completely different analyses. That's part of the diversity of data science!

__What if I can't do anything?__

You should be able to do some things! Check the table, look at the top values, find the mean, etc.

But it's ok if you can't do much. That's why you're in the bootcamp after all. Do the things you can, even if it's just 1 or 2 things, and then save the file. At least we'll know where you are.

__Can I look up stuff to do online?__

Sure, data scientists are constantly looking online for inspiration or refreshers. If you remember learning a statistic or a programming function but forgot how to do it, look it up online to refresh yourself. If there's a cool new trick you can learn, go ahead!

Just try not to spend more than 1-2 hours on this.

__Can I use Excel, PowerBI, Tableau, R, etc.?__

We will use this assignment as a basis for assessing your comfort levels with Python so we STRONGLY prefer Python. That said, if you are really new to Python but are really strong at data analysis with another tool, you can upload a PDF of your analysis with your alternative tool (e.g. dashboard). 

__Can I combine this with other datasets?__

Sure, but try not to spend more than 1-2 hours on this.

__Can I copy a Kaggle kernel?__

No, don't do that.

Can I do [some really fancy model that takes days to deploy]?

Um, you probably shouldn't be in this bootcamp.




In [1]:
# Run this line of code to check the version of python in your jupyeter notebook.
#You can click the run button or press shift + enter (windows) or command key ⌘ for (mac)

!python --version

Python 3.7.3


#  Business & Data Understanding

In [None]:
# E.g. What do you want to find out about this data set?

__Features:__

> 1. `age` - age in years 
2. `sex` - (1 = male; 0 = female) 
3. `cp` - chest pain type 
    * 0: Typical angina: chest pain related decrease blood supply to the heart
    * 1: Atypical angina: chest pain not related to heart
    * 2: Non-anginal pain: typically esophageal spasms (non heart related)
    * 3: Asymptomatic: chest pain not showing signs of disease
4. `trestbps` - resting blood pressure (in mm Hg on admission to the hospital)
    * anything above 130-140 is typically cause for concern
5. `chol` - serum cholestoral in mg/dl 
    * serum = LDL + HDL + .2 * triglycerides
    * above 200 is cause for concern
6. `fbs` - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) 
    * '>126' mg/dL signals diabetes
7. `restecg` - resting electrocardiographic results
    * 0: Nothing to note
    * 1: ST-T Wave abnormality
        - can range from mild symptoms to severe problems
        - signals non-normal heart beat
    * 2: Possible or definite left ventricular hypertrophy
        - Enlarged heart's main pumping chamber
8. `thalach` - maximum heart rate achieved 
9. `exang` - exercise induced angina (1 = yes; 0 = no) 
10. `oldpeak` - ST depression induced by exercise relative to rest 
    * looks at stress of heart during excercise
    * unhealthy heart will stress more
11. `slope` - the slope of the peak exercise ST segment
    * 0: Upsloping: better heart rate with excercise (uncommon)
    * 1: Flatsloping: minimal change (typical healthy heart)
    * 2: Downslopins: signs of unhealthy heart
12. `ca` - number of major vessels (0-3) colored by flourosopy 
    * colored vessel means the doctor can see the blood passing through
    * the more blood movement the better (no clots)
13. `thal` - thalium stress result
    * 1,3: normal
    * 6: fixed defect: used to be defect but ok now
    * 7: reversable defect: no proper blood movement when excercising 
14. `target` - have disease or not (1=yes, 0=no) (= the predicted attribute)



In [4]:
# You can add additional cells if needed.

# Import Data & Libraries

In [2]:
# E.g. Import important libraries (any libraries you want)
import pandas as pd

In [4]:
# E.g. Download the data and make it available in your coding environment
df = pd.read_csv("heart.csv")

In [None]:
# You can add additional cells if needed.

# Exploratory Data Analysis

In [8]:
# E.g. Check out the shape of the dataset.
df.shape

(303, 14)

In [9]:
# E.g. Take a look at the first few rows, and take note of the column names. 
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [10]:
# E.g. Check out the data-type for each column of the dataset. 
df.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

In [None]:
# E.g. Do i need to clean the data?

In [None]:
# E.g. Visualize/plot the data. Use the most appropriate graph for your data as many as you what.

In [None]:
# E.g. Are there anything interesting?

In [None]:
# You can add additional cells if needed.

In [12]:
df.mean()

age          54.366337
sex           0.683168
cp            0.966997
trestbps    131.623762
chol        246.264026
fbs           0.148515
restecg       0.528053
thalach     149.646865
exang         0.326733
oldpeak       1.039604
slope         1.399340
ca            0.729373
thal          2.313531
target        0.544554
dtype: float64

In [16]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


age          54.366337
sex           0.683168
cp            0.966997
trestbps    131.623762
chol        246.264026
fbs           0.148515
restecg       0.528053
thalach     149.646865
exang         0.326733
oldpeak       1.039604
slope         1.399340
ca            0.729373
thal          2.313531
target        0.544554
dtype: float64

# Train/Test Split

In [None]:
# E.g. Set aside some data for testing.

In [None]:
# You can add additional cells if needed.

# Prepare for ML

In [None]:
# E.g. Do you need to transform the data? Yes or no?

In [None]:
# You can add additional cells if needed.

# Pick your Models

In [None]:
# E.g. choose your model. Go wild!!

In [None]:
# You can add additional cells if needed.

# Model Selection

In [None]:
# E.g. Pick one algorithm.

In [None]:
# You can add additional cells if needed.

# Model Tuning

In [None]:
# E.g. You can tune model hyperparameters.

In [None]:
# You can add additional cells if needed.

# Pick the best model

In [None]:
# E.g. Pick the model that performed the best.

In [None]:
# You can add additional cells if needed.

# Conclusion

In [None]:
# E.g. Additional insight that you want to convey.

In [None]:
# You can add additional cells if needed.