## Import Packages

**pandas - manipulating the data (eg. read csv file)**

**numpy - apply math functions to arrays**

**matplotlib.pyplot - features to help with plotting figures**

**seaborn - data visualization (eg. heatmap)**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML 

In [9]:
import wget
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data'
cleveland_data = wget.download(url)
print (cleveland_data)
#df = pd.read_csv(cleveland_data)
#df.head()

  0% [                                                          ]     0 / 18461 44% [.........................                                 ]  8192 / 18461 88% [...................................................       ] 16384 / 18461100% [..........................................................] 18461 / 18461processed.cleveland (6).data


## Read File

**Use pandas to get where file is saved from computer (may need to include delimiter to separate data into columns). Add headers to label columns of dataframe.**

In [10]:
df = pd.read_csv(cleveland_data, header = None, index_col = False, delimiter = ',')
df.columns = ['Age','Sex','Chest_Pain','RestBP','Chol','FBS', 'RestECG', 'MaxHR', 'Exang', 'Oldpeak', 'Slope_ST_seg', 'Ca', 'Thal', 'Num']

df.head()
#1st 5 rows of table with header

Unnamed: 0,Age,Sex,Chest_Pain,RestBP,Chol,FBS,RestECG,MaxHR,Exang,Oldpeak,Slope_ST_seg,Ca,Thal,Num
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


## Assess Basic Info

**Find basic statistics and trends of data using .describe()  .info()  .isnull().sum()  .dtypes()  .nunique()  .plot()**

In [None]:
'''
for j in range(len(df['Ca'])):
    if df.iloc[j]['Ca']=='?':
        df.set_value(j,'Ca','0.0')
df['Ca']=df['Ca'].astype(float)


for j in range(len(df['Thal'])):
    if df.iloc[j]['Thal']=='?':
        df.set_value(j,'Thal','0.0')
df['Thal']=df['Thal'].astype(float)

'''

In [None]:
df[df['Ca']=='?'].Ca

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
print (df.isnull().sum())

In [None]:
print(df.dtypes)                                            #not working

In [None]:
#unique counts per catogory -- issues with num column?
#print(df.apply(lambda x: x.nunique()))                          #this method works for nunique             
print (df.nunique())                                            #working after adjusting panda, conda, switched to py2

In [None]:
df.plot()
plt.title('Plot of All Data', fontweight='bold')
plt.xlabel('Index')
plt.ylabel('Value')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)    #place legend box to right of plot
plt.show()


## In-depth Analysis

**Given previous data obtained from basic info, find patterns that can help narrow conclusions made from data.**

- **Plot data on a log scale to identify the scale range of outputs.**

In [None]:
df.plot()
plt.yscale('log')                                                 # makes plot log scale 


plt.title('Log Plot of All Data', fontweight='bold')              # figure features
plt.xlabel('Index')
plt.ylabel('Log Scale')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) 
plt.show()

**Need to have a basic understanding of the basic dataframe. Each column is a 'series', which is a 1D array to you have to use functions specific to series to reach information from a certain column.**

- **Find how many people from each age are part of the study**

In [None]:
each_age = df.Age.value_counts()
print each_age                                          #gives mean values of all categories based on age


- **Find trends by grouping data according to 'Num' output (0 - no heart disease, 4 - severe heart disease)**

In [None]:
group_by_num= df.groupby(["Num","Sex"])
print group_by_num.Age.agg(np.max) #gives mean values of all categories based on chest pain
print group_by_num
for key, item in group_by_num:
    print key
    #display(HTML(group_by_num.get_group(key).to_html()))
    #print group_by_num.get_group(key)

- **From .nunique() and data provided able to figure out that predicted heart disease outcome ('Num') has 5 categories (0,1,2,3,4). Group each of the 'Num' outputs and find the means for each category.**

In [None]:
group2 = df.groupby("Num")
print (group2.mean())

**To access information from a specific 'Num' category (eg. 3), use get_group(3)**

In [None]:
group2 = df.groupby("Num")
print(group2.Age.get_group(4), "\n\n")


## Visualize Data

- **Plot and organize the different outputs of 'Num' relative to Age. 5 output graphs correlating to 'Num' category 0,1,2,3,4**

In [None]:
for i in range(5):
    group1=group2.Age.get_group(i)
    group1= group1.value_counts().sort_index()
    age_pos = np.arange(len(group1.index))
    plt.figure(figsize=(7,3))
    plt.xlabel('Age')
    plt.ylabel('Count')
    #stri = 'Num output ' +str(i)
    #plt.title(stri)
    plt.title('Age Distribution for Num')
    plt.bar(age_pos,group1.values)
    plt.xticks(age_pos,group1.index, rotation =90)
    plt.show()


- **Histogram to assess resting blood pressure distribution**

In [None]:
#RestBP distribution of study
x = df['RestBP']
num_bins = 5
n, bins, patches = plt.hist(x, num_bins, alpha=0.5)
plt.title('RestBP', fontweight='bold')
plt.xlabel('RestBP')
plt.ylabel('Count')
plt.show()

In [None]:
#normalize distribution -- random
age_hist = np.random.normal(size = 1000)
plt.hist(age_hist, bins=30)
plt.ylabel('Random')
plt.show()

- **Creating scatterplot of maximum heart rate vs. rest blood pressure**

In [None]:
#making markers that vary in size and color; multi-dimensional plot

#scatterplot
x = df['MaxHR']
y = df['RestBP']
#plt.title('FBS vs. angina', fontweight='bold')
plt.xlabel('MaxHR')
plt.ylabel('RestBP')
#plt.errorbar(x,y,linestyle= 'None', marker='s')
plt.scatter(x, y, alpha=0.5)


- **Create heatmap using .corr() to visualize correlation of all categories. Diagonal should all be 1.0 because comparing against itself.**

In [None]:
#heatmap
f,ax = plt.subplots(figsize=(18, 16))
sns.heatmap(df.corr(), annot=True, linewidths=.8, fmt= '.1f',ax=ax)
plt.show()