# <font color='SEAGREEN'>Day 2</font>
# <font color='MEDIUMSEAGREEN'>Visualizing the Data</font>
To get more sense of our data, lets use some visualization technique in python programming.


Load the data in a dataframe called ``data``

In [None]:
# Your code

In [None]:
import matplotlib.pyplot as plt

**Histograms**

In this cell we will learn how to plot histograms. Histograms are diagrams consisting of rectangles (bins) whose height is proportional to the frequency of a variable (i.e., number of times the variable appears in the data) and whose width is equal to the class interval (i.e., the range of values that fall within the bin). 

In [None]:
import numpy as np
# Before we plot the histograms, let's check the age range for the patients to figure out what is a reasonable
# width for the bins and how many bins we should use
age = data.age
print('Oldest: {0}; Youngest: {1}; Age Range: {2}'.format(max(age),min(age),max(age)-min(age)))
# Based on the results, we don't need any bins below 29 years old and no bins above 77 years old. 
# We will set the bins width to one year to get a good resolution of the data. We use
# the numpy linspace function to create an array of 48 evenly spaced numbers (representing the bins) between 29 and 77.
bins = np.linspace(29, 77, 48)
# Let's create the matplotlib figure where we will plots the histograms.
fig = plt.figure(figsize=(10,10))
# Plot the histograms. We use pyplot's hist function and provide and set the opacity of the plot to 0.5 through the alpha argument.
plt.hist(age, bins, alpha=0.5, label='age of the patients')
# Label the plot.
plt.xlabel('Age')
plt.ylabel('Frequency')
# To make the plot look nicer, we set x-tick marks every 5 years. 
plt.xticks(np.arange(29, 77, 5.0))
# same for the y-axis.
plt.yticks(np.arange(0, 20, 2.0))
# Plot a legend so that we can match the color of the histogram to the data.
plt.legend(loc='upper right')
# Show the plot
plt.show()

What can you infer from the above plot?

In [None]:
# Your answer:

Decrease the number of the bins and run the program again

*Show your work to the instructor*

Repeat the same steps for ploting the histogram for the "sex" column. 

In [None]:
# Your code

### Box Plots

The box plot is a standardized way of displaying the distribution of data based on the five key properties: 
minimum, first quartile, median, third quartile, and maximum. The median (Q2) of the data appears as a red line inside a box, where the lower and upper limits of the box denote the first (Q1) and third quantile (Q3) of the data. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles. Data that falls outside of the whiskers (outliers) usually appear as data points in the box plot. 

We will use the boxplot function from pyplot which computes the inter-quartile range IQR = (Q3-Q1), sets the upper whisker to extend up to the last data point less than Q3 + 1.5 IQR and sets the lower whisker to extend up to the first data point greater than Q1 - 1.5 IQR. Beyond the whiskers, data are considered outliers and are plotted as individual points.

Below, we plot a box plot for age. 

In [None]:
age_heart_diease = data.age[data.target==1]
age_not_heart_diease = data.age[data.target==0]
age = [age_heart_diease, age_not_heart_diease]
print('Oldest: {0}; Youngest: {1}; Age Range: {2}'.format(max(age_heart_diease),min(age_heart_diease),max(age_heart_diease)-min(age_heart_diease)))
print('Oldest: {0}; Youngest: {1}; Age Range: {2}'.format(max(age_not_heart_diease),min(age_not_heart_diease),max(age_not_heart_diease)-min(age_not_heart_diease)))
# multiple box plots on one figure
fig = plt.figure(figsize=(10,10))
plt.boxplot(age)
# Set y-axes to be between 20 and 88
plt.ylim(20,80)
# Set x-ticks to display labels
labels = ('Heart Disease','Healthy')
plt.xticks([1,2], labels)
plt.show()

Explain the code. What are the first few lines doing?

In [None]:
# Your answer:

### Scatter Plots

In this cell we will learn how to use scatter plots. Scatter plots are graphs in which the values of two variables are 
plotted along two axes, the pattern of the resulting points revealing any correlation present.
We will use the age of patients with heart disease and the age of the healthy ones.


In [None]:
# Your code
# to create a scatter plot using the pyplot scatter function
# Pick the person's resting blood pressure, save it into a variable named trestbps
# and the person's cholesterol measurement, save it into a variable named chol

# EXERCISE, FIND OUT HOW YOU CAN USE SCATTER PLOT

# HINT: plt.scatter()
# use the .xlabel, .ylabel .xlim, .ylim and .title to make your plot more self-explanetory.


**What is correlation?**

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related.

When the y variable tends to increase as the x variable increases, we say there is a positive correlation between the variables. Also, when the y variable tends to decrease as the x variable increases, we say there is a negative correlation between the variables. Finally, if there is no clear relationship between the two variables, we say there is no correlation between the two variables.

<img src="images/corr.png">

If there is a positive or negative correlation between two features, it means one of the features are redundant. In other words, having that feature will not add any information to our data and we can eliminate it.

Based on your illustrated scatter plot, are those features correlated?

In [None]:
# Your answer:

Draw the scatter plot for ("resting blood pressure" vs "maximum heart rate") and ("maximum heart rate" vs "cholesterol measurement"), and see if there is a correlation between the features.

*Show your work to the Instructor*

In [None]:
# Your code:

We can compute the correlation by using the numpy corrcoef function which computes the Pearson correlation coefficient. The result is a number that lies between -1 and 1. -1 means that the variables are very strongly correlated in a negative way, 1 means that the variables are very strongly related in a positive way, and 0 means that the variables are not correlated with each other.

Note that the Pearson correlation is a linear correlation measure (only tells you about whether a line in a scatter plot captures the relationship between the variables) and does not provide information on any nonlinear relationship there might exist between the two variables.

In [None]:
# corrcoef returns what is known as the Pearson product-moment correlation coefficients. The correlation between the two
# variables is equal to the non-diagonal entries of the array.
Pearson_corr = np.corrcoef(trestbps, chol)
age_corr = Pearson_corr[0,1] # or equivalently Pearson_corr[1,0]
print("The correlation between ""Resting Blood Pressure"" and ""Cholesterol Measurement"" is {0}".format(age_corr))

Find out the correlation between two other plots you've illustrated.

In [None]:
# Your code

### Visualizing Categorical (Discrete) Variables

#### Pie Charts

You have probably seen and used pie charts before. They are great for visualizing proportions. Below we use a pie chart to
visualize the chest pain experienced by the patients.

As you learned yesterday, these are categorical variables, meaning that the variable can take on one of a limited, and usually fixed number of possible values, assigning each entry in the data to a particular group.

We use the ``cp`` feature from the data which contain encoded information about the experienced chest pain. The entries in ``cp`` are either 0, 1, 2 or 3:

1. Value 0: typical angina
2. Value 1: atypical angina
3. Value 3: non-anginal pain
4. Value 4: asymptomatic

In [None]:
cp_heart_diease = data.cp[data.target==1];
#  How many in each category? Use the len function to obtain the length of an array.
cp_0 = len(cp_heart_diease[cp_heart_diease==0])
cp_1 = len(cp_heart_diease[cp_heart_diease==1])
cp_2 = len(cp_heart_diease[cp_heart_diease==2])
cp_3 = len(cp_heart_diease[cp_heart_diease==3])

# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = 'typical angina','atypical angina','non-anginal pain','asymptomatic' 
cp_sizes = [cp_0, cp_1, cp_2, cp_3]
# We can also pass an array called "explode" which determines how much we want to highlight one of the pie slices.
explode = (0, 0, 0, 0.1)  # only "explode" the 4th slice.
# Set the plot font size to 22
plt.rcParams.update({'font.size': 22})
# We will create a pie chart for the mother and one for the father, but we will place both pie charts on the same figure.
fig = plt.figure(figsize=(30,10))
# To do this, we make use of subplots. Subplots partition the figure into multiple plots. 
plt.title('Experienced Chest Pain')
plt.pie(cp_sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

Code the pie chart for healthy persons

In [None]:
# Your code

What can you infer from the pie charts?

In [None]:
# Your answer

#### Bar Plots

Another great way to visualize categorical variables is to use bar plots. Below, we plot the chest pain of patients with heart disease and ones without heart disease. To do this, we again use the feature ``cp``.

In [None]:

cp_heart_diease = data.cp[data.target==1];
cp_healthy = data.cp[data.target==0];

cp_heart_diease_count = []
cp_healthy_count = []
cp_heart_diease_count.append(len(cp_heart_diease[cp_heart_diease==0]))
cp_heart_diease_count.append(len(cp_heart_diease[cp_heart_diease==1]))
cp_heart_diease_count.append(len(cp_heart_diease[cp_heart_diease==2]))
cp_heart_diease_count.append(len(cp_heart_diease[cp_heart_diease==3]))

cp_healthy_count.append(len(cp_healthy[cp_healthy==0]))
cp_healthy_count.append(len(cp_healthy[cp_healthy==1]))
cp_healthy_count.append(len(cp_healthy[cp_healthy==2]))
cp_healthy_count.append(len(cp_healthy[cp_healthy==3]))
    

# We need to define the position of the bars. Since we have 4 levels of chest pain, we will set the position of the
# bars to be at 1 2 3 and 4 on the x-axes. To make the two bars fit nicely, we set the width of the bars to 0.35
bar_pos = np.arange(4)
bar_width = 0.35
 
# create plot
fig = plt.figure(figsize=(30,10))

rects1 = plt.bar(bar_pos, cp_heart_diease_count, bar_width, color='b', label='Heart Disease') 
rects2 = plt.bar(bar_pos + bar_width, cp_healthy_count, bar_width,color='g', label='Healthy') 
plt.title('Experienced Chest Pain')

plt.xticks(bar_pos + bar_width/2, labels)
plt.legend()
plt.tight_layout()

plt.show()

Modify the above code and use "for loops" for the part:
```python
cp_heart_diease_count.append(len(cp_heart_diease[cp_heart_diease==0]))
cp_heart_diease_count.append(len(cp_heart_diease[cp_heart_diease==1]))
cp_heart_diease_count.append(len(cp_heart_diease[cp_heart_diease==2]))
cp_heart_diease_count.append(len(cp_heart_diease[cp_heart_diease==3]))

cp_healthy_count.append(len(cp_healthy[cp_healthy==0]))
cp_healthy_count.append(len(cp_healthy[cp_healthy==1]))
cp_healthy_count.append(len(cp_healthy[cp_healthy==2]))
cp_healthy_count.append(len(cp_healthy[cp_healthy==3]))
```

*Show your work to the Instroctor*

## Exploring more about the dateset


Let's split the labels from the features.

In [None]:
X = data.drop('target', 1)
print(X.head())
y = data['target']
print(y.head())

What do you think the drop function do?

In [None]:
# Your answer:

### Exporting the data
Now we'd like to export this, so that we can use it in the future:

In [None]:
X.to_csv("X.csv", index=False)
y.to_csv("y.csv", index=False)

### Learn More about Visualization
**Exercise:** Write down a code that counts the patients with heart disease and healthy persons.

In [None]:
# Your code

**Exercise:** Show the histogram of the "target" column.

In [None]:
# Your code

You can have two histograms at the same time in one plot. To do so you can repeat the following line for each histogram.
```python
plt.hist(age1, bins, alpha=0.5, label='age1')
plt.hist(age2, bins, alpha=0.5, label='age2')
```
**Exercise:** Rewrite the histogram code that will show the histogram for healthy persons as well as the heart disease patients in one plot.

In [None]:
# Your code

Learn how you can use subplots. 

**Exercise:** Show the pie charts you already have in a unified plot, each in one subplot.

In [None]:
# Your code

In [None]:
print("Nice work today!")

References:
    - Princeton AI4ALL Fragile Families Project
    - Khan Academy