# Work-a-long Notebook for:
## [Understand the Data With Univariate And Multivariate Charts and Plots in Python by Rashida Nasrin Sucky](https://towardsdatascience.com/understand-the-data-with-univariate-and-multivariate-charts-and-plots-in-python-3b9fcd68cd8) 


This notebook is designed to be used as you read through the article.  

**Note:** This is not an exact one-to-one walkthrough.  You will find extra questions, extra python methods, and extra plots.  The goal is to help you better understand how to create questions about your data to discover correlations and patterns.  Also, this is NOT the only way to explore your data set.

**Chart Context:** The [Heart dataset](https://www.kaggle.com/johnsmith88/heart-disease-dataset) is also part of the GitHub folder.  Be sure to explore the context provided with the dataset on Kaggle.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from statsmodels import api as sm
import numpy as np

df = pd.read_csv("heart.csv")

In [2]:
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


In [3]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


##### Question: (Answer to yourself)
Have you looked at the context behind the data on kaggle? (link in top box)
Do you know or have an understanding of each item being studied?
If not, have you done some basic research yet?

In [None]:
# The author of the article has already devised a few questions for us to work through.
# As we continue, you will be asked questions about other items within the data.
# This is to help you experience the question/answer phase, and help you better explore your data

### A Quick Note!

The data in this article has already been cleaned up. 
The exact methods have not be disclosed.  In this notebook, we cleaned the data in a way we felt worked best.  However, the final outcome is 302 records instead of 303.  Please be aware of this slight discrepency when checking your answers with the walkthrough.  Your values might be slightly smaller, but the exploration process will still be of value to you.

Data cleaning includes removing duplicate records, values can be updated, formatting the index names, etc.  We will cover this topic in more depth in the next lesson.  In the next few blocks, you will see a sneak peek of the cleaning process.  

In [4]:
# Print the column names  
df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [5]:
# Note the difference between your column names and the names from the article
# Since the names are strings, let's use a string method to make them match

df.columns = df.columns.str.capitalize()
df.columns

Index(['Age', 'Sex', 'Cp', 'Trestbps', 'Chol', 'Fbs', 'Restecg', 'Thalach',
       'Exang', 'Oldpeak', 'Slope', 'Ca', 'Thal', 'Target'],
      dtype='object')

In [8]:
# Renaming columns for easier comprehension
df = df.rename(columns={"Cp": "ChestPain", "Trestbps": "RestBP", "Restecg": "RestECG", "Exang":"ExAng"})
df.columns

Index(['Age', 'Sex', 'ChestPain', 'RestBP', 'Chol', 'Fbs', 'RestECG',
       'Thalach', 'ExAng', 'Oldpeak', 'Slope', 'Ca', 'Thal', 'Target'],
      dtype='object')

In [9]:
# Checking initial size of dataframe
df.shape

(1025, 14)

In [10]:
# Checking for duplicates
duplicate_rows_df = df[df.duplicated()]
print("Number of duplicated rows: ", duplicate_rows_df.shape)

Number of duplicated rows:  (723, 14)


In [11]:
# Dropping the duplicates then rechecking the shape
df = df.drop_duplicates()
df.shape

# Note, our final count is 302.  
# The final count in the article is 303.  
# This small discrepency should not affect your values enough to prevent you from walking through the artcle.

(302, 14)

In [12]:
# Updating the values in "Sex" where 0 = "Female" and 1 = "Male"

df = df.replace({"Sex": {0: "Female", 1: "Male"}})

In [13]:
# Updating the values in the "Thal" column to reflect what each value means.

df = df.replace({"Thal": {1: "Normal", 2: "Fixed", 3: "Reversable", 0: "Missing"}})

In [14]:
# Updating the values in the "ChestPain" column to reflect what each number means

df = df.replace({"ChestPain" : {0 : "asymptomatic", 1 : "nonanginal", 2 : "nontypical", 3 : "typical"}})

In [15]:
# testing replaced updates
#df.head()
df.head()



# Here ends the data cleaning section

Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,Thalach,ExAng,Oldpeak,Slope,Ca,Thal,Target
0,52,Male,asymptomatic,125,212,0,1,168,0,1.0,2,2,Reversable,0
1,53,Male,asymptomatic,140,203,1,0,155,1,3.1,0,0,Reversable,0
2,70,Male,asymptomatic,145,174,0,1,125,1,2.6,0,0,Reversable,0
3,61,Male,asymptomatic,148,203,0,1,161,0,0.0,2,1,Reversable,0
4,62,Female,asymptomatic,138,294,1,1,106,0,1.9,1,3,Fixed,0


### Solve Some Questions

### Question 1:  Find the population proportions with different types of blood disorders

**Part A:** In this example we are going to use the "Thal" column.
Follow along with the article for syntax and explanations of the output.

*Remember:* with duplicated rows dropped, our final record count is 302 and the author's is 303.

In [None]:
# Definitions and Question:

# Define 'value_counts()':

# Why is this method used as part of initalizing a new varible?


In [None]:
# Define a variable for the value counts of the Thal column.
x = df.Thal.value_counts()
x

# Note: In the data set, some values were passed 0 rather than left missing.  
# These values have been replaced with "Missing".
# Having this already filled in, will not affect your walkthrough, at this point

In [None]:
# Calculate the population proportion here:



In [None]:
# Turn those proportions into percentages:



**Part B:**  Let's examine the proportions of other populations within this dataset.

Let's look at "ExAng" or, as the provided context informs us, exercise induced angina.

We are going to follow the steps provided with the "Thal" values.

In [None]:
# Define a variable for the value counts of the Exang column.

# y = 
# y

In [None]:
# Calculate the population proportions:

# Look at the context to see what 1 and 0 mean.

In [None]:
# Turn those propotions into percentages:



**Part C:**  Select one of the other populations to examine.

In [None]:
# Define a variable for the value counts of the your column.

# z = 
# z

In [None]:
# Calculate the population proportions:

# Does the context provide you with any information about the values?.

In [None]:
# Turn those propotions into percentages:

### Question 2: Find the minimum, maximum, average, and standard deviation of Cholesterol data.

Remember: Our total is 302, not 303.

In [None]:
# using only the Cholesterol column in the dataset, apply the describe function.

df["Chol"].describe()

### Question 3: Make a plot of the distribution of the Cholesterol data.

**Part A:** Plot the Cholestorol data

In [None]:
# Plot the distribution here:

sns.displot(df.Chol.dropna())

# Why is this a "histplot" and not a "distplot"?  
# According to the seaborn (sns) documentation, 
#  the "distplot" is a depricated function and will be removed in a future version.
# So let's explore other options in this learning space.

# Could also try to use sns.histplot(df.Chol.dropna())

In [None]:
# Do you see any outliers?

In [None]:
# Want the histogram with the standard deviation curve on the same plot? 

# Here is they syntax:
sns.histplot(data=df, x="Chol", kde=True)

[Documentation on the histplot](https://seaborn.pydata.org/generated/seaborn.histplot.html) to better understand the syntax.

**Part B:** Describe and Plot the Resting Heart Rate values

In [None]:
# code the describe() for RestBP


In [None]:
# Let's plot The resting heart rate, "RestBP".  
#You can choose to plot using "histplot" or "displot".  Your choice.
sns.displot(df.RestBP.dropna())

In [None]:
# Any outliers?

**Part C:** Describe and Plot the values you selected in Question 1 part c.

In [None]:
# Plot Part C here: 


### Question 4: Find the mean of the RestBP and calculate the population proportion of the people who have the higher RestBP than the mean RestBP.

In [None]:
# Calculate the mean of the RestBp and print it



In [None]:
# Question: Is the .dropna() siginificant or necessary for calculating the mean_rbp?

In [None]:
# Calculate population proportion of people who have higher RestBP than the average


# Questions: 

# 1. Translate this line of code in plain English?

# 2.  Break this line and examine your results.  What did you change?  What happened?


### Question 5: Plot the Cholesterol data against the age group

In [None]:
df["agegrp"]=pd.cut(df.Age, [29,40,50,60,70,80])
plt.figure(figsize=(12,5))
sns.boxplot(x = "agegrp", y = "Chol", data=df)

In [None]:
# Questions for you to answer:

# Looking at the code that was used to create the box plot, 
#   explain what the following code means in your own words:

# 1. pandas.cut():

# 2. Looking at the cut() method signature, what is a "bin": 

# 3. plt.figure():

# 3 A - in the code above, what do the numbers in figsize mean?  What happens if you change them?

# 4. sns.boxplot(x = ??, y = ??, data= ??):


In [None]:
# Combine sex and cholesterol data here:




# Why did the author run this: df["Sex1"] = df.Sex.replace({1: "Male", 0: "Female"})

# What does that line mean, in your own words:


# There is a new argument in the boxplot function.  Look up what "hue" means.

In [None]:
# How does the sns.boxplot(x = "Sex1", y = "Age", data=df) differ from the one in the cell above?

# What information does this provide you?

### Question 6:  Make a chart to show the number of people having each type of chest pain

In [None]:
df.groupby("agegrp")["ChestPain"].value_counts().unstack()

In [None]:
# Run the following line of code:

#df.groupby("agegrp")["ChestPain"].value_counts()

# Question:  
# 1. How does the output vary between these two lines?  

### Question 7: Add on to your chart, but segregate by gender.

In [None]:
dx = df.dropna().groupby(["agegrp", "Sex"])["ChestPain"].value_counts().unstack()
print(dx)

### Question 8:  Present population proportion for each type of chest pain

In [None]:
dx = dx.apply(lambda x: x/x.sum(), axis=1)
print(dx)

Note on your output, as we made this notebook, we discovered that the ariticle dataset had been cleaned prior to our usage.  The methods used were not disclosed, so we cleaned ourselves.  Which could explain the discrepencies in this chart.

In [None]:
# Questions: In your own words:

# 1. What does the apply() function do?


# 2. What is lambda? 