<a href="https://colab.research.google.com/github/kytk/AI-MAILs/blob/main/python_2_pandas_seaborn_en.ipynb?hl=en" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python for Healthcare Professionals: Pandas and Seaborn

Kiyotaka Nemoto (Department of Psychiatry, Faculty of Medicine, University of Tsukuba)

Ver.20240703

## Reference Materials
- [Pandas Official Documentation (English)](https://pandas.pydata.org/docs/index.html)
- [Seaborn Official Documentation (English)](https://seaborn.pydata.org/tutorial/introduction)

## Data Used Today
- Diabetes Dataset https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
- The column names of the distributed file were edited and split into two Excel files
- diabetes_demographics.xlsx: Age, Sex, BMI, Average Blood Pressure
- diabetes_data.xlsx: T-Cho, LDL, HDL, T-Cho/HDL, Log of TG, Blood Sugar, Y (Progression over 1 year)

## Goals of This Section
- Be able to read Excel files using Pandas
- Be able to extract specific columns or rows using Pandas
- Be able to merge two files using Pandas
- Be able to create various graphs using Seaborn

## Content
1. Overview of Pandas and Seaborn
2. Basics of Pandas
   - Data loading
   - Data display
   - Data manipulation
   - Data merging
   - Handling missing values
3. Basics of Seaborn
   - Basic graphs
   - Customization
4. Quiz


## 1. Overview of Pandas and Seaborn
- Pandas
    - A tool for data analysis and manipulation in Python
    - "Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language."
- Seaborn
    - A tool for easily creating graphs from statistical data in Python
    - "Seaborn is a library for making statistical graphics in Python."

### 1.0. Installing Pandas and Seaborn (not necessary this time)
- When setting up your own Python environment, install Pandas and Seaborn with the following:

    ```
    pip install pandas seaborn
    ```

### 1.1. Importing Pandas, Seaborn, and os
- pandas is often imported as pd
- seaborn is often imported as sns
- We will also import the os module to check files


In [None]:
# pandas is often imported as pd
import pandas as pd

# seaborn is often imported as sns (seaborn name space)
import seaborn as sns

# Import the os module to check files
import os

### 1.2. Loading Data into Google Colab
- When working in Google Colab, you usually drag & drop data under "Files" on the left
- You can then access it with 'filename'
- Now, we will download and use the data
- Executing the cell below will download three xlsx files
- **Note**: Work content in Google Colab disappears after a certain time. To save data, you need to download work results periodically

<img src="https://www.nemotos.net/nb/img/colabo_files.png" width=300>

In [None]:
# Download the data we'll use today
# File names are diabetes_demographics.xlsx, diabetes_demographics_short.xlsx, diabetes_data.xlsx
#
# (Note: This is not essential for this lecture, so don't worry if you don't understand it)
# ! is used when you want to execute a program in the shell from Python
# [[ -f diabetes_demographics.xlsx ]] is a shell script test statement asking "Is there a file called diabetes_demographics.xlsx?"
# || means "if the return value is False, then..."
# wget is a Linux program for downloading

![[ -f diabetes_demographics.xlsx ]] || wget https://raw.githubusercontent.com/kytk/AI-MAILs/main/data/diabetes_demographics.xlsx
![[ -f diabetes_data.xlsx ]] || wget https://raw.githubusercontent.com/kytk/AI-MAILs/main/data/diabetes_data.xlsx
![[ -f diabetes_data_short.xlsx ]] || wget https://raw.githubusercontent.com/kytk/AI-MAILs/main/data/diabetes_data_short.xlsx


In [None]:
# You can display a list of files under the current directory using the listdir() function in the os module
# Confirm that the following three files are present:
#   diabetes_demographics.xlsx
#   diabetes_data.xlsx
#   diabetes_data_short.xlsx

os.listdir()

## 2. Basics of Pandas

### 2.1. Examples of What Pandas Can Do
- Handle tables
- Output descriptive statistics of tables
- Extract parts of tables
- Merge multiple tables
- Generate new columns from information in multiple columns
- Pandas DataFrames can be used directly for graph creation in Seaborn

### 2.2. Necessity of Pandas: Data Preprocessing and Cleaning
- Data preprocessing is an important step in data analysis
- Using Pandas, you can handle missing values, convert data types, remove duplicate data, etc.

### 2.3. Pandas Terminology: "DataFrame"
- Think of it as a general table
- In Pandas, a single table is called a "DataFrame"
- Row: row; Column: column
- The name dataframe is often abbreviated and assigned to a variable called 'df'

<img src="https://www.nemotos.net/nb/img/pandas_01.png" width=400>
Figure: Quoted from the official Pandas documentation

### 2.4. Loading Data into Pandas
- Pandas can read Excel files and CSV files
- Excel files can be read with `pd.read_excel('excel_file')`
- CSV files can be read with `pd.read_csv('csv_file')`
- Specifying IDs etc. as an index makes handling easier

- diabetes_demographics.xlsx

    <img src="https://www.nemotos.net/nb/img/diabetes_demographics_screenshot.png" width=300>

- diabetes_data.xlsx

    <img src="https://www.nemotos.net/nb/img/diabetes_data_screenshot.png" width=450>

In [None]:
# Read diabetes_demographics.xlsx as df_demographics. Set the 0th column as the index column
df_demographics = pd.read_excel('diabetes_demographics.xlsx',index_col=0)

# Read diabetes_data.xlsx as df_data. Set the 0th column as the index column
df_data = pd.read_excel('diabetes_data.xlsx',index_col=0)

# Read diabetes_data_short.xlsx as df_data_short. Set the 0th column as the index column
df_data_short = pd.read_excel('diabetes_data_short.xlsx',index_col=0)

#### Checking the Overview of the DataFrame
- You can find out what information is in the DataFrame using the info() method of the pandas DataFrame

In [None]:
# Get an overview of df_demographics
df_demographics.info()

#### Handling Categorical Data
- Currently, SEX is int64, which is an integer type.
- There is a way to change this to categorical data as follows:
    ```
    # Data type conversion
    df_demographics['SEX'] = df_demographics['SEX'].astype('category')
    ```
- However, in this case, it tends to cause errors later, so we intentionally do not change it here
- Caution is needed for the descriptive statistics that follow

#### Pandas DataFrame Type
- Pandas DataFrames are defined in their own type
- The type of a pandas DataFrame can be known using the `type()` function

In [None]:
# DataFrame type
# pandas.core.frame.DataFrame type
type(df_demographics)

### 2.5. Displaying Data
- If the loaded DataFrame is df, you can display the first 5 rows using the head() method
    - Using head(10) will display 10 rows
- The size of the table can be checked with df.shape

In [None]:
# Check the first 5 rows of df_demographics

# Note that the ID column is now the index column, so the ID is shifted one step
df_demographics.head()

In [None]:
# Check what happens if we don't specify the index column
# ID is read as one variable
# The index is 0, 1, 2 ... on the far left

df_demo_without_index = pd.read_excel('diabetes_demographics.xlsx')
df_demo_without_index.head()

In [None]:
# Check the size of the df_demographics table
# 442 rows and 4 columns excluding the index (ID)
df_demographics.shape

In [None]:
# Check the first 5 rows of df_data
df_data.head()

In [None]:
# Check the size of the df_data table
# 442 rows and 7 columns excluding the index (ID)
df_data.shape

In [None]:
# Check the first 5 rows of df_data_short
# Data where ID is not a sequential number
df_data_short.head()

In [None]:
# Check the size of the df_data_short table
# 252 rows and 7 columns excluding the index (ID)
df_data_short.shape

### 2.6. Descriptive Statistics of Data
- As Pandas claims to be a data analysis tool, it's easy to calculate descriptive statistics

In [None]:
# df.describe() calculates descriptive statistics for each item
# For continuous values, it outputs the number of samples, mean, standard deviation, minimum, 25th percentile, 50th percentile, 75th percentile, and maximum in a list
df_demographics.describe()

- Using the groupby() method, you can calculate descriptive statistics for each group

In [None]:
# Use the groupby() method to calculate the mean for each sex
df_demographics.groupby(by='SEX').mean()

- Using the corr() method, you can calculate correlation coefficients

In [None]:
# Correlation can also be easily calculated with the corr() method
# Calculate the correlation between each column of df_data

# Display 3 rows of df_data
df_data.head(3)

In [None]:
# Calculate the correlation
df_data.corr()

### 2.7. Data Manipulation (1)
#### Extracting Columns
- In Pandas, you can extract specific columns from data using column names
- Python uses [] to extract elements. Following this principle, you can extract specific columns by using df['column_name']

In [None]:
# The column names of df_data can be obtained with df_data.columns
df_data.columns

In [None]:
# If you want to extract only the 'T-Cho' column, use df_demographics['T-Cho']
df_data['T-Cho']

In [None]:
# You can also calculate the mean of just the extracted column
df_data['T-Cho'].mean()

- When you want to extract multiple columns, create a list of column names and use df[[list]]
- Note that the list is originally enclosed in [], and then it's enclosed in [] to extract elements, so it results in [[]]

In [None]:
# Want to extract T-Cho, LDL, HDL, Glu, Y
# The idea is to first create a list
# Then put that list inside df_data[]
# ['T-Cho', 'LDL', 'HDL', 'Glu', 'Y']
df_data[['T-Cho', 'LDL', 'HDL', 'Glu', 'Y']]

In [None]:
# Here too, you can calculate the mean values in the same way
# The mean is calculated only for continuous values
df_data[['T-Cho', 'LDL', 'HDL', 'Glu', 'Y']].mean()

#### Exercise 1
- To help you understand the significance of using lists when you want to extract multiple columns in pandas, please execute the following:
- Assign 'T-Cho', 'LDL', 'Glu' to a list called a
- Try executing df_data[a]

In [None]:
# Your answer
# Assign 'T-Cho', 'LDL', 'Glu' to a list called a
a =
# Display a
# If it displays ['T-Cho', 'LDL', 'Glu'], it's correct
print(a)

In [None]:
# Execute df_data[a]
df_data[a]

In [None]:
# Example answer
a = ['T-Cho', 'LDL', 'Glu']
print(a)
df_data[a]
# Since a itself is enclosed in [], df[a] is equivalent to df[['Age', 'Subject_Type', 'CSF']]
# Until you get used to it, it might be good to consciously create a list first and then put it in df[]

#### Extracting Rows
- To extract rows, use loc (**loc**ation) or iloc (**i**nteger **loc**ation)
- After loc, specify the value of the index you want to extract
- After iloc, specify the row number you want to extract. You can specify multiple rows using slicing

In [None]:
# loc can extract rows with specific indexes
# Now, the index is ID
# Extract ID sub400
df_data.loc['sub400']

In [None]:
# If you specify IDs as a list, you can extract multiple rows
df_data.loc[['sub400', 'sub410']]

In [None]:
# iloc can extract rows specified by row number
# Slicing is the same as Python basics
# The row number of the first row is 0
# If you want to extract from the 3rd to 5th row, assuming the rows start from 1,
# considering the row number starts from 0, you need to extract row numbers 2 to 4,
# so think of it as 2 or more and less than 5
df_data.iloc[2:5]

In [None]:
# Confirm by displaying the first 5 rows with df_data.head()
df_data.head()

### 2.8. Data Merging
#### Horizontal Merging of DataFrames
- One of the strengths of pandas is merging tables
- In Excel, it can be quite time-consuming when the IDs in two tables don't match perfectly
- You can merge tables (DataFrames) horizontally using the pd.merge() function
- Specify the common key in the two DataFrames using `on='key'`

### Inner Join and Outer Join
- There are inner joins and outer joins in merge
    - *Strictly speaking, there are three types of outer joins: left, right, and full, but we'll only explain full outer join here
- Inner join merges only what is common in the two datasets (in the figure below, only ID04, ID05, ID06 are merged)
- Outer join merges all of the two datasets
- In mathematical terms, for two groups A and B:
    - Inner join is the intersection: A∩B
    - Outer join is the union: A∪B

<img src="https://www.nemotos.net/nb/img/pandas_02.png" width=300>

- By default, only rows with common keys are merged (inner join)

- Now, df_demographics has 442 rows, df_data has 442 rows, and df_data_short has 254 rows (imagine some data couldn't be obtained in a research dataset)

- We will do three things:
    - Merge df_demographics (442 rows) and df_data (442 rows) (This can be done easily in Excel)
    - Inner join df_demographics (442 rows) and df_data_short (254 rows) (This is difficult in Excel)
    - Outer join df_demographics (442 rows) and df_data_short (254 rows) (This is difficult in Excel)

In [None]:
# Merge df_demographics and df_data using the 'ID' key to generate a DataFrame called df
df = pd.merge(df_demographics, df_data, on='ID')

In [None]:
# Display only the first 5 rows of df
# Note that the columns from demographics and data have been merged
df.head()

In [None]:
# Check the size of df
# 442 rows and 11 columns
df.shape

In [None]:
# Merge df_demographics and df_data_short using the 'ID' key to generate a DataFrame called df_short
df_short = pd.merge(df_demographics, df_data_short, on='ID')

In [None]:
# Check the size of df_short
# 254 rows and 11 columns
df_short.shape

In [None]:
# Display only the first 5 rows of df_short
# Note that the number of IDs has decreased
df_short.head()

In [None]:
# Perform an outer join on df_demographics and df_data_short to generate a DataFrame called df_short_outer
# For outer join, specify how='outer'
df_short_outer = pd.merge(df_demographics, df_data_short, how='outer', on='ID')

In [None]:
# Check the size of df_short_outer
# 442 rows and 11 columns
df_short_outer.shape

In [None]:
# Check the first 5 rows of df_outer
# Note that for those without blood data, the blood data items are NaN
df_short_outer.head()

### 2.9. Handling Missing Values
- We'll use df_short_outer to learn about handling missing values
- The isnull() method of the DataFrame sets True for values that don't exist and False for others

In [None]:
df_short_outer.isnull()

In [None]:
# By using the sum() method on df_short_outer.isnull(), you can find out the number of missing values

# Now we can see that there are 190 missing values for each blood data item
df_short_outer.isnull().sum()

In [None]:
# For missing values, there are two options: "drop" or "impute"

# First, let's look at dropping
# Execute the dropna() method without arguments
df_dropped = df_short_outer.dropna()

In [None]:
# Originally there were 442, but since there were 190 NaN (Not a Number), it becomes 442 - 190 = 252
df_dropped.shape

In [None]:
# For imputation, specify the mean or other values as an argument to the fillna() method
df_filled = df_short_outer.fillna(df_short_outer.mean())

In [None]:
# NaN disappears
df_filled.head()

In [None]:
# Compared to the original, you can see that NaN has been uniformly filled with the mean of each column
# Of course, in reality, this method would be inappropriate if there were so many missing values, but it's introduced here as an example
df_short_outer.head()

### 2.10. Data Manipulation (2)
#### Conditional Extraction
- You can also extract only those that meet certain conditions
- df['AGE']>40 returns True or False for people older than 40 years
- By putting this inside df[], you can create a list of people who meet this condition

In [None]:
# df['AGE'] > 40 returns True or False
df['AGE'] > 40

In [None]:
# By putting the above inside the df index, you can extract only the True ones
df[df['AGE'] > 40]

In [None]:
# You can create multiple conditions using &, |, ~
# Rules:
#   You must use one of &, |, ~. You can't use and, or, not
#  Each individual condition must be enclosed in ()

# Older than 50 and female
(df['AGE'] > 50) & (df['SEX'] == 2)

In [None]:
# Let's put this condition inside df[]
df[(df['AGE'] > 50) & (df['SEX'] == 2)]

#### Exercise 2
- We want to extract data where 'AGE' is 40 or older and 'BMI' is 30 or higher. To deepen your understanding, please execute the following:
    - Create a condition expression for 'AGE' being 40 or older and assign it to condition1
    - Create a condition expression for 'BMI' being 30 or higher and assign it to condition2
    - Create a condition expression for condition1 AND condition2 and assign it to b
    - Execute df[b] and create a DataFrame called df_b
    - Display the first 10 rows of df_b
    - Calculate the descriptive statistics of df_b

In [None]:
# 'AGE' 40 or older AND 'BMI' 30 or higher
# Hint: 40 or older is >= 40
# Hint: AND is &

# condition1: 'AGE' is 40 or older
condition1 =

# Display condition1
condition1.head()

In [None]:
# condition2: 'BMI' is 30 or higher
condition2 =
# Display condition2
condition2.head()

In [None]:
# b: condition1 AND condition2
b =

# Display b
b

In [None]:
# Execute df[b] and create a DataFrame called df_b
df_b = df[b]

# Display the first 10 rows of df_b
# Please write below (Hint: head)


In [None]:
# Calculate the descriptive statistics of df_b
# Please write below (Hint: describe)


In [None]:
#### Example answer
# Question 1
df['AGE'].mean()

In [None]:
# Question 2
df[df['SEX'] == 1]

In [None]:
# Question 3
df['BMI'].max()

In [None]:
# Question 4
sns.scatterplot(data=df, x='AGE', y='BMI')

In [None]:
# Question 5
sns.scatterplot(data=df, x='AGE', y='BMI', hue='SEX')

In [None]:
# Question 6
sns.boxplot(data=df, x='SEX', y='BMI')

#### Adding Calculated Values as New Columns
- In Pandas, it's easy to calculate items and generate new columns
- Not only calculations but also Boolean values can be added

In [None]:
# Calculate the Z-score of Glucose
# (Glu - mean of Glu) / standard deviation of Glu
df['Glu_Z'] = (df['Glu'] - df['Glu'].mean())/df['Glu'].std()

In [None]:
df.head()

In [None]:
# Create a column HC that indicates whether T-cho is greater than 220 or not
df['HC'] = df['T-Cho']>220

In [None]:
# Note that HC has been created in the last column
df.head()

#### File Output
- pandas can easily save to csv or excel
- Use the method 'to_file_type'
- df.to_csv('filename')
- df.to_excel('filename')

In [None]:
# Output the current df as a file named 'diabetes.xlsx'
# It is generated in Google Colaboratory. It can be downloaded
df.to_excel('diabetes.xlsx')

## 3. Basics of Seaborn
- Seaborn can draw various graphs with simple commands
- This allows you to visualize relationships in data

In [None]:
# To use seaborn themes, execute the following command
sns.set_theme()

#### General Rules of Seaborn
- After sns, specify the method prepared for each graph
    - sns.scatterplot()
- Specify the following as arguments:
    - data=DataFrame
    - x='Item to use for X-axis'
    - y='Item to use for Y-axis'
    - hue='Item to color-code'
    - style='Item to differentiate plot shapes'

### 3.1. Basic Graphs

In [None]:
# Review of df
df.head()

#### A. scatterplot
- Purpose of the graph: Visualize the relationship between two continuous variables

In [None]:
sns.scatterplot(data=df, x='AGE', y='BMI', hue='SEX')

In [None]:
# If you want to specify colors, create a dictionary of variable-color correspondences and specify it with palette
color_dict = {1: 'blue', 2: 'orange'}  # Example: 1 is male, 2 is female
sns.scatterplot(data=df, x='AGE', y='BMI', hue='SEX', palette=color_dict)

#### B. regplot (Scatter plot with regression line)
- Purpose of the graph: Visualize the relationship between two continuous variables and their regression line

In [None]:
sns.regplot(data=df, x='AGE', y='BMI')

In [None]:
# The confidence interval is 95% by default
# If you want to change it to 99%, specify ci=99
# If you don't need a confidence interval, use ci=None
sns.regplot(data=df, x='BMI', y='Glu', ci=99)

#### C. histplot (Histogram)
- Purpose of the graph: Visualize the distribution of variables. Kernel Density Estimation (KDE) can also be added

In [None]:
# Multiple histograms can be drawn simultaneously
sns.histplot(df[['T-Cho','LDL']])

In [None]:
# Add Kernel Density Estimation
sns.histplot(data=df, x='Glu', kde=True)

In [None]:
# (Reference)
# Unfortunately, plotting a normal distribution curve can't be done with Seaborn alone
# Use Matplotlib as follows

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm

# Plot histogram
sns.histplot(data=df, x='Glu', stat='density', bins=20)

# Calculate normal distribution curve
mean = df['Glu'].mean()
std = df['Glu'].std()
x = np.linspace(df['Glu'].min(), df['Glu'].max(), 100)
p = norm.pdf(x, mean, std)

# Plot normal distribution curve
plt.plot(x, p, 'k', linewidth=1)
plt.title('Histogram and Normal Distribution Curve')
plt.show()

#### D. boxplot
- Purpose of the graph: Check the dispersion of data and outliers

In [None]:
sns.boxplot(df[['T-Cho', 'LDL', 'HDL', 'Glu']])

#### E. violinplot
- Purpose of the graph: Check the distribution and density of data

In [None]:
sns.violinplot(df[['T-Cho', 'LDL', 'HDL', 'Glu']])

#### F. swarmplot
- Purpose of the graph: Visualize the distribution of data as individual points, used when avoiding overlap of data points

In [None]:
sns.swarmplot(data=df, x='SEX', y='BMI', hue='HC')

#### G. pairplot
- Purpose of the graph: Visualize relationships between multiple variables at once

In [None]:
sns.pairplot(df[['AGE','BMI','T-Cho','LDL','Glu']])

#### H. heatmap
- Purpose of the graph: Visualize a correlation matrix to understand correlations between variables

In [None]:
# First, calculate the correlation coefficients between variables and assign to a variable called correlation
correlation = df[['AGE','BMI','T-Cho','LDL','Glu']].corr()
# Check the contents of correlation
correlation

In [None]:
# Display correlation as a heatmap
# annot=True allows displaying numbers
# fmt='.2f' means up to the second decimal place
sns.heatmap(correlation, annot=True, fmt='.2f')

#### I. countplot
- Purpose of the graph: Visualize the distribution of categorical data

In [None]:
sns.countplot(data=df, x='SEX')

#### J. clustermap
- Purpose of the graph: Visualize data clusters to explore patterns and similarities in data

In [None]:
sns.clustermap(df.corr(), annot=True, cmap='coolwarm')

#### K. Interactive Plot
- Purpose of the graph: Interactively manipulate plots to explore relationships in data

In [None]:
from ipywidgets import interact

def plot_scatter(x, y):
    sns.scatterplot(data=df, x=x, y=y)

interact(plot_scatter, x=df.columns, y=df.columns)

#### L. relplot (Relation Plot)
- Purpose of the graph: Visualize multiple variables simultaneously to comprehensively understand relationships between variables. Grouping is possible by color (hue), shape (style), and size (size)

    ```
    sns.relplot(data=DataFrame,
                x=column to use for x-axis,
                y=column to use for y-axis,
                hue=column to change color,
                style=column to change the shape of the plot
                size=column to reflect in the size of the plot)
    ```

In [None]:
# Simply visualize the relationship between BMI and blood glucose (Glu)
sns.relplot(data=df, x='BMI',y='Glu')

In [None]:
# Using hue allows color-coding of groups
# Now, color-code groups by presence or absence of hypercholesterolemia
sns.relplot(data=df, x='BMI',y='Glu', hue='HC')

In [None]:
# Using style allows differentiation of groups by plot shape
# Now, we want to change the shape for men and women
sns.relplot(data=df, x='BMI',y='Glu', hue='HC', style='SEX')

In [None]:
# Using size allows changing the size of individual plots
# We want to reflect age in the plot
sns.relplot(data=df, x="BMI", y="Glu",hue="HC",style="SEX",size="AGE")

## 4. Quiz
1. Calculate the mean of the 'AGE' column in the df DataFrame
2. Find the maximum value of the 'BMI' column in the df DataFrame
3. Extract only the data where 'SEX' is 1 from the df DataFrame
4. Create a scatter plot showing the relationship between AGE (age) and BMI from the df DataFrame.
5. For the scatter plot created in quiz 4, color-code by SEX (gender).
6. Create a box plot showing the distribution of BMI by SEX (gender) from the df DataFrame.
7. Create a violin plot showing the distribution of BMI by SEX (gender) from the df DataFrame.
8. Calculate the correlation matrix of AGE, BMI, T-Cho, LDL, Glu from the df DataFrame.
9. Visualize the correlation matrix created in quiz 8 as a heatmap and display the correlation coefficients in each cell.

In [None]:
# Quiz 1
# Hint: 2.6. Descriptive Statistics


In [None]:
# Quiz 2
# Hint: 2.6. Descriptive Statistics

In [None]:
# Quiz 3
# Hint: 2.7. Data Manipulation


In [None]:
# Quiz 4
# Hint: 3.1.A
  

In [None]:
# Quiz 5
# Hint: 3.1.A


In [None]:
# Quiz 6
# Hint: 3.1.D


In [None]:
# Quiz 7
# Hint: 3.1.E


In [None]:
# Quiz 8
# Hint: 2.6


In [None]:
# Quiz 9
# Hint: 3.1.H


In [None]:
#### Example Answers
# Quiz 1
df['AGE'].mean()

In [None]:
# Quiz 2
df[df['SEX'] == 1]

In [None]:
# Quiz 3
df['BMI'].max()

In [None]:
# Quiz 4
sns.scatterplot(data=df, x='AGE', y='BMI')

In [None]:
# Quiz 5
sns.scatterplot(data=df, x='AGE', y='BMI', hue='SEX')

In [None]:
# Quiz 6
sns.boxplot(data=df, x='SEX', y='BMI')

In [None]:
# Quiz 7
sns.violinplot(data=df, x='SEX', y='BMI')

In [None]:
# Quiz 8
correlation = df[['AGE', 'BMI', 'T-Cho', 'LDL', 'Glu']].corr()
correlation

In [None]:
# Quiz 9
sns.heatmap(correlation, annot=True, fmt='.2f', cmap='coolwarm')