# Exploratory Data Analysis (EDA)
---

0. **[Introduction to Exploratory Data Analysis](#Introduction-to-EDA)**
1. **[Discovery](#1.-Discovery)**
2. **[Structuring](#2.-Structuring)**
3. **[Cleaning](#3.-Cleaning)**
4. **[Joining](#4.-Joining)**
5. **[Validating](#5.-Validating)**
6. **[Presenting](#6.-Presenting)**

---
<a name="Introduction-to-EDA"></a>
### Introduction to Exploratory Data Analysis

**Exploratory Data Analysis (EDA) |** The process of investigating, organizing, and analyzing datasets and summarizing their main characteristics, often employing data wrangling and visualization methods.

**6 Practices of EDA:**
- **Discovering |** process of data familiarization in order to conceptualize how the data can be used
- **Structuring |** the process of taking raw data and organizing or transforming it to be more easily visualized, explained, or modeled 
- **Cleaning |** the process of removing errors that may distort your data or make it less useful
- **Joining |** the process of augmenting or adjusting data by adding values from other datasets
- **Validating |** the process of verifying that the data is consistent and high quality
- **Presenting |** making the cleaned dataset or data visualizations available to others for analysis or further modeling


---
<a name="1.-Discovery"></a>
### 1. Discovery

#### 1.1 Reference Guide

**Question to ask during the discovery phase:**
1. How can I break this data into smaller groups so that I can understand it better?
2. How can I prove my hypothesis?
3. In its current form, can this data give me the answers I need?

**Functions for data discovery:**

| Function | Description |
| ---- | ---- |
| `DataFrame.head()` | The head() method will display the first n rows of the dataframe. <br> In the argument field, input the number of rows you want displayed in a Python notebook. The default is 5 rows. |
| `DataFrame.info(X)` | The info() method will display a summary of the dataframe, including the range index, dtypes, column headers, and memory usage.<br> Leaving the argument field blank will return a full summary. As an option, in the argument field you can type in show_counts=True, which will return the count of non-null values for each column. |
| `DataFrame.describe()` | The describe() method will return descriptive statistics of the entire dataset, including total count, mean, minimum, maximum, dispersion, and distribution. <br> Leaving the argument field blank will default to returning a summary of the data frame’s statistics. As an option, you can use “include=[X]” and “exclude=[X]” which will limit the results to specific data types, depending on what you input in the brackets. | 
| `DataFrame.shape` | shape is an attribute that returns a tuple representing the dimensions of the dataframe by number of rows and columns. Remember that attributes are not followed by parentheses. |

#### 1.2 Code Cells

##### Import packages

In [None]:
# For data manipulation
import numpy as np
import pandas as pd

# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For displaying all of the columns in dataframes
pd.set_option('display.max_columns', None)



##### Load Data

In [None]:
# Load the dataset into a DataFrame and save in a variable
df0 = pd.read_csv("example_file.csv")

##### Gather basic information about the data

In [None]:
# Display the first 10 rows of the data
df0.head(10)

In [None]:
# Gather basic information about the dataset
df0.info()

In [None]:
# Gather descriptive statistics about the data
df0.describe()

In [None]:
# Display the size of the dataframe
df0.shape

---
<a name="2.-Structuring"></a>
### 2. Structuring

#### 2.1 Reference Guide

**Sorting |** The process of arranging data into meaningful order for analysis

**Extracting |** The process of retrieving data from a dataset or source for further processing

**Filtering |** The process of selecting a smaller part of your dataset based on specific parameters and using it for viewing or analysis

**Slicing |** A method for breaking information down into smaller parts to facilitate efficient examination and analysis from different viewpoints

**Grouping |** Aggregating individual observations of a variable into groups


**Functions for extracting or selecting data:**

| Function | Description |
| ---- | ---- |
| `df[[columns]]` | Use df[[columns]] to extract/select columns from a dataframe. |
| `df.select_dtypes` | A method available to the DataFrame class. <br> Use df.select_dtypes() to return a subset of the dataframe’s columns based on the column dtypes (e.g., float64, int64, bool, object, etc.). |

**Functions for filtering, sorting, slicing data:**

| Function | Description |
| ---- | ---- |
| `df[condition]` | Use df[condition] to create a Boolean mask, then apply the mask to the dataframe to filter according to selected condition. | 
| `pd.sort_values()` | A method available to the DataFrame class. <br> Use pd.sort_values() to sort data according to selected parameters. |
| `df.iloc[]` | Use ‘df.iloc[]’ to slice a dataframe based on an integer index location. | 
| `df.loc[]` | Use df.loc[] to slice a dataframe based on a label or Boolean array. |


**Manipulating datetime strings in Python:**
| Code | Format | Example |
| --- | --- | --- |
| `%a` | Abbreviated weekday | Sun |
| `%A` | Weekday | Sunday |
| `%b` | Abbreviated month | Jan |
| `%B` | Month name | January |
| `%c` | Date and time | Sun Jan 1 00:00:00 2021 |
| `%d` | Day (leading zeros) | 01 to 31 |
| `%H` | 24 hours | 00 to 23 |
| `%I` | 12 hours | 01 to 12 |
| `%j` | Day of year | 001 to 366 |
| `%m` | Month | 01 to 12 |
| `%M` | Minute | 00 to 59 |
| `%p` | AM or PM | AM/PM |
| `%S` | Seconds | 00 to 60 |
| `%U` | Week number (Sun) | 00 to 53 |
| `%W` | Week number (Mon) | 00 to 53 |
| `%w` | Weekday | 0 to 6 |
| `%x` | Locale’s appropriate date representation | 08/16/88 (None) <br> 08/16/1988 (en_US) <br> 16.08.1988 (de_DE) |
| `%X` | A locale’s appropriate time representation | 21:30:00 (en_US) <br> 21:30:00 (de_DE) |
| `%y` | Year without century | 00 to 99 |
| `%Y` | Year | 2022 |
| `%z` | Offset | +0900 |
| `%Z` | Time zone | EDT/JST/WET etc (GMT) |

**Datetime functions to remember**

| Code | Input Type | Input Example | Output Type | Output Example |
| --- | --- | --- | --- | --- |
| `datetime.strptime(“25/11/2022”, “%d/%m/%Y”)` | string | “25/11/2022” | DateTime | “2022-11-25  00:00:00” |
| `datetime.strftime(dt_object, “%d/%m/%Y”)`| DateTime | “2022-11-25  00:00:00” | string | “25/11/2022” |
| `dt_object = datetime.strptime(“25/11/2022”, “%d/%m/%Y”)datetime.timestamp(dt_object)` | string | “25/11/2022” | float (UTC timestamp in seconds) | 1617836400.0 |
| `datetime.strptime(“25/11/2022”, “%d/%m/%Y”).strftime(“%Y-%m-%d”)` | string | “25/11/2022” | string | “2022-11-25” |
| `datetime.fromtimestamp(1617836400.0)` | float (UTC timestamp in seconds) | 1617836400.0 | DateTime | “2022-11-25  00:00:00” |
| `datetime.fromtimestamp(1617836400.0).strftime(“%d/%m/%Y”)` | float (UTC timestamp in seconds) | 1617836400.0 | string | “25/11/2022” |
| `from pytz import timezone` <br> `ny_time = datetime.strptime(“25-11-2022  09:34:00-0700”, “%d-%m-%Y  %H:%M:%S%f%z”)` <br> `Tokyo_time = ny_time.astimezone(timezone(‘Asia/Tokyo’))` | string | NewYork timezone “25-11-2022  09:34:00-0700” | DateTime | Tokyo timezone “2022-11-25  22:34:00+08:00” |
| `datetime.strptime(“20:00”, “%H:%M”).strftime(“%I:%M %p”)` | string | “20:00” | string | “09:00 AM” |
| `datetime.strptime(“08:00 PM”, “%I:%M  %p”).strftime(“%H:%M”)` | string | “08:00 PM” | string | “20:00” |





#### 2.2 Code Cells

---
<a name="3.-Cleaning"></a>
### 3. Cleaning

#### 3.1 Reference Guide

##### 3.1.1 Missing Data


**What to do with missing data:**
- Request the missing values to be filled in by the owner of the data
- Delete the missing column(s), row(s), or value(s)
- Create a NaN category
- Derive new representative value(s)
    - Forward filling
    - Backward filling
    - Deriving mean values 
    - Deriving median values

**Useful Functions:**

| Function | Description |
| ---- | ---- |
 `df.info()` | A DataFrame method that returns a concise summary of the dataframe, including a ‘non-null count,’ which helps you know the number of missing values |
 `pd.isna() / pd.isnull()` | pd.isna() is a pandas function that returns a same-sized Boolean array indicating whether each value is null (you can also use pd.isnull() as an alias). Note that this function also exists as a DataFrame method. |
 `pd.notna() / pd.notnull()` |  A pandas function that returns a same-sized Boolean array indicating whether each value is NOT null (you can also use pd.notnull() as an alias). Note that this function also exists as a DataFrame method. | 
 `df.fillna()` | A DataFrame method that fills in missing values using specified method |
 `df.replace()` | A DataFrame method that replaces specified values with other specified values. Can also be applied to pandas Series. | 
 `df.dropna()` | A DataFrame method that removes rows or columns that contain missing values, depending on the axis you specify. |

##### 3.1.2 Duplicated Data



**Identifying duplicates** 
A simple way to identify duplicates is to use the  `pd.duplicated()` function from Pandas. This function returns a series of “true/false” outputs, with “true” indicating the data value is a duplicate, and “false” indicating it is a unique value.

**Keeping or Dropping Duplicates** Every dataset is unique and you cannot treat every dataset the same. When making the decision on whether to eliminate duplicate values or not, think deeply about the dataset itself and about the objective you wish to achieve. What impact will dropping duplicates have on your dataset and your objective? 
1. **Deciding to drop |** You should drop or eliminate duplicate values if duplicate values are clearly mistakes or will misrepresent the remaining unique values in the dataset.  
2. **Deciding to NOT drop |** You should keep duplicated data in your dataset if the duplicate values are clearly not mistakes and should be taken into account when representing the dataset as a whole. 

##### 3.1.3 Outliers


**Outliers |** Observations that are an abnormal distance from other values or an overall pattern in a data population

**3 Types of Outliers**
- Global outliers 
- Contextual outliers
- Collective outliers

**Global Outliers |** Values that are completely different from the overall data group and have noa association with any other outliers

**Contextual outliers |** Normal data points under certain conditions but become anomalies under most other conditions 

**Collective outliers |** A group of abnormal point that follow similar patterns and are isolated from the rest of the population 

**How to handle outliers** 

It is important to not only detect outliers, but also to have a plan for them.

Whether you keep outliers as they are, delete them, or reassign values is a decision that you make on a dataset-by-dataset basis. To help you make the decision, you can start with these general guidelines:

- **Delete them**: If you are sure the outliers are mistakes, typos, or errors and the dataset will be used for modeling or machine learning, then you are more likely to decide to delete outliers. Of the three choices, you’ll use this one the least.
- **Reassign them**: If the dataset is small and/or the data will be used for modeling or machine learning, you are more likely to choose a path of deriving new values to replace the outlier values.
- **Leave them**: For a dataset that you plan to do EDA/analysis on and nothing else, or for a dataset you are preparing for a model that is resistant to outliers, it is most likely that you are going to leave them in.

**Useful Functions:**

| Function | Description |
| ---- | ---- |
`df.describe()` | A DataFrame method that returns general statistics about the dataframe which can help determine outliers |
`sns.boxplot()` | A seaborn function that generates a box plot. Data points beyond 1.5x the interquartile range are considered outliers. |

##### 3.1.4 Categorical and Numeric Data

**Categorical Data |** Data that is divided into a limited number of qualitative groups 

Data Transformation: 

1. **Label encoding |** Data transformation technique where each category is assigned a unique number instead of a qualitative value

Some potential problems with label encoding:

Imagine you’re analyzing a dataset with categories of music genres. You label encode “Blues,” “Electronic Dance Music (EDM),” “Hip Hop,” “Jazz,” “K-Pop,” “Metal,” “ and “Rock,” with the following numeric values, “1, 2, 3, 4, 5, 6, and 7.” 

With this label encoding, the resulting machine learning model could derive not only a ranking, but also a closer connection between Blues (1) and EDM (2) because of how close they are numerically than, say, Blues(1) and Jazz(4). In addition to these presumed relationships (which you may or may not want in your analysis) you should also notice that each code is equidistant from the other in the numeric sequence, as in 1 to 2 is the same distance as 5 to 6, etc. The question is, does that equidistant relationship accurately represent the relationships between the music genres in your dataset? To ask another question, after encoding, will the visualization or model you build treat the encoded labels as a ranking? 

The same could be said for the mushroom example above. After label encoding mushroom types, are you satisfied with the fact that the mushrooms are now in a presumed ranked order with button mushrooms ranked first and toadstool ranked eighth? 

In summary, label encoding may introduce unintended relationships between the categorical data in your dataset. When you are making decisions about label encoding, consider the algorithm you’ll apply to the data and how it may or may not impact label encoded categorical data.

2. **One-hot encoding |** Uses *Dummy Variables* with values of 0 or 1, which indicated the presence of absence of something.

With this method, we solve the problem of the unintended and problematic relationships that label encoding presented. 

But one-hot encoding does present its own set of problems, particularly when it comes to logistic and linear regression.

**Label encoding or one-hot encoding: How to decide?** 

There is no simple answer to whether you should use label encoding or one-hot encoding. The decision needs to be made on a case-by-case, or dataset-by-dataset basis. But there are some guidelines to help you. 

Use label encoding when:
- There are a large number of different categorical variables — because label encoding uses far less data than one-hot encoding
- The categorical values have a particular order to them (for example, age groups can be grouped as youngest to oldest or oldest to youngest)
- You plan to use a decision tree or random forest machine learning model

Use one-hot encoding when: 
- There is a relatively small amount of categorical variables — because one-hot encoding uses much more data than label encoding. 
- The categorical variables have no particular order
- You use a machine learning model in combination with dimensionality reduction (like Principal Component Analysis (PCA))

**Useful Functions:**

| Function | Description |
| ---- | ---- |
`df.astype()` | A DataFrame method that allows you to encode its data as a specified dtype. Note that this method can also be used on Series objects.  |
`Series.cat.codes` | A Series attribute that returns the numeric category codes of the series. | 
`pd.get_dummies()` | A function that converts categorical values into new binary columns—one for each different category | 
`LabelEncoder()` | A transformer from scikit-learn.preprocessing that encodes specified categories or labels with numeric codes. Note that when building predictive models it should only be used on target variables (i.e., y data). |


#### 3.2 Code Cells

##### 3.2.1 Rename columns

In [None]:
# Display all column names
df0.columns

In [None]:
# Rename columns as needed
df0 = df0.rename(columns={'Work_accident': 'work_accident',
                          'average_montly_hours': 'average_monthly_hours',
                          'time_spend_company': 'tenure',
                          'Department': 'department'})

# Display all column names after the update
df0.columns

##### 3.2.2 Check Missing Data

In [None]:
# Check for missing values
df0.isna().sum()

##### 3.2.3 Check Duplicate Data

In [None]:
# Check for duplicates
duplicates = df0.duplicated().sum()

# Percentage of duplicated data
percentage = df0.duplicated().sum() / df0.shape[0] * 100

print(f'{duplicates} rows contain duplicates amounting to {percentage.round(2)}% of the total data.')

In [None]:
# Inspect some rows containing duplicates as needed
df0[df0.duplicated()].head()

3.2.3.1 Resolve Duplicates

In [None]:
# Drop duplicates and save resulting dataframe in a new variable as needed
df1 = df0.drop_duplicates(keep='first')

# Display first few rows of new dataframe as needed
df1.head()

##### 3.2.4 Check Outliers

Best practice dictates to check for outliers in all variables of interest to ensure the accuracy and validity of statistical analyses and machine learning models.

In [None]:
# Display general statistics about the dataframe which can help determine outliers
df1.describe()

3.2.4.1 Boxplots

In [None]:
# Create a boxplot to visualize distribution of all numeric variables and detect any outliers

# plot 1 boxplot for all variables so must first normalize the scale
from sklearn.preprocessing import MinMaxScaler

# select numeric columns
num_columns = df0[['variable_1', 'variable_2', 'variable_3']]

#normalize values using min-max scaling
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(num_columns)

# Create df with normalized data
df_normalized = pd.DataFrame(normalized_data, columns=num_columns.columns)

sns.boxplot(data= df_normalized)
plt.xticks(rotation=45)

plt.show()

3.2.4.2 Outlier investigation

In [None]:
# Determine the number of rows containing outliers for each variable that needs to be addressed

# Compute the 25th percentile value in `X_n`
percentile25 = df1['X_n'].quantile(0.25)

# Compute the 75th percentile value in `X_n`
percentile75 = df1['X_n'].quantile(0.75)

# Compute the interquartile range in `X_n`
iqr = percentile75 - percentile25

# Define the upper limit and lower limit for non-outlier values in `X_n`
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
print("Lower limit:", lower_limit)
print("Upper limit:", upper_limit)

# Identify subset of data containing outliers in `X_n`
outliers = df1[(df1['X_n'] > upper_limit) | (df1['X_n'] < lower_limit)]

# Count how many rows in the data contain outliers in `X_n`
print("Number of rows in the data containing outliers in `X_n`:", len(outliers))

3.2.4.3 Outlier Resolution

Certain types of models are more sensitive to outliers than others. At the time of model construction, consider whether to remove outliers based on the type of model being used
- **Delete them**: If you are sure the outliers are mistakes, typos, or errors and the dataset will be used for modeling or machine learning, then you are more likely to decide to delete outliers. Of the three choices, you’ll use this one the least.


In [None]:
# use a boolean mask to delete outliers
mask = (df['number_of_strikes'] >= lower_limit) & (df['number_of_strikes'] <= upper_limit)

df = df[mask].copy()


- **Reassign them**: If the dataset is small and/or the data will be used for modeling or machine learning, you are more likely to choose a path of deriving new values to replace the outlier values.
    1. **Create a floor and ceiling at a quantile:** For example, you could place walls at the 90th and 10th percentile of the distribution of data values. Any value above the 90% mark or below the 10% mark are changed to fit within the walls you set
    2. **Impute the average:** In some cases, it might be best to reassign all outlier values to match the median or mean value. This will ensure that your median and distribution are based solely on the non-outlier values, leaving the original outliers excluded.


In [None]:
# floor and ceiling method

# Calculate 10th percentile
tenth_percentile = np.percentile(df['x_n'], 10)

# Calculate 90th percentile
ninetieth_percentile = np.percentile(df['x_n'], 90)

# Apply lambda function to replace outliers with thresholds defined above
df['x_n'] = df['x_n'].apply(lambda x: (
    tenth_percentile if x < tenth_percentile 
    else ninetieth_percentile if x > ninetieth_percentile 
    else x))

In [None]:
# imputing the average

# Calculate median of all NON-OUTLIER values
median = np.median(df['number_of_strikes'][df['number_of_strikes'] >= lower_limit])

# Impute the median for all values < lower_limit
df['number_of_strikes'] = np.where(df['number_of_strikes'] < lower_limit, median, df['number_of_strikes'] )

- **Leave them**: For a dataset that you plan to do EDA/analysis on and nothing else, or for a dataset you are preparing for a model that is resistant to outliers, it is most likely that you are going to leave them in.

Best practice dictates to check for outliers in all variables of interest to ensure the accuracy and validity of statistical analyses and machine learning models.

##### 3.2.5 Convert Categorical to Numeric Data

In [None]:
# create list of columns that need to be encoded
columns_to_encode = ['x_1', 'x_2']

# instantiate new df from the encoded df
df2 = pd.get_dummies(df, columns=columns_to_encode)

df2.head()

##### 3.2.6 Check for class imbalance

**Class imbalance:** When a data has a predictor variable that contains more instances of one outcome than another

**Balancing a Dataset**
1. **Downsampling:**  the process of making the minority class represent a larger share of the whole dataset simply by removing observations from the majority class. It is mostly used with datasets that are large. 

2. **Upsampling:** is the opposite of downsampling, and is done when the dataset doesn’t have a very large number of observations in the first place. Instead of removing observations from the majority class, you increase the number of observations in the minority class.


In [None]:
# to downsample data use the resample() function from the sklearn.utils module.

from sklearn.utils import resample

# Separate your data into majority and minority classes
majority_data = df[df['target_class'] == 0]  # majority class
minority_data = df[df['target_class'] == 1]  # minority class

# Downsample the majority class
downsampled_majority = resample(majority_data, replace=False, n_samples=len(minority_data), random_state=42)

# Combine the downsampled majority class with the minority class
downsampled_data = pd.concat([downsampled_majority, minority_data])

# Check the class distribution of the downsampled data
downsampled_data['target_class'].value_counts()

In [None]:
# To upsample data use the resample() function from the sklearn.utils module.

from sklearn.utils import resample

# Separate your data into majority and minority classes
majority_data = df[df['target_class'] == 0]  # majority class
minority_data = df[df['target_class'] == 1]  # minority class

# Upsample the minority class
upsampled_minority = resample(minority_data, replace=True, n_samples=len(majority_data), random_state=42)

# Combine the upsampled minority class with the majority class
upsampled_data = pd.concat([majority_data, upsampled_minority])

# Check the class distribution of the upsampled data
upsampled_data['target_class'].value_counts()

---
<a name="4.-Joining"></a>
### 4. Joining

#### 4.1 Reference Guide

**Merging |** Method to combine two different data frames along a specified starting column


**Functions for combining data:**

| Function | Description |
| ---- | ---- |
`df.merge()` | A method available to the DataFrame class. <br> Use df.merge() to take columns or indices from other dataframes and combine them with the one to which you’re applying the method. |
`pd.concat()` | A pandas function to combine series and/or dataframes <br> Use pd.concat() to join columns, rows, or dataframes along a particular axis |
`df.join()` | A method available to the DataFrame class. <br> Use df.join() to combine columns with another dataframe either on an index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list. |

#### 4.2 Code Cells

In [3]:
# Use this section to add data as required

---
<a name="5.-Validating"></a>
### 5. Validating

#### 5.1 Reference Guide


**Input Validation |** The practice of thoroughly analyzing and double-checking to make sure data is complete, error-free, and high-quality.

**Why validate data?:**
- Make more accurate business decisions
- Improve complex model performance 
- Prevent future system crashes, coding issues, or wrong predictions

**Questions to ask while validating data:**
- Are all entries in the same format?
- Are all entries in the same range? 
- Are the applicable data entries expressed in teh same data type?

#### 5.2 Code Cells

---
<a name="6.-Presenting"></a>
### 6. Presenting

#### 6.1 Reference Guide



**Data visualization |** Refers to the graphical representation of data and information using visual elements such as charts, graphs, maps, and other visual aids. It is a way of presenting complex data in a visually appealing and easy-to-understand manner, allowing individuals to analyze and interpret patterns, trends, and relationships within the data.

**Data visualization serves multiple purposes, including:**

- Exploration and analysis: It helps in exploring and analyzing large datasets, enabling users to identify patterns, outliers, correlations, and insights that may not be immediately apparent in raw data.
- Communication and storytelling: Visualizing data enhances communication by presenting information in a concise and compelling manner. It allows individuals to effectively convey their findings, narratives, or arguments based on data to others.
- Decision-making: By presenting data visually, decision-makers can gain better insights and make informed decisions. Visual representations facilitate understanding and enable stakeholders to grasp complex information quickly.
- Identifying trends and patterns: Data visualizations help in identifying trends, patterns, and relationships between variables, enabling businesses and organizations to make data-driven decisions and predictions.
- Data exploration and hypothesis testing: Visualizations can aid in exploring data, formulating hypotheses, and testing assumptions. They provide a visual framework to analyze data from different angles and validate or invalidate hypotheses.

#### 6.2 Common Graphs

##### 6.2.1 Boxplots

- Box plots are very useful in visualizing distributions within data
- Can be deceiving without the context of the sample sizes that they represent 
    - Solution is to plot a stacked histogram alongside to visualize the distribution of data in boxplots

In [None]:
# Set figure and axes
fig, ax = plt.subplots(1, 2, figsize = (22,8))

# Create boxplot 
sns.boxplot(
    data=df1, 
    x='X_n', 
    y='Dependant Variable', 
    hue='Dependant Variable',   # specifies the variable from the data to be used for grouping or coloring the histogram bars
    orient="h",                 # determines the orientation of the box plot. "h" refers to horizontal, so the box plots will be drawn horizontally.
    ax=ax[0]                    # specifies the axes object on which the histogram will be drawn. ax[1] refers to the second subplot or axes object.
)
ax[0].invert_yaxis()            # used to invert the y-axis. This is done to have the dependent variable values displayed in descending order on the y-axis.
ax[0].set_title('Title', fontsize='14')

# Create histogram showing distribution
sns.histplot(
    data= df1,
    x='Y', 
    hue='Dependant Variable',   # specifies the variable from the data to be used for grouping or coloring the histogram bars
    multiple='dodge',           # determines the method used to handle overlapping bars.'dodge' means the bars are positioned side by side for different values of 'Dependant Variable'
    shrink=2,                   # It controls the width of the bars. A higher value like 2 will make the bars narrower, while a lower value would make them wider.
    ax=ax[1]                    # specifies the axes object on which the histogram will be drawn. ax[1] refers to the second subplot or axes object.
)
ax[1].set_title('Title', fontsize='14')

# Display the plots
plt.show()

##### 6.2.2 Histograms


A histogram is a graphical representation that displays the distribution of a continuous variable by dividing the data into intervals (bins) and representing the frequency or count of observations within each bin using vertical bars.

In [None]:
# Set figure and axes
fig, ax = plt.subplots(1, 2, figsize = (22,8))

# Define filtered data for plot 1
df_filter1 = df1[df1['variable'] -- 'conditional statement']

# Define long-tenured employees
df_filter2 = df1[df1['variable'] -- 'conditional statement']

# Plot 1 histogram
sns.histplot(
    data=, 
    x='variable name', 
    hue='variable name', 
    discrete=1, 
    hue_order=['1', '2', '3'], 
    multiple='dodge', 
    shrink=.5, 
    ax=ax[0]
)
ax[0].set_title('Title', fontsize='14')

# Plot 2 histogram
sns.histplot(
    data= , 
    x='variable name', 
    hue='variable name', 
    discrete=1, 
    hue_order=['1', '2', '3'], 
    multiple='dodge', 
    shrink=.5, 
    ax=ax[0]
)
ax[0].set_title('Title', fontsize='14')

##### 6.2.3 Scatterplots


A scatter plot is a graphical representation that displays the relationship or correlation between two continuous variables by plotting individual data points as dots on a two-dimensional coordinate system.

In [None]:
# Create scatterplot of `X_1` versus `X_2`
plt.figure(figsize=(16, 9))

sns.scatterplot(
    data=df1, 
    x='X_1', 
    y='X_2', 
    hue='variable_x', 
    alpha=0.4
)
plt.axvline(x=insert_value, color='#ff6361', label='string_label', ls='--')
plt.legend(labels=['name1', 'name2', 'name3'])

plt.title('Title', fontsize='14');

##### 6.2.4 Heatmaps (correlation)

Checks for strong correlations between variables in the data.

In [None]:
# Plot a correlation heatmap
plt.figure(figsize=(12, 6))

heatmap = sns.heatmap(
    data= df0.select_dtypes(include='number').corr(), 
    vmin=-1, 
    vmax=1, 
    annot=True, 
    cmap=sns.color_palette("vlag", as_cmap=True)
)

heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':14}, pad=10);