# Module 04 Lab - Exploratory Data Analysis**Objective:** To learn how to explore a dataset to find patterns, anomalies, and insights before modeling. EDA is like being a detective; you are looking for clues in the data that will help you build a better model.**This lab is fully coded.** Your task is to run each cell, read the detailed explanations, understand the purpose of each visualization, and then complete the experimentation section.

## Part 1: Setup and Data Loading**What is Exploratory Data Analysis (EDA)?**EDA is the process of using summary statistics and visualizations to understand a dataset's main characteristics. Before you can build a model, you need to understand your data. What stories does it tell? Are there errors or missing values? Are there strong relationships between variables? EDA helps answer these questions.We will use the famous Titanic dataset for this lab. It contains information about passengers and, crucially, whether they survived the disaster.

In [None]:
import pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns # Seaborn is a library built on top of Matplotlib that makes creating beautiful plots easier.# Load the dataset directly from a URLdf = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')print("--- First 5 Rows ---")print(df.head())print("--- Basic Info ---")# .info() is a great first command. It tells us the column names, how many non-null values are in each column, and their data types.# Notice that 'Age' and 'Cabin' have missing values!df.info()

## Part 2: Descriptive StatisticsLet's start by getting a high-level numerical summary of the data. The `.describe()` method is perfect for this. It calculates statistics like mean, standard deviation, min, and max for the numerical columns.

In [None]:
# Get summary statistics for numerical columnsprint("--- Descriptive Statistics ---")print(df.describe())print("--- Key Insights from Statistics ---")print(f"The average age of a passenger was {df['Age'].mean():.1f} years.")print(f"The overall survival rate was {df['Survived'].mean():.1%}.")print(f"Fares ranged from ${df['Fare'].min()} to a whopping ${df['Fare'].max()}.")

## Part 3: Visual EDA - Telling Stories with PlotsNumbers are great, but plots make patterns and relationships immediately obvious. The goal of visual EDA is to turn data into insights.**A Note on Plotting Libraries:***   **Matplotlib:** The foundational library, gives you full control over everything.*   **Seaborn:** Built on Matplotlib, it provides a simpler, high-level interface for creating common statistical plots. We will use Seaborn for its ease of use and attractive defaults.

### Visualization 1: How many survived?A simple `countplot` is the best way to see the distribution of a categorical variable, like our target `Survived`.

In [None]:
sns.set_style('whitegrid') # Sets a nice visual style for our plotsplt.figure(figsize=(8, 6))sns.countplot(x='Survived', data=df)plt.title('Survival Distribution (0 = Died, 1 = Survived)')plt.show()print("Insight: Far more people died than survived. This is an example of an imbalanced dataset, which can sometimes be a challenge for machine learning models.")

### Visualization 2: Does passenger class matter for survival?Now we want to see if there's a relationship between two variables. We can use the `hue` parameter in `countplot` to split the bars by another category.

In [None]:
plt.figure(figsize=(10, 6))sns.countplot(x='Pclass', hue='Survived', data=df)plt.title('Survival by Passenger Class')plt.legend(['Died', 'Survived'])plt.show()print("Insight: This is a very strong pattern. 1st class passengers had a much higher chance of survival compared to 3rd class passengers. Money seems to have made a difference.")

### Visualization 3: What about gender?The 'women and children first' mantra is famous. Let's see if the data supports it.

In [None]:
plt.figure(figsize=(10, 6))sns.countplot(x='Sex', hue='Survived', data=df)plt.title('Survival by Gender')plt.legend(['Died', 'Survived'])plt.show()print("Insight: The pattern is undeniable. A much higher proportion of females survived compared to males. This is another very strong predictor.")

### Visualization 4: How does age play a role?For a continuous variable like `Age`, a histogram is a great way to see its distribution.

In [None]:
# A FacetGrid allows us to create multiple plots side-by-side to compare distributions.# Here, we create one histogram for passengers who died (col='Survived'=0) and one for those who survived (col='Survived'=1).g = sns.FacetGrid(df, col='Survived', height=6)g.map(plt.hist, 'Age', bins=20)plt.show()print("Insight: The age distribution for those who did not survive is centered around the 20-40 age range. For those who survived, there is a noticeable spike for young children. This supports the 'children' part of the mantra.")

## Part 4: Student Experimentation**Instructions:** Create your own visualizations to explore other relationships in the data.

### Experiment 1: Port of Embarkation1.  The `Embarked` column tells you where a passenger boarded the ship (C = Cherbourg, Q = Queenstown, S = Southampton).2.  Create a `countplot` to see how survival rates differed by the port of embarkation. Does where they boarded seem to be related to their survival?

In [None]:
# --- ENTER YOUR CODE HERE ---

### Experiment 2: Fare vs. Survival1.  `Fare` is a continuous numerical variable.2.  A `boxplot` or `violinplot` is a great way to see the distribution of fares for those who survived vs. those who didn't.3.  Create one of these plots to compare the `Fare` distribution by `Survived`. What does it tell you?

In [None]:
# --- ENTER YOUR CODE HERE ---

## 📝 Knowledge Check**Instructions:** Answer the following questions in this markdown cell.1.  **What is the primary goal of Exploratory Data Analysis (EDA)?**2.  **Based on the plots in this lab, what kind of person had the best chance of surviving the Titanic?** (Describe them in terms of class, gender, and age).3.  **Why is it important to visualize data instead of just looking at summary statistics?** What can a plot show you that a number like 'mean' or 'count' can't?**[ENTER YOUR ANSWERS HERE]**