# 01_Foundations_EDA.ipynb

---
## Week 1: Foundations & Data Analysis

### Notebook Overview
This notebook covers:

- **Math & Probability Refresher** (Monday–Tuesday)
  - Linear Algebra basics (matrix multiplication, eigenvalues, SVD basics)
  - Calculus basics (derivatives, chain rule)
  - Probability distributions, Bayes’ theorem, basic statistical inference

- **Python Data Handling & Exploratory Data Analysis** (Wednesday–Thursday)
  - Pandas for data cleaning, merges, groupby operations
  - Data visualization with Matplotlib/Seaborn (histograms, boxplots, correlation heatmaps)

- **Mini-Project** (Friday)
  - Short EDA pipeline on a chosen dataset (Titanic or Iris)
  - Insights & next steps

- **Weekend Consolidation**
  - Review math concepts and code
  - **ADHD Tip**: Reward yourself for accomplishing tasks

---

## 1. Introduction & Objectives

**Objective**: By the end of this week, you should be able to:
1. Refresh key math concepts relevant to ML (linear algebra, calculus, probability).
2. Use Python libraries (NumPy, Pandas, Matplotlib, Seaborn) for data manipulation & visualization.
3. Perform a thorough Exploratory Data Analysis (EDA) on a real dataset.
4. Understand how basic data analysis leads to business insights or decisions.

**Industry Context**: EDA is a crucial first step in any data science or AI/ML project. It helps uncover patterns, anomalies, or insights that guide modeling and decision-making.

---

## 2. Monday–Tuesday: Math & Probability Refresher

### 2.1 Linear Algebra Basics

<details>
  <summary><strong>Key Definitions & Concepts</strong></summary>

  - **Matrix Multiplication**: Combining two matrices A and B to produce C, where each element c<sub>ij</sub> is the dot product of row i of A and column j of B.
  - **Eigenvalues and Eigenvectors**: For a matrix A, if Av = λv, then λ is an eigenvalue and v is the corresponding eigenvector.
  - **Singular Value Decomposition (SVD)**: Factorizing a matrix M into U Σ Vᵀ, which is used in dimensionality reduction and other ML techniques.
</details>

#### 2.1.1 Demonstrating Matrix Operations in NumPy

```python
import numpy as np

# TODO: Example - Create two matrices A and B, then multiply them
A = np.array([[1, 2], 
              [3, 4]])
B = np.array([[5, 6], 
              [7, 8]])

# Matrix multiplication
C = A.dot(B)
print("Matrix A:\n", A)
print("Matrix B:\n", B)
print("A dot B = \n", C)

# OPTIONAL: Demonstrate some other operations, e.g. SVD
# TODO: Implement a small snippet that performs np.linalg.svd on A or B
```

Your Observations:

- [Write here what you notice about the matrix multiplication result.]

### 2.2 Calculus Refresher (Derivatives & Chain Rule)
<details>
  <summary><strong>Key Definitions & Concepts</strong></summary>

  - **Matrix Multiplication**: Combining two matrices A and B to produce C, where each element c<sub>ij</sub> is the dot product of row i of A and column j of B.
  - **Derivative**: Rate of change of a function f(x) with respect to x.
  - **Chain Rule**: If y = f(g(x)), then dy/dx = f'(g(x)) * g'(x).
</details>

# Example: Symbolic derivative using sympy (optional if you'd like to illustrate)
```python
import sympy as sp

x = sp.Symbol('x', real=True)
function = x**2 + 3*x + 5
derivative = sp.diff(function, x)
print("Function: ", function)
print("Derivative: ", derivative)
```

Your Observations:

[Write notes on how derivatives might relate to gradient descent or backprop.]


### 2.3 Probability & Basic Statistical Inference

<details>
  <summary><strong>Key Definitions & Concepts</strong></summary>
  - **Probability Distribution**: Describes how probabilities are assigned to all possible outcomes.
  - **Bayes’ Theorem**: P(A|B) = [P(B|A) * P(A)] / P(B).
  - **Statistical Inference**: Drawing conclusions about a population based on a sample.
</details>

```python
import numpy as np

# Example: Generating random data from a normal distribution
data = np.random.normal(loc=0.0, scale=1.0, size=1000)

mean = np.mean(data)
std_dev = np.std(data)

print(f"Mean of data: {mean:.3f}")
print(f"Std Dev of data: {std_dev:.3f}")

# TODO: Try some probability calculations or small examples of Bayes' theorem by hand or code
```
Short Exercise:

Compute P(A|B) for a simple scenario with known probabilities.
Write a quick example that demonstrates conditional probability.
Your Observations:

[Write down any insights or “aha!” moments about probability distributions, e.g., how data might cluster around the mean.]

## 3. Wednesday–Thursday: Python Data Handling & EDA

### 3.1 Pandas for Data Cleaning & Manipulation
<details>
    <summary>
        <strong>
            Key Concepts
        </strong>
    </summary>
    - **DataFrame**: Main data structure in pandas, designed for labeled data.
    - **.merge(), .groupby(), .pivot_table()**: Functions to combine and aggregate data.
    - **Handling Missing Values**: df.fillna(), df.dropna().
</details>

```python
import pandas as pd

# TODO: Load the Titanic or Iris dataset
# For Titanic:
url_titanic = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df_titanic = pd.read_csv(url_titanic)

df_titanic.head()
```

#### Data Cleaning Steps (example placeholders):
```python
# TODO: Check for missing values
df_titanic.isnull().sum()

# TODO: Drop or fill missing values
df_titanic['Age'] = df_titanic['Age'].fillna(df_titanic['Age'].median())

# TODO: Convert categorical columns if needed
df_titanic['Sex'] = df_titanic['Sex'].map({'male': 0, 'female': 1})
```



### 3.2 Basic Data Exploration & Visualization
<details>
    <summary>
        <strong>
            Key Visualization Tools
        </strong>
    </summary>
    - Matplotlib: Lower-level plotting library.
    - Seaborn: Built on Matplotlib, offers high-level API for statistical plots.
    - Common Plots: histograms, bar charts, box plots, pair plots, heatmaps.
</details>

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Distribution plot of Age
plt.figure(figsize=(8,5))
sns.histplot(df_titanic['Age'], bins=30, kde=True)
plt.title("Age Distribution of Titanic Passengers")
plt.show()

# Correlation heatmap
plt.figure(figsize=(8,6))
sns.heatmap(df_titanic.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()
```

Short Exercise:

1. Explore the relationship between Sex and Survived.
2. Create a groupby operation to see survival rates by passenger class.

Your Observations:

[Write your interpretation of the correlation or any interesting patterns in the data.]

## 4. Friday: Mini-Project – EDA

### 4.1 Task Description

Combine all steps from earlier in the week:
1. Data Loading & Cleaning (handle missing values, transform columns as needed).
2. Descriptive Statistics & Visualizations (histograms, boxplots, correlation).
3. Insights & Possible Next Steps (explain potential features to engineer for modeling).

```python
# TODO: Consolidate your final EDA code here
# Example structure:
# 1. Load dataset
# 2. Clean missing values
# 3. Exploratory visuals
# 4. Summarize key findings
```



## 4.2 Industry Context
How can such EDA inform business decisions?

- Example: A travel company analyzing Titanic-like data might segment customers based on certain demographics.
- Risk analysis: Understanding who is more likely to "survive" (in a metaphorical sense for business – e.g., who retains or churns).

Your Notes:

[Add your own reflections on EDA’s importance in real-world scenarios.]

## 5. Weekend: Consolidation
- Review:
1. Revisit linear algebra, probability, and calculus notes.
2. Make sure you’re comfortable with Pandas data manipulation.

## Additional Resources (Optional)
- Linear Algebra Review (Khan Academy)
- Probability & Statistics (Khan Academy)
- Pandas Official Documentation
- Seaborn Official Documentation