### In Session 01, you learned:

- How Python executes code
- How variables store information
- Why automation matters
- How programming complements SQL

Now we move from writing Python to using Python for data analysis.

## The Analytical Workflow

Every professional data analysis follows a structure:
```mermaid
flowchart LR
    A[Load Data] `> B[Inspect Structure]
    B `> C[Clean & Transform]
    C `> D[Analyze & Aggregate]
    D `> E[Visualize]
```

## Why pandas?

Python alone is not optimized for table-based data.

We use pandas — the core data manipulation package in Python.

It provides:
- DataFrames (table-like structures)
- Powerful filtering
- Fast aggregation
- SQL-like transformations
- Integration with visualization libraries

![the pandas pyramid](/img/python/session1/pandas pyramid.png)

### Loading Data

Most real-world data arrives as:
- CSV files
- Excel files
- Database exports

#### Loading a CSV File

```python
import pandas as pd

df = pd.read_csv("sales.csv")
```

#### First Inspection

```python
df.head()
df.shape
df.info()
```

##### What These Do

`head()` → shows first rows

`shape` → (rows, columns)

`info()` → data types & missing values

#### Info

```python
import pandas as pd

df = pd.DataFrame({'A': [1, 2],
                   'B': [1.0, 3.0],
                  'C': ['a', 'b']})

df.info()
```

Output: 
```markdown
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
`-  ```  ```````  ``-
 0   A       2 non-null      int64
 1   B       2 non-null      float64
 2   C       2 non-null      object
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes
```

#### Describe: `df.describe()`

Generates summary statistics for numeric columns:
- mean
- std
- min / max
- quartiles

This is your first statistical snapshot of the dataset.

![describe output](\img\python\02\describe1.png)

Cleaning & Transforming Data
``

Raw data is rarely analysis-ready.

Common tasks:

-   Removing duplicates
    
-   Filtering invalid rows
    
-   Creating new columns
    
-   Changing data types
    

### Filtering Data

Equivalent to SQL WHERE.

`   df_filtered = df[df["price"] > 0]   `

This keeps only rows where price is positive.

### Creating New Columns

`   df["revenue"] = df["price"] - df["quantity"]   `

You are extending the dataset with derived metrics.

### Removing Duplicates

`   df = df.drop_duplicates()   `

Data consistency is essential for reliable results.

Value Counts
``````

To understand categorical variables:

`   df["category"].value_counts()   `

This shows:

-   Frequency per category
    
-   Distribution patterns
    
-   Potential data imbalance
    

> Often used in customer segmentation and behavioral analysis.

Aggregation
`````-

Aggregation summarizes data.

Equivalent to SQL GROUP BY.

### Example: Revenue per Customer
`   df.groupby("customer_id")["revenue"].sum()   `

### Multiple Aggregations

`   df.groupby("category")["revenue"].agg(["sum", "mean", "count"])   `

You can compute:

-   Totals
    
-   Averages
    
-   Counts
    
-   Custom metrics
    

Merging Datasets
````````

Real projects involve multiple tables.

Equivalent to SQL JOIN.

`customers = pd.read_csv("customers.csv")  sales = pd.read_csv("sales.csv")  merged = pd.merge(sales, customers, on="customer_id", how="left")   `

Merge types:

-   inner
    
-   left
    
-   right
    
-   outer

![merge types](aca\img\python\02\join-types-merge-names.jpg)


## Data Visualization

We use `matplotlib` for foundational visualization.

For slightly higher-level statistical plotting, we can also use:

-   seaborn
    
-   plotly
    

### Why Visualization Matters

Visualization helps:

-   Detect patterns
    
-   Identify anomalies
    
-   Understand relationships
    
-   Communicate clearly
    

### Histogram (Distribution)

Used for continuous numeric variables.

```python 
import numpy as np  
import pandas as pd 
import matplotlib.pyplot as plt

data = np.random.normal(loc=100, scale=15, size=1000)  
df = pd.DataFrame({"revenue": data})  

df["revenue"].hist()  
plt.title("Histogram of Revenue")  
plt.xlabel("Revenue") 
plt.ylabel("Frequency")  
plt.show()   
```

This reveals:

-   Shape of distribution
    
-   Skewness
    
-   Spread
    
-   Potential outliers
    

### Bar Chart (Category Frequency)

Used for categorical comparisons.

```python
categories = np.random.choice(["A", "B", "C", "D"], size=500)
df = pd.DataFrame({"category": categories})

df["category"].value_counts().plot(kind="bar")
plt.title("Category Distribution")
plt.xlabel("Category")
plt.ylabel("Count")
plt.show()
```

This reveals:

-   Most frequent category
    
-   Imbalance between groups
    

### Box Plot (Outliers & Spread)

Used to examine variability and extreme values.

```python
data = np.random.normal(200, 40, 800)
df = pd.DataFrame({"revenue": data})

df.boxplot(column="revenue")
plt.title("Boxplot of Revenue")
plt.ylabel("Revenue")
plt.show()
```

This shows:

-   Median
    
-   Quartiles
    
-   Outliers
    

Extremely useful in financial or operational analytics.

### Line Chart (Trend Over Time)

Used for time-series trends.

```python
dates = pd.date_range(start="2024-01-01", periods=100)
values = np.random.randint(50, 150, size=100)

df = pd.DataFrame({"date": dates, "sales": values})

df.set_index("date")["sales"].plot(kind="line")
plt.title("Sales Over Time")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.show()
```
Used for:

-   Growth tracking
    
-   Seasonality
    
-   Operational monitoring
    

### Scatter Plot (Relationship Between Two Variables)

Used to detect correlation or relationships.

```python
x = np.random.normal(50, 10, 500)
y = 2 * x + np.random.normal(0, 15, 500)

df = pd.DataFrame({"advertising_spend": x, "revenue": y})

df.plot(kind="scatter", x="advertising_spend", y="revenue")
plt.title("Scatter Plot: Advertising vs Revenue")
plt.xlabel("Advertising Spend")
plt.ylabel("Revenue")
plt.show()
```

This reveals:

-   Linear relationships
    
-   Strength of association
    
-   Potential outliers
    

> Scatter plots are fundamental in regression analysis.

### Heatmap (For Fun — Correlation Structure)


A heatmap visualizes relationships between multiple variables.

For this, we use `seaborn`.

```python
import seaborn as sns

data = pd.DataFrame({
    "sales": np.random.normal(200, 30, 300),
    "marketing": np.random.normal(100, 20, 300),
    "profit": np.random.normal(80, 15, 300),
    "expenses": np.random.normal(120, 25, 300)
})

corr = data.corr()

sns.heatmap(corr, annot=True)
plt.title("Correlation Heatmap")
plt.show()
```

This reveals:

-   Positive correlations
    
-   Negative correlations
    
-   Variable interdependence
    

> Heatmaps are especially useful during exploratory data analysis (EDA).

### Choosing the Right Visualization


| Question                                         | Chart Type     |
|--------------------------------------------------|---------------|
| What is the distribution?                        | Histogram     |
| Compare categories?                              | Bar chart     |
| Detect outliers?                                 | Box plot      |
| Show trend over time?                            | Line chart    |
| Examine relationships?                           | Scatter plot  |
| Inspect correlations across many variables?      | Heatmap       |

### Conceptual Takeaway

Visualization is:

-   Structural analysis
    
-   Pattern detection
    
-   Decision support
    

You now understand how to generate synthetic datasets and visualize:

-   Distributions
    
-   Categories
    
-   Outliers
    
-   Trends
    
-   Relationships
    
-   Correlation structures
    

This is the foundation of exploratory data analysis.