# 🐼 Pandas Mastery: Data Analysis Superpower

<img src='https://pandas.pydata.org/static/img/pandas_white.svg' width='400' alt='Pandas Logo' style='background-color: #130754; padding: 20px;'>

## 📚 Welcome to Data Analysis Heaven!

**Pandas** is the Swiss Army knife of data analysis in Python. If NumPy is the engine, Pandas is the luxury car built on top of it!

### 🎯 Why Pandas Rules the Data World:
- **DataFrames** - Think Excel on steroids! 💪
- **Missing Data Handling** - Clean messy data like a pro
- **Time Series** - Financial data analysis made easy
- **Data Wrangling** - Merge, join, reshape with ease
- **SQL-like Operations** - GROUP BY, JOIN, and more!
- **File I/O** - Read/write CSV, Excel, JSON, SQL, and more

### 📊 What We'll Master Today:
1. **Series & DataFrames** - The building blocks
2. **Data Loading & Inspection** - Get your data in
3. **Indexing & Selection** - Access any data point
4. **Data Cleaning** - Handle missing/duplicate data
5. **Data Transformation** - Apply, map, and more
6. **Grouping & Aggregation** - SQL-like operations
7. **Merging & Joining** - Combine datasets
8. **Time Series Analysis** - Work with dates
9. **Visualization** - Quick plots
10. **Real-World Projects** - Apply everything!

---

## 🚀 Let's Begin Our Journey!

In [1]:
# Import the essentials
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Display settings for better visibility
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)
pd.set_option('display.width', 1000)

print(f"🐼 Pandas Version: {pd.__version__}")
print("\n✅ Pandas loaded successfully! Let's analyze some data!")

  from pandas.core import (


🐼 Pandas Version: 2.3.1

✅ Pandas loaded successfully! Let's analyze some data!


---

## 📌 Section 1: Series & DataFrames - The Building Blocks

### 🔍 Understanding Pandas Data Structures

- **Series**: 1D labeled array (like a column in Excel)
- **DataFrame**: 2D labeled table (like an Excel spreadsheet)

<img src='https://miro.medium.com/max/1400/1*xfJGLLKJzJfzkRhLZE-anw.png' width='600' alt='Series vs DataFrame'>

In [2]:
# 1.1 Creating a Series
print("📊 Creating Pandas Series\n" + "="*40)

# From a list
temperatures = pd.Series([22, 24, 19, 23, 25, 18, 21])
print("Simple Series:")
print(temperatures)

# With custom index
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
temp_series = pd.Series([22, 24, 19, 23, 25, 18, 21], index=days, name='Temperature')
print("\nSeries with custom index:")
print(temp_series)

# From dictionary
sales_dict = {'iPhone': 500, 'Samsung': 450, 'Xiaomi': 300, 'OnePlus': 150}
sales_series = pd.Series(sales_dict, name='Units Sold')
print("\nSeries from dictionary:")
print(sales_series)

📊 Creating Pandas Series
Simple Series:
0    22
1    24
2    19
3    23
4    25
5    18
6    21
dtype: int64

Series with custom index:
Mon    22
Tue    24
Wed    19
Thu    23
Fri    25
Sat    18
Sun    21
Name: Temperature, dtype: int64

Series from dictionary:
iPhone     500
Samsung    450
Xiaomi     300
OnePlus    150
Name: Units Sold, dtype: int64


In [3]:
# 1.2 Creating DataFrames
print("📊 Creating DataFrames\n" + "="*40)

# Method 1: From dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix'],
    'Salary': [70000, 85000, 95000, 65000, 78000]
}

df = pd.DataFrame(data)
print("DataFrame from dictionary:")
print(df)

# Method 2: From list of dictionaries
records = [
    {'Product': 'Laptop', 'Price': 1200, 'Quantity': 5},
    {'Product': 'Mouse', 'Price': 25, 'Quantity': 50},
    {'Product': 'Keyboard', 'Price': 75, 'Quantity': 30},
    {'Product': 'Monitor', 'Price': 300, 'Quantity': 10}
]

products_df = pd.DataFrame(records)
print("\nDataFrame from list of dicts:")
print(products_df)

📊 Creating DataFrames
DataFrame from dictionary:
      Name  Age     City  Salary
0    Alice   25      NYC   70000
1      Bob   30       LA   85000
2  Charlie   35  Chicago   95000
3    David   28  Houston   65000
4     Emma   32  Phoenix   78000

DataFrame from list of dicts:
    Product  Price  Quantity
0    Laptop   1200         5
1     Mouse     25        50
2  Keyboard     75        30
3   Monitor    300        10


In [4]:
# 1.3 DataFrame Information
print("ℹ️ DataFrame Information\n" + "="*40)

# Create a sample DataFrame
np.random.seed(42)
df = pd.DataFrame({
    'A': np.random.randn(100),
    'B': np.random.randint(0, 50, 100),
    'C': np.random.choice(['X', 'Y', 'Z'], 100),
    'D': pd.date_range('2024-01-01', periods=100),
    'E': np.random.random(100) * 1000
})

print("DataFrame shape:", df.shape)
print("\nColumn names:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)
print("\nMemory usage:")
print(df.memory_usage())
print("\nBasic info:")
df.info()

ℹ️ DataFrame Information
DataFrame shape: (100, 5)

Column names: ['A', 'B', 'C', 'D', 'E']

Data types:
A           float64
B             int32
C            object
D    datetime64[ns]
E           float64
dtype: object

Memory usage:
Index    132
A        800
B        400
C        800
D        800
E        800
dtype: int64

Basic info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   A       100 non-null    float64       
 1   B       100 non-null    int32         
 2   C       100 non-null    object        
 3   D       100 non-null    datetime64[ns]
 4   E       100 non-null    float64       
dtypes: datetime64[ns](1), float64(2), int32(1), object(1)
memory usage: 3.6+ KB


### 🏋️ Exercise 1: Create Your First DataFrame

Create a DataFrame containing information about 5 movies with columns:
- Title, Year, Rating, Runtime (minutes), Revenue (millions)

In [5]:
# Your solution here:

# Solution:
movies = pd.DataFrame({
    'Title': ['Inception', 'The Matrix', 'Interstellar', 'The Dark Knight', 'Avatar'],
    'Year': [2010, 1999, 2014, 2008, 2009],
    'Rating': [8.8, 8.7, 8.6, 9.0, 7.8],
    'Runtime': [148, 136, 169, 152, 162],
    'Revenue': [836.8, 467.2, 677.5, 1004.6, 2847.2]
})

print("Movies DataFrame:")
print(movies)
print(f"\nTotal revenue: ${movies['Revenue'].sum():.1f} million")
print(f"Average rating: {movies['Rating'].mean():.2f}")

Movies DataFrame:
             Title  Year  Rating  Runtime  Revenue
0        Inception  2010     8.8      148    836.8
1       The Matrix  1999     8.7      136    467.2
2     Interstellar  2014     8.6      169    677.5
3  The Dark Knight  2008     9.0      152   1004.6
4           Avatar  2009     7.8      162   2847.2

Total revenue: $5833.3 million
Average rating: 8.58


---

## 📌 Section 2: Data Loading & Inspection

### 📁 Working with Real Data

In [6]:
# 2.1 Creating Sample Data
print("📁 Creating Sample Dataset\n" + "="*40)

# Generate a realistic sales dataset
np.random.seed(42)
n_records = 1000

sales_data = pd.DataFrame({
    'OrderID': range(1001, 1001 + n_records),
    'Date': pd.date_range('2024-01-01', periods=n_records, freq='H'),
    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Watch', 'Headphones'], n_records),
    'Category': np.random.choice(['Electronics', 'Accessories'], n_records),
    'Quantity': np.random.randint(1, 10, n_records),
    'Price': np.random.uniform(50, 2000, n_records).round(2),
    'Customer': ['Customer_' + str(i) for i in np.random.randint(1, 200, n_records)],
    'City': np.random.choice(['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix'], n_records),
    'PaymentMethod': np.random.choice(['Credit', 'Debit', 'Cash', 'PayPal'], n_records)
})

# Add some missing values for realism
sales_data.loc[sales_data.sample(50).index, 'City'] = np.nan
sales_data.loc[sales_data.sample(30).index, 'PaymentMethod'] = np.nan

print("Sample Sales Dataset Created!")
print(f"Shape: {sales_data.shape}")
print(f"Date range: {sales_data['Date'].min()} to {sales_data['Date'].max()}")

📁 Creating Sample Dataset
Sample Sales Dataset Created!
Shape: (1000, 9)
Date range: 2024-01-01 00:00:00 to 2024-02-11 15:00:00


In [7]:
# 2.2 Data Inspection Methods
print("🔍 Inspecting the Data\n" + "="*40)

# First few rows
print("First 5 rows:")
print(sales_data.head())

# Last few rows
print("\nLast 3 rows:")
print(sales_data.tail(3))

# Random sample
print("\nRandom 3 rows:")
print(sales_data.sample(3))

# Basic statistics
print("\nNumerical columns statistics:")
print(sales_data.describe())

# Categorical columns
print("\nCategorical columns info:")
print(sales_data.describe(include=['object']))

🔍 Inspecting the Data
First 5 rows:
   OrderID                Date     Product     Category  Quantity    Price      Customer     City PaymentMethod
0     1001 2024-01-01 00:00:00       Watch  Accessories         7  1260.81  Customer_185      NYC           NaN
1     1002 2024-01-01 01:00:00  Headphones  Accessories         1   737.13    Customer_4  Chicago         Debit
2     1003 2024-01-01 02:00:00      Tablet  Accessories         8  1690.82  Customer_135  Phoenix        PayPal
3     1004 2024-01-01 03:00:00  Headphones  Accessories         3   969.01   Customer_61      NYC           NaN
4     1005 2024-01-01 04:00:00  Headphones  Accessories         6  1959.25  Customer_107      NYC          Cash

Last 3 rows:
     OrderID                Date Product     Category  Quantity    Price      Customer     City PaymentMethod
997     1998 2024-02-11 13:00:00  Laptop  Electronics         2  1658.25  Customer_170       LA        PayPal
998     1999 2024-02-11 14:00:00   Watch  Accessories     

In [8]:
# 2.3 Missing Data Detection
print("🔎 Missing Data Analysis\n" + "="*40)

# Check for missing values
missing = sales_data.isnull().sum()
missing_pct = (missing / len(sales_data)) * 100

missing_df = pd.DataFrame({
    'Missing_Count': missing,
    'Percentage': missing_pct.round(2)
})

print("Missing values per column:")
print(missing_df[missing_df['Missing_Count'] > 0])

# Visualize missing data pattern
print("\nRows with any missing values:", sales_data.isnull().any(axis=1).sum())
print("Complete rows:", sales_data.notna().all(axis=1).sum())

🔎 Missing Data Analysis
Missing values per column:
               Missing_Count  Percentage
City                      50         5.0
PaymentMethod             30         3.0

Rows with any missing values: 77
Complete rows: 923


---

## 📌 Section 3: Indexing & Selection - Access Your Data

### 🎯 Multiple Ways to Select Data

In [9]:
# 3.1 Column Selection
print("📊 Column Selection\n" + "="*40)

# Single column (returns Series)
products = sales_data['Product']
print("Single column (Series):")
print(products.head())
print(f"Type: {type(products)}")

# Multiple columns (returns DataFrame)
subset = sales_data[['Product', 'Price', 'Quantity']]
print("\nMultiple columns (DataFrame):")
print(subset.head())
print(f"Type: {type(subset)}")

📊 Column Selection
Single column (Series):
0         Watch
1    Headphones
2        Tablet
3    Headphones
4    Headphones
Name: Product, dtype: object
Type: <class 'pandas.core.series.Series'>

Multiple columns (DataFrame):
      Product    Price  Quantity
0       Watch  1260.81         7
1  Headphones   737.13         1
2      Tablet  1690.82         8
3  Headphones   969.01         3
4  Headphones  1959.25         6
Type: <class 'pandas.core.frame.DataFrame'>


In [10]:
# 3.2 Row Selection with loc and iloc
print("🎯 Row Selection: loc vs iloc\n" + "="*40)

# Create a simple DataFrame with custom index
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Score': [85, 92, 78, 95, 88],
    'Grade': ['B', 'A', 'C', 'A', 'B']
}, index=['ST001', 'ST002', 'ST003', 'ST004', 'ST005'])

print("Original DataFrame:")
print(df)

# loc: label-based selection
print("\n🏷️ Using loc (label-based):")
print("Select row 'ST003':")
print(df.loc['ST003'])
print("\nSelect rows ST002 to ST004:")
print(df.loc['ST002':'ST004'])
print("\nSelect specific rows and columns:")
print(df.loc[['ST001', 'ST005'], ['Name', 'Score']])

# iloc: integer position-based selection
print("\n🔢 Using iloc (position-based):")
print("Select row at position 2:")
print(df.iloc[2])
print("\nSelect rows 1 to 3:")
print(df.iloc[1:4])
print("\nSelect specific positions:")
print(df.iloc[[0, 4], [0, 1]])

🎯 Row Selection: loc vs iloc
Original DataFrame:
          Name  Score Grade
ST001    Alice     85     B
ST002      Bob     92     A
ST003  Charlie     78     C
ST004    David     95     A
ST005     Emma     88     B

🏷️ Using loc (label-based):
Select row 'ST003':
Name     Charlie
Score         78
Grade          C
Name: ST003, dtype: object

Select rows ST002 to ST004:
          Name  Score Grade
ST002      Bob     92     A
ST003  Charlie     78     C
ST004    David     95     A

Select specific rows and columns:
        Name  Score
ST001  Alice     85
ST005   Emma     88

🔢 Using iloc (position-based):
Select row at position 2:
Name     Charlie
Score         78
Grade          C
Name: ST003, dtype: object

Select rows 1 to 3:
          Name  Score Grade
ST002      Bob     92     A
ST003  Charlie     78     C
ST004    David     95     A

Select specific positions:
        Name  Score
ST001  Alice     85
ST005   Emma     88


In [11]:
# 3.3 Boolean Indexing (Filtering)
print("🔍 Boolean Indexing\n" + "="*40)

# Simple filter
high_price = sales_data[sales_data['Price'] > 1000]
print(f"Products with price > $1000: {len(high_price)} records")
print(high_price[['Product', 'Price']].head())

# Multiple conditions
laptops_nyc = sales_data[
    (sales_data['Product'] == 'Laptop') & 
    (sales_data['City'] == 'NYC')
]
print(f"\nLaptops sold in NYC: {len(laptops_nyc)} records")

# Using isin()
tech_cities = sales_data[sales_data['City'].isin(['NYC', 'LA'])]
print(f"\nSales in NYC or LA: {len(tech_cities)} records")

# Using query() method
query_result = sales_data.query('Price > 500 and Quantity > 5')
print(f"\nHigh value bulk orders: {len(query_result)} records")

🔍 Boolean Indexing
Products with price > $1000: 511 records
      Product    Price
0       Watch  1260.81
2      Tablet  1690.82
4  Headphones  1959.25
5       Phone  1286.58
7      Tablet  1368.55

Laptops sold in NYC: 30 records

Sales in NYC or LA: 356 records

High value bulk orders: 342 records


### 🏋️ Exercise 2: Data Selection Challenge

From the sales_data DataFrame:
1. Find all Phone sales with quantity >= 5
2. Select orders from January 2024
3. Find the top 5 most expensive orders

In [12]:
# Your solution here:

# Solution:
# 1. Phone sales with quantity >= 5
phone_bulk = sales_data[(sales_data['Product'] == 'Phone') & (sales_data['Quantity'] >= 5)]
print(f"Phone bulk orders: {len(phone_bulk)} records")

# 2. January 2024 orders
jan_orders = sales_data[sales_data['Date'].dt.month == 1]
print(f"\nJanuary orders: {len(jan_orders)} records")

# 3. Top 5 most expensive orders
sales_data['TotalValue'] = sales_data['Price'] * sales_data['Quantity']
top_5 = sales_data.nlargest(5, 'TotalValue')[['OrderID', 'Product', 'TotalValue']]
print("\nTop 5 most expensive orders:")
print(top_5)

Phone bulk orders: 105 records

January orders: 744 records

Top 5 most expensive orders:
     OrderID     Product  TotalValue
802     1803       Watch    17623.08
880     1881       Watch    17485.38
824     1825      Laptop    17421.48
807     1808      Tablet    17071.47
918     1919  Headphones    17030.16


---

## 📌 Section 4: Data Cleaning - Handle Real-World Messiness

### 🧹 Clean Your Data Like a Pro

In [13]:
# 4.1 Handling Missing Data
print("🧹 Handling Missing Data\n" + "="*40)

# Create sample data with missing values
messy_data = sales_data.copy()

print("Missing values before cleaning:")
print(messy_data.isnull().sum()[messy_data.isnull().sum() > 0])

# Method 1: Drop missing values
clean_drop = messy_data.dropna()
print(f"\nAfter dropna(): {len(clean_drop)} rows (lost {len(messy_data) - len(clean_drop)} rows)")

# Method 2: Fill with a value
messy_data['City'].fillna('Unknown', inplace=True)
messy_data['PaymentMethod'].fillna('Cash', inplace=True)

print("\nAfter fillna():")
print(messy_data.isnull().sum())

# Method 3: Forward/Backward fill
time_series = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=10),
    'Value': [100, np.nan, np.nan, 150, 200, np.nan, 250, 300, np.nan, 350]
})

print("\nTime series with missing values:")
print(time_series)

print("\nForward filled:")
print(time_series.fillna(method='ffill'))

🧹 Handling Missing Data
Missing values before cleaning:
City             50
PaymentMethod    30
dtype: int64

After dropna(): 923 rows (lost 77 rows)

After fillna():
OrderID          0
Date             0
Product          0
Category         0
Quantity         0
Price            0
Customer         0
City             0
PaymentMethod    0
TotalValue       0
dtype: int64

Time series with missing values:
        Date  Value
0 2024-01-01  100.0
1 2024-01-02    NaN
2 2024-01-03    NaN
3 2024-01-04  150.0
4 2024-01-05  200.0
5 2024-01-06    NaN
6 2024-01-07  250.0
7 2024-01-08  300.0
8 2024-01-09    NaN
9 2024-01-10  350.0

Forward filled:
        Date  Value
0 2024-01-01  100.0
1 2024-01-02  100.0
2 2024-01-03  100.0
3 2024-01-04  150.0
4 2024-01-05  200.0
5 2024-01-06  200.0
6 2024-01-07  250.0
7 2024-01-08  300.0
8 2024-01-09  300.0
9 2024-01-10  350.0


In [14]:
# 4.2 Removing Duplicates
print("🔄 Removing Duplicates\n" + "="*40)

# Create data with duplicates
dup_data = pd.DataFrame({
    'ID': [1, 2, 2, 3, 4, 4, 4, 5],
    'Name': ['Alice', 'Bob', 'Bob', 'Charlie', 'David', 'David', 'David', 'Emma'],
    'Score': [85, 90, 90, 75, 88, 88, 92, 95]
})

print("Original data with duplicates:")
print(dup_data)

# Check for duplicates
print(f"\nDuplicate rows: {dup_data.duplicated().sum()}")
print("\nDuplicate rows details:")
print(dup_data[dup_data.duplicated(keep=False)])

# Remove duplicates
clean_data = dup_data.drop_duplicates()
print(f"\nAfter removing duplicates: {len(clean_data)} rows")
print(clean_data)

# Keep last occurrence
keep_last = dup_data.drop_duplicates(subset=['ID'], keep='last')
print("\nKeeping last occurrence per ID:")
print(keep_last)

🔄 Removing Duplicates
Original data with duplicates:
   ID     Name  Score
0   1    Alice     85
1   2      Bob     90
2   2      Bob     90
3   3  Charlie     75
4   4    David     88
5   4    David     88
6   4    David     92
7   5     Emma     95

Duplicate rows: 2

Duplicate rows details:
   ID   Name  Score
1   2    Bob     90
2   2    Bob     90
4   4  David     88
5   4  David     88

After removing duplicates: 6 rows
   ID     Name  Score
0   1    Alice     85
1   2      Bob     90
3   3  Charlie     75
4   4    David     88
6   4    David     92
7   5     Emma     95

Keeping last occurrence per ID:
   ID     Name  Score
0   1    Alice     85
2   2      Bob     90
3   3  Charlie     75
6   4    David     92
7   5     Emma     95


In [15]:
# 4.3 Data Type Conversions
print("🔧 Data Type Conversions\n" + "="*40)

# Create sample data with wrong types
wrong_types = pd.DataFrame({
    'ID': ['001', '002', '003', '004'],
    'Price': ['19.99', '29.99', '39.99', '49.99'],
    'Date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04'],
    'InStock': ['True', 'False', 'True', 'True']
})

print("Original data types:")
print(wrong_types.dtypes)
print("\nData:")
print(wrong_types)

# Convert to appropriate types
wrong_types['ID'] = wrong_types['ID'].astype(int)
wrong_types['Price'] = wrong_types['Price'].astype(float)
wrong_types['Date'] = pd.to_datetime(wrong_types['Date'])
wrong_types['InStock'] = wrong_types['InStock'].map({'True': True, 'False': False})

print("\nCorrected data types:")
print(wrong_types.dtypes)
print("\nData after conversion:")
print(wrong_types)

🔧 Data Type Conversions
Original data types:
ID         object
Price      object
Date       object
InStock    object
dtype: object

Data:
    ID  Price        Date InStock
0  001  19.99  2024-01-01    True
1  002  29.99  2024-01-02   False
2  003  39.99  2024-01-03    True
3  004  49.99  2024-01-04    True

Corrected data types:
ID                  int32
Price             float64
Date       datetime64[ns]
InStock              bool
dtype: object

Data after conversion:
   ID  Price       Date  InStock
0   1  19.99 2024-01-01     True
1   2  29.99 2024-01-02    False
2   3  39.99 2024-01-03     True
3   4  49.99 2024-01-04     True


In [16]:
# 4.4 String Operations
print("📝 String Operations\n" + "="*40)

# Create sample data
text_data = pd.DataFrame({
    'Name': ['  John Doe  ', 'jane smith', 'BOB JOHNSON', 'Alice_Brown'],
    'Email': ['JOHN@GMAIL.COM', 'Jane@Yahoo.com', 'bob@HOTMAIL.COM', 'alice@outlook.com'],
    'Phone': ['123-456-7890', '(555) 123-4567', '9876543210', '555.123.4567']
})

print("Original messy text data:")
print(text_data)

# Clean the data
text_data['Name'] = text_data['Name'].str.strip()  # Remove spaces
text_data['Name'] = text_data['Name'].str.title()  # Title case
text_data['Name'] = text_data['Name'].str.replace('_', ' ')  # Replace underscore

text_data['Email'] = text_data['Email'].str.lower()  # Lowercase emails

# Extract phone digits only
text_data['Phone_Clean'] = text_data['Phone'].str.replace(r'\D', '', regex=True)

print("\nCleaned text data:")
print(text_data)

# String contains
gmail_users = text_data[text_data['Email'].str.contains('gmail')]
print(f"\nGmail users: {len(gmail_users)}")

📝 String Operations
Original messy text data:
           Name              Email           Phone
0    John Doe       JOHN@GMAIL.COM    123-456-7890
1    jane smith     Jane@Yahoo.com  (555) 123-4567
2   BOB JOHNSON    bob@HOTMAIL.COM      9876543210
3   Alice_Brown  alice@outlook.com    555.123.4567

Cleaned text data:
          Name              Email           Phone Phone_Clean
0     John Doe     john@gmail.com    123-456-7890  1234567890
1   Jane Smith     jane@yahoo.com  (555) 123-4567  5551234567
2  Bob Johnson    bob@hotmail.com      9876543210  9876543210
3  Alice Brown  alice@outlook.com    555.123.4567  5551234567

Gmail users: 1


---

## 📌 Section 5: Data Transformation - Apply, Map, and More

### 🔄 Transform Your Data

In [17]:
# 5.1 Apply Functions
print("🔧 Apply Functions\n" + "="*40)

# Sample data
df = pd.DataFrame({
    'Price': [100, 200, 150, 300, 250],
    'Quantity': [2, 1, 3, 2, 4],
    'Discount': [0.1, 0.15, 0.2, 0.05, 0.1]
})

print("Original DataFrame:")
print(df)

# Apply to Series
df['Total'] = df['Price'] * df['Quantity']
df['Final_Price'] = df['Total'].apply(lambda x: x * 0.9 if x > 500 else x)

print("\nAfter applying discount logic:")
print(df)

# Apply to DataFrame rows
def calculate_revenue(row):
    base = row['Price'] * row['Quantity']
    return base * (1 - row['Discount'])

df['Revenue'] = df.apply(calculate_revenue, axis=1)

print("\nWith revenue calculation:")
print(df)

🔧 Apply Functions
Original DataFrame:
   Price  Quantity  Discount
0    100         2      0.10
1    200         1      0.15
2    150         3      0.20
3    300         2      0.05
4    250         4      0.10

After applying discount logic:
   Price  Quantity  Discount  Total  Final_Price
0    100         2      0.10    200        200.0
1    200         1      0.15    200        200.0
2    150         3      0.20    450        450.0
3    300         2      0.05    600        540.0
4    250         4      0.10   1000        900.0

With revenue calculation:
   Price  Quantity  Discount  Total  Final_Price  Revenue
0    100         2      0.10    200        200.0    180.0
1    200         1      0.15    200        200.0    170.0
2    150         3      0.20    450        450.0    360.0
3    300         2      0.05    600        540.0    570.0
4    250         4      0.10   1000        900.0    900.0


In [18]:
# 5.2 Map and Replace
print("🗺️ Map and Replace\n" + "="*40)

# Sample data
df = pd.DataFrame({
    'Grade': ['A', 'B', 'C', 'A', 'B', 'D', 'A', 'C'],
    'Status': ['Pass', 'Pass', 'Pass', 'Pass', 'Pass', 'Fail', 'Pass', 'Pass']
})

print("Original DataFrame:")
print(df)

# Map values
grade_points = {'A': 4.0, 'B': 3.0, 'C': 2.0, 'D': 1.0, 'F': 0.0}
df['GPA'] = df['Grade'].map(grade_points)

# Replace values
df['Status_Code'] = df['Status'].replace({'Pass': 1, 'Fail': 0})

print("\nAfter mapping and replacing:")
print(df)

# Binning continuous values
scores = pd.DataFrame({'Score': [45, 67, 89, 72, 91, 55, 78, 83, 62, 95]})
bins = [0, 60, 70, 80, 90, 100]
labels = ['F', 'D', 'C', 'B', 'A']
scores['Grade'] = pd.cut(scores['Score'], bins=bins, labels=labels)

print("\nBinning scores into grades:")
print(scores)

🗺️ Map and Replace
Original DataFrame:
  Grade Status
0     A   Pass
1     B   Pass
2     C   Pass
3     A   Pass
4     B   Pass
5     D   Fail
6     A   Pass
7     C   Pass

After mapping and replacing:
  Grade Status  GPA  Status_Code
0     A   Pass  4.0            1
1     B   Pass  3.0            1
2     C   Pass  2.0            1
3     A   Pass  4.0            1
4     B   Pass  3.0            1
5     D   Fail  1.0            0
6     A   Pass  4.0            1
7     C   Pass  2.0            1

Binning scores into grades:
   Score Grade
0     45     F
1     67     D
2     89     B
3     72     C
4     91     A
5     55     F
6     78     C
7     83     B
8     62     D
9     95     A


In [19]:
# 5.3 Creating New Features
print("🏗️ Feature Engineering\n" + "="*40)

# Work with the sales data
df = sales_data.copy()

# Extract date features
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['DayOfWeek'] = df['Date'].dt.day_name()
df['Hour'] = df['Date'].dt.hour
df['IsWeekend'] = df['Date'].dt.dayofweek.isin([5, 6])

# Calculate features
df['TotalValue'] = df['Price'] * df['Quantity']
df['PriceCategory'] = pd.qcut(df['Price'], q=3, labels=['Low', 'Medium', 'High'])

print("New features created:")
print(df[['Date', 'Year', 'Month', 'Day', 'DayOfWeek', 'IsWeekend', 'PriceCategory']].head())

# Aggregate features
customer_stats = df.groupby('Customer').agg({
    'OrderID': 'count',
    'TotalValue': ['sum', 'mean']
}).round(2)

print("\nCustomer statistics (first 5):")
print(customer_stats.head())

🏗️ Feature Engineering
New features created:
                 Date  Year  Month  Day DayOfWeek  IsWeekend PriceCategory
0 2024-01-01 00:00:00  2024      1    1    Monday      False        Medium
1 2024-01-01 01:00:00  2024      1    1    Monday      False        Medium
2 2024-01-01 02:00:00  2024      1    1    Monday      False          High
3 2024-01-01 03:00:00  2024      1    1    Monday      False        Medium
4 2024-01-01 04:00:00  2024      1    1    Monday      False          High

Customer statistics (first 5):
             OrderID TotalValue          
               count        sum      mean
Customer                                 
Customer_1         2   23086.60  11543.30
Customer_10        2    5412.62   2706.31
Customer_100       2    4221.74   2110.87
Customer_101       4   23799.22   5949.80
Customer_102       4   22924.73   5731.18


---

## 📌 Section 6: Grouping & Aggregation - SQL-like Operations

### 📊 GROUP BY in Pandas

In [20]:
# 6.1 Basic GroupBy
print("📊 GroupBy Operations\n" + "="*40)

# Group by single column
product_groups = sales_data.groupby('Product')

# Basic aggregations
print("Sales by Product:")
print(product_groups['Price'].agg(['count', 'mean', 'sum']).round(2))

# Multiple aggregations
summary = sales_data.groupby('Product').agg({
    'OrderID': 'count',
    'Quantity': 'sum',
    'Price': ['mean', 'min', 'max']
}).round(2)

# Flatten column names
summary.columns = ['_'.join(col).strip() for col in summary.columns]
print("\nDetailed Product Summary:")
print(summary)

📊 GroupBy Operations
Sales by Product:
            count     mean        sum
Product                              
Headphones    204  1098.42  224077.93
Laptop        210  1000.27  210056.70
Phone         190  1041.24  197835.30
Tablet        190  1022.59  194291.43
Watch         206   962.78  198333.57

Detailed Product Summary:
            OrderID_count  Quantity_sum  Price_mean  Price_min  Price_max
Product                                                                  
Headphones            204           985     1098.42      50.02    1991.10
Laptop                210          1097     1000.27      66.22    1988.00
Phone                 190           943     1041.24      67.63    1970.47
Tablet                190           985     1022.59      50.26    1949.68
Watch                 206          1004      962.78      52.87    1995.75


In [21]:
# 6.2 Multi-level GroupBy
print("📊 Multi-level GroupBy\n" + "="*40)

# Group by multiple columns
city_product = sales_data.groupby(['City', 'Product'])['Price'].agg(['count', 'mean'])
city_product = city_product.round(2)

print("Sales by City and Product:")
print(city_product.head(10))

# Pivot table alternative
pivot = sales_data.pivot_table(
    values='Price',
    index='Product',
    columns='City',
    aggfunc='mean',
    fill_value=0
).round(2)

print("\nPivot Table - Average Price by Product and City:")
print(pivot)

📊 Multi-level GroupBy
Sales by City and Product:
                    count     mean
City    Product                   
Chicago Headphones     44   953.61
        Laptop         41  1022.34
        Phone          35  1216.69
        Tablet         31  1087.79
        Watch          46  1050.31
Houston Headphones     41  1131.12
        Laptop         41  1013.16
        Phone          39  1001.79
        Tablet         43  1021.49
        Watch          39   902.02

Pivot Table - Average Price by Product and City:
City        Chicago  Houston       LA      NYC  Phoenix
Product                                                
Headphones   953.61  1131.12  1117.24   968.32  1337.90
Laptop      1022.34  1013.16  1100.04   840.98   986.86
Phone       1216.69  1001.79   952.15  1145.84   897.69
Tablet      1087.79  1021.49  1231.85   885.36   852.06
Watch       1050.31   902.02   970.20   998.87   901.79


In [22]:
# 6.3 Transform and Filter Groups
print("🔄 Transform and Filter Groups\n" + "="*40)

# Add group statistics to original data
df = sales_data.copy()
df['TotalValue'] = df['Price'] * df['Quantity']

# Add group means
df['Product_Avg_Price'] = df.groupby('Product')['Price'].transform('mean')
df['Price_vs_Avg'] = df['Price'] - df['Product_Avg_Price']

print("Data with group statistics:")
print(df[['Product', 'Price', 'Product_Avg_Price', 'Price_vs_Avg']].head(10))

# Filter groups
# Keep only products with more than 100 sales
popular_products = df.groupby('Product').filter(lambda x: len(x) > 100)
print(f"\nProducts with >100 sales: {popular_products['Product'].nunique()}")
print(f"Total records: {len(popular_products)}")

🔄 Transform and Filter Groups
Data with group statistics:
      Product    Price  Product_Avg_Price  Price_vs_Avg
0       Watch  1260.81         962.784320    298.025680
1  Headphones   737.13        1098.421225   -361.291225
2      Tablet  1690.82        1022.586474    668.233526
3  Headphones   969.01        1098.421225   -129.411225
4  Headphones  1959.25        1098.421225    860.828775
5       Phone  1286.58        1041.238421    245.341579
6      Tablet   296.22        1022.586474   -726.366474
7      Tablet  1368.55        1022.586474    345.963526
8      Tablet   683.95        1022.586474   -338.636474
9  Headphones  1388.34        1098.421225    289.918775

Products with >100 sales: 5
Total records: 1000


---

## 📌 Section 7: Merging & Joining - Combine Your Data

### 🔗 SQL-like JOINs in Pandas

In [23]:
# 7.1 Merge Operations
print("🔗 Merge Operations\n" + "="*40)

# Create sample DataFrames
customers = pd.DataFrame({
    'CustomerID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'City': ['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix']
})

orders = pd.DataFrame({
    'OrderID': [101, 102, 103, 104, 105, 106],
    'CustomerID': [1, 2, 1, 3, 5, 2],
    'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone', 'Watch'],
    'Amount': [1200, 800, 600, 1500, 900, 400]
})

print("Customers:")
print(customers)
print("\nOrders:")
print(orders)

# Inner join (default)
inner_join = pd.merge(customers, orders, on='CustomerID')
print("\nInner Join Result:")
print(inner_join)

# Left join
left_join = pd.merge(customers, orders, on='CustomerID', how='left')
print("\nLeft Join Result (all customers):")
print(left_join)

🔗 Merge Operations
Customers:
   CustomerID     Name     City
0           1    Alice      NYC
1           2      Bob       LA
2           3  Charlie  Chicago
3           4    David  Houston
4           5     Emma  Phoenix

Orders:
   OrderID  CustomerID Product  Amount
0      101           1  Laptop    1200
1      102           2   Phone     800
2      103           1  Tablet     600
3      104           3  Laptop    1500
4      105           5   Phone     900
5      106           2   Watch     400

Inner Join Result:
   CustomerID     Name     City  OrderID Product  Amount
0           1    Alice      NYC      101  Laptop    1200
1           1    Alice      NYC      103  Tablet     600
2           2      Bob       LA      102   Phone     800
3           2      Bob       LA      106   Watch     400
4           3  Charlie  Chicago      104  Laptop    1500
5           5     Emma  Phoenix      105   Phone     900

Left Join Result (all customers):
   CustomerID     Name     City  OrderID P

In [24]:
# 7.2 Concatenation
print("📚 Concatenation\n" + "="*40)

# Create sample DataFrames
q1_sales = pd.DataFrame({
    'Month': ['Jan', 'Feb', 'Mar'],
    'Sales': [100, 120, 140]
})

q2_sales = pd.DataFrame({
    'Month': ['Apr', 'May', 'Jun'],
    'Sales': [130, 150, 160]
})

# Vertical concatenation
half_year = pd.concat([q1_sales, q2_sales], ignore_index=True)
print("Vertical Concatenation:")
print(half_year)

# Horizontal concatenation
revenue = pd.DataFrame({
    'Month': ['Jan', 'Feb', 'Mar'],
    'Revenue': [10000, 12000, 14000]
})

combined = pd.concat([q1_sales, revenue[['Revenue']]], axis=1)
print("\nHorizontal Concatenation:")
print(combined)

📚 Concatenation
Vertical Concatenation:
  Month  Sales
0   Jan    100
1   Feb    120
2   Mar    140
3   Apr    130
4   May    150
5   Jun    160

Horizontal Concatenation:
  Month  Sales  Revenue
0   Jan    100    10000
1   Feb    120    12000
2   Mar    140    14000


---

## 📌 Section 8: Time Series Analysis

### ⏰ Working with Dates and Time

In [25]:
# 8.1 Date Operations
print("📅 Date Operations\n" + "="*40)

# Create time series data
dates = pd.date_range('2024-01-01', periods=365, freq='D')
ts_data = pd.DataFrame({
    'Date': dates,
    'Sales': np.random.randn(365).cumsum() + 100,
    'Temperature': np.sin(np.arange(365) * 2 * np.pi / 365) * 20 + 20
})

ts_data.set_index('Date', inplace=True)

print("Time Series Data:")
print(ts_data.head())

# Resampling
monthly = ts_data.resample('M').mean()
print("\nMonthly Averages:")
print(monthly.head())

# Rolling windows
ts_data['Sales_MA7'] = ts_data['Sales'].rolling(window=7).mean()
ts_data['Sales_MA30'] = ts_data['Sales'].rolling(window=30).mean()

print("\nWith Moving Averages:")
print(ts_data[['Sales', 'Sales_MA7', 'Sales_MA30']].head(35))

📅 Date Operations
Time Series Data:
                 Sales  Temperature
Date                               
2024-01-01  100.332999    20.000000
2024-01-02  100.135998    20.344267
2024-01-03  100.435514    20.688432
2024-01-04   99.985298    21.032393
2024-01-05  100.700114    21.376049

Monthly Averages:
                 Sales  Temperature
Date                               
2024-01-31  100.781140    25.046746
2024-02-29  104.429304    33.844528
2024-03-31  104.096512    38.995555
2024-04-30  103.893286    39.186639
2024-05-31  100.518349    34.184007

With Moving Averages:
                 Sales   Sales_MA7  Sales_MA30
Date                                          
2024-01-01  100.332999         NaN         NaN
2024-01-02  100.135998         NaN         NaN
2024-01-03  100.435514         NaN         NaN
2024-01-04   99.985298         NaN         NaN
2024-01-05  100.700114         NaN         NaN
...                ...         ...         ...
2024-01-31  103.754640  101.573855  100.79

In [38]:
# # 8.2 Date Range Filtering
# print("🔍 Date Filtering\n" + "="*40)

# # Filter by date range
# jan_data = ts_data['2024-01']
# print(f"January 2024 data: {len(jan_data)} days")
# print(jan_data.head())

# # Between dates
# q1_data = ts_data['2024-01-01':'2024-03-31']
# print(f"\nQ1 2024 data: {len(q1_data)} days")

# # Shift operations
# ts_data['Sales_Yesterday'] = ts_data['Sales'].shift(1)
# ts_data['Sales_Change'] = ts_data['Sales'] - ts_data['Sales_Yesterday']

# print("\nWith lag features:")
# print(ts_data[['Sales', 'Sales_Yesterday', 'Sales_Change']].head())

---

## 📌 Section 9: Visualization with Pandas

### 📈 Quick Plots

In [27]:
# 9.1 Basic Plotting
print("📈 Pandas Plotting\n" + "="*40)

# Note: In real notebook, these would display as plots
# Here we'll describe what would be plotted

# Line plot
print("1. Line Plot: ts_data['Sales'].plot()")
print("   - Shows sales trend over time")

# Bar plot
product_sales = sales_data.groupby('Product')['Price'].count()
print("\n2. Bar Plot: product_sales.plot(kind='bar')")
print("   - Shows count of sales per product")

# Histogram
print("\n3. Histogram: sales_data['Price'].plot(kind='hist', bins=30)")
print("   - Shows distribution of prices")

# Box plot
print("\n4. Box Plot: sales_data.boxplot(column='Price', by='Product')")
print("   - Shows price distribution by product")

# Scatter plot
print("\n5. Scatter Plot: sales_data.plot(x='Quantity', y='Price', kind='scatter')")
print("   - Shows relationship between quantity and price")

📈 Pandas Plotting
1. Line Plot: ts_data['Sales'].plot()
   - Shows sales trend over time

2. Bar Plot: product_sales.plot(kind='bar')
   - Shows count of sales per product

3. Histogram: sales_data['Price'].plot(kind='hist', bins=30)
   - Shows distribution of prices

4. Box Plot: sales_data.boxplot(column='Price', by='Product')
   - Shows price distribution by product

5. Scatter Plot: sales_data.plot(x='Quantity', y='Price', kind='scatter')
   - Shows relationship between quantity and price


---

## 🎯 Section 10: Real-World Projects

### Project 1: Customer Analytics Dashboard

In [28]:
# Project 1: Customer Analytics
print("📊 CUSTOMER ANALYTICS DASHBOARD\n" + "="*50)

# Prepare the data
df = sales_data.copy()
df['TotalValue'] = df['Price'] * df['Quantity']
df['Month'] = df['Date'].dt.to_period('M')

# 1. Customer Segmentation
customer_summary = df.groupby('Customer').agg({
    'OrderID': 'count',
    'TotalValue': 'sum',
    'Date': lambda x: (df['Date'].max() - x.max()).days
}).rename(columns={
    'OrderID': 'Orders',
    'TotalValue': 'Revenue',
    'Date': 'DaysSinceLastOrder'
})

# Categorize customers
customer_summary['Segment'] = pd.cut(
    customer_summary['Revenue'],
    bins=[0, 1000, 5000, 10000, float('inf')],
    labels=['Bronze', 'Silver', 'Gold', 'Platinum']
)

print("Customer Segmentation:")
print(customer_summary.head(10))

# 2. Segment Analysis
segment_analysis = customer_summary.groupby('Segment').agg({
    'Orders': ['count', 'mean'],
    'Revenue': ['sum', 'mean'],
    'DaysSinceLastOrder': 'mean'
}).round(2)

print("\nSegment Analysis:")
print(segment_analysis)

# 3. Top Customers
top_customers = customer_summary.nlargest(5, 'Revenue')[['Orders', 'Revenue', 'Segment']]
print("\nTop 5 Customers:")
print(top_customers)

# 4. Churn Risk
churn_risk = customer_summary[customer_summary['DaysSinceLastOrder'] > 30].sort_values(
    'Revenue', ascending=False
).head(10)[['Orders', 'Revenue', 'DaysSinceLastOrder']]

print("\nChurn Risk (No order in 30+ days):")
print(churn_risk)

📊 CUSTOMER ANALYTICS DASHBOARD
Customer Segmentation:
              Orders   Revenue  DaysSinceLastOrder   Segment
Customer                                                    
Customer_1         2  23086.60                   9  Platinum
Customer_10        2   5412.62                   4      Gold
Customer_100       2   4221.74                  24    Silver
Customer_101       4  23799.22                   0  Platinum
Customer_102       4  22924.73                   1  Platinum
Customer_103       5  25454.93                   0  Platinum
Customer_104      11  68279.90                   1  Platinum
Customer_105       2   9657.99                   4      Gold
Customer_106       3  13875.98                  15  Platinum
Customer_107       5  32246.14                   2  Platinum

Segment Analysis:
         Orders           Revenue           DaysSinceLastOrder
          count  mean         sum      mean               mean
Segment                                                       
Bronze

### Project 2: Sales Performance Report

In [29]:
# Project 2: Sales Performance Report
print("📈 SALES PERFORMANCE REPORT\n" + "="*50)

# Monthly trends
monthly_sales = df.groupby('Month').agg({
    'OrderID': 'count',
    'TotalValue': 'sum',
    'Customer': 'nunique'
}).rename(columns={
    'OrderID': 'Orders',
    'TotalValue': 'Revenue',
    'Customer': 'UniqueCustomers'
})

# Calculate growth
monthly_sales['Revenue_Growth'] = monthly_sales['Revenue'].pct_change() * 100
monthly_sales['AOV'] = monthly_sales['Revenue'] / monthly_sales['Orders']

print("Monthly Performance:")
print(monthly_sales.round(2))

# Product performance
product_performance = df.groupby('Product').agg({
    'OrderID': 'count',
    'TotalValue': 'sum',
    'Quantity': 'sum'
}).rename(columns={
    'OrderID': 'Orders',
    'TotalValue': 'Revenue',
    'Quantity': 'Units'
})

product_performance['Avg_Order_Value'] = (
    product_performance['Revenue'] / product_performance['Orders']
).round(2)

product_performance['Revenue_Share'] = (
    product_performance['Revenue'] / product_performance['Revenue'].sum() * 100
).round(2)

print("\nProduct Performance:")
print(product_performance.sort_values('Revenue', ascending=False))

# City performance
city_performance = df.dropna(subset=['City']).groupby('City').agg({
    'TotalValue': ['sum', 'mean'],
    'OrderID': 'count'
}).round(2)

print("\nCity Performance:")
print(city_performance)

# Key Metrics
print("\n📊 KEY METRICS")
print("=" * 50)
print(f"Total Revenue: ${df['TotalValue'].sum():,.2f}")
print(f"Total Orders: {df['OrderID'].nunique():,}")
print(f"Unique Customers: {df['Customer'].nunique()}")
print(f"Average Order Value: ${df['TotalValue'].mean():.2f}")
print(f"Best Selling Product: {product_performance.idxmax()['Revenue']}")
print(f"Top City: {city_performance.idxmax()[('TotalValue', 'sum')]}")

📈 SALES PERFORMANCE REPORT
Monthly Performance:
         Orders     Revenue  UniqueCustomers  Revenue_Growth      AOV
Month                                                                
2024-01     744  3747136.04              192             NaN  5036.47
2024-02     256  1396758.63              148          -62.72  5456.09

Product Performance:
            Orders     Revenue  Units  Avg_Order_Value  Revenue_Share
Product                                                              
Headphones     204  1102167.99    985          5402.78          21.43
Laptop         210  1099719.83   1097          5236.76          21.38
Tablet         190  1011230.27    985          5322.26          19.66
Phone          190   970814.44    943          5109.55          18.87
Watch          206   959962.14   1004          4660.01          18.66

City Performance:
         TotalValue          OrderID
                sum     mean   count
City                                
Chicago  1027717.20  5216.84  

---

## 🎓 Advanced Pandas Techniques

### 🚀 Pro-Level Skills

In [30]:
# Advanced: Method Chaining
print("⛓️ Method Chaining\n" + "="*40)

# Clean, elegant data processing
result = (
    sales_data
    .assign(TotalValue=lambda x: x['Price'] * x['Quantity'])
    .query('TotalValue > 1000')
    .groupby('Product')
    .agg({'TotalValue': 'sum', 'OrderID': 'count'})
    .rename(columns={'TotalValue': 'Revenue', 'OrderID': 'Orders'})
    .sort_values('Revenue', ascending=False)
    .head()
)

print("Top products by revenue (>$1000 orders only):")
print(result)

⛓️ Method Chaining
Top products by revenue (>$1000 orders only):
               Revenue  Orders
Product                       
Headphones  1086799.37     177
Laptop      1082285.48     179
Tablet       999385.61     169
Phone        955406.29     163
Watch        939522.78     170


In [31]:
# Advanced: MultiIndex Operations
print("📚 MultiIndex Operations\n" + "="*40)

# Create MultiIndex DataFrame
multi_df = df.groupby(['City', 'Product'])['TotalValue'].sum().round(2)
print("MultiIndex Series:")
print(multi_df.head(10))

# Access MultiIndex data
print("\nAccess NYC data:")
print(multi_df.loc['NYC'])

# Unstack to pivot
pivoted = multi_df.unstack(fill_value=0)
print("\nUnstacked (Pivoted):")
print(pivoted)

📚 MultiIndex Operations
MultiIndex Series:
City     Product   
Chicago  Headphones    161110.31
         Laptop        229577.32
         Phone         221616.38
         Tablet        179662.69
         Watch         235750.50
Houston  Headphones    229551.81
         Laptop        200158.05
         Phone         179512.09
         Tablet        209217.30
         Watch         183142.66
Name: TotalValue, dtype: float64

Access NYC data:
Product
Headphones    145896.87
Laptop        137539.76
Phone         190958.46
Tablet        141657.38
Watch         170772.88
Name: TotalValue, dtype: float64

Unstacked (Pivoted):
Product  Headphones     Laptop      Phone     Tablet      Watch
City                                                           
Chicago   161110.31  229577.32  221616.38  179662.69  235750.50
Houston   229551.81  200158.05  179512.09  209217.30  183142.66
LA        211277.31  273845.84  202567.06  252482.33  142610.73
NYC       145896.87  137539.76  190958.46  141657.38 

In [33]:
# Advanced: Window Functions
print("🪟 Window Functions\n" + "="*40)

# Ranking within groups
df['Rank_in_City'] = df.groupby('City')['TotalValue'].rank(method='dense', ascending=False)

# Cumulative sum within groups
df['Cumulative_City_Revenue'] = df.groupby('City')['TotalValue'].cumsum()

# Percentage of group total
df['Pct_of_City_Total'] = df.groupby('City')['TotalValue'].transform(lambda x: x / x.sum() * 100)

print("Window function results:")
print(df[['City', 'Product', 'TotalValue', 'Rank_in_City', 
          'Cumulative_City_Revenue', 'Pct_of_City_Total']].head(10))

🪟 Window Functions
Window function results:
      City     Product  TotalValue  Rank_in_City  Cumulative_City_Revenue  Pct_of_City_Total
0      NYC       Watch     8825.67          25.0                  8825.67           1.121681
1  Chicago  Headphones      737.13         183.0                   737.13           0.071725
2  Phoenix      Tablet    13526.56          10.0                 13526.56           1.344900
3      NYC  Headphones     2907.03         102.0                 11732.70           0.369463
4      NYC  Headphones    11755.50           9.0                 23488.20           1.494042
5       LA       Phone    11579.22          27.0                 11579.22           1.069394
6  Phoenix      Tablet     1777.32         150.0                 15303.88           0.176713
7      NaN      Tablet    12316.95           NaN                      NaN                NaN
8      NYC      Tablet     4787.65          72.0                 28275.85           0.608477
9  Phoenix  Headphones    

---

## 🏆 Final Challenge: Complete E-Commerce Analysis

### Build a Complete Analytics Pipeline

In [34]:
# Final Project: E-Commerce Analytics Pipeline
print("🎯 E-COMMERCE ANALYTICS PIPELINE\n" + "="*50)

class EcommerceAnalyzer:
    def __init__(self, data):
        self.data = data.copy()
        self.prepare_data()
    
    def prepare_data(self):
        """Prepare and clean data"""
        self.data['TotalValue'] = self.data['Price'] * self.data['Quantity']
        self.data['Month'] = self.data['Date'].dt.to_period('M')
        self.data['DayOfWeek'] = self.data['Date'].dt.day_name()
        self.data['Hour'] = self.data['Date'].dt.hour
    
    def executive_summary(self):
        """Generate executive summary"""
        summary = {
            'Total Revenue': f"${self.data['TotalValue'].sum():,.2f}",
            'Total Orders': self.data['OrderID'].nunique(),
            'Unique Customers': self.data['Customer'].nunique(),
            'Average Order Value': f"${self.data['TotalValue'].mean():.2f}",
            'Products Sold': self.data['Quantity'].sum(),
            'Conversion Rate': f"{(self.data['Customer'].nunique() / 200 * 100):.1f}%"
        }
        return pd.Series(summary)
    
    def customer_lifetime_value(self):
        """Calculate CLV"""
        clv = self.data.groupby('Customer').agg({
            'TotalValue': 'sum',
            'OrderID': 'count',
            'Date': lambda x: (self.data['Date'].max() - x.min()).days
        }).rename(columns={
            'TotalValue': 'CLV',
            'OrderID': 'Orders',
            'Date': 'CustomerAge_Days'
        })
        
        clv['Avg_Order_Value'] = clv['CLV'] / clv['Orders']
        return clv.sort_values('CLV', ascending=False)
    
    def product_analytics(self):
        """Product performance analysis"""
        products = self.data.groupby('Product').agg({
            'TotalValue': 'sum',
            'Quantity': 'sum',
            'OrderID': 'count',
            'Customer': 'nunique'
        }).rename(columns={
            'TotalValue': 'Revenue',
            'Quantity': 'Units_Sold',
            'OrderID': 'Orders',
            'Customer': 'Unique_Buyers'
        })
        
        products['Avg_Price'] = products['Revenue'] / products['Units_Sold']
        products['Market_Share'] = products['Revenue'] / products['Revenue'].sum() * 100
        
        return products.round(2)
    
    def time_analysis(self):
        """Time-based patterns"""
        # Best hours
        hourly = self.data.groupby('Hour')['TotalValue'].agg(['sum', 'count'])
        best_hours = hourly.nlargest(3, 'sum')
        
        # Best days
        daily = self.data.groupby('DayOfWeek')['TotalValue'].sum()
        best_day = daily.idxmax()
        
        return {
            'Best Hours': best_hours.index.tolist(),
            'Best Day': best_day,
            'Weekend vs Weekday': {
                'Weekend': daily[['Saturday', 'Sunday']].sum(),
                'Weekday': daily[['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']].sum()
            }
        }

# Run the analysis
analyzer = EcommerceAnalyzer(sales_data)

print("📊 EXECUTIVE SUMMARY")
print("=" * 50)
print(analyzer.executive_summary())

print("\n💰 TOP CUSTOMER LIFETIME VALUES")
print("=" * 50)
print(analyzer.customer_lifetime_value().head())

print("\n📦 PRODUCT ANALYTICS")
print("=" * 50)
print(analyzer.product_analytics())

print("\n⏰ TIME ANALYSIS")
print("=" * 50)
time_insights = analyzer.time_analysis()
print(f"Best Hours for Sales: {time_insights['Best Hours']}")
print(f"Best Day of Week: {time_insights['Best Day']}")
print(f"Weekend Revenue: ${time_insights['Weekend vs Weekday']['Weekend']:,.2f}")
print(f"Weekday Revenue: ${time_insights['Weekend vs Weekday']['Weekday']:,.2f}")

print("\n✅ Analysis Complete!")

🎯 E-COMMERCE ANALYTICS PIPELINE
📊 EXECUTIVE SUMMARY
Total Revenue          $5,143,894.67
Total Orders                    1000
Unique Customers                 194
Average Order Value         $5143.89
Products Sold                   5014
Conversion Rate                97.0%
dtype: object

💰 TOP CUSTOMER LIFETIME VALUES
                   CLV  Orders  CustomerAge_Days  Avg_Order_Value
Customer                                                         
Customer_61   69797.85      10                41      6979.785000
Customer_104  68279.90      11                37      6207.263636
Customer_110  65688.77       7                39      9384.110000
Customer_177  59990.64      10                39      5999.064000
Customer_195  56244.70      10                40      5624.470000

📦 PRODUCT ANALYTICS
               Revenue  Units_Sold  Orders  Unique_Buyers  Avg_Price  Market_Share
Product                                                                           
Headphones  1102167.99         

---

## 🎯 Summary & Next Steps

### 🏆 What You've Mastered:

✅ **Data Structures**
- Series and DataFrames
- MultiIndex operations

✅ **Data Manipulation**
- Indexing and selection
- Filtering and boolean indexing
- Apply, map, and transform

✅ **Data Cleaning**
- Missing data handling
- Duplicate removal
- Type conversions

✅ **Data Analysis**
- GroupBy operations
- Aggregations
- Pivot tables

✅ **Data Combination**
- Merging and joining
- Concatenation

✅ **Time Series**
- Date operations
- Resampling
- Rolling windows

✅ **Real-World Applications**
- Customer analytics
- Sales performance
- E-commerce analysis

### 🚀 Next Steps:

1. **Practice with Real Datasets**: Kaggle, UCI ML Repository
2. **Learn Visualization**: Matplotlib, Seaborn, Plotly
3. **SQL Integration**: Read/write from databases
4. **Big Data**: Dask, PySpark for larger datasets
5. **Machine Learning**: Feed data to scikit-learn

### 💡 Pro Tips:

- **Use vectorized operations** instead of loops
- **Chain methods** for cleaner code
- **Profile memory usage** with `.memory_usage()`
- **Use categories** for string columns to save memory
- **Read documentation**: pandas.pydata.org

### 📚 Resources:

- Official Pandas Documentation
- Pandas Cookbook by Ted Petrou
- Python for Data Analysis by Wes McKinney
- Kaggle Learn Pandas Course

---

## 🎉 Congratulations!

You've mastered Pandas - the Swiss Army knife of data analysis!

With Pandas, you can now:
- **Clean messy data** 🧹
- **Analyze complex datasets** 📊
- **Generate business insights** 💡
- **Prepare data for ML** 🤖
- **Create reports** 📈

**Keep practicing, keep analyzing, and keep discovering insights with Pandas!** 🐼

In [35]:
# 🎊 Course Complete!
print("🎊" * 20)
print("\n    🏆 PANDAS MASTERY ACHIEVED! 🏆")
print("\n    You're now ready to:")
print("    → Analyze any dataset")
print("    → Clean messy data")
print("    → Generate insights")
print("    → Build data pipelines")
print("\n    Next: Visualization with Matplotlib! 📈")
print("\n" + "🎊" * 20)

🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊

    🏆 PANDAS MASTERY ACHIEVED! 🏆

    You're now ready to:
    → Analyze any dataset
    → Clean messy data
    → Generate insights
    → Build data pipelines

    Next: Visualization with Matplotlib! 📈

🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊🎊
