# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 01 · Lab 01A — Pandas Basics
**Instructor:** Amir Charkhi  |  **Duration:** 45 minutes  |  **Difficulty:** ⭐⭐☆☆☆

> **Goal:** Master Series and basic DataFrame operations through hands-on exercises.


## Learning Objectives
- Create and manipulate pandas Series
- Build DataFrames from various sources
- Perform basic selections and filtering
- Apply simple transformations

## Setup
Run this cell first to import libraries and set up your environment.

In [1]:
import pandas as pd
import numpy as np

# Display settings
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', None)

print("🐼 Pandas version:", pd.__version__)
print("Ready to start!")

🐼 Pandas version: 2.3.2
Ready to start!


## Part 1: Series Fundamentals (15 minutes)
Series are the building blocks of pandas - like enhanced Python lists.

### Exercise 1.1 — Temperature Tracker (easy)
Create a Series of daily temperatures for a week and perform basic analysis.

In [26]:
# TODO: Create a Series with these temperatures: [22.5, 23.1, 21.8, 24.2, 22.9, 25.5, 20.3]
# Use days of the week as the index
# Your code here:
temperatures = pd.Series([22.5, 23.1, 21.8, 24.2, 22.9, 25.5, 20.3], index=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])
print(temperatures)

# TODO: Find and print:
# 1. The hottest day and temperature
print(f"Hottest Day: {temperatures.idxmax()}")
print(f"Hottest Temperature: {temperatures.max()} C")
# 2. Days above 23 degrees
tempabove23 = temperatures[temperatures > 23]
print(f"Days above 23 degrees: {', '.join(tempabove23.index)}")

# 3. Average temperature for the week
# Your code here:
print(f"Average Temperature: {temperatures.mean():.2f} C")

Monday       22.5
Tuesday      23.1
Wednesday    21.8
Thursday     24.2
Friday       22.9
Saturday     25.5
Sunday       20.3
dtype: float64
Hottest Day: Saturday
Hottest Temperature: 25.5 C
Days above 23 degrees: Tuesday, Thursday, Saturday
Average Temperature: 22.90 C


<details>
<summary><b>Hint</b></summary>

- Use `pd.Series(data, index=labels)` to create the Series
- `.idxmax()` returns the index of the maximum value
- Boolean indexing: `series[series > value]`
</details>

### Exercise 1.2 — Stock Portfolio (medium)
Track stock prices and calculate portfolio value.

In [63]:
# Given data
stocks = {'AAPL': 175.50, 'GOOGL': 142.80, 'MSFT': 405.20, 'AMZN': 178.35}
shares = {'AAPL': 50, 'GOOGL': 30, 'MSFT': 40, 'AMZN': 25}

# TODO: Create Series for stocks and shares
# Your code here:
stocks = pd.Series ({'AAPL': 175.50, 'GOOGL': 142.80, 'MSFT': 405.20, 'AMZN': 178.35})
shares = pd.Series ({'AAPL': 50, 'GOOGL': 30, 'MSFT': 40, 'AMZN': 25})

# TODO: Calculate:
# 1. Portfolio value for each stock (price * shares)
portfolio_stocks = stocks*shares
formatted_portfolio = portfolio_stocks.apply(lambda x: f"$ {x:.2f}")
print(f"{formatted_portfolio}\n")

# 2. Total portfolio value
total_portfolio = sum(portfolio_stocks)
print(f"Total: $ {total_portfolio}\n")

# 3. Percentage allocation of each stock
# Your code here:
allocation = total_portfolio//portfolio_stocks
print(f"Allocation:")
print(allocation.apply(lambda x: f" {x:.2f} %"))

AAPL      $ 8775.00
GOOGL     $ 4284.00
MSFT     $ 16208.00
AMZN      $ 4458.75
dtype: object

Total: $ 33725.75

Allocation:
AAPL      3.00 %
GOOGL     7.00 %
MSFT      2.00 %
AMZN      7.00 %
dtype: object


### Exercise 1.3 — Sales Growth Analysis (medium)
Analyze month-over-month sales growth.

In [71]:
# Monthly sales data
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
sales = [45000, 48000, 52000, 49000, 55000, 58000]

# TODO: Create a Series and calculate:
sales_growth = pd.Series([45000, 48000, 52000, 49000, 55000, 58000], index=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'])

# 1. Month-over-month growth rate (percentage)
current_month = sales_growth
previous_month = sales_growth.shift(1)
growth_rate = (current_month-previous_month)/previous_month*100
print(growth_rate.apply(lambda x: f"{x:.2f} %"))

# 2. Best and worst growth months
print(f"Best month: {growth_rate.idxmax()}")
print(f"Worst month: {growth_rate.idxmin()}")

# 3. Average monthly growth rate
# Your code here:
print(f"Average monthly growth rate: {growth_rate.mean():.2f}%")

Jan      nan %
Feb     6.67 %
Mar     8.33 %
Apr    -5.77 %
May    12.24 %
Jun     5.45 %
dtype: object
Best month: May
Worst month: Apr
Average monthly growth rate: 5.39%


## Part 2: DataFrame Creation & Basic Operations (15 minutes)

### Exercise 2.1 — Student Records (easy)
Create and explore a student database.

In [81]:
# Student data
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [22, 24, 23, 22, 25],
    'major': ['CS', 'Math', 'CS', 'Physics', 'Math'],
    'gpa': [3.8, 3.5, 3.9, 3.7, 3.6],
    'credits': [120, 115, 125, 118, 122]
}

# TODO: Create DataFrame and:
students = pd.DataFrame(data)
print(f"DataFrame:\n {students}\n")

# 1. Display first 3 rows
print(f"First three rows:\n {students.head(3)}")

# 2. Show DataFrame info and basic statistics
students.info()
students.describe()

# 3. Select only name and gpa columns
students[['name', 'gpa']]


DataFrame:
       name  age    major  gpa  credits
0    Alice   22       CS  3.8      120
1      Bob   24     Math  3.5      115
2  Charlie   23       CS  3.9      125
3    Diana   22  Physics  3.7      118
4      Eve   25     Math  3.6      122

First three rows:
       name  age major  gpa  credits
0    Alice   22    CS  3.8      120
1      Bob   24  Math  3.5      115
2  Charlie   23    CS  3.9      125
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   name     5 non-null      object 
 1   age      5 non-null      int64  
 2   major    5 non-null      object 
 3   gpa      5 non-null      float64
 4   credits  5 non-null      int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 332.0+ bytes


Unnamed: 0,name,gpa
0,Alice,3.8
1,Bob,3.5
2,Charlie,3.9
3,Diana,3.7
4,Eve,3.6


### Exercise 2.2 — Product Inventory (medium)
Manage an electronics store inventory.

In [104]:
# Create inventory DataFrame from lists
products = ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam', 'Headphones']
prices = [1200, 25, 80, 350, 120, 150]
stock = [15, 102, 45, 28, 33, 67]
categories = ['Computer', 'Accessory', 'Accessory', 'Computer', 'Accessory', 'Accessory']

# TODO: Create DataFrame with these columns
# Your code here:
product_inventory = pd.DataFrame({
    'products': products,
    'prices': prices,
    'stock': stock, 
    'categories': categories
})

# TODO: Add new columns:
# 1. 'inventory_value' = price * stock
product_inventory['inventory'] = product_inventory['prices']*product_inventory['stock']

# 2. 'needs_reorder' = True if stock < 30
product_inventory['needs_reorder'] = product_inventory['stock'] < 30

# 3. 'price_category' = 'Budget' (<100), 'Mid' (100-500), 'Premium' (>500)
product_inventory['price_category'] = np.where(product_inventory['prices'] < 100, 'Budget',
                                      np.where(product_inventory['prices'] <= 500, 'Mid', 'Premium'))
print(product_inventory)

     products  prices  stock categories  inventory  needs_reorder  \
0      Laptop    1200     15   Computer      18000           True   
1       Mouse      25    102  Accessory       2550          False   
2    Keyboard      80     45  Accessory       3600          False   
3     Monitor     350     28   Computer       9800           True   
4      Webcam     120     33  Accessory       3960          False   
5  Headphones     150     67  Accessory      10050          False   

  price_category  
0        Premium  
1         Budget  
2         Budget  
3            Mid  
4            Mid  
5            Mid  


### Exercise 2.3 — Data Filtering Challenge (medium)
Practice DataFrame filtering with multiple conditions.

In [116]:
# Create sales data
np.random.seed(42)
sales_data = pd.DataFrame({
    'date': pd.date_range('2025-08-01', periods=20),
    'product': np.random.choice(['A', 'B', 'C'], 20),
    'quantity': np.random.randint(1, 20, 20),
    'price': np.random.uniform(10, 100, 20).round(2),
    'customer_type': np.random.choice(['Regular', 'VIP'], 20)
})

# TODO: Find:
print(sales_data)

# 1. All VIP customer purchases
sales_data[sales_data['customer_type'] == 'VIP']

# 2. Product A sales with quantity > 10
sales_data[(sales_data['product'] == 'A') & (sales_data['quantity'] > 10)]

# 3. Sales between Aug 5-10 with price > 50
sales_data[(sales_data['date'] >= '2025-08-05') & 
           (sales_data['date'] <= '2025-08-10') & 
           (sales_data['price'] > 50)]

         date product  quantity  price customer_type
0  2025-08-01       C         1  44.69       Regular
1  2025-08-02       A        12  11.44           VIP
2  2025-08-03       C        12  30.78           VIP
3  2025-08-04       C        17  31.69       Regular
4  2025-08-05       A        10  71.49           VIP
..        ...     ...       ...    ...           ...
15 2025-08-16       A         7  85.81           VIP
16 2025-08-17       B        18  50.48       Regular
17 2025-08-18       B         4  45.56           VIP
18 2025-08-19       B        14  93.40           VIP
19 2025-08-20       B        18  75.45           VIP

[20 rows x 5 columns]


Unnamed: 0,date,product,quantity,price,customer_type
4,2025-08-05,A,10,71.49,VIP
5,2025-08-06,A,16,64.9,Regular
6,2025-08-07,C,15,84.99,VIP


## Part 3: Data Selection & Indexing (15 minutes)

### Exercise 3.1 — loc vs iloc Practice (medium)
Master the difference between label and position-based selection.

In [132]:
# Create sample DataFrame
df = pd.DataFrame({
    'A': [1, 4, 7, 10, 13],
    'B': [2, 5, 8, 11, 14],
    'C': [3, 6, 9, 12, 15]
}, index=['row1', 'row2', 'row3', 'row4', 'row5'])

print("Original DataFrame:")
print(df)
print()

# TODO: Use loc to:
# 1. Select row2 to row4, columns A and C
df.loc['row2':'row4', ['A','C']]

# 2. Select all rows where column B > 8
# Your code here:
df.loc[df['B']>8, :]

# TODO: Use iloc to:
# 1. Select first 3 rows and last 2 columns
df.iloc[0:3, -2:]

# 2. Select every other row
# Your code here:
df.iloc[::2, :]

Original DataFrame:
       A   B   C
row1   1   2   3
row2   4   5   6
row3   7   8   9
row4  10  11  12
row5  13  14  15



Unnamed: 0,A,B,C
row1,1,2,3
row3,7,8,9
row5,13,14,15


### Exercise 3.2 — Customer Segmentation (hard)
Segment customers based on purchase behavior.

In [151]:
# Generate customer data
np.random.seed(100)
customers = pd.DataFrame({
    'customer_id': range(1, 51),
    'total_purchases': np.random.randint(1, 50, 50),
    'total_spent': np.random.uniform(100, 5000, 50).round(2),
    'days_since_last_purchase': np.random.randint(1, 180, 50)
})

# TODO: Create customer segments:
# 1. Add 'avg_order_value' column
customers['avg_order_value'] = (customers['total_spent'] / customers['total_purchases'])
print(customers.round())

# 2. Add 'segment' column based on:
#    - 'Champions': total_spent > 3000 AND days_since_last_purchase < 30
#    - 'Loyal': total_purchases > 20 AND days_since_last_purchase < 60
#    - 'At Risk': days_since_last_purchase > 90
#    - 'Regular': all others
import numpy as np
champions = (customers['total_spent'] > 3000) & (customers['days_since_last_purchase'] < 30)
loyal = (customers['total_purchases'] > 20) & (customers['days_since_last_purchase'] < 60)
at_risk = customers['days_since_last_purchase'] > 90
conditions = [champions, loyal, at_risk]
choices = ['Champions', 'Loyal', 'At Risk']
customers['segment'] = np.select(conditions, choices, default='Regular')
print(customers.round())

# 3. Count customers in each segment
print(customers['segment'].value_counts())

    customer_id  total_purchases  total_spent  days_since_last_purchase  \
0             1                9       2083.0                        18   
1             2               25       2267.0                        46   
2             3                4        248.0                        90   
3             4               40       3902.0                       174   
4             5               24       2800.0                        36   
..          ...              ...          ...                       ...   
45           46               35       4681.0                        75   
46           47               28        268.0                        31   
47           48               47       3464.0                        68   
48           49               31       2251.0                       157   
49           50               49       4891.0                       166   

    avg_order_value  
0             231.0  
1              91.0  
2              62.0  
3          

## Part 4: Challenge Problems (15 minutes)

### Challenge 1 — Missing Data Handler (hard)
Clean and impute missing values intelligently.

In [164]:
# Create data with missing values
messy_data = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E', 'F'],
    'price': [100, None, 150, 200, None, 180],
    'stock': [50, 30, None, 45, 60, None],
    'category': ['Electronics', 'Electronics', None, 'Clothing', 'Clothing', 'Electronics'],
    'rating': [4.5, 4.2, None, 3.8, 4.6, None]
})

print("Messy data:")
print(messy_data)
print()

# TODO: Clean the data:
# 1. Fill missing prices with category average
messy_data['price'] = messy_data['price'].fillna(messy_data.groupby('category')['price'].transform('mean'))


# 2. Fill missing stock with median stock
messy_data['stock'] = messy_data['stock'].fillna(messy_data['stock'].median())


# 3. Fill missing categories by looking at similar price ranges
print(messy_data.groupby('category')['price'].describe())
# Create a mask for missing categories
missing_cat = messy_data['category'].isna()
# Fill based on price ranges 
messy_data.loc[missing_cat & (messy_data['price'] <= 180), 'category'] = 'Electronics'
messy_data.loc[missing_cat & (messy_data['price'] > 180), 'category'] = 'Clothing'


# 4. Fill missing ratings with overall average
messy_data['rating'] = messy_data['rating'].fillna(messy_data['rating'].mean())


# 5. Create a report showing what was fixed
# Your code here:
# Show missing values in each column
print("Missing values per column:")
print(messy_data.isnull().sum())

print("Data Cleaning Report:")
print("- Filled missing prices with category averages")
print("- Filled missing stock with median values") 
print("- Filled missing categories based on price ranges")
print("- Filled missing ratings with overall average")

print("Cleaned Dataset:")
print(messy_data)
print(f"\nTotal missing values remaining: {messy_data.isnull().sum().sum()}")

Messy data:
  product  price  stock     category  rating
0       A  100.0   50.0  Electronics     4.5
1       B    NaN   30.0  Electronics     4.2
2       C  150.0    NaN         None     NaN
3       D  200.0   45.0     Clothing     3.8
4       E    NaN   60.0     Clothing     4.6
5       F  180.0    NaN  Electronics     NaN

             count   mean   std    min    25%    50%    75%    max
category                                                          
Clothing       2.0  200.0   0.0  200.0  200.0  200.0  200.0  200.0
Electronics    3.0  140.0  40.0  100.0  120.0  140.0  160.0  180.0
Missing values per column:
product     0
price       0
stock       0
category    0
rating      0
dtype: int64
Data Cleaning Report:
- Filled missing prices with category averages
- Filled missing stock with median values
- Filled missing categories based on price ranges
- Filled missing ratings with overall average
Cleaned Dataset:
  product  price  stock     category  rating
0       A  100.0   50.0  

### Challenge 2 — Time Series Basics (hard)
Analyze daily sales patterns.

In [169]:
# Generate time series data
dates = pd.date_range('2025-07-01', periods=30)
daily_sales = pd.DataFrame({
    'date': dates,
    'sales': np.random.randint(1000, 5000, 30) + 
             np.sin(np.arange(30) * 2 * np.pi / 7) * 500  # Weekly pattern
})

# TODO: Analyze the time series:
# 1. Add day_of_week column
daily_sales['day_of_week'] = daily_sales['date'].dt.day_name()

# 2. Calculate 7-day rolling average
daily_sales['rolling_avg'] = daily_sales['sales'].rolling(window=7).mean()
print(daily_sales)

# 3. Find best and worst days of the week
daily_avg = daily_sales.groupby('day_of_week')['sales'].mean()
best_day = daily_avg.idxmax()
worst_day = daily_avg.idxmin()
print(f"Best day: {best_day}")
print(f"Worst day: {worst_day}")

# 4. Calculate week-over-week growth
# Your code here:
weekly_sales = daily_sales.set_index('date').resample('W')['sales'].sum()
weekly_growth = weekly_sales.pct_change() * 100
print(weekly_growth)

         date        sales day_of_week  rolling_avg
0  2025-07-01  3113.000000     Tuesday          NaN
1  2025-07-02  1714.915741   Wednesday          NaN
2  2025-07-03  3717.463956    Thursday          NaN
3  2025-07-04  3904.941870      Friday          NaN
4  2025-07-05  3174.058130    Saturday          NaN
..        ...          ...         ...          ...
25 2025-07-26  1692.058130    Saturday  3575.000000
26 2025-07-27  3233.536044      Sunday  3652.285714
27 2025-07-28  2288.084259      Monday  3447.714286
28 2025-07-29  4217.000000     Tuesday  3657.142857
29 2025-07-30  2758.915741   Wednesday  3476.285714

[30 rows x 4 columns]
Best day: Friday
Worst day: Saturday
date
2025-07-06          NaN
2025-07-13    25.208974
2025-07-20   -19.211047
2025-07-27    42.341740
2025-08-03   -63.764375
Freq: W-SUN, Name: sales, dtype: float64


### Challenge 3 — Data Aggregation Pipeline (hard)
Build a complete analysis pipeline.

In [179]:
# E-commerce transaction data
np.random.seed(42)
transactions = pd.DataFrame({
    'transaction_id': range(1, 101),
    'customer_id': np.random.randint(1, 21, 100),
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Watch'], 100),
    'amount': np.random.uniform(100, 2000, 100).round(2),
    'date': pd.date_range('2025-08-01', periods=100, freq='h')
})

# TODO: Create a comprehensive analysis:
# 1. Calculate total revenue by product
total_revenue = transactions.groupby('product')['amount'].sum()

# 2. Find top 5 customers by total spending  
top_5_cust = transactions.groupby('customer_id')['amount'].sum().nlargest(5)

# 3. Identify peak shopping hours
peak_hours = transactions.groupby(transactions['date'].dt.hour).size()
busiest_hour = peak_hours.idxmax()

# 4. Skip customer metrics for now

# 5. Create a summary report DataFrame
summary_report = pd.DataFrame({
    'Product': total_revenue.index,
    'Total_Revenue': total_revenue.values
})
print(summary_report)

  Product  Total_Revenue
0  Laptop       24941.12
1   Phone       23487.18
2  Tablet       30381.74
3   Watch       25319.93


## 🎯 Bonus Challenge: Real-World Mini Project
Combine everything you've learned!

In [184]:
# Restaurant order analysis
# You're analyzing data for a restaurant chain

# Generate realistic restaurant data
np.random.seed(123)
menu_items = ['Burger', 'Pizza', 'Salad', 'Pasta', 'Steak', 'Fish', 'Soup', 'Dessert']
locations = ['Downtown', 'Mall', 'Airport', 'Suburb']

orders = pd.DataFrame({
    'order_id': range(1, 201),
    'location': np.random.choice(locations, 200, p=[0.3, 0.25, 0.2, 0.25]),
    'item': np.random.choice(menu_items, 200),
    'quantity': np.random.randint(1, 5, 200),
    'price': np.random.uniform(8, 45, 200).round(2),
    'day_of_week': np.random.choice(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], 200),
    'meal_type': np.random.choice(['Breakfast', 'Lunch', 'Dinner'], 200, p=[0.2, 0.35, 0.45])
})

# TODO: Complete analysis to answer these business questions:
# 1. Which location has the highest average order value?
orders ['order_value'] = orders['price'] * orders['quantity']
aov_by_location = orders.groupby('location')['order_value'].mean()
print(orders)

# 2. What's the most popular item at each location?
item_counts = orders.groupby(['location', 'item']).size()
most_popular = item_counts.groupby('location').idxmax()

# 3. Which meal type generates the most revenue?
highest_revenue_meal = orders.groupby('meal_type')['order_value'].sum().idxmax()

# 4. What's the busiest day of the week?
busiest_day = orders.groupby('day_of_week').size().idxmax()

# 5. Create a location performance summary with key metrics
location_summary = orders.groupby('location').agg({
    'order_value': ['count', 'sum', 'mean'],  # Order count, total revenue, avg order value
    'quantity': 'sum',                         # Total items sold
    'price': 'mean'                           # Average item price
}).round(2)
print(location_summary)

     order_id  location     item  quantity  price day_of_week  meal_type  \
0           1   Airport     Fish         2  24.88         Fri  Breakfast   
1           2  Downtown    Pizza         2  18.35         Tue     Dinner   
2           3  Downtown    Salad         2  42.52         Tue     Dinner   
3           4   Airport  Dessert         1  19.63         Sat  Breakfast   
4           5   Airport   Burger         1  41.66         Fri     Dinner   
..        ...       ...      ...       ...    ...         ...        ...   
195       196   Airport    Pasta         2  24.51         Sun     Dinner   
196       197  Downtown     Soup         2  32.55         Wed     Dinner   
197       198   Airport     Soup         3  28.33         Fri      Lunch   
198       199      Mall    Salad         4   9.02         Mon     Dinner   
199       200  Downtown    Pasta         3   9.18         Wed     Dinner   

     order_value  
0          49.76  
1          36.70  
2          85.04  
3          

## 📊 Lab Summary Checklist
Before submitting, ensure you've completed:

- [ ] All Series exercises (1.1 - 1.3)
- [ ] DataFrame creation and operations (2.1 - 2.3)
- [ ] Selection and indexing practice (3.1 - 3.2)
- [ ] At least 2 challenge problems
- [ ] Bonus challenge (optional)

**Self-Assessment:**
- I can create and manipulate Series ✅
- I can build DataFrames from various sources ✅
- I understand loc vs iloc selection ✅
- I can filter data with multiple conditions ✅
- I can add and modify columns ✅

## 🚀 What's Next?
**Lab 01B:** Data wrangling with groupby, merge, and pivot operations!