# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 01 · Lab 01A — Pandas Basics
**Instructor:** Amir Charkhi  |  **Duration:** 45 minutes  |  **Difficulty:** ⭐⭐☆☆☆

> **Goal:** Master Series and basic DataFrame operations through hands-on exercises.


## Learning Objectives
- Create and manipulate pandas Series
- Build DataFrames from various sources
- Perform basic selections and filtering
- Apply simple transformations

## Setup
Run this cell first to import libraries and set up your environment.

In [3]:
import pandas as pd
import numpy as np

# Display settings
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', None)

print("🐼 Pandas version:", pd.__version__)
print("Ready to start!")

🐼 Pandas version: 2.3.2
Ready to start!


## Part 1: Series Fundamentals (15 minutes)
Series are the building blocks of pandas - like enhanced Python lists.

### Exercise 1.1 — Temperature Tracker (easy)
Create a Series of daily temperatures for a week and perform basic analysis.

In [5]:
# TODO: Create a Series with these temperatures: [22.5, 23.1, 21.8, 24.2, 22.9, 25.5, 20.3]
# Use days of the week as the index
# Your code here:
temperatures = [22.5, 23.1, 21.8, 24.2, 22.9, 25.5, 20.3]
labels = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
temp_pd = pd.Series(temperatures, index=labels)

# TODO: Find and print:
# 1. The hottest day and temperature
# 2. Days above 23 degrees
# 3. Average temperature for the week
# Your code here:
print(f"Hottest day: {temp_pd.idxmax()} at {temp_pd.max()} degC.")
print(f"\nDays above 23 degC:\n{temp_pd[temp_pd>23]}")
print(f"\nAverage temperature for the week: {round(temp_pd.mean())} degC.")

Hottest day: Sat at 25.5 degC.

Days above 23 degC:
Tue    23.1
Thu    24.2
Sat    25.5
dtype: float64

Average temperature for the week: 23 degC.


<details>
<summary><b>Hint</b></summary>

- Use `pd.Series(data, index=labels)` to create the Series
- `.idxmax()` returns the index of the maximum value
- Boolean indexing: `series[series > value]`
</details>

### Exercise 1.2 — Stock Portfolio (medium)
Track stock prices and calculate portfolio value.

In [10]:
# Given data
stocks = {'AAPL': 175.50, 'GOOGL': 142.80, 'MSFT': 405.20, 'AMZN': 178.35}
shares = {'AAPL': 50, 'GOOGL': 30, 'MSFT': 40, 'AMZN': 25}

# TODO: Create Series for stocks and shares
# Your code here:
stocks_pd = pd.Series(stocks)
shares_pd = pd.Series(shares)

# TODO: Calculate:
# 1. Portfolio value for each stock (price * shares)
# 2. Total portfolio value
# 3. Percentage allocation of each stock
# Your code here:
portfolio = stocks_pd * shares_pd
total_portfolio = portfolio.sum()
portfolio_percentage = round(portfolio/total_portfolio*100,2)

print(f"Portfolio\n{portfolio}")
print(f"\nTotal Portfolio: {total_portfolio}")
print(f"\nPortfolio Percentage\n{portfolio_percentage}")

Portfolio
AAPL      8775.00
GOOGL     4284.00
MSFT     16208.00
AMZN      4458.75
dtype: float64

Total Portfolio: 33725.75

Portfolio Percentage
AAPL     26.02
GOOGL    12.70
MSFT     48.06
AMZN     13.22
dtype: float64


### Exercise 1.3 — Sales Growth Analysis (medium)
Analyze month-over-month sales growth.

In [11]:
# Monthly sales data
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
sales = [45000, 48000, 52000, 49000, 55000, 58000]

# TODO: Create a Series and calculate:
# 1. Month-over-month growth rate (percentage)
# 2. Best and worst growth months
# 3. Average monthly growth rate
# Your code here:

sales_pd = pd.Series(sales, index=months)

growth_rate=[]
for i in range(1,len(sales_pd)):
    growth = (sales_pd[i]-sales_pd[i-1])/sales_pd[i-1]*100
    growth_rate.append(growth)
growth_rate_pd = pd.Series(growth_rate,index=months[1:])

print("Monthly growth rate, %")
for m,g in zip(months[1:],growth_rate_pd):
    print(f"{m}: {g:.2f}")

print(f"\nBest growth month: {growth_rate_pd.idxmax()} at {growth_rate_pd.max():.2f} %")
print(f"Worst growth month: {growth_rate_pd.idxmin()} at {growth_rate_pd.min():.2f} %")
print(f"Average growth rate: {growth_rate_pd.mean():.2f} %")

Monthly growth rate, %
Feb: 6.67
Mar: 8.33
Apr: -5.77
May: 12.24
Jun: 5.45

Best growth month: May at 12.24 %
Worst growth month: Apr at -5.77 %
Average growth rate: 5.39 %


  growth = (sales_pd[i]-sales_pd[i-1])/sales_pd[i-1]*100


## Part 2: DataFrame Creation & Basic Operations (15 minutes)

### Exercise 2.1 — Student Records (easy)
Create and explore a student database.

In [29]:
# Student data
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [22, 24, 23, 22, 25],
    'major': ['CS', 'Math', 'CS', 'Physics', 'Math'],
    'gpa': [3.8, 3.5, 3.9, 3.7, 3.6],
    'credits': [120, 115, 125, 118, 122]
}

In [30]:
# TODO: Create DataFrame and:
# 1. Display first 3 rows
df_student = pd.DataFrame(data)
df_student.head(3)

Unnamed: 0,name,age,major,gpa,credits
0,Alice,22,CS,3.8,120
1,Bob,24,Math,3.5,115
2,Charlie,23,CS,3.9,125


In [31]:
# (option 2) Display first 3 rows
df_student.iloc[:3]

Unnamed: 0,name,age,major,gpa,credits
0,Alice,22,CS,3.8,120
1,Bob,24,Math,3.5,115
2,Charlie,23,CS,3.9,125


In [32]:
# 2. Show DataFrame info and basic statistics
print(f"DataFrame info:\n")
print(df_student.info())
print(f"\nDataFrame statistics:")
print(df_student.describe())

DataFrame info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   name     5 non-null      object 
 1   age      5 non-null      int64  
 2   major    5 non-null      object 
 3   gpa      5 non-null      float64
 4   credits  5 non-null      int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 332.0+ bytes
None

DataFrame statistics:
            age       gpa     credits
count   5.00000  5.000000    5.000000
mean   23.20000  3.700000  120.000000
std     1.30384  0.158114    3.807887
min    22.00000  3.500000  115.000000
25%    22.00000  3.600000  118.000000
50%    23.00000  3.700000  120.000000
75%    24.00000  3.800000  122.000000
max    25.00000  3.900000  125.000000


In [33]:
# 3. Select only name and gpa columns
df_student[['name','gpa']]

Unnamed: 0,name,gpa
0,Alice,3.8
1,Bob,3.5
2,Charlie,3.9
3,Diana,3.7
4,Eve,3.6


### Exercise 2.2 — Product Inventory (medium)
Manage an electronics store inventory.

In [34]:
# Create inventory DataFrame from lists
products = ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam', 'Headphones']
prices = [1200, 25, 80, 350, 120, 150]
stock = [15, 102, 45, 28, 33, 67]
categories = ['Computer', 'Accessory', 'Accessory', 'Computer', 'Accessory', 'Accessory']

In [37]:
# TODO: Create DataFrame with these columns
# Your code here:
product_dict = {
    'products':['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam', 'Headphones'],
    'prices':[1200, 25, 80, 350, 120, 150],
    'stock':[15, 102, 45, 28, 33, 67],
    'categories':['Computer', 'Accessory', 'Accessory', 'Computer', 'Accessory', 'Accessory']
}
product_df = pd.DataFrame(product_dict)

In [42]:
# TODO: Add new columns:
# 1. 'inventory_value' = price * stock
# 2. 'needs_reorder' = True if stock < 30
product_df['inventory_value'] = product_df['prices']*product_df['stock']
product_df['needs_reorder'] = product_df['stock']<30

In [46]:
# 3. 'price_category' = 'Budget' (<100), 'Mid' (100-500), 'Premium' (>500)
def price_category(n):
    if n < 100:
        return 'Budget'
    elif n < 500:
        return 'Mid'
    else:
        return 'Premium'
product_df['price_category']= product_df['prices'].apply(price_category)

In [48]:
product_df

Unnamed: 0,products,prices,stock,categories,inventory_value,needs_reorder,price_category
0,Laptop,1200,15,Computer,18000,True,Premium
1,Mouse,25,102,Accessory,2550,False,Budget
2,Keyboard,80,45,Accessory,3600,False,Budget
3,Monitor,350,28,Computer,9800,True,Mid
4,Webcam,120,33,Accessory,3960,False,Mid
5,Headphones,150,67,Accessory,10050,False,Mid


### Exercise 2.3 — Data Filtering Challenge (medium)
Practice DataFrame filtering with multiple conditions.

In [49]:
# Create sales data
np.random.seed(42)
sales_data = pd.DataFrame({
    'date': pd.date_range('2025-08-01', periods=20),
    'product': np.random.choice(['A', 'B', 'C'], 20),
    'quantity': np.random.randint(1, 20, 20),
    'price': np.random.uniform(10, 100, 20).round(2),
    'customer_type': np.random.choice(['Regular', 'VIP'], 20)
})


In [56]:
# 1. All VIP customer purchases
VIP_sales = sales_data[sales_data['customer_type']=='VIP']
Total_VIP_sales = sum(VIP_sales['quantity']*VIP_sales['price'])
print(f"VIP Sales\n{VIP_sales}")
print(f"\nTotal VIP Sales: ${Total_VIP_sales:.2f}")

VIP Sales
         date product  quantity  price customer_type
1  2025-08-02       A        12  11.44           VIP
2  2025-08-03       C        12  30.78           VIP
4  2025-08-05       A        10  71.49           VIP
6  2025-08-07       C        15  84.99           VIP
7  2025-08-08       B        15  25.60           VIP
..        ...     ...       ...    ...           ...
14 2025-08-15       B         9  12.82           VIP
15 2025-08-16       A         7  85.81           VIP
17 2025-08-18       B         4  45.56           VIP
18 2025-08-19       B        14  93.40           VIP
19 2025-08-20       B        18  75.45           VIP

[12 rows x 5 columns]

Total VIP Sales: $7002.48


In [62]:
# 2. Product A sales with quantity > 10
Product_A_sales = sales_data[(sales_data['product']=='A') & (sales_data['quantity']>10)]
Total_Product_A_sales = sum(Product_A_sales['quantity']*Product_A_sales['price'])
print(f"Product A sales (>10)\n{Product_A_sales}")
print(f"\nTotal Product A sales: ${Total_Product_A_sales:.2f}")

Product A sales (>10)
         date product  quantity  price customer_type
1  2025-08-02       A        12  11.44           VIP
5  2025-08-06       A        16  64.90       Regular
12 2025-08-13       A        19  28.71       Regular

Total Product A sales: $1721.17


In [69]:
# 3. Sales between Aug 5-10 with price > 50
start_date = pd.Timestamp('2025-08-05')
end_date = pd.Timestamp('2025-08-10')
Sales_filtered = sales_data[(sales_data['date']>= start_date) & (sales_data['date']<= end_date) & (sales_data['price']>50)]
Total_sales_filtered = sum(Sales_filtered['quantity']*Sales_filtered['price'])
print(f"Sales between Aug 5-10 with Price > $50\n{Sales_filtered}")
print(f"\nTotal sales during this period: ${Total_sales_filtered:.2f}")

Sales between Aug 5-10 with Price > $50
        date product  quantity  price customer_type
4 2025-08-05       A        10  71.49           VIP
5 2025-08-06       A        16  64.90       Regular
6 2025-08-07       C        15  84.99           VIP

Total sales during this period: $3028.15


## Part 3: Data Selection & Indexing (15 minutes)

### Exercise 3.1 — loc vs iloc Practice (medium)
Master the difference between label and position-based selection.

In [72]:
# Create sample DataFrame
df = pd.DataFrame({
    'A': [1, 4, 7, 10, 13],
    'B': [2, 5, 8, 11, 14],
    'C': [3, 6, 9, 12, 15]
}, index=['row1', 'row2', 'row3', 'row4', 'row5'])

print("Original DataFrame:")
print(df)
print()


Original DataFrame:
       A   B   C
row1   1   2   3
row2   4   5   6
row3   7   8   9
row4  10  11  12
row5  13  14  15



In [84]:
# TODO: Use loc to:
# 1. Select row2 to row4, columns A and C
df_filtered_loc1 = df.loc['row2':'row4',['A','C']]
print(f"Row 2 to 4, columns A and C: \n{df_filtered_loc1}")

# 2. Select all rows where column B > 8
df_filtered_loc2 = df[df['B']>8].loc[:,'B']
print(f"\nAll rows where B>8: \n{df_filtered_loc2}")


Row 2 to 4, columns A and C: 
       A   C
row2   4   6
row3   7   9
row4  10  12

All rows where B>8: 
row4    11
row5    14
Name: B, dtype: int64


In [93]:
# TODO: Use iloc to:
# 1. Select first 3 rows and last 2 columns
df_filtered_iloc1 = df.iloc[0:3,-2:]
print(f"First 3 rows and last 2 columns: \n{df_filtered_iloc1}")

# 2. Select every other row
df_filtered_iloc2 = df.iloc[::2,:]
print(f"\nEvery other row: \n{df_filtered_iloc2}")

First 3 rows and last 2 columns: 
      B  C
row1  2  3
row2  5  6
row3  8  9

Every other row: 
       A   B   C
row1   1   2   3
row3   7   8   9
row5  13  14  15


### Exercise 3.2 — Customer Segmentation (hard)
Segment customers based on purchase behavior.

In [98]:
# Generate customer data
np.random.seed(100)
customers = pd.DataFrame({
    'customer_id': range(1, 51),
    'total_purchases': np.random.randint(1, 50, 50),
    'total_spent': np.random.uniform(100, 5000, 50).round(2),
    'days_since_last_purchase': np.random.randint(1, 180, 50)
})


In [96]:
customers.head(5)

Unnamed: 0,customer_id,total_purchases,total_spent,days_since_last_purchase
0,1,9,2083.31,18
1,2,25,2267.04,46
2,3,4,247.6,90
3,4,40,3902.43,174
4,5,24,2799.7,36


In [103]:
# TODO: Create customer segments:
# 1. Add 'avg_order_value' column
customers['avg_order_value']=round(customers['total_spent']/customers['total_purchases'],2)
print(f"Added average order value per customer: \n{customers.head()}")

Added average order value per customer: 
   customer_id  total_purchases  total_spent  days_since_last_purchase  \
0            1                9      2083.31                        18   
1            2               25      2267.04                        46   
2            3                4       247.60                        90   
3            4               40      3902.43                       174   
4            5               24      2799.70                        36   

   avg_order_value  
0           231.48  
1            90.68  
2            61.90  
3            97.56  
4           116.65  


In [104]:
customers.describe()

Unnamed: 0,customer_id,total_purchases,total_spent,days_since_last_purchase,avg_order_value
count,50.0,50.0,50.0,50.0,50.0
mean,25.5,23.8,2473.2616,85.82,237.2016
std,14.57738,14.951623,1563.637286,52.274273,404.463512
min,1.0,2.0,155.81,1.0,6.23
25%,13.25,10.25,850.1225,35.25,59.11
50%,25.5,24.5,2687.73,88.5,94.68
75%,37.75,35.0,3949.5975,133.25,191.67
max,50.0,49.0,4914.89,174.0,1839.47


In [117]:
# 2. Add 'segment' column based on:
#    - 'Champions': total_spent > 3000 AND days_since_last_purchase < 30
#    - 'Loyal': total_purchases > 20 AND days_since_last_purchase < 60
#    - 'At Risk': days_since_last_purchase > 90
#    - 'Regular': all others

def function_segment(df):
    if (df['total_spent']>3000) & (df['days_since_last_purchase']<30):
        return 'Champions'
    elif (df['total_purchases']>20) & (df['days_since_last_purchase']<60):
        return 'Loyal'
    elif df['days_since_last_purchase']>90:
        return 'At Risk'
    else:
        return 'Regular'
    
customers['segment']=customers.apply(function_segment, axis=1)
print(f"Added 'segment' column: \n{customers.head()}")

Added 'segment' column: 
   customer_id  total_purchases  total_spent  days_since_last_purchase  \
0            1                9      2083.31                        18   
1            2               25      2267.04                        46   
2            3                4       247.60                        90   
3            4               40      3902.43                       174   
4            5               24      2799.70                        36   

   avg_order_value  segment  
0           231.48  Regular  
1            90.68    Loyal  
2            61.90  Regular  
3            97.56  At Risk  
4           116.65    Loyal  


In [121]:
# 3. Count customers in each segment
segment_count = customers['segment'].value_counts()
print(f"Customer counts: \n{segment_count}")

Customer counts: 
segment
At Risk      24
Regular      17
Champions     5
Loyal         4
Name: count, dtype: int64


In [124]:
print(f"The Champions: \n{customers[customers['segment']=='Champions'][['customer_id','total_spent','days_since_last_purchase']]}")

The Champions: 
    customer_id  total_spent  days_since_last_purchase
10           11      4328.43                        21
12           13      4142.96                         9
22           23      3994.91                         1
37           38      3678.94                        25
38           39      4243.53                        29


In [125]:
print(f"The Loyals:\n{customers[customers['segment']=='Loyal'][['customer_id','total_purchases','days_since_last_purchase']]}")

The Loyals:
    customer_id  total_purchases  days_since_last_purchase
1             2               25                        46
4             5               24                        36
6             7               49                        13
46           47               28                        31


In [128]:
print(f"At risk:\n{customers[customers['segment']=='At Risk']['days_since_last_purchase'].describe()}")

At risk:
count     24.000000
mean     132.666667
std       24.588203
min       91.000000
25%      115.500000
50%      135.000000
75%      147.500000
max      174.000000
Name: days_since_last_purchase, dtype: float64


In [130]:
print(f"The Regulars: \n{customers[customers['segment']=='Regular'][['total_purchases','total_spent','days_since_last_purchase','avg_order_value']].describe()}")

The Regulars: 
       total_purchases  total_spent  days_since_last_purchase  avg_order_value
count        17.000000    17.000000                 17.000000        17.000000
mean         15.647059  2082.197059                 52.705882       296.010000
std          13.453351  1700.201778                 27.476728       442.573584
min           2.000000   155.810000                  3.000000         9.170000
25%           4.000000   628.800000                 34.000000        61.900000
50%          11.000000  1967.260000                 64.000000       122.950000
75%          20.000000  3464.070000                 74.000000       231.480000
max          47.000000  4788.700000                 90.000000      1481.650000


## Part 4: Challenge Problems (15 minutes)

### Challenge 1 — Missing Data Handler (hard)
Clean and impute missing values intelligently.

In [62]:
# Create data with missing values
messy_data = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E', 'F'],
    'price': [100, None, 150, 200, None, 180],
    'stock': [50, 30, None, 45, 60, None],
    'category': ['Electronics', 'Electronics', None, 'Clothing', 'Clothing', 'Electronics'],
    'rating': [4.5, 4.2, None, 3.8, 4.6, None]
})

print("Messy data:")
print(messy_data)
print()

Messy data:
  product  price  stock     category  rating
0       A  100.0   50.0  Electronics     4.5
1       B    NaN   30.0  Electronics     4.2
2       C  150.0    NaN         None     NaN
3       D  200.0   45.0     Clothing     3.8
4       E    NaN   60.0     Clothing     4.6
5       F  180.0    NaN  Electronics     NaN



In [63]:
# 1. Fill missing prices with category average
average_prices = messy_data.groupby('category')['price'].transform('mean')
print(f"Average prices:\n{average_prices}")

messy_data['price'] = messy_data['price'].fillna(average_prices)
print(f"\nFilled prices:\n{messy_data}")

Average prices:
0    140.0
1    140.0
2      NaN
3    200.0
4    200.0
5    140.0
Name: price, dtype: float64

Filled prices:
  product  price  stock     category  rating
0       A  100.0   50.0  Electronics     4.5
1       B  140.0   30.0  Electronics     4.2
2       C  150.0    NaN         None     NaN
3       D  200.0   45.0     Clothing     3.8
4       E  200.0   60.0     Clothing     4.6
5       F  180.0    NaN  Electronics     NaN


In [64]:
# 2. Fill missing stock with median stock
median_stock = messy_data['stock'].median()
print(f"Median stock: {median_stock}")
messy_data['stock']=messy_data['stock'].fillna(median_stock)
print(f"\nFilled stock: \n{messy_data}")

Median stock: 47.5

Filled stock: 
  product  price  stock     category  rating
0       A  100.0   50.0  Electronics     4.5
1       B  140.0   30.0  Electronics     4.2
2       C  150.0   47.5         None     NaN
3       D  200.0   45.0     Clothing     3.8
4       E  200.0   60.0     Clothing     4.6
5       F  180.0   47.5  Electronics     NaN


In [75]:
# 3. Fill missing categories by looking at similar price ranges
average_prices_by_category = messy_data.groupby('category')['price'].mean()
print(f"\nAvg prices by category: \n{average_prices_by_category}")

rows_missing_category = messy_data['category'].isna() & messy_data['price'].notna()
print(f"\nRows with missing category:\n{rows_missing_category}")

price_missing_category = messy_data.loc[rows_missing_category,'price'].iloc[0]
delta_price_by_category = abs(average_prices_by_category - price_missing_category)
print(f"\nAbsolute price difference by category: \n{delta_price_by_category}")

closest_category = delta_price_by_category.idxmin()
print(f"\nClosest category: {closest_category}")

messy_data['category'] = messy_data['category'].fillna(closest_category)
print(f"\nFilled category:\n{messy_data}")


Avg prices by category: 
category
Clothing       200.0
Electronics    140.0
Name: price, dtype: float64

Rows with missing category:
0    False
1    False
2     True
3    False
4    False
5    False
dtype: bool

Absolute price difference by category: 
category
Clothing       50.0
Electronics    10.0
Name: price, dtype: float64

Closest category: Electronics

Filled category:
  product  price  stock     category  rating
0       A  100.0   50.0  Electronics     4.5
1       B  140.0   30.0  Electronics     4.2
2       C  150.0   47.5  Electronics     NaN
3       D  200.0   45.0     Clothing     3.8
4       E  200.0   60.0     Clothing     4.6
5       F  180.0   47.5  Electronics     NaN


In [80]:
# 4. Fill missing ratings with overall average
avg_rating = messy_data['rating'].mean()
print(f"Avg rating: {avg_rating:.2f}")

messy_data['rating'] = messy_data['rating'].fillna(avg_rating)
print(f"\nFilled rating: \n{messy_data}")

Avg rating: 4.27

Filled rating: 
  product  price  stock     category  rating
0       A  100.0   50.0  Electronics   4.500
1       B  140.0   30.0  Electronics   4.200
2       C  150.0   47.5  Electronics   4.275
3       D  200.0   45.0     Clothing   3.800
4       E  200.0   60.0     Clothing   4.600
5       F  180.0   47.5  Electronics   4.275


In [85]:
# 5. Create a report showing what was fixed
print(f"Data Cleaning and Imputation")
print(f"============================")
print(f"\n(1) Fill missing price with category average:\n{average_prices_by_category}")
print(f"\n(2) Fill missing stock with median stock: {median_stock}")
print(f"\n(3) Fill missing category with reference to the similar price range, i.e. {closest_category}")
print(f"\n(4) Fill missing rating with average rating: {avg_rating:.2f}")

Data Cleaning and Imputation

(1) Fill missing price with category average:
category
Clothing       200.0
Electronics    140.0
Name: price, dtype: float64

(2) Fill missing stock with median stock: 47.5

(3) Fill missing category with reference to the similar price range, i.e. Electronics

(4) Fill missing rating with average rating: 4.27


### Challenge 2 — Time Series Basics (hard)
Analyze daily sales patterns.

In [86]:
# Generate time series data
dates = pd.date_range('2025-07-01', periods=30)
daily_sales = pd.DataFrame({
    'date': dates,
    'sales': np.random.randint(1000, 5000, 30) + 
             np.sin(np.arange(30) * 2 * np.pi / 7) * 500  # Weekly pattern
})

In [87]:
daily_sales

Unnamed: 0,date,sales
0,2025-07-01,1297.000000
1,2025-07-02,3329.915741
2,2025-07-03,3198.463956
3,2025-07-04,4688.941870
4,2025-07-05,3071.058130
...,...,...
25,2025-07-26,2949.058130
26,2025-07-27,2278.536044
27,2025-07-28,2534.084259
28,2025-07-29,4932.000000


In [89]:
# TODO: Analyze the time series:
# 1. Add day_of_week column
daily_sales['day_of_week']=daily_sales['date'].dt.day_name()
print(f"Added day of week:\n{daily_sales}")

Added day of week:
         date        sales day_of_week
0  2025-07-01  1297.000000     Tuesday
1  2025-07-02  3329.915741   Wednesday
2  2025-07-03  3198.463956    Thursday
3  2025-07-04  4688.941870      Friday
4  2025-07-05  3071.058130    Saturday
..        ...          ...         ...
25 2025-07-26  2949.058130    Saturday
26 2025-07-27  2278.536044      Sunday
27 2025-07-28  2534.084259      Monday
28 2025-07-29  4932.000000     Tuesday
29 2025-07-30  2672.915741   Wednesday

[30 rows x 3 columns]


In [94]:
# 2. Calculate 7-day rolling average
daily_sales['7d_rolling_avg'] = daily_sales['sales'].rolling(window=7).mean()
print(f"Added 7-day rolling average: \n{daily_sales.head(10)}")

Added 7-day rolling average: 
        date        sales day_of_week  7d_rolling_avg
0 2025-07-01  1297.000000     Tuesday             NaN
1 2025-07-02  3329.915741   Wednesday             NaN
2 2025-07-03  3198.463956    Thursday             NaN
3 2025-07-04  4688.941870      Friday             NaN
4 2025-07-05  3071.058130    Saturday             NaN
5 2025-07-06  3922.536044      Sunday             NaN
6 2025-07-07  2972.084259      Monday     3211.428571
7 2025-07-08  4761.000000     Tuesday     3706.285714
8 2025-07-09  3905.915741   Wednesday     3788.571429
9 2025-07-10  3830.463956    Thursday     3878.857143


In [100]:
# 3. Find best and worst days of the week
best_day = daily_sales.loc[daily_sales['7d_rolling_avg'].idxmax(),'day_of_week']
worst_day = daily_sales.loc[daily_sales['7d_rolling_avg'].idxmin(),'day_of_week']
print(f"Best day is {best_day} and worst day is {worst_day}")

Best day is Saturday and worst day is Saturday


In [128]:
# 4. Calculate week-over-week growth
week1 = daily_sales[(daily_sales['date']>='2025-07-01')&(daily_sales['date']<='2025-07-07')]
week2 = daily_sales[(daily_sales['date']>='2025-07-08')&(daily_sales['date']<='2025-07-14')]
week3 = daily_sales[(daily_sales['date']>='2025-07-15')&(daily_sales['date']<='2025-07-21')]
week4 = daily_sales[(daily_sales['date']>='2025-07-22')&(daily_sales['date']<='2025-07-28')]
week5 = daily_sales[(daily_sales['date']>='2025-07-29')&(daily_sales['date']<='2025-07-31')]

weekly_sales= pd.DataFrame({
    'week': range(1,6),
    'weekly_total_sales' : [week1['sales'].sum(), week2['sales'].sum(), week3['sales'].sum(), week4['sales'].sum(), week5['sales'].sum()]
})

In [135]:
weekly_sales['week_on_week_growth']=round(weekly_sales['weekly_total_sales'].pct_change()*100,2)
print(f"Weekly sales with % of week-on-week growth:\n{weekly_sales}")

Weekly sales with % of week-on-week growth:
   week  weekly_total_sales  week_on_week_growth
0     1        22480.000000                  NaN
1     2        26371.000000                17.31
2     3        18279.000000               -30.69
3     4        22012.000000                20.42
4     5         7604.915741               -65.45


### Challenge 3 — Data Aggregation Pipeline (hard)
Build a complete analysis pipeline.

In [131]:
# E-commerce transaction data
np.random.seed(42)
transactions = pd.DataFrame({
    'transaction_id': range(1, 101),
    'customer_id': np.random.randint(1, 21, 100),
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Watch'], 100),
    'amount': np.random.uniform(100, 2000, 100).round(2),
    'date': pd.date_range('2025-08-01', periods=100, freq='h')
})


In [136]:
# TODO: Create a comprehensive analysis:
# 1. Calculate total revenue by product
total_revenue_by_product = transactions.groupby('product')['amount'].sum()
print(f"Total revenue by product ($):\n{total_revenue_by_product}")

Total revenue by product ($):
product
Laptop    24941.12
Phone     23487.18
Tablet    30381.74
Watch     25319.93
Name: amount, dtype: float64


In [167]:
# 2. Find top 5 customers by total spending
unique_customer_id = (transactions['customer_id'].unique()).tolist()
print(f"There are total {len(unique_customer_id)} customers.")

top5_amount=transactions['amount'].sort_values(ascending=False).head(5)
print(f"\nTop 5 transaction amount ($): \n{top5_amount}")

total_spending_per_customer = transactions.groupby('customer_id')['amount'].sum()
total_spending_per_customer_top5= total_spending_per_customer.sort_values(ascending=False).head(5)
print(f"\nTop 5 customers in total spending ($): \n{total_spending_per_customer_top5}")

There are total 20 customers.

Top 5 transaction amount ($): 
18    1972.74
90    1948.72
3     1946.39
4     1928.65
42    1879.79
Name: amount, dtype: float64

Top 5 customers in total spending ($): 
customer_id
8     10605.64
12     9721.97
7      8468.18
15     7266.93
4      6586.46
Name: amount, dtype: float64


In [159]:
# 3. Identify peak shopping hours
transactions['hours']=transactions['date'].dt.hour
shopping_hour = transactions.groupby('hours')['amount'].sum()
shopping_hour_top3 = shopping_hour.sort_values(ascending=False).head(3)
print(f"Top 3 peak shopping hours :\n{shopping_hour_top3}")

Top 3 peak shopping hours :
hours
3     7106.50
18    6208.34
23    5666.21
Name: amount, dtype: float64


In [194]:
# 4. Calculate customer metrics: total spent, avg transaction, transaction count
# 5. Create a summary report DataFrame

total_spending_per_customer = transactions.groupby('customer_id')['amount'].sum()
#print(f"Total spending per customer ($): \n{total_spending_per_customer}")
average_spending_per_customer = transactions.groupby('customer_id')['amount'].mean()
#print(f"\nAverage spending per customer ($): \n{average_spending_per_customer}")
transaction_count_per_customer = transactions.groupby('customer_id')['transaction_id'].count()
#print(f"\nTransaction count per customer: \n{transaction_count_per_customer}")

customer_metrics = pd.DataFrame({
    'Customer ID': sorted(unique_customer_id),
    'Total spend ($)' : total_spending_per_customer.tolist(),
    'Average spend ($)' : average_spending_per_customer.tolist(),
    'Transaction count' : transaction_count_per_customer.tolist()
})

print(f"Customer metrics: \n{customer_metrics}")

Customer metrics: 
    Customer ID  Total spend ($)  Average spend ($)  Transaction count
0             1          5077.48        1015.496000                  5
1             2          5125.43         854.238333                  6
2             3          3946.43         789.286000                  5
3             4          6586.46        1097.743333                  6
4             5          2348.09         782.696667                  3
..          ...              ...                ...                ...
15           16          5610.75        1402.687500                  4
16           17          4040.37        1346.790000                  3
17           18          5577.91        1115.582000                  5
18           19          4864.93        1216.232500                  4
19           20          3170.75        1056.916667                  3

[20 rows x 4 columns]


## 🎯 Bonus Challenge: Real-World Mini Project
Combine everything you've learned!

In [198]:
# Restaurant order analysis
# You're analyzing data for a restaurant chain

# Generate realistic restaurant data
np.random.seed(123)
menu_items = ['Burger', 'Pizza', 'Salad', 'Pasta', 'Steak', 'Fish', 'Soup', 'Dessert']
locations = ['Downtown', 'Mall', 'Airport', 'Suburb']

orders = pd.DataFrame({
    'order_id': range(1, 201),
    'location': np.random.choice(locations, 200, p=[0.3, 0.25, 0.2, 0.25]),
    'item': np.random.choice(menu_items, 200),
    'quantity': np.random.randint(1, 5, 200),
    'price': np.random.uniform(8, 45, 200).round(2),
    'day_of_week': np.random.choice(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], 200),
    'meal_type': np.random.choice(['Breakfast', 'Lunch', 'Dinner'], 200, p=[0.2, 0.35, 0.45])
})
orders

Unnamed: 0,order_id,location,item,quantity,price,day_of_week,meal_type
0,1,Airport,Fish,2,24.88,Fri,Breakfast
1,2,Downtown,Pizza,2,18.35,Tue,Dinner
2,3,Downtown,Salad,2,42.52,Tue,Dinner
3,4,Airport,Dessert,1,19.63,Sat,Breakfast
4,5,Airport,Burger,1,41.66,Fri,Dinner
...,...,...,...,...,...,...,...
195,196,Airport,Pasta,2,24.51,Sun,Dinner
196,197,Downtown,Soup,2,32.55,Wed,Dinner
197,198,Airport,Soup,3,28.33,Fri,Lunch
198,199,Mall,Salad,4,9.02,Mon,Dinner


In [207]:
# TODO: Complete analysis to answer these business questions:
# 1. Which location has the highest average order value?
orders['total_order_value']=orders['quantity']*orders['price']
average_order = orders.groupby('location')['total_order_value'].mean()
print(f"Average order by location: \n{average_order}")
print(f"\nLocation with the highest average order value: {average_order.idxmax()}")

Average order by location: 
location
Airport     63.289259
Downtown    68.964694
Mall        69.520714
Suburb      67.018049
Name: total_order_value, dtype: float64

Location with the highest average order value: Mall


In [237]:
# 2. What's the most popular item at each location?
item_by_location = orders.groupby(['location','item'])['quantity'].sum()
item_by_location

popular_item = item_by_location.groupby(level=0).idxmax()
popular_item_by_location = item_by_location.loc[popular_item]
print(f"Popular item by location based on order quantity:\n{popular_item_by_location}")

Popular item by location based on order quantity:
location  item   
Airport   Soup       35
Downtown  Fish       25
Mall      Dessert    30
Suburb    Soup       19
Name: quantity, dtype: int32


In [221]:
# 3. Which meal type generates the most revenue?
meal_revenue = orders.groupby('meal_type')['total_order_value'].sum()
print(f"Meal revenue: \n{meal_revenue}")
print(f"\nMeal with most revenue: {meal_revenue.idxmax()}")

Meal revenue: 
meal_type
Breakfast    1998.60
Dinner       6122.72
Lunch        5316.47
Name: total_order_value, dtype: float64

Meal with most revenue: Dinner


In [224]:
# 4. What's the busiest day of the week?
transaction_count = orders.groupby('day_of_week')['order_id'].count()
print(f"Transaction count by day of the week: \n{transaction_count}")
print(f"\nBusiest day of the week: {transaction_count.idxmax()}")

Transaction count by day of the week: 
day_of_week
Fri    29
Mon    32
Sat    20
Sun    29
Thu    38
Tue    26
Wed    26
Name: order_id, dtype: int64

Busiest day of the week: Thu


In [246]:
# 5. Create a location performance summary with key metrics
unique_location = orders['location'].unique().tolist()
total_order = orders.groupby('location')['total_order_value'].sum()
popular_item_clean = [t[1] if isinstance(t, tuple) else t for t in popular_item.tolist()]

meal_by_location = orders.groupby(['location','meal_type'])['quantity'].sum()
popular_meal = meal_by_location.groupby(level=0).idxmax()
popular_meal_clean = [t[1] if isinstance(t, tuple) else t for t in popular_meal.tolist()]

busyday_by_location = orders.groupby(['location','day_of_week'])['order_id'].count()
busy_day = busyday_by_location.groupby(level=0).idxmax()
busy_day_clean = [t[1] if isinstance(t,tuple) else t for t in busy_day.tolist()]

location_performance_metrics = pd.DataFrame({
    'Location': unique_location,
    'Popular item' : popular_item_clean,
    'Popular meal type' : popular_meal_clean,
    'Average order ($)' : average_order.tolist(),
    'Total order ($)' : total_order.tolist(),
    'Busiest day': busy_day_clean
    })

print(f"Location performance metrics: \n{location_performance_metrics}")

Location performance metrics: 
   Location Popular item Popular meal type  Average order ($)  \
0   Airport         Soup            Dinner          63.289259   
1  Downtown         Fish            Dinner          68.964694   
2      Mall      Dessert            Dinner          69.520714   
3    Suburb         Soup             Lunch          67.018049   

   Total order ($) Busiest day  
0          3417.62         Thu  
1          3379.27         Sun  
2          3893.16         Thu  
3          2747.74         Mon  


## 📊 Lab Summary Checklist
Before submitting, ensure you've completed:

- [ ] All Series exercises (1.1 - 1.3)
- [ ] DataFrame creation and operations (2.1 - 2.3)
- [ ] Selection and indexing practice (3.1 - 3.2)
- [ ] At least 2 challenge problems
- [ ] Bonus challenge (optional)

**Self-Assessment:**
- I can create and manipulate Series ✅
- I can build DataFrames from various sources ✅
- I understand loc vs iloc selection ✅
- I can filter data with multiple conditions ✅
- I can add and modify columns ✅

## 🚀 What's Next?
**Lab 01B:** Data wrangling with groupby, merge, and pivot operations!