# Python for Data Analysis - Week 2
## Major Group Assignment: Pandas Fundamentals I

**Due Date:** Thursday, April 24, 2025

### Overview
In this assignment, you will work in groups of 3-4 students to explore and analyze an e-commerce dataset using Pandas. This assignment will allow you to apply the fundamental pandas concepts covered in Week 2, particularly focusing on DataFrame creation, column selection, row filtering, and translating SQL queries to pandas code.

### Learning Objectives
By completing this assignment, you will be able to:
- Load and inspect data using pandas
- Select and filter data using pandas methods
- Create derived columns and manipulate DataFrame structure
- Translate SQL queries into equivalent pandas operations
- Perform basic data analysis using pandas

### Dataset
You will be working with an e-commerce sales dataset containing information about orders, products, customers, and transactions. The dataset is available in the `Data` folder as `ecommerce_sales.csv`.

### Assignment Structure
This assignment is divided into five parts:
1. Data Loading and Inspection
2. Basic DataFrame Operations
3. Selection and Filtering
4. SQL to Pandas Translation
5. Data Analysis Challenge

### Submission Guidelines
- Submit your completed notebook via the course portal
- Include the names of all team members in the notebook
- Ensure all code cells are executed and outputs are visible
- Add comments to explain your code and reasoning

### Grading Criteria
- Correctness of code: 60%
- Code efficiency and best practices: 20%
- Analysis and interpretation: 15%
- Code documentation: 5%

Let's begin!

## Team Information

**Team Members:**
1. 
2. 
3. 
4. 

## Setup

First, let's import the necessary libraries and set up our environment.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 1000)

# For plotting in the notebook
%matplotlib inline

## Part 1: Data Loading and Inspection (15 points)

### 1.1 Load the Dataset

Load the e-commerce sales dataset (`ecommerce_sales.csv`) into a pandas DataFrame.

In [None]:
# Load the dataset
# Your code here

### 1.2 Inspect the Dataset

Perform a thorough inspection of the dataset to understand its structure and content. Answer the following questions:

1. How many rows and columns does the dataset have?
2. What are the column names and their data types?
3. Are there any missing values in the dataset?
4. What is the time range of the orders in the dataset?
5. What are the different product categories in the dataset?

In [None]:
# Display the first few rows of the dataset
# Your code here

In [None]:
# Check the dataset's shape (rows, columns)
# Your code here

In [None]:
# Display information about the DataFrame
# Your code here

In [None]:
# Check for missing values
# Your code here

In [None]:
# Determine the time range of orders
# Your code here

In [None]:
# List the unique product categories
# Your code here

### 1.3 Dataset Summary

Generate summary statistics for the numerical columns in the dataset and interpret the results.

In [None]:
# Generate summary statistics
# Your code here

**Interpretation:**

Write a brief interpretation of the summary statistics here.

## Part 2: Basic DataFrame Operations (20 points)

In this section, you will perform various operations to manipulate the DataFrame and extract specific information.

### 2.1 Data Type Conversions

Convert the `order_date` column to datetime format, if it's not already, and create new columns for the year, month, and day of the week.

In [None]:
# Convert order_date to datetime format and create new columns
# Your code here

### 2.2 Creating Derived Columns

Create the following new columns in the DataFrame:

1. `total_amount`: The total price of the order (quantity × price)
2. `price_category`: Categorize products as 'Low', 'Medium', or 'High' based on their price
   - Low: price < 50
   - Medium: 50 ≤ price < 100
   - High: price ≥ 100
3. `shipping_time`: The number of days between order date and shipping date (if available)

In [None]:
# Create total_amount column
# Your code here

In [None]:
# Create price_category column
# Your code here

In [None]:
# Create shipping_time column (if shipping date is available)
# Your code here

### 2.3 Column Selection and Manipulation

1. Create a new DataFrame that contains only customer information (customer ID, gender, age, email domain)
2. Create a new DataFrame that contains order details (order ID, product name, quantity, price, total amount)
3. Calculate and display the total number of orders and the total revenue by month

In [None]:
# Create customer information DataFrame
# Your code here

In [None]:
# Create order details DataFrame
# Your code here

In [None]:
# Calculate total orders and revenue by month
# Your code here

## Part 3: Selection and Filtering (25 points)

In this section, you will practice selecting and filtering data based on various conditions.

### 3.1 Basic Filtering

Filter the DataFrame to show:

1. Orders with a total amount greater than $500
2. Orders placed on weekends (Saturday or Sunday)
3. Orders shipped to a specific region (choose a region from your dataset)
4. Orders of a specific product category

In [None]:
# 1. Orders with total amount > $500
# Your code here

In [None]:
# 2. Orders placed on weekends
# Your code here

In [None]:
# 3. Orders shipped to a specific region
# Your code here

In [None]:
# 4. Orders of a specific product category
# Your code here

### 3.2 Complex Filtering

Apply multiple conditions to filter the DataFrame:

1. High-value orders (total amount > $200) placed by female customers
2. Orders of electronic products with a quantity greater than 2
3. Orders placed in January 2024 (or another month in your dataset) with express shipping
4. Orders where the shipping time was more than 7 days for premium customers

In [None]:
# 1. High-value orders placed by female customers
# Your code here

In [None]:
# 2. Electronic product orders with quantity > 2
# Your code here

In [None]:
# 3. Orders in January with express shipping
# Your code here

In [None]:
# 4. Orders with shipping time > 7 days for premium customers
# Your code here

### 3.3 Sorting and Limiting Results

1. Find the top 10 customers by total purchase amount
2. Find the 5 most popular products by quantity sold
3. Find the days with the highest number of orders
4. Find the most profitable product categories

In [None]:
# 1. Top 10 customers by total purchase amount
# Your code here

In [None]:
# 2. Top 5 most popular products by quantity sold
# Your code here

In [None]:
# 3. Days with the highest number of orders
# Your code here

In [None]:
# 4. Most profitable product categories
# Your code here

## Part 4: SQL to Pandas Translation (25 points)

In this section, you will translate SQL queries into equivalent pandas operations. First, study each SQL query to understand what it's trying to accomplish, then implement the same functionality using pandas.

### 4.1 Simple Queries

**SQL Query 1:**
```sql
SELECT product_id, product_name, price 
FROM products
WHERE category = 'Electronics'
ORDER BY price DESC;
```

In [None]:
# Pandas equivalent of SQL Query 1
# Your code here

**SQL Query 2:**
```sql
SELECT customer_id, COUNT(*) as order_count, SUM(total_amount) as total_spent
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 3;
```

In [None]:
# Pandas equivalent of SQL Query 2
# Your code here

### 4.2 Complex Queries

**SQL Query 3:**
```sql
SELECT 
    EXTRACT(YEAR FROM order_date) AS year,
    EXTRACT(MONTH FROM order_date) AS month,
    product_category,
    COUNT(DISTINCT order_id) AS num_orders,
    SUM(quantity) AS total_quantity,
    SUM(price * quantity) AS total_revenue,
    AVG(price * quantity) AS avg_order_value
FROM orders
JOIN order_items ON orders.order_id = order_items.order_id
JOIN products ON order_items.product_id = products.product_id
GROUP BY 
    EXTRACT(YEAR FROM order_date),
    EXTRACT(MONTH FROM order_date),
    product_category
ORDER BY 
    year, month, total_revenue DESC;
```

In [None]:
# Pandas equivalent of SQL Query 3
# Your code here

**SQL Query 4:**
```sql
SELECT 
    CASE 
        WHEN age < 25 THEN 'Under 25'
        WHEN age BETWEEN 25 AND 34 THEN '25-34'
        WHEN age BETWEEN 35 AND 44 THEN '35-44'
        WHEN age BETWEEN 45 AND 54 THEN '45-54'
        ELSE '55 and above'
    END AS age_group,
    gender,
    COUNT(DISTINCT customer_id) AS num_customers,
    COUNT(DISTINCT order_id) AS num_orders,
    SUM(total_amount) AS total_spent,
    AVG(total_amount) AS avg_order_value
FROM customers
JOIN orders ON customers.customer_id = orders.customer_id
WHERE order_date >= '2024-01-01' AND order_date < '2024-03-01'
GROUP BY age_group, gender
ORDER BY age_group, gender;
```

In [None]:
# Pandas equivalent of SQL Query 4
# Your code here

### 4.3 Comparing SQL and Pandas

Based on your experience translating these SQL queries to pandas code, discuss the advantages and disadvantages of each approach. When might you prefer using SQL over pandas, and vice versa?

*Write your comparison here*

## Part 5: Data Analysis Challenge (15 points)

Using the skills you've learned, conduct a comprehensive analysis of the e-commerce dataset to answer the following business questions:

### 5.1 Sales Trends Analysis

Analyze the sales trends over time. Identify patterns, seasonal variations, and growth trends.

In [None]:
# Sales trends analysis
# Your code here

**Interpretation:**

*Write your analysis of sales trends here*

### 5.2 Customer Segmentation

Segment customers based on their purchasing behavior (frequency, recency, monetary value) and demographic information. Identify the most valuable customer segments.

In [None]:
# Customer segmentation analysis
# Your code here

**Interpretation:**

*Write your analysis of customer segments here*

### 5.3 Product Performance Analysis

Evaluate the performance of different product categories and identify the best-selling and most profitable products.

In [None]:
# Product performance analysis
# Your code here

**Interpretation:**

*Write your analysis of product performance here*

### 5.4 Business Recommendations

Based on your analysis, provide at least three specific business recommendations to improve sales, customer satisfaction, or profitability.

**Business Recommendations:**

1. 

2. 

3. 

## Conclusion

Summarize what you've learned from this assignment and how you've applied pandas concepts to analyze the e-commerce dataset.

*Write your conclusion here*