# Brewed Insights: Coffee Sales Analysis

### 1. Started by installing and importing the relevant libraries

In [1]:
import sqlite3
import pandas as pd
import os
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
import numpy as np

### 2. Load the dataset

In [2]:
# Load CSV data into a Pandas DataFrame
base_path = os.path.dirname(os.getcwd())  # go up from notebooks folder
data_path = os.path.join(base_path, 'data', 'coffee_sales.csv')

df = pd.read_csv(data_path)

### 3. Create a temporary database using SQLite and insert the table

In [3]:
# Create a temporary SQLite database
conn = sqlite3.connect('coffee.db')

# Write the DataFrame into a SQL table
df.to_sql('coffee_sales', conn, index=False, if_exists='replace')

3547

### 4. Create a function that will be reutilized later

In [4]:
def run_query(query, conn):
    """Helper function to run SQL queries and return DataFrame."""
    return pd.read_sql_query(query, conn)

### 5. Exploratory Data Analysis (EDA)

#### 5.1 Top-Selling Coffee Products

The top 5 best-selling drinks (ordered by total revenue) are:

1. **Latte** – 757 units sold, $26,875.30 revenue  
2. **Americano with Milk** – 809 units sold, $24,751.12 revenue  
3. **Capuccino** – 486 units sold, $17,439.14 revenue  
4. **Americano** – 564 units sold, $14,650.26 revenue  
5. **Cortado** – 287 units sold, $7,384.86 revenue

**Strategy Recommendations:**
- **Menu focus**: Promote top-performing drinks like Latte and Americano with Milk during peak hours to maximize revenue.
- **Upselling opportunities**: Encourage add-ons or combos for mid-level sellers like Capuccino and Cortado to boost average ticket.
- **Inventory planning**: Ensure adequate stock of high-demand drinks, especially during morning and afternoon peaks.

In [5]:
top_sellers = run_query("""
SELECT coffee_name, COUNT(*) AS total_sales, ROUND(SUM(money), 2) AS total_revenue
FROM coffee_sales
GROUP BY coffee_name
ORDER BY total_sales DESC
LIMIT 5;
""", conn)

top_sellers

Unnamed: 0,coffee_name,total_sales,total_revenue
0,Americano with Milk,809,24751.12
1,Latte,757,26875.3
2,Americano,564,14650.26
3,Cappuccino,486,17439.14
4,Cortado,287,7384.86


#### 5.2 Peak Hours

- **10 AM is the peak sales hour**, generating $10,198 in revenue. This suggests mornings are the busiest period, likely due to office commuters and the customary "morning coffee" routine.

- **1–2 PM sees a slight dip** in revenue (~$7,100), which could be an opportunity for **promotions or lunch combos** to boost revenue during this quieter period.

- **7 PM–9 PM** ($6,400–$7,700), still generates decent revenue (~$6,398–$7,752), but lower than afternoon. People still buy coffee in the evening; maybe offer **seasonal warm drinks** like Hot Chocolate to increase evening sales.

- **6-7 AM** are low hours: $149.40 (6 AM) and $2,846.02 (7 AM). Very early opening may not be worth staffing heavily unless you have loyal early-morning customers. Could **consider reducing staff or offering pre-order options**.

**Strategy Recommendations:**
- **Staffing**: Allocate more baristas from 9 AM–12 PM and 4 PM–5 PM to handle peak demand.

- **Product promotions**: Target slow hours (1–3 PM) with discounts or combos to increase average revenue.

- **Menu focus**: Highlight high-margin drinks during peak hours to maximize profits.

- **Operational planning**: Monitor inventory for popular drinks during peak hours to avoid shortages.

In [6]:
peak_hours = run_query("""
SELECT hour_of_day, SUM(money) AS total
FROM coffee_sales
GROUP BY hour_of_day
ORDER BY hour_of_day;
""", conn)

peak_hours

Unnamed: 0,hour_of_day,total
0,6,149.4
1,7,2846.02
2,8,7017.88
3,9,7264.28
4,10,10198.52
5,11,8453.1
6,12,7419.62
7,13,7028.76
8,14,7173.8
9,15,7476.02


#### 5.3 Revenue by Day of the Week

- **Tuesday generates the highest revenue** with $18,168.38, suggesting mid-week demand is strongest.

- **Monday** follows closely at $17,363.10, showing strong early-week sales, likely as people start their workweek.

- **Friday** brings in $16,802.66, still significant but slightly lower than earlier in the week, which could reflect early weekend patterns.

- **Thursday ($16,091.40) and Wednesday ($15,750.46)** show steady mid-week revenue.

- **Weekends see lower revenue**: Saturday at $14,733.52 and Sunday at $13,336.06, indicating less foot traffic compared to weekdays.

**Strategy Recommendations:**

- **Staffing**: Schedule more staff on weekdays, particularly Tuesday and Monday, to handle higher demand.

- **Promotions**: Offer weekend promotions to boost sales during slower days.

- **Inventory planning**: Ensure top-selling drinks are well-stocked during high-revenue weekdays.

In [7]:
revenue_by_day = run_query("""
SELECT Weekday, SUM(money) AS total_revenue
FROM coffee_sales
GROUP BY Weekday
ORDER BY Weekdaysort;
""", conn)

revenue_by_day

Unnamed: 0,Weekday,total_revenue
0,Mon,17363.1
1,Tue,18168.38
2,Wed,15750.46
3,Thu,16091.4
4,Fri,16802.66
5,Sat,14733.52
6,Sun,13336.06


#### 5.4 Average Sale per Hour

- **Morning transactions (6–9 AM)** tend to be smaller, averaging around $29–$32 per sale, reflecting lighter orders early in the day.  

- **Late morning to early afternoon (10 AM–2 PM)** sees moderate average sales ($31–$32), coinciding with peak traffic hours.  

- **Afternoon and evening (3–9 PM)** have the highest average transaction values, peaking at $33.85 around 7 PM, suggesting customers are purchasing larger or "premium" drinks later in the day.  

- **Night hours (10–11 PM)** maintain relatively high averages despite fewer customers, indicating that fewer orders are slightly bigger in value.

**Strategy Recommendations:**

- **Upselling:** Promote premium drinks or add-ons in the afternoon and evening to maximize revenue per transaction.  

- **Early morning offers:** Introduce combos or incentives to increase average sales when traffic is lighter. 

In [8]:
avg_sale_hour = run_query("""
SELECT hour_of_day,
	   ROUND(AVG(money), 2) AS avg_sale_per_hour
FROM coffee_sales
GROUP BY hour_of_day
ORDER BY hour_of_day;
""", conn)

avg_sale_hour

Unnamed: 0,hour_of_day,avg_sale_per_hour
0,6,29.88
1,7,32.34
2,8,29.86
3,9,30.02
4,10,31.09
5,11,29.87
6,12,30.79
7,13,31.24
8,14,31.88
9,15,31.68


#### 5.5 Monthly Sales Performance: Growth Rate by Month

- **Initial growth:**: Feb (+107%) jumps from January’s $6,399 to $13,215. March (+20%) continues growth, but at a slower pace.

- **Fluctuations**: Apr drops sharply (-64%), then May recovers (+43%), showing volatility in spring months.

- **Moderate stability**: Jun–Aug see smaller changes (-7% to +10%), indicating steady sales.

- **High season**: Sep–Oct (+31% and +39%) mark strong late-year performance.

- **Slowdown**: Nov drops (-38%), possibly due to seasonal factors like colder weather or lower foot traffic.

In [9]:
monthly_sales = df.groupby(['Month_name', 'Monthsort'])['money'].sum().reset_index()
monthly_sales = monthly_sales.sort_values('Monthsort') # Sort by Monthsort to ensure correct order
monthly_sales['sales_growth_rate'] = (
    monthly_sales['money'].pct_change().fillna(0) * 100 # Calculate percentage change and convert to percentage
).map(lambda x: f"{x:.2f}%") # Format as percentage string
monthly_sales = monthly_sales.drop(columns='Monthsort')  # Remove Monthsort for cleaner output
print(monthly_sales)

   Month_name     money sales_growth_rate
4         Jan   6398.86             0.00%
3         Feb  13215.48           106.53%
7         Mar  15891.64            20.25%
0         Apr   5719.56           -64.01%
8         May   8164.42            42.75%
6         Jun   7617.76            -6.70%
5         Jul   6915.94            -9.21%
1         Aug   7613.84            10.09%
11        Sep   9988.64            31.19%
10        Oct  13891.16            39.07%
9         Nov   8590.54           -38.16%
2         Dec   8237.74            -4.11%


### 6. Outliers: Extreme Sales

#### 6.1 Overview

- Transaction count: 1,415 purchases exceed the high-value threshold, representing the top tier of sales.

- Revenue impact: These transactions generate $51,511.80, accounting for 46% of total revenue, highlighting their outsized contribution.

- Average value: Each high-value transaction averages $36.40, slightly above typical sales, confirming that larger purchases are common.

In [10]:
# Threshold for top 25% transactions
threshold = df['money'].quantile(0.75)
high_value_sales = df[df['money'] >= threshold]

num_outliers = len(high_value_sales) # Number of high-value transactions
total_outliers = high_value_sales['money'].sum() # Total revenue from high-value transactions
avg_outliers = high_value_sales['money'].mean() # Average value of high-value transactions
pct_of_total = total_outliers / df['money'].sum() * 100 # Percentage of total revenue from high-value transactions

# Save summary into a DataFrame
summary = pd.DataFrame({
    "Metric": ["High-value transactions", "Total revenue", "Average value", "% of total revenue"], 
    "Value": [num_outliers, f"${total_outliers:.2f}", f"${avg_outliers:.2f}", f"{pct_of_total:.2f}%"]
})
summary

Unnamed: 0,Metric,Value
0,High-value transactions,1415
1,Total revenue,$51511.80
2,Average value,$36.40
3,% of total revenue,45.89%


#### 6.2 High-Value Coffee Transactions: Item Contribution to Top Sales

- **Top contributor**: Latte dominates high-value sales, generating $20,508 (40% of total), with 563 transactions averaging $36.43 each.


- **Strong performers**: Cappuccino follows with $14,288 (28%), 390 transactions, and a similar average transaction value ($36.64).


- **Mid-tier contributors**: Hot Chocolate brings in $9,080 (18%) across 250 transactions, slightly below the top two in both revenue and count.


- **Smaller but notable**: Cocoa adds $7,635 (15%) from 212 transactions, maintaining consistent average value ($36.01).


**Strategy Recommendations**: 
- Upselling or promotions targeting Lattes and Cappuccinos could maximize revenue, while Hot Chocolate and Cocoa represent steady secondary options for high-value sales.




In [11]:
# Aggregate high-value sales by coffee item
coffee_high_value = high_value_sales.groupby('coffee_name').agg(
    total_revenue=('money', 'sum'),
    transaction_count=('money', 'count'),
    avg_value=('money', 'mean')
).sort_values(by='total_revenue', ascending=False)

# Optionally, calculate what % of total revenue each coffee contributes among high-value sales
coffee_high_value['pct_of_high_value_revenue'] = coffee_high_value['total_revenue'] / high_value_sales['money'].sum() * 100

coffee_high_value

Unnamed: 0_level_0,total_revenue,transaction_count,avg_value,pct_of_high_value_revenue
coffee_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Latte,20508.22,563,36.426679,39.812664
Cappuccino,14288.42,390,36.636974,27.738149
Hot Chocolate,9080.14,250,36.32056,17.627301
Cocoa,7635.02,212,36.014245,14.821885


### 7. Predictive Analysis

#### 7.1 Hourly Sales Forecast (6:00 - 22:00)
Predicting expected revenue for each hour of the day using polynomial regression to capture peak and off-peak trends.

- **Peak sales** expected between 16:00–22:00, with predicted revenue gradually increasing throughout the day.

In [12]:
# Aggregate average sales by hour
hourly_sales = df.groupby('hour_of_day')['money'].mean().reset_index()

# Features and target
X = hourly_sales[['hour_of_day']]  # Feature
y = hourly_sales['money']          # Target

# Polynomial regression (degree=2 for curve fitting)
poly_hour_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
poly_hour_model.fit(X, y)

# Predict for cafe operating hours only (6:00 to 22:00)
future_hours = pd.DataFrame({'hour_of_day': range(6, 23)})
predicted_hourly_sales = poly_hour_model.predict(future_hours)

# Combine results for clarity
predicted_hourly_sales_df = future_hours.copy()
predicted_hourly_sales_df['predicted_sales'] = predicted_hourly_sales

# Format numbers
# predicted_hourly_sales_df['predicted_sales'] = predicted_hourly_sales_df['predicted_sales'].apply(lambda x: f"${x:,.2f}")

predicted_hourly_sales_df

Unnamed: 0,hour_of_day,predicted_sales
0,6,30.164497
1,7,30.342098
2,8,30.521475
3,9,30.70263
4,10,30.885562
5,11,31.070271
6,12,31.256758
7,13,31.445021
8,14,31.635063
9,15,31.826881


#### 7.2 Monthly Sales Forecast (Next 3 Months)
Predicting expected monthly revenue based on historical trends using polynomial regression, including simulated next months.

- Revenue is projected to **increase gradually**, reaching ~$11,076 in March.

In [13]:
# Aggregate monthly sales
monthly_sales = df.groupby(['Month_name', 'Monthsort'])['money'].sum().reset_index()
monthly_sales = monthly_sales.sort_values('Monthsort')  # ensure correct order

# Features and target
X_month = monthly_sales[['Monthsort']]  # Feature
y_month = monthly_sales['money']        # Target

# Polynomial regression (degree=2)
poly_month_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
poly_month_model.fit(X_month, y_month)

# Predict for next 3 months (simulate months 13, 14, 15)
predicted_monthly_sales_df = pd.DataFrame({'Monthsort': [13, 14, 15]})
predicted_monthly_sales = poly_month_model.predict(predicted_monthly_sales_df)

# Map future month numbers to names for clarity
predicted_monthly_sales_df['Month_name'] = ['Jan_next', 'Feb_next', 'Mar_next']
predicted_monthly_sales_df['predicted_sales'] = predicted_monthly_sales

# Format numbers
#predicted_monthly_sales_df['predicted_sales'] = predicted_monthly_sales_df['predicted_sales'].apply(lambda x: f"${x:,.2f}")

predicted_monthly_sales_df

Unnamed: 0,Monthsort,Month_name,predicted_sales
0,13,Jan_next,10091.926818
1,14,Feb_next,10547.34486
2,15,Mar_next,11076.018576


### 8. Close the Connection

In [14]:
conn.close()