# Lab: Ingesting and Analyzing Avalanche Shipping Logs

**Objective:**  
You’ll use Snowflake and Python to ingest a raw markdown file containing Avalanche shipping logs. Using this file you will parse and clean the data, and extract useful insights that can inform the next stage of your GenAI prototype.

**Story Context:**  
You just got your hands on internal shipping logs from Avalanche's distribution warehouse. These logs contain important operational data — delivery errors, shipping times, product IDs, and destinations. But it’s all in a raw `.md` format.

Your job is to:
1. Upload the markdown file into Snowflake.
2. Parse and structure it into a usable table.
3. Clean and explore the data.
4. Start identifying shipping trends or anomalies.

## ✅ Step 1: Upload the file to Snowflake Notebook

In [None]:
# Load file into a variable
with open("/files/shipping-logs.md", "r") as file:
    raw_text = file.read()

# Show a preview
print(raw_text[:1000])  # Preview first 1000 characters

In [None]:
# Simple log pattern extraction (customize as needed)
import re
import pandas as pd

log_entries = re.findall(r"(?s)---(.*?)---", raw_text)

structured_logs = []

for entry in log_entries:
    lines = entry.strip().split('\n')
    record = {
        "timestamp": lines[0].strip(),
        "location": lines[1].strip(),
        "summary": " ".join(lines[2:]).strip()
    }
    structured_logs.append(record)

df_logs = pd.DataFrame(structured_logs)
df_logs.head()

In [None]:
# Connect to Snowflake and write the data to a table
from snowflake.snowpark.session import Session

connection_params = {
  "account": "<your_account>",
  "user": "<your_username>",
  "password": "<your_password>",
  "role": "ACCOUNTADMIN",
  "warehouse": "COMPUTE_WH",
  "database": "AVALANCHE_DB",
  "schema": "AVALANCHE_SCHEMA"
}

session = Session.builder.configs(connection_params).create()

# Convert to Snowflake DataFrame
session.write_pandas(df_logs, "SHIPPING_LOGS", auto_create_table=True, overwrite=True)

In [None]:
# Run basic analysis
from snowflake.snowpark.functions import col, length

df = session.table("SHIPPING_LOGS")

# Find most common problem locations
df.group_by('"location"').count().order_by(col("count").desc()).show()

# Long entries
df.with_column("length", length(col('"summary"'))).order_by(col("length").desc()).limit(5).show()

## ✅ Challenge Extension (Optional)
Try building a quick prototype in Streamlit that lets a user:
- Search shipping logs by location or keyword
- Display the top locations with shipping issues
- Summarize long logs using the OpenAI API