# Lab: Ingesting and Analyzing Avalanche Shipping Logs

**Objective:**  
You’ll use Snowflake and Python to ingest a raw markdown file containing Avalanche shipping logs. Using this file you will parse and clean the data, and extract useful insights that can inform the next stage of your GenAI prototype.

**Story Context:**  
You just got your hands on internal shipping logs from Avalanche's distribution warehouse. These logs contain important operational data — delivery errors, shipping times, product IDs, and destinations. But it’s all in a raw `.md` format.

Your job is to:
1. Upload the markdown file into Snowflake.
2. Parse and structure it into a usable table.
3. Clean and explore the data.
4. Start identifying shipping trends or anomalies.

## ✅ Step 1: Upload the file to Snowflake Notebook

In [None]:
# Load file into a variable
with open("shipping-logs.md", "r") as file:
    raw_text = file.read()

# Show a preview
print(raw_text[:1000])  # Preview first 1000 characters

In [None]:
import re
import pandas as pd

# Regular expression pattern
pattern = r'Order ID:\s*(\d+)\s+Shipping Date:\s*(\d{4}-\d{2}-\d{2})\s+Carrier:\s*(.*?)\s+Tracking Number:\s*(\d+)\s+Latitude:\s*([-\d.]+)\s+Longitude:\s*([-\d.]+)\s+Status:\s*(\w+)'

# parse using regex
matches = re.findall(pattern, raw_text, re.DOTALL)

# create dataframe from matches
df = pd.DataFrame(matches, columns=["ORDER_ID", "SHIPPING_DATE", "CARRIER", "TRACKING_NUMBER", "LATITUDE", "LONGITUDE", "STATUS"])

# Convert columns to appropriate data types
df = df.astype({
    'ORDER_ID': 'int',
    'SHIPPING_DATE': 'datetime64[ns]',
    'TRACKING_NUMBER': 'str',
    'LATITUDE': 'float',
    'LONGITUDE': 'float',
    'STATUS': 'str'
})

# Convert Shipping_Date to string explicitly for Snowflake compatibility
df["SHIPPING_DATE"] = df["SHIPPING_DATE"].dt.strftime('%Y-%m-%d')

# Display the resulting DataFrame
df.head()

In [None]:
from snowflake.snowpark.context import get_active_session

session = get_active_session()

# Convert to Snowflake DataFrame
session.write_pandas(df, "SHIPPING_LOGS", auto_create_table=True, overwrite=True)

In [None]:
# Run basic analysis
from snowflake.snowpark.functions import col, length

df_from_snowflake = session.table("SHIPPING_LOGS")

# Find most common problem locations
df_from_snowflake.group_by('"CARRIER"').count().order_by(col("count").desc()).show()

## ✅ Challenge Extension (Optional)
Try building a quick prototype in Streamlit that lets a user:
- Search shipping logs by location or keyword
- Display the top locations with shipping issues
- Summarize long logs using the OpenAI API