# Phase A1: Data Ingestion & Normalization

This notebook ingests the raw Georgia Google Maps data and converts it to cleaned Parquet format.

**Inputs:**
- `data/raw/review-Georgia.json` (7.2GB)
- `data/raw/meta-Georgia.json` (168MB)

**Outputs:**
- `data/processed/ga/reviews_ga.parquet`
- `data/processed/ga/biz_ga.parquet`

In [2]:
import sys
from pathlib import Path

# Add src to path
sys.path.append(str(Path.cwd().parent / 'src'))

from data.ingest import main

print("✓ Imports successful")

✓ Imports successful


## Run Data Ingestion

This will:
1. Load and normalize metadata (businesses)
2. Load and normalize reviews
3. Apply data cleaning and deduplication
4. Save to Parquet format
5. Print summary statistics

**Note:** This may take 10-20 minutes for the 7GB review file.

In [5]:
# Run the ingestion pipeline
main()


PHASE A1: INGESTING METADATA
Input: /Users/istantheman/Forkast/data/raw/meta-Georgia.json
Output: /Users/istantheman/Forkast/data/processed/ga/biz_ga.parquet

[1/6] Reading JSON file...
  Loaded 166,381 raw businesses

[2/6] Filtering to Georgia geographic bounds...
  Retained 166,334 businesses in Georgia

[3/6] Parsing price buckets...

[4/6] Detecting closed businesses...

[5/7] Normalizing categories...

[6/7] Filtering to food-only businesses...
  Retained 27,757 food-related businesses

[7/7] Finalizing schema...
  Final count: 27,710 unique businesses

Writing to /Users/istantheman/Forkast/data/processed/ga/biz_ga.parquet...

✓ Metadata ingestion complete!
  Output size: 4.8 MB
PHASE A1: INGESTING REVIEWS
Input: /Users/istantheman/Forkast/data/raw/review-Georgia.json
Output: /Users/istantheman/Forkast/data/processed/ga/reviews_ga.parquet

[1/5] Reading JSON file...
  Loaded 24,060,125 raw reviews

[2/5] Converting timestamps...

[3/5] Filtering invalid timestamps...
  Retained 

## Verify Output

Let's load and inspect the generated Parquet files.

In [6]:
import polars as pl

# Load the processed data
base_dir = Path.cwd().parent
biz_df = pl.read_parquet(base_dir / "data/processed/ga/biz_ga.parquet")
reviews_df = pl.read_parquet(base_dir / "data/processed/ga/reviews_ga.parquet")

print("Business Data Schema:")
print(biz_df.schema)
print(f"\nShape: {biz_df.shape}")
print("\nFirst 5 rows:")
print(biz_df.head())

print("\n" + "="*80)
print("\nReview Data Schema:")
print(reviews_df.schema)
print(f"\nShape: {reviews_df.shape}")
print("\nFirst 5 rows:")
print(reviews_df.head())

Business Data Schema:
Schema({'gmap_id': String, 'name': String, 'lat': Float32, 'lon': Float32, 'category_main': String, 'category_all': List(String), 'avg_rating': Float32, 'num_reviews': Int32, 'price_bucket': Int8, 'is_closed': Boolean, 'relative_results': List(String)})

Shape: (27710, 11)

First 5 rows:
shape: (5, 11)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ gmap_id   ┆ name      ┆ lat       ┆ lon       ┆ … ┆ num_revie ┆ price_buc ┆ is_closed ┆ relative │
│ ---       ┆ ---       ┆ ---       ┆ ---       ┆   ┆ ws        ┆ ket       ┆ ---       ┆ _results │
│ str       ┆ str       ┆ f32       ┆ f32       ┆   ┆ ---       ┆ ---       ┆ bool      ┆ ---      │
│           ┆           ┆           ┆           ┆   ┆ i32       ┆ i8        ┆           ┆ list[str │
│           ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆ ]        │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══

## Quick Visualizations

In [7]:
import plotly.express as px

# Category distribution
cat_dist = biz_df.group_by("category_main").agg(pl.count().alias("count")).sort("count", descending=True)
cat_dist_pd = cat_dist.to_pandas()

fig = px.bar(cat_dist_pd.head(15), x="category_main", y="count", 
             title="Top 15 Business Categories in Georgia",
             labels={"category_main": "Category", "count": "Number of Businesses"})
fig.show()

# Rating distribution
rating_dist = reviews_df.group_by("rating").agg(pl.count().alias("count")).sort("rating")
rating_dist_pd = rating_dist.to_pandas()

fig = px.bar(rating_dist_pd, x="rating", y="count",
             title="Review Rating Distribution",
             labels={"rating": "Rating (1-5)", "count": "Number of Reviews"})
fig.show()

# Geographic distribution (sample for speed)
biz_sample = biz_df.sample(n=min(5000, len(biz_df))).to_pandas()
fig = px.scatter_mapbox(biz_sample, lat="lat", lon="lon", 
                        color="category_main",
                        hover_name="name",
                        hover_data=["avg_rating", "num_reviews"],
                        title="Business Locations in Georgia (Sample)",
                        zoom=6, height=600)
fig.update_layout(mapbox_style="open-street-map")
fig.show()


`pl.count()` is deprecated. Please use `pl.len()` instead.
(Deprecated in version 0.20.5)




`pl.count()` is deprecated. Please use `pl.len()` instead.
(Deprecated in version 0.20.5)




*scatter_mapbox* is deprecated! Use *scatter_map* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/

