# Step .: Review Transformation - Tutorial

**Purpose:** Transform and enrich collected reviews before classification

**What you'll learn:**
- How to normalize review fields (dates, ratings, text)
- How to add geographic regions to reviews
- How to build aggregates for analysis
- How to handle missing data gracefully

**For Junior Developers:**
- This step is optional but highly recommended
- Transforms raw collected data into analysis-ready format
- Adds regional context for geographic analysis
- Creates pre-computed aggregates for faster dashboards

## What's New: Transform Pipeline

This notebook demonstrates the **new transform step** in the pipeline:

### Transform Components:
. **Normalize Reviews** (`normalize_reviews.py`):
 - Parse French relative dates ("il y a mois" datetime)
 - Clean and validate ratings (clip to - range)
 - Normalize city names (remove accents, lowercase)
 - Extract month for temporal analysis

. **Add Regions** (`geocode.py`):
 - Use GeoJSON polygons to assign Moroccan regions
 - Fallback to city-name mapping when coordinates unavailable
 - Validate coordinate bounds (Morocco only)

. **Build Aggregates** (`aggregates.py`):
 - Pre-compute statistics by business, city, region, month
 - Count reviews, average ratings, sentiment distribution
 - Ready for dashboards and reports

### Pipeline Flow:
```
discover collect TRANSFORM classify
 ↓ ↓ ↓ ↓
0_raw/ 0_interim/ 0_interim/ 0_processed/
discovery collection transform classification
```

## Data Architecture Overview

The pipeline uses an organized folder structure:

```
data/
 00_config/ # Static configurations
 cities/ # regions.geojson for geocoding
 0_raw/ # Immutable source data
 0_interim/ # Recomputable cache
 collection/ # Collected reviews (input for transform)
 transform/ # Normalized + enriched reviews
 0_processed/ # Final outputs
 transform/ # Aggregates for analysis
 0_analysis/ # Reports, figures, dashboards
```

**This notebook processes:**
- Input: `data/0_interim/collection/reviews.parquet` (from collect step)
- Output: `data/0_interim/transform/reviews_normalized.parquet`
- Aggregates: `data/0_processed/transform/aggregates_*.csv`

## Setup and Imports

In [None]:
# Add parent directory to path
import sys
from pathlib import Path

project_root = Path().resolve().parent
sys.path.insert(0, str(project_root / "src"))

print(f"Project root: {project_root}")
print(f"Python path updated")

In [None]:
# Import required libraries
import pandas as pd
import json
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

# Import transform modules
from review_analyzer.transformers.normalize_reviews import normalize_reviews_df
from review_analyzer.transformers.geocode import add_region, add_region_by_city, CITY_REGION_MAPPING
from review_analyzer.transformers.aggregates import build_aggregates
from review_analyzer import config

print("All imports successful!")
print(f"\nTransform components loaded:")
print(f" - normalize_reviews_df (date parsing, field cleaning)")
print(f" - add_region (GeoJSON-based geocoding)")
print(f" - add_region_by_city (city-name fallback)")
print(f" - build_aggregates (pre-computed statistics)")
print(f"\nCity-to-region mapping: {len(CITY_REGION_MAPPING)} cities")

## Test : Load Raw Reviews

**What this does:** Loads reviews from the collection step

**Expected input:** Reviews with raw dates, text, ratings from Step 

**Expected columns:** `text`, `date`, `rating`, `lat`, `lng`, `city`, `business_id`, etc.

In [None]:
# Load reviews from collection step
# Try multiple locations in order of preference
input_file = None

candidates = [
 project_root / "data" / "0_interim" / "collection" / "reviews.parquet",
 project_root / "data" / "0_interim" / "collection" / "reviews.csv",
 project_root / "data" / "0_interim" / "collection" / "bank_reviews.csv",
 project_root / "data" / "output" / "reviews_for_classification.csv", # Legacy
]

for candidate in candidates:
 if candidate.exists():
 input_file = candidate
 break

if input_file and input_file.exists():
 # Load based on file type
 if input_file.suffix == '.parquet':
 reviews_df = pd.read_parquet(input_file)
 else:
 # Try multiple encodings for CSV files (French text may use Latin-)
 for encoding in ['utf-8', 'latin-', 'cp', 'utf-8-sig']:
 try:
 reviews_df = pd.read_csv(input_file, encoding=encoding)
 print(f" Loaded with encoding: {encoding}")
 break
 except UnicodeDecodeError:
 continue
 else:
 # Fallback: read with errors='replace' to handle any encoding
 reviews_df = pd.read_csv(input_file, encoding='utf-8', errors='replace')
 print(f" Loaded with fallback encoding (some characters may be replaced)")
 
 print(f" RAW REVIEWS LOADED")
 print(f"="*80)
 print(f" Source: {input_file.relative_to(project_root)}")
 print(f" Total reviews: {len(reviews_df)}")
 print(f" Columns: {list(reviews_df.columns)}")
 
 # Show data types
 print(f"\n Data types:")
 for col in ['date', 'rating', 'lat', 'lng', 'city', 'text']:
 if col in reviews_df.columns:
 print(f" {col}: {reviews_df[col].dtype}")
 
 # Show sample
 print(f"\n Sample raw data:")
 sample_cols = [c for c in ['date', 'rating', 'city', 'text'] if c in reviews_df.columns]
 display(reviews_df[sample_cols].head())
 
 # Check for issues
 print(f"\n Data quality checks:")
 if 'date' in reviews_df.columns:
 french_dates = reviews_df['date'].astype(str).str.contains('il y a', na=False).sum()
 print(f" French relative dates: {french_dates}/{len(reviews_df)} (need parsing)")
 
 if 'rating' in reviews_df.columns:
 print(f" Rating type: {reviews_df['rating'].dtype} (should be numeric)")
 print(f" Rating range: {reviews_df['rating'].min()} - {reviews_df['rating'].max()}")
 
 if 'city' in reviews_df.columns:
 cities_with_accents = reviews_df['city'].dropna().str.contains('[éèêëàâäùûüôö]', regex=True, na=False).sum()
 print(f" Cities with accents: {cities_with_accents} (need normalization)")
 
 print(f"\n This data needs transformation!")
 
else:
 print(f" No reviews file found!")
 print(f"\n Tried locations:")
 for candidate in candidates:
 print(f" - {candidate.relative_to(project_root)}")
 print(f"\n Please run collect_reviews.ipynb first.")

## Test : Normalize Review Fields

**What this does:** Cleans and normalizes review data

**Transformations:**
- Parse French relative dates ("il y a mois" datetime)
- Convert ratings to Int6 and clip to - range
- Normalize city names (remove accents, lowercase)
- Create `created_at` and `month` columns for temporal analysis
- Clean text fields (strip whitespace)

**Expected output:** DataFrame with datetime, Int6 ratings, normalized fields

In [None]:
# Normalize reviews
if input_file and input_file.exists():
 print(f" NORMALIZING REVIEW FIELDS")
 print(f"="*80)
 print(f" Processing {len(reviews_df)} reviews...\n")
 
 # Check date column before normalization
 date_col = 'review_date' if 'review_date' in reviews_df.columns else 'date'
 if date_col in reviews_df.columns:
 print(f" Raw date column: '{date_col}'")
 print(f" Sample values (repr to see hidden characters):")
 for i, val in enumerate(reviews_df[date_col].head()):
 print(f" [{i}] {repr(val)}")
 
 # Apply normalization
 normalized_df = normalize_reviews_df(reviews_df.copy())
 
 print(f"\n NORMALIZATION COMPLETE")
 print(f"="*80)
 
 # Compare before/after
 print(f"\n New columns added:")
 new_cols = set(normalized_df.columns) - set(reviews_df.columns)
 if new_cols:
 for col in sorted(new_cols):
 print(f" + {col}: {normalized_df[col].dtype}")
 else:
 print(f" (none)")
 
 # Show transformations
 print(f"\n Field transformations:")
 
 if 'created_at' in normalized_df.columns:
 parsed_dates = normalized_df['created_at'].notna().sum()
 unparsed_dates = normalized_df['created_at'].isna().sum()
 print(f" Dates parsed: {parsed_dates}/{len(normalized_df)} ({parsed_dates/len(normalized_df)*00:.f}%)")
 print(f" Type: {normalized_df['created_at'].dtype}")
 if parsed_dates > 0:
 print(f" Range: {normalized_df['created_at'].min()} to {normalized_df['created_at'].max()}")
 if unparsed_dates > 0:
 print(f" Unparsed: {unparsed_dates} dates could not be parsed")
 # Show examples of unparsed dates
 unparsed_examples = reviews_df.loc[normalized_df['created_at'].isna(), date_col].head().tolist()
 if unparsed_examples:
 print(f" Examples: {unparsed_examples}")
 else:
 print(f" created_at NOT created - check date column format!")
 if date_col in reviews_df.columns:
 print(f" Raw date samples: {reviews_df[date_col].head().tolist()}")
 
 if 'rating' in normalized_df.columns:
 print(f" Ratings normalized: {normalized_df['rating'].dtype}")
 print(f" Range: {normalized_df['rating'].min()} - {normalized_df['rating'].max()} (clipped to -)")
 
 if 'city_normalized' in normalized_df.columns:
 unique_cities = normalized_df['city_normalized'].nunique()
 print(f" Cities normalized: {unique_cities} unique cities")
 print(f" Sample: {normalized_df['city_normalized'].dropna().unique()[:].tolist()}")
 
 if 'month' in normalized_df.columns:
 unique_months = normalized_df['month'].nunique()
 print(f" Months extracted: {unique_months} unique months")
 print(f" Range: {normalized_df['month'].min()} to {normalized_df['month'].max()}")
 
 # Show sample
 print(f"\n Sample normalized data:")
 sample_cols = [c for c in ['created_at', 'month', 'rating', 'city_normalized'] if c in normalized_df.columns]
 if sample_cols:
 display(normalized_df[sample_cols].head())
 
else:
 print(" No data loaded. Run Test first!")

## Test : Add Regions (GeoJSON-based)

**What this does:** Assigns Moroccan regions using coordinates and GeoJSON polygons

**Method:**
. Load `regions.geojson` with Morocco's regions
. Use lat/lng to find which polygon contains each review
. Validate coordinates are within Morocco bounds

**Regions:** administrative regions of Morocco

**Expected output:** DataFrame with `region` column added

In [None]:
# Add regions using GeoJSON
if 'normalized_df' in locals():
 print(f" ADDING GEOGRAPHIC REGIONS")
 print(f"="*80)
 
 # Check for regions.geojson
 regions_file = config.REGIONS_FILE
 
 if regions_file.exists():
 print(f" Using: {regions_file.relative_to(project_root)}\n")
 
 # Apply geocoding
 geocoded_df = add_region(
 df=normalized_df.copy(),
 regions_path=regions_file
 )
 
 print(f"\n GEOCODING COMPLETE")
 print(f"="*80)
 
 # Statistics
 if 'region' in geocoded_df.columns:
 regions_assigned = geocoded_df['region'].notna().sum()
 regions_missing = geocoded_df['region'].isna().sum()
 
 print(f"\n Results:")
 print(f" Regions assigned: {regions_assigned}/{len(geocoded_df)} ({regions_assigned/len(geocoded_df)*00:.f}%)")
 print(f" Missing regions: {regions_missing}/{len(geocoded_df)} ({regions_missing/len(geocoded_df)*00:.f}%)")
 
 # Region distribution
 print(f"\n Region distribution:")
 region_counts = geocoded_df['region'].value_counts()
 for region, count in region_counts.items():
 percentage = (count / len(geocoded_df)) * 00
 bar = '' * int(percentage / )
 print(f" {region:0} {count:} {bar} {percentage:.f}%")
 
 # Show sample
 print(f"\n Sample with regions:")
 sample_cols = [c for c in ['city', 'lat', 'lng', 'region'] if c in geocoded_df.columns]
 display(geocoded_df[sample_cols].head())
 
 else:
 print(f" regions.geojson not found at: {regions_file}")
 print(f" Using city-name fallback instead...\n")
 
 # Fallback to city-name mapping
 geocoded_df = add_region_by_city(normalized_df.copy())
 
 print(f"\n GEOCODING COMPLETE (City-name fallback)")
 print(f"="*80)
 
 if 'region' in geocoded_df.columns:
 regions_assigned = geocoded_df['region'].notna().sum()
 print(f"\n Results:")
 print(f" Regions assigned: {regions_assigned}/{len(geocoded_df)}")
 print(f"\n Region distribution:")
 region_counts = geocoded_df['region'].value_counts()
 for region, count in region_counts.head(0).items():
 print(f" {region:0} {count:}")
 
 print(f"\n For better accuracy, add regions.geojson to data/00_config/cities/")
 
else:
 print(" No normalized data. Run Test first!")

## Test : Visualize Transformations

**What this does:** Creates visualizations showing data quality improvements

**Visualizations:**
- Date distribution (before/after parsing)
- Region coverage map
- Rating distribution
- Temporal trends (reviews per month)

In [None]:
# Visualize transformations
if 'geocoded_df' in locals():
 print(f" TRANSFORMATION VISUALIZATIONS")
 print(f"="*80)
 print(f"\nCreating charts...\n")
 
 # Set style
 plt.style.use('seaborn-v0_8-whitegrid')
 
 # . Date parsing success
 if 'created_at' in geocoded_df.columns:
 fig, ax = plt.subplots(figsize=(0, 6))
 
 parsed = geocoded_df['created_at'].notna().sum()
 unparsed = geocoded_df['created_at'].isna().sum()
 
 ax.bar(['Parsed', 'Unparsed'], [parsed, unparsed], color=['#ecc7', '#e7cc'])
 ax.set_title('Date Parsing Results', fontsize=, fontweight='bold')
 ax.set_ylabel('Number of Reviews', fontsize=)
 ax.grid(axis='y', alpha=0.)
 
 # Add percentage labels
 total = len(geocoded_df)
 ax.text(0, parsed + 0, f'{parsed/total*00:.f}%', ha='center', fontsize=)
 ax.text(, unparsed + 0, f'{unparsed/total*00:.f}%', ha='center', fontsize=)
 
 plt.tight_layout()
 plt.show()
 
 # . Region coverage
 if 'region' in geocoded_df.columns:
 fig, ax = plt.subplots(figsize=(, 6))
 
 region_counts = geocoded_df['region'].value_counts().sort_values(ascending=True)
 region_counts.plot(kind='barh', ax=ax, color='steelblue')
 
 ax.set_title('Reviews by Region', fontsize=, fontweight='bold')
 ax.set_xlabel('Number of Reviews', fontsize=)
 ax.set_ylabel('Region', fontsize=)
 ax.grid(axis='x', alpha=0.)
 
 plt.tight_layout()
 plt.show()
 
 # . Temporal trends
 if 'month' in geocoded_df.columns:
 fig, ax = plt.subplots(figsize=(, 6))
 
 month_counts = geocoded_df['month'].value_counts().sort_index()
 month_counts.plot(kind='line', ax=ax, marker='o', color='coral', linewidth=)
 
 ax.set_title('Reviews Over Time', fontsize=, fontweight='bold')
 ax.set_xlabel('Month', fontsize=)
 ax.set_ylabel('Number of Reviews', fontsize=)
 ax.grid(axis='both', alpha=0.)
 
 plt.xticks(rotation=, ha='right')
 plt.tight_layout()
 plt.show()
 
 # . Rating distribution
 if 'rating' in geocoded_df.columns:
 fig, ax = plt.subplots(figsize=(0, 6))
 
 rating_counts = geocoded_df['rating'].value_counts().sort_index()
 colors = ['#d678', '#ff7f0e', '#ffbb78', '#98df8a', '#ca0c']
 rating_counts.plot(kind='bar', ax=ax, color=colors)
 
 ax.set_title('Rating Distribution (After Normalization)', fontsize=, fontweight='bold')
 ax.set_xlabel('Rating (Stars)', fontsize=)
 ax.set_ylabel('Number of Reviews', fontsize=)
 ax.set_xticklabels([f'{int(x)} ' for x in rating_counts.index], rotation=0)
 ax.grid(axis='y', alpha=0.)
 
 plt.tight_layout()
 plt.show()
 
else:
 print(" No geocoded data. Run Test first!")

## Test : Save Transformed Data

**What this does:** Saves transformed reviews for the classification step

**Output location:** `data/0_interim/transform/reviews_normalized.parquet`

**Format:** Parquet (efficient, preserves types)

**This data is now ready for classification!**

In [None]:
# Save transformed data
if 'geocoded_df' in locals():
 # Output paths
 output_parquet = project_root / "data" / "0_interim" / "transform" / "reviews_normalized.parquet"
 output_csv = project_root / "data" / "0_interim" / "transform" / "reviews_normalized.csv"
 
 # Create directories
 output_parquet.parent.mkdir(parents=True, exist_ok=True)
 
 # Save both formats
 geocoded_df.to_parquet(output_parquet, index=False)
 geocoded_df.to_csv(output_csv, index=False)
 
 print(f" TRANSFORMED DATA SAVED")
 print(f"="*80)
 print(f" Parquet: {output_parquet.relative_to(project_root)}")
 print(f" CSV: {output_csv.relative_to(project_root)}")
 print(f" Records: {len(geocoded_df)}")
 print(f" Columns: {len(geocoded_df.columns)}")
 
 # Summary of transformations
 print(f"\n TRANSFORMATION SUMMARY:")
 
 if 'created_at' in geocoded_df.columns:
 parsed = geocoded_df['created_at'].notna().sum()
 print(f"\n Dates:")
 print(f" Parsed: {parsed}/{len(geocoded_df)} ({parsed/len(geocoded_df)*00:.f}%)")
 
 if 'rating' in geocoded_df.columns:
 print(f"\n Ratings:")
 print(f" Type: {geocoded_df['rating'].dtype} (Int6)")
 print(f" Range: {geocoded_df['rating'].min()} - {geocoded_df['rating'].max()}")
 print(f" Average: {geocoded_df['rating'].mean():.f}")
 
 if 'region' in geocoded_df.columns:
 regions_assigned = geocoded_df['region'].notna().sum()
 print(f"\n Regions:")
 print(f" Assigned: {regions_assigned}/{len(geocoded_df)} ({regions_assigned/len(geocoded_df)*00:.f}%)")
 print(f" Unique regions: {geocoded_df['region'].nunique()}")
 
 if 'city_normalized' in geocoded_df.columns:
 print(f"\n Cities:")
 print(f" Normalized: {geocoded_df['city_normalized'].nunique()} unique cities")
 
 print(f"\n Data saved to: data/0_interim/transform/")
 print(f"\n Next: Classify reviews (classify_reviews.ipynb)")
 print(f" The classification notebook will automatically load this transformed data!")
 
else:
 print(" No data to save. Run previous tests first!")

## Test 6: Build Aggregates (Optional)

**What this does:** Pre-computes statistics for dashboards and reports

**Aggregation levels:**
- By business
- By city
- By region
- By month

**Metrics computed:**
- Review count
- Average rating
- Sentiment distribution (if classified)
- Category counts (if classified)

**Note:** This step is most useful after classification!

In [None]:
# Build aggregates (optional - more useful after classification)
if 'geocoded_df' in locals():
 print(f" BUILDING AGGREGATES")
 print(f"="*80)
 print(f" Note: More metrics available after classification!\n")
 
 # Output directory
 output_dir = project_root / "data" / "0_processed" / "transform"
 output_dir.mkdir(parents=True, exist_ok=True)
 
 # Build aggregates
 try:
 aggregate_files = build_aggregates(
 df_labeled=geocoded_df,
 output_dir=output_dir,
 date_suffix=True
 )
 
 print(f"\n AGGREGATES COMPLETE")
 print(f"="*80)
 print(f"\n Files created:")
 for key, path in aggregate_files.items():
 rel_path = Path(path).relative_to(project_root)
 print(f" {key}: {rel_path}")
 
 # Show sample from one aggregate
 if 'by_business.parquet' in aggregate_files:
 agg_df = pd.read_parquet(aggregate_files['by_business.parquet'])
 print(f"\n Sample: Reviews by Business")
 display(agg_df.head())
 
 print(f"\n These aggregates will be more informative after classification!")
 print(f" Run classify_reviews.ipynb, then re-run this step.")
 
 except Exception as e:
 print(f"\n Error building aggregates: {e}")
 print(f" This is normal if data hasn't been classified yet.")
 print(f" Aggregates work best with sentiment and category data.")
 
else:
 print(" No data available. Run previous tests first!")

## Summary for New Developers

**What you learned:**

. **Transform Pipeline** - The middle step between collection and classification
. **Normalization** - Converting raw data into clean, typed, analysis-ready format
. **Geocoding** - Adding regional context using GeoJSON polygons or city names
. **Aggregation** - Pre-computing statistics for faster dashboards
. **Data Flow** - How data moves through the pipeline stages

**Key takeaways:**
- Transform step is **optional but highly recommended**
- Normalization **improves data quality** (dates, ratings, text)
- Regions enable **geographic analysis** (heatmaps, regional comparisons)
- Aggregates **speed up dashboards** (pre-computed metrics)
- Parquet format **preserves types** and is more efficient than CSV
- Fallback mechanisms ensure **graceful degradation** (GeoJSON city names)

**Transform Components:**

. **normalize_reviews_df()** - Main normalization function
 - Parses French relative dates ("il y a mois" datetime)
 - Converts ratings to Int6, clips to - range
 - Normalizes city names (removes accents, lowercase)
 - Extracts month for temporal analysis
 - Cleans text fields (strips whitespace)

. **add_region()** - GeoJSON-based geocoding
 - Loads regions.geojson with Morocco's regions
 - Uses shapely to check if coordinates fall within polygons
 - Validates coordinates are within Morocco bounds
 - Falls back to city-name mapping if GeoJSON unavailable

. **add_region_by_city()** - City-name fallback
 - Uses CITY_REGION_MAPPING (0+ cities regions)
 - Case-insensitive matching
 - Works when coordinates are unavailable

. **build_aggregates()** - Pre-computed statistics
 - Multiple aggregation levels (business, city, region, month)
 - Smart column detection (works with/without classification)
 - Saves separate CSV for each level
 - More useful after classification (includes sentiment/categories)

**Data Flow (Complete Pipeline):**
```
discover collect TRANSFORM classify analyze
 ↓ ↓ ↓ ↓ ↓
0_raw/ 0_interim/ 0_interim/ 0_processed/ 0_analysis/
discovery collection transform classification reports
```

**This notebook processes:**
- **Input**: `data/0_interim/collection/reviews.parquet` (from collect step)
- **Output**: `data/0_interim/transform/reviews_normalized.parquet`
- **Aggregates**: `data/0_processed/transform/aggregates_*.csv`

**Next steps:**
. You've discovered place_ids (Step : discover_placeids.ipynb)
. You've collected reviews (Step : collect_reviews.ipynb)
. You've transformed reviews (Step .: transform_reviews.ipynb - this notebook!)
. Classify reviews to extract sentiments/topics (Step : classify_reviews.ipynb)
. Analyze results, create reports, build dashboards!

**Pipeline Command (Run all steps):**
```bash
python -m review_analyzer.main pipeline \
 --businesses "Attijariwafa Bank" \
 --cities "Casablanca" \
 --business-type "bank"
```

**Or step by step:**
```bash
# . Discover
python -m review_analyzer.main discover --businesses "Bank" --cities "City"

# . Collect
python -m review_analyzer.main collect --input agencies.csv --mode csv

# . Transform (this step!)
python -m review_analyzer.main transform --regions regions.geojson

# . Classify
python -m review_analyzer.main classify --input reviews.csv --wide-format
```

**Troubleshooting:**
- No input data? Run collect_reviews.ipynb first
- Missing regions? Add regions.geojson to data/00_config/cities/ or use city-name fallback
- Aggregate errors? Normal if data not classified yet - run classify_reviews.ipynb first
- Date parsing issues? French relative dates are auto-detected ("il y a X mois/jours/ans")

Happy transforming! 