# üè¢ Hamburg Branchenbuch - Category Filtering Workflow

## üìã Quick Start Guide

Run cells in order:

1. **Cells 1-2**: Load Branchenbuch data (39,508 companies)
2. **Cells 3-4**: View all categories *(optional, for reference)*
3. **Cells 6-9**: Load Excel file and extract categories marked for deletion
4. **Cell 11**: Apply filter to dataset
5. **Cell 13**: Save filtered dataset to CSV

---

## üìù Process

This notebook filters the Hamburg Branchenbuch dataset by removing categories marked in your Excel file (`20_10_2025_Branchenbuch_Categories_to_be_deleted.xlsx`).

The Excel file should have:
- **Column 1**: Category names (numbers will be automatically removed)
- **Column 2**: Mark with "x" to delete that category

---


In [1]:
import pandas as pd

In [2]:
df_hamburg_branchenbuch = pd.read_csv('hamburg_branchenbuch_companies_details_from_map_20250930_162445_with_websites.csv')

In [3]:
# See all unique categories (sorted alphabetically)
unique_categories = sorted(df_hamburg_branchenbuch['category'].unique())
print(f"Total unique categories: {len(unique_categories)}\n")

# Print all categories, one per line
for i, cat in enumerate(unique_categories, 1):
    print(f"{i:3d}. {cat}")
    
# Alternative: See categories with their counts
print("\n" + "="*50)
print("Categories with counts:")
print("="*50)
df_hamburg_branchenbuch['category'].value_counts()

Total unique categories: 554

  1. AIDS Hilfe
  2. AVGS Coaching
  3. Abbruchunternehmen
  4. Abendkleider
  5. Abendschule
  6. Abrechnungsstelle
  7. Abschleppdienst
  8. Accor Hotel
  9. Adidas
 10. Afghanisches Restaurant
 11. Afrikanisches Restaurant
 12. Aktenlagerung
 13. Aktenvernichtung
 14. Akupunktur
 15. Alarmanlagen
 16. Alessi
 17. Alfa Romeo
 18. All you can eat
 19. Allergologe
 20. Allgemeinmediziner / Hausarzt
 21. Altbausanierung
 22. Altenheime
 23. Alternative Heilmethoden
 24. Ambulantes OP-Zentrum
 25. Amerikanisches Restaurant
 26. An- und Verkauf
 27. Angelladen
 28. Angelverein
 29. Anh√§ngerverleih
 30. Anlageimmobilie
 31. Anlagenbau
 32. Antiquariat
 33. Antiquit√§ten
 34. Anwalt Arbeitsrecht
 35. Anwalt Bankrecht
 36. Anwalt Baurecht
 37. Anwalt Erbrecht
 38. Anwalt Familienrecht
 39. Anwalt Gewerblicher Rechtsschutz
 40. Anwalt IT Recht
 41. Anwalt Immobilienrecht
 42. Anwalt Insolvenzrecht
 43. Anwalt Markenrecht
 44. Anwalt Medizinrecht
 45. Anwalt Miet

category
Handel & Shopping          3062
Auto & Verkehr             2558
Gesellschaft & Soziales    1963
Arzt                       1906
Gesundheit & Medizin       1731
                           ... 
Ballonfahrt                   1
Musikschule                   1
N√§hkurs                       1
Inneneinrichtung              1
Stadtf√ºhrung                  1
Name: count, Length: 554, dtype: int64

---

## üìÇ Step 1: Load Categories from Excel File

Load the categories marked for deletion from `20_10_2025_Branchenbuch_Categories_to_be_deleted.xlsx`


In [4]:
import re

# Read the deletion list from Excel/CSV
deletion_file = '20_10_2025_Branchenbuch_Categories_to_be_deleted.xlsx'

# Try reading as Excel first, if that fails, try CSV
try:
    df_deletion = pd.read_excel(deletion_file, header=0)
    print("‚úÖ Loaded Excel file")
except:
    # Try CSV with semicolon separator
    df_deletion = pd.read_csv(deletion_file.replace('.xlsx', '.csv'), sep=';', header=0)
    print("‚úÖ Loaded CSV file")

print(f"Loaded {len(df_deletion)} rows\n")
print("First few rows:")
print(df_deletion.head(10))


‚úÖ Loaded Excel file
Loaded 554 rows

First few rows:
  Total unique categories: 554 1 To be deleted
0                   AIDS Hilfe 2             x
1                AVGS Coaching 3             x
2           Abbruchunternehmen 4           NaN
3                 Abendkleider 5             x
4                  Abendschule 6             x
5            Abrechnungsstelle 7             x
6              Abschleppdienst 8             x
7                  Accor Hotel 9             x
8                      Adidas 10             x
9     Afghanisches Restaurant 11             x


In [5]:
# Extract categories marked for deletion
# Column 1: Category name with trailing number (e.g., " AIDS Hilfe 2")
# Column 2: "x" marks categories to delete

# Get column names
col_category = df_deletion.columns[0]  # First column
col_to_delete = df_deletion.columns[1]  # Second column

print(f"Category column: '{col_category}'")
print(f"Deletion marker column: '{col_to_delete}'")
print()

# Filter rows where second column contains 'x'
df_marked_for_deletion = df_deletion[df_deletion[col_to_delete].astype(str).str.lower() == 'x'].copy()

print(f"Found {len(df_marked_for_deletion)} categories marked for deletion\n")

# Clean the category names: remove trailing numbers (e.g., " AIDS Hilfe 2" -> "AIDS Hilfe")
def clean_category_name(cat_str):
    """Remove leading/trailing spaces and trailing numbers from category names"""
    if pd.isna(cat_str):
        return ""
    
    # Convert to string and strip whitespace
    cat_str = str(cat_str).strip()
    
    # Remove trailing pattern like " 2", " 123", etc.
    # Pattern: optional space + one or more digits at the end
    cat_str = re.sub(r'\s+\d+$', '', cat_str)
    
    return cat_str

df_marked_for_deletion['cleaned_category'] = df_marked_for_deletion[col_category].apply(clean_category_name)

# Create the list of categories to remove
categories_to_remove = df_marked_for_deletion['cleaned_category'].tolist()

# Remove any empty strings
categories_to_remove = [cat for cat in categories_to_remove if cat]

print(f"‚úÖ Extracted {len(categories_to_remove)} categories to remove\n")
print("First 20 categories to be removed:")
for i, cat in enumerate(categories_to_remove[:20], 1):
    count = len(df_hamburg_branchenbuch[df_hamburg_branchenbuch['category'] == cat])
    if count > 0:
        print(f"  {i:2d}. {cat} ({count:,} companies)")
    else:
        print(f"  {i:2d}. {cat} ‚ö†Ô∏è NOT FOUND in dataset")


Category column: 'Total unique categories: 554 1'
Deletion marker column: 'To be deleted'

Found 466 categories marked for deletion

‚úÖ Extracted 466 categories to remove

First 20 categories to be removed:
   1. AIDS Hilfe (4 companies)
   2. AVGS Coaching (24 companies)
   3. Abendkleider (43 companies)
   4. Abendschule (2 companies)
   5. Abrechnungsstelle (13 companies)
   6. Abschleppdienst (4 companies)
   7. Accor Hotel (11 companies)
   8. Adidas (8 companies)
   9. Afghanisches Restaurant (5 companies)
  10. Afrikanisches Restaurant (3 companies)
  11. Aktenlagerung (4 companies)
  12. Aktenvernichtung (4 companies)
  13. Akupunktur (39 companies)
  14. Alarmanlagen (35 companies)
  15. Alessi (7 companies)
  16. Alfa Romeo (4 companies)
  17. All you can eat (6 companies)
  18. Allergologe (3 companies)
  19. Allgemeinmediziner / Hausarzt (1 companies)
  20. Altenheime (73 companies)


In [6]:
# Verify: Show summary statistics
print("=" * 70)
print("SUMMARY OF CATEGORIES TO REMOVE")
print("=" * 70)

total_companies_to_remove = 0
categories_found = []
categories_not_found = []

for cat in categories_to_remove:
    count = len(df_hamburg_branchenbuch[df_hamburg_branchenbuch['category'] == cat])
    if count > 0:
        categories_found.append((cat, count))
        total_companies_to_remove += count
    else:
        categories_not_found.append(cat)

print(f"‚úÖ Categories found in dataset: {len(categories_found)}")
print(f"‚ö†Ô∏è  Categories NOT found: {len(categories_not_found)}")
print(f"üìä Total companies to be removed: {total_companies_to_remove:,}")
print(f"üìä Percentage of dataset: {total_companies_to_remove/len(df_hamburg_branchenbuch)*100:.1f}%")

if categories_not_found:
    print(f"\n‚ö†Ô∏è  Categories not found in dataset (typos?):")
    for cat in categories_not_found[:10]:  # Show first 10
        print(f"   - '{cat}'")


SUMMARY OF CATEGORIES TO REMOVE
‚úÖ Categories found in dataset: 466
‚ö†Ô∏è  Categories NOT found: 0
üìä Total companies to be removed: 28,529
üìä Percentage of dataset: 72.2%


In [7]:
# OPTIONAL: Show ALL categories that will be removed (sorted by company count)
print("=" * 70)
print("ALL CATEGORIES TO BE REMOVED (sorted by number of companies)")
print("=" * 70)

categories_sorted = sorted(categories_found, key=lambda x: x[1], reverse=True)

for i, (cat, count) in enumerate(categories_sorted, 1):
    print(f"{i:3d}. {cat:50s} {count:5,} companies")

print()
print(f"Total: {len(categories_sorted)} categories, {total_companies_to_remove:,} companies")


ALL CATEGORIES TO BE REMOVED (sorted by number of companies)
  1. Handel & Shopping                                  3,062 companies
  2. Gesellschaft & Soziales                            1,963 companies
  3. Arzt                                               1,906 companies
  4. Gesundheit & Medizin                               1,731 companies
  5. Service & Dienstleistung                           1,616 companies
  6. Essen & Trinken                                    1,571 companies
  7. Alternative Heilmethoden                             968 companies
  8. Grafik / Design / Fotografie                         575 companies
  9. Reise & √úbernachtung                                 564 companies
 10. EDV-Beratung / Software                              399 companies
 11. B√§ckerei                                             397 companies
 12. Friseur                                              323 companies
 13. Marketingberatung                                    306 companies
 

---

## üìä Step 2: Filter the Dataset

Apply the filter using the categories loaded from Excel.

In [8]:
# Show statistics BEFORE filtering
print("=" * 60)
print("BEFORE FILTERING:")
print("=" * 60)
print(f"Total companies: {len(df_hamburg_branchenbuch):,}")
print(f"Total categories: {df_hamburg_branchenbuch['category'].nunique()}")
print()

# Filter: Keep only rows where category is NOT in the removal list
df_filtered = df_hamburg_branchenbuch[~df_hamburg_branchenbuch['category'].isin(categories_to_remove)].copy()

# Show statistics AFTER filtering
print("=" * 60)
print("AFTER FILTERING:")
print("=" * 60)
print(f"Total companies: {len(df_filtered):,}")
print(f"Total categories: {df_filtered['category'].nunique()}")
print()

# Show what was removed
companies_removed = len(df_hamburg_branchenbuch) - len(df_filtered)
print("=" * 60)
print("SUMMARY:")
print("=" * 60)
print(f"‚úÖ Companies removed: {companies_removed:,} ({companies_removed/len(df_hamburg_branchenbuch)*100:.1f}%)")
print(f"‚úÖ Companies remaining: {len(df_filtered):,} ({len(df_filtered)/len(df_hamburg_branchenbuch)*100:.1f}%)")
print(f"‚úÖ Categories removed: {len(categories_to_remove)}")


BEFORE FILTERING:
Total companies: 39,507
Total categories: 554

AFTER FILTERING:
Total companies: 10,978
Total categories: 88

SUMMARY:
‚úÖ Companies removed: 28,529 (72.2%)
‚úÖ Companies remaining: 10,978 (27.8%)
‚úÖ Categories removed: 466


---

## üíæ Step 3: Save the Filtered Dataset


In [9]:
# Save the filtered dataset
output_filename = 'hamburg_branchenbuch_filtered_categories.csv'
df_filtered.to_csv(output_filename, index=False)
print(f"‚úÖ Filtered dataset saved to: {output_filename}")
print(f"   {len(df_filtered):,} companies across {df_filtered['category'].nunique()} categories")


‚úÖ Filtered dataset saved to: hamburg_branchenbuch_filtered_categories.csv
   10,978 companies across 88 categories
