# 🏙️ **Cleaning Municipalities Dataset**
### **Ironhack Data Science and Machine Learning Bootcamp**
📅 **Date:** February 10, 2025  
📁 **Notebook:** `clean_municipalities.ipynb`  
👩‍💻 **Authors:** Ginosca Alejandro Dávila & Natanael Santiago Morales  

---

## **📌 Project Overview**
This notebook is part of **The Hitchhiker’s Guide to Puerto Rico**, a **travel planning chatbot** designed to recommend **landmarks, municipalities, and points of interest** in Puerto Rico. The chatbot integrates **historical news, weather forecasts, and user preferences** to enhance recommendations.

This notebook focuses on **cleaning and structuring the Municipalities dataset**, extracted from Wikipedia, ensuring the chatbot can effectively use it for recommendations.

### **🔹 Dataset Usage**
The cleaned dataset will be used for:
- ✅ **Chatbot Responses** – Providing structured information about municipalities.
- ✅ **Location-Based Filtering** – Helping users refine their travel preferences.
- ✅ **Geographical Itinerary Planning** – Structuring trips based on user input.
- ✅ **Map-Based Visualization** – Displaying municipalities interactively.
- ✅ **Enhancing Landmark Recommendations** – Providing regional context for chatbot suggestions.

---

## **📂 Dataset Description**
- **Source:** Wikipedia (text extracted from `.txt` files).  
- **Format:** `.zip` file containing `.txt` files (each representing a municipality).  
- **Location:**  
  📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/data/municipalities.zip`  

### **🔹 Key Fields in the Dataset**
| Field | Required? | Purpose |
|---|---|---|
| 🌆 Municipality Name | ✅ Yes | Used for location filtering & chatbot queries. |
| 🌎 Coordinates (Latitude, Longitude) | ✅ Yes | Useful for mapping & distance calculations. |
| 📝 Brief Summary / Description | ✅ Yes | Provides essential information for chatbot responses. |
| 🏛 Historical Significance (if mentioned) | ⚠️ Optional | Enhances chatbot responses with historical context. |
| 🎭 Notable Attractions (if available) | ⚠️ Optional | Helps users discover places of interest within municipalities. |

---

## **🛠️ What This Notebook Does**
✔ **Step 1:** Extract and inspect raw text files.  
✔ **Step 2:** Remove unnecessary **HTML tags, metadata, and special characters**.  
✔ **Step 3:** Extract structured information, including:  
   - Municipality names  
   - Geographical coordinates  
   - Brief descriptions  

✔ **Step 4:** Store cleaned data in a structured format (**CSV/JSON**) for later use.

---

## **💾 Project Structure**
📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/` → **Main project folder**  
📄 `clean_municipalities.ipynb` → **Notebook for cleaning municipalities data**  
📁 `data/` → **Raw dataset (`municipalities.zip`)**  
📁 `cleaned data/cleaned municipalities data/` → **Stores the processed municipalities dataset**  

---

🔹 **Let’s clean the municipalities dataset and prepare it for analysis! 🚀**


## 🔗 Mounting Google Drive

Since our dataset is stored in **Google Drive**, we need to **mount Google Drive** to access the project folder.

This will allow us to later extract the `municipalities.zip` file and inspect its contents.


In [1]:
from google.colab import drive

# 🔹 Mount Google Drive
drive.mount('/content/drive')


Mounted at /content/drive


## 📂 Extracting the Municipalities Dataset

Now that Google Drive is mounted, we will:

✔ Locate the `municipalities.zip` file inside the **data folder**.  
✔ Extract its contents inside the **same `data` folder**.  
✔ List the extracted `.txt` files to verify successful extraction.

This will allow us to inspect the raw dataset before cleaning it.


In [None]:
import zipfile
import os

# 🔹 Define paths
data_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/data"
zip_path = os.path.join(data_folder, "municipalities.zip")
extract_path = data_folder  # Extract directly inside 'data' folder
extracted_folder = os.path.join(data_folder, "municipalities")  # The expected extracted folder

# 🔹 Check if the folder is already extracted
if os.path.exists(extracted_folder) and len(os.listdir(extracted_folder)) > 0:
    print("✅ The 'municipalities' folder already exists. Skipping extraction.")
else:
    # Extract ZIP file
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_path)

    # Verify extraction
    if os.path.exists(extracted_folder):
        files = os.listdir(extracted_folder)
        print(f"✅ Extraction successful! Total files extracted: {len(files)}")
        print("Sample files:", files[:10])  # Show first 10 files
    else:
        print("⚠️ Extraction failed. Check the file paths.")


✅ Extraction successful! Total files extracted: 78
Sample files: ['Adjuntas.txt', 'Aguada.txt', 'Aguadilla.txt', 'Aguas Buenas.txt', 'Aibonito.txt', 'Añasco.txt', 'Arecibo.txt', 'Arroyo.txt', 'Barceloneta.txt', 'Barranquitas.txt']


## 📄 Previewing Full Municipality Files

Before deciding on the cleaning steps, we need to **fully inspect** some sample `.txt` files.

This will help us:

✔ Understand how the data is structured.  
✔ Identify **unnecessary metadata, scripts, or unwanted content**.  
✔ Determine whether **all files follow the same structure** or if different cleaning steps are needed.  

We will preview **a fixed set of files** to ensure that if we re-run the notebook after a runtime disconnect, we get the **same output** for better decision-making. 🚀


In [None]:
# 🔹 Re-list files in case of runtime reset
files = os.listdir(extracted_folder)

# 🔹 Select a fixed set of sample files (first 5 in the folder)
sample_files = files[:5]  # First 5 files to ensure consistency

# 🔹 Preview full content of selected files
for i, file in enumerate(sample_files, start=1):
    file_path = os.path.join(extracted_folder, file)

    # Read content
    with open(file_path, "r", encoding="utf-8") as f:
        content = f.read()

    # Print full content
    print(f"📂 Full Preview of File {i}: {file}")
    print("=" * 80)
    print(content)  # Display full content
    print("=" * 80)
    print("\n")  # Space between previews


📂 Full Preview of File 1: Adjuntas.txt
b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Adjuntas, Puerto Rico - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disable

### **📌 File Type and Content Overview**

The dataset consists of **`.txt` files**, each containing raw **HTML pages**. These files appear to be **webpage dumps** from Wikipedia, including metadata, embedded scripts, and text content.

A review of the files shows:
- They start with an **HTML document structure** (`<!DOCTYPE html>`).
- They contain **metadata, JavaScript, and CSS links**.
- The main text content is embedded within HTML tags.

This confirms that **text extraction and cleaning** will be necessary before using the dataset for analysis.

---

### **📌 Preview of the First Five Files**

The previous output displays the raw contents of the first five municipality text files. These files contain **full HTML pages**, typically including metadata, infoboxes, and structured data.

🚀 **Next Step:** Extract relevant information directly from the original files (**raw HTML**), focusing on key details such as:

- **Municipality Name**
- **Coordinates (Latitude, Longitude)**
- **Wikipedia URL**
- **Brief Description of the Municipality** (from the first paragraph)
- **Notable Attractions** (landmarks, parks, cultural sites, if available)
- **Historical Significance**

Once this information has been **extracted**, we will proceed to the **cleaning phase** to retrieve structured details.

📌 **Current Focus:** Extracting structured information **directly from raw HTML** before moving on to text cleaning.


## 🏷️ Extracting Municipality Names

Now that we have previewed the structure of the municipality files, the next step is to **extract the municipality names**.

✔ Each file represents a **specific municipality** and its content is extracted from **Wikipedia**.  
✔ The **municipality name** is found in **the file name**.  
✔ We will extract **the file name as the municipality name**, ensuring consistency.

### 🔹 **How We Will Store This Data**
- The extracted municipality names will be **stored in a structured table**.
- We will save the table in:  
  📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_names.csv`

This structured data will help in **municipality-based filtering, chatbot responses, and travel itinerary planning**. 🚀  


In [None]:
import os
import pandas as pd

# 🔹 Define paths
structured_data_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities"
os.makedirs(structured_data_folder, exist_ok=True)  # Ensure the directory exists

output_csv = os.path.join(structured_data_folder, "municipality_names.csv")

# 🔹 Initialize storage list
municipality_data = []

# 🔹 Function to properly format municipality names
def format_municipality_name(name):
    return " ".join(word.capitalize() for word in name.lower().split())  # Ensures only the first letter of each word is capitalized

# 🔹 Iterate through all municipality files
for file in os.listdir(extracted_folder):
    if file.endswith(".txt"):
        # Extract municipality name directly from the file name
        raw_name = file.replace(".txt", "").replace("_", " ")
        municipality_name = format_municipality_name(raw_name)  # Apply formatting

        # Store extracted data
        municipality_data.append({"File Name": file, "Municipality Name": municipality_name})

# 🔹 Convert to DataFrame
df_municipalities = pd.DataFrame(municipality_data)

# 🔹 Save extracted municipality names to CSV
df_municipalities.to_csv(output_csv, index=False, encoding="utf-8")

# ✅ Display the extracted municipality names
from IPython.display import display

display(df_municipalities)  # Show DataFrame output

print(f"✅ Municipality names saved to: {output_csv}")


Unnamed: 0,File Name,Municipality Name
0,Adjuntas.txt,Adjuntas
1,Aguada.txt,Aguada
2,Aguadilla.txt,Aguadilla
3,Aguas Buenas.txt,Aguas Buenas
4,Aibonito.txt,Aibonito
...,...,...
73,Vega Baja.txt,Vega Baja
74,Vieques.txt,Vieques
75,Villalba.txt,Villalba
76,Yabucoa.txt,Yabucoa


✅ Municipality names saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_names.csv


## 📍 Extracting Municipality Coordinates  

Now that we have extracted the **municipality names**, the next step is to **extract their location coordinates**.  

✔ The **coordinates** (latitude & longitude) are embedded within the **raw HTML files**.  
✔ They can be found in different formats, such as **JSON objects, HTML metadata, and embedded map links**.  
✔ We will use **multiple regex patterns** to capture coordinates from these various sources.  

### 🔹 **How We Will Store This Data**  
- The extracted coordinates will be **stored in a structured table** with:  
  - **File Name**  
  - **Latitude**  
  - **Longitude**  
- The table will be saved in:  
  📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_coordinates.csv`  

- Any files where **coordinates were not found** will be logged separately:  
  📁 `municipalities_missing_coordinates.txt`  

This structured data will allow us to **map municipalities**, **provide location-based recommendations**, and **integrate with travel planning tools**. 🚀  


In [None]:
import os
import re
import json
import pandas as pd

# 🔹 Define paths
structured_data_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities"
os.makedirs(structured_data_folder, exist_ok=True)  # Ensure the directory exists

output_csv = os.path.join(structured_data_folder, "municipality_coordinates.csv")
missing_files_log = os.path.join(structured_data_folder, "municipalities_missing_coordinates.txt")

municipalities_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/data/municipalities"

# 🔹 List to store extracted data
data = []
missing_municipalities = set()  # Store files with missing coordinates

# Define regex patterns for extracting coordinates
coordinate_patterns = [
    r'"wgCoordinates"\s*:\s*({.*?})',  # JSON dictionary inside "wgCoordinates"
    r'"lat"\s*:\s*(-?\d+\.\d+)\s*,\s*"lon"\s*:\s*(-?\d+\.\d+)',  # "lat" and "lon"
    r'"latitude"\s*:\s*(-?\d+\.\d+)\s*,\s*"longitude"\s*:\s*(-?\d+\.\d+)',  # "latitude" and "longitude"
    r'"latLng"\s*:\s*\[\s*(-?\d+\.\d+)\s*,\s*(-?\d+\.\d+)\s*\]',  # "latLng" array
    r'LatLng\s*\(\s*(-?\d+\.\d+)\s*,\s*(-?\d+\.\d+)\s*\)',  # Google Maps `LatLng()`
    r'coordinates\s*:\s*\[\s*(-?\d+\.\d+)\s*,\s*(-?\d+\.\d+)\s*\]',  # Generic "coordinates" array
    r'data-location\s*=\s*"\s*(-?\d+\.\d+)\s*,\s*(-?\d+\.\d+)\s*"',  # HTML `data-location`
    r'<span class="geo">\s*(-?\d+\.\d+)\s*;\s*(-?\d+\.\d+)\s*</span>',  # Wikipedia geo <span>
    r'geo:lat" content="(-?\d+\.\d+)"[^>]*geo:long" content="(-?\d+\.\d+)"',  # HTML meta geo tags
    r'geohack\.toolforge\.org/.*?params=(-?\d+\.\d+)_(-?\d+\.\d+)',  # Wikipedia GeoHack links
    r'(\d{1,3}°\s*\d{1,2}\'\s*\d{1,2}(?:\.\d+)?["″]?\s*[NS]),\s*(\d{1,3}°\s*\d{1,2}\'\s*\d{1,2}(?:\.\d+)?["″]?\s*[EW])',  # DMS format
    r'(-?\d{1,3}\.\d+)\s*,\s*(-?\d{1,3}\.\d+)',  # Comma-separated decimal degrees
    r'geo\.position"\s*content="\s*(-?\d+\.\d+);\s*(-?\d+\.\d+)"',  # Geo meta position tags
    r'www\.openstreetmap\.org/\?mlat=(-?\d+\.\d+)&mlon=(-?\d+\.\d+)',  # OpenStreetMap URLs
    r'www\.google\.com/maps/@(-?\d+\.\d+),(-?\d+\.\d+),\d+z',  # Google Maps URL parameters
    r'UTM\s*Zone\s*\d+\s*[NS]\s*Easting:\s*\d+\s*Northing:\s*\d+',  # UTM coordinates
    r'"coordinates"\s*:\s*\[\s*(-?\d+\.\d+)\s*,\s*(-?\d+\.\d+)\s*\]',  # GeoJSON format
    r'([-+]\d{2,3}\.\d+)([-+]\d{2,3}\.\d+)',  # ISO 6709 Format
    r'<meta\s+property="place:location:latitude"\s+content="(-?\d+\.\d+)"[^>]*>',  # Facebook place location latitude
    r'<meta\s+property="place:location:longitude"\s+content="(-?\d+\.\d+)"[^>]*>',  # Facebook place location longitude
    r'<meta\s+name="ICBM"\s+content="(-?\d+\.\d+),\s*(-?\d+\.\d+)"',  # Deprecated but still used
    r'<meta\s+name="geo\.position"\s+content="(-?\d+\.\d+);\s*(-?\d+\.\d+)"',  # Another meta tag format
    r'data-lat\s*=\s*"(-?\d+\.\d+)"\s+data-lon\s*=\s*"(-?\d+\.\d+)"',  # HTML attributes for coordinates
    r'data-geo\s*=\s*"(-?\d+\.\d+),\s*(-?\d+\.\d+)"',  # HTML `data-geo`
    r'www\.google\.com/maps/embed\?pb=!1m\d+!1d(-?\d+\.\d+)!2d(-?\d+\.\d+)',  # Google Maps Embed URL
    r'L\.marker\(\[\s*(-?\d+\.\d+),\s*(-?\d+\.\d+)\s*\]\)',  # Leaflet.js Map Coordinates
    r'"geometry"\s*:\s*{"type":"Point","coordinates":\s*\[\s*(-?\d+\.\d+),\s*(-?\d+\.\d+)\s*\]}',  # Mapbox GeoJSON format
    r'www\.google\.com/maps/api/staticmap\?.*?center=(-?\d+\.\d+),(-?\d+\.\d+)',  # Google Static Maps API
    r'maps\.apple\.com/\?ll=(-?\d+\.\d+),(-?\d+\.\d+)',  # Apple Maps URLs
    r'maps\.yahoo\.com/#lat=(-?\d+\.\d+)&lon=(-?\d+\.\d+)',  # Yahoo Maps URLs
    r'www\.bing\.com/maps\?v=\d+&where1=(-?\d+\.\d+),(-?\d+\.\d+)',  # Bing Maps URLs
    r'P625"\s*:\s*\{"type":"Point","coordinates":\s*\[\s*(-?\d+\.\d+),\s*(-?\d+\.\d+)\s*\]\}',  # Wikidata GeoCoordinates
    r'<coordinates>\s*(-?\d+\.\d+),\s*(-?\d+\.\d+),?\s*(-?\d+\.\d+)?\s*</coordinates>',  # KML (Google Earth)
    r'<trkpt\s+lat="(-?\d+\.\d+)"\s+lon="(-?\d+\.\d+)"'  # GPX (GPS Data)
]

# 🔹 Function to extract coordinates from content
def extract_coordinates(content):
    for pattern in coordinate_patterns:
        match = re.search(pattern, content)
        if match:
            if len(match.groups()) == 1:
                try:
                    coordinates = json.loads(match.group(1))
                    lat = coordinates.get("lat") or coordinates.get("latitude")
                    lon = coordinates.get("lon") or coordinates.get("longitude")
                except json.JSONDecodeError:
                    continue
            elif len(match.groups()) == 2:
                lat, lon = match.groups()
                lat, lon = float(lat), float(lon)
            return lat, lon
    return None, None  # No valid coordinates found

# 🔹 Iterate through all text files in the municipalities folder
for filename in os.listdir(municipalities_folder):
    if filename.endswith(".txt"):
        file_path = os.path.join(municipalities_folder, filename)

        try:
            with open(file_path, "r", encoding="utf-8") as file:
                content = file.read()

            lat, lon = extract_coordinates(content)

            if lat is None or lon is None:
                missing_municipalities.add(filename.replace(".txt", ""))  # Mark as missing

            # Store extracted data (even if coordinates are missing)
            data.append({"File Name": filename, "Latitude": lat if lat else "", "Longitude": lon if lon else ""})

        except Exception as e:
            print(f"⚠️ Error processing {filename}: {str(e)}")
            missing_municipalities.add(filename.replace(".txt", ""))  # Mark as missing

# 🔹 Convert to DataFrame
df = pd.DataFrame(data)

# 🔹 Save extracted municipality coordinates to CSV
df.to_csv(output_csv, index=False, encoding="utf-8")

# 🔹 Save missing municipalities **only if there are missing coordinates**
if missing_municipalities:
    with open(missing_files_log, "w", encoding="utf-8") as missing_file:
        for municipality in sorted(missing_municipalities):
            missing_file.write(municipality + "\n")
    print(f"⚠️ Missing municipality coordinates saved to: {missing_files_log}")
else:
    print("✅ No missing municipalities detected. Skipping log file creation.")

# ✅ Display the extracted municipality coordinates
from IPython.display import display
display(df)  # Show DataFrame output

print(f"✅ Municipality coordinates saved to: {output_csv}")


✅ No missing municipalities detected. Skipping log file creation.


Unnamed: 0,File Name,Latitude,Longitude
0,Adjuntas.txt,18.162778,-66.722222
1,Aguada.txt,18.379444,-67.188333
2,Aguadilla.txt,18.430000,-67.154444
3,Aguas Buenas.txt,18.256944,-66.103056
4,Aibonito.txt,18.140000,-66.266111
...,...,...,...
73,Vega Baja.txt,18.446111,-66.387500
74,Vieques.txt,18.116667,-65.416667
75,Villalba.txt,18.127222,-66.492222
76,Yabucoa.txt,18.050556,-65.879444


✅ Municipality coordinates saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_coordinates.csv


## 🔗 Extracting Wikipedia URLs  

Now that we have extracted the **municipality names** and **coordinates**, the next step is to **extract the Wikipedia URLs**.  

✔ The **Wikipedia URL** is typically embedded in the **HTML content of each file**.  
✔ It can often be found in **metadata tags, references, or direct Wikipedia links within the content**.  
✔ We will use **regular expressions** to extract the most relevant Wikipedia URL.  

### 🔹 **How We Will Store This Data**  
- The extracted Wikipedia URLs will be **stored in a structured table** with:  
  - **File Name**  
  - **Wikipedia URL**  
- The table will be saved in:  
  📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_wikipedia_urls.csv`  

- Any files where **a Wikipedia URL was not found** will be logged separately:  
  📁 `municipalities_missing_wikipedia_urls.txt`  

This data will allow us to **provide direct references to Wikipedia articles**, enhancing the **information available for each municipality**. 🌍  


In [None]:
import os
import re
import pandas as pd

# 🔹 Define paths
structured_data_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities"
os.makedirs(structured_data_folder, exist_ok=True)  # Ensure the directory exists

output_csv = os.path.join(structured_data_folder, "municipality_wikipedia_urls.csv")
missing_urls_log = os.path.join(structured_data_folder, "municipalities_missing_wikipedia_urls.txt")

municipalities_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/data/municipalities"

# 🔹 List to store extracted data
data = []
all_files = set()  # Store all filenames
missing_wikipedia_urls = set()  # Store files with missing Wikipedia URLs

# 🔹 Regular Expression Patterns for Extracting Wikipedia URL
wikipedia_patterns = [
    r'https?://[a-z]{2,3}\.wikipedia\.org/wiki/[^\s"<>#]+',  # Standard Wikipedia URL
    r'www\.wikipedia\.org/wiki/[^\s"<>#]+',  # Wikipedia URL without protocol
    r'<link rel="canonical" href="(https?://[a-z]{2,3}\.wikipedia\.org/wiki/[^\s"<>#]+)"',  # Canonical link
    r'<meta property="og:url" content="(https?://[a-z]{2,3}\.wikipedia\.org/wiki/[^\s"<>#]+)"',  # Open Graph URL
    r'<a href="(https?://[a-z]{2,3}\.wikipedia\.org/wiki/[^\s"<>#]+)"',  # Hyperlink to Wikipedia
]

# 🔹 Function to extract Wikipedia URL from content
def extract_wikipedia_url(content):
    for pattern in wikipedia_patterns:
        match = re.search(pattern, content)
        if match:
            return match.group(0)
    return None  # No valid Wikipedia URL found

# 🔹 Iterate through all text files in the municipalities folder
for filename in os.listdir(municipalities_folder):
    if filename.endswith(".txt"):
        all_files.add(filename.replace(".txt", ""))  # Store filenames without .txt
        file_path = os.path.join(municipalities_folder, filename)

        try:
            with open(file_path, "r", encoding="utf-8") as file:
                content = file.read()

            wikipedia_url = extract_wikipedia_url(content)

            if not wikipedia_url:
                missing_wikipedia_urls.add(filename.replace(".txt", ""))  # Mark as missing

            # Store extracted data (even if Wikipedia URL is missing)
            data.append({"File Name": filename, "Wikipedia URL": wikipedia_url if wikipedia_url else ""})

        except Exception as e:
            print(f"⚠️ Error processing {filename}: {str(e)}")
            missing_wikipedia_urls.add(filename.replace(".txt", ""))  # Mark as missing

# 🔹 Convert to DataFrame
df = pd.DataFrame(data)

# 🔹 Save extracted Wikipedia URLs to CSV
df.to_csv(output_csv, index=False, encoding="utf-8")

# 🔹 Save missing Wikipedia URLs **only if there are missing entries**
if missing_wikipedia_urls:
    with open(missing_urls_log, "w", encoding="utf-8") as missing_file:
        for municipality in sorted(missing_wikipedia_urls):
            missing_file.write(municipality + "\n")
    print(f"⚠️ Missing Wikipedia URLs saved to: {missing_urls_log}")
else:
    print("✅ No missing Wikipedia URLs detected. Skipping log file creation.")

# ✅ Display the extracted Wikipedia URLs
from IPython.display import display
display(df)  # Show DataFrame output

print(f"✅ Municipality Wikipedia URLs saved to: {output_csv}")


✅ No missing Wikipedia URLs detected. Skipping log file creation.


Unnamed: 0,File Name,Wikipedia URL
0,Adjuntas.txt,"https://en.wikipedia.org/wiki/Adjuntas,_Puerto..."
1,Aguada.txt,"https://en.wikipedia.org/wiki/Aguada,_Puerto_Rico"
2,Aguadilla.txt,"https://en.wikipedia.org/wiki/Aguadilla,_Puert..."
3,Aguas Buenas.txt,"https://en.wikipedia.org/wiki/Aguas_Buenas,_Pu..."
4,Aibonito.txt,"https://en.wikipedia.org/wiki/Aibonito,_Puerto..."
...,...,...
73,Vega Baja.txt,"https://en.wikipedia.org/wiki/Vega_Baja,_Puert..."
74,Vieques.txt,"https://en.wikipedia.org/wiki/Vieques,_Puerto_..."
75,Villalba.txt,"https://en.wikipedia.org/wiki/Villalba,_Puerto..."
76,Yabucoa.txt,"https://en.wikipedia.org/wiki/Yabucoa,_Puerto_..."


✅ Municipality Wikipedia URLs saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_wikipedia_urls.csv


## 📖 Extracting Brief Descriptions from Municipality Files  

Before performing full data cleaning, we will **extract the first paragraph** from each municipality file as the **Brief Description**.  

✔ The **brief description** is typically the **first meaningful paragraph** in each file.  
✔ We will **ignore empty lines** and **extract the first non-empty paragraph**.  
✔ The extracted data will be **stored in a structured table** for easy reference.  

### 🔹 **How We Will Store This Data**  
- The extracted brief descriptions will be **stored in a structured table** with:  
  - **File Name**  
  - **Brief Description** (First paragraph of each file)  
- The table will be saved in:  
  📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_brief_descriptions_v1.csv`  

This structured dataset will be useful for **chatbot responses** and **travel recommendations**. 🚀  


In [None]:
import os
import re
import pandas as pd
from bs4 import BeautifulSoup
import unicodedata

# 🔹 Define paths
structured_data_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities"
os.makedirs(structured_data_folder, exist_ok=True)  # Ensure the directory exists

output_csv = os.path.join(structured_data_folder, "municipality_brief_descriptions_v1.csv")
municipalities_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/data/municipalities"

# 🔹 List to store extracted data
data = []

# 🔹 Function to clean text encoding and spacing
def clean_text(text):
    # Normalize encoding (fixes characters like \xc3\xb1 -> ñ)
    text = unicodedata.normalize("NFKC", text)

    # Remove unnecessary characters (like IPA pronunciation, extra spaces)
    text = re.sub(r"\(.*?\)", "", text)  # Remove inline pronunciations
    text = re.sub(r"\s+", " ", text)  # Ensure single spaces
    text = text.replace("\n", " ").strip()  # Remove new lines and strip spaces

    return text

# 🔹 Function to extract the first paragraph
def extract_first_paragraph(content):
    soup = BeautifulSoup(content, "html.parser")

    # Find all paragraphs <p> in the main content
    paragraphs = soup.find_all("p")

    for para in paragraphs:
        text = para.get_text(strip=True)
        text = clean_text(text)  # Clean extracted text
        if len(text) > 100:  # Ensure it's not a short sentence
            return text  # Return the first relevant paragraph
    return None  # Return None if no paragraph is found

# 🔹 Iterate through all text files in the municipalities folder
for filename in os.listdir(municipalities_folder):
    if filename.endswith(".txt"):
        file_path = os.path.join(municipalities_folder, filename)

        try:
            with open(file_path, "r", encoding="utf-8") as file:
                content = file.read()

            brief_description = extract_first_paragraph(content)

            # Store extracted data
            data.append({"File Name": filename, "Brief Description": brief_description if brief_description else "Not Found"})

        except Exception as e:
            print(f"⚠️ Error processing {filename}: {str(e)}")

# 🔹 Convert to DataFrame
df = pd.DataFrame(data)

# 🔹 Save extracted brief descriptions to CSV
df.to_csv(output_csv, index=False, encoding="utf-8")

# ✅ Display the extracted brief descriptions
from IPython.display import display
display(df)  # Show DataFrame output

print(f"✅ Municipality brief descriptions saved to: {output_csv}")


Unnamed: 0,File Name,Brief Description
0,Adjuntas.txt,Adjuntas is a small mountainsidetownandmunicip...
1,Aguada.txt,"Aguada, originallySan Francisco de As\xc3\xads..."
2,Aguadilla.txt,"Aguadilla, founded in 1775 by Luis de C\xc3\xb..."
3,Aguas Buenas.txt,"Aguas Buenas, , popularly known as ""La Ciudad ..."
4,Aibonito.txt,Aibonito is a small mountaintownandmunicipalit...
...,...,...
73,Vega Baja.txt,Vega Baja is atownandmunicipalitylocated on th...
74,Vieques.txt,"Vieques, officiallyIsla de Vieques, is an isla..."
75,Villalba.txt,"Villalba, originally known asVilla Alba, is at..."
76,Yabucoa.txt,Yabucoa is atownandmunicipalityinPuerto Ricolo...


✅ Municipality brief descriptions saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_brief_descriptions_v1.csv


## 🔧 Fixing Encoding Issues in Municipality Descriptions  

After extracting brief descriptions, we noticed **encoding artifacts** (e.g., `\xc3\xb1` instead of `ñ`). To ensure **readability and accuracy**, we will **apply encoding fixes** to `municipality_brief_descriptions_v1.csv`.  

### 🔹 **Common Encoding Issues Found**
✔ Special characters like `á, é, í, ó, ú, ñ` were **misrepresented** in the extracted descriptions.  
✔ Some descriptions contained **Unicode artifacts** instead of proper text.  

### 🔹 **Fixing Approach**
1. **UTF-8 decoding:** Convert broken sequences into proper characters.  
2. **Manual corrections:** Replace misencoded sequences (e.g., `\xc3\xb1` → `ñ`).  
3. **Unicode normalization:** Standardize diacritics and symbols.  
4. **Whitespace cleanup:** Remove unnecessary spaces.  

### 📁 **Output File**
- **Fixed descriptions** will be stored in:  
  📁 `structured-information-municipalities/municipality_brief_descriptions_v2.csv`  

This cleaned dataset will **improve chatbot responses** and **user experience** when interacting with municipal information. 🚀  


In [3]:
import pandas as pd

# 🔹 Define paths
input_csv = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_brief_descriptions_v1.csv"
output_csv = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_brief_descriptions_v2.csv"

# 🔹 Load the CSV file
df = pd.read_csv(input_csv)

# 🔹 Function to Fix Encoding Issues
def fix_encoding(text):
    try:
        # Step 1: Attempt direct UTF-8 decoding first
        fixed_text = text.encode("utf-8", errors="ignore").decode("utf-8")

        # Step 2: Handle known misencoded sequences manually
        fix_map = {
            # Lowercase Latin characters with accents
            "\\xc3\\xa1": "á", "\\xc3\\xa9": "é", "\\xc3\\xad": "í", "\\xc3\\xb3": "ó", "\\xc3\\xba": "ú",
            "\\xc3\\xb1": "ñ", "\\xc3\\xa0": "à", "\\xc3\\xa2": "â", "\\xc3\\xa4": "ä", "\\xc3\\xa7": "ç",
            "\\xc3\\xb6": "ö", "\\xc3\\xbc": "ü", "\\xc3\\x80": "À", "\\xc3\\x82": "Â", "\\xc3\\x87": "Ç",

            # Uppercase Latin characters with accents
            "\\xc3\\x81": "Á", "\\xc3\\x89": "É", "\\xc3\\x8D": "Í", "\\xc3\\x93": "Ó", "\\xc3\\x9A": "Ú",
            "\\xc3\\x91": "Ñ", "\\xc3\\x84": "Ä", "\\xc3\\x96": "Ö", "\\xc3\\x9C": "Ü", "\\xc3\\x8B": "Ê",
            "\\xc3\\x99": "Ù",

            # Tilde and diacritical marks
            "\\xc3\\xb5": "õ", "\\xc3\\xb8": "ø", "\\xc3\\xb2": "ò", "\\xc3\\xb4": "ô",

            # Characters for other languages
            "\\xc2\\xb0": "°", "\\xc2\\xa9": "©", "\\xc2\\xae": "®",

            # Additional characters in your previous input
            "\\xc3\\x83": "Ã", "\\xc2\\xad": "­", "\\xe2\\x80\\x93": "–", "\\xe2\\x80\\x94": "—",
            "\\xe2\\x99\\xa5": "♥", "\\xe2\\x9c\\x94": "✓", "\\xe2\\x9d\\x8f": "✯", "\\xe2\\x97\\x8f": "◆",

            # Miscellaneous characters
            "\\xe2\\x80\\x9d": "”", "\\xe2\\x80\\x9c": "“", "\\xe2\\x80\\x98": "‘", "\\xe2\\x80\\x99": "’",
            "\\xe2\\x80\\xa6": "…", "\\xe2\\x81\\x84": "⁴", "\\xe2\\x83\\xa3": "₣",

            # Others commonly misencoded symbols
            "\\xe2\\x88\\x82": "∂", "\\xe2\\x9c\\x9d": "✔", "\\xc2\\xa1": "¡", "\\xc2\\xbf": "¿",
        }

        # Apply the encoding fixes from the map
        for wrong, correct in fix_map.items():
            fixed_text = fixed_text.replace(wrong, correct)

        return fixed_text.strip()

    except Exception as e:
        print(f"⚠️ Encoding fix failed for: {text} → {e}")
        return text  # Return the original if fixing fails

# 🔹 Apply encoding fix to the "Brief Description" column
df['Brief Description'] = df['Brief Description'].astype(str).apply(fix_encoding)

# 🔹 Save the cleaned CSV as version 2
df.to_csv(output_csv, index=False, encoding="utf-8")

# ✅ Display the updated DataFrame (for review)
from IPython.display import display
display(df)  # Show DataFrame output

print(f"✅ Encoding fixed and updated brief descriptions saved to: {output_csv}")


Unnamed: 0,File Name,Brief Description
0,Adjuntas.txt,Adjuntas is a small mountainsidetownandmunicip...
1,Aguada.txt,"Aguada, originallySan Francisco de Asís de la ..."
2,Aguadilla.txt,"Aguadilla, founded in 1775 by Luis de Córdova,..."
3,Aguas Buenas.txt,"Aguas Buenas, , popularly known as ""La Ciudad ..."
4,Aibonito.txt,Aibonito is a small mountaintownandmunicipalit...
...,...,...
73,Vega Baja.txt,Vega Baja is atownandmunicipalitylocated on th...
74,Vieques.txt,"Vieques, officiallyIsla de Vieques, is an isla..."
75,Villalba.txt,"Villalba, originally known asVilla Alba, is at..."
76,Yabucoa.txt,Yabucoa is atownandmunicipalityinPuerto Ricolo...


✅ Encoding fixed and updated brief descriptions saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_brief_descriptions_v2.csv


### **📖 Fixing Spacing and Formatting Issues in Municipality Descriptions**  

Now that we have **fixed encoding issues**, we need to **resolve spacing and formatting problems** in the municipality descriptions.  

✔ **Issues Identified:**  
- Missing spaces between words (e.g., `"mountainsidetownandmunicipality"` → `"mountainside town and municipality"`)  
- No spaces after punctuation (e.g., `"north ofYauco"` → `"north of Yauco"`)  
- Words merged without spaces (e.g., `"is atownandmunicipalityofPuerto Rico"` → `"is a town and municipality of Puerto Rico"`)  
- Inconsistent spacing in descriptions  

### **🔹 How We Will Fix It**  
To improve readability, we will:  
✔ **Ensure proper spacing** between words and punctuation marks  
✔ **Separate merged words** to restore readability  
✔ **Remove excess spaces** and normalize text  

📁 The improved dataset will be saved as:  
📄 `municipality_brief_descriptions_v3.csv` in the structured dataset folder.  

This **final cleanup** ensures that the descriptions are well-formatted and **ready for chatbot integration**. 🚀


In [2]:
import pandas as pd
import re

# 🔹 Define file paths
input_csv = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_brief_descriptions_v2.csv"
output_csv = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_brief_descriptions_v3.csv"

# 🔹 Load the CSV
df = pd.read_csv(input_csv)

# 🔹 Function to fix spacing and formatting
def fix_spacing(text):
    if not isinstance(text, str):
        return text  # Skip if not a string

    # Ensure space after punctuation (fixes "north ofYauco" → "north of Yauco")
    text = re.sub(r"([a-zA-Z])([.,;!?])([A-Za-z])", r"\1\2 \3", text)

    # Fix missing spaces before capitalized words (fixes "mountainsidetownandmunicipality" → "mountainside town and municipality")
    text = re.sub(r"([a-z])([A-Z])", r"\1 \2", text)

    # Ensure a **single space** after commas and periods
    text = re.sub(r"\s*([.,;:])\s*", r"\1 ", text)

    # Fix missing spaces between words and locations
    text = re.sub(r"([a-z])([A-Z])", r"\1 \2", text)

    # Replace multiple spaces with a single space
    text = re.sub(r"\s+", " ", text).strip()

    return text

# 🔹 Apply fixes to the "Brief Description" column
df["Brief Description"] = df["Brief Description"].apply(fix_spacing)

# 🔹 Save the improved CSV
df.to_csv(output_csv, index=False, encoding="utf-8")

# ✅ Display the cleaned dataset
from IPython.display import display
display(df)

print(f"✅ Spacing issues fixed and saved to: {output_csv}")


Unnamed: 0,File Name,Brief Description
0,Adjuntas.txt,Adjuntas is a small mountainsidetownandmunicip...
1,Aguada.txt,"Aguada, originally San Francisco de Asís de la..."
2,Aguadilla.txt,"Aguadilla, founded in 1775 by Luis de Córdova,..."
3,Aguas Buenas.txt,"Aguas Buenas, , popularly known as ""La Ciudad ..."
4,Aibonito.txt,Aibonito is a small mountaintownandmunicipalit...
...,...,...
73,Vega Baja.txt,Vega Baja is atownandmunicipalitylocated on th...
74,Vieques.txt,"Vieques, officially Isla de Vieques, is an isl..."
75,Villalba.txt,"Villalba, originally known as Villa Alba, is a..."
76,Yabucoa.txt,Yabucoa is atownandmunicipalityin Puerto Ricol...


✅ Spacing issues fixed and saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_brief_descriptions_v3.csv


## ✅ **Finalizing Municipality Brief Descriptions**  

We have now processed the **brief descriptions** of Puerto Rican municipalities by:  

✔ **Extracting the first meaningful paragraph** from each municipality file.  
✔ **Fixing encoding issues** to ensure proper display of special characters.  
✔ **Cleaning up spacing and formatting issues** to improve readability.  

### 🔹 **Current Status**  
The dataset `municipality_brief_descriptions_v3.csv` has been generated with **improved descriptions**. While some minor formatting issues remain, we will proceed with this version for further integration.  

### 🚀 **Next Steps**  
- This dataset is now **ready for chatbot integration** and **travel recommendations**.  
- If necessary, future refinements can be applied to address remaining formatting inconsistencies.  

📄 **Final Cleaned File Path:**  
📁 `/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_brief_descriptions_v3.csv`  

✅ **This marks the completion of the municipality brief description preprocessing!** 🎉  


## 🔗 Merging Municipality Data into a Single Table  

Since our individual datasets contain information related to municipalities, we will **merge multiple CSV files** into a single structured dataset.  

### 🔹 **Files to be Merged**
| File Name | Description |
|-----------|------------|
| `municipality_names.csv` | List of municipality names linked to file names |
| `municipality_coordinates.csv` | Latitude and longitude coordinates for each municipality |
| `municipality_wikipedia_urls.csv` | Wikipedia URLs for more detailed information |
| `municipality_brief_descriptions_v3.csv` | The extracted and cleaned brief descriptions |

### 🏗 **Final Table Structure**
| File Name | Municipality Name | Coordinates | Brief Description | Wikipedia URL |
|-----------|------------------|-------------|-------------------|--------------|
| Adjuntas.txt | Adjuntas | (lat, lon) | Small mountainside town... | [Wikipedia](URL) |
| Aguada.txt | Aguada | (lat, lon) | Located in the northwest... | [Wikipedia](URL) |

### ✅ **Why This Is Useful**
- Provides **structured** and **centralized** data for chatbot responses.
- Location coordinates enable **mapping and distance calculations**.
- Wikipedia links allow **further exploration of municipalities**.
- The brief descriptions serve as **summaries** for user interactions.

The merged dataset will be **saved in the folder**:  
📁 `/structured-information-municipalities/municipality_data_combined.csv`


In [3]:
import os
import pandas as pd
from IPython.display import display

# 🔹 Define paths
structured_data_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities"
os.makedirs(structured_data_folder, exist_ok=True)  # Ensure directory exists

# 🔹 File paths for each dataset
names_csv = os.path.join(structured_data_folder, "municipality_names.csv")
coordinates_csv = os.path.join(structured_data_folder, "municipality_coordinates.csv")
urls_csv = os.path.join(structured_data_folder, "municipality_wikipedia_urls.csv")
descriptions_csv = os.path.join(structured_data_folder, "municipality_brief_descriptions_v3.csv")
output_csv = os.path.join(structured_data_folder, "municipality_data_combined.csv")

# 🔹 Load CSV files
df_names = pd.read_csv(names_csv)  # Columns: File Name, Municipality Name
df_coordinates = pd.read_csv(coordinates_csv)  # Columns: File Name, Latitude, Longitude
df_urls = pd.read_csv(urls_csv)  # Columns: File Name, Wikipedia URL
df_descriptions = pd.read_csv(descriptions_csv)  # Columns: File Name, Brief Description

# 🔹 Fix potential column name typos
df_descriptions.rename(columns={"Fila Name": "File Name"}, inplace=True)

# 🔹 Merge datasets using "File Name" as the common key
df_merged = df_names.merge(df_coordinates, on="File Name", how="left") \
                    .merge(df_urls, on="File Name", how="left") \
                    .merge(df_descriptions, on="File Name", how="left")

# 🔹 Save merged dataset
df_merged.to_csv(output_csv, index=False, encoding="utf-8")

# ✅ Display merged dataset
display(df_merged)

print(f"✅ Municipality data combined and saved to: {output_csv}")


Unnamed: 0,File Name,Municipality Name,Latitude,Longitude,Wikipedia URL,Brief Description
0,Adjuntas.txt,Adjuntas,18.162778,-66.722222,"https://en.wikipedia.org/wiki/Adjuntas,_Puerto...",Adjuntas is a small mountainsidetownandmunicip...
1,Aguada.txt,Aguada,18.379444,-67.188333,"https://en.wikipedia.org/wiki/Aguada,_Puerto_Rico","Aguada, originally San Francisco de Asís de la..."
2,Aguadilla.txt,Aguadilla,18.430000,-67.154444,"https://en.wikipedia.org/wiki/Aguadilla,_Puert...","Aguadilla, founded in 1775 by Luis de Córdova,..."
3,Aguas Buenas.txt,Aguas Buenas,18.256944,-66.103056,"https://en.wikipedia.org/wiki/Aguas_Buenas,_Pu...","Aguas Buenas, , popularly known as ""La Ciudad ..."
4,Aibonito.txt,Aibonito,18.140000,-66.266111,"https://en.wikipedia.org/wiki/Aibonito,_Puerto...",Aibonito is a small mountaintownandmunicipalit...
...,...,...,...,...,...,...
73,Vega Baja.txt,Vega Baja,18.446111,-66.387500,"https://en.wikipedia.org/wiki/Vega_Baja,_Puert...",Vega Baja is atownandmunicipalitylocated on th...
74,Vieques.txt,Vieques,18.116667,-65.416667,"https://en.wikipedia.org/wiki/Vieques,_Puerto_...","Vieques, officially Isla de Vieques, is an isl..."
75,Villalba.txt,Villalba,18.127222,-66.492222,"https://en.wikipedia.org/wiki/Villalba,_Puerto...","Villalba, originally known as Villa Alba, is a..."
76,Yabucoa.txt,Yabucoa,18.050556,-65.879444,"https://en.wikipedia.org/wiki/Yabucoa,_Puerto_...",Yabucoa is atownandmunicipalityin Puerto Ricol...


✅ Municipality data combined and saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_data_combined.csv
