# 🏛️ **Cleaning Landmarks Dataset**
### **Ironhack Data Science and Machine Learning Bootcamp**
📅 **Date:** February 10, 2025  
📁 **Notebook:** `clean_landmarks.ipynb`  
👩‍💻 **Authors:** Ginosca Alejandro Dávila & Natanael Santiago Morales  

---

## **📌 Project Overview**
This project is part of **The Hitchhiker’s Guide to Puerto Rico**, an interactive **travel planning chatbot** designed to help visitors explore Puerto Rico based on their interests. The chatbot will recommend **landmarks, historical sites, and attractions**, while retrieving relevant **news articles** for additional context.

This notebook focuses on **extracting and structuring data from the Landmarks dataset**, which consists of **raw text extracted from Wikipedia**. Our goal is to **extract key information**, such as:
- **Landmark names**
- **Location coordinates (latitude & longitude)**
- **Physical address (if available)**
- **Municipality (if available)**
- **Short descriptions/summaries**
- **Landmark types/categories (if available)**
- **Wikipedia URL for reference**

### **🛠️ How This Data Will Be Used**
The cleaned landmarks dataset will be used for:
- ✅ **Chatbot Recommendations** – Providing users with landmark suggestions based on interests.  
- ✅ **Weather-Based Travel Warnings** – Checking if a landmark is affected by weather conditions using OpenWeather API.  
- ✅ **Distance-Based Itinerary Planning** – Helping users create efficient travel routes.  
- ✅ **Structured Landmark Database** – Enabling efficient search and filtering of landmarks.  

---

## **📂 Dataset Description**
- **Source:** Wikipedia-extracted **landmarks text files**.  
- **Format:** `.zip` file containing multiple `.txt` files (each representing a landmark).  
- **Location:**  
  📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/data/landmarks.zip`  

---

## **🛠️ What We Are Doing in This Notebook**
✔ **Step 1:** Extract and inspect raw text files.  
✔ **Step 2:** Extract structured information directly from the raw HTML (before cleaning), including:
   - **Landmark names**
   - **Coordinates (if available)**
   - **Wikipedia URL**
   - **Landmark categories/types (if present in Infobox)**  
   - **Physical address** (if available in Infobox or main content)
   - **Municipality** (if explicitly mentioned)

✔ **Step 3:** Apply selective text cleaning and extract:
   - **Short descriptions** (from the first paragraph)
   - **Historical significance** (if available)
   - **Opening hours & fees** (if available)  

✔ **Step 4:** Save cleaned data in **CSV/JSON format** for later use.

---

## **💾 Project Structure**
📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/` → Main project folder  
📄 `clean_landmarks.ipynb` → Notebook for extracting and structuring landmarks data  
📁 `data/` → Contains `landmarks.zip` (raw dataset)  
📁 `cleaned data/cleaned landmarks data/` → Stores the **cleaned landmarks dataset**  


## 🔗 Mounting Google Drive

Since our dataset is stored in **Google Drive**, we need to **mount Google Drive** to access the project folder.

This will allow us to later extract the `landmarks.zip` file and inspect its contents.


In [1]:
from google.colab import drive

# 🔹 Mount Google Drive
drive.mount('/content/drive')


Mounted at /content/drive


## 📂 Extracting the Landmarks Dataset

Now that Google Drive is mounted, we will:

✔ Locate the `landmarks.zip` file inside the **data folder**.  
✔ Extract its contents inside the **same `data` folder**.  
✔ List the extracted `.txt` files to verify successful extraction.

This will allow us to inspect the raw dataset before cleaning it.


In [None]:
import zipfile
import os

# 🔹 Define paths
data_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/data"
zip_path = os.path.join(data_folder, "landmarks.zip")
extract_path = data_folder  # Extract directly inside 'data' folder
extracted_folder = os.path.join(data_folder, "landmarks")  # The expected extracted folder

# 🔹 Check if the folder is already extracted
if os.path.exists(extracted_folder) and len(os.listdir(extracted_folder)) > 0:
    print("✅ The 'landmarks' folder already exists. Skipping extraction.")
else:
    # Extract ZIP file
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_path)

    # Verify extraction
    if os.path.exists(extracted_folder):
        files = os.listdir(extracted_folder)
        print(f"✅ Extraction successful! Total files extracted: {len(files)}")
        print("Sample files:", files[:10])  # Show first 10 files
    else:
        print("⚠️ Extraction failed. Check the file paths.")


✅ The 'landmarks' folder already exists. Skipping extraction.


## 📄 Previewing Full Landmark Files

Before deciding on the cleaning steps, we need to **fully inspect** some sample `.txt` files.

This will help us:

✔ Understand how the data is structured.  
✔ Identify **unnecessary metadata, scripts, or unwanted content**.  
✔ Determine whether **all files follow the same structure** or if different cleaning steps are needed.  

We will preview **a fixed set of files** to ensure that if we re-run the notebook after a runtime disconnect, we get the **same output** for better decision-making. 🚀


In [None]:
# 🔹 Re-list files in case of runtime reset
files = os.listdir(extracted_folder)

# 🔹 Select a fixed set of sample files (first 5 in the folder)
sample_files = files[:5]  # First 5 files to ensure consistency

# 🔹 Preview full content of selected files
for i, file in enumerate(sample_files, start=1):
    file_path = os.path.join(extracted_folder, file)

    # Read content
    with open(file_path, "r", encoding="utf-8") as f:
        content = f.read()

    # Print full content
    print(f"📂 Full Preview of File {i}: {file}")
    print("=" * 80)
    print(content)  # Display full content
    print("=" * 80)
    print("\n")  # Space between previews


📂 Full Preview of File 1: academia_del_perpetuo_socorro.txt
b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Academia del Perpetuo Socorro - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-fea

### **📌 File Type and Content Overview**

The dataset consists of **`.txt` files**, each containing raw **HTML pages**. These files appear to be **webpage dumps**, likely from Wikipedia or similar sources, including metadata, embedded scripts, and text content.

A review of the files shows:
- They start with an **HTML document structure** (`<!DOCTYPE html>`).
- They contain **metadata, JavaScript, and CSS links**.
- The main text content is embedded within HTML tags.

This confirms that **text extraction and cleaning** will be necessary before using the dataset for analysis.  

---

### **📌 Preview of the First Five Files**

The previous output displays the raw contents of the first five text files. These files contain **full HTML pages**, typically including metadata, infoboxes, and structured data.

🚀 **Next Step:** Extract relevant information directly from the original files (**raw HTML**), focusing on key details such as:

- **Landmark names**
- **Coordinates (if available)**
- **Wikipedia URL**
- **Landmark categories/types (if present in Infobox)**
- **Physical address** (if available in Infobox or main content)
- **Municipality** (if explicitly mentioned)

Once this information has been **extracted**, we will proceed to the **cleaning phase** to retrieve additional structured details:

- **Short descriptions** (from the first paragraph)
- **Historical significance** (if available)
- **Opening hours & fees** (if available)

📌 **Current Focus:** Extracting structured information **directly from raw HTML** before moving on to text cleaning.


## 🏷️ Extracting Landmark Names

Now that we have previewed the structure of the landmark files, the next step is to **extract the landmark names**.

✔ Each file represents a **specific landmark** and its content is extracted from **Wikipedia**.  
✔ The **landmark name** is typically found in **the file name** and in **the article's title/header inside the file**.  
✔ We will extract **both the file name and the landmark name from the content** to compare and verify accuracy.  

### 🔹 **How We Will Store This Data**
- The extracted landmark names will be **stored in a structured table**.
- We will save the table in:  
  📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmark_names.csv`
- Any files where we **could not extract a landmark name** will be logged in a separate file:  
  📁 `landmarks_missing_names.txt`

This structured data will help in **landmark recommendations, user search queries, and travel itinerary planning**. 🚀  


In [None]:
import os
import pandas as pd
import re

# 🔹 Define paths
structured_data_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks"
os.makedirs(structured_data_folder, exist_ok=True)  # Ensure the directory exists

output_csv = os.path.join(structured_data_folder, "landmark_names.csv")
missing_names_log = os.path.join(structured_data_folder, "landmarks_missing_names.txt")

# 🔹 Initialize storage lists
landmark_data = []
missing_files = []

# 🔹 Regular Expression Pattern for Landmark Name Extraction
landmark_name_pattern = r"<title>(.*?) - Wikipedia</title>"  # Extract from HTML title tag

# 🔹 Function to Fix Encoding Issues
def fix_encoding(text):
    try:
        # Step 1: Attempt direct UTF-8 decoding first
        fixed_text = text.encode("utf-8", errors="ignore").decode("utf-8")

        # Step 2: Handle known misencoded sequences manually
        fix_map = {
            "\\xc3\\xad": "í", "\\xc3\\xb1": "ñ", "\\xc3\\xa1": "á", "\\xc3\\xa9": "é",
            "\\xc3\\xb3": "ó", "\\xc3\\xba": "ú", "\\xc3\\x81": "Á", "\\xc3\\x89": "É",
            "\\xc3\\x8d": "Í", "\\xc3\\x93": "Ó", "\\xc3\\x9a": "Ú", "\\xc3\\x91": "Ñ"
        }
        for wrong, correct in fix_map.items():
            fixed_text = fixed_text.replace(wrong, correct)

        return fixed_text.strip()

    except Exception as e:
        print(f"⚠️ Encoding fix failed for: {text} → {e}")
        return text  # Return the original if fixing fails

# 🔹 Iterate through all landmark files
for file in os.listdir(extracted_folder):
    if file.endswith(".txt"):
        file_path = os.path.join(extracted_folder, file)
        file_name = file.replace(".txt", "").replace("_", " ").title()  # Format filename as a potential landmark name

        try:
            with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
                content = f.read()

            # Extract the landmark name from the Wikipedia title
            match = re.search(landmark_name_pattern, content)
            landmark_name = match.group(1) if match else None

            # If extraction from content fails, fall back to the file name
            if not landmark_name:
                landmark_name = file_name  # Use file name as best guess

            # 🔹 Fix encoding issues
            landmark_name = fix_encoding(landmark_name)

            # Store extracted data
            landmark_data.append({"File Name": file, "Landmark Name": landmark_name})

        except Exception as e:
            print(f"⚠️ Error processing {file}: {str(e)}")
            missing_files.append(file)

# 🔹 Convert to DataFrame
df_landmarks = pd.DataFrame(landmark_data)

# 🔹 Save extracted landmark names to CSV
df_landmarks.to_csv(output_csv, index=False, encoding="utf-8")

# 🔹 Save files with missing names to a log file **only if there are missing names**
if missing_files:
    with open(missing_names_log, "w", encoding="utf-8") as missing_file:
        for missing in missing_files:
            missing_file.write(missing + "\n")
    print(f"⚠️ Missing landmark name files logged in: {missing_names_log}")
else:
    print("✅ No missing landmark names detected. Skipping log file creation.")

# ✅ Display the extracted landmark names
from IPython.display import display

display(df_landmarks)  # Show DataFrame output

print(f"✅ Landmark names saved to: {output_csv}")


✅ No missing landmark names detected. Skipping log file creation.


Unnamed: 0,File Name,Landmark Name
0,academia_del_perpetuo_socorro.txt,Academia del Perpetuo Socorro
1,academia_interamericana_metro.txt,Academia Interamericana Metro
2,academia_maria_reina.txt,Academia Maria Reina
3,academia_san_jorge.txt,Academia San Jorge
4,adjuntas_barrio-pueblo.txt,Adjuntas barrio-pueblo
...,...,...
569,william_miranda_marín_botanical_and_cultural_...,William Miranda Marín Botanical and Cultural G...
570,world_war_ii.txt,World War II
571,yabucoa_barrio-pueblo.txt,Yabucoa barrio-pueblo
572,yauco_barrio-pueblo.txt,Yauco barrio-pueblo


✅ Landmark names saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmark_names.csv


In [None]:
import os
import re
import pandas as pd
from bs4 import BeautifulSoup

# 🔹 Define paths
structured_data_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities"
os.makedirs(structured_data_folder, exist_ok=True)  # Ensure the directory exists

output_csv = os.path.join(structured_data_folder, "municipality_brief_descriptions_cleaned_v2.csv")
municipalities_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/data/municipalities"

# 🔹 List to store extracted data
data = []

# 🔹 Function to Fix Encoding Issues (same method used in landmark extraction)
def fix_encoding(text):
    try:
        # Step 1: Attempt direct UTF-8 decoding first
        fixed_text = text.encode("utf-8", errors="ignore").decode("utf-8")

        # Step 2: Handle known misencoded sequences manually
        fix_map = {
            "\\xc3\\xad": "í", "\\xc3\\xb1": "ñ", "\\xc3\\xa1": "á", "\\xc3\\xa9": "é",
            "\\xc3\\xb3": "ó", "\\xc3\\xba": "ú", "\\xc3\\x81": "Á", "\\xc3\\x89": "É",
            "\\xc3\\x8d": "Í", "\\xc3\\x93": "Ó", "\\xc3\\x9a": "Ú", "\\xc3\\x91": "Ñ"
        }
        for wrong, correct in fix_map.items():
            fixed_text = fixed_text.replace(wrong, correct)

        return fixed_text.strip()

    except Exception as e:
        print(f"⚠️ Encoding fix failed for: {text} → {e}")
        return text  # Return the original if fixing fails

# 🔹 Function to Extract the First Valid Paragraph
def extract_first_paragraph(content):
    soup = BeautifulSoup(content, "html.parser")

    # Find all paragraphs <p> in the main content
    paragraphs = soup.find_all("p")

    for para in paragraphs:
        text = para.get_text(strip=True)
        text = fix_encoding(text)  # Apply encoding fixes

        # Restore proper spacing
        text = re.sub(r"([a-z])([A-Z])", r"\1 \2", text)  # Fix missing spaces between words
        text = re.sub(r"\s+", " ", text)  # Remove extra spaces
        text = text.replace("\n", " ").strip()  # Remove new lines

        # Ensure paragraph is meaningful
        if len(text) > 100 and "." in text:
            return text  # Return the first valid paragraph
    return None  # Return None if no valid paragraph is found

# 🔹 Iterate through all text files in the municipalities folder
for filename in os.listdir(municipalities_folder):
    if filename.endswith(".txt"):
        file_path = os.path.join(municipalities_folder, filename)

        try:
            with open(file_path, "r", encoding="utf-8") as file:
                content = file.read()

            brief_description = extract_first_paragraph(content)

            # Store extracted data
            data.append({"File Name": filename, "Brief Description": brief_description if brief_description else "Not Found"})

        except Exception as e:
            print(f"⚠️ Error processing {filename}: {str(e)}")

# 🔹 Convert to DataFrame
df = pd.DataFrame(data)

# 🔹 Save extracted brief descriptions to CSV
df.to_csv(output_csv, index=False, encoding="utf-8")

# ✅ Display the extracted brief descriptions
from IPython.display import display
display(df)  # Show DataFrame output

print(f"✅ Municipality brief descriptions saved to: {output_csv}")


Unnamed: 0,File Name,Brief Description
0,Juana Díaz.txt,Juana Díaz(Spanish pronunciation:[\xcb\x88xwan...
1,Barceloneta.txt,Barceloneta(Spanish pronunciation:[ba\xc9\xbes...
2,Carolina.txt,Carolina(/\xcb\x8ck\xc3\xa6ro\xca\x8a\xcb\x88l...
3,Aguadilla.txt,Aguadilla(Spanish pronunciation:[a\xc9\xa3wa\x...
4,Aguas Buenas.txt,"Aguas Buenas, (Spanish pronunciation:[\xcb\x88..."
...,...,...
73,Patillas.txt,Patillas(Spanish pronunciation:[pa\xcb\x88ti\x...
74,Trujillo Alto.txt,Trujillo Alto(Spanish pronunciation:[t\xc9\xbe...
75,Barranquitas.txt,Barranquitas(Spanish pronunciation:[bara\xc5\x...
76,Juncos.txt,Juncos(Spanish pronunciation:[\xcb\x88xu\xc5\x...


✅ Municipality brief descriptions saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-municipalities/municipality_brief_descriptions_cleaned_v2.csv


## 📍 Extracting Landmark Coordinates  

Now that we have extracted the **landmark names**, the next step is to **extract their location coordinates**.  

✔ The **coordinates** (latitude & longitude) are embedded within the **raw HTML files**.  
✔ They can be found in different formats, such as **JSON objects, HTML metadata, and embedded map links**.  
✔ We will use **multiple regex patterns** to capture coordinates from these various sources.  

### 🔹 **How We Will Store This Data**  
- The extracted coordinates will be **stored in a structured table** with:  
  - **File Name**  
  - **Latitude**  
  - **Longitude**  
- The table will be saved in:  
  📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmarks_coordinates.csv`  

- Any files where **coordinates were not found** will be logged separately:  
  📁 `landmarks_missing_coordinates.txt`  

This structured data will allow us to **map landmarks**, **provide location-based recommendations**, and **integrate with travel planning tools**. 🚀  


In [None]:
import os
import re
import json
import pandas as pd

# 🔹 Define paths
structured_data_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks"
os.makedirs(structured_data_folder, exist_ok=True)  # Ensure the directory exists

output_csv = os.path.join(structured_data_folder, "landmarks_coordinates.csv")
missing_files_log = os.path.join(structured_data_folder, "landmarks_missing_coordinates.txt")

landmarks_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/data/landmarks"

# 🔹 List to store extracted data
data = []
all_files = set()  # Store all filenames
missing_landmarks = set()  # Store files with missing coordinates

# Define regex patterns for extracting coordinates
coordinate_patterns = [
    r'"wgCoordinates"\s*:\s*({.*?})',  # JSON dictionary inside "wgCoordinates"
    r'"lat"\s*:\s*(-?\d+\.\d+)\s*,\s*"lon"\s*:\s*(-?\d+\.\d+)',  # "lat" and "lon"
    r'"latitude"\s*:\s*(-?\d+\.\d+)\s*,\s*"longitude"\s*:\s*(-?\d+\.\d+)',  # "latitude" and "longitude"
    r'"latLng"\s*:\s*\[\s*(-?\d+\.\d+)\s*,\s*(-?\d+\.\d+)\s*\]',  # "latLng" array
    r'LatLng\s*\(\s*(-?\d+\.\d+)\s*,\s*(-?\d+\.\d+)\s*\)',  # Google Maps `LatLng()`
    r'coordinates\s*:\s*\[\s*(-?\d+\.\d+)\s*,\s*(-?\d+\.\d+)\s*\]',  # Generic "coordinates" array
    r'data-location\s*=\s*"\s*(-?\d+\.\d+)\s*,\s*(-?\d+\.\d+)\s*"',  # HTML `data-location`
    r'<span class="geo">\s*(-?\d+\.\d+)\s*;\s*(-?\d+\.\d+)\s*</span>',  # Wikipedia geo <span>
    r'geo:lat" content="(-?\d+\.\d+)"[^>]*geo:long" content="(-?\d+\.\d+)"',  # HTML meta geo tags
    r'geohack\.toolforge\.org/.*?params=(-?\d+\.\d+)_(-?\d+\.\d+)',  # Wikipedia GeoHack links
    r'(\d{1,3}°\s*\d{1,2}\'\s*\d{1,2}(?:\.\d+)?["″]?\s*[NS]),\s*(\d{1,3}°\s*\d{1,2}\'\s*\d{1,2}(?:\.\d+)?["″]?\s*[EW])',  # DMS format
    r'(-?\d{1,3}\.\d+)\s*,\s*(-?\d{1,3}\.\d+)',  # Comma-separated decimal degrees
    r'geo\.position"\s*content="\s*(-?\d+\.\d+);\s*(-?\d+\.\d+)"',  # Geo meta position tags
    r'www\.openstreetmap\.org/\?mlat=(-?\d+\.\d+)&mlon=(-?\d+\.\d+)',  # OpenStreetMap URLs
    r'www\.google\.com/maps/@(-?\d+\.\d+),(-?\d+\.\d+),\d+z',  # Google Maps URL parameters
    r'UTM\s*Zone\s*\d+\s*[NS]\s*Easting:\s*\d+\s*Northing:\s*\d+',  # UTM coordinates
    r'"coordinates"\s*:\s*\[\s*(-?\d+\.\d+)\s*,\s*(-?\d+\.\d+)\s*\]',  # GeoJSON format
    r'([-+]\d{2,3}\.\d+)([-+]\d{2,3}\.\d+)',  # ISO 6709 Format
    r'<meta\s+property="place:location:latitude"\s+content="(-?\d+\.\d+)"[^>]*>',  # Facebook place location latitude
    r'<meta\s+property="place:location:longitude"\s+content="(-?\d+\.\d+)"[^>]*>',  # Facebook place location longitude
    r'<meta\s+name="ICBM"\s+content="(-?\d+\.\d+),\s*(-?\d+\.\d+)"',  # Deprecated but still used
    r'<meta\s+name="geo\.position"\s+content="(-?\d+\.\d+);\s*(-?\d+\.\d+)"',  # Another meta tag format
    r'data-lat\s*=\s*"(-?\d+\.\d+)"\s+data-lon\s*=\s*"(-?\d+\.\d+)"',  # HTML attributes for coordinates
    r'data-geo\s*=\s*"(-?\d+\.\d+),\s*(-?\d+\.\d+)"',  # HTML `data-geo`
    r'www\.google\.com/maps/embed\?pb=!1m\d+!1d(-?\d+\.\d+)!2d(-?\d+\.\d+)',  # Google Maps Embed URL
    r'L\.marker\(\[\s*(-?\d+\.\d+),\s*(-?\d+\.\d+)\s*\]\)',  # Leaflet.js Map Coordinates
    r'"geometry"\s*:\s*{"type":"Point","coordinates":\s*\[\s*(-?\d+\.\d+),\s*(-?\d+\.\d+)\s*\]}',  # Mapbox GeoJSON format
    r'www\.google\.com/maps/api/staticmap\?.*?center=(-?\d+\.\d+),(-?\d+\.\d+)',  # Google Static Maps API
    r'maps\.apple\.com/\?ll=(-?\d+\.\d+),(-?\d+\.\d+)',  # Apple Maps URLs
    r'maps\.yahoo\.com/#lat=(-?\d+\.\d+)&lon=(-?\d+\.\d+)',  # Yahoo Maps URLs
    r'www\.bing\.com/maps\?v=\d+&where1=(-?\d+\.\d+),(-?\d+\.\d+)',  # Bing Maps URLs
    r'P625"\s*:\s*\{"type":"Point","coordinates":\s*\[\s*(-?\d+\.\d+),\s*(-?\d+\.\d+)\s*\]\}',  # Wikidata GeoCoordinates
    r'<coordinates>\s*(-?\d+\.\d+),\s*(-?\d+\.\d+),?\s*(-?\d+\.\d+)?\s*</coordinates>',  # KML (Google Earth)
    r'<trkpt\s+lat="(-?\d+\.\d+)"\s+lon="(-?\d+\.\d+)"'  # GPX (GPS Data)
]

# 🔹 Function to extract coordinates from content
def extract_coordinates(content):
    for pattern in coordinate_patterns:
        match = re.search(pattern, content)
        if match:
            if len(match.groups()) == 1:
                try:
                    coordinates = json.loads(match.group(1))
                    lat = coordinates.get("lat") or coordinates.get("latitude")
                    lon = coordinates.get("lon") or coordinates.get("longitude")
                except json.JSONDecodeError:
                    continue
            elif len(match.groups()) == 2:
                lat, lon = match.groups()
                lat, lon = float(lat), float(lon)
            return lat, lon
    return None, None  # No valid coordinates found

# 🔹 Iterate through all text files in the landmarks folder
for filename in os.listdir(landmarks_folder):
    if filename.endswith(".txt"):
        all_files.add(filename.replace(".txt", ""))  # Store filenames without .txt
        file_path = os.path.join(landmarks_folder, filename)

        try:
            with open(file_path, "r", encoding="utf-8") as file:
                content = file.read()

            lat, lon = extract_coordinates(content)

            if lat is None or lon is None:
                missing_landmarks.add(filename.replace(".txt", ""))  # Mark as missing

            # Store extracted data (even if coordinates are missing)
            data.append({"File Name": filename, "Latitude": lat if lat else "", "Longitude": lon if lon else ""})

        except Exception as e:
            print(f"⚠️ Error processing {filename}: {str(e)}")
            missing_landmarks.add(filename.replace(".txt", ""))  # Mark as missing

# 🔹 Convert to DataFrame
df = pd.DataFrame(data)

# 🔹 Save extracted landmark coordinates to CSV
df.to_csv(output_csv, index=False, encoding="utf-8")

# 🔹 Save missing landmarks **only if there are missing coordinates**
if missing_landmarks:
    with open(missing_files_log, "w", encoding="utf-8") as missing_file:
        for landmark in sorted(missing_landmarks):
            missing_file.write(landmark + "\n")
    print(f"⚠️ Missing landmark coordinates saved to: {missing_files_log}")
else:
    print("✅ No missing landmarks detected. Skipping log file creation.")

# ✅ Display the extracted landmark coordinates
from IPython.display import display
display(df)  # Show DataFrame output

print(f"✅ Landmarks coordinates saved to: {output_csv}")


⚠️ Missing landmark coordinates saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmarks_missing_coordinates.txt


Unnamed: 0,File Name,Latitude,Longitude
0,academia_del_perpetuo_socorro.txt,18.454444,-66.084722
1,academia_interamericana_metro.txt,18.448531,-66.072122
2,academia_maria_reina.txt,18.383442,-66.085516
3,academia_san_jorge.txt,18.450556,-66.061667
4,adjuntas_barrio-pueblo.txt,18.163776,-66.723544
...,...,...,...
569,william_miranda_marín_botanical_and_cultural_...,18.241389,-66.061667
570,world_war_ii.txt,,
571,yabucoa_barrio-pueblo.txt,18.047304,-65.880083
572,yauco_barrio-pueblo.txt,18.036342,-66.84947


✅ Landmarks coordinates saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmarks_coordinates.csv


## 🏙️ Extracting Municipality from Coordinates  

Now that we have extracted the **landmark coordinates**, the next step is to **determine the municipality in Puerto Rico** where each landmark is located.  

✔ Each landmark has **latitude and longitude** extracted from the raw HTML files.  
✔ We will use **reverse geocoding** to find the **municipality name** from these coordinates.  
✔ The extracted municipality will be **stored in a structured table**, using the same list of landmarks.  

### 🔹 **How We Will Store This Data**  
- The extracted municipalities will be **stored in a structured table** with:  
  - **File Name**  
  - **Municipality**  
- The table will be saved in:  
  📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmarks_municipalities.csv`  

- Any files where **coordinates were missing or the municipality was not found** will be logged separately:  
  📁 `landmarks_missing_municipalities.txt`  

This structured data will allow us to **group landmarks by municipality**, **filter search results**, and **provide location-specific recommendations** for travelers. 🌍  


In [None]:
import os
import pandas as pd
import requests
import time

# 📁 Define paths
structured_data_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks"
os.makedirs(structured_data_folder, exist_ok=True)  # Ensure the directory exists

input_csv = os.path.join(structured_data_folder, "landmarks_coordinates.csv")  # File with extracted coordinates
output_csv = os.path.join(structured_data_folder, "landmarks_municipalities.csv")
missing_municipalities_log = os.path.join(structured_data_folder, "landmarks_missing_municipalities.txt")

# 🌍 API User-Agent (Required to avoid 403 errors)
HEADERS = {
    "User-Agent": "LandmarkMapper/1.0 (ginosca23@gmail.com)"
}

# 🔹 Load extracted coordinates
df = pd.read_csv(input_csv)

# 📜 List to store extracted municipality data
municipality_data = []
missing_municipalities = set()

# 🔍 Function to get the municipality from coordinates using Nominatim API
def get_municipality(lat, lon):
    url = f"https://nominatim.openstreetmap.org/reverse?lat={lat}&lon={lon}&format=json&addressdetails=1"

    try:
        response = requests.get(url, headers=HEADERS, timeout=2)  # Added timeout to avoid long waits
        response.raise_for_status()  # Raise error if response is unsuccessful
        data = response.json()

        # Extract municipality from response
        address = data.get("address", {})
        municipality = address.get("town") or address.get("city") or address.get("village") or address.get("county")

        return municipality.strip() if municipality else None

    except requests.exceptions.RequestException as e:
        print(f"⚠️ Request failed for ({lat}, {lon}): {e}")
        return None

# 🔄 Iterate through the dataset
for index, row in df.iterrows():
    file_name = row["File Name"]
    lat, lon = row["Latitude"], row["Longitude"]

    if pd.notna(lat) and pd.notna(lon):
        municipality = get_municipality(lat, lon)

        if not municipality:
            missing_municipalities.add(file_name.replace(".txt", ""))  # Mark as missing

        municipality_data.append({"File Name": file_name, "Municipality": municipality if municipality else ""})

    else:
        # If no coordinates available, mark as missing
        missing_municipalities.add(file_name.replace(".txt", ""))
        municipality_data.append({"File Name": file_name, "Municipality": ""})

    time.sleep(1)  # ⏳ Add delay to avoid API rate limits

# 📊 Convert to DataFrame
df_municipality = pd.DataFrame(municipality_data)

# 💾 Save extracted municipalities to CSV
df_municipality.to_csv(output_csv, index=False, encoding="utf-8")

# 📜 Save missing municipalities **only if there are missing entries**
if missing_municipalities:
    with open(missing_municipalities_log, "w", encoding="utf-8") as missing_file:
        for landmark in sorted(missing_municipalities):
            missing_file.write(landmark + "\n")
    print(f"⚠️ Missing municipality data saved to: {missing_municipalities_log}")
else:
    print("✅ No missing municipality data detected. Skipping log file creation.")

# ✅ Display the extracted municipalities
from IPython.display import display
display(df_municipality)  # Show DataFrame output

print(f"✅ Landmarks municipalities saved to: {output_csv}")


⚠️ Missing municipality data saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmarks_missing_municipalities.txt


Unnamed: 0,File Name,Municipality
0,academia_del_perpetuo_socorro.txt,San Juan
1,academia_interamericana_metro.txt,San Juan
2,academia_maria_reina.txt,San Juan
3,academia_san_jorge.txt,Río Piedras
4,adjuntas_barrio-pueblo.txt,Adjuntas
...,...,...
569,william_miranda_marín_botanical_and_cultural_...,Caguas
570,world_war_ii.txt,
571,yabucoa_barrio-pueblo.txt,Yabucoa
572,yauco_barrio-pueblo.txt,Yauco


✅ Landmarks municipalities saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmarks_municipalities.csv


## 🔗 Extracting Wikipedia URLs  

Now that we have extracted the **landmark names** and **coordinates**, the next step is to **extract the Wikipedia URLs**.  

✔ The **Wikipedia URL** is typically embedded in the **HTML content of each file**.  
✔ It can often be found in **metadata tags, references, or direct Wikipedia links within the content**.  
✔ We will use **regular expressions** to extract the most relevant Wikipedia URL.  

### 🔹 **How We Will Store This Data**  
- The extracted Wikipedia URLs will be **stored in a structured table** with:  
  - **File Name**  
  - **Wikipedia URL**  
- The table will be saved in:  
  📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmarks_wikipedia_urls.csv`  

- Any files where **a Wikipedia URL was not found** will be logged separately:  
  📁 `landmarks_missing_wikipedia_urls.txt`  

This data will allow us to **provide direct references to Wikipedia articles**, enhancing the **information available for each landmark**. 🌍  


In [None]:
import os
import re
import pandas as pd

# 🔹 Define paths
structured_data_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks"
os.makedirs(structured_data_folder, exist_ok=True)  # Ensure the directory exists

output_csv = os.path.join(structured_data_folder, "landmarks_wikipedia_urls.csv")
missing_urls_log = os.path.join(structured_data_folder, "landmarks_missing_wikipedia_urls.txt")

landmarks_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/data/landmarks"

# 🔹 List to store extracted data
data = []
all_files = set()  # Store all filenames
missing_wikipedia_urls = set()  # Store files with missing Wikipedia URLs

# 🔹 Regular Expression Patterns for Extracting Wikipedia URL
wikipedia_patterns = [
    r'https?://[a-z]{2,3}\.wikipedia\.org/wiki/[^\s"<>#]+',  # Standard Wikipedia URL
    r'www\.wikipedia\.org/wiki/[^\s"<>#]+',  # Wikipedia URL without protocol
    r'<link rel="canonical" href="(https?://[a-z]{2,3}\.wikipedia\.org/wiki/[^\s"<>#]+)"',  # Canonical link
    r'<meta property="og:url" content="(https?://[a-z]{2,3}\.wikipedia\.org/wiki/[^\s"<>#]+)"',  # Open Graph URL
    r'<a href="(https?://[a-z]{2,3}\.wikipedia\.org/wiki/[^\s"<>#]+)"',  # Hyperlink to Wikipedia
]

# 🔹 Function to extract Wikipedia URL from content
def extract_wikipedia_url(content):
    for pattern in wikipedia_patterns:
        match = re.search(pattern, content)
        if match:
            return match.group(0)
    return None  # No valid Wikipedia URL found

# 🔹 Iterate through all text files in the landmarks folder
for filename in os.listdir(landmarks_folder):
    if filename.endswith(".txt"):
        all_files.add(filename.replace(".txt", ""))  # Store filenames without .txt
        file_path = os.path.join(landmarks_folder, filename)

        try:
            with open(file_path, "r", encoding="utf-8") as file:
                content = file.read()

            wikipedia_url = extract_wikipedia_url(content)

            if not wikipedia_url:
                missing_wikipedia_urls.add(filename.replace(".txt", ""))  # Mark as missing

            # Store extracted data (even if Wikipedia URL is missing)
            data.append({"File Name": filename, "Wikipedia URL": wikipedia_url if wikipedia_url else ""})

        except Exception as e:
            print(f"⚠️ Error processing {filename}: {str(e)}")
            missing_wikipedia_urls.add(filename.replace(".txt", ""))  # Mark as missing

# 🔹 Convert to DataFrame
df = pd.DataFrame(data)

# 🔹 Save extracted Wikipedia URLs to CSV
df.to_csv(output_csv, index=False, encoding="utf-8")

# 🔹 Save missing Wikipedia URLs **only if there are missing entries**
if missing_wikipedia_urls:
    with open(missing_urls_log, "w", encoding="utf-8") as missing_file:
        for landmark in sorted(missing_wikipedia_urls):
            missing_file.write(landmark + "\n")
    print(f"⚠️ Missing Wikipedia URLs saved to: {missing_urls_log}")
else:
    print("✅ No missing Wikipedia URLs detected. Skipping log file creation.")

# ✅ Display the extracted Wikipedia URLs
from IPython.display import display
display(df)  # Show DataFrame output

print(f"✅ Landmarks Wikipedia URLs saved to: {output_csv}")


✅ No missing Wikipedia URLs detected. Skipping log file creation.


Unnamed: 0,File Name,Wikipedia URL
0,academia_del_perpetuo_socorro.txt,https://en.wikipedia.org/wiki/Academia_del_Per...
1,academia_interamericana_metro.txt,https://en.wikipedia.org/wiki/Academia_Interam...
2,academia_maria_reina.txt,https://en.wikipedia.org/wiki/Academia_Maria_R...
3,academia_san_jorge.txt,https://en.wikipedia.org/wiki/Academia_San_Jorge
4,adjuntas_barrio-pueblo.txt,https://en.wikipedia.org/wiki/Adjuntas_barrio-...
...,...,...
569,william_miranda_marín_botanical_and_cultural_...,https://en.wikipedia.org/wiki/William_Miranda_...
570,world_war_ii.txt,https://en.wikipedia.org/wiki/World_War_II
571,yabucoa_barrio-pueblo.txt,https://en.wikipedia.org/wiki/Yabucoa_barrio-p...
572,yauco_barrio-pueblo.txt,https://en.wikipedia.org/wiki/Yauco_barrio-pueblo


✅ Landmarks Wikipedia URLs saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmarks_wikipedia_urls.csv


## 📖 Extracting Brief Descriptions from Landmark Files

Before performing full data cleaning, we will **extract the first paragraph** from each landmark file as the **Brief Description**.

✔ The **brief description** is typically the **first meaningful paragraph** in each file.  
✔ We will **ignore empty lines** and **extract the first non-empty paragraph**.  
✔ The extracted data will be **stored in a structured table** for easy reference.  

### 🔹 **How We Will Store This Data**  
- The extracted brief descriptions will be **stored in a structured table** with:  
  - **File Name**  
  - **Brief Description** (First paragraph of each file)  
- The table will be saved in:  
  📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmark_brief_descriptions_v1.csv`  

This structured dataset will be useful for **chatbot responses** and **travel recommendations**. 🚀


In [None]:
import os
import re
import pandas as pd
from bs4 import BeautifulSoup
import unicodedata

# 🔹 Define paths
structured_data_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks"
os.makedirs(structured_data_folder, exist_ok=True)  # Ensure the directory exists

output_csv = os.path.join(structured_data_folder, "landmark_brief_descriptions_v1.csv")
landmarks_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/data/landmarks"

# 🔹 List to store extracted data
data = []

# 🔹 Function to clean text encoding and spacing
def clean_text(text):
    # Normalize encoding (fixes characters like \xc3\xb1 -> ñ)
    text = unicodedata.normalize("NFKC", text)

    # Remove unnecessary characters (like IPA pronunciation, extra spaces)
    text = re.sub(r"\(.*?\)", "", text)  # Remove inline pronunciations
    text = re.sub(r"\s+", " ", text)  # Ensure single spaces
    text = text.replace("\n", " ").strip()  # Remove new lines and strip spaces

    return text

# 🔹 Function to extract the first paragraph
def extract_first_paragraph(content):
    soup = BeautifulSoup(content, "html.parser")

    # Find all paragraphs <p> in the main content
    paragraphs = soup.find_all("p")

    for para in paragraphs:
        text = para.get_text(strip=True)
        text = clean_text(text)  # Clean extracted text
        if len(text) > 100:  # Ensure it's not a short sentence
            return text  # Return the first relevant paragraph
    return None  # Return None if no paragraph is found

# 🔹 Iterate through all text files in the landmarks folder
for filename in os.listdir(landmarks_folder):
    if filename.endswith(".txt"):
        file_path = os.path.join(landmarks_folder, filename)

        try:
            with open(file_path, "r", encoding="utf-8") as file:
                content = file.read()

            brief_description = extract_first_paragraph(content)

            # Store extracted data
            data.append({"File Name": filename, "Brief Description": brief_description if brief_description else "Not Found"})

        except Exception as e:
            print(f"⚠️ Error processing {filename}: {str(e)}")

# 🔹 Convert to DataFrame
df = pd.DataFrame(data)

# 🔹 Save extracted brief descriptions to CSV
df.to_csv(output_csv, index=False, encoding="utf-8")

# ✅ Display the extracted brief descriptions
from IPython.display import display
display(df)  # Show DataFrame output

print(f"✅ Landmark brief descriptions saved to: {output_csv}")


Unnamed: 0,File Name,Brief Description
0,academia_del_perpetuo_socorro.txt,Academia del Perpetuo Socorro was founded in 1...
1,academia_interamericana_metro.txt,TheAcademia Interamericana Metro was founded i...
2,academia_maria_reina.txt,Academia Maria Reina is a school of theSisters...
3,academia_san_jorge.txt,"Academia San Jorge is a private,Roman Catholic..."
4,adjuntas_barrio-pueblo.txt,Adjuntas barrio-pueblois abarrioand the admini...
...,...,...
569,william_miranda_marín_botanical_and_cultural_...,TheWilliam Miranda Mar\xc3\xadn Botanical and ...
570,world_war_ii.txt,World War II[b]or theSecond World War was aglo...
571,yabucoa_barrio-pueblo.txt,Yabucoa barrio-pueblois abarrioand the adminis...
572,yauco_barrio-pueblo.txt,Yauco barrio-pueblois abarrioand the administr...


✅ Landmark brief descriptions saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmark_brief_descriptions_v1.csv


## 📖 Improving Landmark Brief Descriptions Encoding

In this step, we will **improve the encoding** of the extracted **brief descriptions** in `landmark_brief_descriptions_v1.csv` to handle any misencoded characters.

✔ We will **fix common encoding issues** such as characters with accents, diacritical marks, and symbols that were misrepresented during the initial extraction.  
✔ This will ensure that the descriptions are displayed correctly and consistently.  
✔ The improved data will be **saved in a new version** of the file: `landmark_brief_descriptions_v2.csv`.  

### 🔹 **How We Will Handle This**  
- We will apply a **character replacement map** (`fix_map`) to resolve common encoding errors and replace misencoded characters.  
- The improved data will be **saved** in:  
  📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmark_brief_descriptions_v2.csv`  

This version will be useful for further processing and chatbot responses. 🚀


In [3]:
import pandas as pd

# 🔹 Load the CSV file
input_csv = '/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmark_brief_descriptions_v1.csv'
df = pd.read_csv(input_csv)

# 🔹 Function to Fix Encoding Issues
def fix_encoding(text):
    try:
        # Step 1: Attempt direct UTF-8 decoding first
        fixed_text = text.encode("utf-8", errors="ignore").decode("utf-8")

        # Step 2: Handle known misencoded sequences manually
        fix_map = {
            # Lowercase Latin characters with accents
            "\\xc3\\xa1": "á", "\\xc3\\xa9": "é", "\\xc3\\xad": "í", "\\xc3\\xb3": "ó", "\\xc3\\xba": "ú",
            "\\xc3\\xb1": "ñ", "\\xc3\\xa0": "à", "\\xc3\\xa2": "â", "\\xc3\\xa4": "ä", "\\xc3\\xa7": "ç",
            "\\xc3\\xb6": "ö", "\\xc3\\xbc": "ü", "\\xc3\\x80": "À", "\\xc3\\x82": "Â", "\\xc3\\x87": "Ç",

            # Uppercase Latin characters with accents
            "\\xc3\\x81": "Á", "\\xc3\\x89": "É", "\\xc3\\x8D": "Í", "\\xc3\\x93": "Ó", "\\xc3\\x9A": "Ú",
            "\\xc3\\x91": "Ñ", "\\xc3\\x84": "Ä", "\\xc3\\x96": "Ö", "\\xc3\\x9C": "Ü", "\\xc3\\x8B": "Ê",
            "\\xc3\\x99": "Ù",

            # Tilde and diacritical marks
            "\\xc3\\xb5": "õ", "\\xc3\\xb8": "ø", "\\xc3\\xb2": "ò", "\\xc3\\xb4": "ô",

            # Characters for other languages
            "\\xc2\\xb0": "°", "\\xc2\\xa9": "©", "\\xc2\\xae": "®",

            # Additional characters in your previous input
            "\\xc3\\x83": "Ã", "\\xc2\\xad": "­", "\\xe2\\x80\\x93": "–", "\\xe2\\x80\\x94": "—",
            "\\xe2\\x99\\xa5": "♥", "\\xe2\\x9c\\x94": "✓", "\\xe2\\x9d\\x8f": "✯", "\\xe2\\x97\\x8f": "◆",

            # Miscellaneous characters
            "\\xe2\\x80\\x9d": "”", "\\xe2\\x80\\x9c": "“", "\\xe2\\x80\\x98": "‘", "\\xe2\\x80\\x99": "’",
            "\\xe2\\x80\\xa6": "…", "\\xe2\\x81\\x84": "⁴", "\\xe2\\x83\\xa3": "₣",

            # Others commonly misencoded symbols
            "\\xe2\\x88\\x82": "∂", "\\xe2\\x9c\\x9d": "✔", "\\xc2\\xa1": "¡", "\\xc2\\xbf": "¿",
        }

        # Apply the encoding fixes from the map
        for wrong, correct in fix_map.items():
            fixed_text = fixed_text.replace(wrong, correct)

        return fixed_text.strip()

    except Exception as e:
        print(f"⚠️ Encoding fix failed for: {text} → {e}")
        return text  # Return the original if fixing fails

# 🔹 Apply encoding fix to the "Brief Description" column
df['Brief Description'] = df['Brief Description'].apply(fix_encoding)

# 🔹 Save the cleaned CSV as version 2
output_csv = '/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmark_brief_descriptions_v2.csv'
df.to_csv(output_csv, index=False, encoding="utf-8")

# ✅ Display the updated DataFrame (for review)
from IPython.display import display
display(df)  # Show DataFrame output

print(f"✅ Encoding fixed and updated brief descriptions saved to: {output_csv}")


Unnamed: 0,File Name,Brief Description
0,academia_del_perpetuo_socorro.txt,Academia del Perpetuo Socorro was founded in 1...
1,academia_interamericana_metro.txt,TheAcademia Interamericana Metro was founded i...
2,academia_maria_reina.txt,Academia Maria Reina is a school of theSisters...
3,academia_san_jorge.txt,"Academia San Jorge is a private,Roman Catholic..."
4,adjuntas_barrio-pueblo.txt,Adjuntas barrio-pueblois abarrioand the admini...
...,...,...
569,william_miranda_marín_botanical_and_cultural_...,TheWilliam Miranda Marín Botanical and Cultura...
570,world_war_ii.txt,World War II[b]or theSecond World War was aglo...
571,yabucoa_barrio-pueblo.txt,Yabucoa barrio-pueblois abarrioand the adminis...
572,yauco_barrio-pueblo.txt,Yauco barrio-pueblois abarrioand the administr...


✅ Encoding fixed and updated brief descriptions saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmark_brief_descriptions_v2.csv


## 📖 Cleaning Landmark Brief Descriptions

In this step, we will **clean** the **brief descriptions** in `landmark_brief_descriptions_v2.csv` to fix encoding issues, remove extra whitespace, and ensure consistent formatting.

✔ We will **decode HTML entities**, fix **special character encoding**, **correct concatenated words**, and **remove extra spaces** to ensure text consistency.  
✔ The cleaned descriptions will be **standardized**, ensuring that all special characters, spaces, and formatting are correctly applied.  
✔ The improved data will be **saved in a new version** of the file: `landmark_brief_descriptions_v3.csv`.  

### 🔹 **How We Will Handle This**  
- We will apply a **text-cleaning process** to:
  - Decode **HTML entities** (e.g., `&amp;` to `&`)
  - Replace problematic characters (e.g., `\xe2\x80\xb2` to `'`)
  - Correct **concatenated words** (e.g., `inMiramarinPuerto` to `in Miramar in Puerto`)
  - Remove **extra whitespace** and **line breaks**.
- The cleaned data will be **saved** in:  
  📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/data/landmarks/landmark_brief_descriptions_v3.csv`  

This version will be ready for further analysis and integration into the chatbot. 🚀


In [7]:
import pandas as pd
import html
import re

# 🔹 Load the dataset with the corrected path
file_path = '/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmark_brief_descriptions_v2.csv'
df = pd.read_csv(file_path)

# 🔹 Function to clean the text
def clean_text(text):
    if not isinstance(text, str):
        return text  # Skip if not a string

    # Decode HTML entities (e.g., &amp; to &)
    text = html.unescape(text)

    # Handle problematic characters and encoding issues
    text = text.replace("\xe2\x80\xb2", "'")  # Apostrophe character
    text = text.replace("\xc3\xa9", "é")  # Special character 'é'
    text = text.replace("\xe2\x80\x93", "–")  # En dash (–)

    # Fix concatenated words (insert space where needed)
    text = re.sub(r'([a-zA-Z])([A-Z])', r'\1 \2', text)  # Adds a space before capital letters that follow lowercase letters

    # Ensure space after punctuation marks (e.g., "inPonce,Puerto" -> "in Ponce, Puerto")
    text = re.sub(r'([a-zA-Z])([.,;!?])([A-Z])', r'\1\2 \3', text)

    # Ensure a **single space** after commas and periods
    text = re.sub(r"\s*([.,;:])\s*", r"\1 ", text)

    # Remove extra whitespaces and newlines (combine multiple spaces into one and remove newlines)
    text = ' '.join(text.split()).strip()

    return text

# 🔹 Apply the cleaning function to the 'Brief Description' column
df['Brief Description'] = df['Brief Description'].apply(clean_text)

# 🔹 Save the cleaned data as a new CSV file: landmark_brief_descriptions_v3.csv
cleaned_file_path = '/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmark_brief_descriptions_v3.csv'
df.to_csv(cleaned_file_path, index=False)

# ✅ Display the updated DataFrame (for review)
from IPython.display import display
display(df)  # Show DataFrame output

# ✅ Output the cleaned file path
print(f"✅ Cleaned file saved to: {cleaned_file_path}")


Unnamed: 0,File Name,Brief Description
0,academia_del_perpetuo_socorro.txt,Academia del Perpetuo Socorro was founded in 1...
1,academia_interamericana_metro.txt,The Academia Interamericana Metro was founded ...
2,academia_maria_reina.txt,Academia Maria Reina is a school of the Sister...
3,academia_san_jorge.txt,"Academia San Jorge is a private, Roman Catholi..."
4,adjuntas_barrio-pueblo.txt,Adjuntas barrio-pueblois abarrioand the admini...
...,...,...
569,william_miranda_marín_botanical_and_cultural_...,The William Miranda Marín Botanical and Cultur...
570,world_war_ii.txt,World War I I[b]or the Second World War was ag...
571,yabucoa_barrio-pueblo.txt,Yabucoa barrio-pueblois abarrioand the adminis...
572,yauco_barrio-pueblo.txt,Yauco barrio-pueblois abarrioand the administr...


✅ Cleaned file saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmark_brief_descriptions_v3.csv


## ✅ **Finalizing Landmark Brief Descriptions**  

We have now processed the **brief descriptions** of Puerto Rican landmarks by:  

✔ **Extracting the first meaningful paragraph** from each landmark file.  
✔ **Fixing encoding issues** to ensure proper display of special characters.  
✔ **Cleaning up spacing and formatting issues** to improve readability.  

### 🔹 **Current Status**  
The dataset `landmark_brief_descriptions_v3.csv` has been generated with **improved descriptions**. While some minor formatting issues remain, we will proceed with this version for further integration.  

### 🚀 **Next Steps**  
- This dataset is now **ready for chatbot integration** and **travel recommendations**.  
- If necessary, future refinements can be applied to address remaining formatting inconsistencies.  

📄 **Final Cleaned File Path:**  
📁 `/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmark_brief_descriptions_v3.csv`  

✅ **This marks the completion of the landmark brief description preprocessing!** 🎉  


## 🔗 Merging Landmark Data into a Single Table  

Since our individual datasets contain information related to landmarks, we will **merge multiple CSV files** into a single structured dataset.  

### 🔹 **Files to be Merged**
| File Name | Description |
|-----------|------------|
| `landmark_names.csv` | List of landmark names linked to file names |
| `landmark_coordinates.csv` | Latitude and longitude coordinates for each landmark |
| `landmark_municipalities.csv` | Municipality for each landmark |
| `landmark_wikipedia_urls.csv` | Wikipedia URLs for more detailed information |
| `landmark_brief_descriptions_v3.csv` | The extracted and cleaned brief descriptions |

### 🏗 **Final Table Structure**
| File Name | Landmark Name | Coordinates | Municipality | Brief Description | Wikipedia URL |
|-----------|---------------|-------------|--------------|-------------------|---------------|
| academia_del_perpetuo_socorro.txt | Academia del Perpetuo Socorro | (lat, lon) | San Juan | Founded in 1921 as a Catholic parochial school... | [Wikipedia](URL) |
| academia_interamericana_metro.txt | Academia Interamericana Metro | (lat, lon) | Santurce | Founded in 1928 as a private and religious school... | [Wikipedia](URL) |

### ✅ **Why This Is Useful**
- Provides **structured** and **centralized** data for chatbot responses.
- Location coordinates enable **mapping and distance calculations**.
- Wikipedia links allow **further exploration of landmarks**.
- The brief descriptions serve as **summaries** for user interactions.

The merged dataset will be **saved in the folder**:  
📁 `/structured-information-landmarks/landmark_data_combined.csv`


In [11]:
import os
import pandas as pd
from IPython.display import display

# 🔹 Define paths
structured_data_folder = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks"
os.makedirs(structured_data_folder, exist_ok=True)  # Ensure directory exists

# 🔹 File paths for each dataset
names_csv = os.path.join(structured_data_folder, "landmark_names.csv")
coordinates_csv = os.path.join(structured_data_folder, "landmarks_coordinates.csv")
municipalities_csv = os.path.join(structured_data_folder, "landmarks_municipalities.csv")
urls_csv = os.path.join(structured_data_folder, "landmarks_wikipedia_urls.csv")
descriptions_csv = os.path.join(structured_data_folder, "landmark_brief_descriptions_v3.csv")
output_csv = os.path.join(structured_data_folder, "landmark_data_combined.csv")

# 🔹 Load CSV files
df_names = pd.read_csv(names_csv)  # Columns: File Name, Landmark Name
df_coordinates = pd.read_csv(coordinates_csv)  # Columns: File Name, Latitude, Longitude
df_municipalities = pd.read_csv(municipalities_csv)  # Columns: File Name, Municipality
df_urls = pd.read_csv(urls_csv)  # Columns: File Name, Wikipedia URL
df_descriptions = pd.read_csv(descriptions_csv)  # Columns: File Name, Brief Description

# 🔹 Merge datasets using "File Name" as the common key
df_merged = df_names.merge(df_coordinates, on="File Name", how="left") \
                    .merge(df_municipalities, on="File Name", how="left") \
                    .merge(df_urls, on="File Name", how="left") \
                    .merge(df_descriptions, on="File Name", how="left")

# 🔹 Save merged dataset
df_merged.to_csv(output_csv, index=False, encoding="utf-8")

# ✅ Display merged dataset
display(df_merged)

print(f"✅ Landmark data combined and saved to: {output_csv}")


Unnamed: 0,File Name,Landmark Name,Latitude,Longitude,Municipality,Wikipedia URL,Brief Description
0,academia_del_perpetuo_socorro.txt,Academia del Perpetuo Socorro,18.454444,-66.084722,San Juan,https://en.wikipedia.org/wiki/Academia_del_Per...,Academia del Perpetuo Socorro was founded in 1...
1,academia_interamericana_metro.txt,Academia Interamericana Metro,18.448531,-66.072122,San Juan,https://en.wikipedia.org/wiki/Academia_Interam...,The Academia Interamericana Metro was founded ...
2,academia_maria_reina.txt,Academia Maria Reina,18.383442,-66.085516,San Juan,https://en.wikipedia.org/wiki/Academia_Maria_R...,Academia Maria Reina is a school of the Sister...
3,academia_san_jorge.txt,Academia San Jorge,18.450556,-66.061667,Río Piedras,https://en.wikipedia.org/wiki/Academia_San_Jorge,"Academia San Jorge is a private, Roman Catholi..."
4,adjuntas_barrio-pueblo.txt,Adjuntas barrio-pueblo,18.163776,-66.723544,Adjuntas,https://en.wikipedia.org/wiki/Adjuntas_barrio-...,Adjuntas barrio-pueblois abarrioand the admini...
...,...,...,...,...,...,...,...
569,william_miranda_marín_botanical_and_cultural_...,William Miranda Marín Botanical and Cultural G...,18.241389,-66.061667,Caguas,https://en.wikipedia.org/wiki/William_Miranda_...,The William Miranda Marín Botanical and Cultur...
570,world_war_ii.txt,World War II,,,,https://en.wikipedia.org/wiki/World_War_II,World War I I[b]or the Second World War was ag...
571,yabucoa_barrio-pueblo.txt,Yabucoa barrio-pueblo,18.047304,-65.880083,Yabucoa,https://en.wikipedia.org/wiki/Yabucoa_barrio-p...,Yabucoa barrio-pueblois abarrioand the adminis...
572,yauco_barrio-pueblo.txt,Yauco barrio-pueblo,18.036342,-66.849470,Yauco,https://en.wikipedia.org/wiki/Yauco_barrio-pueblo,Yauco barrio-pueblois abarrioand the administr...


✅ Landmark data combined and saved to: /content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/structured-information-from-datasets/structured-information-landmarks/landmark_data_combined.csv
