# Geocoding and Location Details Extraction from Movie Data










The provided code outlines a process for extracting geographic details from location names in a dataset using geocoding services. Here's a breakdown of each part of the script:

1. **Geolocator Initialization**:
   - A geolocator instance is created using the `Nominatim` geocoding service from the `geopy` library. The `user_agent` parameter is set to "geopy_app" to identify the application making the request.

2. **Function to Clean Location Names**:
   - The `clean_location_name` function is designed to preprocess location names. It:
     - Strips any leading or trailing whitespace.
     - Removes non-alphabetical characters (except spaces and commas).
     - Removes any occurrences of "##" and extra spaces, ensuring clean input for geocoding.

3. **Function to Fetch Location Details**:
   - The `get_location_details` function takes a location name and attempts to retrieve its geographic details (neighborhood, city, region, country).
     - It first checks if the location name is not empty or just whitespace.
     - It cleans the location name using the `clean_location_name` function.
     - It uses the geolocator to fetch the geographic data with a timeout of 10 seconds.
     - If successful, it extracts details like neighborhood, city, region, and country from the returned data.
     - If an error occurs (e.g., due to an invalid location), the function prints the error and returns `None` for all details.

4. **Batch Processing Function**:
   - The `process_batch` function processes a batch of rows from the DataFrame (`df_batch`). For each row:
     - It retrieves a list of locations extracted by spaCy (assumed to be stored in the column `spacy_extracted_locations`).
     - It cleans each location name using the `clean_location_name` function and performs geocoding for each location.
     - For each location, it appends the corresponding geographic details (neighborhood, city, region, country) to respective lists.
     - It also tracks whether the geocoding was successful for all locations in the batch.
     - A delay of 1 second (`time.sleep(1)`) is included between API calls to avoid overwhelming the geocoding service.

5. **Test Sample and Processing**:
   - A test sample of the DataFrame is loaded, with the first 6000 rows selected (`df.head(6000)`).
   - The `process_batch` function is then called to process the sample and extract geographic details for each location.

6. **Updating the DataFrame**:
   - After processing, the results are used to populate new columns in the DataFrame:
     - `neighbourhood`, `city`, `region`, `country`, and `geocoding_success` (indicating whether the geocoding was successful for the batch).
   - The updated DataFrame is saved to a CSV file named `test_location_details.csv`.

Finally, the script prints a message indicating that the processing is complete and the results have been saved.

In [None]:
import pandas as pd
import time
import requests
from geopy.geocoders import Nominatim
from tqdm import tqdm  # For the progress bar
import re  # For cleaning the location strings

In [None]:
df = pd.read_csv('/content/13k_finalsummary.csv')

In [None]:
# Initialize geolocator
geolocator = Nominatim(user_agent="geopy_app")

# Function to clean location names
def clean_location_name(location):
    location = location.strip()  # Remove leading/trailing whitespace
    location = re.sub(r'[^a-zA-Z\s,]', '', location)  # Keep letters, spaces, commas
    location = location.replace("##", "").strip()  # Remove "##" and extra spaces
    return ' '.join(location.split())

# Function to fetch location details with error handling
def get_location_details(location):
    if not location or location.strip() == "":
        return None, None, None, None

    cleaned_location = clean_location_name(location)

    try:
        location_data = geolocator.geocode(
            cleaned_location, addressdetails=True, language='en', timeout=10
        )
        if location_data:
            details = location_data.raw.get("address", {})
            neighborhood = details.get("neighborhood")
            city = details.get("city", details.get("town", details.get("village")))
            region = details.get("state")
            country = details.get("country")
            return neighborhood, city, region, country
    except Exception as e:
        print(f"Error for location '{cleaned_location}': {e}")
    return None, None, None, None

# Batch processing function
def process_batch(df_batch):
    results = []
    for _, row in tqdm(df_batch.iterrows(), total=len(df_batch), desc="Processing Batch"):
        locations = row["spacy_extracted_locations"]
        location_list = [clean_location_name(loc) for loc in locations.split(",")]

        neighborhood_list, city_list, region_list, country_list = [], [], [], []
        for location in location_list:
            neighborhood, city, region, country = get_location_details(location)
            neighborhood_list.append(neighborhood or "")
            city_list.append(city or "")
            region_list.append(region or "")
            country_list.append(country or "")
            time.sleep(1)  # To avoid overloading the geocoding API

        results.append(
            {
                "neighbourhood": "; ".join(neighborhood_list),
                "city": "; ".join(city_list),
                "region": "; ".join(region_list),
                "country": "; ".join(country_list),
                "geocoding_success": "success" if all(
                    [neighborhood_list, city_list, region_list, country_list]
                ) else "fail",
            }
        )
    return results

# Load a test sample of 100 rows
test_df = df.head(6000).copy()

# Process the batch and update the DataFrame
results = process_batch(test_df)

# Extract results and assign to columns
test_df["neighbourhood"] = [res["neighbourhood"] for res in results]
test_df["city"] = [res["city"] for res in results]
test_df["region"] = [res["region"] for res in results]
test_df["country"] = [res["country"] for res in results]
test_df["geocoding_success"] = [res["geocoding_success"] for res in results]

# Save results to a CSV
test_df.to_csv("test_location_details.csv", index=False)

print("Test processing complete. Results saved to 'test_location_details.csv'")


Processing Batch:  23%|██▎       | 1369/6000 [53:23<3:16:25,  2.54s/it]