# Data Preprocessing

This document describes the initial data preparation steps for the GeoMind project. The main goal was to process a raw metadata file containing geographic coordinates and enrich it with useful labels for training our machine learning model.

### Step 1: Getting the Country Code

First, I used the `reverse_geocoder` library to get the two-letter country code (e.g., `PL` for Poland or `US` for the United States) for each image. The `rg.search` function took the latitude and longitude of each photo and returned its corresponding country.

### Step 2: Assigning Smart Super-Regions

Next, I grouped the countries into 13 super-regions. These regions were designed to be visually distinct for the AI, based on clues like language, architecture, and landscape. I created a set of rules to assign each `countryCode` to a specific region.

**Here are the rules I used:**

> * **North America:** `US`, `CA`
> * **Latin America:** `MX`, `GT`, `SV`, `HN`, `NI`, `CR`, `PA`, `CO`, `VE`, `EC`, `PE`, `BO`, `PY`, `AR`, `CL`, `UY`, `BR`, `GY`, `SR`, `GF`, `CU`, `HT`, `DO`, `PR`, `VI`, `BZ`
> * **Western & Northern Europe:** `FR`, `DE`, `NL`, `BE`, `LU`, `AT`, `CH`, `LI`, `GB`, `IE`, `DK`, `SE`, `NO`, `FI`, `IS`, `AX`, `FO`
> * **Southern Europe:** `ES`, `PT`, `IT`, `GR`, `MT`, `AD`, `SM`, `VA`, `CY`
> * **Eastern Europe & Balkans:** `TR`, `PL`, `CZ`, `SK`, `HU`, `EE`, `LV`, `LT`, `SI`, `HR`, `BA`, `RS`, `ME`, `MK`, `AL`, `RO`, `BG`, `MD`
> * **Russia & Cyrillic:** `RU`, `UA`, `BY`, `MN`, `KZ`, `KG`, `UZ`
> * **East Asia:** `JP`, `KR`, `TW`, `HK`, `CN`
> * **Southeast Asia:** `TH`, `MY`, `SG`, `ID`, `PH`, `VN`, `KH`, `LA`, `BN`, `TL`
> * **South Asia:** `IN`, `BD`, `LK`, `BT`, `NP`
> * **Africa:** `NA`, `ZA`, `NG`, `KE`, `SZ`, `LS`, `SN`, `BW`, `GH`, `RW`, `UG`, `GM`, `CI`, `BF`, `TG`, `GN`, `TZ`, `ET`, `ML`, `ZW`, `CD`, `GW`
> * **Arabia:** `PS`, `LB`, `QA`, `IL`, `AE`, `OM`, `TN`, `JO`, `SY`, `YE`, `MR`
> * **Oceania:** `AU`, `NZ`

### Step 3: Adding a Numerical Region ID

Machine learning models work with numbers, not text. Therefore, I added a `region_id` column to convert each region name into a unique number. This ID will be the target label that our model will learn to predict.

**The mapping from region to ID is as follows:**

> * **North America:** `0`
> * **Latin America:** `1`
> * **Western & Northern Europe:** `2`
> * **Southern Europe:** `3`
> * **Eastern Europe & Balkans:** `4`
> * **Russia & Cyrillic:** `5`
> * **East Asia:** `6`
> * **Southeast Asia:** `7`
> * **South Asia:** `8`
> * **Africa:** `9`
> * **Arabia:** `10`
> * **Oceania:** `11`
> * **Rare Regions:** `12`


### Final Step: Saving the Result

Finally, I saved the fully processed DataFrame to a new file named `metadata_final.csv`. This file is now ready for the next stages of the project, such as data analysis and model training.