# Lesson 5: Geoparsing and Sentiment Mapping in Python

**🎯 Learning Objectives:**
- Extract geographic locations from text using advanced geoparser
- Combine location data with sentiment analysis
- Create interactive maps to visualize sentiment by location
- Apply data science techniques to literary analysis

**📋 Prerequisites:**
- Complete `lesson_5_0_installation_setup.ipynb` first
- Data from previous lessons (sentiment analysis results)

## Overview

In this lesson, we'll take text data about Virginia's history and:

1. **Extract Locations**: Use a sophisticated geoparser to find and resolve geographic references
2. **Combine with Sentiment**: Link location mentions to emotional sentiment in the text
3. **Visualize on Maps**: Create interactive maps showing sentiment patterns across different places

This process helps us understand how different locations are portrayed in historical texts - whether they're mentioned positively, negatively, or neutrally.

---

In [None]:
# Import all required libraries
try:
    from geoparser import Geoparser
    from tqdm.notebook import tqdm
    import pandas as pd
    import plotly.express as px
    import mapclassify as mc
    import warnings
    
    # Suppress warnings for cleaner output
    warnings.simplefilter(action='ignore', category=FutureWarning)
    
    print("✅ All libraries imported successfully!")
    
except ImportError as e:
    print(f"❌ Missing library: {e}")
    print("Please run the installation notebook first: lesson_5_0_installation_setup.ipynb")

## Part 1: Initialize the Geoparser System

The geoparser is a sophisticated tool that can identify place names in text and resolve them to actual geographic coordinates.

### Step 1.1: Initialize the Geoparser

We'll create a geoparser with optimized settings for accuracy:

In [None]:
try:
    print("Initializing geoparser... (this may take a minute)")
    geo = Geoparser(
        spacy_model='en_core_web_trf',                    # Advanced language model
        transformer_model='dguzh/geo-all-distilroberta-v1', # Geographic transformer
        gazetteer='geonames'                              # Geographic database
    )
    print("✅ Geoparser initialized successfully!")
    
except Exception as e:
    print(f"❌ Error initializing geoparser: {e}")
    print("Make sure you ran the installation notebook first!")

**What these parameters do:**
- `spacy_model`: Advanced language processing for accurate text understanding
- `transformer_model`: Specialized AI model trained to recognize geographic references  
- `gazetteer`: Database containing millions of place names and their coordinates

### Step 1.2: Test the Geoparser

Let's test the geoparser with some sample sentences. Try changing the text below to include places you know:

In [None]:
# Test with sample sentences - feel free to modify these!
test_sentences = [
    "I traveled from New York to Richmond, Virginia last summer.",
    "The battle took place near Harrisonburg in the Shenandoah Valley.",
    "London and Paris are popular European destinations."
]

try:
    docs = geo.parse(test_sentences)
    print(f"✅ Successfully parsed {len(docs)} sentences!")
    
except Exception as e:
    print(f"❌ Error during parsing: {e}")
    print("Try restarting the kernel and running from the beginning.")

### Step 1.3: Examine the Results

Let's see what locations the geoparser found. Each "toponym" is a place name with detailed geographic information:

In [None]:
print("🗺️  LOCATIONS FOUND:")
print("=" * 50)

for i, doc in enumerate(docs):
    print(f"\nSentence {i+1}: \"{test_sentences[i]}\"")
    
    if doc.toponyms:
        for toponym in doc.toponyms:
            print(f"  📍 Found: {toponym}")
    else:
        print("  ❌ No locations found in this sentence")
        
print("\n" + "=" * 50)

### Understanding the Data Structure

Each toponym contains detailed geographic information. Here's what a complete location record looks like:

```python
{
    'geonameid': 2867714,
    'name': 'Munich',
    'latitude': 48.13743,
    'longitude': 11.57549,
    'country_name': 'Germany',
    'admin1_name': 'Bavaria',        # State/Province
    'admin2_name': 'Upper Bavaria',  # County/Region
    'feature_name': 'seat of a first-order administrative division',
    'population': 1260391
}
```

We can access specific pieces of information using `.location['key_name']`:

In [None]:
# Extract specific geographic information
print("📍 DETAILED LOCATION DATA:")
print("=" * 60)

for i, doc in enumerate(docs):
    print(f"\nSentence {i+1}: \"{test_sentences[i]}\"")
    
    for toponym in doc.toponyms:
        if toponym.location:
            name = toponym.location['name']
            lat = toponym.location['latitude']
            lon = toponym.location['longitude']
            country = toponym.location.get('country_name', 'Unknown')
            
            print(f"  🏛️  Place: {name}")
            print(f"  🌍 Country: {country}")
            print(f"  📐 Coordinates: ({lat:.4f}, {lon:.4f})")
            print()
        else:
            print(f"  ❌ Location '{toponym}' could not be resolved to coordinates")
            print()

## Part 2: Load and Process Historical Text Data

Now we'll work with real historical text that already has sentiment analysis completed.

### Step 2.1: Load Sentiment Data

This dataset contains sentences from historical texts about Virginia, with sentiment scores already calculated:

In [None]:
try:
    df_virginia_toponyms_sentiment = pd.read_pickle('df_virginia_toponym_sentiment_full.pickle')
    print(f"✅ Loaded {len(df_virginia_toponyms_sentiment):,} sentences with sentiment data")
    print(f"📊 Columns: {list(df_virginia_toponyms_sentiment.columns)}")
    
except FileNotFoundError:
    print("❌ Data file not found!")
    print("You may need to run previous lessons first to generate the sentiment data.")
    print("Or check that you're in the correct directory.")

### Step 2.2: Demo with Sample Data

**⏱️ Note**: Full geoparsing takes 1+ hours, so we'll demonstrate with a small sample first, then load pre-processed results.

In [None]:
# Create a small sample for demonstration
try:
    df_virginia_sample = df_virginia_toponyms_sentiment.head(100).copy()
    print(f"📝 Created sample with {len(df_virginia_sample)} sentences")
    print("\n🔍 Sample of the data:")
    display(df_virginia_sample[['cleaned_sentences', 'roberta_pos', 'roberta_neu', 'roberta_neg']].head(3))
    
except NameError:
    print("❌ Data not loaded yet. Please run the previous cell first.")

### Step 2.3: The Geoparsing Function

Here's a streamlined function that processes text and extracts geographic information:

**Key features:**
- Processes multiple sentences at once for efficiency
- Filters for Administrative areas (Countries, States) and Population centers (Cities)
- Extracts coordinates and place information

In [None]:
def geoparse_dataframe(df, text_column='cleaned_sentences'):
    """
    Extract geographic locations from text data.
    
    Args:
        df: DataFrame with text data
        text_column: Column containing the text to parse
    
    Returns:
        DataFrame with added location columns
    """
    print(f"🔍 Processing {len(df)} sentences for geographic locations...")
    
    # Convert text column to list for batch processing
    sentences = df[text_column].tolist()
    
    try:
        # Process all sentences at once (more efficient)
        docs = geo.parse(sentences, feature_filter=['A', 'P'])  # A=Administrative, P=Population centers
        
        # Initialize storage for results
        places, latitudes, longitudes, feature_names = [], [], [], []
        
        # Extract information from each processed document
        for doc in tqdm(docs, desc="Extracting locations"):
            doc_places = []
            doc_latitudes = []
            doc_longitudes = []
            doc_feature_names = []
            
            # Get all toponyms found in this document
            for toponym in doc.toponyms:
                if toponym.location:
                    doc_places.append(toponym.location.get('name'))
                    doc_latitudes.append(toponym.location.get('latitude'))
                    doc_longitudes.append(toponym.location.get('longitude'))
                    doc_feature_names.append(toponym.location.get('feature_name'))
            
            # Store results (empty lists if no locations found)
            places.append(doc_places)
            latitudes.append(doc_latitudes)
            longitudes.append(doc_longitudes)
            feature_names.append(doc_feature_names)
        
        # Add new columns to dataframe
        df_result = df.copy()
        df_result['place'] = places
        df_result['latitude'] = latitudes
        df_result['longitude'] = longitudes
        df_result['feature_name'] = feature_names
        
        print(f"✅ Geoparsing complete!")
        return df_result
        
    except Exception as e:
        print(f"❌ Error during geoparsing: {e}")
        return df

In [None]:
# Run geoparsing on sample data
try:
    geoparse_results_sample = geoparse_dataframe(df_virginia_sample)
    print(f"\n📊 Results: {len(geoparse_results_sample)} sentences processed")
    
except Exception as e:
    print(f"❌ Error: {e}")
    print("Make sure the geoparser was initialized successfully in Part 1.")

In [None]:
geoparse_results_sample.sample(5, random_state =10)

There are several interesting things of note in the data. First, for some of the sentences the tokenizer did not find a toponym which is indicated by empty lists `[]`. This because this is a more accurate tokenizer and will likely have fewer false positives. We will have to remember to remove these. Likewise, right now the parsing has been set to include Administrative areas like countries and states (i.e. The US and Virginia) and population centers (Richmond, Harrisonburg). We will have to think of how to deal with these down the road.

**We can run the geoparser for all the data and expect to wait at least an hour!**

In [None]:
# Run the geoparser over the entire 'cleaned_sentences' column
#geoparse_results = geoparse_column(df_virginia_toponyms_sentiment)

In [None]:
# Display the updated DataFrame with new columns
#geoparse_results.sample(10)

In [None]:
#geoparse_results.to_pickle('df_virginia_geoparsed_complete.pickle')

---

## Part 3: Work with Complete Geoparsed Dataset

Since geoparsing the full dataset takes hours, we'll load the pre-processed results.

### Step 3.1: Load Pre-processed Data

The complete geoparsing process took over an hour on a modern computer:

![Processing Time](geoparser_completion.png)

In [None]:
try:
    df_virginia_all = pd.read_pickle('df_virginia_geoparsed_complete.pickle')
    print(f"✅ Loaded complete dataset: {len(df_virginia_all):,} sentences")
    print(f"📊 New columns added: place, latitude, longitude, feature_name")
    
except FileNotFoundError:
    print("❌ Complete geoparsed file not found!")
    print("This file should be provided with the lesson materials.")
    print("You can also create it by running the full geoparsing process (takes 1+ hours).")

### Step 3.2: Understanding Accuracy

The advanced geoparser is more accurate than simple named entity recognition, so it produces fewer false positives:

In [None]:
df_virginia_all[['toponyms','place']].sample(7, random_state = 19)

Calculate the number of false positives in the original toponyms.

In [None]:
empty_percent = (df_virginia_all['place'].str.len() == 0).mean() * 100
print(f"The number of missing values is {empty_percent:.0f}%")

### 2.1 Remove false positives

Here a false positive is defined as any empty value in the more fine-grained value in the toponym parser and a hit in the rough NER extraction.

In [None]:
df_virginia_cleaned = df_virginia_all[df_virginia_all['place'].str.len() != 0].copy()

In [None]:
df_virginia_cleaned[['cleaned_sentences','place','latitude','longitude']].sample(5, random_state= 15)

#### 2.1.1 `explode` the dataframe

As previously, we want to get all of the data per row. Some sentences contained multiple locations so we want to get these out.

In [None]:
df_virginia_long = df_virginia_cleaned.explode(['place', 'latitude', 'longitude', 'feature_name'])

#### 2.1.2 Investigate the data frame

In [None]:
df_virginia_long[['place', 'latitude', 'longitude', 'feature_name']].sample(5, random_state = 7)

#### 2.1.3 Critical Question

From a cartographic perspective, what is the difference between the types of features recovered?

### 2.2 Remove empty values

We can remove all of the `None` values with `.notna()`. That is, if the value is not `None` return true.

In [None]:
df_virginia_long = df_virginia_long[df_virginia_long.place.notna()]

When we do this we have to remember to reset the index column.

In [None]:
df_virginia_long = df_virginia_long.reset_index(drop=True)

Check the result

In [None]:
df_virginia_long.sample(3, random_state=25)

### 2.3 Creating an Aggregate table with `.groupby()`

We now have a list of every place, latitude, longitude, and type of place it is. This is coupled with the sentiment data. We now need to consolidate this into singular points. We are going to do this by using the `.groupby()` method, which makes the calculation by group. In this case our group is going to be `place`, meaning that for every place it will count the number of times it appears and also the average sentiment score. We group by `place` to consolidate multiple mentions of the same location and calculate the average sentiment to get a general sentiment trend for each location.

In [None]:
df_geolocations_sentiments = df_virginia_long.groupby('place').agg(
    location_count=('place', 'size'),  # Count occurrences of each location
    latitude=('latitude', 'first'),    # Take the first latitude (you can also use 'mean')
    longitude=('longitude', 'first'),  # Take the first longitude (or 'mean')
    location=('feature_name','first'),
    avg_roberta_pos=('roberta_pos', 'mean'),  # Average of roberta_pos
    avg_roberta_neu=('roberta_neu', 'mean'),  # Average of roberta_neu
    avg_roberta_neg=('roberta_neg', 'mean')   # Average of roberta_neg
).reset_index()



Check the results

In [None]:
df_geolocations_sentiments.sample(5, random_state=5)

#### 2.3.1 Find top locations

In [None]:
df_geolocations_sentiments.sort_values(by='location_count', ascending=False).head(5)

What can we tell about the number of frequent locations?

### 2.4 Consolidate the Roberta Score

Since the roberta score is positive, negative, and neutral, we will have to consolidate it into one easier to understand score. We will take the difference between positive and negative, and multiply it by the percentage of neutral. This way if a score is very neutral it will even out the difference between positive and negative.

In [None]:
# Calculate the compound score and add it as a new column 'roberta_compound'
df_geolocations_sentiments['avg_roberta_compound'] = (
    df_geolocations_sentiments['avg_roberta_pos'] - df_geolocations_sentiments['avg_roberta_neg']
) * (1 - df_geolocations_sentiments['avg_roberta_neu'])


#### 2.4.1 The most negative place

Let's find the most negative place.

In [None]:
# Sort the DataFrame by 'roberta_compound' in ascending order and display the top 10 negative scores
top_10_negative = df_geolocations_sentiments.sort_values(by='avg_roberta_compound').head(10)
top_10_negative

### 2.4.2 Critical Question

We can already tell there might be some challenges here. The values with the strongest scores tend to be low in count. 

- Why might this be?

Also, these are different types of features. 

- How might this distort results?

In [None]:
#df_geolocations_sentiments.to_pickle('df_geolocations_sentiments.pickle')

##### 3 (Optional) Load in completed file

In [None]:
df_geolocations_sentiments = pd.read_pickle('df_geolocations_sentiments.pickle')

---

## Part 4: Create Interactive Sentiment Maps

Now we'll create beautiful interactive maps showing how sentiment varies by location.

### 3.1 Basic Plotly Map

Below is the basic stub from the instructions to create a [map](https://plotly.com/python/tile-scatter-maps/). 

In [None]:
import plotly.express as px

# Create the map using plotly.express 
fig = px.scatter_mapbox(
    df_geolocations_sentiments,  #put your dataframe here
    lat="latitude",               # Latitude column
    lon="longitude",              # Longitude column
    size="location_count",        # Bubble size based on location count
    color="avg_roberta_compound",      # Color based on sentiment score
    color_continuous_scale=px.colors.cyclical.IceFire,  # Use IceFire scale (blue to red)
    size_max=15,                  # Maximum size of the bubbles
    center={"lat": 0.0, "lon": 0.0},
    zoom=1                        # Adjust zoom level for better visibility
)

# Update the layout to use the default map style (which doesn't need a token)
fig.update_layout(
    mapbox_style="open-street-map",  # No token needed for this style
    margin={"r":0,"t":0,"l":0,"b":0}  # Remove margins for a cleaner view
)

fig.show()

### 3.1 Critical Questions

- What are some issue with this map?
- How can we make it better?

#### 3.2.1 Adjust the position and the zoom

Look at the comments next to each variable in the plotly function, what does each thing do? 
- If I wanted to get a closer zoom how would I fix it?
- If I want to set a different center what should I choose?
- How would I change the size of the bubbles?
- How can I get a different mapbox style?

Take some time to mess around with your map.


[Read the full documentation](https://plotly.github.io/plotly.py-docs/generated/plotly.express.scatter_mapbox.html)

In [None]:
# Create the map using plotly.express 
fig = px.scatter_mapbox(
    df_geolocations_sentiments,  #put your dataframe here
    lat="latitude",               # Latitude column
    lon="longitude",              # Longitude column
    size="location_count",        # Bubble size based on location count
    color="avg_roberta_compound",      # Color based on sentiment score
    color_continuous_scale=px.colors.cyclical.IceFire,  # Use IceFire scale (blue to red)
    size_max=50,                  # Maximum size of the bubbles
    center={"lat": 0.0, "lon": 0.0},
    zoom=1                        # Adjust zoom level for better visibility
)

# Update the layout to use the default map style (which doesn't need a token)
fig.update_layout(
    mapbox_style="open-street-map",  # No token needed for this style
    margin={"r":0,"t":0,"l":0,"b":0}  # Remove margins for a cleaner view
)

fig.show()

### 3.3 Bubble Size

One major issue is simply the size of the bubbles. The data is very spread out. The lowest number is 1 and the highest number is 11000. We can make life a little easier by simply removing some of the lower numbers. We can do this randomly by simply removing every below a certain count or we can be a bit more thoughtful and only consider a certain percentage of values. We can actually get a nice summary of a column using the `.describe()` function.



In [None]:
df_geolocations_sentiments.location_count.describe()

It actually looks like a significant number of locations appear very few times. We will have to keep this in mind going forward. For now, let's set our count threshold to 100.

### 3.4 Critical Question

Create a new dataframe that sets the minimum location count to 100.

In [None]:
df_geolocations_sentiments_small = df_geolocations_sentiments[df_geolocations_sentiments.location_count>99]

In [None]:
# Create the map using plotly.express 
fig = px.scatter_mapbox(
    df_geolocations_sentiments_small,  #put your dataframe here
    lat="latitude",               # Latitude column
    lon="longitude",              # Longitude column
    size="location_count",        # Bubble size based on location count
    color="avg_roberta_compound",      # Color based on sentiment score
    color_continuous_scale=px.colors.cyclical.IceFire,  # Use IceFire scale (blue to red)
    size_max=20,                  # Maximum size of the bubbles
    center={"lat": 37.5246322, "lon": -77.5758331},
    zoom=4                        # Adjust zoom level for better visibility
)

# Update the layout to use the default map style (which doesn't need a token)
fig.update_layout(
    mapbox_style="carto-darkmatter",  # No token needed for this style
    margin={"r":0,"t":0,"l":0,"b":0}  # Remove margins for a cleaner view
)



fig.show()

### 3.5 Critical Question

- How did reducing the number of values improve the data visualization?
- Where does it run into issues?

### 3.6 Bucketing Values

The above map still runs into a sizing issue. This is because even though the values are a bit closer together there is still a difference of 11,000 between the lowest value and the highest value. This is a common problem when representing values on a map. The easiest way to fix this is to "bucket" the values from "smallest" to "small" "medium", "large", and "largest". The problem is deciding how to do this in the most optimal way. 

   - **Equal Interval**: This method divides the entire range of values into equal-sized buckets. It’s simple to implement and works well when data is evenly distributed. However, it may not be as effective when the data is heavily skewed, as it can lead to many data points clustering within certain buckets.
   - **Quantile (Percentiles)**: Quantile bucketing divides data so that each bucket contains an equal number of data points. This method is useful for data with uneven distributions, as it ensures that each category has a similar representation.
   - **Natural Breaks (Jenks)**: The Jenks method automatically identifies clusters and gaps within the data to create buckets based on natural groupings. This technique is particularly beneficial for data with distinct groupings, as it helps to highlight these patterns and produce visually distinct buckets that better reflect the distribution of values.


We are going to go with natural breaks using `mapclassify`.

In [None]:
import mapclassify as mc #you may get an error. If so install mapclassify with pip install mapclassify

jenks_breaks = mc.NaturalBreaks(y=df_geolocations_sentiments_small['location_count'], k=5)
jenks_breaks

#### 3.6.1 Critical Question

Now that we have a better view of the data, what would it have looked like with equal interval?

In [None]:
# Create a new column 'location_count_bucket' for the classified values
df_geolocations_sentiments_small.loc[:,'location_count_bucket'] = jenks_breaks.find_bin(df_geolocations_sentiments_small['location_count'])+1

#### 3.6.2 Optional Retrieve Backup

get copy of `df_geolocations_sentiments_small.pickle` if the above does not work.

In [None]:
#df_geolocations_sentiments_small = pd.read_pickle("df_geolocations_sentiments_small.pickle")

#### 3.6.3 Explore Buckets

In [None]:
df_geolocations_sentiments_small.sample(5, random_state=4)

What do we notice about the buckets?

In [None]:
# Create the map using plotly.express 
fig = px.scatter_mapbox(
    df_geolocations_sentiments_small,  #put your dataframe here
    lat="latitude",               # Latitude column
    lon="longitude",              # Longitude column
    size="location_count_bucket",        # Bubble size based on location count
    color="avg_roberta_compound",      # Color based on sentiment score
    color_continuous_scale=px.colors.cyclical.IceFire,  # Use IceFire scale (blue to red)
    size_max=20,                  # Maximum size of the bubbles
    center={"lat": 37.5246322, "lon": -77.5758331},
    zoom=4                        # Adjust zoom level for better visibility
)

# Update the layout to use the default map style (which doesn't need a token)
fig.update_layout(
    mapbox_style="carto-darkmatter",  # No token needed for this style
    margin={"r":0,"t":0,"l":0,"b":0}  # Remove margins for a cleaner view
)



fig.show()


### 3.7 Improving labels and colors

Now the markers need some labels and the colors need to be reversed and need a new midpoint.

We can set the midpoint for the color scale with the argument:

```python
color_continuous_scale=0,
```

We can reverse the `IceFire` color scale using reverse list slicing 

```python
[::-1]
```

Or you can pick a different one [here](https://plotly.com/python/builtin-colorscales/)




In [19]:
import plotly.express as px

fig = px.scatter_mapbox(
    df_geolocations_sentiments_small,  # DataFrame
    lat="latitude",                   # Latitude column
    lon="longitude",                  # Longitude column
    size="location_count_bucket",     # Bubble size based on location count bucket
    color="avg_roberta_compound",     # Color based on sentiment score
    color_continuous_scale=px.colors.cyclical.Edge[::-1],  # IceFire color scale
    color_continuous_midpoint=0,      # Set 0 as the center point
    size_max=20,                      # Maximum size of the bubbles
    center={"lat": 37.5246322, "lon": -77.5758331},
    zoom=4,                           # Zoom level
    hover_name="place",               # Show place name
    hover_data={
            "avg_roberta_compound": ':.2f',   # Rounded sentiment score to 2 decimals
        "longitude": False,
        "latitude": False,
        "location_count_bucket": False
    }
)

# Update the layout to use the default map style
fig.update_layout(
    mapbox_style="carto-darkmatter",  # No token needed
    margin={"r": 0, "t": 0, "l": 0, "b": 0}  # Remove margins for cleaner view
)

# Show the plot
fig.show()


NameError: name 'df_geolocations_sentiments_small' is not defined

How else can this map be improved?

In [None]:
df_geolocations_sentiments_small.to_pickle('df_geolocations_sentiments_small.pickle')

---

## 🎯 Lesson Summary

### What We Accomplished:

1. **🔧 Setup & Installation**: Created a one-stop installation notebook for all dependencies
2. **🗺️ Geoparsing**: Used advanced AI to extract and resolve geographic locations from text
3. **📊 Data Integration**: Combined location data with sentiment analysis results
4. **🎨 Interactive Mapping**: Created beautiful, interactive maps showing sentiment patterns

### Key Skills Learned:

- **Advanced Text Processing**: Using transformer models for geographic entity recognition
- **Data Pipeline**: Combining multiple data processing steps into a coherent workflow  
- **Geographic Visualization**: Creating sophisticated interactive maps with plotly
- **Data Science**: Handling real-world data challenges like false positives and scale differences

### Technical Tools Mastered:

- `geoparser`: State-of-the-art geographic text processing
- `plotly`: Interactive data visualization  
- `mapclassify`: Intelligent data bucketing for visualization
- `pandas`: Advanced data manipulation and aggregation

### Applications:

This workflow can be applied to:
- **Literary Analysis**: Studying geographic patterns in literature
- **Historical Research**: Mapping sentiment about places over time
- **Social Media Analysis**: Understanding geographic sentiment in tweets/posts
- **News Analysis**: Tracking how different locations are portrayed in media

### Next Steps:

- Try this workflow with your own text data
- Experiment with different visualization styles and color schemes
- Explore temporal patterns (how sentiment changes over time)
- Combine with other data sources (demographics, economics, etc.)

🎉 **Congratulations!** You've completed a full geospatial text analysis pipeline!