In [2]:
from geoparser import Geoparser
from tqdm.autonotebook import tqdm, trange
import pandas as pd

import warnings

# Suppress all FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)

  from tqdm.autonotebook import tqdm, trange





### Load Geoparser

To use Geoparser, instantiate an object of the Geoparser class with optional specifications for the spaCy model, transformer model, and gazetteer. By default, the library uses an accuracy-optimised configuration:

In [6]:
geo = Geoparser(spacy_model='en_core_web_trf', transformer_model='dguzh/geo-all-distilroberta-v1', gazetteer='geonames')

Geoparser is optimised for parsing large collections of texts at once. To perform parsing, supply a list of strings to the parse method. This method processes the input and returns a list of GeoDoc objects, each containing identified and resolved toponyms:

```python
docs = geo.parse(["Sample text 1", "Sample text 2", "Sample text 3"])
```

Modify your code above to run your own three strings for geodata.


In [9]:
docs = geo.parse (["The text goes here. New York","There are also locations in Harrisonburg"])

Toponym Recognition...


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

We can get the result by cycling through every individual instances of the toponyms in the collectino stored in docs.

In [12]:
for doc in docs:
    for toponym in doc.toponyms:
        print(toponym)

New York
Harrisonburg


The toponym variable actually stores a lot of other information as well. The data structure looks like this:

```python
{
'geonameid': 2867714,
'name': 'Munich',
'admin2_geonameid': 2861322,
'admin2_name': 'Upper Bavaria',
'admin1_geonameid': 2951839,
'admin1_name': 'Bavaria',
'country_geonameid': 2921044,
'country_name': 'Germany',
'feature_name': 'seat of a first-order administrative division',
'latitude': 48.13743,
'longitude': 11.57549,
'elevation': None,
'population': 1260391
}
```

Much like a list or a dataframe we can navigate to these individual values by accessing the `.location` value in the toponym, and then going to the individual `key`.

In [16]:
for doc in docs:
    for toponym in doc.toponyms:
        if toponym.location:
            # If location is resolved
            place_name = toponym
            latitude = toponym.location['latitude']
            longitude = toponym.location['longitude']
            print(f"Place: {place_name}, Latitude: {latitude}, Longitude: {longitude}")
        else:
            # If no location is resolved
            print(f"No location found for toponym: {toponym}")


Place: New York, Latitude: 40.71427, Longitude: -74.00597
Place: Harrisonburg, Latitude: 38.44957, Longitude: -78.86892


### Load in sentiment data

We can load in the data from last lesson and start there.

In [20]:
df_virginia_toponyms_sentiment = pd.read_pickle('df_virginia_toponym_sentiment_full.pickle')

As the toponym extraction process is very time intensive. We will run it on a sample, just to show you how it works. The full results will be imported below.

In [25]:
df_virginia_sample = df_virginia_toponyms_sentiment.head(300).copy()
df_virginia_sample.sample(5)

Unnamed: 0,text_id,title,subjects,last_name,first_name,birth,death,cleaned_sentences,toponyms,nltk_sentiment,roberta_neg,roberta_neu,roberta_pos
77,2674,The Complete Writings of Charles Dudley Warner...,Autobiographies; Virginia -- Description and t...,Warner,Charles Dudley,1829,1900,The proprietor sat in a little railed veranda.,[veranda],0.0,0.082492,0.845996,0.071511
251,2898,Pioneers of the Old South: A Chronicle of Engl...,"Southern States -- History -- Colonial period,...",Johnston,Mary,1870,1936,Such was the Virginia between the Potomac and ...,[Virginia],0.0,0.066193,0.886669,0.047138
227,2898,Pioneers of the Old South: A Chronicle of Engl...,"Southern States -- History -- Colonial period,...",Johnston,Mary,1870,1936,Its letters patent were for North Virginia.,[North Virginia],0.0,0.093607,0.870762,0.035631
136,2674,The Complete Writings of Charles Dudley Warner...,Autobiographies; Virginia -- Description and t...,Warner,Charles Dudley,1829,1900,I shall go away with a high opinion of the ho...,[Mitchell County],0.4939,0.018908,0.601169,0.379923
127,2674,The Complete Writings of Charles Dudley Warner...,Autobiographies; Virginia -- Description and t...,Warner,Charles Dudley,1829,1900,"At the top he encountered a stranger, on a sor...",[Burnsville],-0.296,0.131433,0.804757,0.063811


Since the `geo.parse` function is expecting a list of strings and outputs a complicated piece of data, this logic has been abstracted out into a function to make things a bit simpler. The function goes through and looks at the strings in `cleaned_sentences` and tries to establish a toponym. It also includes a `feature_filter` for only Administrative areas [`A`] (Countries, States, Counties) and Population centers [`P`] because otherwise the process will take very long and also include geographic features like streams, rivers, gulley's etc.

In [35]:
def geoparse_column(df):
    sentences = df['cleaned_sentences'].tolist()  # Convert column to list
    docs = geo.parse(sentences, feature_filter=['A', 'P'])  # Run geo.parse on the entire list

    # Initialize lists to store the extracted fields
    places, latitudes, longitudes, feature_names = [], [], [], []

    # Iterate through the results and extract toponyms and their locations
    for doc in docs:
        doc_places = []
        doc_latitudes = []
        doc_longitudes = []
        doc_feature_names = []

        for toponym in doc.toponyms:
            if toponym.location:
                doc_places.append(toponym.location.get('name'))
                doc_latitudes.append(toponym.location.get('latitude'))
                doc_longitudes.append(toponym.location.get('longitude'))
                doc_feature_names.append(toponym.location.get('feature_name'))
            else:
                doc_places.append(None)
                doc_latitudes.append(None)
                doc_longitudes.append(None)
                doc_feature_names.append(None)

        # Append the extracted data for the document
        places.append(doc_places)
        latitudes.append(doc_latitudes)
        longitudes.append(doc_longitudes)
        feature_names.append(doc_feature_names)

    # Assign the extracted data to the DataFrame as new columns
    df['place'] = places
    df['latitude'] = latitudes
    df['longitude'] = longitudes
    df['feature_name'] = feature_names

    return df


In [37]:
geoparse_results_sample = geoparse_column(df_virginia_sample)

Toponym Recognition...


Batches:   0%|          | 0/300 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/505 [00:00<?, ?it/s]

Batches:   0%|          | 0/57 [00:00<?, ?it/s]

In [41]:
geoparse_results_sample.sample(5, random_state =10)

Unnamed: 0,text_id,title,subjects,last_name,first_name,birth,death,cleaned_sentences,toponyms,nltk_sentiment,roberta_neg,roberta_neu,roberta_pos,place,latitude,longitude,feature_name
24,2674,The Complete Writings of Charles Dudley Warner...,Autobiographies; Virginia -- Description and t...,Warner,Charles Dudley,1829,1900,"When supper came, he never went near Cynthia, ...",[Cynthia],0.4215,0.263413,0.641921,0.094665,[],[],[],[]
65,2674,The Complete Writings of Charles Dudley Warner...,Autobiographies; Virginia -- Description and t...,Warner,Charles Dudley,1829,1900,"Yes, they'd got a Republican member of Congre...",[he'd],-0.128,0.252675,0.648327,0.098998,[],[],[],[]
113,2674,The Complete Writings of Charles Dudley Warner...,Autobiographies; Virginia -- Description and t...,Warner,Charles Dudley,1829,1900,Grandfather loomed up much more loftily than t...,[South Carolina],-0.7003,0.021178,0.523528,0.455294,"[None, South Carolina]","[None, 34.00043]","[None, -81.00009]","[None, first-order administrative division]"
261,2898,Pioneers of the Old South: A Chronicle of Engl...,"Southern States -- History -- Colonial period,...",Johnston,Mary,1870,1936,Ten years' time from this first Virginia voyag...,[Virginia],-0.25,0.285936,0.682786,0.031278,"[Virginia, Java]","[37.54812, 36.83597]","[-77.44675, -79.2278]","[first-order administrative division, populate..."
188,2674,The Complete Writings of Charles Dudley Warner...,Autobiographies; Virginia -- Description and t...,Warner,Charles Dudley,1829,1900,The stream that originates in Hickory Nut Gap ...,[Rutherford County],0.0258,0.021638,0.93245,0.045912,"[None, East Windsor, Rutherford County, Columb...","[None, 41.91232, 35.8427, 35.61507, 39.46883, ...","[None, -72.54509, -86.41674, -87.03528, -74.63...","[None, populated place, second-order administr..."


Since the is still sorted into individual dictionary entries, we are going to extract it into the columns we need: `place`, `latitude`, `longitude`, and `feature_name`.

There are several interesting things of note in the data. First, for some of the sentences the tokenizer did not find a toponym which is indicated by empty lists `[]`. This because this is a more accurate tokenizer and will likely have fewer false positives. We will have to remember to remove these. Likewise, right now the parsing has been set to include Administrative areas like countries and states (i.e. The US and Virginia) and population centers (Richmond, Harrisonburg). We will have to think of how to deal with these down the road.

**We can run the geoparser for all the data and expect to wait at least an hour!**

In [None]:
# Run the geoparser over the entire 'cleaned_sentences' column
geoparse_results = geoparse_column(df_virginia_toponyms_sentiment)

Toponym Recognition...


Batches:   0%|          | 0/45972 [00:00<?, ?it/s]

In [None]:
# Display the updated DataFrame with new columns
df_virginia_toponyms_sentiment.head()

In [None]:
# Add the results back to the original DataFrame
df_virginia_toponyms_sentiment[['place', 'latitude', 'longitude']] = geoparse_results

# Display the updated DataFrame with new columns
df_virginia_toponyms_sentiment.head()

In [None]:
df_virginia_toponyms_sentiment.to_pickle('df_virginia_toponyms_all.pickle')

In [None]:
df_virginia_all = pd.read_pickle('df_virginia_toponyms_all.pickle')

In [None]:
df_virginia_all.sample(10)

In [None]:
empty_percent = (df_virginia_toponyms_sentiment['place'].str.len() == 0).mean() * 100
empty_percent

In [None]:
df_virginia_cleaned = df_virginia_all[df_virginia_toponyms_sentiment['place'].str.len() != 0].copy()

In [None]:
df_virginia_cleaned[['cleaned_sentences','place']].sample(5, random_state= 15)

In [None]:
df_virginia_long = df_virginia_cleaned.explode(['place', 'latitude', 'longitude', 'location'])

In [None]:
df_virginia_long

In [None]:
df_virginia_long = df_virginia_long.reset_index(drop=True)

In [None]:
df_virginia_long.sample()

In [None]:
df_virginia_long.to_pickle('df_virginia_long.pickle')

In [None]:
df_geolocations_sentiments = df_virginia_long.groupby('place').agg(
    location_count=('place', 'size'),  # Count occurrences of each location
    latitude=('latitude', 'first'),    # Take the first latitude (you can also use 'mean')
    longitude=('longitude', 'first'),  # Take the first longitude (or 'mean')
    location=('location','first'),
    avg_roberta_pos=('roberta_pos', 'mean'),  # Average of roberta_pos
    avg_roberta_neu=('roberta_neu', 'mean'),  # Average of roberta_neu
    avg_roberta_neg=('roberta_neg', 'mean')   # Average of roberta_neg
).reset_index()



In [None]:
df_geolocations_sentiments.sample(10)

In [None]:
df_geolocations_sentiments.sort_values(by='location_count', ascending=False).head(10)

In [None]:
# Compute a single sentiment score by subtracting avg_roberta_neg from avg_roberta_pos
df_geolocations_sentiments['sentiment_score'] = df_geolocations_sentiments['avg_roberta_pos'] - df_geolocations_sentiments['avg_roberta_neg']
# Set the display option to show all rows

In [None]:
# Set the display option to show all rows
pd.set_option('display.max_columns', None)

# Display all rows where 'place' is 'Virginia'
df_geolocations_sentiments.location[df_geolocations_sentiments.place == 'Virginia']

In [None]:
# Filter the 'location' data for rows where 'place' is 'Virginia'
locations_virginia = df_geolocations_sentiments.location[df_geolocations_sentiments.place == 'Virginia']

# Iterate over the filtered locations and print each key-value pair in the dictionary
for loc in locations_virginia:
    if isinstance(loc, dict):
        print("Location details:")
        for key, value in loc.items():
            print(f"{key}: {value}")
        print("\n" + "-"*40 + "\n")  # Separator for each dictionary
    else:
        print("No valid location data found.")
b

In [None]:
# Filter the 'location' data for rows where 'place' is 'Richmond'
locations_richmond = df_geolocations_sentiments.location[df_geolocations_sentiments.place == 'Richmond']

# Iterate over the filtered locations and print each key-value pair in the dictionary
for loc in locations_richmond:
    if isinstance(loc, dict):
        print("Location details:")
        for key, value in loc.items():
            print(f"{key}: {value}")
        print("\n" + "-"*40 + "\n")  # Separator for each dictionary
    else:
        print("No valid location data found.")


In [None]:
# Extract 'feature_name' from the 'location' dictionary and create a new column
df_geolocations_sentiments['feature_name'] = df_geolocations_sentiments['location'].apply(lambda x: x.get('feature_name') if isinstance(x, dict) else None)

# Display the first few rows to verify the new column
df_geolocations_sentiments.feature_name.unique()


In [None]:
import plotly.express as px

# Create the map using plotly.express, similar to your example
fig = px.scatter_mapbox(
    df_geolocations_sentiments,
    lat="latitude",               # Latitude column
    lon="longitude",              # Longitude column
    size="location_count",        # Bubble size based on location count
    color="sentiment_score",      # Color based on sentiment score
    color_continuous_scale=px.colors.cyclical.IceFire,  # Use IceFire scale (blue to red)
    size_max=15,                  # Maximum size of the bubbles
    zoom=5                        # Adjust zoom level for better visibility
)

# Update the layout to use the default map style (which doesn't need a token)
fig.update_layout(
    mapbox_style="stamen-toner",  # No token needed for this style
    margin={"r":0,"t":0,"l":0,"b":0}  # Remove margins for a cleaner view
)

In [None]:
import plotly.express as px

# Create the map using plotly.express
fig = px.scatter_mapbox(
    df_geolocations_sentiments,     # DataFrame with latitude, longitude, etc.
    lat="latitude",                # Latitude column
    lon="longitude",               # Longitude column
    size="location_count",         # Bubble size based on location count
    color="sentiment_score",       # Color based on sentiment score
     hover_name="place",            # Display place name on hover
    hover_data={"location_count": True, "sentiment_score": True},  # Show count and sentiment score
    color_continuous_scale=px.colors.cyclical.IceFire,  # Use IceFire scale (blue to red)
    
    size_max=50,                   # Maximum size of the bubbles
    zoom=8                         # Adjust zoom level for closer view
)

# Update the layout to use the 'stamen-terrain' map style and center on Richmond, VA
fig.update_layout(
    mapbox_style="carto-positron",  # Stamen terrain map style
    mapbox_center={"lat": 37.5407, "lon": -77.4360},  # Center map on Richmond, Virginia
    margin={"r":0,"t":0,"l":0,"b":0},  # Remove margins for a cleaner view
    mapbox_zoom=8  # Initial zoom level (adjust to your preference)
)

# Display the map
fig.show()


In [None]:
import plotly.express as px
import numpy as np

# Define a function to scale location_count to a specific range with a minimum size
def scale_bubble_size(counts, min_size=5, max_size=50):
    # Scale the location_count to be within min_size and max_size
    scaled_size = np.interp(counts, (counts.min(), counts.max()), (min_size, max_size))
    return scaled_size

# Scale the location_count column to enforce a minimum size of 5
df_geolocations_sentiments['scaled_size'] = scale_bubble_size(df_geolocations_sentiments['location_count'], min_size=5, max_size=15)

# Create the map using plotly.express
fig = px.scatter_mapbox(
    df_geolocations_sentiments,     # DataFrame with latitude, longitude, etc.
    lat="latitude",                # Latitude column
    lon="longitude",               # Longitude column
    size="scaled_size",            # Use the scaled bubble size
    color="sentiment_score",       # Color based on sentiment score
    color_continuous_scale=px.colors.cyclical.IceFire,  # Use IceFire scale (blue to red)
    zoom=8,                        # Adjust zoom level for closer view
    hover_name="place",            # Display place name on hover
    hover_data={"location_count": True, "sentiment_score": True}  # Show count and sentiment score
)

# Update the layout to use the 'carto-positron' map style and center on Richmond, VA
fig.update_layout(
    mapbox_style="carto-positron",  # Black and white light theme
    mapbox_center={"lat": 37.5407, "lon": -77.4360},  # Center map on Richmond, Virginia
    margin={"r":0,"t":0,"l":0,"b":0},  # Remove margins for a cleaner view
    mapbox_zoom=8  # Initial zoom level (adjust to your preference)
)

# Display the map
fig.show()


In [None]:
# Display the map
import plotly.express as px
import numpy as np

# Filter out locations with fewer than 50 counts
df_filtered = df_geolocations_sentiments[df_geolocations_sentiments['location_count'] >= 50]

# Define a function to scale location_count to a specific range with a minimum size
def scale_bubble_size(counts, min_size=5, max_size=15):
    # Scale the location_count to be within min_size and max_size
    scaled_size = np.interp(counts, (counts.min(), counts.max()), (min_size, max_size))
    return scaled_size

# Scale the location_count column to enforce a minimum size of 5
df_filtered['scaled_size'] = scale_bubble_size(df_filtered['location_count'], min_size=5, max_size=15)

# Create the map using plotly.express
fig = px.scatter_mapbox(
    df_filtered,                  # Filtered DataFrame with latitude, longitude, etc.
    lat="latitude",               # Latitude column
    lon="longitude",              # Longitude column
    size="scaled_size",           # Use the scaled bubble size
    color="sentiment_score",      # Color based on sentiment score
    color_continuous_scale=px.colors.cyclical.IceFire,  # Use IceFire scale (blue to red)
    zoom=8,                       # Adjust zoom level for closer view
    hover_name="place",           # Display place name on hover
    hover_data={"location_count": True, "sentiment_score": True}  # Show count and sentiment score
)

# Update the layout to use the 'carto-positron' map style and center on Richmond, VA
fig.update_layout(
    mapbox_style="carto-positron",  # Black and white light theme
    mapbox_center={"lat": 37.5407, "lon": -77.4360},  # Center map on Richmond, Virginia
    margin={"r":0,"t":0,"l":0,"b":0},  # Remove margins for a cleaner view
    mapbox_zoom=8  # Initial zoom level (adjust to your preference)
)

# Display the map
fig.show()
