# Mapping Regional Bias in the Field of Italian Studies
by Mac Callahan / 21 December 2023 / CLS 161: Introduction to Digital Humanities

## Introduction

As noted by digital humanist Crystal Hall, like Digital Humanities, the field of Italian Studies has struggled with “inclusivity and the representation of traditionally marginalized voices" (Hall, p. 103). In an Italian context this includes “the [Southern Italian] stories and voices which have been underrepresented in both print and digital knowledge production…[and] marginalized in their national contexts" (Risam, p. 139). The lack of scholarship focused on Southern Italy echoes a broader conundrum -- *la questione merridionale* or "The Southern Question" -- which points out the political and economic underdevelopment of southern Italy in comparison to its northern regions (Riall, p. 90). Thus, for my final project I aimed to grapple with the following research question: Does the contemportry field of Italian Studies reflect the regional diversity of the Italian peninsula?

To answer this question, I compiled a dataset of all articles (excluding book reviews) published in a leading, peer-reviewed journal (aptly) titled [Italian Studies](https://www-tandfonline-com.ezproxy.library.tufts.edu/journals/yits20) in 2023. This included 31 unique works, some in English and others in Italian, that traversed discussions of Italy's rich history, art, literature, cultural studies and more.

I then used named entity recognition (NER), a natural language processing method (NLP), to extract location data from the articles. Using an open-source Gazetteer of locations in Italy, I matched location names with longitudinal and latitudinal coordinates. Finally, I mapped the location data onto an ArcGIS heat map, creating a data visualization for further analysis.

## Data
I downloaded all 31 articles as PDFs, converted them into Word documents, and copy and pasted their bodies into a single .txt file. Copying from Word documents allowed me to exclude footnotes easily from the data set, which often included cities where sources were published, and therefore would add data that were not reflective of the articles' subject matter. See: [Compilation of 31 Italian Studies articles (All_texts.txt)](https://github.com/mac-callahan/cls161_fall23/blob/main/Final_Project/data/All_texts.txt)

GeoNames is a free, open-source geographical database that includes over eleven million placenames across all countries. I was able to download their Italy-specific geographical database which included coordinates. See: [GeoNames Italy Gazetteer (IT.csv)](https://github.com/mac-callahan/cls161_fall23/blob/main/Final_Project/IT.csv)

## Code

In [64]:
# import necessary libraries
import os
import pandas as pd
import stanza
import json
import wget
import requests

In [154]:
# load stanza nlp pipeline that tokenizes and performs Named Entity Recognition
nlp_ner= stanza.Pipeline(lang='en', processors='tokenize, ner')

2023-12-17 21:38:57 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-12-17 21:38:58 INFO: Loading these models for language: en (English):
| Processor | Package          |
--------------------------------
| tokenize  | combined         |
| ner       | ontonotes_charlm |

2023-12-17 21:38:58 INFO: Using device: cpu
2023-12-17 21:38:58 INFO: Loading: tokenize
2023-12-17 21:38:58 INFO: Loading: ner
2023-12-17 21:38:59 INFO: Done loading processors!


In [155]:
# create function that will take a text string as input and return a dictionary
# with locations and location counts from the text string
def get_locations_from_text(text):
    locations_dict = {}
    doc = nlp_ner(text)
    for sentence in doc.sentences:
        for token in sentence.tokens:
            if token.ner == 'S-GPE':
                if not token.text in locations_dict.keys():
                    locations_dict[token.text] = 1
                else:
                    locations_dict[token.text] += 1
            else:
                continue
    return locations_dict

In [156]:
# identify path to the text file of 31 articles
path = f'data/All_texts.txt'

In [157]:
# read text from text file
with open(path, encoding='utf-8', mode='r') as f:
       text  = f.read()

In [158]:
# apply function to get locations and location counts
locations = get_locations_from_text(text)

In [160]:
# preview
locations

{'Inghilterra': 1,
 'Crete': 1,
 'Costantinople': 1,
 'Naples': 9,
 'Florence': 34,
 'Italia': 18,
 'Germania': 2,
 'Germany': 5,
 'Italy': 252,
 'Ravenna': 1,
 'Costantinopoli': 1,
 'Livorno': 1,
 'Savoy': 1,
 'Nice': 1,
 'Rome': 51,
 'Rovereto': 1,
 'Trento': 1,
 'Trieste': 1,
 'London': 11,
 'England': 1,
 'Sicily': 2,
 'Washington': 1,
 'Matera': 21,
 'Plovdiv': 1,
 'Basilicata': 18,
 'Potenza': 6,
 'Bari': 4,
 'Altamura': 2,
 'Ferrandina': 2,
 'Torino': 2,
 'Melfi': 2,
 'Moliterno': 1,
 'Rionero': 1,
 'Tricarico': 1,
 'Venosa': 1,
 'Episcopia': 1,
 'Irsina': 1,
 'Catania': 8,
 'Messina': 1,
 'Palermo': 1,
 'Mezzogiorno': 2,
 'Zanardelli': 1,
 'Sicilia': 3,
 'Milan': 16,
 'Acerenza': 1,
 'Calabria': 4,
 'Cività': 1,
 'Largo': 1,
 'Taranto': 1,
 'Kingdom': 3,
 'Monterrone': 4,
 'Puglia': 1,
 'Lucania': 1,
 'Ethiopia': 21,
 'Piredda': 1,
 'Ujiji': 2,
 'Cavour': 1,
 'Britain': 4,
 'France': 6,
 'Tripolitania': 1,
 'Brunetta': 1,
 'Seraye': 1,
 'Eritrea': 2,
 'Libya': 3,
 'Somalia': 1,

In [161]:
# convert dictionary to DataFrame for easier processing
location_count_df = pd.DataFrame.from_dict(locations, orient='index').reset_index().rename(columns={'index':'place_name', 0:'count'})
# preview DataFrame
location_count_df.head()

Unnamed: 0,place_name,count
0,Inghilterra,1
1,Crete,1
2,Costantinople,1
3,Naples,9
4,Florence,34


In [162]:
# convert GeoNames Italy Gazetteer to DataFrame and preview it
places_df = pd.read_csv('IT.csv')
places_df.head()

Unnamed: 0,geonameid,name,asciiname,alternatenames,latitude,longitude
0,781059,Colognole,Colognole,,43.50972,10.44833
1,781060,Casale Sant'Antonio,Casale Sant'Antonio,,44.61907,11.02235
2,2522617,Graham Island,Graham Island,"Banco Graham,Banco Grahm,Ferdinandea Bank,Ferd...",37.14266,12.88126
3,2522676,Zungti,Zungti,,38.65,15.98333
4,2522677,Zumpano,Zumpano,Zumpano,39.31053,16.29269


In [163]:
# return GeoName ID of a given location in Italy
def get_geonames_id(location):
    place_row = places_df.loc[places_df['name'] == location]
    if len(place_row) >= 1:
        return int(place_row['geonameid'].iloc[0])
    else:
        return None

# return latitude of a location, given its GeoName ID
def get_latitude(geoname_id):
    places_row = places_df.loc[places_df['geonameid'] == geoname_id]
    if len(places_row) == 1:
        return places_row.latitude.iloc[0]

# return longitude of a location, given its GeoName ID
def get_longitude(geoname_id):
    places_row = places_df.loc[places_df['geonameid'] == geoname_id]
    if len(places_row) == 1:
        return places_row.longitude.iloc[0]

In [164]:
# add GeoName IDs to location DataFrame
# drop places not recognized in GeoName
location_count_df['geoname_id'] = location_count_df['place_name'].apply(get_geonames_id)
location_count_df = location_count_df.dropna().reset_index(drop=True)

# add coordinates to location DataFrame
location_count_df['latitude'] = location_count_df['geoname_id'].apply(get_latitude)
location_count_df['longitude'] = location_count_df['geoname_id'].apply(get_longitude)

In [165]:
# Clean up location DataFrame by removing observed non-Italian or overly-general locations
dispose_of = ['Crete', 'Calvino', 'Italia','Savoy', 'London', 'Washington','Vienna', 'Paris', 'America', 'Houston', 'Londra', 'Boston', 'Sacramento', 'Cambridge', 'Mexico', 'Orlando', 'Cuba','Versailles']
for place in dispose_of:
    location_count_df.drop(location_count_df.index[(location_count_df["place_name"] == place)],axis=0,inplace=True)

In [166]:
# save location DataFrame to a .csv file for import into ArcGIS
file_name = 'location_count_df.csv'
location_count_df.to_csv(file_name)

## Artifacts

[Post-NER location data with matched coordinates (location_count_df.csv)](https://github.com/mac-callahan/cls161_fall23/blob/main/Final_Project/location_count_df.csv)

In [167]:
display(location_count_df)

Unnamed: 0,place_name,count,geoname_id,latitude,longitude
1,Naples,9,3172394.0,40.85216,14.26811
2,Florence,34,3176959.0,43.77925,11.24626
4,Ravenna,1,3169561.0,44.41344,12.20121
5,Livorno,1,3174659.0,43.54427,10.32615
7,Nice,1,10294319.0,45.47237,12.22590
...,...,...,...,...,...
107,Risorgimento,1,8958713.0,45.75488,12.27099
108,Alba,1,3183364.0,44.69990,8.03470
109,Mombarcaro,1,6534291.0,44.46764,8.08824
110,Valdivilla,1,8955600.0,44.70948,8.18276


[Location data mapped onto ArcGIS heat map](https://tuftsgis.maps.arcgis.com/apps/instant/basic/index.html?appid=ba0cb371279e46ddbf2d9830c26f298d)

In [4]:
from arcgis.gis import GIS

gis = GIS()

webmap_search = gis.content.search(
  query="Mapping Geographic Bias in the Field of Italian Studies",
  item_type="Web Map"
)
webmap_search

webmap_item = webmap_search[0]
webmap_item

from arcgis.mapping import WebMap
italy_map = WebMap(webmap_item)
italy_map

MapView(hide_mode_switch=True, layout=Layout(height='400px', width='100%'))

## Discussion
By using computational tools to extract location data from recent Italian Studies journal articles and projecting them on a map, viewers can glean an overrepresentation in scholars' references to northern Italy in comparsion to southern Italy. Typically, *il mezzogiorno* ("The South") is considered inclusive of the regions of the former Kingdom of Sicily (beginning between Rome and Naples through the southern tip of Sicily). The visualization reveals that Italian Studies scholarship revolves around cultural centers in the North -- namely Rome, Florence, Venice, and Milan. There was only one location in Southern Italy that garnered more than 15 references -- an entire region called Basilicata -- which was disproportionately referenced in a single article with a focus on southern Italy (Matera in posa: The Photographic Self-Portrait of a Southern-Italian City, 1900–1920). Disregarding this single article, the distribution of southern cultural centers in recent Italian Studies literature appears lackluster. For example, Palermo is only mentioned once and Syracuse is not mentioned at all in the data set. Neither is the island region of Sardinia, which has a population of over 1.6 million.

While it is impossible to make sweeping conclusions on the state of a regional bias in the entire Italian Studies discipline, my analysis of a subsection of recent literature shows that progress is still needed to equal the playing field of regional representation in the field. With more time and easier access to NER-proccessable Italian Studies articles, outliers like those present in my visualization would be neautered and broader patterns could emerge. One could garner further insight from this research by comparing sourthern and northern representation in Italian Studies scholarship over time, to measure what progress, if any, has occurred.

## Bibliography

Hall, Crystal. (2019) *Digital Humanities and Italian Studies: Intersections and Oppositions.* Abingdon-on-Thames: Taylor & Francis.

Riall, Lusy. (2000) *Which road to the south? Revisionists revisit the Mezzogiorno*. United Kingdom: Journal of Modern Italian Studies.

Risam, Roopika. (2018) *New Digital Worlds: Postcolonial Digital Humanities in Theory, Praxis, and Pedagogy*. Evanston: Northwestern University Press.