# Mapping Geographic Bias in the Field of Italian Studies
by Mac Callahan / 21 December 2023 / CLS 161: Introduction to Digital Humanities

## Introduction

For my final project, I aimed to answer the following question: Has the field of Italian Studies progressed in equalizing its study of social and cultural phenomena across the Italian Peninsula? This question arises from a broader debate in the both the Italian Studies and Digital Humanities. As noted by, “the [Southern Italian] stories and voices which have been underrepresented in both print and digital knowledge production…[and] marginalized in their national contexts — can be heard”

To answer this question, I compliled a dataset of all articles (excluding book reviews) published in a leading, peer-reviewed journal (aptly) titled [Italian Studies](https://www-tandfonline-com.ezproxy.library.tufts.edu/journals/yits20) in 2023. This included 31 unique works, some in English and others in Italian, that traversed discussions of Italy's history, art, literature, cultural studies and more.



## Data
I downloaded all 31 articles as PDFs, converted them into Word documents, and copy and pasted their bodies into a single .txt file. Copying from Word documents allowed me to exclude footnotes easily from the data set, which often included cities where sources were published, and therefore would add data that were not reflective of the articles' subject matter. 

The .txt file can be viewed here: 



In [64]:
# import necessary libraries
import os
import pandas as pd
import stanza
import json
import wget
import requests

In [16]:
# load stanza nlp pipeline that tokenizes and performs Named Entity Recognition
nlp_ner= stanza.Pipeline(lang='en', processors='tokenize, ner')

2023-12-17 11:13:49 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-12-17 11:13:50 INFO: Loading these models for language: en (English):
| Processor | Package          |
--------------------------------
| tokenize  | combined         |
| ner       | ontonotes_charlm |

2023-12-17 11:13:50 INFO: Using device: cpu
2023-12-17 11:13:50 INFO: Loading: tokenize
2023-12-17 11:13:50 INFO: Loading: ner
2023-12-17 11:13:51 INFO: Done loading processors!


In [125]:
# create function that will take a text string as input and return a dictionary
# with locations and location counts from the text string
def get_locations_from_text(text):
    locations_dict = {}
    doc = nlp_ner(text)
    for sentence in doc.sentences:
        for token in sentence.tokens:
            if token.ner == 'S-GPE':
                if not token.text in locations_dict.keys():
                    locations_dict[token.text] = 1
                else:
                    locations_dict[token.text] += 1
            else:
                continue
    return locations_dict

In [126]:
# identify the path to the text file you want to use
path = f'{(os.getcwd())}/All_texts.txt'

In [127]:
# read text from text file
with open(path, encoding='utf-8', mode='r') as f:
       text  = f.read()

In [128]:
# apply function to get locations and location counts
# this will take a few minutes
locations = get_locations_from_text(text)

In [129]:
locations

{'Inghilterra': 1,
 'Crete': 1,
 'Costantinople': 1,
 'Naples': 9,
 'Florence': 34,
 'Italia': 18,
 'Germania': 2,
 'Germany': 5,
 'Italy': 252,
 'Ravenna': 1,
 'Costantinopoli': 1,
 'Livorno': 1,
 'Savoy': 1,
 'Nice': 1,
 'Rome': 51,
 'Rovereto': 1,
 'Trento': 1,
 'Trieste': 1,
 'London': 11,
 'England': 1,
 'Sicily': 2,
 'Washington': 1,
 'Matera': 21,
 'Plovdiv': 1,
 'Basilicata': 18,
 'Potenza': 6,
 'Bari': 4,
 'Altamura': 2,
 'Ferrandina': 2,
 'Torino': 2,
 'Melfi': 2,
 'Moliterno': 1,
 'Rionero': 1,
 'Tricarico': 1,
 'Venosa': 1,
 'Episcopia': 1,
 'Irsina': 1,
 'Catania': 8,
 'Messina': 1,
 'Palermo': 1,
 'Mezzogiorno': 2,
 'Zanardelli': 1,
 'Sicilia': 3,
 'Milan': 16,
 'Acerenza': 1,
 'Calabria': 4,
 'Cività': 1,
 'Largo': 1,
 'Taranto': 1,
 'Kingdom': 3,
 'Monterrone': 4,
 'Puglia': 1,
 'Lucania': 1,
 'Ethiopia': 21,
 'Piredda': 1,
 'Ujiji': 2,
 'Cavour': 1,
 'Britain': 4,
 'France': 6,
 'Tripolitania': 1,
 'Brunetta': 1,
 'Seraye': 1,
 'Eritrea': 2,
 'Libya': 3,
 'Somalia': 1,

In [130]:
# convert dictionary to DataFrame for easier processing
location_count_df = pd.DataFrame.from_dict(locations, orient='index').reset_index().rename(columns={'index':'place_name', 0:'count'})
# preview DataFrame
location_count_df.head()

Unnamed: 0,place_name,count
0,Inghilterra,1
1,Crete,1
2,Costantinople,1
3,Naples,9
4,Florence,34


In [131]:
# convert GeoName Italy Gazetteer to DataFrame and preview it
places_df = pd.read_csv('IT.csv')
places_df.head()

Unnamed: 0,geonameid,name,asciiname,alternatenames,latitude,longitude
0,781059,Colognole,Colognole,,43.50972,10.44833
1,781060,Casale Sant'Antonio,Casale Sant'Antonio,,44.61907,11.02235
2,2522617,Graham Island,Graham Island,"Banco Graham,Banco Grahm,Ferdinandea Bank,Ferd...",37.14266,12.88126
3,2522676,Zungti,Zungti,,38.65,15.98333
4,2522677,Zumpano,Zumpano,Zumpano,39.31053,16.29269


In [149]:
# function returns GeoName ID of a given location in Italy
def get_geonames_id(location):
    place_row = places_df.loc[places_df['name'] == location]
    if len(place_row) >= 1:
        return int(place_row['geonameid'].iloc[0])
    else:
        return None

# function returns latitude of a location, given its GeoName ID
def get_latitude(geoname_id):
    places_row = places_df.loc[places_df['geonameid'] == geoname_id]
    if len(places_row) == 1:
        return places_row.latitude.iloc[0]

# function returns longitude of a location, given its GeoName ID
def get_longitude(geoname_id):
    places_row = places_df.loc[places_df['geonameid'] == geoname_id]
    if len(places_row) == 1:
        return places_row.longitude.iloc[0]

In [150]:
# add GeoName IDs to location DataFrame
location_count_df['geoname_id'] = location_count_df['place_name'].apply(get_geonames_id)
location_count_df = location_count_df.dropna().reset_index(drop=True)

# add coordinates to location DataFrame
location_count_df['latitude'] = location_count_df['geoname_id'].apply(get_latitude)
location_count_df['longitude'] = location_count_df['geoname_id'].apply(get_longitude)

In [151]:
# preview
location_count_df

Unnamed: 0,place_name,count,geoname_id,latitude,longitude
0,Naples,9,3172394,40.85216,14.26811
1,Florence,34,3176959,43.77925,11.24626
2,Ravenna,1,3169561,44.41344,12.20121
3,Livorno,1,3174659,43.54427,10.32615
4,Nice,1,10294319,45.47237,12.22590
...,...,...,...,...,...
89,Risorgimento,1,8958713,45.75488,12.27099
90,Alba,1,3183364,44.69990,8.03470
91,Mombarcaro,1,6534291,44.46764,8.08824
92,Valdivilla,1,8955600,44.70948,8.18276


In [146]:
# Clean up data by removing observed non-Italian or overly-general locations
dispose_of = ['Crete', 'Calvino', 'Italia','Savoy', 'London', 'Washington','Vienna', 'Paris', 'America', 'Houston', 'Londra', 'Boston', 'Sacramento', 'Cambridge', 'Mexico', 'Orlando', 'Cuba','Versailles']
for place in dispose_of:
    location_count_df.drop(location_count_df.index[(location_count_df["place_name"] == place)],axis=0,inplace=True)

In [145]:
# save location DataFrame to a .csv file to import into ArcGIS
file_name = 'location_count_df.csv'
location_count_df.to_csv(file_name)