# Geospatial Tutorial

### Data Format

As always when thinking about digital methods, start from the *method*, not the software. Software comes and goes, the core logic stays the same. So: what do we need in order to model data in geographic space? To put a point on a map, all we need is the lattitude and longitude. For most geospatial software, you can add this information "by hand," but in practice you will generally be *importing* a dataset. Two common formats for importing geospatial data are: GeoJSON and CSV (comma separated values).

In principle, this is enough data to import into something like QGIS or ArcGIS and produce a map (which would consist of a single point on a map):

| Latitude | Longitude |
|----------|-----------|
| 40.7128 | 74.0060 |

Recall, the above table is the *rendered* version of raw CSV:

```
Latitude,Longitude
40.7128° N,74.0060° W
```

A more "realistic" dataset you might keep for historical purposes could look like this:

| Latitude | Longitude | Place_Name | Population |
|----------|-----------|------------|------------|
| 40.7128° N | 74.0060° W | New York City, USA | 8,804,190 |
| 35.6762° N | 139.6503° E | Tokyo, Japan | 13,929,286 |
| 51.5074° N | 0.1278° W | London, UK | 8,982,000 |
| 33.8688° S | 151.2093° E | Sydney, Australia | 5,367,206 |
| 1.3521° N | 103.8198° E | Singapore | 5,685,807 |


When you import this data, you would have to make sure that whatever software you are using knows that the first column corresponds to the Y axis and the second to the X axis: most software packages will assume that any remaining columns are *attributes*. For instance, once you have your points showing up on your map, you could tell it to use the third column (Place_Name) to label the points, and you could instruct it to use the fourth column (Population) to size the dots as a way of visualizing the cities in relative terms.



### Vectors / Polygons versus Rasters

Before you get started, it's worth pausing to consider an important conceptual distinction in the way a computer codes images. If you take a picture with your phone, you are producing a *raster* image, which is made up of a grid of pixels (or cells). If you zoom in, the picture will become grainy as you get closer to the pixels.

A *vector* or *polygon* shape is fundamentally different: essentially, a vector image is a collection of points in space, which the computer joins together with a line. Therefore, if you have three points in space, you already have a vector graphic.



### Databases and Concatenating Tables

If your research goal is to simply track places with some sort of information attached to those places in a one-to-one relationship, then all you need is a CSV file similar to the example above. However, for many historical research projects, you will need to join information kept in separate tables.

For instance: what if determining the likely location of a historical location is itself a research finding, and you have multiple attested locations that you are tracking at the same time? In this case, you would likely be maintaining one table for location entities, and another table for location attributes -- and the latter table could track multiple attested coordinates for a given location.


Table 1: Locations

| Location_ID | Location_Name |
|-------------|---------------|
| L001 | Alexandria |
| L002 | Timbuktu |
| L003 | Tenochtitlan |


Table 2: Location Coordinates

| Location_ID | Latitude | Longitude |
|-------------|----------|-----------|
| L001 | 31.2001° N | 29.9187° E |
| L001 | 31.1855° N | 29.8962° E |
| L001 | 31.2156° N | 29.9090° E |
| L002 | 16.7666° N | 3.0026° W |
| L003 | 19.4326° N | 99.1332° W |


Or, even if you are certain of all your locations (i.e., you can just keep them as columns in one single table), what if you want to keep track of the travel itineraries of individuals, i.e., who went where? In that case, you would need a table for locations, a table for individuals, and then a join table connecting them.


Table 1: Historical Figures

| Person_ID | Person_Name | Lifetime |
|-----------|-------------|----------|
| P001 | Marco Polo | 1254-1324 |
| P002 | Ibn Battuta | 1304-1369 |
| P003 | Zheng He | 1371-1433 |


Table 2: Locations

| Location_ID | Location_Name | Region |
|-------------|---------------|--------|
| L001 | Venice | Italy |
| L002 | Constantinople | Byzantine Empire |
| L003 | Beijing | China |
| L004 | Calicut | India |
| L005 | Malacca | Malaysia |


Table 3: Travel Itineraries (Join Table)

| Journey_ID | Person_ID | Location_ID | Visit_Year | Duration_Months | Notes |
|------------|-----------|-------------|------------|-----------------|-------|
| J001 | P001 | L001 | 1271 | 0 | Departure point |
| J002 | P001 | L002 | 1272 | 3 | First major stop |
| J003 | P001 | L003 | 1275 | 204 | Served Kublai Khan |
| J004 | P002 | L002 | 1325 | 4 | Pilgrimage stop |
| J005 | P002 | L004 | 1342 | 6 | Trading port visit |
| J006 | P003 | L003 | 1405 | 0 | Departure point |
| J007 | P003 | L004 | 1407 | 2 | Diplomatic mission |
| J008 | P003 | L005 | 1409 | 3 | Established relations |


You already know how to stitch together tables using the Pandas library in Python [from a previous lesson](https://github.com/pickettj/teaching/blob/main/Pandas_Relational_Tables.ipynb). We will follow very similar steps to produce an output file of interesting historical-geographical data ready for import into a geospatial software package.

### Concatenating Tables for Geospatial Import

Let's start the walkthrough: you can follow along with the [provided sample CSV files](https://github.com/pickettj/teaching/tree/main/geospatial_sample_data).

In [2]:
# Import the pandas library and give it the nickname 'pd'
import pandas as pd

In [6]:
# Read in the data as a "dataframe" (i.e., the tabular data format in Pandas)

df_people = pd.read_csv("./geospatial_sample_data/people.csv")
df_locations = pd.read_csv("./geospatial_sample_data/locations.csv")
df_journeys = pd.read_csv("./geospatial_sample_data/journeys.csv")

# note that the dot in ./geospatial... means "start in the current location 
# of your computer directory system and then look for a directory called..."
# this is why if you pull or download this code from Github, the path (file location)
# will work on your computer even though your "absolute path" will be different from the
# one on my computer
# see: https://phoenixnap.com/kb/absolute-path-vs-relative-path

In [12]:

# Merge df_journeys with df_locations
merged_with_locations = pd.merge(df_journeys, df_locations, left_on='Location_ID', right_on='UID', how='left')

# Merge the result with df_people
final_merged_df = pd.merge(merged_with_locations, df_people, left_on='Person_ID', right_on='UID', how='left')


In [13]:
# Clean up some of the redunand columnss: Drop the redundant columns UID_x and UID_y
final_cleaned_df = final_merged_df.drop(columns=['UID_x', 'UID_y'])

# Display the cleaned DataFrame
final_cleaned_df

Unnamed: 0,Location_ID,Person_ID,Year,Place_Name,Latitude,Longitude,Person_Name,Birth_Year,Death_Year
0,L001,P001,1271,Venice,45.4408,12.3155,Marco Polo,1254,1324
1,L002,P001,1272,Constantinople,41.0082,28.9784,Marco Polo,1254,1324
2,L007,P001,1274,Baghdad,33.3152,44.3661,Marco Polo,1254,1324
3,L008,P001,1275,Samarkand,39.6270,66.9750,Marco Polo,1254,1324
4,L003,P001,1275,Beijing,39.9042,116.4074,Marco Polo,1254,1324
...,...,...,...,...,...,...,...,...,...
112,L045,P001,1295,Trabzon,41.0027,39.7178,Marco Polo,1254,1324
113,L002,P001,1295,Constantinople,41.0082,28.9784,Marco Polo,1254,1324
114,L024,P001,1296,Genoa,44.4056,8.9463,Marco Polo,1254,1324
115,L034,P001,1300,Palermo,38.1157,13.3615,Marco Polo,1254,1324
