# Plotting the Publication Location of Exile Magazines

In this notebook, we will work with selected data from the Czech Literary Exile Bibliography. It contains records of books and articles, published by Czech exile publishers and magazines. We will focus only on records of articles that include the magazine and place of publication, which can be found in the `773 $t` field.

This notebook will show you how to extract the place of publication from the data, obtain its coordinates, and then visualize it on a map using the `geopandas` and `matplotlib` libraries.

This notebook is suitable for both beginners and those who want to familiarize themselves with data processing in Python.

## Requirements

To work with this notebook, you need to have a CSV file created from the Explore Data notebook.

## Prerequisites

This notebook does not require deep knowledge of Python, but a basic understanding of programming will be helpful.

## Notebook Structure

The notebook is divided into several sections:

0. **Preparation**: We will add the necessary libraries for processing the MARC file.

1. **Loading from CSV**: We will show how to load our data stored in a CSV file.

2. **Extraction of Publication Place**: From the data, we will find and extract the place of publication.

3. **Data Cleaning**: We will correct errors in the places of publication, such as typos, etc.

4. **Obtaining Place Coordinates**: We will learn how to obtain the coordinates of the publication place using an API and save the results to a CSV file.

5. **Loading Coordinates**: We will load the coordinates from the CSV file.

6. **Map**: Finally, we will plot our data on a map.

## Additional Resources

- [LearnPython.org](https://www.learnpython.org/): This online course offers Python tutorials for both beginners and advanced learners. It can be a useful resource for those looking to expand their Python knowledge.

- [W3Schools.com/Python](https://www.w3schools.com/python/): An extensive tutorial that covers Python along with some popular Python libraries.


### 0. Preparation
First, we need to install the libraries we will be working with. Libraries are packages of functions that are not part of the Python language's core. <br>
To install libraries, use the command `%pip install <library_name>`. Then, we add them to our notebook using the command `import <library_name> (as alias)`. To access functions from the library, use `library_name.function_name` <br>
If we only want to use a single function from a library, we add it using `from <library_name> import <function_name>`.

In [None]:
# Install libraries
%pip install geopandas
%pip install matplotlib
%pip install numpy 
%pip install pandas
%pip install requests
%pip install shapely

# Add libraries
from collections import Counter
import geopandas as gpd
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import numpy as np
import pandas as pd
import re
import requests
from shapely.geometry import Point
from statistics import median, mean

### 1. Load from CSV

First, we use the `pandas` library to load our saved CSV data into a DataFrame data structure (similar to an Excel table). Rows in the DataFrame represent individual records, columns represent different types of data (e.g., author names).
Some fields and subfields may repeat. In the CSV they are concatenated with semicolons.

In [None]:
# Select base
base = 'cle'

# Path to our data
csv_data = 'data/csv/out_{base}.csv'.format(base = base)

# Load data
df = pd.read_csv(csv_data, delimiter=',')

print("Data loaded to DataFrame df.")

# Iterate through column in the DataFrame
for column in df.columns:
    # There is only one publication date, so we don't need to split this column
    if column != 'year': 

        # Split joined values into a list 
        df[column] = df[column].apply(lambda x: x.split(';') if isinstance(x, str)  else [])


We'll take a look at our data. First, we'll determine how many records contain information about magazines. These are the records about articles that interest us. The remaining records will be considered as book records. Then, we'll display the first 5 and the last 5 items in the DataFrame.

First, using a lambda function, we'll find rows that contain information about magazines. We'll assign 1 to them and sum them all up. This will give us the total number of article records.<br>
To obtain the number of book records, we'll subtract the total number of article records from the overall record count. Finally, we'll keep only the records about articles in our DataFrame df.

###### Note: If you require more precise data, the MARC record contains a field called LDR (leader) that carries information about the record type.

In [None]:
# Count nonempty rows 
magazines_counts = df['magazine'].apply(lambda x: 1 if len(x) > 0 else 0)

sum_magazines_counts = magazines_counts.sum()

print("Number of article records: ", sum_magazines_counts)

sum_books_counts = len(df) - sum_magazines_counts

print("Number of books records: ", sum_books_counts)

# Filter book record 
df = df[df['magazine'].apply(lambda x: len(x) > 0)]

We can see that articles are in the majority. 

In [None]:
# Print first 5 records in the DataFrame 'df'
df.head()

In [None]:
# Print last 5 records in the DataFrame 'df'
df.tail()

We have displayed the beginning and end of our DataFrame. We can observe that the place of publication is written in parentheses next to the journal name and that it is not present in all records.

### 2. Extracting the Place of Publication

Now, we extract the place of publication and determine how many records lack this information. To achieve this, we use a regular expression that finds the substring (word) within parentheses. Records that have a magazine but lack a place of publication return a None value. We then count the None values.

In [None]:
# regex pattern, that finds string inside columns
pattern_cities = r"\((.*?)\)"

# Save cities (substring inside brackets) to list 
cities = df['magazine'].apply(lambda x: [re.search(pattern_cities, y).group(1) if re.search(pattern_cities, y) else None for y in x]).tolist().copy()

# Count all values that have None values -> magazines without 
sum_None =  sum(list(map(lambda x : 1 if (len(x) == 1 and x[0] is None) else 0, cities )))

print("Počet časopisů bez místa vydání: ", sum_None)

As we can see, there aren't too many records to handle, so we don't need to worry about them.

Now, we determine the places of publication and their frequencies.

To avoid writing the same code multiple times, we write it once as a function that we can later easily call. In this case, we write a function that creates one list from several nested lists. This will be useful when we want to calculate frequencies.


In [None]:
# Function that flattens nested lists 
def flatten_list(strings):
    flattened_list = []
    if strings is not None: # Check if element is not None
        for item in strings:
            if isinstance(item, str):  # If element is a string, add it to the list
                flattened_list.append(item)
            elif isinstance(item, list):  # Recursion
                flattened_list.extend(flatten_list(item))
        return flattened_list

print("Function saved")             

We find which places of publication are in our database. 

In [None]:
# Create a flatten list that deletes also None values 
cities = flatten_list(cities)

print("Unique places of publication:", np.unique(cities))

### 3. Data Cleaning

From the unique values, we can see that some cities are stored under their original name as well as under their Czech alternative (London - Londýn, Köln - Kolín nad Rýnem). Some have typos (Wintertuhr-Obstladen -> Winterthur-Obstalden) that need to be corrected. Others contain multiple cities (Wintertuhr-Obstladen, Ženeva-Middlesex-Mnichov). In this case, we store all cities in a list.

We use the `lambda` function and `map()`, which works similarly to the `apply()` function in the `pandas` library.


In [None]:
# Delete Index on Censorship from an element 'London, Index on Censorship', combine 'London' and 'Londýn'
cities = list(map(lambda x: 'Londýn' if 'London' in x else x, cities))

# Correct 'Obstladen' -> 'Obstalden' and Winterthur
cities = list(map(lambda x: ['Winterthur', 'Obstalden'] if 'Obstladen' in x else x, cities))

# Create two elements from 'New York-Paříž'
cities = list(map(lambda x: ['New York','Paříž'] if 'New York-Paříž' in x else x, cities))

# Create three elements from 'Ženeva-Middlesex-Mnichov'
cities = list(map(lambda x: ['Ženeva','Middlesex', 'Mnichov'] if 'Ženeva-Middlesex-Mnichov' in x else x, cities))

# Overwrite 'Köln-Ehrenfeld' to 'Kolín nad Rýnem'
cities = list(map(lambda x: 'Kolín nad Rýnem' if 'Köln-Ehrenfeld' in x else x, cities))

# Flatten list 
cities = flatten_list(cities)

print("Unique places of publication after correction:", np.unique(cities))

We count the number of records using `Counter()` function.

In [None]:
# Count number of occurences 
cities_number_of_records = Counter(cities)

print(cities_number_of_records)

We convert our dictionary to DataFrame `cities_df`.

In [None]:
# Create DataFrame
cities_df = pd.DataFrame.from_dict(cities_number_of_records, orient='index').reset_index()

# Add column titles
cities_df.columns = ['city', 'number of records']

# Print 
cities_df

### 4. Obtaining Location Coordinates

This code demonstrates how to obtain the coordinates of cities using an API. It requires a personal API key, which can be obtained from this website - https://opencagedata.com/api. You simply need to add the API key to the `api_key` variable. The code then sends a request to the URL with our API key and the city name. If the request is successful, the function returns the latitude and longitude.

<i> This cell will not run. To execute it, you need to remove the first line `%%script echo skip`. </i>


In [None]:
%%script echo skip

# Function that find coordinates of a city by it's name
def get_city_coordinates(city):
    api_key = "MY KEY"
    url = f"https://api.opencagedata.com/geocode/v1/json?q={city}&key={api_key}"

    # Call request function
    response = requests.get(url)

    data = response.json()
    
    if response.status_code == 200:
        # If the city is found, save coordinates
        if data["total_results"] > 0:
            lat = data["results"][0]["geometry"]["lat"]
            lon = data["results"][0]["geometry"]["lng"]
            return lat, lon
        else:
            print("No results found for the city.")
    else:
        print("Error occurred while fetching data.")


Using the `get_city_coordinates` function  we obtain the coordinates of cities and save them in a DataFrame. At the end, the DataFrame will be saved to a CSV file.

<i> This cell will not run. To execute it, you need to remove the first line %%script echo skip. </i>

In [None]:
%%script echo skip

cities_df['latitude'] = None
cities_df['longitude'] = None
df.reindex(columns=['city', 'number of records', 'lat', 'lon'], fill_value=0)

# Unique cities
unique_cities = np.unique(cities)
coordinates = {}

# Iterate unique cities
for city in unique_cities:
    # Try except to catch mistakes
    try:
        (latitude, longitude) = get_city_coordinates(city)
        print(f"Coordinates of {city}: Latitude={latitude}, Longitude={longitude}")
        coordinates[city] = (latitude, longitude)
        cities_df.loc[cities_df['city'] == city, 'latitude'] = latitude
        cities_df.loc[cities_df['city'] == city, 'longitude'] = longitude
    except:
        print(f"City {city} not found.")
          
# Create DataFrame from dictionary
df_coordinates = pd.DataFrame.from_dict(coordinates)

# Transpose table
df_coordinates = df_coordinates.T

df_coordinates.to_csv('data/coordinates/coordinates.csv')

### 5. Loading Coordinates

We saved the result to 'coordinates.csv'. We now load our data from the file. 
CSV file contains list of cities with their latitude and longitude. 

In [None]:
# Load coordinates from a file 
df_coordinates = pd.read_csv('data/coordinates/coordinates.csv')

# Add column titles
df_coordinates.columns = ['city','latitude', 'longitude']

# Merge tables together
points_df = pd.merge(cities_df, df_coordinates)

points_df

### 6. Map
In the final section, we visualize the data on a map, which is prepared in the 'data/geojson' folder. First, we determine which part of the world we want to display. Then, we will gradually plot cities as points on the map. The size of the points will depend on the number of published articles. We will also add a legend and, if necessary, labels for individual cities.

##### 6.1 Bounding box
Since all the cities are located either in Europe or North America, we can define the bounding box of the world map we plot. We define the box using minimum and maximum latitude and longitude. Then, we remove all cities that are outside of this boundary.

In [None]:
# Bounding box of our map  
bbox = [-130, 15, 50, 80]  # [minx, miny, maxx, maxy] - minimal longitude, minimal latitude, maximal longitude, maximal latitude 

# List of cities we want to omit  
remove = []

# Iterate through DataFrame
for _, row in points_df.iterrows():
    lat = row['latitude']
    lon = row['longitude']
    minx = bbox[0]
    miny = bbox[1]
    maxx = bbox[2]
    maxy = bbox[3]
    
    # All cities that are outside of the bounding box add to remove list
    if lon < minx or lat < miny or lon > maxx or lat > maxy:
        remove.append(row['city'])

# Remove cities that are on the remove list 
points_df = points_df[~points_df['city'].isin(remove)]

points_df

<div class='alert alert-block alert-info'>
    <b>Try It!</b>  You can experiment with different geographical boundaries. <br>
For example, [-130, 15, 50, 80] is the bounding box of North America and Europe. [-5, 35, 30, 55] is an approximate boundary of Europe. [6, 45, 11, 48] roughly defines the border of Switzerland. <br> 
</div>


#### 6.2 Map Visualization

We use the `geopandas` and `matplotlib` libraries for visualization.<br> 
First, we create `Point` objects from longitude and latitude using the `shapely` library and convert them into a `GeoDataFrame`, which can be easily plotted on a map.<br>
We load the map and plot it. Our map will be displayed according to the bounding box defined in the previous cell.<br>
We then add points to the map - our places of publication - based on their longitude and latitude. The size of the points depend on the number of articles published in that location. <br>

In [None]:
# Create shapely points from longitude and latitude of each city
geometry = [Point(lon, lat) for lon, lat in zip(points_df['longitude'], points_df['latitude'])]

# Convert DataFrame to GeoDataFrame
points_gdf = gpd.GeoDataFrame(points_df, geometry=geometry, crs="EPSG:4326") 

# Load map 
base_map_data = gpd.read_file("data/geojson/world_1960.geojson")

# Set figure size 
figsize = (15,12)

plt.figure(figsize=figsize)

# Bound the map 
ax = base_map_data.clip(bbox).plot(figsize=figsize)

# Parameter for point size 
div = 10

# Parameter for annotating the point
ann = False

# Plot map and cities-points
points_gdf.plot(figsize=figsize,
                ax=ax, 
                color = "red",
                marker = 'o',
                markersize=points_gdf['number of records'].apply(lambda x: x/div),  # Set point size  
                )

# If parameter is set to True, annotate points 
if ann:
     # Add name of the cities as labels
     for x, y, label in zip(points_df.longitude, points_df.latitude, points_df.city):
          ax.annotate(label, xy=(x, y), xytext=(3, 3), textcoords="offset points", fontsize = 8)

# Set sizes of legend points
point_sizes = [min(points_gdf['number of records'].apply(lambda x: int(round(x, -1)))), mean(points_gdf['number of records'].apply(lambda x: int(round(x, -1)))), max(points_gdf['number of records'].apply(lambda x: int(round(x, -1))))]

# Create legend points
legend_handles = [Line2D([], [], marker = 'o', lw=0, color='red', markersize=np.sqrt(size/div), label=str(int(round(size, -1)))) for size in point_sizes]

# Add legend
ax.legend(handles=legend_handles, title='Number of Records', loc='upper right')

plt.title("Exile magazines ")
plt.grid(False)
ax.set_axis_off()  
plt.savefig("plots/exil_cropped_map.svg")
plt.show()

<div class='alert alert-block alert-info'>
    <b>Try It!</b> The code includes two parameters - div and ann.<br> 
    Using the div parameter, you can change the point size. A higher value will result in smaller points. Higher values are more suitable for maps that cover larger areas to prevent points from overlapping.    
    The ann parameter adds city labels. When set to True, it adds the city name next to its point.
</div>
