# MapScapingMap

My plan is to create a map with the locations of all the interview partners Danial O'Donohue had on his [MapScaping](https://mapscaping.com/) Podcast.
For this, I scrape his website for the basic info (episode, date, duration, title, categories) on each of the >200 podcast episodes. Then, I have to add the interviewees and their location etc. manually.

In [1]:
# import all necessary modules and packages
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import time
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
pd.set_option('display.max_rows', None)  # Optionally, ensure all rows are shown
# geocoded_df

In [3]:
base_url = 'https://mapscaping.com/podcasts/page/'
page_numbers = range(1, 13)

# Creating a list of URLs for pages 1 to 12 /!\ except page 4, see below!
links_list = [base_url + str(page_number) for page_number in page_numbers if page_number != 4]
# links_list

In [4]:
all_dfs = []

for url in tqdm(links_list): # Iterate over the links and track progress
    request = requests.get(url)
    soup = BeautifulSoup(request.text)

    date_soup = soup.findAll('span', class_= "blog-meta-date-display")
    dates = [date.text for date in date_soup]

    title_soup = soup.findAll('h2', class_= "secondline-blog-title")
    titles = [title.text for title in title_soup]
    titles = [title.replace('\n','').replace('\t','') for title in titles] # not nice, but works

    episode_soup = soup.findAll('div', class_= "blog-meta-serie-episode")
    episodes = [episode.text for episode in episode_soup]

    categories_soup = soup.findAll('span', class_= "blog-meta-category-list")
    categories = [category.text for category in categories_soup]
    categories = [category.replace('\n','').replace('\t','') for category in categories] # not nice, but works

    duration_soup = soup.findAll('span', class_= "blog-meta-time-slt")
    durations = [duration.text for duration in duration_soup]

    current_df = pd.DataFrame({'Episode': episodes, 'Date': dates, 'Title': titles, 'Duration': durations, 'Categories': categories})

    # Append the current DataFrame to the list
    all_dfs.append(current_df)

    # Adding a 5-second delay
    time.sleep(5)

# Concatenate all DataFrames in the list
result_df = pd.concat(all_dfs, ignore_index=True)


100%|███████████████████████████████████████████| 11/11 [01:07<00:00,  6.13s/it]


In [5]:
result_df['Date'] = pd.to_datetime(result_df['Date'])
result_df['Episode'] = result_df['Episode'].str.replace('Episode ', '')
result_df

Unnamed: 0,Episode,Date,Title,Duration,Categories
0,230,2024-05-23,GeoParquet for beginners,42:00,mapscaping.com
1,229,2024-05-16,Finding Stuff Indoors,49:37,mapscaping.com
2,228,2024-05-01,What is humanitarian GIS?,47:24,mapscaping.com
3,227,2024-04-12,AI Autocomplete for QGIS,42:52,mapscaping.com
4,226,2024-03-21,GNSS receivers – why precise positioning will ...,51:01,mapscaping.com
5,225,2024-03-15,The Way You Talk About Your Geospatial Skills ...,52:17,mapscaping.com
6,224,2024-02-29,Modern Geospatial,48:25,mapscaping.com
7,223,2024-02-15,Introduction To LIDAR & Point Clouds,48:52,mapscaping.com
8,222,2024-01-26,Introduction to Cloud Native Geospatial,55:13,mapscaping.com
9,221,2024-01-09,GeeMap,54:55,mapscaping.com


### On page 3 of the website:


Episode 162 (QGIS offline and in the field, July 20th, 2022) on page 3 does not have info on duration and episode, which makes it hard to make a dataframe out of it (as the arrays have different length)
In the code below, I manually insert the missing information and bind page 3 to the rest of the scraped data. 

Episode 162 is located:
between 'whitebox tools' and 'sentinel hub'
between 161 and 163

In [6]:
base_url = 'https://mapscaping.com/podcasts/page/4'

response = requests.get(base_url)
soup = BeautifulSoup(response.text)

In [7]:
date_soup = soup.findAll('span', class_= "blog-meta-date-display")
dates = [date.text for date in date_soup]
print(len(dates))
dates

21


['August 10, 2022',
 'August 8, 2022',
 'August 3, 2022',
 'July 27, 2022',
 'July 20, 2022',
 'July 20, 2022',
 'July 6, 2022',
 'June 29, 2022',
 'June 25, 2022',
 'June 8, 2022',
 'June 1, 2022',
 'April 28, 2022',
 'April 20, 2022',
 'April 13, 2022',
 'April 6, 2022',
 'March 30, 2022',
 'March 18, 2022',
 'March 9, 2022',
 'March 2, 2022',
 'February 23, 2022',
 'February 16, 2022']

In [8]:
title_soup = soup.findAll('h2', class_= "secondline-blog-title")
titles = [title.text for title in title_soup]
titles = [title.replace('\n','').replace('\t','') for title in titles] # not nice, but works
print(len(titles))
titles

21


['Bathymetric Lidar and Blue Carbon',
 'Re-Published – QGIS Offline And In The Field',
 'The Open Geospatial Consortium',
 'Monetizing An Open-Source Geospatial Project',
 'Whitebox Tools Is The Backend To Many Frontends',
 'QGIS Offline And In The Field',
 'Sentinel Hub',
 'Unstructured Data Is Dark Data',
 'What Is Modern GIS?',
 'FOSS4G',
 'Building a web based mapping tool into a business',
 'Digital twins – not just a buzzword',
 'Build Your Own SaaS',
 'Getting Your Dream Job in Earth Observation',
 'Fake Satellite Imagery',
 'Cloud Native Geospatial',
 'Python Maps',
 'Mentorship, Leadership and Career Advice',
 'The Role Of Geospatial In Open Source Intelligence',
 'Geospatial Design and User Experience Can Reduce The Time To Science',
 'Business Ideas For Geospatial People']

In [9]:
episode_soup = soup.findAll('div', class_= "blog-meta-serie-episode")
episodes = [episode.text for episode in episode_soup]
episodes.insert(16, 'Episode 162')
print(len(episodes))
episodes

21


['Episode 167',
 'Episode 166',
 'Episode 165',
 'Episode 164',
 'Episode 161',
 'Episode 163',
 'Episode 160',
 'Episode 159',
 'Episode 158',
 'Episode 157',
 'Episode 156',
 'Episode 155',
 'Episode 154',
 'Episode 153',
 'Episode 152',
 'Episode 151',
 'Episode 162',
 'Episode 150',
 'Episode 149',
 'Episode 148',
 'Episode 147']

In [10]:
categories_soup = soup.findAll('span', class_= "blog-meta-category-list")
categories = [category.text for category in categories_soup]
categories = [category.replace('\n','').replace('\t','') for category in categories] # not nice, but works
print(len(categories))
categories

21


['Earth Observation, Geospatial Concepts, Geospatial Tech and Tools',
 'mapscaping.com',
 'Geospatial Career, Geospatial Startups, Geospatial Tech and Tools',
 'Geospatial Startups, Geospatial Tech and Tools, GIS',
 'Geospatial Tech and Tools, GIS',
 'Geospatial Career, Geospatial Tech and Tools, GIS',
 'Earth Observation, Geospatial Concepts, Geospatial Tech and Tools',
 'Geospatial Startups, Geospatial Tech and Tools, GIS',
 'Geospatial Career, Geospatial Tech and Tools, GIS',
 'Geospatial Tech and Tools',
 'Geospatial Career, Geospatial Startups, Geospatial Tech and Tools, GIS',
 'Geospatial Concepts, Geospatial Tech and Tools, GIS',
 'Earth Observation, Geospatial Startups, Geospatial Tech and Tools',
 'Earth Observation, Geospatial Career, GIS',
 'Earth Observation, Geospatial Tech and Tools',
 'Earth Observation, Geospatial Concepts, Geospatial Tech and Tools',
 'Geospatial Career, Geospatial Tech and Tools, GIS',
 'Geospatial Career, GIS',
 'Earth Observation, Geospatial Tech an

In [11]:
duration_soup = soup.findAll('span', class_= "blog-meta-time-slt")
durations = [duration.text for duration in duration_soup]
durations.insert(16, '35:53')
print(len(durations))
durations

21


['44:16',
 '35:52',
 '44:56',
 '59:04',
 '50:35',
 '51:06',
 '41:26',
 '48:00',
 '26:10',
 '49:35',
 '41:54',
 '44:58',
 '53:12',
 '40:34',
 '40:59',
 '39:33',
 '35:53',
 '43:41',
 '33:42',
 '49:00',
 '30:27']

In [12]:
df_page3 = pd.DataFrame({'Episode': episodes, 'Date': dates, 'Title': titles, 'Duration': durations, 'Categories': categories})
df_page3['Date'] = pd.to_datetime(df_page3['Date'])
df_page3['Episode'] = df_page3['Episode'].str.replace('Episode ', '')

df_page3.info()
df_page3

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Episode     21 non-null     object        
 1   Date        21 non-null     datetime64[ns]
 2   Title       21 non-null     object        
 3   Duration    21 non-null     object        
 4   Categories  21 non-null     object        
dtypes: datetime64[ns](1), object(4)
memory usage: 968.0+ bytes


Unnamed: 0,Episode,Date,Title,Duration,Categories
0,167,2022-08-10,Bathymetric Lidar and Blue Carbon,44:16,"Earth Observation, Geospatial Concepts, Geospa..."
1,166,2022-08-08,Re-Published – QGIS Offline And In The Field,35:52,mapscaping.com
2,165,2022-08-03,The Open Geospatial Consortium,44:56,"Geospatial Career, Geospatial Startups, Geospa..."
3,164,2022-07-27,Monetizing An Open-Source Geospatial Project,59:04,"Geospatial Startups, Geospatial Tech and Tools..."
4,161,2022-07-20,Whitebox Tools Is The Backend To Many Frontends,50:35,"Geospatial Tech and Tools, GIS"
5,163,2022-07-20,QGIS Offline And In The Field,51:06,"Geospatial Career, Geospatial Tech and Tools, GIS"
6,160,2022-07-06,Sentinel Hub,41:26,"Earth Observation, Geospatial Concepts, Geospa..."
7,159,2022-06-29,Unstructured Data Is Dark Data,48:00,"Geospatial Startups, Geospatial Tech and Tools..."
8,158,2022-06-25,What Is Modern GIS?,26:10,"Geospatial Career, Geospatial Tech and Tools, GIS"
9,157,2022-06-08,FOSS4G,49:35,Geospatial Tech and Tools


In [13]:
df_complete = pd.concat([result_df, df_page3], ignore_index = True)

In [14]:
final_df = df_complete.sort_values(by='Date')
final_df

Unnamed: 0,Episode,Date,Title,Duration,Categories
208,1,2019-01-14,The future of collecting and updating geospati...,46:08,"Earth Observation, Geospatial Startups, Geospa..."
207,2,2019-02-08,"Indoor mapping and navigation: Manage, visuali...",40:37,"Geospatial Concepts, Geospatial Tech and Tools..."
206,3,2019-03-10,Bellerby & Co – The globemakers,46:35,GIS
205,4,2019-04-02,freelance mappers create maps for machines,43:43,"Geospatial Startups, Geospatial Tech and Tools..."
204,5,2019-04-08,Powering location intelligence with geo social...,33:48,"Geospatial Concepts, Geospatial Startups, Geos..."
203,6,2019-04-16,Data discovery – the way it should be,38:40,mapscaping.com
202,7,2019-04-25,Augmented reality will change the way you thin...,29:52,"Geospatial Concepts, Geospatial Tech and Tools..."
201,8,2019-05-03,Geo-tagged audio – another way of augmenting r...,29:25,"Geospatial Concepts, Geospatial Startups, Geos..."
200,9,2019-05-08,"Proof of location, bringing the blockchain to ...",26:14,mapscaping.com
199,10,2019-05-15,Mapping Personalised Workplace Risk,32:29,GIS


In [None]:
final_df.to_csv('MapScaping_scraped.tsv', sep='\t', index = False)

# Interviewee info

this is the more tricky part. Daniel does not list the names or contact information of his interview partners is a structured way on his website. Therefore, I have to manually go through the transcripts (those he has, luckily!) to find the names of the interview partners and then browse the internet to find social media profiles etc. with data on them

#### Variables I try to find for each interviewee are:
* (interviewee id)
* first name
* last name
* gender (based on pronouns on LinkedIn or how other people/websites refer to them)
* position/seniority *at time of interview*
* weblinks (to LinkedIn/Twitter/personal page and to company page)
* company name *at time of interview*

#### Shortcomings
* not everyone is 'from' somewhere. Several interviewees moved quite often in their lives and don't consider themselves to be from a specific place
* gender is assumed from name & photos if not explicitly stated on LinkedIn or clear from how Daniel refers to them
* the location I use is point data, while some interviewees state a region (or even country) as their location
* quite a few companies that the interviewees worked at the time of the interview were acquired by other firms or were closed
* 

some names can't be written correctly because of UTF8 encoding
* Lyden Foust
* Josh Kopecek
* Markus Müller


* several place names (especially PL, CZ) 

Not sure whether the map is too much advertisement for the interview partners / their companies. Daniel does not want the podcast to be a sales pitch for them.

### Struggles
I use Excel to type in the data that I found on line.
Now I can't read in the csv file into python because of UTF-8 encoing issues...

In [None]:
with open("MapScaping_extended.csv", 'rb') as f:
  contents = f.read()

In [None]:
contents

In [None]:
import os

## from 
## https://stackoverflow.com/questions/48812580/python-pandas-unicodedecodeerror-utf-8-codec-cant-decode-byte-0xcd-in-pos

def read_csv(filepath):
     if os.path.splitext(filepath)[1] != '.csv':
          return  # or whatever
     seps = [',', ';', '\t']                    # ',' is default
     encodings = [None, 'utf-8', 'ISO-8859-1']  # None is default
     for sep in seps:
         for encoding in encodings:
              try:
                  return pd.read_csv(filepath, encoding=encoding, sep=sep)
              except Exception:  # should really be more specific 
                  pass
     raise ValueError("{!r} is has no encoding in {} or seperator in {}"
                      .format(filepath, encodings, seps))

In [None]:
df = read_csv("MapScaping_extended.csv")

In [None]:
# double square brackets to keep it as DataFrame (instead of Series)
locations = df[['location']]
# drop duplicate locations to make geocoding faster and avoid merge complications later on
locations = locations.drop_duplicates(subset='location')

locations #.info()

#### Nomatim API to geocode addresses

In [None]:
import pandas as pd
import requests
import time

# Load addresses from CSV
addresses_df = locations # pd.read_csv('location.csv')
addresses = addresses_df['location'].tolist()

# Function to geocode address using Nominatim API
def geocode_address(address):
    url = 'https://nominatim.openstreetmap.org/search'
    headers = {'User-Agent': 'Nicolas'}
    params = {'q': address, 'format': 'json'}
    
    response = requests.get(url, headers=headers, params=params)
    if response.status_code == 200:
        results = response.json()
        if results:
            return results[0]['lat'], results[0]['lon']  # Return the latitude and longitude of the first result
    return None, None  # Return None if no results or an error occurred

# Create a list to hold geocoded results
geocoded_addresses = []

# Geocode each address
for address in tqdm(addresses):
    lat, lon = geocode_address(address)
    geocoded_addresses.append({'address': address, 'latitude': lat, 'longitude': lon})
    time.sleep(1)  # Sleep to respect Nominatim's usage policy

# Convert results to a DataFrame
geocoded_df = pd.DataFrame(geocoded_addresses)

# Optionally, save the geocoded results to a new CSV file
geocoded_df.to_csv('geocoded_addresses.csv', index=False)

print('Geocoding complete. Results saved to geocoded_addresses.csv.')


In [None]:
geocoded_df
# manually insert coordinates for...
# ...Bellerby & Co., as it's too specific for nomatim 
geocoded_df.at[2, 'latitude'] = 51.5625709
geocoded_df.at[2, 'longitude'] = -0.0788484
# ...Cologne-Bonn-Region in Germany, as it's unknown to nomatim 
geocoded_df.at[112, 'latitude'] = 50.8285133
geocoded_df.at[112, 'longitude'] = 6.9960294
# ...NaN, as geocoding this makes no sense at all ;)
geocoded_df.at[1, 'latitude'] = 'NaN'
geocoded_df.at[1, 'longitude'] = 'NaN'

geocoded_df

In [None]:
# drop rows without location / coordinates
# create explicit copy of the DataFrame
geocoded_df_clean = geocoded_df.dropna().copy()
geocoded_df_clean.info()

In [None]:
# transform into float and round to 7 decimal points (approx 1cm)
geocoded_df_clean.loc[:, 'latitude'] = geocoded_df_clean['latitude'].astype(float).round(7)
geocoded_df_clean.loc[:, 'longitude'] = geocoded_df_clean['longitude'].astype(float).round(7)

In [None]:
geocoded_df_clean = geocoded_df_clean.rename(columns={'address': 'location'})
geocoded_df_clean.info()
df#.info()

In [None]:
# merge geocoded coordinates to main dataframe
result_df = pd.merge(df, geocoded_df_clean, on='location', how='left')

result_df.sort_values(by=['Duration'])
# transform 'duration' to timedelta
# difficult, because excel changed the time format ...
# -> use columns from original scraped final_df?
result_df['Duration'] = pd.to_timedelta(result_df['Duration'].replace('',np.NaN))
result_df.groupby('gender')['Duration'].mean()

In [None]:
result_df['Date'] = pd.to_datetime(result_df['Date'], format='%d.%m.%y')
# check that datetime conversion worked properly
result_df = result_df.sort_values(by=['Date'])
result_df.info()

In [None]:
# result_df = result_df.drop(multiple_guest)
result_df# .info()

In [None]:
# result_df = result_df.rename(columns={'address': 'location'})

In [None]:
# Creating a combined name column for convenience
result_df['full_name'] = result_df['first_name'] + ' ' + result_df['last_name']

# Counting occurrences and creating a new column
result_df['number_of_interviews'] = result_df.groupby('full_name')['full_name'].transform('count') #.astype(int)

In [None]:
# who appears how often?
result_df.sort_values(by=['number_of_interviews', 'full_name'], ascending = False)

In [None]:
result_df.info()

In [None]:
# subset only first columns to check that file is okay
subset = result_df[['Episode', 'Date', 'Title', 'Duration', 'Categories', 'first_name', 'last_name', 'interviewee_link', 'gender', 'position', 'company_name', 'company_link', 'location', 'latitude', 'longitude', 'full_name', 'number_of_interviews']]

# 

In [None]:
subset.info()

In [None]:
data = {'Episode': [1, 2, 2, 3, 4, 4, 5],
        'Title': ['A', 'B', 'B', 'C', 'D', 'D', 'E']}
result_df = pd.DataFrame(data)
result_df

In [None]:
# save as tab-separated file so that commas in addresses don't mess it up
subset.to_csv('geocoded_addresses.tsv', sep='\t', index = False)

### To Do Preprocessing
* rename variables to be meaningful
* create variable with count of interviews per person
* change data types to time / timedelta? Or only later in postgis??





### To Do Frontend
* jitter points so that they are not exactly on top of each other
* clustering of points that overlap too much (solve css problems)
    * use colorcoding for percentage of male/female interviewees in cluster
* handle special characters correctly (in names)
* include more in popup
    * interviewee name (make link clickable)
    * episode
    * company name (make link clickable)
* allow filtering for 
    * category (use regex to check whether category is in string)
    * gender
    * date (range filter?)
    * number of interviews (checkbox)
    * interviewee name?
    * company name?

To introduce a jitter of approximately 1 kilometer to the locations in your interviewees table to ensure that identical locations don’t overlay each other, you can use the ST_Translate function in PostGIS. This function shifts the geometry by a specified amount in the x (longitude) and y (latitude) directions. The amount to shift can be determined based on the degree equivalent of the desired distance at the given latitude (keeping in mind that a degree of longitude varies in actual distance depending on the latitude, but for small distances like 1km, this can be approximated fairly accurately).

For the jitter, we can use random offsets in both the latitude and longitude directions. Given that 1 degree of latitude is approximately 111 kilometers, a shift of about 0.009° in any direction would approximate to 1 kilometer. For longitude, this value needs to be adjusted based on the latitude due to the varying distance covered by a degree of longitude, but for simplicity, we'll use the same approximation, which is reasonable for small distances and near the equator.

This SQL statement will move each point in a random direction by up to approximately 1 kilometer. The RANDOM() function generates a value between 0 and 1, so RANDOM() * 0.018 - 0.009 will produce a shift ranging from -0.009° to +0.009° in both latitude and longitude, effectively creating a jitter around the original point.

not needed
* interviewee_id
* page_on_website
* incomplete

**Attention!!!** 
* there are two different Eric Jensen's
* Peter Petrik's episode was re-published because of audio-quality

## troubleshooting

In [325]:
!pip install qiskit

python(14510) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Collecting qiskit
  Downloading qiskit-1.0.2-cp38-abi3-macosx_10_9_x86_64.whl (4.3 MB)
[K     |████████████████████████████████| 4.3 MB 1.1 MB/s eta 0:00:01
Collecting stevedore>=3.0.0
  Downloading stevedore-5.2.0-py3-none-any.whl (49 kB)
[K     |████████████████████████████████| 49 kB 11.9 MB/s eta 0:00:01
[?25hCollecting rustworkx>=0.14.0
  Downloading rustworkx-0.14.2-cp39-cp39-macosx_10_12_x86_64.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 44.0 MB/s eta 0:00:01�█████████████████▏     | 1.5 MB 44.0 MB/s eta 0:00:01
Collecting symengine>=0.11
  Downloading symengine-0.11.0-cp39-cp39-macosx_10_9_x86_64.whl (23.8 MB)
[K     |████████████████████████████████| 23.8 MB 1.2 MB/s eta 0:00:011
[?25hCollecting dill>=0.3
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[K     |████████████████████████████████| 116 kB 37.7 MB/s eta 0:00:01
Collecting pbr!=2.1.0,>=2.0.0
  Downloading pbr-6.0.0-py2.py3-none-any.whl (107 kB)
[K     |████████████████████████████████| 10