#Scientist Map

####*Illuminating the Pathways of Global Academia and unlocking the potential of scientific mobility*

######*My project from the International Conference on the Science of Science and Innovation (ICSSI) Hackathon*

>**Overview**: The idea behind this project is to build a tool which could help researchers find potential partnerships, quickly explore a colleague's history and connections, and view anyone's chronological work history using a map. I built this project during a conference held at Northwestern University and it's definitely in no way completed. But I have saved it here in case I ever want to come back and finish it. Overall this was a great learning opportunity reminding me how to search through, combine, and format dense data structures as well as how to problem solve with external APIs (for geolocation) and how to graph using folium maps.

>**Tool Requirements**
>This tool requires the following Python packages to be installed in order to utilize its functionality: pyalex, fuzzywuzzy, and geopy. Please make sure to install these packages before running the tool. You can install them using the following command:

>`pip install pyalex fuzzywuzzy geopy`

>**How to use**
>Run the first few cells until prompted to enter the name you're interested in. Try not to mis-spell the name or else the API will not be able to locate the person. Then, follow the instructions to identify the right person (in case multiple people share the same name). Then, run the last two cells to see the returned data and the map.

>**Shortcomings**
>This API does not always contain accurate information. A rep for OpenAlex has shared that the organization is in the process of correcting incorrect institution names, sometimes inconsitent years, and more. Please understand that the program will locate who you're looking for, but is only as good as the data used to build it.

------

In [3]:
!pip install pyalex
!pip install fuzzywuzzy
!pip install geopy

from pyalex import Works, Authors, Sources, Institutions, Concepts, Publishers, Funders
import pandas as pd
import sys
from fuzzywuzzy import process, fuzz
from collections import Counter
import nltk
import re
import time
from geopy.geocoders import Nominatim
import folium

nltk.download('punkt')
nltk.download('stopwords')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [31]:
#All the helper functions

def generate_acronym(name):
    name_without_brackets = re.sub(r'\(.*\)', '', name)
    tokens = nltk.word_tokenize(name_without_brackets)

    acronym = ''
    for token in tokens:
        if token.lower() not in nltk.corpus.stopwords.words('english'):
            acronym += token[0].upper()

    return acronym


def acronym(s):
    return ''.join([word[0].upper() for word in s.split(' ')])

def is_same_entity(s1, s2, threshold=90):
    ratio = fuzz.ratio(s1, s2)
    return ratio > threshold

def adjust_duplicate_locations(df):
    new_locations = []
    seen = {}

    for i, loc in enumerate(df['locations']):
        if loc[0] is not None and loc[1] is not None:
            loc_tuple = tuple(loc)
            if loc_tuple in seen:
                new_loc = [coord + (1 * (seen[loc_tuple] + 1)) if coord is not None else None for coord in loc]
                new_locations.append(new_loc)
                seen[loc_tuple] += 1
            else:
                new_locations.append(loc)
                seen[loc_tuple] = 1
        else:
            new_locations.append(loc)

    df['locations'] = new_locations
    return df

In [45]:
#Ask for the name we're looking for
TheInput = input("Who are you looking for? \n-----------------------------\n")
print(f'-----------------------------\nOK, here are all the institutions where someone named {TheInput} works...')

#Go through the whole API result 200 at a time
pager = Authors().search_filter(display_name=TheInput).paginate(per_page=200)

#Collect all the institutions
institutions = []
items = []
for page in pager:
    if page is not None:
        for item in page:
            if item is not None and 'last_known_institution' in item and item['last_known_institution'] is not None and 'display_name' in item['last_known_institution']:
                institution = item['last_known_institution']['display_name']
                institutions.append(institution)
                items.append(item)

# Sort institutions by frequency
counter = Counter(institutions)
institutions = [item for item, _ in counter.most_common()]

if institutions:
    if len(institutions) == 1:
        selected_institution = institutions[0]
    else:
        #Generate acronyms for all places worked
        acronyms = {}
        for institution in institutions:
            acronyms[generate_acronym(institution)] = institution

        #Print the list of unique institutions so the user can select the right person
        for i, institution in enumerate(institutions[:5], start=1):
            print(f"{i}. {institution}")

        #Don't print more than 5 results if there are a lot
        if len(institutions) > 5:
            print("\nThere are more; type Yes if you want to load more institutions. Otherwise,\n")


        #Prompt the user to select which is the right person
        while True:
            selected_institution_input = input(f"Which institution does {TheInput} work? \n")
            print('-----------------------------')
            if selected_institution_input.lower() == 'yes':
                #Print additional institutions
                for i, institution in enumerate(institutions[5:], start=11):
                    print(f"{i}. {institution}")
            #In case user selects for example "1" for the first entry
            if selected_institution_input.isdigit():
                selected_institution = institutions[int(selected_institution_input) - 1]
                break
            else:
                #If user inputs an acronym rather than the full name
                acronym_match = acronyms.get(selected_institution_input.upper(), None)
                if acronym_match:
                    selected_institution = acronym_match  #select the right place
                    break
                else:
                    matches = process.extract(selected_institution_input, institutions)
                    selected_institution = matches[0][0]  #Picks the top result
                    if matches[0][1] >= 95:
                        break
                    else:
                        #In case the user didn't spell it exactly
                        confirmation = input(f"Did you mean {selected_institution}? (Yes/No) ")
                        if confirmation.lower() == 'yes':
                            break



    filtered_items = []
    for item in items:
        if item['last_known_institution']['display_name'] == selected_institution:
            filtered_items.append(item)

    if filtered_items:
        author_id = None
        for filtered_item in filtered_items:
            if 'id' in filtered_item and filtered_item['id'].startswith('https://openalex.org/'):
                author_id = filtered_item['id'].split('/')[-1]
                break

        if author_id is not None:
            works = Works().filter(authorships={"author": {"id": f"https://openalex.org/{author_id}"}}).get()

            if works is None:
                print("No works found for the author.")
        else:
            print("No valid author ID found.")
    else:
        print("No results found for the selected institution.")
else:
    print("No results found for the search.")
    raise StopExecution

#---------------------------------------------------
#STEP 2
#---------------------------------------------------

#Gather all the data on the selected person; all the papers they wrote and where and when they did the research
school = []
country = []
year= []
date = []
for papers in works:
  for author in papers['authorships']:
    if author['author']['id']==f"https://openalex.org/{author_id}":
      for institution in author['institutions']:
        school.append(institution['display_name'])
        year.append(papers['publication_year'])
        country.append(institution['country_code'])
        date.append(papers['publication_date'])

data = list(zip(school, country, date, year))
df = pd.DataFrame(data, columns=["School", "Country", "Date", "Year"])

#Figure out which years a person was at an institution
df_new = df.groupby("School").agg({"Year": ["min", "max"]})
df_new.columns = ["Earliest Year", "Latest Year"]

#---------------------------------------------------
#STEP 3
#---------------------------------------------------

school_names = df_new.index.tolist()

mapping_dict = {}
for i in range(len(school_names)):
    for j in range(i+1, len(school_names)):
        if is_same_entity(acronym(school_names[i]), school_names[j]) or \
           is_same_entity(acronym(school_names[j]), school_names[i]):
            mapping_dict[school_names[j]] = school_names[i]

#Replace school names in the index using the mapping_dict
df_new.index = df_new.index.map(lambda x: mapping_dict[x] if x in mapping_dict else x)

#Group by index (School name), and take the min of 'Earliest Year' and the max of 'Latest Year'
new_df = df_new.groupby(df_new.index).agg({'Earliest Year':'min', 'Latest Year':'max'})

if not new_df.empty:
    sorted_df = new_df.sort_values(by='Latest Year', ascending=False)
    print(f'Selected: {TheInput} from {selected_institution}\n')
else:
    print(f"There is no affiliated information for {TheInput} at {selected_institution}")

#-------------------------------------
#STEP 4
#-------------------------------------

#Initialize the API to locate coords of addresses to feed to folium map
geolocator = Nominatim(user_agent="MyApp")

locations = []
for i in sorted_df.reset_index().School:
  split = i.split('(')
  try:
      location = geolocator.geocode(split[1].split(')')[0])
  except:
      location = geolocator.geocode(split[0])

  if location is None:
      locations.append([None, None])
      print(f"Unable to find location for '{i}'")
  else:
      locations.append([location.latitude,location.longitude])


map_df = sorted_df.copy()
map_df['locations'] = locations

map_df = adjust_duplicate_locations(map_df)

#-------------------------------------
#STEP 5
#-------------------------------------
#Create a map centered at 0
map = folium.Map(location=[0, 0], zoom_start=2)

solid_line_locations = []
dotted_line_locations = []

#Add a marker to the map for each row
for index, row in map_df.iterrows():
    #if the location is available
    if row['locations'] not in ['no location', None]:
        #Calculate the person's tenure
        tenure = row['Latest Year'] - row['Earliest Year']

        #Add a marker to the map
        folium.Marker(
            location=row['locations'],
            popup=index,
            icon=folium.Icon(color='red'),
            tooltip=str(index)
        ).add_to(map)

        #If the tenure is less than a year, add the location to the dotted_line_locations list
        #and add the previous location to the solid_line_locations list (if it exists)
        if tenure < 2:
            if solid_line_locations and solid_line_locations[-1] is not None:
                dotted_line_locations.append(solid_line_locations[-1])
                dotted_line_locations.append(row['locations'])
        else:
            solid_line_locations.append(row['locations'])

# Filter out None values
solid_line_locations = [loc for loc in solid_line_locations if loc is not None]
dotted_line_locations = [loc for loc in dotted_line_locations if loc is not None]

#Add a solid line connecting all the solid_line_locations
folium.PolyLine(solid_line_locations, color="blue", weight=2.5, opacity=1).add_to(map)

#Add a dotted line connecting all the dotted_line_locations
for i in range(0, len(dotted_line_locations), 2):
    folium.PolyLine(dotted_line_locations[i:i+2], color="red", weight=2.5, opacity=1, dash_array='5, 5').add_to(map)

Who are you looking for? 
-----------------------------
Laurina Zhang
-----------------------------
OK, here are all the institutions where someone named Laurina Zhang works...
Selected: Laurina Zhang from Boston University



In [46]:
#Return the data in a table
sorted_df

Unnamed: 0_level_0,Earliest Year,Latest Year
School,Unnamed: 1_level_1,Unnamed: 2_level_1
Boston University,2013,2022
Quest University Canada,2014,2021
Georgia Institute of Technology,2017,2019
Western University,2015,2018
University of Toronto,2013,2014


In [47]:
#draw it on a map
map