## Aim

In this notebook, I tried to keep working on `get_gender_race_aff_pred.py` and test the functions one by one. The output file should be author data with gender, race, and affiliation predictions. Affiliation predictions should come with the ror affname in upper class and the associated ROR IDs. 

In [7]:
import pandas as pd
import requests
import numpy as np
import re
from collections import Counter
import random
import json

In [8]:
race_pred = pd.read_csv('../data/author_with_pred.csv')
race_pred.shape

(13603, 25)

In [9]:
race_pred.head(2)

Unnamed: 0,doi,url,year,title,journal,datePublished,authorFullName,firstName,lastName,numberOfAuthors,...,genderAccuracy,api,black,hispanic,white,race,raceHighest,raceSecondHighest,raceDiff,racePredAccuracy
0,10.1093/joc/jqac004,https://academic.oup.com/joc/article/72/3/297/...,2022,The Gender Divide in Wikipedia: Quantifying an...,Journal of Communication,2022-02-16,Isabelle Langrock,Isabelle,Langrock,2.0,...,High,0.007762,0.066429,0.030049,0.89576,white,0.89576,0.066429,0.829331,High
1,10.1093/joc/jqac004,https://academic.oup.com/joc/article/72/3/297/...,2022,The Gender Divide in Wikipedia: Quantifying an...,Journal of Communication,2022-02-16,Sandra González-Bailón,Sandra,González-Bailón,2.0,...,High,0.010406,0.0003,0.979183,0.010111,hispanic,0.979183,0.010406,0.968777,High


In [10]:
len(race_pred)

13603

In [11]:
# testing whether I can have dupulicate cols; yes
cols_to_keep = ['doi', 'url', 'doi', 'url']

In [12]:
race_pred[cols_to_keep].head()

Unnamed: 0,doi,url,doi.1,url.1
0,10.1093/joc/jqac004,https://academic.oup.com/joc/article/72/3/297/...,10.1093/joc/jqac004,https://academic.oup.com/joc/article/72/3/297/...
1,10.1093/joc/jqac004,https://academic.oup.com/joc/article/72/3/297/...,10.1093/joc/jqac004,https://academic.oup.com/joc/article/72/3/297/...
2,10.1093/joc/jqac009,https://academic.oup.com/joc/article/72/3/322/...,10.1093/joc/jqac009,https://academic.oup.com/joc/article/72/3/322/...
3,10.1093/joc/jqac009,https://academic.oup.com/joc/article/72/3/322/...,10.1093/joc/jqac009,https://academic.oup.com/joc/article/72/3/322/...
4,10.1093/joc/jqac008,https://academic.oup.com/joc/article/72/3/345/...,10.1093/joc/jqac008,https://academic.oup.com/joc/article/72/3/345/...


### Process aff text

In [13]:
def notNaN(aff):
    '''returns True if it is not nan
    '''
    return aff == aff

def process_affiliation_text(aff):
    '''process aff text: lower case, remove content within (), keep only characters
    '''
    if notNaN(aff):
        aff = aff.lower()
        # delete anything between ()
        aff = re.sub(r'\(.*?\)', '', aff)
        # remove anything other than characters
        aff = re.sub('[^a-z ]+', ' ', aff)
        aff = ' '.join(aff.split())
        return aff
    else:
        return np.nan

In [14]:
test_text = "1 Abran J. Salazar (Ph.D., University of Iowa, 1991) is assistant professor in the Department of Speech Communication at Texas A&amp;M University"

In [15]:
process_affiliation_text(test_text)

'abran j salazar is assistant professor in the department of speech communication at texas a amp m university'

In [16]:
race_pred['affProcessed'] = [
    process_affiliation_text(aff) for aff in race_pred.affiliation]

### Get deduplicated affs to predict

In [17]:
def get_deduplicated_affs_to_predict(race_pred):
    '''deduplicate the column of affProcessed, remove nan and '', and return the 
    list of dedup_affs_to_predict
    '''
    affs = race_pred.affProcessed
    # deduplicate
    affs = list(set(affs))
    # remove nan
    affs = [x for x in affs if str(x) != 'nan' and x != '']
    print(f'There are in total {len(affs)} deduplicated affiliations to predict')
    return affs

In [18]:
dedup_affs_to_predict = get_deduplicated_affs_to_predict(race_pred)

There are in total 8409 deduplicated affiliations to predict


In [19]:
dedup_affs_to_predict[0:10]

['department of communication university of california santa barbara usa',
 'ronald c kessler is at the department of sociology and the institute for social research university of michigan ann arbor',
 'new mexico state university',
 'jakob d jensen and ryan j hurley are doctoral candidates in the department of speech communication at the university of illinois at urbana champaign',
 'is assistant professor of speech at fayetteville state university his research examines the functions of ethos and identity in all persuasive discourse but especially that found in computer mediated communication this essay was a part of his dissertation work which focused also on ethos and identity in online political and religious environments',
 'department of speech communication the university of georgia athens ga usa',
 'stephen g west is associate professor of psychology at florida state university',
 'department of communication the hebrew university of jerusalem jerusalem israel',
 'college of co

In [20]:
for i in dedup_affs_to_predict:
    if i == '':
        print('yes')
        break

### Load ROR raw dataset

In [21]:
def load_ror_dataset(ROR_RAW_DATA):
    '''read in ROR_DATA
    Output:
        a list of dictionaries
    '''
    with open(ROR_RAW_DATA, 'r') as myfile:
        data=myfile.read()
    data = json.loads(data)
    return data

In [22]:
rorData = load_ror_dataset('../data/raw/large/ror.json')

In [23]:
rorData[0]

{'id': 'https://ror.org/019wvm592',
 'name': 'Australian National University',
 'types': ['Education'],
 'links': ['http://www.anu.edu.au/'],
 'aliases': [],
 'acronyms': ['ANU'],
 'status': 'active',
 'wikipedia_url': 'http://en.wikipedia.org/wiki/Australian_National_University',
 'labels': [],
 'email_address': None,
 'ip_addresses': [],
 'established': 1946,
 'country': {'country_code': 'AU', 'country_name': 'Australia'},
 'relationships': [{'type': 'Related',
   'label': 'Calvary Hospital',
   'id': 'https://ror.org/041c7s516'},
  {'type': 'Related',
   'label': 'Canberra Hospital',
   'id': 'https://ror.org/04h7nbn38'},
  {'type': 'Related',
   'label': 'Goulburn Base Hospital',
   'id': 'https://ror.org/030jpqj15'},
  {'type': 'Child',
   'label': 'ARC Centre of Excellence for Transformative Meta-Optical Systems',
   'id': 'https://ror.org/05sh7tb37'},
  {'type': 'Child',
   'label': 'ARC Centre of Excellence in Plant Energy Biology',
   'id': 'https://ror.org/01a1mq059'},
  {'ty

### Get ROR dictionaries

In [24]:
def get_ror_dics(rorData):
    '''dictionaries of 
        1. ror aff name in lower case and its corresponding affname in upper case
        2. ror aff name (upper case) and it corresponding ror id
    Note that these two dics contain ALL affiliations in rorData
    '''
    ror_lower_upper_dic = {}
    ror_upper_id_dic = {}
    for i in rorData:
        upper_affname = i['name']
        lower_affname = i['name'].lower()
        ror_lower_upper_dic[lower_affname] = upper_affname
        ror_upper_id_dic[upper_affname] = i['id']
    return ror_lower_upper_dic, ror_upper_id_dic

In [25]:
ror_lower_upper_dic, ror_upper_id_dic = get_ror_dics(rorData)

### Get select ror affnames

In [26]:
def get_select_ror_affnames(rorData, target_str, to_remove_affs):
	'''
	ror has A LOT of affiliations. I only select some of them. 

	In the selected ones, some obviously will lead to wrong (exact) match later, 
		so I delete them here. 

	The select affnames are in lower case. 
	'''
	select_ror_affnames = []
	for i in rorData:
		affname = i['name'].lower()
		if any(x in affname for x in target_str):
			select_ror_affnames.append(affname)
	select_ror_affnames = [x for x in select_ror_affnames if x not in to_remove_affs]
	print(f'There are a total of {len(select_ror_affnames)} select ror affnames')
	return select_ror_affnames

In [27]:
target_str = [
    'university', 
    'school',
    'college', 
    "universität", 
    "université", 
    "inc.", 
    "company", 
    'coorporation',
    'institute',
    'center',
    'centre',
]
to_remove_affs = [
    'he university',
    'ege university',
    'ces university',
    'coe college',
    'kes college',
    'ie university',
    'health center',
    'cancer institute',
    'rk university',
    'air university',

]
select_ror_affnames = get_select_ror_affnames(
    rorData, target_str, to_remove_affs)

There are a total of 29813 select ror affnames


### Exact match

In [28]:
def get_exact_match_list(dedup_affs_to_predict, select_ror_affnames):
	"""for each aff in dedup_affs_to_predict, check whether any of its substring 
	can be exactly matched with any of the aff in select_ror_affnames
	Output:
		a dictionary where key is aff_to_predict, and value is the matched 
		aff in select ror affnames
	"""
	total = len(dedup_affs_to_predict)
	exact_match_dic = {}
	exact_match = 0
	for aff_to_predict in dedup_affs_to_predict:
		for x in select_ror_affnames:
			if x in aff_to_predict:
				exact_match += 1
				exact_match_dic[aff_to_predict] = x
				break
	print(f'{exact_match} out of {total} affiliations have been exactly matched')
	return exact_match_dic

In [29]:
exact_match_dic = get_exact_match_list(
    dedup_affs_to_predict, select_ror_affnames)

5821 out of 8409 affiliations have been exactly matched


In [30]:
pd.DataFrame(exact_match_dic.items()).head()

Unnamed: 0,0,1
0,ronald c kessler is at the department of socio...,institute for social research
1,new mexico state university,new mexico state university
2,is assistant professor of speech at fayettevil...,fayetteville state university
3,department of speech communication the univers...,university of georgia
4,stephen g west is associate professor of psych...,florida state university


### API query

In [31]:
def get_to_api_query_list(dedup_affs_to_predict, exact_match_dic):
	"""affs in dedup_affs_to_predict that are not exactly matched
	"""
	to_api_query_list = [
		x for x in dedup_affs_to_predict if x not in exact_match_dic.keys()]
	print(f'{len(to_api_query_list)} affiliations were not exactly matched. Will use API to query')
	return to_api_query_list

In [32]:
to_api_query_list = get_to_api_query_list(
		dedup_affs_to_predict, exact_match_dic)

2588 affiliations were not exactly matched. Will use API to query


In [33]:
random.sample(to_api_query_list, 10)

['faculty of statistical demographical and social sciences university of padova',
 'panamericana university mexico city mexico',
 'maudie l graham is assistant professor in the department of communication at the university of wisconsin milwaukee',
 'department of communication at purdue university',
 'paolo mancini is an associate professor at the istituto di studi sociali universit di perugia italy',
 'department of design communication and media it university copenhagen',
 'barbara j wilson is professor in the department of communication university of california santa barbara',
 'boston',
 'department of educational psychology university of wisconsin madison madison wi',
 'professor for information systems at the university of muenster germany his main areas of research are the development and impact of inter organizational systems and electronic commerce he is co organizer of the research symposium on electronic markets research track chair of the bled international conference on el

In [34]:
def get_api_query_match_dic(to_api_query_list):
    '''For affs not exactly matched, query through ror api and get the first result
    Output:
        a dictionary where keys are aff in to_api_query_list and value is 
        the matched ror affname in lower case
    '''
    api_query_match_dic = {}
    api_query_matched = 0
    for aff in to_api_query_list[0:10]:
# 	for aff in to_api_query_list:
        idx = to_api_query_list.index(aff) + 1
        response = requests.get('https://api.ror.org/organizations?query='+aff)
        j = response.json()
        try:
            j = j['items'][0]
            ror_matched_affname = j['name'].lower()
            api_query_matched += 1
        except:
            ror_matched_affname = np.nan
        api_query_match_dic[aff] = ror_matched_affname
        print(f'{idx}/{len(to_api_query_list)} is done')
    print(f'{api_query_matched} out of {len(to_api_query_list)} have been identified and matched on ROR')
    return api_query_match_dic

In [35]:
api_query_match_dic = get_api_query_match_dic(to_api_query_list)

1/2588 is done
2/2588 is done
3/2588 is done
4/2588 is done
5/2588 is done
6/2588 is done
7/2588 is done
8/2588 is done
9/2588 is done
10/2588 is done
10 out of 2588 have been identified and matched on ROR


### Update race_pred

In [36]:
def get_match_method(aff_processed, exact_match_dic, api_query_match_dic):
    """add a column called "matchMethod" in race_pred
    The match method should be either 'Exact', 'API_QUERY', or np.nan
    """
    if aff_processed in exact_match_dic.keys():
        return 'Exact'
    elif aff_processed in api_query_match_dic.keys():
        print('good')
        return 'API_QUERY'
    else:
        return np.nan

In [37]:
race_pred['matchMethod'] = race_pred['affProcessed'].apply(
    get_match_method, 
    args=(exact_match_dic, api_query_match_dic)
)

good
good
good
good
good
good
good
good
good
good
good
good


In [38]:
combined_dic = dict(list(exact_match_dic.items()) + list(api_query_match_dic.items()))

In [39]:
def get_matched_ror_affname(aff_processed, combined_dic, ror_lower_upper_dic):
    """get corresponding ror affname in upper case
    """
    try:
        lower_ror_affname = combined_dic[aff_processed]
        return ror_lower_upper_dic[lower_ror_affname]
    except:
        return np.nan

In [40]:
race_pred['ROR_AFFNAME'] = race_pred['affProcessed'].apply(
    get_matched_ror_affname, 
    args=(combined_dic, ror_lower_upper_dic)
)

In [41]:
def get_ror_id(ror_affname, ror_upper_id_dic):
    try:
        return ror_upper_id_dic[ror_affname]
    except:
        return np.nan

In [42]:
race_pred['ROR_ID'] = race_pred['ROR_AFFNAME'].apply(
    get_ror_id,
    args=(ror_upper_id_dic, )
)

In [59]:
def get_gscholarLink(row):
    # add google scholar link
    gscholar_str = 'https://scholar.google.com/scholar?hl=en&as_sdt=0%252C50&q='
    if str(row['firstName']) != 'nan' and str(row['lastName']) != 'nan':
        gscholarLink = gscholar_str + str(row['firstName']) + '+' + str(row['lastName'])
    else:
        gscholarLink = np.nan
    return gscholarLink

In [60]:
race_pred['gscholarLink'] = race_pred.apply(get_gscholarLink, axis = 1)

In [61]:
race_pred

Unnamed: 0,doi,url,year,title,journal,datePublished,authorFullName,firstName,lastName,numberOfAuthors,...,race,raceHighest,raceSecondHighest,raceDiff,racePredAccuracy,affProcessed,matchMethod,ROR_AFFNAME,ROR_ID,gscholarLink
0,10.1093/joc/jqac004,https://academic.oup.com/joc/article/72/3/297/...,2022,The Gender Divide in Wikipedia: Quantifying an...,Journal of Communication,2022-02-16,Isabelle Langrock,Isabelle,Langrock,2.0,...,white,0.895760,0.066429,0.829331,High,annenberg school for communication university ...,Exact,University of Pennsylvania,https://ror.org/00b30xv10,https://scholar.google.com/scholar?hl=en&as_sd...
1,10.1093/joc/jqac004,https://academic.oup.com/joc/article/72/3/297/...,2022,The Gender Divide in Wikipedia: Quantifying an...,Journal of Communication,2022-02-16,Sandra González-Bailón,Sandra,González-Bailón,2.0,...,hispanic,0.979183,0.010406,0.968777,High,annenberg school for communication university ...,Exact,University of Pennsylvania,https://ror.org/00b30xv10,https://scholar.google.com/scholar?hl=en&as_sd...
2,10.1093/joc/jqac009,https://academic.oup.com/joc/article/72/3/322/...,2022,Mapping Exposure Diversity: The Divergent Effe...,Journal of Communication,2022-03-16,Pascal Jürgens,Pascal,Jürgens,2.0,...,white,0.873102,0.088763,0.784340,High,department of communication jakob welder weg j...,,,,https://scholar.google.com/scholar?hl=en&as_sd...
3,10.1093/joc/jqac009,https://academic.oup.com/joc/article/72/3/322/...,2022,Mapping Exposure Diversity: The Divergent Effe...,Journal of Communication,2022-03-16,Birgit Stark,Birgit,Stark,2.0,...,white,0.927664,0.043421,0.884243,High,department of communication jakob welder weg j...,,,,https://scholar.google.com/scholar?hl=en&as_sd...
4,10.1093/joc/jqac008,https://academic.oup.com/joc/article/72/3/345/...,2022,Democratic Consequences of Incidental Exposure...,Journal of Communication,2022-03-17,Andreas Nanz,Andreas,Nanz,2.0,...,white,0.663489,0.185628,0.477861,Low,department of communication university of vien...,Exact,University of Vienna,https://ror.org/03prydq77,https://scholar.google.com/scholar?hl=en&as_sd...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13598,10.1111/j.1753-9137.2007.00010.x,https://academic.oup.com/ccc/article/1/1/92/40...,2008,Crossing Boundaries: New Media and Networked J...,"Communication, Culture and Critique",2008-02-29,Robin Mansell,Robin,Mansell,2.0,...,white,0.910091,0.054096,0.855996,High,department of media and communications london ...,Exact,London School of Economics and Political Science,https://ror.org/0090zs177,https://scholar.google.com/scholar?hl=en&as_sd...
13599,10.1111/j.1753-9137.2007.00011.x,https://academic.oup.com/ccc/article/1/1/105/4...,2008,Knowledge Workers of the World! Unite?,"Communication, Culture and Critique",2008-02-29,Vincent Mosco,Vincent,Mosco,1.0,...,white,0.882340,0.054859,0.827481,High,department of sociology queen s university kin...,,,,https://scholar.google.com/scholar?hl=en&as_sd...
13600,10.1111/j.1753-9137.2007.00012.x,https://academic.oup.com/ccc/article/1/1/116/4...,2008,The Other Sides of Globalization: Communicatio...,"Communication, Culture and Critique",2008-02-29,Radhika Parameswaran,Radhika,Parameswaran,1.0,...,api,0.850295,0.137539,0.712756,High,school of journalism indiana university bloomi...,Exact,Indiana University,https://ror.org/01kg8sb98,https://scholar.google.com/scholar?hl=en&as_sd...
13601,10.1111/j.1753-9137.2007.00013.x,https://academic.oup.com/ccc/article/1/1/126/4...,2008,The Militarization of U.S. Communications,"Communication, Culture and Critique",2008-02-29,Dan Schiller,Dan,Schiller,1.0,...,white,0.965455,0.022555,0.942900,High,department of speech communication and graduat...,,,,https://scholar.google.com/scholar?hl=en&as_sd...
