# Data Wrangling (Optional)

For the evaluation (`01-Evaluation.ipynb`), you need to prepare a csv file that has a header and includes at minimum a `url` column.

This notebook creates the proper CSV files to be used in the evaluation. Some formatting guidelines for the data are:

* Input files should be CSV files
* Input files should have headers
* For consistency, column names use underscores with lower cases, i.e., `short_name` instead of `shortName`

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

In [2]:
"""
The name of the base folder you want to work on under `../data/`
"""
TIME_STAMP_FOLDER_NAME = '08-01-2024'

## Database Commons (JSON)

The Database Commons (https://ngdc.cncb.ac.cn/databasecommons/) contains 6K+ data resources with URLs and other metadata.

You can download the data using the following URL:

> https://ngdc.cncb.ac.cn/databasecommons/database/browse?term=&q=&draw=1&columns%5B0%5D.data=0&columns%5B0%5D.name=&columns%5B0%5D.searchable=false&columns%5B0%5D.orderable=false&columns%5B0%5D.search.value=&columns%5B0%5D.search.regex=false&columns%5B1%5D.data=zindex&columns%5B1%5D.name=&columns%5B1%5D.searchable=true&columns%5B1%5D.orderable=true&columns%5B1%5D.search.value=&columns%5B1%5D.search.regex=false&columns%5B2%5D.data=citation&columns%5B2%5D.name=&columns%5B2%5D.searchable=true&columns%5B2%5D.orderable=true&columns%5B2%5D.search.value=&columns%5B2%5D.search.regex=false&columns%5B3%5D.data=shortName&columns%5B3%5D.name=&columns%5B3%5D.searchable=true&columns%5B3%5D.orderable=true&columns%5B3%5D.search.value=&columns%5B3%5D.search.regex=false&columns%5B4%5D.data=foundedYear&columns%5B4%5D.name=&columns%5B4%5D.searchable=true&columns%5B4%5D.orderable=true&columns%5B4%5D.search.value=&columns%5B4%5D.search.regex=false&order%5B0%5D.column=1&order%5B0%5D.dir=desc&order%5B1%5D.column=4&order%5B1%5D.dir=DESC&start=0&length=10000&search.value=&search.regex=false&_=1667231167872

In [None]:
"""
Load the original data
"""
df = pd.read_json(f'../data/{TIME_STAMP_FOLDER_NAME}/input/data-portal/unformatted/database-commons.json')

In [None]:
"""
We use underscore lowercase column names
"""
df.columns = (df.columns.str.replace('(?<=[a-z])(?=[A-Z])', '_', regex=True).str.lower())

In [None]:
"""
Optionally, drop columns that we don't need
"""
df.drop(columns=['biodb_ranks', 'rating_list'], inplace=True)

In [None]:
# Commenting out the deprecated data field
# """
# The id from the sources are consistently "source_id"
# The values should be a string type, and it has the prefix that represents the source (e.g. dc_ for Database Commons)
# """
# df.rename(columns={ "db_id": "source_id" }, inplace=True)
# df.source_id = df.source_id.apply(lambda x: 'dc_' + str(x))

In [None]:
"""
Some columns from data commons are in json format, we need to convert them to string
Example: [{ "id": 1, "name": "foo" }, { "id": 2, "name": "bar" }] --> 'foo, bar'
"""
json_column_names_and_keys = {
    'data_type_list': 'datatypeName', 
    'category_list': 'name',
    'keywords_list': 'name',
    'data_object_list': 'name',
    'organism_list': 'organismName',
    'theme_list': 'name'
}

for (column, key) in json_column_names_and_keys.items():
    df[column] = df[column].apply(lambda x: ', '.join([object[key] for object in x]))

In [None]:
"""
Add unique IDs of websites
"""
df['website_id'] = df.short_name
df.website_id = df.website_id.apply(lambda x: x.replace(' ', '-').replace('/', '-').replace(',', '-'))

# Important: since there can be duplicated website names, add postfix to ensure the website_id is actually unique
mask = df.website_id.duplicated(keep=False)
df.loc[mask, 'website_id'] += df.groupby('website_id').cumcount().add(1).astype(str)

In [None]:
"""
Add additional information for the analysis
"""
df['page_type'] = 'home'
df['page_id'] = 0

In [None]:
df.head(3)

In [None]:
"""
Export the data
"""
df.to_csv(
    f'../data/{TIME_STAMP_FOLDER_NAME}/input/data-portal/database-commons.csv',
    index=False,
    # to be sure, no allow ovewrite
    # mode='x'
)

## Scientific Journal Ranking (SJR) (CSV)

The journal data is corrected from SJR (https://www.scimagojr.com/journalrank.php). An important information missing is that this does not contain the URLs of the jorunals. We parse the data from the website using the ID of each journal.

In [3]:
"""
Load the original data
"""
df = pd.read_csv(f'../data/{TIME_STAMP_FOLDER_NAME}/input/journal/unformatted/sjr2022.csv', sep=';')

In [4]:
"""
We use underscore lowercase column names
"""
df.columns = (df.columns.str.replace('(?<=[a-z])(?=[A-Z])', '_', regex=True).str.lower())
df.columns = (df.columns.str.replace('.', '', regex=False)) # remove dots
df.columns = (df.columns.str.replace('(', '', regex=False)) # remove parentheses
df.columns = (df.columns.str.replace(')', '', regex=False))
df.columns = (df.columns.str.replace('/', 'per', regex=False)) # replace slash with "per"
df.columns = (df.columns.str.replace(' ', '_', regex=False)) # replace space with underscore

In [5]:
df.head(3)

Unnamed: 0,rank,sourceid,title,type,issn,sjr,sjr_best_quartile,h_index,total_docs_2022,total_docs_3years,...,total_cites_3years,citable_docs_3years,cites_per_doc_2years,ref_per_doc,country,region,publisher,coverage,categories,areas
0,1,28773,Ca-A Cancer Journal for Clinicians,journal,"15424863, 00079235",86091,Q1,198,44,118,...,30318,85,29999,9700,United States,Northern America,Wiley-Blackwell,1950-2022,Hematology (Q1); Oncology (Q1),Medicine
1,2,29431,Quarterly Journal of Economics,journal,"00335533, 15314650",36730,Q1,292,36,122,...,2141,122,1483,6661,United Kingdom,Western Europe,Oxford University Press,1886-2022,Economics and Econometrics (Q1),"Economics, Econometrics and Finance"
2,3,20315,Nature Reviews Molecular Cell Biology,journal,"14710072, 14710080",34201,Q1,485,121,328,...,13331,156,3547,8929,United Kingdom,Western Europe,Nature Publishing Group,2000-2022,Cell Biology (Q1); Molecular Biology (Q1),"Biochemistry, Genetics and Molecular Biology"


In [6]:
# Commenting out the deprecated data field
# """
# The id from the sources are consistently "source_id"
# The values should be a string type, and it has the prefix that represents the source (e.g. dc_ for Database Commons)
# """
# df.rename(columns={ "sourceid": "source_id" }, inplace=True)
# df.source_id = df.source_id.apply(lambda x: 'sjr_' + str(x))

In [7]:
"""
Filter out journals that are not related to life science.
"""
# If you want to see the full list of areas, uncomment the below code
# unique_areas = set()
# for item in set(df_meta.areas):
#     for area in item.split(';'):
#         unique_areas.add(area)
# print(unique_areas)

# Interested in the following areas
areas_interested = [
    'Biochemistry, Genetics and Molecular Biology',
    'Health Professions',
    'Immunology and Microbiology',
    'Medicine',
    'Multidisciplinary',
    'Neuroscience',
    'Pharmacology, Toxicology and Pharmaceutics',
    'Psychology',
    'Agricultural and Biological Sciences'
]

# Filter out the metadata
df = df[df.areas.str.contains('|'.join(areas_interested))]

In [8]:
"""
Exclude categories that are not relevant
"""
categories_to_keep = [
 'Advanced and Specialized Nursing',
 'Aging',
 'Agricultural and Biological Sciences',
 'Agronomy and Crop Science',
 'Anatomy',
 'Anesthesiology and Pain Medicine',
 'Animal Science and Zoology',
 'Anthropology',
 'Applied Microbiology and Biotechnology',
 'Applied Psychology',
 'Assessment and Diagnosis',
 'Atmospheric Science',
 'Atomic and Molecular Physics, and Optics',
 'Behavioral Neuroscience',
 'Biochemistry',
 'Biochemistry, Genetics and Molecular Biology',
 'Bioengineering',
 'Biological Psychiatry',
 'Biomaterials',
 'Biomedical Engineering',
 'Biophysics',
 'Biotechnology',
 'Cancer Research',
 'Cardiology and Cardiovascular Medicine',
 'Catalysis',
 'Cell Biology',
 'Cellular and Molecular Neuroscience',
 'Chemical Health and Safety',
 'Chiropractics',
 'Clinical Biochemistry',
 'Clinical Psychology',
 'Cognitive Neuroscience',
 'Complementary and Alternative Medicine',
 'Complementary and Manual Therapy',
 'Critical Care Nursing',
 'Critical Care and Intensive Care Medicine',
 'Demography',
 'Dental Assisting',
 'Dental Hygiene',
 'Dentistry',
 'Dermatology',
 'Development',
 'Developmental Biology',
 'Developmental Neuroscience',
 'Developmental and Educational Psychology',
 'Drug Discovery',
 'Drug Guides',
 'Emergency Medical Services',
 'Emergency Medicine',
 'Emergency Nursing',
 'Endocrine and Autonomic Systems',
 'Endocrinology',
 'Endocrinology, Diabetes and Metabolism',
 'Epidemiology',
 'Experimental and Cognitive Psychology',
 'Food Animals',
 'Food Science',
 'Gastroenterology',
 'Gender Studies',
 'Genetics',
 'Health',
 'Health Informatics',
 'Health Information Management',
 'Health Policy',
 'Health Professions',
 'Health, Toxicology and Mutagenesis',
 'Hematology',
 'Hepatology',
 'Histology',
 'Horticulture',
 'Human Factors and Ergonomics',
 'Immunology',
 'Immunology and Allergy',
 'Immunology and Microbiology',
 'Infectious Diseases',
 'Insect Science',
 'Internal Medicine',
 'Life-span and Life-course Studies',
 'Linguistics and Language',
 'Maternity and Midwifery',
 'Medical Assisting and Transcription',
 'Medical Laboratory Technology',
 'Medical Terminology',
 'Medical and Surgical Nursing',
 'Medicine',
 'Microbiology',
 'Molecular Biology',
 'Molecular Medicine',
 'Multidisciplinary',
 'Nanoscience and Nanotechnology',
 'Nephrology',
 'Neurology',
 'Neuropsychology and Physiological Psychology',
 'Neuroscience',
 'Nurse Assisting',
 'Nursing',
 'Nutrition and Dietetics',
 'Obstetrics and Gynecology',
 'Occupational Therapy',
 'Oncology',
 'Ophthalmology',
 'Optometry',
 'Oral Surgery',
 'Organic Chemistry',
 'Orthodontics',
 'Orthopedics and Sports Medicine',
 'Otorhinolaryngology',
 'Paleontology',
 'Parasitology',
 'Pathology and Forensic Medicine',
 'Pediatrics',
 'Pediatrics, Perinatology and Child Health',
 'Periodontics',
 'Pharmaceutical Science',
 'Pharmacology',
 'Pharmacology, Toxicology and Pharmaceutics',
 'Pharmacy',
 'Physical Therapy, Sports Therapy and Rehabilitation',
 'Physiology',
 'Plant Science',
 'Podiatry',
 'Process Chemistry and Technology',
 'Psychiatry and Mental Health',
 'Psychology',
 'Public Health, Environmental and Occupational Health',
 'Pulmonary and Respiratory Medicine',
 'Radiation',
 'Radiological and Ultrasound Technology',
 'Radiology, Nuclear Medicine and Imaging',
 'Rehabilitation',
 'Reproductive Medicine',
 'Respiratory Care',
 'Rheumatology',
 'Sensory Systems',
 'Social Psychology',
 'Speech and Hearing',
 'Structural Biology',
 'Surgery',
 'Tourism, Leisure and Hospitality Management',
 'Toxicology',
 'Transplantation',
 'Urology',
 'Veterinary',
 'Virology'
]


def remove_unkeep_areas(areas):
    area_list = areas.split("; ")
    area_list = [a.split(' (')[0] for a in area_list]
    filtered_list = list(filter(lambda area: area in categories_to_keep, area_list))
        
    return ";".join(filtered_list)
    
df.categories = df.categories.apply(lambda areas: remove_unkeep_areas(areas))

In [9]:
"""
Remove journals from a manually selected list
"""
journals_to_unkeep = pd.read_csv(f'../data/{TIME_STAMP_FOLDER_NAME}/input/journal/unformatted/sjr-journals-to-filter.csv').title.unique().tolist()
df.title = df.title.apply(lambda x: 'unkeep' if x in journals_to_unkeep else x)
df = df[df.title != 'unkeep']

In [11]:
"""
Add unique IDs of websites
"""
df['website_id'] = df.title
df.website_id = df.website_id.apply(lambda x: x.replace(' ', '-').replace('/', '-').replace(',', '-'))

# Important: since there can be duplicated website names, add postfix to ensure the website_id is actually unique
mask = df.website_id.duplicated(keep=False)
df.loc[mask, 'website_id'] += df.groupby('website_id').cumcount().add(1).astype(str)

In [12]:
"""
Remove discontinued jorunals
"""
df = df[~df.website_id.str.contains('discontinued')]

In [13]:
"""
Add additional information for the analysis
"""
df['page_type'] = 'home'
df['page_id'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['page_type'] = 'home'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['page_id'] = 0


In [14]:
"""
Using `Sourceid` of SJR, get URLs of individual journal portals
TODO: Reuse the previously identified home pages
"""
def infer_homepage(Sourceid):
    info_url = f'https://www.scimagojr.com/journalsearch.php?q={Sourceid}&tip=sid&clean=0'
    html_text = requests.get(info_url).text
    soup = BeautifulSoup(html_text, 'html.parser')
    urls = soup.find_all('a', text=re.compile('Homepage'))
    if len(urls) > 0:
        return urls[0].get('href')
    else:
        print(f'No homepage found for {Sourceid}')
        return None

df['url'] = df['sourceid'].apply(lambda x: infer_homepage(x))

  urls = soup.find_all('a', text=re.compile('Homepage'))


No homepage found for 19700175113
No homepage found for 5000158305
No homepage found for 71628
No homepage found for 25392
No homepage found for 21100784787
No homepage found for 21100983356
No homepage found for 21101047803
No homepage found for 21100894516
No homepage found for 19900191708
No homepage found for 21100851290
No homepage found for 21100851285
No homepage found for 21100896491
No homepage found for 21100244807
No homepage found for 21100784717
No homepage found for 21100243806
No homepage found for 21101058912
No homepage found for 21101044876
No homepage found for 21100314711
No homepage found for 21100313904
No homepage found for 21101041809
No homepage found for 27974
No homepage found for 21101042490
No homepage found for 52142
No homepage found for 21100784450
No homepage found for 21100201082
No homepage found for 4300151409
No homepage found for 21100284963
No homepage found for 27539
No homepage found for 21100894629
No homepage found for 21100817408
No homepage 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['url'] = df['sourceid'].apply(lambda x: infer_homepage(x))


In [15]:
"""
Filter out jorunals with no URL
"""
df = df[df.url.notnull()]

In [16]:
"""
Export the data
"""
df.to_csv(
    f'../data/{TIME_STAMP_FOLDER_NAME}/input/journal/sjr2022.csv',
    index=False,
    # to be sure, no allow ovewrite
    # mode='x'
)