Skip to content

lizichen/Scholars-Collaboration-Network

Repository files navigation

Collaborative Network of NSF Programs and Awardees, An Adaptation of Identity Disambiguity Method with Heterogeneous Data from Scopus.com

Data Mining Research Project for Department of Environmental Science, New York University

This project aims to identify the following questions:

1 .How do NSF programs vary in the extent to which they

  1. foster new collaborations
  2. strengthen existing collaborations
  3. foster/strengthen different kinds of collaborations: i.e. cross-disciplinary collaborations; this may not be an appropriate question for all programs, but will be for many sustainability science programs, and other programs that specifically seek to encourage collaboration across traditional disciplinary boundaries as a way of tackling “wicked”

2. Which programs foster successful collaborations as measured by

  1. more publications,
  2. more joint authorship,
  3. higher-ranked publications, [? how to measure ?]
  4. more highly cited publications.

3. Which programs foster the production of manuscripts that have broad implications within/beyond their field (as analyzed through network analysis and number and origin of citations)?

  1. which papers connect to other clusters of authors/papers more effectively?

4. Across programs, how do factors such as the followings affect the success and network impacts of collaborations?

  1. award size/amount,
  2. geographic distribution(affiliation/universities) of co-awardees,
  3. number of awardees, etc.,

*Cross-Disciplinarity might also be thought of as a factor that potentially influences success

Diagrams:

A simple example:

relationship

Data Pipeline example:

pipeline

UI and Disambiguity Demonstration example:

demonstration

Technical Stack Proposal:

technical

Development Log:

  • [Oct 2016] Design the research direction and articulate steps of development.
    • Proposed using Scopus open API for data retrieval.
    • Obtained 2015 NSF Award data and stored in MySQL.
  • [Nov 2016] Implement a PHP website that can find the exact awardee with correct Scopus ID.
  • [Dec 2016] Add function to the website that can download top 20 publications of an awardee where the awardee is the first author of the publications.
  • [Jan 2017] Due to Scopus API limit calls and slowness of data retrieval, abandon the previous implementation design. Propose to bring Scopus Data to the local for speed-up and data pre-processing.
  • [Feb 2017] Hack through the Scopus web request to find a non-API calling approach that can download detailed structured data to the local computer.
  • [March 2017] Obtain a list of all the potential awardee Scopus IDs for each awardee. Obtain lists of co-authors for each found Scopus ID. Find the exact awardee Scopus ID. (Detailed description, steps, scriptings, and diagram, please see Data_Scrapping_And_PreProcess/README.md)
    • [March 15, 2017] Some authors do not exist in Scopus Website, the list of not-found awardee names are collected in file: HTML_NOT_FOUND_LIST.txt (154) and AWARDEES_NOT_FOUND_IN_SCOPUS.txt (3265)
    • [March 16, 2017] Due to the bash script, some requests to the Scopus Server returns zero byte html files. For all these awardee names, we have to send requests to the server again for getting the data. A list of such awardee names/id are stored in ZERO_BYTE_HTML_LIST.txt (2525) and ZERO_BYTE_ID.txt (2525) where the ID refers to the $ID used in author_$ID.html. A text file that contains all HTML links for each of the zero-byte html awardees is html_links_for_zero_byte_awardees.txt (2544).
    • [March 16, 2017] Some awardees name are missing from the investigator.csv. Verify_All_Awardees_By_Name.ipynb is used to verify any missing awardee names. all the missing awardee names are stored in MISSING_AWARDEES.txt (4202). The text file that contains all the actual HTMLs is missing_awardee_htmls.txt (4201), which is generated by run a Java program which do the String concatenation job. Due to the fact that Java's limitation on OutputStream, and to ease the programming, a MISSING_AWARDEES_LAST.txt file is created only for the last 22 rows of awardee urls. The missing_awardee_htmls.txt and the MISSING_AWARDEES.txt both contain 4201 rows. All these awardee names have to be re-run with the script get_authors_html_page.sh.
    • [March 16, 2017] Figure out the cause of zero-byte html issue and fix URL encoding problem by the useful regular expression: (?<=[a-z])((\s)(?=[a-z]))
    • [March 17, 2017] There are 48152 unique awardee names in investigator.csv 16 of them have the type of name that include Jr. which results in empty search result from Scopus.
    • [March 17, 2017] 48134 CSV files are in the directory all-investigator-csv-final48134. (A back up tar.gz is created as well: all-investigator-csv-final48134.tar.gz) Each contains zero, one, or more possible awardee names with according Scopus ID and affiliation name(s). The next step is to get all the Scopus IDs from all the 48134 csv files. Then retrieve a co-author list (which contains at maximum 150 co-authors) for each Scopus ID.
    • [March 31, 2017] /Data_Scrapping_And_PreProcess/all_scopusID/GetAllScopusID.ipynb read all the author csv files from dir /all-investigator-csv-final48134/ and get 198815 unique author Scopus IDs. (There are 203696 Scopus IDs from the 48134 files, some of them are repeated.) The 198815 unique Scopus IDs are stored in the txt file: unique_scopus_ids.txt
  • [April 2017] In April, the major tasks will be to leverage Hadoop Ecosystem to query all request data relationship, and format result to a readable network relationship graph. **(1)**To obtain exact Scopus ID for each awardees. **(2)**To obtain all publication data info. **(3)**Migrate to HDFS. **(4)**Design Queries. **(5)**Draw network relationship graph, possibly in D3.js.
    • [April 4, 2017] Use the 198815 Scopus ID to get all of the co-author list data. Each author will have upto 150 co-author can be found from Scopus. All co-author list will be downloaded as scopus_id.html, then retrieve the {coauthorName, coauthor_scopusID} into scopus_id.csv file. Data will be stored under all-coauthors-html/CSVs/.
    • [April 7, 2017] Have the final datasets:
      • Data_Scrapping_And_PreProcess/original-award-data/award.csv
      • Data_Scrapping_And_PreProcess/original-award-data/CleanedInvestigator.csv
      • Data_Scrapping_And_PreProcess/all-investigator-csv-final48134-merged.csv
      • Data_Scrapping_And_PreProcess/coauthor_id_fn_ln.csv
    • [April 9, 2017] Feed Data into Dumbo HPC. Create Apache PIG scripts to do the MapReduce task.
    • [TODO] Optimize the Hadoop architecture, think about using HBase, Zookeeper, and other Hadoop Ecosystem Frameworks.
    • [TODO] Productionalize this project on April 25, 2017 and give presentation.
    • [ISSUE] Nicknames for the awardees should also be parsed. For example: James Kurose is not the mostly commonly used name for this author; instead, he uses Jim Kurose on the Scopus Website. However, James Kurose, is shown as a previously used name in a smaller font.

About

Find collaboration strength between NSF awards, awardees, and more.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published