# Welcome !
This CoLab notebook is a walkthough on how to use my function **expertiseFinder_NameOrInst()** in the .py file **ExpertiseFinder_MSI.py**. The purpose of this function is to determine researchers at a specific institutions that have a presence in the ADS NASA database. The "Expertise Finder" detects what researchers are from the target institution by querying NASA ADS using the institution name, then pulling out the author name of papers whose first author is affiliated with the target institution. The code then focuses on their astronomical expertise through selecting publication in astronomical journals (e.g. ApJ, MNRAS, SPIE).

The primary goal of this tutorial is to walk the user through the process of using said function on provided data files and prepare them for using the function on their own.

The items you will need to run this code are listed below:
1. The file **ExpertiseFinder_MSI.py**.
2. The supporting file **TextAnalysis.py**.
3. A file of ignorable "stop words" and its directory path in Google Drive: **stopwords.csv**.
4. An institution to research of your choosing. For the purposes of this tutorial:  **University of the Virgin Islands**.
5. Your own **NASA ADS API token**. This is a long string of characters generated by ADS that gives you acces to their API. You can find instructions on how to get an ADS token [here](https://ads.readthedocs.io/en/v1/api-key.html).

Please make sure you have these items available before starting this tutorial. If you do not have access to these files, please contact maire.volz@nasa.gov or antonino.cucchiara@nasa.gov. Thank you !


### The "Expertise Finder"
The function has the following arguments:
1. **token** = an ads API token  from https://ui.adsabs.harvard.edu/help/api/ (*string*)
2. **directorypath** = the file location of 'stopwords.txt' on your device (*string*)
3. **inst** = the institution name of interest (*string*)
4. **name** ('LastName, FirstName') = an optional argument that, when not 'None', searches ADS for a specific author's name at a specific institution (*string*)
5. **refereed** (True or False) = an optional argument that toggles whether NASA ADS search results need to be peer-refereed (*boolean*)
6. **year** = an optional argument that determines a cutoff point for consideration; papers published before this year will not be considered (*int*)
7. **strictness** = an optional keyword argument that determines how strict the function will be when filtering ADS results (can be one of three *strings*: 'default', 'low', or 'high')
8. **fileName** ('___.csv') = an optional keyword argument that defines the name under which the newData output is saved to the user's device (*string* with default "expertiseFinder_output.csv")

Running this function will return the following:

1. **newData** = a spreadsheet containing information on author name, institution, paper titles, publication years, keywords, abstracts, top 10 words, bi-grams, and tri-grams, and a "CLEAN" or "DIRTY" classification.

In this tutorial, we will only be defining the arguments **token**, **directorypath**, **inst**, **year**, and **strictness**.

### STEP 1: Import necessary files and packages to this notebook.
In order to run the "Expertise Finder", we must define the location of and upload the necessary files.

In [1]:
# Install the ADS library
!pip install ads



In [2]:
# Before I forget, though, you should import the outside packages that this code needs to run.
import ads
import nltk
nltk.download('punkt')
nltk.download('wordnet')
import pandas as pd

[nltk_data] Downloading package punkt to /Users/mairevolz/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/mairevolz/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# Define the directory path of the  stopwords file. If it is in the same directory as this notebook, topdirectory will remain blank. 
# Modify the string below if 'stopowrds.txt' is in another directory:
topdirectory = ''
stopwordspath = topdirectory + 'stopwords.txt'

In [4]:
# Now, we will import the functions from TextAnalysis.py and ExpertiseFinder_MSI.py.
import TextAnalysis as TA
import ExpertiseFinder_MSI as EF

**Hooray !** Now all of our necessary files, packages, and modules are imported.

### STEP 3: Running the code.
In this step, we will accomplish one of our main goals: running the function **expertiseFinder_NameOrInst()** on an institution name of our choice.

In this tutorial, we will only be defining the arguments **token**, **directorypath**, **inst**, **year**, **strictness**, and **fileName**. By enabling **inst** and not **name** at the same time, this means our code will search of ALL top author names associated with the institution that we defined. If we were to define both **inst** and **name**, the code would "switch modes", in a way: it would only search for the papers written by a single author (**name**) at while at the institution (**inst**).

The **strictness** keyword warrants more explanation. The user's choice of "default", "low", or "high" will decide what set of filters are applied to each paper belonging to the top authors of an institution.

* LOW strictness triggers filters that permit an exact author match, a close institution match, or the presence of a journal name (ApJ, MNRAS, SPIE, AJ, Science, PASP, Nature, Arxiv).
* HIGH strictness triggers filters that permit a relevant journal name AND an exact author match or a close institution match (the provided institution name string is included in the official affiliation name).
* DEFAULT strictness triggers filters that permit a close author match AND close institution match, OR an exact author match, OR a close institution match, OR a relevant journal name.
In this tutorial, we will use the "default" filter strictness.

In the output **newData**, the "CLEAN" or "DIRTY" classification is determined by whether a specific paper met the filter qualifications specified by the **strictness** keyword.

In [5]:
# First, define your ADS token:
token = 'EPi9fOI0hRn5vHHflaJleJKTer69r4aEkP9TSXuj'

In [6]:
# Let's do it !
newData = EF.expertiseFinder_NameOrInst(token, stopwordspath, 'University of the Virgin Islands', year = 2000, strictness = 'default', fileName = topdirectory+'UVI_topAuthors.csv')

Currently searching for information on University of the Virgin Islands; ADS records show that the following first authors are affiliated with this institution:
Carbone, Dario
Gendre, B.
Staff, Jan E.
Staff, Jan. E.
Strausbaugh, Robert
Code complete.


In [7]:
# Check out what you made !
newData

Unnamed: 0,True Author,True Institution,First Author,Bibcode,Title,Year,Keywords,Affiliations,Abstract,Top 10 Words,Top 10 Bigrams,Top 10 Trigrams,Data Type
0,"Carbone, Dario",University of the Virgin Islands,"Carbone, Dario",2020ApJ...889...36C,An Optimized Radio Follow-up Strategy for Stri...,2020,Radio astronomy,"Department of Physics and Astronomy, Texas Tec...",Several ongoing or planned synoptic optical su...,"[(radio, 7), (type, 3), (core, 3), (collapse, ...","[((core, collapse), 3), ((rarest, type), 2), (...","[((stripped, envelope, core), 2), ((envelope, ...",CLEAN
1,"Gendre, B.","University of the Virgin Islands, University o...","Gendre, B., Gendre, B.","2019MNRAS.486.2471G, 2022ApJ...929...16G",Can we quickly flag ultra-long gamma-ray burst...,"2019, 2022","methods: observational, Gamma-ray bursts","University of the Virgin Islands, College of S...",Ultra-long gamma-ray bursts are a class of hig...,"[(long, 8), (burst, 6), (ultra, 5), (observati...","[((ultra, long), 5), ((long, grbs), 4), ((gamm...","[((ultra, long, grbs), 3), ((gamma, ray, burst...",CLEAN
2,"Staff, Jan E.","University of the Virgin Islands, University o...","Staff, Jan. E., Staff, Jan E., Staff, Jan E.","2018ApJ...862...74S, 2019ApJ...882..123S, 2023...",The Role of Dredge-up in Double White Dwarf Me...,"2018, 2019, 2023","binaries: close, stars: formation, Star formation","College of Science and Math, University of the...",We present the results of an investigation of ...,"[(outflow, 11), (simulation, 10), (star, 10), ...","[((star, formation), 6), ((opening, angle), 4)...","[((star, formation, efficiency), 2), ((formati...",CLEAN
3,"Staff, Jan. E.","University of the Virgin Islands, University o...","Staff, Jan. E., Staff, Jan E., Staff, Jan E.","2018ApJ...862...74S, 2019ApJ...882..123S, 2023...",The Role of Dredge-up in Double White Dwarf Me...,"2018, 2019, 2023","binaries: close, stars: formation, Star formation","College of Science and Math, University of the...",We present the results of an investigation of ...,"[(outflow, 11), (simulation, 10), (star, 10), ...","[((star, formation), 6), ((opening, angle), 4)...","[((star, formation, efficiency), 2), ((formati...",CLEAN
4,"Strausbaugh, Robert",University of the Virgin Islands,"Strausbaugh, Robert",2022AJ....163...95S,Finding Fast Transients in Real Time Using a N...,2022,1957,"University of the Virgin Islands, Number 2, Br...",The current data acquisition rate of astronomi...,"[(transient, 5), (data, 4), (algorithm, 4), (d...","[((optical, data), 2), ((analysis, algorithm),...","[((current, data, acquisition), 1), ((data, ac...",CLEAN


The cell above contains one final output of this function.

Before we end here, make sure to save your output(s) ! The expertise finder does automatically save your file as a .csv, but if you would like to instead have it as an Excel document, run this cell separately:

In [None]:
# Save the output in a .excel file run the following command:
newData.to_excel(topdirectory+'UVI_topAuthors.xlsx')

Any new files created within this tutorial should now appear in the same directory as this tutorial.

### BONUS pt. 1: Search a single name and institution.
The function **expertiseFinder_NameOrInst()** has another "mode" where, instead of searching only for an institution name and returning several author names from that insitution, a user can search for a single name at a single instition. *This is done by defining the optional keyword "name".*

In this example, we will search for information on Joshua Pepper while he was working at Lehigh University.

In [10]:
# Define the new keywords.
josh = 'Pepper, Joshua'
lehigh = 'Lehigh University'

In [12]:
joshData = EF.expertiseFinder_NameOrInst(token, stopwordspath, inst = lehigh, name = josh, year = 2000, strictness = 'default', fileName = topdirectory+'joshData.csv')

Code complete.


In [14]:
# Check it out !
joshData

Unnamed: 0,True Author,True Institution,First Author,Bibcode,Title,Year,Keywords,Affiliations,Abstract,Top 10 Words,Top 10 Bigrams,Top 10 Trigrams,Data Type
0,"Pepper, Joshua","Lehigh University, Lehigh University, Lehigh U...","Pepper, Joshua, Pepper, Joshua, Pepper, Joshua...","2017AJ....153..215P, 2017AJ....153..177P, 2013...",KELT-11b: A Highly Inflated Sub-Saturn Exoplan...,"2017, 2017, 2013, 2020, 2019","planetary systems, galaxies: star clusters: in...","Department of Physics, Lehigh University, 16 M...",We report the discovery of a transiting exopla...,"[(planet, 15), (star, 13), (host, 10), (public...","[((host, star), 7), ((transiting, exoplanet), ...","[((fully, convective, boundary), 4), ((orbital...",CLEAN


### BONUS pt. 2: Search for several names in a large spreadsheet.
While the function **expertiseFinder_NameOrInst()** was built for processing single names or institutions, its parent function **expertiseFinder()** can process large spreadsheets of researcher names and their affiliate institutions. 

The **expertiseFinder()** has some slightly different keywords that are detailed below:
1. **rawFile** = a Pandas data frame containing at least the following two columns: **"LastName, FirstName"** (containing researcher names) and **"Institution Name"** (containting those researchers' affiliate institution.
2. **start** = the index of **rawFile** at which the function should begin searching ADS (*int*)
3. **count** = the number of rows after **start** that the function should continue searching ADS (*int*). This keyword is put in place to prevent a user from overpassing the 5000-queries-a-day limit from ADS API.

**expertiseFinder()** also contains optional keywords **year** and **strictness** but does NOT have keywords **inst**, **name**, or **refereed**.

As for outputs, instead of returning a single data frame, **expertiseFinder()** returns TWO:
1. **top10Df** = a Pandas data frame containing results from ADS that are considered "clean", or match the "strictness" filters.
2. **top10DirtyDf** = a Pandas data frame containing resutls from ADS that are considered "dirty", or did NOT match the "strictness" filters.

The cells to follow display how this parent function analyses sample data from NHFP fellowship recipients and their host institutions.

In [23]:
# Import the data to analyse:
NHFPdf = pd.read_csv('NHFPFellows_Sample.csv')
NHFPdf

Unnamed: 0,Full Name,"LastName, FirstName",Institution Name
0,Aaron Barth,"Barth, Aaron",California Institute of Technology
1,Aaron Boley,"Boley, Aaron",University of Florida
2,Aaron Meisner,"Meisner, Aaron",California Institute of Technology
3,Aaron Smith,"Smith, Aaron",Massachusetts Institute of Technology
4,Adam Burgasser,"Burgasser, Adam",University of California Los Angeles
5,Adam Burgasser,"Burgasser, Adam",American Museum of Natural History
6,Adam Frank,"Frank, Adam",University of Minnesota
7,Adam Kraus,"Kraus, Adam",University of Hawaii
8,Adam Leroy,"Leroy, Adam",National Radio Astronomy Observatory
9,Adam Miller,"Miller, Adam",Jet Propulsion Laboratory


Above is the sample we will be analysing. 

**PLEASE NOTE: In order for** expertiseFinder() **to analyze a spreadsheet, there must be at least two columns named** "LastName, FirstName" **and** "Institution Name"**.** So make sure that, when you use this function on your own data, you rename some columns !!

In [24]:
# Start by defining where you would like to start and end analysis.
start = 12
count = 5

In [25]:
# Then, run the code !
NHFPData = EF.expertiseFinder(token, stopwordspath, NHFPdf, start, count, year = 2000, strictness = 'default')

In [28]:
cleanData = NHFPData[0]
dirtyData = NHFPData[1]

In [29]:
cleanData

Unnamed: 0,True Author,True Institution,First Author,Bibcode,Title,Year,Keywords,Affiliations,Abstract,Top 10 Words,Top 10 Bigrams,Top 10 Trigrams
0,"Bogdan, Akos","Center for Astrophysics, Center for Astrophysi...","Bogdán, Ákos, Bogdán, Ákos, Bogdán, Ákos, Bogd...","2013ApJ...772...97B, 2017ApJ...850...98B, 2018...",Hot X-Ray Coronae around Massive Spiral Galaxi...,"2013, 2017, 2018, 2015, 2015, 2017, 2020, 2018...","galaxies: individual: NGC 1961 NGC 6753, galax...","Harvard Smithsonian Center for Astrophysics, 6...",Luminous X-ray gas coronae in the dark matter ...,"[(galaxy, 130), (ray, 76), (ngc, 57), (mass, 5...","[((ngc, ngc), 21), ((galaxy, cluster), 19), ((...","[((dark, matter, halos), 9), ((superluminous, ..."
1,"Ji, Alexander","Carnegie Observatories, Carnegie Observatories...","Ji, Alexander P., Ji, Alexander P., Ji, Alexan...","2016ApJ...830...93J, 2019ApJ...870...83J, 2015...",Complete Element Abundances of Nine Stars in t...,"2016, 2019, 2015, 2016, 2020, 2018, 2014, 2016...","galaxies: dwarf, galaxies: dwarf, stars: chemi...",Department of Physics and Kavli Institute for ...,We present chemical abundances derived from hi...,"[(star, 101), (abundance, 51), (galaxy, 37), (...","[((dwarf, galaxy), 23), ((neutron, capture), 1...","[((ultra, faint, dwarf), 10), ((low, neutron, ..."
2,"Kuznetsova, Aleksandra","American Museum of Natural History, American M...","Kuznetsova, Aleksandra, Kuznetsova, Aleksandra...","2022ApJ...928...92K, 2015ApJ...815...27K, 2018...",Anisotropic Infall and Substructure Formation ...,"2022, 2015, 2018, 2019, 2020, 2018, 2017, 2020","Hydrodynamics, stars: formation, stars: format...","American Museum of Natural History, 200 Centra...",The filamentary nature of accretion streams fo...,"[(mass, 26), (star, 25), (formation, 18), (cor...","[((angular, momentum), 9), ((power, law), 9), ...","[((power, law, mass), 4), ((law, mass, functio..."
3,"Philippov, Alexander","University of California Berkeley, University ...","Philippov, Alexander A., Philippov, Alexander,...","2018ApJ...855...94P, 2014MNRAS.441.1879P, 2020...",Ab-initio Pulsar Magnetosphere: Particle Accel...,"2018, 2014, 2020, 2019, 2015, 2014, 2015, 2013...","plasmas, stars: magnetic field, Astrophysics -...","Department of Astrophysical Sciences, Princeto...",We perform global particle-in-cell simulations...,"[(pulsar, 31), (pair, 24), (particle, 16), (fi...","[((current, sheet), 10), ((pulsar, magnetosphe...","[((particle, cell, simulations), 5), ((open, f..."
4,"Sadowski, Aleksander","Massachusetts Institute of Technology, Massach...","Sadowski, Aleksander, Sądowski, Aleksander, Są...","2008ApJ...676.1162S, 2014MNRAS.439..503S, 2015...",The Total Merger Rate of Compact Object Binari...,"2008, 2014, 2015, 2013, 2013, 2016, 2015, 2009...","binaries: close, accretion, accretion, accreti...","Nicolaus Copernicus Astronomical Center, Barty...","Using a population synthesis approach, we comp...","[(accretion, 48), (hole, 43), (black, 42), (di...","[((black, hole), 42), ((accretion, rate), 15),...","[((dot, dot, edd), 6), ((black, hole, accretio..."


This concludes the tutorial. If you have any questions, comments, concerns, or encounter an bugs, please do not hestitate to contact maire.volz@nasa.gov or antonino.cucchiara@nasa.gov. Happy coding !