# Welcome !
This CoLab notebook is a walkthough on how to use my function **expertiseFinder_NameOrInst()** *(plus some bonus functions)* in the .py file **ExpertiseFinder_MSI.py**. The purpose of this function is to determine researchers at a specific institutions that have a presence in the ADS NASA database. The "Expertise Finder" detects what researchers are from the target institution by querying NASA ADS using the institution name, then pulling out the author name of papers whose first author is affiliated with the target institution. The code then focuses on their astronomical expertise through selecting publication in astronomical journals (e.g. ApJ, MNRAS, SPIE).

The primary goal of this tutorial is to walk the user through the process of using said function on provided data files and prepare them for using the function on their own.

The items you will need to run this code are listed below:
1. The file **ExpertiseFinder_MSI.py**.
2. The supporting file **TextAnalysis.py**.
3. A file of ignorable "stop words" and its directory path in Google Drive: **stopwords.csv**.
4. For the BONUS section, the file **NHFPFellows_Sample.csv** as an input.
5. Your own **NASA ADS API token**. This is a long string of characters generated by ADS that gives you acces to their API. You can find instructions on how to get an ADS token [here](https://ads.readthedocs.io/en/v1/api-key.html).

Please make sure you have these items available before starting this tutorial. If you do not have access to these files, please contact maire.volz@nasa.gov or antonino.cucchiara@nasa.gov. Thank you !


### The "Expertise Finder"
The function has the following arguments:
1. **token** = an ads API token  from https://ui.adsabs.harvard.edu/help/api/ (*string*)
2. **directorypath** = the file location of 'stopwords.txt' on your device (*string*)
3. **inst** = the institution name of interest (*string*)
4. **name** ('LastName, FirstName') = an optional argument that, when not 'None', searches ADS for a specific author's name at a specific institution (*string*)
5. **refereed** (True or False) = an optional argument that toggles whether NASA ADS search results need to be peer-refereed (*boolean*)
6. **year** = an optional argument that determines a cutoff point for consideration; papers published before this year will not be considered (*int*)
7. **strictness** = an optional keyword argument that determines how strict the function will be when filtering ADS results (can be one of three *strings*: 'default', 'low', or 'high')
8. **fileName** ('___.csv') = an optional keyword argument that defines the name under which the newData output is saved to the user's device (*string* with default "expertiseFinder_output.csv")

Running this function will return the following:

1. **newData** = a spreadsheet containing information on author name, institution, paper titles, publication years, keywords, abstracts, top 10 words, bi-grams, and tri-grams, and a "CLEAN" or "DIRTY" classification.

In this tutorial, we will only be defining the arguments **token**, **directorypath**, **inst**, **year**, and **strictness**.

### STEP 1: Import necessary files and packages to this notebook.
In order to run the "Expertise Finder", we must define the location of and upload the necessary files.

In [None]:
# Install the ADS library
!pip install ads

In [None]:
# Before I forget, though, you should import the outside packages that this code needs to run.
import ads
import nltk
nltk.download('punkt')
nltk.download('wordnet')
import pandas as pd

In [None]:
# Define the directory path of the  stopwords file. If it is in the same directory as this notebook, topdirectory will remain blank. 
# Modify the string below if 'stopowrds.txt' is in another directory:
topdirectory = ''
stopwordspath = topdirectory + 'stopwords.txt'

In [None]:
# Now, we will import the functions from TextAnalysis.py and ExpertiseFinder_MSI.py.
import TextAnalysis as TA
import ExpertiseFinder_MSI as EF

**Hooray !** Now all of our necessary files, packages, and modules are imported.

### STEP 3: Running the code.
In this step, we will accomplish one of our main goals: running the function **expertiseFinder_NameOrInst()** on an institution name of our choice.

In this tutorial, we will only be defining the arguments **token**, **directorypath**, **inst**, **year**, **strictness**, and **fileName**. By enabling **inst** and not **name** at the same time, this means our code will search of ALL top author names associated with the institution that we defined. If we were to define both **inst** and **name**, the code would "switch modes", in a way: it would only search for the papers written by a single author (**name**) at while at the institution (**inst**).

The **strictness** keyword warrants more explanation. The user's choice of "default", "low", or "high" will decide what set of filters are applied to each paper belonging to the top authors of an institution.

* LOW strictness triggers filters that permit an exact author match, a close institution match, or the presence of a journal name (ApJ, MNRAS, SPIE, AJ, Science, PASP, Nature, Arxiv).
* HIGH strictness triggers filters that permit a relevant journal name AND an exact author match or a close institution match (the provided institution name string is included in the official affiliation name).
* DEFAULT strictness triggers filters that permit a close author match AND close institution match, OR an exact author match, OR a close institution match, OR a relevant journal name.
In this tutorial, we will use the "default" filter strictness.

In the output **newData**, the "CLEAN" or "DIRTY" classification is determined by whether a specific paper met the filter qualifications specified by the **strictness** keyword.

In [None]:
# First, define your ADS token:
token = 'blah'

In [None]:
# Let's do it !
newData = EF.expertiseFinder_NameOrInst(token, stopwordspath, 'University of the Virgin Islands', year = 2000, strictness = 'default', fileName = topdirectory+'UVI_topAuthors.csv')

In [None]:
# Check out what you made !
newData

The cell above contains one final output of this function.

Before we end here, make sure to save your output(s) ! The expertise finder does automatically save your file as a .csv, but if you would like to instead have it as an Excel document, run this cell separately:

In [None]:
# Save the output in a .excel file run the following command:
newData.to_excel(topdirectory+'UVI_topAuthors.xlsx')

Any new files created within this tutorial should now appear in the same directory as this tutorial.

### BONUS pt. 1: Search a single name and institution.
The function **expertiseFinder_NameOrInst()** has another "mode" where, instead of searching only for an institution name and returning several author names from that insitution, a user can search for a single name at a single instition. *This is done by defining the optional keyword "name".*

In this example, we will search for information on Joshua Pepper while he was working at Lehigh University.

In [None]:
# Define the new keywords.
josh = 'Pepper, Joshua'
lehigh = 'Lehigh University'

In [None]:
joshData = EF.expertiseFinder_NameOrInst(token, stopwordspath, inst = lehigh, name = josh, year = 2000, strictness = 'default', fileName = topdirectory+'joshData.csv')

In [None]:
# Check it out !
joshData

### BONUS pt. 2: Search for several names in a large spreadsheet.
While the function **expertiseFinder_NameOrInst()** was built for processing single names or institutions, its parent function **expertiseFinder()** can process large spreadsheets of researcher names and their affiliate institutions. 

The **expertiseFinder()** has some slightly different keywords that are detailed below:
1. **rawFile** = a Pandas data frame containing at least the following two columns: **"LastName, FirstName"** (containing researcher names) and **"Institution Name"** (containting those researchers' affiliate institution.
2. **start** = the index of **rawFile** at which the function should begin searching ADS (*int*)
3. **count** = the number of rows after **start** that the function should continue searching ADS (*int*). This keyword is put in place to prevent a user from overpassing the 5000-queries-a-day limit from ADS API.

**expertiseFinder()** also contains optional keywords **year** and **strictness** but does NOT have keywords **inst**, **name**, or **refereed**.

As for outputs, instead of returning a single data frame, **expertiseFinder()** returns TWO:
1. **top10Df** = a Pandas data frame containing results from ADS that are considered "clean", or match the "strictness" filters.
2. **top10DirtyDf** = a Pandas data frame containing resutls from ADS that are considered "dirty", or did NOT match the "strictness" filters.

The cells to follow display how this parent function analyses sample data from NHFP fellowship recipients and their host institutions.

In [None]:
# Import the data to analyse:
NHFPdf = pd.read_csv('NHFPFellows_Sample.csv')
NHFPdf

Above is the sample we will be analysing. 

**PLEASE NOTE: In order for** expertiseFinder() **to analyze a spreadsheet, there must be at least two columns named** "LastName, FirstName" **and** "Institution Name"**.** So make sure that, when you use this function on your own data, you rename some columns !!

In [None]:
# Start by defining where you would like to start and end analysis.
start = 12
count = 5

In [None]:
# Then, run the code !
NHFPData = EF.expertiseFinder(token, stopwordspath, NHFPdf, start, count, year = 2000, strictness = 'default')

In [None]:
cleanData = NHFPData[0]
dirtyData = NHFPData[1]

In [None]:
cleanData

In [None]:
dirtyData

This concludes the tutorial. If you have any questions, comments, concerns, or encounter an bugs, please do not hestitate to contact maire.volz@nasa.gov or antonino.cucchiara@nasa.gov. Happy coding !