# Welcome !
This CoLab notebook is a walkthough on how to use my function **CarnegieMatching_OneSpreadsheet()** in the .py file **CarnegieMatchingFunctions.py**. The final goal of this tutorial is to identify the research and MSI classification of each institution name included in an input spreadsheet. 

The files you will need to run this code are listed below:
1. Your own CoLab notebook.
2. The file **CarnegieMatchingFunctions.py**.
3. The most up-to-date "exceptions" file: **Exceptions_3-10_1.csv**.
4. Carnegie's database of institutions: **CarnegieClassification_Data.csv**.
5. An input file of your choosing. In this notebook, I will be using **ADAP_MergedList_fromNino.csv**.

Please make sure each of these required files are on your device before beginning this tutorial. If you do not have access to these files, please contact maire.volz@nasa.gov or antonino.cucchiara@nasa.gov. Thank you !


### STEP 1: Import packages and files.
Before we begin coding, we have import all of our necessary packages and files. The files listed in the previous text cell should be downloaded onto your device for access.

In [None]:
# In order for .py files to be imported, first, you must mount your google drive in google colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Import pandas
import pandas as pd

In [None]:
# Import the Carnegie and Exceptions databases
# Define the directory path of the current tutorial directory and of your necessary files. On your Google Drive, it should be the following:
topdirectory = '/content/drive/MyDrive/Colab Notebooks/Carnegie Matching Tutorial/'

carnegieData = pd.read_csv(topdirectory + 'CarnegieClassification_Data.csv')
exceptionsData = pd.read_csv(topdirectory + 'Exceptions_3-10_1.csv')

In [None]:
# Now, we will import the functions from CarnegieMatchingFunctions.py.
# Since we've already mounted Google Drive, we can now insert the directory to your python path using sys. The argument of the "sys.path.append" function should be the path of the directory in which the .py file is saved.
import sys 
sys.path.append(topdirectory)

In [None]:
import CarnegieMatchingFunctions as CMF

In [None]:
# Finally, import and view the raw data we'll be analyzing
inputData_ADAP = pd.read_csv(topdirectory + 'ADAP_MergedList_fromNino.csv')
inputData_ADAP

### STEP 2: Running the code.
In this step, we will run the function **CMF.CarnegieMatching_OneSpreadsheet()**. This function has the following arguments:
1. **inputData** = A data frame containing institution names that we wish to identify (imported earlier).
2. **institution_column** = A string name of the column within *inputData* that contains information on institution names.
3. **carnegieData** = A data frame of the Carnegie classification database (imported earlier).
4. **exceptionsData** = A data frame of potential erroneous or missing names from the Carnegie database (imported earlier). 

The function returns the following:
1. **outputDf** = A data frame identical to inputData, except with three additional columns: "Homogenized Institution Name", "Research Classification" and "MSI Classification".
2. **exceptionsDf** = A data frame identical to exceptionsData, except with additional columns containing new exceptions encountered when analyzing *inputData*.


#### **IMPORTANT NOTE**

In order for this function to return a complete list of MSI and research classifications for each institution, it may pause at certain institution names it does not recognize in neither **carnegieData** nor **exceptionsData**. Sometimes, this is due to the fact that the institution is not a university or it has not been encountered in previous tests of this function. Or, it may have some sort of strange capitalization or spacing.

In any case, the user will be prompted to input another name for the erroneous institution that the code can search for again. You may need to manually search **carnegieData** or **exceptionsData** for the correct name. If the name is not in either of those spreadsheets, you will need to input the classification by hand. Follow the code's instructions for more details. 

In the last step before the code is done running, the user (you) will be prompted with a final question: "*Would you like to automatically update and save your exceptions file ? (Yes/No)*". It is highly encouraged to answer "Yes" to this question, as then the code will ask for your input in automatically saving your exceptions file as a .csv to your device. This adds a layer of security from any new exceptions being lost. If you respond "No", your exceptions file will still be returned from the function, but you will then need to save the file on your own.

In [None]:
# For demonstration purposes, in this code box, I shorten inputData to only 10 rows.
test_inputData = inputData_ADAP[0:10]

In [None]:
# Define the column in inputData with institution names.
institution_column = 'Company'

In [None]:
# Now it's time to run the function !! Notice that, when you run this code, you will encounter various input requests as you go along. Respond to each question to the best of your knowledge, referring to the carnegieData and exceptionsData when needed.
test_outputDf, test_exceptionsDf = CMF.CarnegieMatching_OneSpreadsheet(test_inputData, 
                                                                       institution_column, 
                                                                       carnegieData, 
                                                                       exceptionsData)

STEPS 1-3 COMPLETE
STEP 4 COMPLETE
The institution "Foundation for Research and Technology (FORTH)" was not found in our references. Please input the name we should search for instead:Foundation for Research and Technology (FORTH)
Your input "Foundation for Research and Technology (FORTH)" was not found in the Carnegie database. Your input may not qualify as a degree-granting institution. Please type the name of your institution.Foundation for Research and Technology (FORTH)
What is the research classification of this institution ? (R1, R2, 4Y, G = Government, RC = Research Center, F = Foreign, I = Industry, or None)F
What is the MSI classification of this institution ? (HSI, BSI, HBCU, or None)None
The institution "Spectral Sciences Inc." was not found in our references. Please input the name we should search for instead:Spectral Sciences, Inc
The institution "Smithsonian Observatory" was not found in our references. Please input the name we should search for instead:Harvard Universit

In [None]:
# Success ! Now that the code is done, make sure to save your files if you haven't done so already.
test_outputDf.to_csv(topdirectory + 'TEST_ADAP_MergedList_withClassifications.csv')

In [None]:
test_outputDf

Unnamed: 0.1,Unnamed: 0,Review Name,Review Acronym,Panel Name,Panel Acronym,First Name,Last Name,Company,Country,Email,Homogenized Institution Name,MSI Classification,Research Classification
0,0,Astrophysics Data Analysis Program 2013,ADAP13,Active Galaxies and Quasars - 1,AGAL 1,GEORGE,CHARTAS,College of Charleston,US,chartasg@cofc.edu,College of Charleston,,R2 - 19
1,1,Astrophysics Data Analysis Program 2013,ADAP13,Active Galaxies and Quasars - 1,AGAL 1,Tanio,Diaz-Santos,Foundation for Research and Technology (FORTH),GR,tanio@ia.forth.gr,Foundation for Research and Technology (FORTH),,F
2,2,Astrophysics Data Analysis Program 2013,ADAP13,Active Galaxies and Quasars - 1,AGAL 1,Jonathan,Gelbord,Spectral Sciences Inc.,US,jgelbord@spectral.com,"Spectral Sciences, Inc",,I
3,3,Astrophysics Data Analysis Program 2013,ADAP13,Active Galaxies and Quasars - 1,AGAL 1,Dan,Schwartz,Smithsonian Observatory,US,dschwartz@cfa.harvard.edu,Harvard University,,R1
4,4,Astrophysics Data Analysis Program 2013,ADAP13,Active Galaxies and Quasars - 1,AGAL 1,Renbin,Yan,University of Kentucky,US,rya225@g.uky.edu,University of Kentucky,,R1
5,5,Astrophysics Data Analysis Program 2013,ADAP13,Active Galaxies and Quasars - 2,AGAL 2,Marco,Chiaberge,Space Telescope Science Institute,US,marcoc@stsci.edu,Space Telescope Science Institute,,RC
6,6,Astrophysics Data Analysis Program 2013,ADAP13,Active Galaxies and Quasars - 2,AGAL 2,Elisa,Costantini,SRON space research organization netherlands,NL,e.costantini@sron.nl,Netherlands Institute for Space Research,,F
7,7,Astrophysics Data Analysis Program 2013,ADAP13,Active Galaxies and Quasars - 2,AGAL 2,Karen,Davis,SAIC,US,kdavis@nasaprs.com,Science Applications International Corporation,,I
8,8,Astrophysics Data Analysis Program 2013,ADAP13,Active Galaxies and Quasars - 2,AGAL 2,Karen,Davis,SAIC,US,kdavis@nasaprs.com,Science Applications International Corporation,,I
9,9,Astrophysics Data Analysis Program 2013,ADAP13,Active Galaxies and Quasars - 2,AGAL 2,Davide,Donato,NASA/GSFC,US,donato@milkyway.gsfc.nasa.gov,NASA Goddard Space Flight Center,,G
