# Welcome !
This CoLab notebook is a walkthough on how to use Maire Volz function **CarnegieMatching_OneSpreadsheet()** in the .py file **CarnegieMatchingFunctions.py**. The final goal of this tutorial is to identify the research and MSI classification of each institution name included in an input spreadsheet.

The files you will need to run this code are listed below:
1. Your own CoLab notebook.
2. The file **CarnegieMatchingFunctions.py**.
3. The most up-to-date "exceptions" file, useful to have the date of creation in the name of the file: **Exceptions_3-10_1.csv**.
4. Carnegie's database of institutions (2021 is the latest): **CarnegieClassification_Data.csv**.
5. An input file of your choosing. In this notebook, I will be using **Inst_class_test.csv**.

Please make sure each of these required files are on your device before beginning this tutorial. If you do not have access to these files, contact antonino.cucchiara@nasa.gov. Thank you !


Original Github:[Maire Volz](https://github.com/maireav/NASA-Internship/tree/main/Fellow%20and%20Institution%20Map)

### STEP 1: Import packages and files.
Before we begin coding, we have import all of our necessary packages and files. The files listed in the previous text cell should be downloaded onto your device for access.

In [3]:
# It is easier sometimes to use Google Colab as many useful libraries are available there already. 
# In order for .py files to be imported, first, you must mount your google drive in google colab
#UNCOMMENT the following two lines
#from google.colab import drive
#drive.mount('/content/drive')

In [4]:
# Import pandas
import pandas as pd

In [5]:
!pwd

/Users/acucchia/codes/ninoc_git/Institution_classification


In [6]:
# Import the Carnegie and Exceptions databases
# Define the directory path of the current tutorial directory and of your necessary files. On your Google Drive, 
#it could something like the following:
#topdirectory = '/content/drive/MyDrive/Colab Notebooks/Carnegie Matching Tutorial/'
#
topdirectory = '/Users/acucchia/codes/ninoc_git/Institution_classification/'
#
carnegieData = pd.read_csv(topdirectory + 'CarnegieClassification_Data.csv')
exceptionsData = pd.read_csv(topdirectory + 'Exceptions_01_26b_24.csv')

In [7]:
# Now, we will import the functions from CarnegieMatchingFunctions.py.
# Since we've already mounted Google Drive, we can now insert the directory to your python path using sys. The argument of the "sys.path.append" function should be the path of the directory in which the .py file is saved.
import sys
sys.path.append(topdirectory)

In [8]:
import CarnegieMatchingFunctions as CMF

In [9]:
# Finally, import and view the raw data we'll be analyzing 
inputData = pd.read_csv(topdirectory + 'Inst_class_test.csv')

inputData

Unnamed: 0,Institution
0,Jackson State University
1,Tuskegee University
2,Delaware State University
3,Colorado Mesa University
4,Delaware State University
5,"California State University, Northridge"
6,Florida Agricultural and Mechanical University
7,University of Houston-Clear Lake
8,California State University Northridge
9,California State University Los Angels


### STEP 2: Running the code.
In this step, we will run the function **CMF.CarnegieMatching_OneSpreadsheet()**. This function has the following arguments:
1. **inputData** = A data frame containing institution names that we wish to identify (imported earlier).
2. **institution_column** = A string name of the column within *inputData* that contains information on institution names.
3. **carnegieData** = A data frame of the Carnegie classification database (imported earlier).
4. **exceptionsData** = A data frame of potential erroneous or missing names from the Carnegie database (imported earlier).

The function returns the following:
1. **outputDf** = A data frame identical to inputData, except with three additional columns: "Homogenized Institution Name", "Research Classification" and "MSI Classification".
2. **exceptionsDf** = A data frame identical to exceptionsData, except with additional columns containing new exceptions encountered when analyzing *inputData*.


#### **IMPORTANT NOTE**

In order for this function to return a complete list of MSI and research classifications for each institution, it may pause at certain institution names it does not recognize in neither **carnegieData** nor **exceptionsData**. Sometimes, this is due to the fact that the institution is not a university or it has not been encountered in previous tests of this function. Or, it may have some sort of strange capitalization or spacing.

In any case, the user will be prompted to input another name for the erroneous institution that the code can search for again. You may need to manually search **carnegieData** or **exceptionsData** for the correct name. If the name is not in either of those spreadsheets, you will need to input the classification by hand. Follow the code's instructions for more details.

In the last step before the code is done running, the user (you) will be prompted with a final question: "*Would you like to automatically update and save your exceptions file ? (Yes/No)*". It is highly encouraged to answer "Yes" to this question, as then the code will ask for your input in automatically saving your exceptions file as a .csv to your device. This adds a layer of security from any new exceptions being lost. If you respond "No", your exceptions file will still be returned from the function, but you will then need to save the file on your own.

In [10]:
# For demonstration purposes, in this code box, I shorten inputData to only 14 rows.
test_inputData = inputData

In [11]:
# Define the column in inputData wehre the institution names are listed 
#(it is OK if there are more columns, the code will ignore them)

institution_column = 'Institution'

In [None]:
# Now it's time to run the function !! Notice that, when you run this code, you will encounter various input requests as you go along. Respond to each question to the best of your knowledge, referring to the carnegieData and exceptionsData when needed.
test_outputDf, test_exceptionsDf = CMF.CarnegieMatching_OneSpreadsheet(test_inputData,
                                                                       institution_column,
                                                                       carnegieData,
                                                                       exceptionsData)

STEPS 1-3 COMPLETE
STEP 4 COMPLETE


In [None]:
# Success ! Now that the code is done, make sure to save your files if you haven't done so already.
test_outputDf.to_csv(topdirectory + 'test_prop_withClassifications.csv')

In [None]:
test_outputDf