# HOWTO: PACMan Model Training/Testing

The goal of this notebook is demonstrate the steps required to train a new multinomial, Naive Bayes classification model.
We will start with raw proposal data located in `proposal_data` directory and perform the following steps:

1. Proposal Scraping
  1. Extracting the Abstract and Scientific Justification sections from the .txtx files generated by the PDF to ascii converter
1. Text pre-processing
  1. Tokenization
  1. Filtering stop words
  1. Lemmatization
1. Training the model on our hand classified proposals

In [2]:
import os
import sys
cwd = os.getcwd()
pacman_directory = os.path.join('/',*cwd.split('/')[:-1])
sys.path.append(pacman_directory)

from pacman2020 import PACManTrain, PACManPipeline
from utils.proposal_scraper import HSTProposalScraper

### 1. Proposal Scraping
We use the `HSTProposalScraper` class contained in the `proposal_scraper` module in the `utils` subpackage. We specify that we are scraping the proposals with the intention of using them for training and that we only want to scrape proposals in Cycle 24.
- By setting `for_training=True`, the software automatically looks for a file containing the hand classifications for the list of proposals and saves the scraped proposal information in an subdirectory of `~/PACMan_dist/training_data/`. In this example, the subdirectory will be named `training_corpus_cy24` and it will contain all of the training data for the given cycle, as well as the file containing the hand classifications.
- For the hand classifications, we adopt the following naming convention: cycle_CYCLENUMBER_hand_classifications.txt
   - e.g. cycle_24_hand_classifications.txt contains the hand classification of each proposal for cycle 24.
- Additionally, the file should only contain two columns, `proposal_num` and `hand_classification`. Below is an example snippet of what the file should look like:
    
    ```text
    proposal_num,hand_classification
    0001,stellar physics
    0002,stellar physics
            .
            . 
            .
    ```


In [None]:
pacman_scraper = HSTProposalScraper(for_training=True, cycles_to_analyze=[24, 25])
pacman_scraper.scrape_cycles()

In [None]:
!ls ../training_data/training_corpus_cy24/ | wc

### 2. Text Preprocessing
The `PACManTrain` class contained in the `pacman2020` module to is capable of performing all of the necessary preprocessing steps. Just like before, we specify the cycles we want to analyze and in this case it is just cycle 24.

In summary, this step is processing each input proposal with the `spaCy` NLP package to generate a `Doc` object, which is a sequence of tokens. Each token is an individual word that contains a variety of semantic information derived from the word and its context in a sentence. We leverage this information to filter out stop words, punctuations,  etc... This is the slowest step of the entire process and if needed, it can be improved using the multithreading behavior of `spaCy`.


In [None]:
pacman_training = PACManTrain(cycles_to_analyze=[24])
pacman_training.read_training_data(parallel=False)

For each proposal cycle in the `cycle_to_analyze` argument, the tokenizer will perform the necessary preprocessing steps and save the proposal number, text, cleaned text, filename, the hand classified science category, and the encoded value of the hand classified category. The results are stored in a pandas DataFrame in the `PACManTrain.proposal_data` attribute

In [None]:
print(pacman_training.proposal_data.keys())
pacman_training.proposal_data['cycle_24'].head()

We can use the resulting DataFrame to make a quickplot of the distribution of proposal categories

In [None]:
proposal_categories = pacman_training.proposal_data['cycle_24']['hand_classification'].value_counts()
print(proposal_categories)

### 3. Training

Now that we have all the proposal information loaded, we can train our Multinomial Naive Bayes classifier. When no model or vectorizer is specified, the software will use the default classifier (Multinomial Naive Bayes) and the default vectorizer (term frequency-inverse document frequency TFIDF). In theory, you can pass any classifier and any vectorizer you want!

In [None]:
pacman_training.fit_model(pacman_training.proposal_data["cycle_24"])

In [None]:
pacman_training.model

### 4. Testing 

In [3]:
pacman_pipeline = PACManPipeline(cycle=25, model_name='pacman_production_model.joblib')

In [4]:
pacman_pipeline.read_data(cycle=25, N=30, parallel=False)

INFO [pacman2020.read_data:311] Reading in 30 proposals...
Data Directory: /user/nmiles/PACMan_dist/unclassified_proposals/corpus_cy25
0it [00:00, ?it/s]
INFO [pacman2020.preprocess:282] Total time for preprocessing: 0.000
