# Exploring NLP on project XML files 
You can write this in a new Python file in Pycharm, or you can work on it as a Jupyter Lab notebook that you can edit on your local computer just using a web browser. I recommend working on this in as a Jupyter Lab notebook. 

I recommend that you copy my notebook file into your own filespace (a personal GitHub repo) where you set up a virtual environment.

* Navigate to your filespace (preferably a personal GitHub repo) in your Git Bash (Windows) or Terminal (Mac), and set up your virtual environment with this command:

`python -m venv .`

Troubleshooting: If your system does not respond to `python`, run `python3 -m venv .` instead. 

    * In Windows that will create a `Scripts/` directory in your fastbook repo.
    * In Mac, it will create a `bin/` directory.
    * Look inside either of these: you'll see pip and other executables to manage your local venv. See the activate script? We need to run it.
* Run the appropriate activate script for your system. There's a script file named "activate" in your new virtual environmen, and you basically need to run this command: source yourFilePath/To/activate Enter the command appropriate to your system. If you are at the root of your GitHub repo with the virtual environment folders nested inside, you should run one of the following commands appropriate to your computer:
    * Windows: `source Scripts/activate`
    * Mac: `source bin/activate`
* Now try entering: `jupyter lab`.
    * If it works, you'll open your directory in a browser and be able to navigate to the interactive notebook, and work with it and edit it there.
    * If it doesn't work, you need to install jupyter in your virtual environment: Do that with `pip install jupyterlab`
And try again.
* Where you've set up your python environment, run `pip install saxonche` or `pip3 install saxonche` as needed.

You should be able to run this notebook on your local computer: 
* Navigate to the Class Examples/Python directory in your Git Bash (Windows) or Terminal (Mac),
* Type in `jupyter lab` and press enter
* Then open the localhost address you're given in your web browser. 

In [None]:
!pip install saxonche
import os
import spacy
import re as regex
# re lets us work with regular expressions in Python
from saxonche import PySaxonProcessor
# You may need to pip install saxonche at the command line if the install doesn't work in the notebook here.
# This lets us use Saxon XPath parsers over XML files

Remember the spaCy language models? Let's try loading loading the large one to get the maximum amount of information from it! 
There's a lot we can experiment with from spaCy, so here's a link to the documentation for our ready reference:
<https://spacy.io/usage/spacy-101> 

We're going to start by just reviewing its POS (part of speech) and NER (named entity recognition) taggers to see what we can see in your project files.


In [None]:
# nlp = spacy.cli.download("en_core_web_lg")
# ONLY NEED ABOVE LINE ONCE. REMEMBER: COMMENT OUT THE ABOVE LINE THE NEXT TIME YOU RUN THIS.
nlp = spacy.load('en_core_web_lg')

Okay, let's explore some project files!
We've loaded the XML directory prepared by the Futurama team for our example here. 

* If you have some basic XML right now, like the Futurama team has prepared, we can easily scope in tagged sections of your collection. Swap out the Futurama collection with yours, and adjust the Python code below accordingly.
* If you don't have XML at this point, you can work around this over text files, or just explore the Futurama collection.

In [None]:
# DEFINE SOME FILE PATHS FOR INPUT, AND (ONCE WE'RE READY) OUTPUT
InputPath = 'futurama-xml'
OutputPath = 'testOutput' 

Now, here are some functions to: 
* read input files
* pull from the XML elements with some simple XPath
* run stuff through spaCy's NLP

Read (and adapt) the functions in the following cell from the bottom up.

In [None]:
def readTextFiles(InputPath):
    # This function uses XPath to read the XML input
    for file in os.listdir(InputPath):
        if file.endswith('.xml'):
            filepath = f"{InputPath}/{file}"
            with PySaxonProcessor(license=False) as proc:
                xml = open(filepath, encoding='utf-8').read()
                # ebb: Here we apply the Saxon processor to read files with XPath.
                xp = proc.new_xpath_processor()
                node = proc.parse_xml(xml_text=xml)
                xp.set_context(xdm_item=node)

                # From here on, we select the string that Python will send to NLP. 
                # xpath = xp.evaluate('//your/xpath/here')
                xpath = xp.evaluate('(//speak/text()) => string-join()')
                string = str(xpath)
                print(string)
                
                
readTextFiles(InputPath)