# Exploring NLP on project XML files 
Start with imports and installs.
You can start on your local computer with a pip install. 
* Where you've set up your python environment, run `pip install saxonche` or `pip3 install saxonche` as needed.

You should be able to run this notebook on your local computer: 
* Navigate to the Class Examples/Python directory in your Git Bash (Windows) or Terminal (Mac),
* Type in `jupyter lab` and press enter
* Then open the localhost address you're given in your web browser. 

In [3]:
!pip install saxonche
import os
import spacy
import re as regex
# re lets us work with regular expressions in Python
from saxonche import PySaxonProcessor
# You may need to pip install saxonche at the command line if the install doesn't work in the notebook here.
# This lets us use Saxon XPath parsers over XML files



Remember the spaCy language models? Let's try loading loading the large one to get the maximum amount of information from it! 
There's a lot we can experiment with from spaCy, so here's a link to the documentation for our ready reference:
<https://spacy.io/usage/spacy-101> 

We're going to start by just reviewing its POS (part of speech) and NER (named entity recognition) taggers to see what we can see in your project files.


In [4]:
# nlp = spacy.cli.download("en_core_web_lg")
# ONLY NEED ABOVE LINE ONCE. REMEMBER: COMMENT OUT THE ABOVE LINE THE NEXT TIME YOU RUN THIS.
nlp = spacy.load('en_core_web_lg')

Okay, let's explore some project files!
We've loaded the XML directory prepared by the Futurama team for our example here. 

* If you have some basic XML right now, like the Futurama team has prepared, we can easily scope in tagged sections of your collection. Swap out the Futurama collection with yours, and adjust the Python code below accordingly.
* If you don't have XML at this point, you can work around this over text files, or just explore the Futurama collection.

In [5]:
# DEFINE SOME FILE PATHS FOR INPUT, AND (ONCE WE'RE READY) OUTPUT
InputPath = 'futurama-xml'
OutputPath = 'testOutput' 

Now, here are some functions to: 
* read input files
* pull from the XML elements with some simple XPath
* run stuff through spaCy's NLP

Read (and adapt) the functions in the following cell from the bottom up.

In [10]:
def readTextFiles(InputPath):
    # This function uses XPath to read the XML input
    for file in os.listdir(InputPath):
        if file.endswith('.xml'):
            filepath = f"{InputPath}/{file}"
            with PySaxonProcessor(license=False) as proc:
                xml = open(filepath, encoding='utf-8').read()
                # ebb: Here we apply the Saxon processor to read files with XPath.
                xp = proc.new_xpath_processor()
                node = proc.parse_xml(xml_text=xml)
                xp.set_context(xdm_item=node)

                # From here on, we select the string that Python will send to NLP. 
                # xpath = xp.evaluate('//your/xpath/here')
                xpath = xp.evaluate('//speak[@who="FRY"]/text()')
                string = str(xpath)
                print(string)
                #another script //info/text()
                #another script '//speak[@who="FRY"]/text()'
                
readTextFiles(InputPath)


So what's Mars Day about, anyway?

Lighten up, Leela. It's funny!

I'll have a thorax and some feelers.

Yuck!

What's that weird sound?

Aw, man!

Yee-haw!

Where?

Oh. Oh!

Where?

Pfft, that's not scary!

Hook on the hand!

Man in the attic!

With pleasure. 
 Once, not 
far from here, four people set out on 
a cattle drive--

And then, while they sat helplessly 
around the campfire ... a demented knife-wielding 
escaped lunatic libertarian zombie mutant 
snuck up and--

Hey, it's that "barbecue's over" sound 
again.

Where?

My God!

Also, I didn't know buggalo could fly.

Hm.

Yeah!

Wait. That's the bead you traded your 
land for?

You know what movies average out to 
be really good? The first six Star Trek 
movies!

What words? Star Trek?

She's all yours, buddy!

Mr. Nimoy, I came as soon as I heard 
what happened centuries ago. I can't 
believe your show was banned.

You know? 1966? 79 episodes, about 30 
good ones.

Come on! Remember that episode where 
you go high on spores 