## Words to Network Data

Now that we know some ways to pull an interesting selection of words, let's think about them like a network. Network analysis involves looking at how things relate to each other based on shared properties. 

In this notebook, we will address a simple research question: Which **files** (or **chapters**, or **scenes**, or **authors**, or **titles**) share each others' top 5 or 10 **verbs** (or **adjectives**, or **nouns**, or **WordNet synsets**, etc)? 

We will try exploring this as a network of data and import it into network visualization software. 

### Installs in your Terminal 
* If you did not do this in the previous notebook: `pip install spacy` (or `python3.12 -m pip install spacy`)
* We need the **pandas** library to make dataframes: `pip install pandas` (or `python3.12 -m pip install pandas`)

<pre>
    ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣤⣶⣿⣷⣦⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⣰⣿⣿⣿⣦⡀
⠀⠀⠀⠀⢀⠄⠀⠀⠀⠀⢈⣿⣿⣿⠟⠛⠁⠀⠀⠀⠀⠀⠀⠐⢿⣿⣿⣿⣿⣷
⠀⠀⢀⡔⠁⠀⠀⠀⢀⣴⡿⠃⠈⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⢿⣿⠟⠁
⠀⣠⡟⠀⠀⠀⠀⣰⣿⣿⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⢠⣿⠁⠀⠀⠀⢰⣿⣿⡏⠀⠀⠀⢀⣤⣤⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣾⣿⠀⠀⠀⠀⢸⣿⣿⡇⠀⠀⢠⣿⣿⣿⣿⠀⠀⠀⠀⣰⣾⣶⣆⠀⠀⠀⠀⡄
⣿⣿⡀⠀⠀⠀⢸⣿⣿⣧⠀⠀⢸⣿⣿⡿⠃⠀⠀⠀⠀⢿⣿⣿⣿⡆⠀⠀⢸⣧
⢿⣿⣧⠀⠀⠀⢸⣿⣿⣿⣦⡀⠀⠉⠉⠀⠀⠀⠀⠀⠀⠈⠻⣿⠿⠁⠀⢠⣿⣿
⢸⣿⣿⣷⣄⠀⣿⣿⣿⣿⣿⣿⣷⣦⣄⠀⠲⣶⣶⣶⠀⠀⠀⠀⠀⣀⣴⣿⣿⣿
⢰⣿⣿⣿⣿⣷⣽⣿⣿⣿⣿⣿⣿⣿⣿⣦⡄⣬⣅⣀⣠⣾⣿⣿⣿⣿⣿⣿⣿⡟
⠈⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣤⣤⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⠇
⠀⠹⣿⣿⣿⣿⣿⣿⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡏⠀⠈⣿⣿⣿⣿⣿⣿⣿⡿⠀
⠀⠀⠻⣿⣿⣿⣿⣿⡎⢿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⣿⣿⣿⣿⣿⣿⣿⠃⠀
⠀⠀⠀⠙⠻⠿⣿⣿⠗⠀⢻⣿⣿⣿⣿⣿⣿⣿⡇⠀⢸⣿⣿⣿⣿⣿⡿⠃⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠹⣿⣿⣿⣿⣿⣿⣿⠀⢸⣿⣿⣿⡿⠟⠁⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠛⠿⢿⣿⣿⡿⠇⠀⠈⠉⠁⠀⠀⠀⠀⠀⠀
</pre>



Image credit: <https://emojicombos.com/panda-ascii-art> 

### Language Models
Like NLTK's collections that we downloaded, [spaCy has trained language models](https://spacy.io/usage/models) to download. You download these in your Python script after you import spacy, and after you download once, you don't need to do it again (so you can comment out the download line). We're going to try the medium and large models in English. (It's good to know that both spaCy and NLTK have resources for NLP on other languages, too!)  

In [None]:
import spacy
import os

### Downloading language models
To work with spaCy's pre-trained language models, you need to download them to you virtual environment. There are:
* en_core_web_sm (smallest--not as much info as the others)
* **en_core_web_md** (Pretty good date here, size: 34 MB )
* **en_core_web_lg** (Lots of data here, size: 400 MB.)
Note that the LARGEST  one will have the most data and probably be the most reliable. 

In [None]:
# CAN YOU SKIP THIS???
# After you download a model into your virtual environment for the first time, you can comment out the download line.
# spaCy's medium and large models will give us the best results for NLP tagging.
# nlp = spacy.cli.download("en_core_web_sm")
# nlp = spacy.cli.download("en_core_web_md")
nlp = spacy.cli.download("en_core_web_lg")

### Load the model 
Now we redefine the nlp variable to LOAD the model you downloaded.

In [None]:
nlp = spacy.load("en_core_web_lg")

## File Collection
You can set this to read a collection of text files that you've prepped for natural language processing. 
For this notebook, we'll work with just the dialogue text we pulled from the One Piece project. 
The text is organized as one file for each volume of OnePiece, so we can maybe explore how much the dialogue is relying on certain words, and how it might be changing over the volumes.

We're going to get volume information from OnePiece's filenames (which look like this: `vol-9.txt'. We can use the `os` library to help us isolate the filename from the file extension. Let's try that so we can simplify our references to the volumes when we output our data.


In [None]:
collPath = 'onepiece-nlp-text'
for file in os.listdir(collPath):
    if file.endswith(".txt"):
        filepath = f"{collPath}/{file}"
        name, extension = os.path.splitext(file)
        print(name)
        with open(filepath, 'r', encoding='utf8') as f:
            readFile = f.read()
            lengthFile = len(readFile)
            print(lengthFile)
# We're just printing out the filepaths and lengths of the files as a "smoke test" to see if we're reading the files. 

## Collect some words with a little help from spaCy
Let's build this up to go file by file. We'll start by surveying what we have...(Ultimately we want to collect the top 5 or 10 most frequently repeating words of a kind.)

In [None]:
# SURVEY IT ALL! :-) (Yeah, lots of data...)
for file in os.listdir(collPath):
    if file.endswith(".txt"):
        filepath = f"{collPath}/{file}"
        name, extension = os.path.splitext(file)
        print(name)
        with open(filepath, 'r', encoding='utf8') as f:
            readFile = f.read()
            spacyRead = nlp(readFile)
            for token in spacyRead:
                print(token.text, "---->", token.pos_, ":::::", token.lemma_)                 

In [None]:
# Remember, we can request an explanation from spaCy of its tags: 
spacy.explain("DET")

### Okay, let's select something interesting
 ...and maybe only collect the lemmas.

Let's do this with a **function** call. We can easily switch the kind of word we want to output this way.
Here we are starting to associate words wtih volumes...we're going to refine that by making a dataframe next.
 

In [None]:
def wordCollector(words, unit):
    wordList = []
    for token in words:
        if token.pos_ == "ADJ":
            wordList.append((token.lemma_, unit))
    uniqueLems = set(wordList)
    return uniqueLems

for file in os.listdir(collPath):
    if file.endswith(".txt"):
        filepath = f"{collPath}/{file}"
        name, extension = os.path.splitext(file)
        with open(filepath, 'r', encoding='utf8') as f:
            readFile = f.read()
            spacyRead = nlp(readFile)
            myWords = wordCollector(spacyRead, name)
            print(myWords)

### Let's try dataframes to store these relations of word-form to unit 
Dataframes, supporteed by the pandas library in Python, facilitate handling data in spreadsheet / CSV / TSV format.
We want to output some data to a TSV (tab-separated values file) so we can send it to network visualization software.

We'll need to open a terminal and  **pip install pandas** (or python3.12 -m pip install pandas) to get dataframes working.

In [None]:
import pandas as pd

def wordCollector(words, unit):
    wordList = []
    nodeAtts = []
    unitList = []
    for token in words:
        if token.pos_ == "ADJ":
            wordList.append(token.lemma_)
            nodeAtts.append(token.pos_)
            unitList.append(unit)

    data = {
        'word': wordList,
        'nodeType': nodeAtts,
        'unit': unitList
    }
    df = pd.DataFrame(data)
    return df

for file in os.listdir(collPath):
    if file.endswith(".txt"):
        filepath = f"{collPath}/{file}"
        name, extension = os.path.splitext(file)
        with open(filepath, 'r', encoding='utf8') as f:
            readFile = f.read()
            spacyRead = nlp(readFile)
            myDataFrame = wordCollector(spacyRead, name)
            print(myDataFrame)

## Write the dataframes to an output TSV file
TSV means **tab-separated-values**. We'll use the tab character as a separator.
The other popular alternative is the CSV: **comma-separated-values**, which uses the comma as a separator.
You can use either one for our network data.

I'm going to output some "node attributes" for tne network. In a network visualization, we'll want to be able to quickly distinguish the word nodes from the unit nodes (possibly by shape), and providing a qualifier or "node attribute" in the dataframe will be helpful.
We can add other kinds of node attributes perhaps as numerical ranges (see below).

In [None]:
import pandas as pd

def wordCollector(words, unit):
    wordList = []
    nodeAtts = []
    unitList = []
    for token in words:
        if token.pos_ == "ADJ":
            wordList.append(token.lemma_)
            nodeAtts.append(token.pos_)
            unitList.append(unit)

    data = {
        'word': wordList,
        'nodeType': nodeAtts,
        'unit': unitList
    }
    df = pd.DataFrame(data)
    return df
    # This is returning a separate dataframe for every source text file. 

# We need to consolidate all the dataframes into one file. Collect all dataframes here!
allDataFrames = []

for file in os.listdir(collPath):
    if file.endswith(".txt"):
        filepath = f"{collPath}/{file}"
        name, extension = os.path.splitext(file)
        with open(filepath, 'r', encoding='utf8') as f:
            readFile = f.read()
            spacyRead = nlp(readFile)
            myDataFrame = wordCollector(spacyRead, name)
            # Add each individual dataframe as it comes out into the list of dataframes!
            allDataFrames.append(myDataFrame)

# Make an output filepath
outputFilePath = 'networkData.tsv'
# Turn the list of dataframes into one dataframe:
fullDataFrame = pd.concat(allDataFrames, ignore_index=True)

# Note, since Pandas knows how to open and write files line by line, we can skip that open() step we used last time.
fullDataFrame.to_csv(outputFilePath, sep='\t', index=False)
print('I just saved a dataframe as a TSV file.')
# Go check your filestash for the file. 

### Something to try...
* Try developing this to return synsets from WordNet.
* Add an Ambiguity metric to count the number of synsets it belongs to.
* Output this as a second "node attribute" qualifier to the Part of Speech. We could use that to color-code the words by ambiguity...

In [None]:
# This time, with Wordnet data!
import pandas as pd
import nltk
from nltk.corpus import wordnet as wn

def wordCollector(words, unit):
    wordList = []
    nodeAtts = []
    synsetCounts = []
    unitList = []
    for token in words:
        if token.pos_ == "ADJ":
            synsets = len(wn.synsets(token.lemma_))
            # We should be able to look up and count those synsets right here!          
            wordList.append(token.lemma_)
            nodeAtts.append(token.pos_)
            synsetCounts.append(synsets)
            unitList.append(unit)
                          
    data = {
        'word': wordList,
        'nodeType': nodeAtts,
        'synsetCount': synsetCounts,
        'unit': unitList
    }
    df = pd.DataFrame(data)
    return df
    # This is returning a separate dataframe for every source text file. 

# We need to consolidate all the dataframes into one file. Collect all dataframes here!
allDataFrames = []

for file in os.listdir(collPath):
    if file.endswith(".txt"):
        filepath = f"{collPath}/{file}"
        name, extension = os.path.splitext(file)
        with open(filepath, 'r', encoding='utf8') as f:
            readFile = f.read()
            spacyRead = nlp(readFile)
            myDataFrame = wordCollector(spacyRead, name)
            # Add each individual dataframe as it comes out into the list of dataframes!
            allDataFrames.append(myDataFrame)

# Make an output filepath
outputFilePath = 'networkData-2.tsv'
# Turn the list of dataframes into one dataframe:
fullDataFrame = pd.concat(allDataFrames, ignore_index=True)

# Note, since Pandas knows how to open and write files line by line, we can skip that open() step we used last time.
fullDataFrame.to_csv(outputFilePath, sep='\t', index=False)
print('I just saved a NEW dataframe as a NEW TSV file.')
# Go check your filestash for the file. 

## What will we do with this fancy TSV?

We can analyze it as a **network** of data using network analysis. You could try importing your TSV into kumu.io, but we're going to introduce Cytoscape, which is a very powerful, artful method with lots of visualization options. 

To view the data we extracted in this notebook as an SVG export from Cytoscape, see <https://raw.githubusercontent.com/newtfire/textAnalysis-Hub/refs/heads/main/Class-Examples/Python/readFileCollections-examples/networkData-2.svg>



## Your Turn! 
Adapt this notebook to explore one or more of your project files and prepare some network data of your own. 

* Output a CSV or TSV of data from your project for import into our network visualization software.
* [Install Cytoscape](https://cytoscape.org/) on your computer and check to make sure it is working.
     * Cytoscape now comes bundled with its own Java, so you should be good to go for running it (much like oXygen). If you have trouble, work through the [Troubleshooting](https://cytoscape.org/troubleshooting.html) guide first. Bring questions to class or via Canvas!
     * After you install Cytoscape, see if you can also install the yFiles Layout Algorithms. Open Cytoscape, go to **Layout → Install yFiles**, which directs you to an App store to download it and apply it to Cytoscape. 