## Words to Network Data

Now that we know some ways to pull an interesting selection of words, let's think about them like a network. Network analysis involves looking at how things relate to each other based on shared properties. 

In this notebook, we will address a simple research question: Which **files** (or **chapters**, or **scenes**, or **authors**, or **titles**) share each others' top 5 or 10 **verbs** (or **adjectives**, or **nouns**, or **WordNet synsets**, etc)? 

We will try exploring this as a network of data and import it into network visualization software. 

### Installs in your Terminal 
* If you did not do this in the previous notebook: `pip install spacy` (or `python3.12 -m pip install spacy`)

### Language Models
Like NLTK's collections that we downloaded, [spaCy has trained language models](https://spacy.io/usage/models) to download. You download these in your Python script after you import spacy, and after you download once, you don't need to do it again (so you can comment out the download line). We're going to try the medium and large models in English. (It's good to know that both spaCy and NLTK have resources for NLP on other languages, too!)  

In [2]:
import spacy
import os

### Downloading language models
To work with spaCy's pre-trained language models, you need to download them to you virtual environment. There are:
* en_core_web_sm (smallest--not as much info as the others)
* **en_core_web_md** (Pretty good date here, size: 34 MB )
* **en_core_web_lg** (Lots of data here, size: 400 MB.)
Note that the LARGEST  one will have the most data and probably be the most reliable. 

In [None]:
# CAN YOU SKIP THIS???
# After you download a model into your virtual environment for the first time, you can comment out the download line.
# spaCy's medium and large models will give us the best results for NLP tagging.
# nlp = spacy.cli.download("en_core_web_sm")
# nlp = spacy.cli.download("en_core_web_md")
nlp = spacy.cli.download("en_core_web_lg")

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━[0m [32m293.6/400.7 MB[0m [31m13.8 MB/s[0m eta [36m0:00:08[0m

### Load the model 
Now we redefine the nlp variable to LOAD the model you downloaded.

In [3]:
nlp = spacy.load("en_core_web_lg")

## File Collection
You can set this to read a collection of text files that you've prepped for natural language processing. 
For this notebook, we'll work with just the dialogue text we pulled from the One Piece project. 
The text is organized as one file for each volume of OnePiece, so we can maybe explore how much the dialogue is relying on certain words, and how it might be changing over the volumes.

We're going to get volume information from OnePiece's filenames (which look like this: `vol-9.txt'. We can use the `os` library to help us isolate the filename from the file extension. Let's try that so we can simplify our references to the volumes when we output our data.


In [4]:
collPath = 'onepiece-nlp-text'
for file in os.listdir(collPath):
    if file.endswith(".txt"):
        filepath = f"{collPath}/{file}"
        name, extension = os.path.splitext(file)
        print(name)
        with open(filepath, 'r', encoding='utf8') as f:
            readFile = f.read()
            lengthFile = len(readFile)
            print(lengthFile)
# We're just printing out the filepaths and lengths of the files as a "smoke test" to see if we're reading the files. 

vol-9
6089
vol-8
5571
vol-16
4610
vol-17
3879
vol-15
5330
vol-14
5469
vol-10
5412
vol-11
285
vol-13
3932
vol-12
5817
vol-23
6614
vol-22
3316
vol-20
6774
vol-21
4644
vol-25
5306
vol-19
4629
vol-18
5272
vol-24
6093
vol-26
2592
vol-27
7330
vol-6
6371
vol-7
6018
vol-5
1924
vol-4
6884


## Collect some words with a little help from spaCy
Let's build this up to go file by file. We'll start by surveying what we have...(Ultimately we want to collect the top 5 or 10 most frequently repeating words of a kind.)

In [32]:
# SURVEY IT ALL! :-) (Yeah, lots of data...)
for file in os.listdir(collPath):
    if file.endswith(".txt"):
        filepath = f"{collPath}/{file}"
        name, extension = os.path.splitext(file)
        print(name)
        with open(filepath, 'r', encoding='utf8') as f:
            readFile = f.read()
            spacyRead = nlp(readFile)
            for token in spacyRead:
                print(token.text, "---->", token.pos_, ":::::", token.lemma_)                 

vol-9
  ----> SPACE :::::  
Sensei ----> PROPN ::::: Sensei
, ----> PUNCT ::::: ,
I ----> PRON ::::: I
ca ----> AUX ::::: can
n't ----> PART ::::: not
stop ----> VERB ::::: stop
my ----> PRON ::::: my
" ----> PUNCT ::::: "
okay ----> ADJ ::::: okay
hair ----> NOUN ::::: hair
" ----> PUNCT ::::: "
. ----> PUNCT ::::: .
How ----> SCONJ ::::: how
do ----> AUX ::::: do
I ----> PRON ::::: I
stop ----> VERB ::::: stop
it ----> PRON ::::: it
? ----> PUNCT ::::: ?

  ----> SPACE ::::: 
 
There ----> PRON ::::: there
's ----> VERB ::::: be
no ----> DET ::::: no
need ----> NOUN ::::: need
to ----> PART ::::: to
! ----> PUNCT ::::: !
! ----> PUNCT ::::: !
  ----> SPACE :::::  
The ----> DET ::::: the
SBS ----> PROPN ::::: SBS
begins ----> VERB ::::: begin
! ----> PUNCT ::::: !
! ----> PUNCT ::::: !

  ----> SPACE ::::: 
 
To ----> ADP ::::: to
Oda ----> PROPN ::::: Oda
- ----> PUNCT ::::: -
Sensei ----> PROPN ::::: Sensei
, ----> PUNCT ::::: ,
I ----> PROPN ::::: I
always(x2 ----> NOUN ::::: alwa

In [41]:
# Remember, we can request an explanation from spaCy of its tags: 
spacy.explain("DET")

'determiner'

### Okay, let's select something interesting
 ...and maybe only collect the lemmas.

Let's do this with a **function** call. We can easily switch the kind of word we want to output this way.
Here we are starting to associate words wtih volumes...we're going to refine that by making a dataframe next.
 

In [5]:
def wordCollector(words, unit):
    wordList = []
    for token in words:
        if token.pos_ == "ADJ":
            wordList.append((token.lemma_, unit))
    uniqueLems = set(wordList)
    return uniqueLems

for file in os.listdir(collPath):
    if file.endswith(".txt"):
        filepath = f"{collPath}/{file}"
        name, extension = os.path.splitext(file)
        with open(filepath, 'r', encoding='utf8') as f:
            readFile = f.read()
            spacyRead = nlp(readFile)
            myWords = wordCollector(spacyRead, name)
            print(myWords)

{('surprised', 'vol-9'), ('medium', 'vol-9'), ('wrong', 'vol-9'), ('flustered', 'vol-9'), ('awfull', 'vol-9'), ('pivoting', 'vol-9'), ('golden', 'vol-9'), ('many', 'vol-9'), ('young', 'vol-9'), ('heavy', 'vol-9'), ('famous', 'vol-9'), ('old', 'vol-9'), ('refined', 'vol-9'), ('full', 'vol-9'), ('much', 'vol-9'), ('good', 'vol-9'), ('terrible', 'vol-9'), ('right', 'vol-9'), ('bad', 'vol-9'), ('left', 'vol-9'), ('angry', 'vol-9'), ('pleased', 'vol-9'), ('popular', 'vol-9'), ('filthy', 'vol-9'), ('inside', 'vol-9'), ('elementary', 'vol-9'), ('smelly', 'vol-9'), ('sorry', 'vol-9'), ('long', 'vol-9'), ('more', 'vol-9'), ('embarrassed', 'vol-9'), ('bald', 'vol-9'), ('small', 'vol-9'), ('next', 'vol-9'), ('dry', 'vol-9'), ('true', 'vol-9'), ('light', 'vol-9'), ('complete', 'vol-9'), ('great', 'vol-9'), ('okay', 'vol-9'), ('naughty', 'vol-9'), ('third', 'vol-9'), ('whole', 'vol-9'), ('flintlock', 'vol-9'), ('kindred', 'vol-9')}
{('alien', 'vol-8'), ('astonishing', 'vol-8'), ('earthling', 'vol-8

### Let's try dataframes to store these relations of word-form to unit 
Dataframes, supporteed by the pandas library in Python, facilitate handling data in spreadsheet / CSV / TSV format.
We want to output some data to a TSV (tab-separated values file) so we can send it to network visualization software.

We'll need to open a terminal and  **pip install pandas** (or python3.12 -m pip install pandas) to get dataframes working.

In [6]:
import pandas as pd

def wordCollector(words, unit):
    wordList = []
    nodeAtts = []
    unitList = []
    for token in words:
        if token.pos_ == "ADJ":
            wordList.append(token.lemma_)
            nodeAtts.append(token.pos_)
            unitList.append(unit)

    data = {
        'word': wordList,
        'nodeType': nodeAtts,
        'unit': unitList
    }
    df = pd.DataFrame(data)
    return df

for file in os.listdir(collPath):
    if file.endswith(".txt"):
        filepath = f"{collPath}/{file}"
        name, extension = os.path.splitext(file)
        with open(filepath, 'r', encoding='utf8') as f:
            readFile = f.read()
            spacyRead = nlp(readFile)
            myDataFrame = wordCollector(spacyRead, name)
            print(myDataFrame)

         word nodeType   unit
0        okay      ADJ  vol-9
1       third      ADJ  vol-9
2       small      ADJ  vol-9
3   surprised      ADJ  vol-9
4      famous      ADJ  vol-9
..        ...      ...    ...
56        old      ADJ  vol-9
57       more      ADJ  vol-9
58        bad      ADJ  vol-9
59     smelly      ADJ  vol-9
60       next      ADJ  vol-9

[61 rows x 3 columns]
            word nodeType   unit
0           same      ADJ  vol-8
1           long      ADJ  vol-8
2      earthling      ADJ  vol-8
3           half      ADJ  vol-8
4          alien      ADJ  vol-8
..           ...      ...    ...
60         other      ADJ  vol-8
61  astronomical      ADJ  vol-8
62       careful      ADJ  vol-8
63     notorious      ADJ  vol-8
64    ridiculous      ADJ  vol-8

[65 rows x 3 columns]
          word nodeType    unit
0      popular      ADJ  vol-16
1        ready      ADJ  vol-16
2         good      ADJ  vol-16
3        quick      ADJ  vol-16
4        right      ADJ  vol-16
5     

## Write the dataframes to an output TSV file
TSV means **tab-separated-values**. We'll use the tab character as a separator.
The other popular alternative is the CSV: **comma-separated-values**, which uses the comma as a separator.
You can use either one for our network data.

I'm going to output some "node attributes" for tne network. In a network visualization, we'll want to be able to quickly distinguish the word nodes from the unit nodes (possibly by shape), and providing a qualifier or "node attribute" in the dataframe will be helpful.
We can add other kinds of node attributes perhaps as numerical ranges (see below).

In [7]:
import pandas as pd

def wordCollector(words, unit):
    wordList = []
    nodeAtts = []
    unitList = []
    for token in words:
        if token.pos_ == "ADJ":
            wordList.append(token.lemma_)
            nodeAtts.append(token.pos_)
            unitList.append(unit)

    data = {
        'word': wordList,
        'nodeType': nodeAtts,
        'unit': unitList
    }
    df = pd.DataFrame(data)
    return df
    # This is returning a separate dataframe for every source text file. 

# We need to consolidate all the dataframes into one file. Collect all dataframes here!
allDataFrames = []

for file in os.listdir(collPath):
    if file.endswith(".txt"):
        filepath = f"{collPath}/{file}"
        name, extension = os.path.splitext(file)
        with open(filepath, 'r', encoding='utf8') as f:
            readFile = f.read()
            spacyRead = nlp(readFile)
            myDataFrame = wordCollector(spacyRead, name)
            # Add each individual dataframe as it comes out into the list of dataframes!
            allDataFrames.append(myDataFrame)

# Make an output filepath
outputFilePath = 'networkData.tsv'
# Turn the list of dataframes into one dataframe:
fullDataFrame = pd.concat(allDataFrames, ignore_index=True)

# Note, since Pandas knows how to open and write files line by line, we can skip that open() step we used last time.
fullDataFrame.to_csv(outputFilePath, sep='\t', index=False)
print('I just saved a dataframe as a TSV file.')
# Go check your filestash for the file. 

I just saved a dataframe as a TSV file.


### Something to try...
* Try developing this to return synsets from WordNet.
* Add an Ambiguity metric to count the number of synsets it belongs to.
* Output this as a second "node attribute" qualifier to the Part of Speech. We could use that to color-code the words by ambiguity...

In [11]:
# This time, with Wordnet data!
import pandas as pd
import nltk
from nltk.corpus import wordnet as wn

def wordCollector(words, unit):
    wordList = []
    nodeAtts = []
    synsetCounts = []
    unitList = []
    for token in words:
        if token.pos_ == "ADJ":
            synsets = len(wn.synsets(token.lemma_))
            # We should be able to look up and count those synsets right here!          
            wordList.append(token.lemma_)
            nodeAtts.append(token.pos_)
            synsetCounts.append(synsets)
            unitList.append(unit)
                          
    data = {
        'word': wordList,
        'nodeType': nodeAtts,
        'synsetCount': synsetCounts,
        'unit': unitList
    }
    df = pd.DataFrame(data)
    return df
    # This is returning a separate dataframe for every source text file. 

# We need to consolidate all the dataframes into one file. Collect all dataframes here!
allDataFrames = []

for file in os.listdir(collPath):
    if file.endswith(".txt"):
        filepath = f"{collPath}/{file}"
        name, extension = os.path.splitext(file)
        with open(filepath, 'r', encoding='utf8') as f:
            readFile = f.read()
            spacyRead = nlp(readFile)
            myDataFrame = wordCollector(spacyRead, name)
            # Add each individual dataframe as it comes out into the list of dataframes!
            allDataFrames.append(myDataFrame)

# Make an output filepath
outputFilePath = 'networkData-2.tsv'
# Turn the list of dataframes into one dataframe:
fullDataFrame = pd.concat(allDataFrames, ignore_index=True)

# Note, since Pandas knows how to open and write files line by line, we can skip that open() step we used last time.
fullDataFrame.to_csv(outputFilePath, sep='\t', index=False)
print('I just saved a NEW dataframe as a NEW TSV file.')
# Go check your filestash for the file. 

I just saved a NEW dataframe as a NEW TSV file.


## Your Turn! 
Adapt this notebook to explore one or more of your project files and prepare some network data of your own!