# Networking popular words in lyrics
For this exercise, we'll create a network of words (based on part of speech or some other classification) that are frequently used in collections of song lyrics assembled by the Olivia Rodriguez project team. For our network example, we ask: Which of the popular words is used the most frequently by particular artists? 

If you're adapting this to your own project, take some time with your team to think about what's interesting to explore as a network from your project. You can use this cell to sketch out in markdown what you want to do. 

The steps to create this network are:
## Collect the words, rank them, and select them
* Collect the distinct words from all of the lyrics together by part of speech. (Let's look at nouns in this example.) Return these as a sorted list with duplicates removed, ranked from most to least frequent. 
* Streamline the list, by choosing, say, the top 10 or 20 words.

## Find out which artists use each word and how much
* Reach into the collections assembled by artist.
* For each word in our streamlined list, return a count of how much the artist repeats that word
* Prepare network data (arranged as a TSV or pandas dataframe) with this structure. (The syntax will be different, but this is just an idea of the information we need.

  ```
     word | used by | artist | count of times used by this artist
  ```

### Alternative ways to develop networks of information
There are lots of ways to think about how to explore word use in song lyrics. We are making something of a "big picture" study of the most popular words by part of speech used by all artists, and looking across their albums: a word is just used by an artist (regardless of which song or album it's in). But we could change this to take a closer look at other patterns. For example:
* Change the context: word use by song or within an album!

* Start with a word of interest **to accept as input into the notebook** and:
    * Look for all the ones most closely related in a collection using cosine similarity (see our early Python homework assignments to explore words similar to a word of interest).
    * Try an adjacency network of words: Find out which other words of its type are sitting close (adjacent) to the word of interest. 


# Practical considerations: an overview of our process!
## Consider how to visualize this network
This would be a bimodal network to show which words are shared the most by which artists in the collection. 
Node1 = word
Node2 = artist
Edge connection = "used by"
Edge weight data = count of times used by this artist (needs to be an integer)

## How will we do all this (with the Olivia Rodriguez Team collections)?
* Python imports and functions to coordinate processing
* Apply saxonche to import the team's XML: pull the text you want from the XML nodes in a collection using XQuery
    * _Without XML data_, collect strings of words by opening the files and reading them in. Refer to our early Python assignments.
* Use NLP to find the words of interest: We can send the text for the whole collection to spaCy (much as we did in [project ipynb exercise 3](3-projectExplore-dataCounts.ipynb) to retrieve the words of interest (e.g. nouns).
    * Remove duplicates and rank them (use Counter and mostCommon()).
    * Slice this list to get you the top 20 (or however many you want to plot).
      
* Now return to our XML collection to find out information about who is using the words and where.
    * For our example from the Olivia Rodriguez team, the team has organized files in folders named by artist and album. We'll use this organization to help collect information based on the artists. (See local folder in this projectExamples directory named `lyricXML/`.
    * We'll return to saxonche and **write an XQuery function that defines the collections we need to reach into based on the artist**. (**NOTE: This part is tricky: It will be specific to the project team's folder structure**. Ask for help if you need to on this part to adapt to your project!)
        * _Without XML data_, return to the text files: open them based on filename or folder name or whatever structure will help you establish context for your files
    * Start with a for loop over your words of interest. Each word is sent into XQuery to retrieve how much it's used by an artist and return the count of its use.
        *  _Without XML data_, each word is sent by a Python function retrieve its count of how much it's used in whatever you're using to delimit a special "bucket" of files or folders in your project.
     
    * Output a pandas dataframe  to prepare data to be read by NetworkX and pyVis.
 
## Time to code this...

In [None]:
# START WITH INSTALLS AND IMPORTS!

# If you're missing anything in the import cells below, you should install it with pip (or pip3) in your virtual environment. 

# TRY uncommenting the lines here to see if the notebook will handle the imports directly. 
# Here you need the `!` in front if it's going to work.
# IF THAT DOESN'T WORK:
# (Go to your command line in the git bash shell (Windows) or Terminal (Mac) and 
# activate your virtual environment where you've set it on your local computer: 
# Windows: source Scripts/activate
# Mac: source bin/activate 
# watch for your virtual environment to show you it's active in the shell.
# Then enter your pip installs (without the `!` explanation points). 

# INSTALLS
# !pip install pathlib
# !pip install saxonche
# !pip install pandas
# !pip install networkx
# !pip install pyvis


In [None]:
# IMPORTS for the text NLP processing
import pathlib
import spacy
from pathlib import Path
from saxonche import PySaxonProcessor
from collections import Counter

# Just in case you want it:

# import re as regex
# re is standard to Python3: lets us work with regular expressions in Python. 
# Uncomment it if you want to try it here to search for a specific pattern in your texts with Python.

In [None]:
# IMPORTS For the network visualizations
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from pyvis.network import Network


In [None]:
# nlp = spacy.cli.download("en_core_web_lg")
# ONLY NEED ABOVE LINE ONCE. REMEMBER: COMMENT OUT THE ABOVE LINE THE NEXT TIME YOU RUN THIS.
nlp = spacy.load('en_core_web_lg')

## Collect the words, rank them, and select them
* Collect the distinct words from all of the lyrics together by part of speech. (Let's look at nouns in this example.) Return these as a sorted list with duplicates removed, ranked from most to least frequent. 
* Streamline the list, by choosing, say, the top 10 or 20 words.


### Sample XML code for a file in the lyricXML collection

```xml
<lyrics>
    <section type="verse" n="1">
        <l>Told you not to worry</l>
        <l>But maybe that's a lie</l>
        <l>Honey, what's your hurry?</l>
        <l>Won't you stay inside?</l>
        <l>Remember not to get too close to stars</l>
        <l>They're never gonna give you love like ours</l>
    </section>
    <section type="chorus">
        <l>Where did you go?</l>
        <l>I should know, but it's cold</l>
        <l>And I don't wanna be lonely</l>
        <l>So show me the way home</l>
        <l>I can't lose another life</l>
    </section>
    <section type="refrain">
        <l>Hurry, I'm worried</l>
    </section>
    <section type="verse" n="2">
        <l>The world's a little blurry</l>
        <l>Or maybe it's my eyes</l>
        <l>The friends I've had to bury</l>
        <l>They keep me up at night</l>
        <l>Said I couldn't love someone</l>
        <l>'Cause I might break</l>
        <l>If you're gonna die, not by mistake</l>
    </section>
    <section type="chorus">
        <l>So, where did you go?</l>
        <l>I should know, but it's cold</l>
        <l>And I don't wanna be lonely</l>
        <l>So tell me you'll come home</l>
        <l>Even if it's just a lie</l>
    </section>
    <section type="bridge">
        <l>I tried not to upset you</l>
        <l>Let you rescue me the day I met you</l>
        <l>I just wanted to protect you</l>
        <l>But now I'll never get to</l>
    </section>
    <section type="refrain">
        <l>Hurry, I'm worried</l>
    </section>
    <section type="outro">
        <l>Where did you go?</l>
        <l>I should know, but it's cold</l>
        <l>And I don't wanna be lonely</l>
        <l>Was hoping you'd come home</l>
        <l>I don't care if it's a lie</l>
    </section>
</lyrics>
```

### The next two cells...
Define your input and output filepaths...and send them to XQuery for processing.

#### About reaching into your file collections with XQuery
Our input for this exericse in lyricXML is a set of nested folders, so we need to recurse through them. 

The collection() function in our XQuery is set to **recurse** through each of the folders   and find all the XML files inside. 

#### Keeping your outputs from scrolling forever
On an output cell with a LONG BLOB of text, right-click and select "Enable Scrolling for Outputs"

In [None]:
# DEFINE SOME FILE PATHS FOR INPUT, AND (ONCE WE'RE READY) OUTPUT
InputPath = 'lyricXML'
OutputPath = 'testOutput' 

# NOTE: We need to use a return line on this function to return the string value of `r` as the result of our python function.
# With the return line, that makes it possible to call the function in the next cell when we need to deliver the output to nlp.

In [None]:
def xqueryAndNLP(InputPath):
    # This time, let's try XQuery over a collection of files:
    with PySaxonProcessor(license=False) as proc:
        print(proc.version)
        xq = proc.new_xquery_processor()
        xq.set_query_base_uri(Path('.').absolute().as_uri() + '/')
        xq.set_query_content('''
let $allTheLyrics := collection('lyricXML/.?select=*.xml;recurse=yes')
(: ebb: our collection variable is set to recurse through the internal nested folders. :)
let $lines := $allTheLyrics//l ! text()
return string-join($lines, ' ')

''')
        r = xq.run_query_to_string()
        # print(r)
        r = str(r)
    return r

xqueryAndNLP(InputPath)

### Let's roll this ball of text over to NLP now. . .

In [None]:
# If everything's working properly and you have lots of text for the computer to read, this cell may take a moment to run. 

inputstring = xqueryAndNLP(InputPath)

# start playing with spaCy and nlp:
words = nlp(inputstring)
# print(words)

# Collecting the lemmatized forms will be better than just all the words. (Remember what these are?)
Lemmas = []
for token in words:
    if token.pos_ == "NOUN":
        lemma = token.lemma_
        Lemmas.append(lemma)

# Okay, we'll use python's Counter() find out how frequently each verb lemma shows up in the entire verb list.
# Counter() removes duplicates and counts the number of times something appears. 
# And it outputs a dictionary of key:value pairs already sorted from highest to lowet count.

# print(Lemmas)

lemmaFreq = Counter(Lemmas)
totalLemmaCount = len(lemmaFreq) 

print(f"Lemma count: {totalLemmaCount}")

print(f"Lemma frequency {lemmaFreq}")

# We can even calculate the percentage each verb is used.
# The totalVerbCount will be the length of the BenderLemmas list.



In [None]:
# As with our previous bar graph examples in exercise 3, we don't want to plot every last word here.
# But we have a lot of data, so we can experiment!
# To access data in our Counter list and keep it organized from highest to lowest value, we use `most_common()`.
# Then we can slice it to store however many we want to plot. [:10] would plot the first 11 values since python starts counting from zero.

mostCommon = dict(lemmaFreq.most_common()[:29])
print(f"mostCommon Lemmas {mostCommon}")

# Here we are unpacking our sliced dictionary of most common noun lemmas into lists of the values and keys,
# and checking to make sure they remain in their dictionary order here. 
# We will use the list of lemmas in the next code cell to look for each one as used by each artist.  
# (We used them when plotting bar graphs, 
# so you could output some bar graphs in the next cells if you want, and then return to the network we're building!

listCounts = list(mostCommon.values())
listLems = list(mostCommon.keys())
print(f"listCounts: {listCounts}")
print(f"listLems: {listLems}")


## Okay, time to build our network...
### Find out which artists use each word and how much
* Reach into the collections assembled by artist, **with XQuery again, but this time, reaching into specific collections for each artist** 
* **For each word in our streamlined list**, return a count of how much the artist repeats that word
* Prepare network data (arranged as a TSV or pandas dataframe) with this structure.
* This is just to remind us what we're  constructing for our network. **The syntax for our dataframes will be different** (no vertical `|`'s ). 

  ```
     word | used by | artist | count of times used by this artist
  ```


In [None]:
### CURRENTLY RUNNING BUT UNFINISHED!

def networkQuery(listLems, InputPath):
    with PySaxonProcessor(license=False) as proc:
       for lemma in listLems:
            xq = proc.new_xquery_processor()
            xq.set_query_base_uri(Path('.').absolute().as_uri() + '/')
            xquery = f'''
                declare variable $lemma as xs:string* external := '{lemma}';
                declare variable $string as xs:string := string-join(
                let $billieLyrics := collection('lyricXML/billie/.?select=*.xml;recurse=yes')
                let $allTheLyrics := collection('lyricXML/.?select=*.xml;recurse=yes')
                (: ebb: our collection variables are set to recurse through the internal nested folders.:)
                
                let $artistNames := ('billie', 'olivia', 'sabrina', 'taylor')
                for $name in $artistNames
                (: return $name :) 
                
                let $lemmaLines := $allTheLyrics[base-uri() ! contains(., $name)]//l ! text()[contains(., $lemma)]
                let $billieCount := count($lemmaLines)
                return ($lemma || '\t' || 'used by' || '\t' || $name || '\t' ||  $billieCount), '\n');
                
                (: May work more reliably than regex '\n' :)
                (: IF NEEDED: in place of \t, we can use `&#x9;.` :)
                (: IF NEEDED: in pace of \n, we can use this weird special character for a newline or hard return.:)
                $string
            '''
            xq.set_query_content(xquery)
        
    
            r = xq.run_query_to_value()
            r = str(r)
            # print(r)

    return r
    
networkQuery(listLems, InputPath)   
    
    

### Network Vis Time!
We have lovely network data formatted as TSV (tab-separated-values), and we need to convert it into a format that Python easily interprets for use in visualizations. 

We have lots of options here, but let's try out pandas dataframes. Dataframes are used frequently in text data analytics for organizing and reading values into charts and graphs. Read more about it at <https://www.geeksforgeeks.org/python-pandas-dataframe/>.


In [None]:
from io import StringIO

# Store the output of our networkQuery in a nice variable: 
networkData = networkQuery(listLems, InputPath)

# NOT WORKING RIGHT NOW Okay, maybe we need to output it as a whole TSV file to read in:
OutputPath = 'testOutput/network.tsv'
with open(OutputPath, "w", encoding="utf-8") as file:
    # Write each string as a new line in the file
    file.write(networkData)



#Now let's try converting it into a pandas dataframe: 

dataframe = pd.read_csv(StringIO(networkData), sep='\t')

# Split the data into lines, then split each line by tabs
# dataframe  = [line.split('\t') for line in networkData.split('\n')]
# Create a DataFrame from the list of lists
# df = pd.DataFrame(dataframe[1:], columns=dataframe[0])

dataframe.columns = ['col1', 'col2', 'col3', 'col4']
print(dataframe)

print(dataframe)