# Interdisciplinary Communication Exploration

In this final segment you will use cyberinfrastructure to computationally explore the words used in academic articles. This is a peek into computational linguistics and we'll use some natural language processing tools. If you want to know more about those things, google them!

This segment is displayed in "Notebook Mode" rather than "Presentation Mode." So you will need to scroll down to explore the content. Notebook mode allows you to see more content at once. It also allows you to easily compare and contrast cells and visualizations. 

Here you are free to explore as much as you want. Once you see how the code works, free to change attributes, code pieces, etc.

In [1]:
# This code cell starts the necessary setup for Hour of CI lesson notebooks.
# First, it enables users to hide and unhide code by producing a 'Toggle raw code' button below.
# Second, it imports the hourofci package, which is necessary for lessons and interactive Jupyter Widgets.
# Third, it helps hide/control other aspects of Jupyter Notebooks to improve the user experience
# This is an initialization cell
# It is not displayed because the Slide Type is 'Skip'

from IPython.display import HTML, IFrame, Javascript, display
from ipywidgets import interactive
import ipywidgets as widgets
from ipywidgets import Layout

import getpass # This library allows us to get the username (User agent string)

# import package for hourofci project
import sys
sys.path.append('../../supplementary') # relative path (may change depending on the location of the lesson notebook)
import hourofci

# Retreive the user agent string, it will be passed to the hourofci submit button
agent_js = """
IPython.notebook.kernel.execute("user_agent = " + "'" + navigator.userAgent + "'");
"""
Javascript(agent_js)

# load javascript to initialize/hide cells, get user agent string, and hide output indicator
# hide code by introducing a toggle button "Toggle raw code"
HTML(''' 
    <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
    <input id="toggle_code" type="button" value="Toggle raw code">
''')

## Setup
As always, you have to import the specific Python packages you'll need. However, for this exploration,since many of the functions are basic Python functions they don't require separate imports. So we'll import the packages as we need them so that you can see when we are using something more than the base functionality of Python. 

Remember to run each code cell by clicking the "Run" button to the left of the code cell. Wait for the <pre>In [ ]:</pre> to change from an asterisk to a number. That is when you know the code is finished running.

## What text to explore?
As you'll see here, you can use the computer to deconstruct simple strings of letters into sentences, phrases and words. Words can be tagged with their "parts of speech"--nouns, verbs, prepositions, etc. That allows us to quantify and compare content, vocabulary, writing styles and sentiment, all things that make different disciplines unique in their forms of communication. The word clouds you worked with earlier in this lesson were created using these tools.

Although the code in this exploration will work with any PDF in which the text is encoded (i.e. not just images of pages), to get started we'll work with some geospatial science journal articles. Later you can use any PDF file you wish. We'll start with a pair of articles published in a special 10 year anniversary issue of the open *Journal of Spatial Information Science* (http://JOSIS.org). Both articles discuss how spatial information science can help us examine mobility in transportation systems. This is a research area that often requires analysis of massive datasets varying over space and time (i.e. big data!). Good fodder for cyberinfrastructure and cyber literacy for GIScience.

We'll compare these two short articles to see how similar they are: 
- Martin Raubal, 2020. <u>Spatial data science for sustainable mobility</u>. JOSIS 20, pp. 109–114, doi:10.5311/JOSIS.2020.20.651
- Harvey J. Miller, 2020. <u>Movement analytics for sustainable mobility</u>. JOSIS 20, pp. 115–123 doi:10.5311/JOSIS.2020.20.663

Since these are open source we can download them directly from their URLs:
- Raubal - http://josis.org/index.php/josis/article/viewFile/651/271
- Miller - http://josis.org/index.php/josis/article/viewFile/663/279

We'll start with Raubal's article, then you can try your hand at processing Miller's article. As much as possible, this code below is generic to make it easier to run again. You'll see a few times where we've created specially named result files to be sure we've still got them when we get around to comparing the two articles at the end. 

## Get the data

First we get the document we want to examine from the web using the system command *!wget*. **wget** is a software package that helps us download files from the internet. We use it here to download our PDF files. Since you want to reuse this code later, we're going to put the downloaded document into a temporary file called *article.pdf*.

Then we have to translate the PDF format into plain text using the system command *!pdftotext*, naming the output *article.txt*. 

In [None]:
!wget http://josis.org/index.php/josis/article/viewFile/651/271 -O article.pdf
!pdftotext -enc ASCII7 'article.pdf' 'article.txt'

Finally, we need to open the text file so that we can read it and begin our analysis. The following code opens *article.txt* for "r"eading, reads the file, saves the contents in the variable *article_string*, and then closes the file.

In [None]:
file_text = open("article.txt", "r", encoding='utf-8') 
article_string = file_text.read()
file_text.close()

## View the data
Once you have read the data, you should look at it to make sure it is what you expected before you start your analysis.

In [None]:
print(article_string)

## Clean the data
You'll notice that there are a lot of "typos" in this translation from PDF to text. While we could, with some clever coding, get rid of the strange text, for now we'll just take off the top bit and the references, since they, in particular, are not very clean.

We'll do this by finding the index (location) of the words "Abstract" and "References". Using these locations we will extract the text between these start and end words (see the [start:end] piece of code below). This is sometimes called 'trimming.'

In [None]:
start = article_string.index('Abstract')
end = article_string.index('References')
article_extract = article_string[start:end]
print(article_extract)

# The Fun Begins!
Now, we introduce some advanced chunks of code that will enable text processing. We do not expect you to fully understand all of the details of the code, nor become an expert in this advanced technology. Just enjoy the exploration and see what makes sense to you. There are brief explanations of each code chunk that explain the key concepts involved. The code and the explanations contain technical jargon that are most likely unfamiliar to you. As you learn to communicate interdisciplinarily, it is important to reflect on these new experiences where you are exposed to unfamiliar concepts, words, and jargon. This is an opportunity to try out a new experience and hopefully pick up one or two ideas along the way. The most important aspect is to have fun!

## Start processing
To begin, we'll use a new module called TextBlob from the textblob package. Textblob is a Python package for processing textual data. For more information see the QuickStart guide: https://textblob.readthedocs.io/en/dev/quickstart.html

We have to import the module, then turn the trimmed text *article_extract* into a *textblob*, a specific data format used by TextBlob. Then we will be able to start analyzing this text using functions provided by the TextBlob module. 

In [None]:
from textblob import TextBlob
tblob = TextBlob(article_extract)

Now we'll try some of the functions that can be applied to the textblob called *tblob*. Remember that to the computer this is just a string of characters representing text. To do textual analysis, we have to get the computer to recognize meaningful "chunks" of this text. Thus, we break the text string up into words, phrases and sentences.

Here we will focus only on words since we'd need a cleaner version of the text for more complex analysis such as phrases. Run the next code and see what textblob comes up with. 

In [None]:
words = tblob.words
words

That's pretty impressive when you realize the computer just started with a string of characters. Let's take this to the next level, by getting the computer to decide what kinds of words each of these are. This is called Part Of Speech (POS) tagging. For more information about POS tagging, see this Wikipedia article https://en.wikipedia.org/wiki/Part-of-speech_tagging.

In [None]:
tags = tblob.tags
tags

This function produces a *list of tuples*. A **tuple** is an ordered collection of items (such as words) that cannnot change. A **list** is an ordered collection of items that can change. Here, each tuple contains a word and its associated tag (i.e. a pair of items). Can you figure out what the tags mean? 

[OK, here's the hint: NN are variations of nouns, NNP are proper nouns (names of people and places), VB are verbs, CC conjunctions, etc. ] 

Next we can quantify the words this author chose to use in their article. Let's focus only on nouns (NN and NNS, singular and plural). 

In [None]:
article_nouns = [word for (word, tag) in tags if tag == 'NN' or tag == 'NNS']

#returns a Python list, as indicated by the square bracket at the start.
article_nouns

Alright! But like any good programmer, you should scan this output to see if it looks good. 

Wait! There are lots of instances of ']' and 'https' in this supposed list of nouns. Let's quickly get rid of them so they don't contaminate our results. 

In [None]:
non_nouns = {']', 'https'}  
all_nouns = [noun for noun in article_nouns if noun not in non_nouns] 
all_nouns

That's better! There are still a few unique non-nouns, but we're only interested in the frequent nouns, so this is good.

Now, before we go any further, let's keep a copy of these nouns for later.

**IMPORTANT!**  Read the comment in the code chunk below that instructs you to change the variable name *Raubal_nouns* when you rerun the code as instructed later.

In [None]:
#IMPORTANT! Change this variable name when you run your code again otherwise you will overwrite the file!
Raubal_nouns = all_nouns

OK, back to our processing. It's hard to see from this long list which nouns are the most frequently used. It is too much data. Let's see if we can get some information by visualizing this data in a word cloud. This is basically a data to information transformation through visualization, which is a lot of words to say that we are distilling a lot of data into a small amount of useful information in the form of a visual word cloud.

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

wordcloud = WordCloud(colormap='cividis', background_color='white').generate(str(all_nouns))

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

OK, that's interesting. Let's make a copy of that word cloud for later...

**IMPORTANT!**  Again, pay attention to the important note to change the variable name when rerunning the code.

In [None]:
#IMPORTANT! Change this variable name when you run the code again!
Raubal_wordcloud = wordcloud

But there are a lot of words in this visualization and it's hard to see which words are, say, the top 25 most common words. Let's dig into this deeper with a bit of computation.

We can start by counting up how many times each of these nouns was used. There are plenty of ways to do this frequency count. We'll use a **for loop,** which will iterate over each item in our list of article_nouns. We will store the results into a Python **dictionary**, a special kind of Python data format that stores unordered collection of items organized as pairs of keys and values. To learn more about dictionaries and loops check out the link [here](https://www.w3schools.com/python/python_dictionaries.asp). 

In [None]:
#create an empty dictionary, indicated by the curly brackets
noun_count = {}

# loop over the list of nouns, identifying unique words and incrementing
# the total for repeated words
for item in all_nouns:
    if item in noun_count:     # We already saw this noun, so add 1
        noun_count[item] += 1
    else:                      # We have not seen this noun yet, so make the count 1
        noun_count[item] = 1

#show the resulting dictionary (note the curly brackets)
noun_count

Cool! Of course there are a lot of words used only once (after all that's good grammatical style), so let's create a new dictionary of only the top 25 words.

While the code chunk below is deceptively simple, it's pretty sophisticated. See if this description makes sense to you. Deconstruct the code from the inside out. 

- We use a module from the package *operator* called **itemgetter**. Remember that dictionaries are composed of pairs of keys (also called items, here it's the nouns) and values (the counts). The module itemgetter will sequentially get the value for each item from our dictionary. 

- That result is sent to the **sorted** function to sort all of the items in our dictionary by value. 

- Then the result of the **sorted** function is transformed back to a dictionary using the **dict** function. This will create a *topN* dictionary containing the N most common words in our article. 

In [None]:
N = 25  #sets the number of items to extract
from operator import itemgetter 
topN = dict(sorted(noun_count.items(), key = itemgetter(1), reverse = True)[:N])  
topN

That's it! We've got a sorted dictionary of the top 25 nouns in the article. 

OK, let's save this result and then you can run the other article! 

In [None]:
#IMPORTANT! Change this file name when you run again!
Raubal_top25 = topN

# Process the other article
It's your turn to run the Miller article through the same analysis. Start at the beginning again, running each code cell, being sure to make the necessary changes to process the Miller article and not overwrite the key Raubal files. When you've generated 'Miller_nouns', 'Miller_wordcloud', and 'Miller_top25', you'll be ready for the next step.

GO BACK TO THE TOP!
_______________________________________________________________

# Compare the articles
OK, now you've got two sets of files, Raubal's and Miller's. First let's look at the two wordclouds

In [None]:
plt.imshow(Raubal_wordcloud, interpolation='bilinear')
plt.axis("off")

In [None]:
plt.imshow(Miller_wordcloud, interpolation='bilinear')
plt.axis("off")

What differences and similarities do you see? 

Finally, let's look at the word counts in a table. For visualization purposes we can put the two top 25 lists into a single pandas dataframe.

In [None]:
import pandas
Raubal_top25_df = pandas.DataFrame(list(Raubal_top25.items()),columns = ['noun','count']) 
Miller_top25_df = pandas.DataFrame(list(Miller_top25.items()),columns = ['noun','count']) 
result = pandas.concat([Raubal_top25_df, Miller_top25_df], axis=1).reindex(Raubal_top25_df.index)
result

What do these two lists tell you about differences in the articles? Can you spot some "words" that shouldn't be in this list? Oops, that would call for a bit more cleaning up, but for now, we're good!

# Congratulations!


**You have finished an Hour of CI!**


But, before you go ... 

1. Please fill out a very brief questionnaire to provide feedback and help us improve the Hour of CI lessons. It is fast and your feedback is very important to let us know what you learned and how we can improve the lessons in the future.
2. If you would like a certificate, then please type your name below and click "Create Certificate" and you will be presented with a PDF certificate.

<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" href="https://forms.gle/JUUBm76rLB8iYppN7">Take the questionnaire and provide feedback</a></font>


In [2]:
# This code cell has a tag "Hide" (Setting by going to Toolbar > View > Cell Toolbar > Tags)
# Code input is hidden when the notebook is loaded and can be hide/show using the toggle button "Toggle raw code" at the top

# This code cell loads the Interact Textbox that will ask users for their name
# Once they click "Create Certificate" then it will add their name to the certificate template
# And present them a PDF certificate
from PIL import Image
from PIL import ImageFont
from PIL import ImageDraw

from ipywidgets import interact

def make_cert(learner_name):
    cert_filename = 'hourofci_certificate.pdf'

    img = Image.open("../../supplementary/hci-certificate-template.jpg")
    draw = ImageDraw.Draw(img)

    cert_font = ImageFont.load_default()
    cert_font = ImageFont.truetype('../../supplementary/times.ttf', 150) 
    
    w,h = cert_font.getsize(learner_name)    
    draw.text( xy = (1650-w/2,1100-h/2), text = learner_name, fill=(0,0,0),font=cert_font)
    
    img.save(cert_filename, "PDF", resolution=100.0)   
    return cert_filename


interact_cert=interact.options(manual=True, manual_name="Create Certificate")

@interact_cert(name="Your Name")
def f(name):
    print("Congratulations",name)
    filename = make_cert(name)
    print("Download your certificate by clicking the link below.")
    
    
    


interactive(children=(Text(value='Your Name', description='name'), Button(description='Create Certificate', st…

<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" href="hourofci_certificate.pdf">Download your certificate</a></font>