<a href="https://colab.research.google.com/github/kevin-kaianalytics/coursera_course_reviews/blob/master/bcirp2020_workshop_nlp_higher_ed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p><img alt="Kai Analytics logo" height="45px" src="https://static.wixstatic.com/media/8e400c_2cf51ced4ded437f9151bf60ba5a32d0~mv2.png/v1/fill/w_108,h_45,al_c,q_85,usm_0.66_1.00_0.01/8e400c_2cf51ced4ded437f9151bf60ba5a32d0~mv2.webp" align="left" hspace="10px" vspace="0px"></p>
<br>
<br>
<h1>Text Analytis for Institutional Research</h1>
<h2>BCIRP 2020 Virtual Workshop</h2>

The workbook accompanies the following presentation on 

<hr>

# Getting Started

The document you are reading is not a static web page, but an interactive environment called **Jupyter Notebook** (*powered by GoLabs*) that lets you write and execute code.

For example, below is **code cell** with a short Python script that prints a statement. Simply press the ![Play Button](https://img.icons8.com/ios-glyphs/30/000000/play-button-circled.png) button to see it run!

In [0]:
print("Hello Everyone!")

Woohoo! You ran a Python code! Congratulations. 

![Thumbs up](https://img.icons8.com/dusk/64/000000/thumb-up.png)

<hr>

# Load our Dataset

Today we will work with a subset of Coursera course feedback dataset from [kaggle](https://www.kaggle.com/septa97/100k-courseras-course-reviews-dataset/)

Kaggle is the world's largest data science community with powerful tools and resources to help you achieve your data science goals.

This dataset contains only feedback related to courses that teach English writing. In the next **code cell**, we will:

*   Load the first Python library in today's workshop, PANDAS. It will help us easily read, and manipulate our dateset
*   Read our dataset and look at some of its attributes


In [0]:
# Let's import PANDAS (Python Data Analysis Library) library.
# This one is built into CoLabs, so we don't have to install it.

import pandas as pd

# Set the url to where the dataset is located.
# In this case, it's one of my public gitHub repositories.
url = 'https://raw.githubusercontent.com/kevin-kaianalytics/coursera_course_reviews/master/writing_course_review.csv'

# Read the CSV file.
df = pd.read_csv(url) # df is shorthand for dataframe

# Let's check and see we've loaded our data properly.
# We do this by checking its first 10 rows.
pd.options.display.max_columns = None
display(df.head(10))

# If you want to upload from your desktop
# from google.colab import files
# df = files.upload()

We can also see how many rows is in our data with the **count()** command. How many rows should we expect to see?

In [0]:
print("Number of rows in the 'Review' column: ", df['Review'].count())

<hr>

## Our First Word Cloud

Okay let's see what a word cloud of the responses look like. *(Don't worry about the complexity of the code here, we're just generating a word cloud)*

In [0]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# quickly make all lower case, replace non alpha-numeric characters
df_wordcloud = df['Review'].astype(str).str.lower().str.replace('[^\w\s]','')

text = df_wordcloud.values 
wordcloud = WordCloud().generate(str(text))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

# Install and Import the NLTK Package

"The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language."

It is great for for understanding the key concepts we discussed today.

In the next **code cell**, you we will be installing and importing the NLTK package.

*Note: This may take a moment*

In [0]:
! pip install NLTK #Install the Natural Language Toolkit
import nltk

# Text Pre-Processing

In this section, we use some of NLTK text pre-processing functions to help us improve the results of our analysis. As introduced in our presentation, these steps will include:

*   Tokenizing text into words
*   Removing stop words
*   Lemmatization
*   Parts of Speech Tagging

## Tokenize the comments

From the NLTK Tokenize library, we will import the word_tokenize module.
We will also need to download the pre-trained [Punkt tokenizer](https://www.nltk.org/api/nltk.tokenize.html) for English. This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.

In [0]:
# Import word_tokenize function for the nltk.tokenize library
from nltk.tokenize import word_tokenize

# Download the Punkt Sentence Tokenizer from the NLTK Tokenize library
nltk.download('punkt')

# Some standard text pre-processing here
df['Review'] = df['Review'].astype(str).str.lower().str.replace('[^a-zA-Z#]',' ')

# Apply Tokenizer and create a new column named "ReviewTokenized"
df['ReviewTokenized'] = df['Review'].apply(word_tokenize)

# Output the first 10 results
print(df['ReviewTokenized'].head())

## Remove Stopwords

Remove frequent words that are redundant to our analysis. Let's see what stop words come out of the box.

In [0]:
# download default list of stopwords
nltk.download('stopwords')

# Download stopwords from the NLTK Tokenize library
from nltk.corpus import stopwords

In [0]:
# We will load the English stop words into the stop variable and see it's output
stop = stopwords.words('english')

# Output all the default stopwords
print(stop)

Let's remove the stop words from our dataset.
*This next **code cell** contains some advanced concepts: Lamdas and List Comprehensions. Basically, we apply a one-time function (lambda) that loops through every token in a row and every row, keeping only words not in the stop word list*

In [0]:
df['stop_words'] = df['ReviewTokenized'].apply(lambda x: [item for item in x if item not in stop])

# Output the first result compare it with the last steps
print(df[['Review']].iloc[0].to_string())
print('')
print(df[['ReviewTokenized']].iloc[0].to_string())
print('')
print(df[['stop_words']].iloc[0].to_string())

## Lemmatization

Lemmatize using WordNet's built-in morphy function. Returns the input word unchanged if it cannot be found in WordNet. [WordNet -> Stem](https://www.nltk.org/_modules/nltk/stem/wordnet.html)

In [0]:
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
# Name the WordNetLemmatizer function simply as lemmitizer
lemmatizer = nltk.WordNetLemmatizer()

Now we test it with some English words

In [0]:
print('dogs: ', lemmatizer.lemmatize('dogs'))
print('kittens: ', lemmatizer.lemmatize('kittens'))
print('students: ', lemmatizer.lemmatize('students'))
print('profs: ', lemmatizer.lemmatize('profs'))
print('professors: ', lemmatizer.lemmatize('professors'))
print('')
print('**teaching: ', lemmatizer.lemmatize('teaching'))

By default, WordNetLemmatizer() treats all words as nouns. So it needs a part of speech tag to help it out.

('v' is for verb)

In [0]:
print('Now with a verb tag')
print('---------')
print('teaching: ', lemmatizer.lemmatize('teaching', 'v'))
print('teaches: ', lemmatizer.lemmatize('teaches', 'v'))
print('taught: ', lemmatizer.lemmatize('taught', 'v'))

As you can see, without a part of speech (POS) tag, the WordNetLemmatizer() treats all words as nouns. So how can we assign POS tags more efficiently?

We can us the [NLTK Taggers package](https://www.nltk.org/_modules/nltk/tag.html)!
This package contains classes and interfaces for part-of-speech tagging, or simply "tagging".

## Parts of Speech Tagging

"This package defines several taggers, which take a list of tokens, assign a tag to each one, and return the resulting list of tagged tokens. Most of the taggers are built automatically based on a training corpus. For example, the unigram tagger tags each word *w* by checking what the most frequent tag for *w* was in a training corpus" -- *from NLTK 3.5 documentation*

In [0]:
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

df['token_pos'] = df['stop_words'].apply(pos_tag)
print(df['token_pos'].head())

### A note on the "greedy averaged perceptron tagger." ###

Greedy algorithm is an approach to solving an optimization problem. In the case for this tagger, it picks the most obvious POS representation of a word based on a pre-trained corpus (*Penn Treebank tagset*). There are numerous benefits to this approach but it has an important limitation. It is not uncommon for one English word to hold multiple POS representation. For example: 

The child had a nasty **fall** from the wall.
We went to see a water **fall** yesterday.
There is no chance of **fall** in prices of essential commodities.

Average means that it that it took an average weight over several iterations during training. Said plainly, it's quite fast, but it's not perfect due to words with more rare POS representations. So use with caution!

---
Now let's try Lemmitizing again with the additional POS tags. 
To begin, we need to define a custom function that convert the [Penn Treebank tagset](https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html) to nouns (n), verbs (v) and adjectives (a). These latter three POS are what is supported in NLTK. Actually, if you think about it, prepositions don't have much left in them to truncate! 

*Sorry, but you won't find any 'But' jokes in this workshop!*

In [0]:
def lemmatize_all(sentence):
    for word, tag in pos_tag(word_tokenize(sentence)):
        if tag.startswith("NN"):
            return lemmatizer.lemmatize(word, pos='n')
        elif tag.startswith('VB'):
            return lemmatizer.lemmatize(word, pos='v')
        elif tag.startswith('JJ'):
            return lemmatizer.lemmatize(word, pos='a')
        else:
            return word

With the function defined, we can now apply the tagged lemmatization to every row in our dataset. 

*This might just take a second or two*

In [0]:
df['tagged_token'] = df['ReviewTokenized'].apply(lambda x: [lemmatize_all(y) for y in x]).apply(lambda x: [item for item in x if item not in stop])

Great! Now we're ready to look at the results. We'll run it just for the first row.

In [0]:
row_number = 0

print(df[['Review']].iloc[row_number].to_string())
print('')
print(df[['ReviewTokenized']].iloc[row_number].to_string())
print('')
print(df[['stop_words']].iloc[row_number].to_string())
print('')
print(df[['tagged_token']].iloc[row_number].to_string())

Can you see the difference from the original sentence?

Notice how we've simplified our sentence yet retained much of its semantic?

---

**[Optional]** If you want, you can download a copy of the results up to this point by running the next **code cell**.

In [0]:
from google.colab import files

df.to_csv('df.csv')
files.download('df.csv')

# Text Analysis
Now that we've applied some common text pre-processing methods (*By the way, congratulations on getting this far*), we can analyze some text! 

In the next two sections we will take a look at the N-Grams methods nad see what that looks like in terms of visualization.

## N-Grams

"An n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech." -- [Wikipedia](https://en.wikipedia.org/wiki/N-gram).

Most of the time we're dealing with words, but sometimes we might get to the character level. For example "A," "T," "G," "C," the building blocks of DNA.

N-grams can be forward or in reverse. For example, the user may have searched "blue sky", but was they in fact interested in the colour, "sky blue"?

Sometimes you want to skip-gram, especially when it comes to predicting the next word (e.g. autofill). 

For example, fill in the blank to this sentence, "the cat ___ on the mat"...
Is it "sat", "played", or hopefully not... "pooped"?

---

Jokes aside, let's start by importing some libraries to help us with our analysis.

In [0]:
from collections import Counter
#Counter will help us keep track of elements (our n-grams) and their count.

from nltk.util import ngrams
# this is for a generalized n-gram. see also bigrams.

In [0]:
#Specify number of grams: bigrams(2), trigrams(3), etc.
number_of_grams = 2

ngrams_results = []
for x in (df['tagged_token']):
    ngrams_results.append(list(ngrams(x,number_of_grams))) 
ngrams_results = [y for x in ngrams_results for y in x]

ngram_count = Counter(ngrams_results)

dataset_ngram = pd.DataFrame.from_dict(ngram_count, orient='index').reset_index()
dataset_ngram.columns = ["ngrams", "count"]

dataset_ngram = dataset_ngram.sort_values(by=['count'], ascending=False).head(40)

print(dataset_ngram.sort_values(by=['count'], ascending=False))



## Vizualizing a Network Graph

"In graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense "related". The objects correspond to mathematical abstractions called vertices and each of the related pairs of vertices is called an edge." -- [Wikipedia](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics))

In [0]:
d = dataset_ngram.set_index('ngrams').T.to_dict('records')
n_size = dataset_ngram['count'].astype(int).tolist()
n_list = dataset_ngram['ngrams'].astype(str).tolist()

print(d)
print(len(n_size))
print(len(n_list))
print(d[0])

In [0]:
import networkx as nx
from matplotlib import cm
import matplotlib.pyplot as plt
from google.colab import files # this will help us save our plot as an image
#from matplotlib.pyplot import figure

# Create network plot 
G = nx.Graph()

# Create connections between nodes
for k, v in d[0].items():
    G.add_edge(k[0], k[1], weight=(v * 5))
#    G.add_nodes_from(n_list)


fig, ax = plt.subplots(figsize=(10, 8), dpi = 90)

pos = nx.spring_layout(G, k=2)


"""

Working on ways to vary the size of each node based on frequency or weight...

print(pos)

print(G.number_of_nodes())
print(G.nodes())
print(G.edges())


print(len(pos))
print(len(G))

colors = range(len(G))
print(colors)
print(type(colors))
print(plt.cm.Blues)

"""

#nx.draw(G, pos, node_color=range(24), node_size=800, cmap=plt.cm.Blues)

# Plot networks
nx.draw_networkx(G, pos,
                 font_size=14,
#                 nodelist = n_list,
#                 node_size= n_size,
                 width=3,
                 edge_color='grey',
                 edge_cmap=plt.cm.Blues,
                 node_color='#4CB1D7',
                 with_labels = False,
                 ax=ax
                 )

# nx.draw_networkx_edges(G = G, pos = pos, edge_color='#73CAC7', alpha=0.6, width=(n_size))

# Create offset labels to improve readability
for key, value in pos.items():
    x, y = value[0], value[1]
    ax.text(x, y,
            s=key,
            bbox=dict(facecolor='#5DC5C1', alpha=0.25),
            horizontalalignment='center', fontsize=12)

# If you want to download the graph, just uncomment the next two lines
# plt.savefig("network_graph_coursera.png")
# files.download("network_graph_coursera.png")    
plt.show()


## And Finally, A Better Word Cloud

In [0]:
import wordcloud as w

word_dict = dict(zip(dataset_ngram["ngrams"].astype(str), dataset_ngram["count"]))
# word cloud needs a dictionary format if we're generating from frequencies

wordcloud = w.WordCloud(collocations=False, 
                        max_words=500, 
                        width=1000,
                        height=500).generate_from_frequencies(word_dict)

# Set plot size and output the graph
plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

# If you want to download the word cloud
# plt.savefig("ngram_coursera_wordcloud.png")
# files.download("ngram_coursera_wordcloud.png")    
# plt.show()