#IST664/CIS668 - Homework 4 Template

##My name: __________

By adding your name to the space above, you attest that this work is all your own, except in those code and text blocks where you have given attribution to another author. You do not need to provide attribution for code copied from the labs or exercises for this class.

Sometimes it is helpful to discuss the homework with other members of the class. This is fine as long as you do not share code. If you collaborated with one or more individuals, list their names here:

###My collaborators: __________

For this homework, you will be processing a dataset that contains a series of sentences, each of which is separated by a semicolon. The goal for the homework is to develop a matrix representation of the relationships among these sentences that can be processed by a CNN to make predictions. 

You will use a pre-trained sentence vectorization model to produce sentence embeddings for each sentence and then compute a square matrix of cosine similarities among the sentences. This matrix will serve as the input to the CNN predictive models.

This template helps to pre-process the data with the following steps:

1. Parse the data into sentences.
2. Develop a vector summary for each sentence.
3. Create a matrix of similarities among the sentences.
4. Pad the matrix to a common (square) size. This will be the input to the CNN.
5. Convert input matrices and output vectors into tensors.


In [None]:
import pandas as pd
# Read in the data from Github
url = "https://raw.githubusercontent.com/jmstanto/ist664/main/paracoh.csv"

sentDB = pd.read_csv(url)

sentDB.shape

In [None]:
sentDB # Preview the data

The text field contains several sentences harvested from Wikipedia articles and separated by semicolons. The polys field is a readability index that is computed by looking at the ratio of polysyllabic words to total word count. cl is the Coleman-Liau index, another measure of readability that takes into account word length and sentence length.

Both polys and cl are insensitive to the order of sentences in a paragraph and therefore do not measure paragraph coherence. In contrast, the incoh score indexes the number of ordering substitutions in a paragraph. So 0 indicates that the sentences are in their original order and 1 indicates that one sentence is in the wrong place, etc.

The technique you are using from this lab, of capturing a matrix of similarity scores among sentences in a paragraph and analyzing it with CNN, should theoretically be able to detect when the sentences appear out of order in a paragraph.

In [None]:
# These are the three metrics you will predict
print(type(sentDB["polys"])) # The polysylabbic score
print(type(sentDB["cl"])) # The Coleman-Liau Index
print(type(sentDB["incoh"])) # The number of sentence ordering incoherencies

**Task 1**

Plot histograms that show the shape of the distribution for each of the three outcome variables. You don't need to take any action if the distributions look unusual, but doing this kind of diagnosis is an important part of becoming oriented to any data science problem.

In [None]:
# HW4T1a
# Task 1: Produce histograms for each the three metric outcome variables
#

In [None]:
# How big does our padded (square) input matrix need to be?
max_mat_size = max([len(sent.split(";")) for sent in sentDB["text"]])
max_mat_size

**Task 2**

Show a few of the texts from the dataset. Note the semicolons that separate the sentences. Each row of the dataset has a set of sentences – the number of sentences varies per row.

In [None]:
# HW4T2a
# Task 2: Examine a few of the texts.
#


In [None]:
# We will need the library for loading sentence transformers
# This generates a lot of output, but should run pretty fast.
!pip install sentence-transformers

In this next step you have the opportunity to select one of three sentence vectorizers. These vary in terms of their dimensionality. The cosine similarities that you generate to measure the similarity of sentences will be affected by which model you choose.

In [None]:
#@title Task 3: Choose a Pretrained Sentence Summarizer

model_name = 'Six Level Mini-LM V2 (d=384)'  #@param ["Six Level Mini-LM V2 (d=384)", "Multilingual Sentence BERT (d=512)", "Multi-QA MPnet (d=768)"]

map_name_to_handle = {
    'Six Level Mini-LM V2 (d=384)':
        'sentence-transformers/all-MiniLM-L6-v2',
    'Multilingual Sentence BERT (d=512)':
        'sentence-transformers/distiluse-base-multilingual-cased-v2',
    'Multi-QA MPnet (d=768)':
        'sentence-transformers/multi-qa-mpnet-base-dot-v1'
    
}


my_transformer = map_name_to_handle[model_name]

print(f'Sentence model selected           : {my_transformer}')


In [None]:
# Now load the pre-trained sentence transformer, based on your selection above.
# This downloads a lot of data to your virtual machine and takes half a minute or so.
from sentence_transformers import SentenceTransformer


model = SentenceTransformer(my_transformer)

In [None]:
# This defines a function that takes an input matrix and puts it in the
# upper left corner of a padded, standard-sized matrix

import numpy as np

def pad_matrix(in_mat, mat_size):

  ret_mat = np.zeros(shape=(mat_size, mat_size))

  # By using Python indices, we can target the upper left subset
  # of the return matrix in the same shape as the input matrix.
  ret_mat[0:in_mat.shape[0] , 0:in_mat.shape[1] ] = in_mat
  
  return(ret_mat)

In [None]:
# Test the padder
M = np.arange(3*3).reshape((3,3))
print(M)

pad_matrix(M, 5)

In [None]:
# HWT3a
# Exercise: Test the padder using a 4x4 input matrix that needs to be padded to 6x6
#


In [None]:
# Process the individual texts: Computational time depends on the sentence summarizer selected,
# but could take 15-20 minutes for nearly 5000 instances.
from sentence_transformers.util import cos_sim

mat_list = []

for text in sentDB["text"]:
  items = text.split(";")
  vect_list = model.encode(items)
  sim_matrix = cos_sim(vect_list, vect_list)
  mat_list.append(pad_matrix(sim_matrix, max_mat_size))

In [None]:
# Should have the same number of matrices as rows in the pandas df
len(mat_list)

In [None]:
# HW4T3b
# (End of) Task 3: Review one of the padded matrices. Comment on what you see.
#


**Task 4**

Set up a CNN model to process the similarity matrices. The goal is to train models that can predict each of the three measures of quality using the similarity matrices as input. Use the same architecture for all three models. Don't forget that all three of these outcome variables are interval/metric data. You should therefore make the final layer of the model a single linear activation node. This is not the only possible option, but it is the most sensible one. Make sure to show a model summary that shows the layers and the shapes of data flowing between the layers.

In [None]:
# First set up testing and training splits that will be used by each of the three models
from sklearn.model_selection import train_test_split
import tensorflow as tf


# Do random splits for testing and training
PX_train, PX_test, Py_train, Py_test = train_test_split(mat_list, sentDB["polys"], test_size=0.33, random_state=42)
CX_train, CX_test, Cy_train, Cy_test = train_test_split(mat_list, sentDB["cl"], test_size=0.33, random_state=42)
IX_train, IX_test, Iy_train, Iy_test = train_test_split(mat_list, sentDB["incoh"], test_size=0.33, random_state=42)

# Convert to tensors for presenting to TF
PX_train_tensor = tf.convert_to_tensor(PX_train)
PX_test_tensor = tf.convert_to_tensor(PX_test)
Py_train_tensor = tf.convert_to_tensor(Py_train)
Py_test_tensor = tf.convert_to_tensor(Py_test)

CX_train_tensor = tf.convert_to_tensor(CX_train)
CX_test_tensor = tf.convert_to_tensor(CX_test)
Cy_train_tensor = tf.convert_to_tensor(Cy_train)
Cy_test_tensor = tf.convert_to_tensor(Cy_test)

IX_train_tensor = tf.convert_to_tensor(IX_train)
IX_test_tensor = tf.convert_to_tensor(IX_test)
Iy_train_tensor = tf.convert_to_tensor(Iy_train)
Iy_test_tensor = tf.convert_to_tensor(Iy_test)



In [None]:
# Here are a few keras imports that will probably be needed. You can include
# other kinds of layers appropriate to CNNs if you like. 
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dropout
from keras.layers import Dense

from keras.losses import MeanSquaredError

# Hint: after running modelC = Sequential(), you can add each new layer using
# modelC.add()

At this point in the code, create three separate CNN models using Sequential(). Look to the Week 8 lecture and Lab 8 for information on configuring CNN models. The most basic model could consist of a 2D convolutional layer (because your input is a set of 2D matrices) followed by a max-pooling-2D layer, followed by a flattening layer, and concluding with a single unit dense layer with linear activation (because you are predicting a metric output rather than a categorical one). You should try more complex models as well. Note that your error function should be Mean Squared Error.

Hint: Get the first model working first (the one where you will predict Py_train_tensor during training) making adjustments to the layers and hyperparameters as needed to improve the training. Once that first model is working well, you can simply copy it and change the training inputs and outputs 

In [None]:
# These are some hyperparameters that you may be helpful in specifying the layers
input_shape = (max_mat_size, max_mat_size, 1) # Each input is a single layer square matrix 
num_filters = 6 # You can adjust this up or down to try to improve model fit
kern_size = (6, 6) # You can adjust this up or down as needed to improve model fit
max_pooling_size = (2, 2) # Making a pooling window larger than 2,2 can result in a loss of important data
dense_size = 64 # How many neurons to receive the output of the max pooling layer
dense_act = 'linear'
val_split = 0.2

Remember, because you are predicting metric outcomes (i.e., floating point numbers) rather than categories, you should use MeanSquaredError() as your loss function. Make sure to show a model summary that confirms the shapes of the various layers.  

During training, carefully monitor the number of epochs needed to fully train the model without overtraining it. Keep your eye on the validation loss to make sure it does not start to climb far higher than the training loss.

Once you have a trained model that is satisfactory for each of your outcome variables, make predictions using the model and the original input data and compute a regular correlation coefficient (e.g., with numpy corrcoef) to confirm how well the model performs.

In [None]:
#
# (End of) Task 4: Compile and summarize your models here.
# Hint: Adam would be a reasonable optimizer to use: optimizer=tf.keras.optimizers.Adam(learning_rate=0.0005)
# You can also experiment with a faster learning rate.
#


**Task 5**

Train your CNN models. Experiment with the learning rate and the number of epochs to make your training as efficient as you can. Remember, get the first model working satisfactorily first, before you make copies for predicting the other two outcome variables.

In [None]:
# HW4T5a
# Train your models here

**Task 6**

Once each model is fully trained to your satisfaction, use the "predict" method on your test set to compute a set of predicted values from the set of input matrices. Use a Pearson's correlation (r) between predicted and actual values (from the test set) as a final model performance metric. You can use np.corrcoef() to calculate this. You may need to append .squeeze(axis=-1) to the predict function to get the predictions into the shape they need to be in for np.corrcoef().

Report the value of Pearson's r for each of the three models.

In [None]:
# HW4T6a
# Task 6: Evaluate your models here
#


**Task 7** 


The very last block in the notebook should be a text block that documents and discusses your results. Make sure your discussion describes performance you achieved on each of the three different metrics. Give some thought to why the performance levels of the three models may differ from each other. Based on your results, what can be learned from vectorizing a series of sentences? Comment on whether you think that your models are overtrained – that is would these models generalize to new sentences extracted from Wikipedia?

HWCC

Replace this text with your comments.