# Hands-On Week 1
In week 1 we've learned about calculating the information in spoken language. In other words, how can we go from an audio signal to strings of text, to measures of information transfer. <br>
In this first hands-on class, we will take a closer look at quantifiyng information in language. We'll learn how to use Python to train a very simple model of next-word probabilities, and how we can use these values to learn about spoken language use. We provide you with the main components, but you will need to use your own python skills to complete exercise!<br>
Before we begin, download the DailyDialog corpus from Kaggle: https://www.kaggle.com/datasets/thedevastator/dailydialog-unlock-the-conversation-potential-in?resource=download 

## Initializing Variables
First, we will need to initialize some variables and load in our data.

In [15]:
import numpy as np
import os
import pandas as pd
from nltk import trigrams

import matplotlib.pyplot as plt

####
# Set up NLP parameters
from collections import defaultdict
import math
from scipy.stats import entropy

# Create a placeholder for probability model
model_prob = defaultdict(lambda: defaultdict(lambda: 0))
# Create a placeholder for surprisal model
model_surprisal = defaultdict(lambda: defaultdict(lambda: 0))

# update as the model is trained
max_surprisal = 0

### DIY: Preparing the corpus
Now that the packages are loaded, we need to read in the corpus, and format it for processing. Our simple language model requires us to have a <i>list</i> of strings, where each string is a sentence in the corpus. The corpus that we are using, the MultiDialogue Corpus, is not yet in this format. This is because the corpus has more features than we actually need. It provides a list of dialogues (collections of exchanges), that each consist of several speech turns (i.e., sentences, for the purpose of this tutorial), as well emotion and dialogue act labels. <br>
<b>Your Task:</b><br>
- Inspect the format of the data
- Extract the sentences, removing any punctuation and removing contractions (hint: check out the contractions and string packages)
- Create a new list variable called 'sentences' that contains only the speech sentences

In [14]:
raw_corpus = pd.read_csv("./archive/train.csv")
raw_corpus.head()


# REMOVE
import contractions
import string
sentences = []
for idx, row in raw_corpus.iterrows():
    for sentence in row['dialog'].split('\n'):
        sentence_cleaned = contractions.fix(sentence)
        sentence_cleaned = sentence_cleaned.translate(str.maketrans('', '', ','))

        sentence_cleaned = sentence_cleaned.translate(str.maketrans(' ', ' ', string.punctuation))
        sentences.append(sentence_cleaned)

print(sentences[0:10])

['Say  Jim  how about going for a few beers after dinner  ', '  You know that is tempting but is really not good for our fitness  ', '  What do you mean  It will help us to relax  ', '  Do you really think so  I do not  It will just make us fat and act silly  Remember last time  ', '  I guess you are rightBut what shall we do  I do not feel like sitting at home  ', '  I suggest a walk over to the gym where we can play singsong and meet some of our friends  ', '  That is a good idea  I hear Mary and Sally often go there to play pingpongPerhaps we can make a foursome with them  ', '  Sounds great to me  If they are willing  we could ask them to go dancing with usThat is excellent exercise and fun  too  ', '  GoodLet  s go now    All right  ', 'Can you do pushups  ']


In [None]:
sentence

## Calculating Surprisal

In [17]:
#Step 4: Count frequency of co-occurance
for sent_num in range(len(sentences)):
    sentence = sentences[sent_num]
    sequence= sentence.split(" ")
    for w1, w2, w3 in trigrams(sequence, pad_right=True, pad_left=True):
        model_prob[(w1, w2)][w3] += 1
        model_surprisal[(w1, w2)][w3] += 1

#Step 5: Transform the counts to probabilities
for w1_w2 in model_prob:
    total_count = float(sum(model_prob[w1_w2].values()))
    for w3 in model_prob[w1_w2]:
        model_prob[w1_w2][w3] /= total_count # probability

#Step 6: Transform the counts to surprisal
for w1_w2 in model_surprisal:
    total_count = float(sum(model_surprisal[w1_w2].values()))
    for w3 in model_surprisal[w1_w2]:
        if w3:
            probability = model_surprisal[w1_w2][w3] / total_count  
            
            model_surprisal[w1_w2][w3] = -math.log(probability) # <-- Smith and Levy, 2013
            
            # update max -- this is used for corpus values not encountered during training
            if -math.log(probability) > max_surprisal:
                max_surprisal = -math.log(probability)

In [48]:
#Step 7: Test probability model
print(model_prob[('I','think')]['sometimes'])
print(model_prob[('I','think')]['that'])


0.00046446818392940084
0.08732001857872736


In [28]:
#Step 8: Test surprisal model
print(model_surprisal['are','you']['going'])
print(model_surprisal['are','you']['happy'])

1.640282587785199
7.157735484249907


## Step 7: Next-Word Entropy
While surprisal captures the new information gained at each word, next-word entropy measures how uncertain you are about the next upcoming word.

In [None]:
 
def calc_entropy(w1_w2,model_surprisal):
    # expects w1_w2 to be of the format: ["some","words"]
    prob_dist = [ model_surprisal[w1_w2][val3] for val3 in model_surprisal[w1_w2]]
    
    NWE = entropy(prob_dist, base=2)
    return NWE

### DIY: 
- Read in the train.csv file
- Use your trained model to calculate, and visualize, the surprisal or NWE profile within dialogues or sentences
- <b>Answer</b> in the LC1 review: what patterns do you see?
- Calculate average surprisal per dialogue act <b>or</b> per emotion category
- <b>Answer</b> in the LC1 review: do you see any correlations between surprisal/NWE and a particular dialogue act or emotion?