# Practical session

## Computational Semantics and Pragmatics, 02/11/2017

This is the introductory practical session of the course Computational Semantics and Pragmatics. We will get you started with the corpus that we will use throughout the course and give you a brief introduction to *Jupyter Notebooks*, which we will use for the practical assignments.

## Getting the materials

Start by downloading the folder *cosp2017_practical* from here: https://surfdrive.surf.nl/files/index.php/s/8MVVoz5sna0LhTg (you may need to expand it). 

This folder contains the corpus, some Python scripts, and other auxiliary materials that will be used in the different assignments throughout the course.

## The Switchboard corpus

In the assignmens of this course we will explore a dataset called the Switchboard corpus, consisting of telephone conversations between speakers of American English (https://catalog.ldc.upenn.edu/LDC97S62). There are several different distributions of the Switchboard corpus available (see http://groups.inf.ed.ac.uk/switchboard/structure.html for a description of the project). The distribution of the corpus that we are using is a version we have preprocessed for use in this course, which includes both linguistic annotations (the dialogue acts of the utterances and parsed and POS-annotated versions of these utterances), as well as timing information (beginning and end of turns). This corresponds to a subset of the overall corpus, containing 642 dialogues. You will find this version of the corpus in the folder *swda_time*.

The folder contains a bunch of csv files called swXXXX.csv. Each of them correponds to a conversation, with one utterance per row annotated with different types of information. The user manual here https://web.stanford.edu/~jurafsky/ws97/manual.august1.html can be used as a reference to understand the meaning of the different annotation tags.

Explore a couple of conversation files by opening them with a csv editor/viewer. **Make sure NOT to save them to avoid that your default csv editor makes changes in the encoding.** 

The folder also contains a file *swda-metadata-ext.csv* with metadata for all conversations in the corpus (and also for the conversations that are not in this specific distribution), such as the identities of the speakers and the topic of the conversation. Take a look at it, and again don't modify it nor save it. 

## Jupyter Notebook

For the assignments we will use an environment called _Jupyter Notebook_ (formerly known as iPython notebook), in which code and Markdown text can be combined (for more information, see http://jupyter.readthedocs.io/en/latest/index.html). You can view the instructions in an online notebook viewer, but to edit code and hand in your assignments, you will need to install Jupyter Notebook on your own machine. The Notebook environment is most easily installed using *Anaconda*, but you can also use *pip*. How exactly you should install it depends on your operating system, for instructions check https://jupyter.readthedocs.io/en/latest/install.html.

**Important: the code we will use is written in Python 2 and is not upwards compatible. In particular, if you install Anaconda, make sure you choose the Python 2.7 version. Adding a Python 2 kernel to an existing Jupyter installation might also work, but is not recommended.**

Now install iPython/Jupyter Notebook on the machine you are using, as described above.

In [1]:
import sys
print(sys.version)

2.7.14 |Anaconda, Inc.| (default, Oct 16 2017, 17:29:19) 
[GCC 7.2.0]


##  Python scripts and libraries

To extract (meta)information from the corpus, we will use a set of Python classes provided by Chris Potts (http://compprag.christopherpotts.net/swda.html). The distribution of the corpus that the scripts assume is slightly different from ours, so we will use an adapted version of these scripts *swda_time.py*, that you can find in the main folder *cosp2017_practical*.

These scripts depend on the latest version of a library called nltk, so make sure you have that installed (http://www.nltk.org/install.html). If you already have nltk installed, you can check your current version in the terminal with the following command:

$ python -c "import nltk; print(nltk.\__version\__)"

If you use pip, you can upgrade using the flag -U:

$ pip install -U nltk

Anaconda allows you to specify the version when you install nltk, choose the latest version 3.2.1.

If you are working on a computer where you cannot install packages or you have some other code depending on older versions of nltk, consider creating a virtual environment using virtualenv to create the proper setup

To be able to run the code in the notebooks without problems, it is important that the notebook has access to the python classes we are using. Download the ipynb notebook file *practical_session.ipynb* from https://surfdrive.surf.nl/files/index.php/s/FS3EkC09xUuqFlD and copy it into your *cosp2017_practical* folder. 


## Using the notebook

Now you are ready to start up your notebook. Fire up a terminal, navigate to the folder *cosp2017_practical* containing your files and run "jupyter notebook" (if you already had ipython notebook installed, you may also use that). You should now see the contents of your folder on the web browser, and you should be able to open and edit the notebook files in it.

Some things that may be helpful when working with Jupyter notebooks:

* The notebook can be seen as an interactive python session, which means that every time you open your notebook you have to rerun the blocks of code you need. If your code in block X depends on an import in block Y, or uses a variable created in block Y, you have to run block Y before you can run block X. You can run all cells in the notebook by using the menu: *Cell > Run All*.
* The shortcuts for running a cell when your cursor is in it are *Shift-Enter* (run cell and go to the next one) and *Ctrl-Enter* (run cell in place). Find more shortcuts in the Help menu.
* If your notebook crashes somehow or seems to be unresponsive, consider using the menu to restart/interrupt the kernel in the Kernel menu. If you restart the kernel, your variables will be lost.

Now open your own version of this notebook, so that you can play around with the corpus and test if everything works.

In [2]:
import nltk
print(nltk.__version__)

3.2.4


## Exploring the Switchboard corpus

We will now do a small exploration of the Switchboard corpus and the scripts that can be used to extract information from it. Let's start with looking at a single conversation.

### Transcripts

The file *swda_time.py* contains a _Transcript_ class. You can create a transcript object of one of the csv files by calling it on this file and the general metadata file that is also in the corpus folder, E.g.:

In [3]:
from swda_time import Transcript
trans = Transcript('swda_time/sw2020.csv', 'swda_time/swda-metadata-ext.csv')

Transcript objects have several accessible attributes, such as the date and topic of the dialogue, the age, gender and dialect of the two speakers and the transcription of the utterances spoken in the dialogue (for a complete list, check section 2.2.1 in  http://compprag.christopherpotts.net/swda.html, or look in the metadata file). Since these are phone conversations, the speakers are refered to as 'callers' (each dialogue is a call 'from A to B'). All attributes are accessible just by their name:

In [4]:
print "Id of speaker 'A': %s" % trans.from_caller_id
print "Id of speaker 'B': %s" % trans.to_caller_id
print "Dialect area of B: %s" % trans.to_caller_dialect_area
print "Gender of B: %s" % trans.to_caller_sex
print "Education of B: %s" % trans.to_caller_education
print "Topic description of conversation: %s" % trans.topic_description
print "Number of utterances in conversation: %i" % len(trans.utterances)

Id of speaker 'A': 1176
Id of speaker 'B': 1169
Dialect area of B: NORTH MIDLAND
Gender of B: MALE
Education of B: 2
Topic description of conversation: MUSIC
Number of utterances in conversation: 269


### Utterances

The individual utterances, that correspond with rows in the csv file, are *Utterance* objects, that also carry several types of metadata, e.g., the sex, birth year, education level and dialect area of the speaker. Furhermore, the utterance object contains the transcription of the utterance, a POS tagged version and a parse tree, the dialogue act performed by the utterance, the number of the turn the utterance belongs to, and the start and end time of this turn (for a complete list check the Utterance class in the source code). You can access all the attributes in a way similar to the Transcript class, e.g.:

In [5]:
utt1, utt2 = trans.utterances[0:2]

print "Utterance 1:\t%s\t Speaker: %s \t dialogue act: %s\t start turn: %f\t end turn: %f" \
        % (utt1.text, utt1.caller, utt1.act_tag, utt1.start_turn, utt1.end_turn)
print "Utterance 2:\t%s\t Speaker: %s \t dialogue act: %s \t start turn: %f\t end turn: %f" \
        % (utt2.text, utt2.caller, utt2.act_tag, utt2.start_turn, utt2.end_turn)
    

Utterance 1:	Hi. /	 Speaker: A 	 dialogue act: fp	 start turn: 1.747375	 end turn: 2.100750
Utterance 2:	Hi,  /	 Speaker: B 	 dialogue act: fp^m 	 start turn: 2.539000	 end turn: 9.651625


Take a moment to look at the meanings of the dialogue act tags here:  http://compprag.christopherpotts.net/swda.html#tags, we will use them in other assignments. 

Sometimes, a turn can consist of multiple utterances, in which case the turn_index stays consistent over consecutive utterances:

In [8]:
utterances = trans.utterances[25:31]

for utterance in utterances:
    print utterance.subutterance_index, utterance.text

print "\n",
    
for utterance in utterances:
    print "speaker: %s\tturn index: %i\t subutterance index: %s" \
        % (utterance.caller, utterance.turn_index, utterance.subutterance_index)


1 {C But, }  the, {F uh, } - /
2 there's such a wide selection,  /
3 [ I think I like a lot, + I like a little bit of a lot of ] different types of music.  /
4 {D You know, } [ [ I, + I, ] + I ] like music  [ that  is, +  that I feel ] - /
5 if it is performed correctly or if it's done right, or if the version is done right, I like it <laughter>, /
1 Yeah. /

speaker: B	turn index: 14	 subutterance index: 1
speaker: B	turn index: 14	 subutterance index: 2
speaker: B	turn index: 14	 subutterance index: 3
speaker: B	turn index: 14	 subutterance index: 4
speaker: B	turn index: 14	 subutterance index: 5
speaker: A	turn index: 15	 subutterance index: 1


Most utterances also contain a list of syntactic parse trees, that are represented as nltk Trees objects (http://www.nltk.org/_modules/nltk/tree.html) in our script. This means that you can use the methods of this class to access a tree node, for instance the root node of the tree:

In [9]:
utterance = trans.utterances[4]
tree = utterance.trees[0]
root_label = tree.label()
print "Utterance text: %s\n\ncorresponding tree: %s\n\nrootnote label: %s" % (utterance.text, tree, root_label)

Utterance text: {D Well, } I mostly listen to popular music.  /

corresponding tree: (S
  (INTJ (UH Well))
  (, ,)
  (NP-SBJ (PRP I))
  (ADVP (RB mostly))
  (VP (VBP listen) (PP (IN to) (NP (JJ popular) (NN music))))
  (. .)
  (-DFL- E_S))

rootnote label: S


The Utterance class also contains several additional methods, such as the *damsl_act_tag()* method, which collapses the more than 200 dialogue act tags into a reduced set of 44 tags (see http://compprag.christopherpotts.net/swda.html#tags). And also the *pos_lemmas()* method, which  returns a list of the words in the utterance (without the extra annotations).

### CorpusReader

The *swda_time.py* file also contains a class that allows you to work on all dialogues in the corpus directly. This class, called _CorpusReader_, has a method to iterate over all transcripts in the corpus (*iter_transcripts*) or all utterances in all transcripts in the corpus (*iter_utterances*). Using this class, you can easily extract information about the entire corpus, for instance, a distribution over the topics:

In [10]:
from swda_time import CorpusReader
from collections import defaultdict

corpus = CorpusReader('swda_time', 'swda_time/swda-metadata-ext.csv')

topics = defaultdict(int)
#dialogue
for transcript in corpus.iter_transcripts(display_progress=True):
    topics[transcript.topic_description] += 1

print "How many dialogues are about the weather:", topics['WEATHER CLIMATE'],
print ""
print "How many dialogues are about music:", topics['MUSIC'],


transcript 645

How many dialogues are about the weather: 11 
How many dialogues are about music: 14





Or the average number of utterances per turn and what percentage of the utterances are questions:

In [11]:
import numpy as np

utterances_per_turn = []
turn_index = 1
no_utterances = 0
questions = 0
#for utterance in corpus.iter_utterances(display_progress=True):
for utterance in corpus.iter_utterances(display_progress=False):
    if utterance.damsl_act_tag()[0] == 'q':
        questions+=1
        
    if utterance.turn_index == turn_index:
        no_utterances+=1
    else:
        utterances_per_turn.append(no_utterances)
        turn_index = utterance.turn_index
        no_utterances=1

print "Average number of utterances per turn: ", np.mean(utterances_per_turn)
print "Maximum number of utterances per turn: ", np.max(utterances_per_turn)

print "\nPercentage of utterances that are questions: %f%%" % (float(questions)/np.sum(utterances_per_turn)*100)

Average number of utterances per turn:  1.78885387295
Maximum number of utterances per turn:  30

Percentage of utterances that are questions: 4.466641%


## Starting with the first assignment

We have now seen a few examples of how information can be extracted from the Switchboard corpus. When you finish going through the above examples, you can try to extract data from the corpus that you think is interesting, or you can start with the first assignment.

Download the first assignment from https://surfdrive.surf.nl/files/index.php/s/4whnfKWM6JJYFhJ. Save the file in your *cosp2017_practical* folder and follow the instructions above to get it running. Try to at least open the assignment and run the cells in it during this practical session, so that you can ask for help if you run into trouble.