# Technical Assignment

The task is to study real dual conversations and model what people are talking about in various contexts and their emotions in expressing themselves.
Each multi-turn conversations \([dialogues_text.txt](dialogues_text.txt)\) has been annotated with: 
* one dialogue act for each turn \([dialogues_act.txt](dialogues_act.txt)\)
* one emotion of the speaker for each turn \([dialogues_emotion.txt](dialogues_emotion.txt)\) 
* an overall conversation topic \([dialogues_topic.txt](dialogues_topic.txt)\)

#### Task 1. Use proper data science techniques to ingest the raw data into a form amenable for analysis.
Importing some useful libraries and constants

In [None]:
import re
import sqlite3

DB_FILE = 'task-1.db'

Defining a function to read tab/space separated values

In [None]:
def readNumeric(file):
    with open(file) as f:
        for line in f:
            # Splitting separated values
            result = re.compile("\s").split(line.strip())
            # Casting values as integers
            result = [int(x) for x in result]
            yield result

Dumping numeric values into lists

In [None]:
listAct = [x for x in readNumeric("dialogues_act.txt")]
listEmotion = [x for x in readNumeric("dialogues_emotion.txt")]
listTopic = [x for x in readNumeric("dialogues_topic.txt")]

Testing the function

In [None]:
print(listAct[0:2])
print(listEmotion[0:2])
print(listTopic[0:2])

As word-tokenization might not be useful at this point, defining a function to *detokenize* (and fix inconsistencies on apostrophes):

In [None]:
def detokenize(str):
    # punctuation join
    str = re.sub(r" +([!\?\.,;:])", r"\1", str)
    # plural apostrophe join
    str = re.sub(r"s +['’]", "s'", str)
    # other apostrophes join
    str = re.sub(r" +['’] *", "'", str)
    # parenthetical info removal
    str = re.sub(r" *\([^\)]*\)", "", str)
    # trimming
    return str.strip()

Defining a function to read the conversations' files

In [None]:
def readConversation(file):
    with open(file, encoding="utf8") as f:
        for line in f:
            # Splitting utterances -last(empty)
            result = line.split("__eou__")[:-1]
            # Detokenizing for now
            yield [detokenize(x) for x in result]

Dumping conversations into a list and testing function

In [None]:
listText = [x for x in readConversation("dialogues_text.txt")]
listText[0:2]

Creating relational database using SQLite and [this schema](ddl.sql.txt)

In [None]:
conn = sqlite3.connect(DB_FILE)
with open("ddl.sql.txt") as f:
    for line in f:
        conn.execute(line)
    conn.close()

Defining a function to unfold lists into tuples

In [None]:
def getUnfoldedList(myList):
    for cId, container in enumerate(myList, start=1):
        for uId, value in enumerate(container, start=1):
            yield (cId, uId, value)

Testing the function

In [None]:
[t for t in getUnfoldedList(listText[0:2])]

Defining a function to generically write into the database

In [None]:
def insertInto(myTable, myList):
    conn = sqlite3.connect(DB_FILE)
    c = conn.cursor()
    c.executemany("INSERT INTO " + myTable + " VALUES (?, ?, ?)", getUnfoldedList(myList))
    conn.commit()
    conn.close()

Ingesting the data on the database, for the posterior analysis

In [None]:
insertInto("utterance", listText)
insertInto("utterance_act", listAct)
insertInto("utterance_emotion", listEmotion)
insertInto("conversation_topic", listTopic)

Testing the database with a random conversation

In [None]:
sql = '''
SELECT topicId, actId, emotionId, Utterance FROM utterance AS u
 INNER JOIN conversation_topic AS t ON (t.cId = u.cId)
 INNER JOIN utterance_act AS a ON (a.cId = u.cId AND a.uId = u.uId)
 INNER JOIN utterance_emotion AS e ON (e.cId = u.cId AND e.uId = u.uId)
WHERE u.cId IN (
 SELECT cId FROM conversation_topic ORDER BY RANDOM() LIMIT 1
) 
'''
conn = sqlite3.connect(DB_FILE)
for row in conn.execute(sql):
    print(row)

**Task 2. Analysis of the data** continues [here](2-analysis.ipynb)