## CUNY DATA698
### Topic Modeling for Forensics Analysis of Text-Based Conversations
#### Michael Ippolito
#### May 2024

This is part of a series of Python Jupyter notebooks in support of my master's capstone project. The aim of the project is to study various methods of preprocessing, topic modeling, and postprocessing text-based conversation data often extracted from electronic devices recovered during criminal or cybersecurity investigations.

The Jupyter notebooks used in this project are as follows:

| Module | Purpose |
|--------|---------|
| eda1.ipynb | Exploratory data analysis of the four datasets used in the study. |
| modeling1.ipynb | Loads and preprocesses the datasets, performs various topic models, postprocesses the topic representations. |
| survey1.ipynb | Generates conversation text and topic representations to submit to Mechanical Turk. It later parses the results and incorporates them into my own hand-labeled results. |
| survey2.ipynb | Loads Mechanical Turk survey results and evaluates them for quality based on reading speed and attention questinos. |
| eval1.ipynb | Evaluates the topic modeling and survey results based on topic coherence, semantic quality, and topic relevance. |

The study uses the following four datasets:

1. Chitchat
2. Topical Chat
3. Ubuntu Dialogue
4. Enron Email

For further details and attribution, see my paper in this github repo.


### Mechanical Turk Survey (Survey Preparation and Parsing)
#### survey1.ipynb

The code in tbe first half of the module generates conversations and topic representations to submit to Mechanical Turk. The second half of the module parses the results of the survey and incorporates them into my own hand-labeled results.


### Initialization

This section loads the required libraries and sets module-wide parameters.


In [67]:
# Load libraries
import os
import re
import json
import numpy as np
import pandas as pd
import chitchat_dataset as ccc
import random
from collections import Counter
import html
import csv
import random
from IPython.core.display import HTML
from IPython.display import clear_output
import time


In [68]:
# Set params
capstone_dir = 'C:/Users/micha/Box Sync/cuny/698-Capstone'
pickle_dir = 'C:/tmp/pickles'
mturk_dir = 'C:/Users/micha/Box Sync/cuny/698-Capstone/mturk/'
topn_words = 5
nsamples = 10
rnd_seed = 77
rnd_seed2 = 777
num_topic_words = 10


In [69]:
# Instantiate chitchat dataset
print('instantiating chitchat dataset')
ccds = ccc.Dataset()
print()


instantiating chitchat dataset



In [70]:
# Function to return friendly sender names
def friendly_senders(senders):
    """
    Purpose:              To return a list of anonymized conversation participants (e.g. UserA, UserB, UserC, etc.) from an arbitrary list of names.
    Parameters:
        senders           List of conversation participants to anonymize and reformat.
    Returns:
        r                 Anonymized list of "friendly" participant names.
    """
    fsn_prefix = 'User'
    return {sender: f"{fsn_prefix}{chr(65 + i)}" for i, sender in enumerate(senders)}

# Test suite
print(friendly_senders(['asdf', 2, 'hop', 'blaaaaaah']))
      

{'asdf': 'UserA', 2: 'UserB', 'hop': 'UserC', 'blaaaaaah': 'UserD'}


### Chitchat dataset

Loads the Chitchat dataset.


In [71]:
#############################################
# Chitchat dataset
#############################################

# Function to load the Chitchat dataset
def load_chitchat(num_docs):

    # Init convo list
    convs_txt = []

    # Init document list
    docs_txt = []
    
    # Iterate through items in dataset
    ct = 0
    ct_msgs_corpus = []
    ct_chats_corpus = []
    ct_words = []
    for convo_id, convo in ccds.items():

        # Conversation header info
        #print(convo_id, convo['ratings'], convo['start'], convo['prompt'], '\n')
        ct += 1

        # Init conversation text array
        convo_msgs = []

        # Init sender map (will be UserA, UserB)
        sender_map = {}
    
        # Init 
        doc_msgs = ''
        
        # Iterate through messages; each message is from a single person and contain multiple chats, e.g.:
        # {"messages": [[{"text": "Hello", "timestamp": "2018-05-02T19:38:15Z", "sender": "720840be-e522-47ba-9e9f-143f66372673"}...
        ct_msgs = 0
        ct_chats = []
        for msg in convo['messages']:

            # Concatenate all chats within this message
            msg_chats = [chat['text'] for chat in msg]
            msg_senders = [chat['sender'] for chat in msg]
            ct_chat = len(msg_chats)
            ct_chats.append(ct_chat)
            msg_chats = ' '.join(msg_chats)
            doc_msgs += msg_chats + ' '
            ct_msgs += 1

            # Convo text
            convo_msgs.append({'sender': np.unique(msg_senders)[0], 'txt': msg_chats})
            #print({'sender': np.unique(msg_senders)[0], 'txt': msg_chats})

        # Append count to the overall corpus counts (for stats purposes)
        #print(ct_msgs)
        #print(ct_chats)
        ct_msgs_corpus.append(ct_msgs)
        ct_chats_corpus.append(ct_chats)
    
        # Append to docs list
        docs_txt.append(doc_msgs)

        # Get friendly sender names
        senders = np.unique([e['sender'] for e in convo_msgs])
        fsenders = friendly_senders(senders)

        # Convo text
        conv_txt = ''
        for convo_msg in convo_msgs:
            conv_txt += f"{fsenders[convo_msg['sender']]}: {convo_msg['txt']}\n\n"
        convs_txt.append(conv_txt)

        # Count # of words
        ct_words.append(len(re.findall(r'\W', conv_txt)) + 1)
                       
        # Show first few docs
        """
        if ct < 6:
            print(ct)
            print(doc_msgs)
            print()
        """
    
        # Break early
        if num_docs > 0 and ct >= num_docs: break
    
    # Doc summary
    print(f'Number of docs (conversations): {len(docs_txt)}')

    # Return
    return docs_txt, ct_msgs_corpus, ct_chats_corpus, convs_txt, ct_words


In [72]:
# Load chitchat dataset
cc_txt, cc_msg_ct, cc_chat_ct, cc_conv, cc_words = load_chitchat(0)

# Make dataframe
dfcc = pd.DataFrame({'txt': cc_txt, 'msg_ct': cc_msg_ct, 'chat_ct': cc_chat_ct, 'conv': cc_conv, 'words': cc_words})
print(dfcc.shape)

# Only take conversations with more than 5 exchanges
dfcc = dfcc[dfcc['msg_ct'] > 5].reset_index()
dfcc['index'] = dfcc.index
display(dfcc.head())


Number of docs (conversations): 7168
(7168, 5)


Unnamed: 0,index,txt,msg_ct,chat_ct,conv,words
0,0,Hello How are you doing today? whats up MD im ...,35,"[2, 3, 1, 2, 1, 1, 1, 1, 3, 2, 2, 1, 1, 1, 2, ...",UserB: Hello How are you doing today?\n\nUserA...,661
1,1,hi anyone here hey whats up yeah how are you i...,57,"[2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, ...",UserA: hi anyone here\n\nUserB: hey whats up y...,835
2,2,Hey! Hey I'm gonna close the other window if t...,18,"[1, 1, 1, 1, 1, 1, 2, 1, 1, 3, 3, 3, 1, 1, 2, ...",UserB: Hey!\n\nUserA: Hey I'm gonna close the ...,458
3,3,I don't know what falafel is In fact I don't e...,25,"[7, 2, 1, 1, 1, 2, 1, 1, 1, 2, 3, 1, 1, 2, 4, ...",UserA: I don't know what falafel is In fact I ...,488
4,4,Helllo!!! Hello! I think this program is bugg...,140,"[1, 1, 3, 1, 2, 2, 1, 1, 2, 3, 3, 3, 2, 3, 1, ...",UserA: Helllo!!!\n\nUserB: Hello! I think thi...,3886


### Topical Chat dataset

Loads the Topical Chat dataset.


In [73]:
#############################################
# Topical Chat dataset
#############################################

# Function to load the Topical Chat dataset
def load_topical(num_docs):

    # Path
    path_to_docs = 'C:/Users/micha/Documents/698/corpora/topical_chat/train.json'
    
    # Init convo list
    convs_txt = []

    # Init document list
    docs_txt = []

    # Load file
    with open(path_to_docs, 'r', encoding='latin-1') as fh:

        j = json.load(fh)

    # Iterate through items in dataset
    ct = 0
    ct_msgs_corpus = []
    ct_chats_corpus = []
    ct_words = []
    for k in j.keys():

        ct += 1

        # Get conversation
        conv = j[k]['content']

        # Init 
        doc_msgs = ''
        conv_txt = ''
        
        # Iterate over each message in the conversation; each message is by one person and can contain multiple sentences
        ct_msgs = 0
        for msg in conv:

            # Concatenate all chats within this message
            msg_txt = msg['message']
            sender = msg['agent']
            doc_msgs += msg_txt + ' '
            conv_txt += f"{sender}: {msg_txt}\n\n"
            ct_msgs += 1

        # Append count to the overall corpus counts (for stats purposes)
        ct_msgs_corpus.append(ct_msgs)
    
        # Append to docs list
        docs_txt.append(doc_msgs)

        # Convos
        convs_txt.append(conv_txt)
    
        # Count # of words
        ct_words.append(len(re.findall(r'\W', conv_txt)) + 1)
                       
        # Break early
        if num_docs > 0 and ct >= num_docs: break
    
    # Doc summary
    print(f'Number of docs (conversations): {len(docs_txt)}')

    # Return
    return docs_txt, ct_msgs_corpus, convs_txt, ct_words


In [74]:
# Load topical chat dataset
tc_txt, tc_msg_ct, tc_conv, tc_words = load_topical(3000)

# Make dataframe
dftc = pd.DataFrame({'txt': tc_txt, 'msg_ct': tc_msg_ct, 'chat_ct': 1, 'conv': tc_conv, 'words': tc_words})
print(dftc.shape)

# Only take conversations with more than 5 exchanges
dftc = dftc[dftc['msg_ct'] > 5].reset_index()
display(dftc.head())


Number of docs (conversations): 3000
(3000, 5)


Unnamed: 0,index,txt,msg_ct,chat_ct,conv,words
0,0,Are you a fan of Google or Microsoft? Both are...,21,1,agent_1: Are you a fan of Google or Microsoft?...,493
1,1,do you like dance? Yes I do. Did you know Bru...,21,1,agent_1: do you like dance?\n\nagent_2: Yes I...,300
2,2,Hey what's up do use Google very often?I reall...,21,1,agent_1: Hey what's up do use Google very ofte...,502
3,3,Hi! do you like to dance? I love to dance a l...,23,1,agent_1: Hi! do you like to dance?\n\nagent_2...,471
4,4,do you like dance? I love it. Did you know Bru...,21,1,agent_1: do you like dance?\n\nagent_2: I love...,325


### Ubuntu Dialogue dataset

Loads the Ubuntu Dialogue dataset.


In [75]:
#############################################
# Ubuntu Dialogue dataset
#############################################

# Function to load the Ubuntu Dialogue dataset
def load_ubuntu(num_docs):

    # Path
    path_to_docs = 'C:/Users/micha/Documents/698/corpora/ubuntu_dialogues/dialogs'
    
    # Init document list
    docs_txt = []

    # Init convo list
    convs_txt = []

    # Iterate over directories
    ct_conv = 0
    ct_err = 0
    ct_msgs_corpus = []
    ct_words = []
    for d in os.listdir(path_to_docs):

        # Iterate over files in directory
        print('dir', d)
        for f in os.listdir(path_to_docs + '/' + d):

            # Verify it's a file
            fn = path_to_docs + '/' + d + '/' + f
            if os.path.isfile(fn):

                # Init 
                doc_msgs = ''
                conv_txt = ''
        
                # Read the file
                with open(fn, 'r', encoding='latin-1') as fh:

                    # Reach each line; each line is a separate message from a single user
                    ct_conv += 1
                    ct_msgs = 0
                    while True:

                        # Read line
                        l = fh.readline()
                        if not l:
                            break

                        # Split to an array; each line will be in this format: timestamp[tab]sender[tab]receiver[tab]message
                        # e.g.: 2005-05-26T16:54:00.000Z[tab]lifeless[tab]we2by[tab]calm down please
                        tmp = l.strip().split('\t')
                        if len(tmp) == 4:

                            # Concatenate to all the messages in the conversation
                            sender = tmp[1]
                            doc_msgs += tmp[3] + ' '
                            conv_txt += f"{sender}: {tmp[3]}\n\n"
                            ct_msgs += 1
                            
                        else:
                            # Not the right number of fields in this message
                            ct_err += 1
                            
                # Append count to the overall corpus counts (for stats purposes)
                ct_msgs_corpus.append(ct_msgs)

                # Convos
                convs_txt.append(conv_txt)

                # Append to docs list
                docs_txt.append(doc_msgs)

                # Count # of words
                ct_words.append(len(re.findall(r'\W', conv_txt)) + 1)
                               
            # Break early
            if ct_conv % 1000 == 0: print(ct_conv)
            if num_docs > 0 and ct_conv >= num_docs:
                break
                
        # Break early
        if ct_conv % 1000 == 0: print(ct_conv)
        if num_docs > 0 and ct_conv >= num_docs:
            break

    # Doc summary
    print(f'Number of docs (conversations): {ct_conv}')
    print(f'Number of errs: {ct_err}')
    
    # Return
    return docs_txt, ct_msgs_corpus, convs_txt, ct_words


In [76]:
# Load Ubuntu Dialogue data set
ud_txt, ud_msg_ct, ud_conv, ud_words = load_ubuntu(10000)

# Make dataframe
dfud = pd.DataFrame({'txt': ud_txt, 'msg_ct': ud_msg_ct, 'chat_ct': 1, 'conv': ud_conv, 'words': ud_words})
print(dfud.shape)

# Only take conversations with more than 5 exchanges
dfud = dfud[dfud['msg_ct'] > 5].reset_index()
display(dfud.head())


dir 10
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
10000
Number of docs (conversations): 10000
Number of errs: 12
(10000, 5)


Unnamed: 0,index,txt,msg_ct,chat_ct,conv,words
0,0,hi sudo echo Y > /sys/module/usbcore/parameter...,10,1,we2by: hi\n\nwe2by: sudo echo Y > /sys/module/...,92
1,1,Hmm Why doesn't GLX work with X.Org (I just ch...,10,1,rapha: Hmm\n\nrapha: Why doesn't GLX work with...,145
2,2,hi can someone tell me where shell prompt name...,10,1,bitnumus: hi can someone tell me where shell p...,267
3,3,How do I boot in safe mode with 12.04? you me...,10,1,MorganJarl: How do I boot in safe mode with 12...,271
4,4,"Hello, I have a minimal linux system: how can ...",10,1,"fk91: Hello, I have a minimal linux system: ho...",107


### Enron Email dataset

Loads the Enron Email dataset.


In [77]:
#############################################
# Enron Email dataset
#############################################

# Preload enron emails into dataframe
path_to_emails = 'C:/Users/micha/Documents/698/corpora/enron_emails/emails.csv'
dfee = pd.read_csv(path_to_emails)
dfee.drop(columns=['file'], inplace=True)
dfee['msg_ct'] = 1  # just assume each email = 1 message = 1 conversation
dfee['chat_ct'] = 1
dfee.rename(columns={'message': 'txt'}, inplace=True)
dfee = dfee[:5000]
dfee['index'] = dfee.index


In [78]:
# Function to remove header info from Enron emails
def proc_email(msg):
    """
    Purpose:             To remove email headers from Enron emails.
    Parameters:
        num_docs         The text of the message.
    Returns:
        r                The text of the message with header information stripped.
    """

    # Init processed msg
    r = ''

    # Find subject
    subj = ''
    i = msg.find('Subject: ')
    if i > -1:
        j = msg.find('\n', i)
        if j > -1:
            subj = msg[i + len('Subject: '):j]

    # Strip off 'Re:' and 'Fwd:'
    while True:
        if subj[0:3].lower() == 're:':
            subj = subj[3:]
        elif subj[0:4].lower() == 'fwd:':
            subj = subj[4:]
        else:
            break
    subj = subj.strip()

    # Find the first double \n; this should be the start of the message
    i = msg.find('\n\n')
    r = subj + '\n\n'
    if i > -1:
        r += msg[i + 2:]

    # Return
    return r.strip()

# Test suite
msg_before = dfee.loc[dfee.index == random.randint(0, 5000), 'txt'].values[0]
msg_after = proc_email(msg_before)
print(msg_after)
print('***************************************************************')
print('***************************************************************')
print('***************************************************************')
print(msg_before)


12APR HOUSTON TO NEW YORK = JENNIFER WHITE = TICKETED

---------------------- Forwarded by John Arnold/HOU/ECT on 04/04/2001 10:40 
PM ---------------------------


sandra delgado <sdelgado_vitoltvl@yahoo.com> on 03/30/2001 04:27:11 PM
To: JOHN.ARNOLD@ENRON.COM
cc:  
Subject: 12APR HOUSTON TO NEW YORK = JENNIFER WHITE = TICKETED


                                          AGENT SS/SS BOOKING REF
YFRJLU

                                          WHITE/JENNIFER


  ENRON
  1400 SMITH
  HOUSTON TX 77002
  ATTN: JOHN ARNOLD


  DATE:  MAR 30 2001                   ENRON

SERVICE               DATE  FROM           TO             DEPART
ARRIVE

CONTINENTAL AIRLINES  12APR HOUSTON TX     NEW YORK NY    335P    817P
CO 1700    V          THU   G.BUSH INTERCO LA GUARDIA
                            TERMINAL C     TERMINAL M
                            SNACK                         NON STOP
                            RESERVATION CONFIRMED         3:42 DURATION
                  AIRCRAFT: BOEING 

In [79]:
# Remove header info from Enron emails
dfee['txt'] = dfee['txt'].apply(proc_email)


In [80]:
# Set conversation field
dfee['conv'] = dfee['txt']

# Count words in each doc
dfee['words'] = dfee['conv'].apply(lambda x: len(re.findall(r'\W', x)) + 1)

# Display
print(dfee.shape)
display(dfee.head())


(5000, 6)


Unnamed: 0,txt,msg_ct,chat_ct,index,conv,words
0,Here is our forecast,1,1,0,Here is our forecast,4
1,Traveling to have a business meeting takes the...,1,1,1,Traveling to have a business meeting takes the...,163
2,test\n\ntest successful. way to go!!!,1,1,2,test\n\ntest successful. way to go!!!,12
3,"Randy,\n\n Can you send me a schedule of the s...",1,1,3,"Randy,\n\n Can you send me a schedule of the s...",46
4,Hello\n\nLet's shoot for Tuesday at 11:45.,1,1,4,Hello\n\nLet's shoot for Tuesday at 11:45.,11


### Prepare Survey Questions

This section prepares survey questions by exploring candidate conversations and topic representations to include.


In [81]:
# Function to load the last pickle file of the specified type (bow, word2vec, or transformer)
def load_last_pickle(model_type):

    d = os.listdir(f'{pickle_dir}/{model_type}')
    last_pickle = sorted(d, reverse=True)[0]
    dftmp = pd.read_pickle(f'{pickle_dir}/{model_type}/{last_pickle}')
    return dftmp


In [82]:
# This section creates a document map to map document ID in the result dataframe back to a
# document ID in the initial dataset's dataframe.

# Init the docmap
docmap = {'chitchat': {}, 'topical chat': {}, 'ubuntu dialogue': {}, 'enron email': {}}
datasetmap = {'cc': 'chitchat', 'tc': 'topical chat', 'ud': 'ubuntu dialogue', 'ee': 'enron email'}

# Iterate through result dataframe
for dataset in docmap.keys():

    # Get list of document indices
    if dataset == 'chitchat':
        alldocs_i = list(dfcc.index)
    elif dataset == 'topical chat':
        alldocs_i = list(dftc.index)
    elif dataset == 'ubuntu dialogue':
        alldocs_i = list(dfud.index)
    elif dataset == 'enron email':
        alldocs_i = list(dfee.index)

    # Create a document map mapping a document number (0 to 249) to a document id in the dataset from which it originated
    random.seed(rnd_seed)
    docs_i = random.sample(alldocs_i, 250)
    for i, doc_i in enumerate(docs_i):
        docmap[dataset][i] = doc_i

# Print and save docmap
print(docmap)
with open(f"{capstone_dir}/docmap.json", 'w') as fh:
    json.dump(docmap, fh)
    

{'chitchat': {0: 1035, 1: 1334, 2: 808, 3: 985, 4: 791, 5: 471, 6: 1197, 7: 1950, 8: 2285, 9: 2504, 10: 974, 11: 595, 12: 2276, 13: 2583, 14: 2707, 15: 2050, 16: 9, 17: 1146, 18: 116, 19: 2036, 20: 707, 21: 783, 22: 2077, 23: 1591, 24: 749, 25: 1357, 26: 1316, 27: 480, 28: 357, 29: 2256, 30: 1066, 31: 2187, 32: 1923, 33: 839, 34: 924, 35: 785, 36: 294, 37: 331, 38: 819, 39: 2855, 40: 2294, 41: 1792, 42: 2438, 43: 2884, 44: 2206, 45: 312, 46: 408, 47: 2897, 48: 868, 49: 1347, 50: 359, 51: 388, 52: 2362, 53: 1889, 54: 324, 55: 2203, 56: 436, 57: 1113, 58: 1821, 59: 2568, 60: 2261, 61: 497, 62: 1439, 63: 2169, 64: 230, 65: 1655, 66: 2385, 67: 2480, 68: 2174, 69: 1015, 70: 897, 71: 2345, 72: 1934, 73: 1649, 74: 566, 75: 1574, 76: 52, 77: 143, 78: 213, 79: 2030, 80: 2590, 81: 161, 82: 1466, 83: 1772, 84: 482, 85: 1048, 86: 2064, 87: 2305, 88: 2819, 89: 2002, 90: 229, 91: 2665, 92: 1931, 93: 2834, 94: 2680, 95: 389, 96: 603, 97: 1221, 98: 2175, 99: 376, 100: 70, 101: 1988, 102: 1445, 103: 17

In [83]:
# Load latest pickle
dfr = load_last_pickle('sim')
print(dfr.shape)
display(dfr.head())
display(dfr.tail())
print(dfr.columns)
print()


(400, 25)


Unnamed: 0,index,Dataset,Num_docs,Rnd_seed,Model,Num_topics,Model_params,Cv_score,Cuci_score,Cnpmi_score,...,Runtime,Timestamp,Topic_words,Doc_topics,Flan_topic,Cosine_similarity,Model_family,Topic_words2,Flan_topic2,Keyphrases
0,0,chitchat,250,77,lsi,8,{'num_topics': 15},0.256103,-12.5412,-0.44286,...,77.495494,20240401_133320,"[(0, [('tear', 0.1492861747332724), ('eye', 0....","{0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,...","[[0, ""People""], [1, ""People""], [2, ""Science/Te...","{""0"": [[0, 0.19291281700134277], [1, 0.0759510...",bow,"[[0, [[""job"", 0], [""work"", 0], [""eros"", 0], [""...","[[0, ""eros""], [1, ""eros""], [2, ""Brazil""], [3, ...",False
1,1,chitchat,250,77,lsi,2,{'num_topics': 15},0.378848,-14.478989,-0.512248,...,93.398113,20240401_133437,"[(0, [('change', 0.041310894306892804), ('expe...","{0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,...","[[0, ""Change""], [1, ""Change""], [2, ""Dissipate""...","{""0"": [[0, 0.05856030061841011], [1, 0.0758973...",bow,"[[0, [[""looker"", 0], [""eddy"", 0], [""vesture"", ...","[[0, ""Dog""], [1, ""Change""], [2, ""trespass""], [...",False
2,2,chitchat,250,77,lsi,1,{'num_topics': 15},0.618475,-13.53559,-0.484803,...,133.312275,20240401_133611,"[(0, [('pass', 0.02115486804364902), ('break',...","{0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,...","[[0, ""game""], [1, ""Science/Tech""], [2, ""Sports...","{""0"": [[0, 0.0338706336915493], [1, -0.0393538...",bow,"[[0, [[""follow"", 0], [""deferment"", 0], [""alter...","[[0, ""Running game""], [1, ""Deferment""], [2, ""S...",False
3,3,chitchat,250,77,lsi,2,{'num_topics': 15},0.662265,-13.528606,-0.484894,...,137.099871,20240401_133824,"[(0, [('experience', 0.01997448274133498), ('p...","{0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,...","[[0, ""game""], [1, ""game""], [2, ""Electrical out...","{""0"": [[0, 0.05463644862174988], [1, 0.0174532...",bow,"[[0, [[""follow"", 0], [""deferment"", 0], [""vestu...","[[0, ""Deferment""], [1, ""Sports""], [2, ""Animal""...",False
4,4,chitchat,250,77,lsi,11,{'num_topics': 30},0.249136,-12.43753,-0.436446,...,91.362547,20240401_134041,"[(0, [('tear', 0.14928553793317761), ('eye', 0...","{0: [0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...","[[0, ""People""], [1, ""People""], [2, ""Science/Te...","{""0"": [[0, 0.19291281700134277], [1, 0.0759510...",bow,"[[0, [[""job"", 0], [""work"", 0], [""eros"", 0], [""...","[[0, ""eros""], [1, ""eros""], [2, ""Brazil""], [3, ...",False


Unnamed: 0,index,Dataset,Num_docs,Rnd_seed,Model,Num_topics,Model_params,Cv_score,Cuci_score,Cnpmi_score,...,Runtime,Timestamp,Topic_words,Doc_topics,Flan_topic,Cosine_similarity,Model_family,Topic_words2,Flan_topic2,Keyphrases
395,395,topical chat,250,77,bertopic,9,{'repr_model': 'keybert'},0.368992,-9.86131,-0.197724,...,11.998523,20240504_155104,"[(-1, [('troopers', 0.5052358), ('military', 0...","[{-1: [2, 4, 6, 11, 12, 21, 22, 35, 36, 38, 41...","[[-1, ""World""], [0, ""Science/Tech""], [1, ""Aero...","{""-1"": [[2, 0.08672984689474106], [4, 0.014690...",transformer,"[[-1, [[""bad person"", 0], [""chessman"", 0], [""s...","[[-1, ""wikileaks""], [0, ""YouTube""], [1, ""Black...",True
396,396,ubuntu dialogue,250,77,bertopic,2,{'repr_model': 'none'},0.542989,-14.670168,-0.459252,...,10.875522,20240504_155415,"[(0, [('ubuntu', 0.14119847234575922), ('linux...","[{0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 13, 1...","[[0, ""Linux""], [1, ""World""]]","{""0"": [[0, 0.2154170125722885], [1, 0.11550097...",transformer,"[[0, [[""phone card"", 0], [""bank bill"", 0], [""a...","[[0, ""Science/Tech""], [1, ""Fox""]]",True
397,397,ubuntu dialogue,250,77,bertopic,2,{'repr_model': 'keybert'},0.447699,-14.460991,-0.470929,...,11.713669,20240504_155426,"[(0, [('ubuntu', 0.7777027), ('linux', 0.62036...","[{0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 13, 1...","[[0, ""Linux""], [1, ""World""]]","{""0"": [[0, 0.1624004989862442], [1, 0.03420655...",transformer,"[[0, [[""artifact"", 0], [""adroitness"", 0], [""ge...","[[0, ""xubuntu""], [1, ""Race (United States)""]]",True
398,398,enron email,250,77,bertopic,3,{'repr_model': 'none'},,,,...,3.13473,20240504_155926,"[(0, [('gas', 0.07409395651732713), ('enron', ...","[{0: [0, 1, 2, 3, 4, 5, 6, 7, 11, 13, 14, 15, ...","[[0, ""Energy""], [1, """"], [2, """"]]","{""0"": [[0, 0.17197903990745544], [1, 0.2113838...",transformer,"[[0, [[""futures"", 0], [""entropy"", 0], [""tradin...","[[0, ""Future""], [1, """"], [2, """"]]",True
399,399,enron email,250,77,bertopic,3,{'repr_model': 'keybert'},,,,...,3.332339,20240504_155929,"[(0, [('enron', 0.7303953), ('gas', 0.38910866...","[{0: [0, 1, 2, 3, 4, 5, 6, 7, 11, 13, 14, 15, ...","[[0, ""Business""], [1, """"], [2, """"]]","{""0"": [[0, 0.020173335447907448], [1, 0.006645...",transformer,"[[0, [[""futures"", 0], [""guard"", 0], [""trading""...","[[0, ""Business""], [1, """"], [2, """"]]",True


Index(['index', 'Dataset', 'Num_docs', 'Rnd_seed', 'Model', 'Num_topics',
       'Model_params', 'Cv_score', 'Cuci_score', 'Cnpmi_score', 'Umass_score',
       'Spell_checked', 'Text_speak', 'Synonyms', 'Hypernyms', 'Runtime',
       'Timestamp', 'Topic_words', 'Doc_topics', 'Flan_topic',
       'Cosine_similarity', 'Model_family', 'Topic_words2', 'Flan_topic2',
       'Keyphrases'],
      dtype='object')



In [84]:
# Find which rows correspond to which model runs
#print(dfr.loc[(dfr['Model']=='lda') & (dfr['Model_params'].astype(str)=="{'num_topics': 15, 'num_passes': 25}") & (dfr['Keyphrases']==True)])
#print(dfr.loc[(dfr['Model']=='bertopic') & (dfr['Model_params'].astype(str)=="{'repr_model': 'none'}") & (dfr['Keyphrases']==True)])
#print(dfr.loc[(dfr['Model']=='word2vec') & (dfr['Model_params'].astype(str)=="{'embeddings': 'word2vec', 'vector_size': 200, 'min_count': 1, 'cluster_alg': 'kmeans', 'n_clusters': 15, 'max_iter': 300, 'tol': 0.0001}") & (dfr['Keyphrases']==True)])
#print(dfr.loc[(dfr['Model']=='word2vec') & (dfr['Model_params'].astype(str)=="{'embeddings': 'glove', 'vector_size': 200, 'min_count': 1, 'cluster_alg': 'kmeans', 'n_clusters': 15, 'max_iter': 300, 'tol': 0.0001}") & (dfr['Keyphrases']==True)])


In [85]:
# Read the manually defined topic words
dfkw = pd.read_excel('C:/Users/micha/Box Sync/cuny/698-Capstone/keywords.xlsx', sheet_name='human')
display(dfkw)


Unnamed: 0,Dataset,Doc_id,Snippet,Topic_words,Human_topic,Relevant
0,chitchat,1035.0,i never turn mine off anyways updates would ju...,"browser,college,major,BYU,student,media,marketing",traveling,1
1,chitchat,1334.0,Hello fellow human! As a human with skin and h...,"smell,friends,hangout,rpg,game,play,munchkin,b...",,1
2,chitchat,808.0,Hello! Hahaha hey again! :) I was having a con...,"pizza,restaurant,topping,pepper,santa cruz,lag...",,1
3,chitchat,985.0,"Hello? hey! Hello! Sorry, this is the first ti...","book,bank,florence,machine,panic,disco,music,j...",,1
4,chitchat,791.0,Oh for sure :p oh my finally a person! I anni ...,"south korea,metal,listen,music,rap,hiphop,rb,c...",,1
...,...,...,...,...,...,...
163,enron email,4128.0,,"payout,investment,builders,profit,repaid,split...",,0
164,enron email,205.0,,"notes,outlook,migration,survey,fill,computer,p...",,0
165,enron email,205.0,,"overdue,access,request,mat,smith,pending,approval",,0
166,enron email,3418.0,,"extension, document, engineer, architect, acco...",,0


In [86]:
# Get stats on models
print(np.unique(dfr['Model'].astype(str) + ' - ' + dfr['Model_params'].astype(str)))
#print(np.unique(dfr['Model_params'].astype(str)))
[print(e) for e in list(dfr.columns)]
print()


["bertopic - {'repr_model': 'keybert'}"
 "bertopic - {'repr_model': 'none'}"
 "lda - {'num_topics': 15, 'num_passes': 25}"
 "lda - {'num_topics': 30, 'num_passes': 25}" "lsi - {'num_topics': 15}"
 "lsi - {'num_topics': 30}" "nmf - {'num_topics': 15, 'num_passes': 15}"
 "nmf - {'num_topics': 30, 'num_passes': 15}"
 "word2vec - {'embeddings': 'fasttext', 'vector_size': 200, 'min_count': 1, 'cluster_alg': 'dbscan', 'eps': 0.1, 'min_samples': 3}"
 "word2vec - {'embeddings': 'fasttext', 'vector_size': 200, 'min_count': 1, 'cluster_alg': 'dbscan', 'eps': 1, 'min_samples': 2}"
 "word2vec - {'embeddings': 'fasttext', 'vector_size': 200, 'min_count': 1, 'cluster_alg': 'kmeans', 'n_clusters': 15, 'max_iter': 300, 'tol': 0.0001}"
 "word2vec - {'embeddings': 'fasttext', 'vector_size': 200, 'min_count': 1, 'cluster_alg': 'kmeans', 'n_clusters': 30, 'max_iter': 300, 'tol': 0.0001}"
 "word2vec - {'embeddings': 'glove', 'vector_size': 200, 'min_count': 1, 'cluster_alg': 'dbscan', 'eps': 0.1, 'min_samp

In [87]:
# Get word count stats

"""
print(dfcc['words'].describe())
print()
print(dftc['words'].describe())
print()
print(dfud['words'].describe())
print()
print(dfee['words'].describe())
print()
"""

#dftmp = pd.DataFrame(dfcc['words'])
dftmp = pd.concat([dfcc['words'], dftc['words'], dfud['words'], dfee['words']])
print(dftmp.shape)
display(dftmp.head())
print()
display(dftmp.describe())


(20963,)


0     661
1     835
2     458
3     488
4    3886
Name: words, dtype: int64




count    20963.000000
mean       381.118256
std        696.331635
min          1.000000
25%        130.000000
50%        191.000000
75%        441.000000
max      25471.000000
Name: words, dtype: float64

In [88]:
# Choose 2 conversations per dataset that have between 210 and 610 words (avg words/convo = 410).
# This is messed up because ud and ee don't follow the same indexex as cc and tc.
# So none of the indexes I chose to hand-label for these two datasets are actually in the list of randomly chosen 250 documents below.
# Instead of using this section to choose the ud and ee docs, use the next two sections.

# See if any of the conversations that I already prelabled are in that range
labeled = [1035, 1334, 808, 985, 791, 471, 1197, 1950, 2285, 2504, 974, 595, 2276, 2583, 2707, 2050, 9, 1146, 116, 2036]

# For each dataset, generate list of word counts for the above documents
l = {}
l['cc'] = dfcc.loc[dfcc['index'].isin(labeled), ['index', 'words']].sort_values(['words'])
l['tc'] = dftc.loc[dftc['index'].isin(labeled), ['index', 'words']].sort_values(['words'])
l['ud'] = dfud.loc[dfud['index'].isin(labeled), ['index', 'words']].sort_values(['words'])
l['ee'] = dfee.loc[dfee['index'].isin(labeled), ['index', 'words']].sort_values(['words'])

# Print doc ids for reference
print('doc ids:')
[print(f"{i}: {id}", end=', ') for i, id in enumerate(labeled)]
print('\n')

# Print word counts for each dataset
for dataset in l.keys():
    print(dataset)
    [print(f"{row['index']}: {row['words']}", end=', ') for _, row in l[dataset].iterrows()]
    print('\n')


doc ids:
0: 1035, 1: 1334, 2: 808, 3: 985, 4: 791, 5: 471, 6: 1197, 7: 1950, 8: 2285, 9: 2504, 10: 974, 11: 595, 12: 2276, 13: 2583, 14: 2707, 15: 2050, 16: 9, 17: 1146, 18: 116, 19: 2036, 

cc
985: 121, 1950: 173, 595: 193, 2050: 210, 1197: 294, 2276: 305, 2504: 384, 471: 533, 2707: 608, 2036: 621, 1146: 744, 2583: 793, 791: 1083, 1035: 1141, 974: 1445, 808: 1565, 1334: 1676, 116: 1735, 2285: 1967, 9: 2805, 

tc
2276: 382, 595: 394, 116: 405, 1197: 433, 1334: 481, 808: 494, 2050: 495, 2707: 506, 985: 550, 974: 561, 2036: 565, 2504: 579, 2583: 610, 1035: 616, 1950: 659, 791: 665, 2285: 687, 471: 736, 9: 780, 1146: 794, 

ud
2036: 90, 116: 96, 2050: 105, 9: 108, 974: 112, 1146: 113, 1950: 115, 2707: 117, 595: 120, 1334: 136, 471: 145, 1197: 156, 985: 157, 808: 171, 2583: 177, 2504: 200, 2276: 248, 791: 256, 2285: 288, 1035: 334, 

ee
1197: 6, 791: 14, 2707: 18, 1035: 22, 808: 23, 974: 31, 595: 60, 2036: 126, 2276: 193, 985: 243, 2583: 260, 2285: 275, 2504: 292, 1146: 304, 9: 556, 471: 5

In [89]:
# Look for a suitably sized ubuntu conversation
docid_list = [docmap['ubuntu dialogue'][docid] for docid in docmap['ubuntu dialogue']]
#print(dfud.loc[dfud['index'].isin(docid_list), ['index', 'words']].sort_values(by='words').values)
print(dfud.loc[dfud['index'] == 4140, 'conv'].values[0])
print('*************************************')
print(dfud.loc[dfud['index'] == 6584, 'conv'].values[0])


Dasda: anyone else have trouble viewing ebooks in ubuntu?

Squideshi: In what format and with what application?

Dasda: exe and with ebookpro

Squideshi: The ebook is in EXE format?

Dasda: yes, thats why i wouldnt mind it if i could even get the exe to unpack but cant even do that

Squideshi: EXE files are usually for Microsoft operating systems. Where did you get the file?

Dasda: vgsports

Dasda: Squdeshi i can give u the link for download. it is only 1.8megs

Squideshi: Don't give me the link for download. Give me the link for the webpage that has the link for download.

Dasda: i message it to you cause I was unsure if we are allowed to post links here


*************************************
aanonymouss: In order to run VNC (like vino, remote desktop) do I need to have a video card installed?  I booted up w/o vid card and I can ssh into the box but cannot vnc in and it seems like x11 applications aren't running

fryguy: you'll need to run XvFB or something

fryguy: or forward throu

In [90]:
# Look for a suitably sized enron email
docid_list = [docmap['enron email'][docid] for docid in docmap['enron email']]
#print(dfee.loc[dfee['index'].isin(docid_list), ['index', 'words']].sort_values(by='words').values)
print(dfee.loc[dfee['index'] == 995, 'conv'].values[0])
print('*************************************')
print(dfee.loc[dfee['index'] == 4128, 'conv'].values[0])


Jacques,

Still trying to close the loop on the $15,000 of extensions.  Assuming that 
it is worked out today or tomorrow, I would like to get whatever documents 
need to be
completed to convey the partnership done.  I need to work with the engineer 
and architect to get things moving.  I am planning on  writing a personal 
check to the engineer while I am setting up new accounts.  Let me know if 
there is a reason I should not do this.

Thanks for all your help so far.  Between your connections and expertise in 
structuring the loan, you saved us from getting into a bad deal.

Phillip
*************************************
Last night

Lady, c'mon...you're just one of the guys!  Wanna go to Treasures tonight?


From: Margaret Allen@ENRON on 10/13/2000 08:39 AM
To: John Arnold/HOU/ECT@ECT
cc:  
Subject: Last night

Hey Buster John,

Despite the X's you received last night for your ill behavior, I wanted to 
thank you for dinner because I had a great time.  Although, I do take 
personal o

In [91]:
# Selected mturk docs based on word count
mturk_docs = {
    'cc': [2276, 2504],
    'tc': [116, 808],
    'ud': [4140, 6584],
    'ee': [995, 4128]
}

# Select rows
display(dfr.head(1))
print()
print(dfr.loc[
      (dfr['Model'] == 'word2vec') &
      (dfr['Model_params'].astype('string') == "{'embeddings': 'glove', 'vector_size': 200, 'min_count': 1, 'cluster_alg': 'kmeans', 'n_clusters': 15, 'max_iter': 300, 'tol': 0.0001}") & 
      (dfr['Spell_checked'] == True) &
      (dfr['Text_speak'] == True) &
      (dfr['Synonyms'] == True) & 
      (dfr['Hypernyms'] == True),
      ['Dataset', 'index']])
print()
mturk_rows = [16, 19, 288, 292, 96, 99, 112, 115, 40, 43, 296, 300, 144, 147, 160, 163, 64, 67, 304, 308, 192, 195, 208, 211, 88, 91, 312, 316, 240, 243, 256, 259]
no_flan = [292, 300, 308, 316]  # no need to run flan on these because it was processed with keybert
mturk_rows2 = [324, 330, 336, 342, 344, 348, 356, 360, 368, 372, 380, 384, 392, 394, 396, 398]  # with keyphrase extraction

# See if any selected rows only have a single topic
print('mturk rows with only one topic:')
print(dfr.loc[(dfr.index.isin(mturk_rows)) & (dfr['Num_topics'] == 1), ['Num_topics', 'Doc_topics']])
print('mturk2 rows with only one topic:')
print(dfr.loc[(dfr.index.isin(mturk_rows2)) & (dfr['Num_topics'] == 1), ['Num_topics', 'Doc_topics']])
print()

# Average word count
wordct = 0
docct = 0
for dataset in mturk_docs.keys():
    
    # Get the document map for this dataset
    print(f"dataset {dataset}")
    ds_docmap = docmap[datasetmap[dataset]]  # This is the docmap for this dataset

    # Get handle to the dataframe
    df = eval(f'df{dataset}')
    #display(df.head())

    # Iterate over each doc in the dataset
    for docid in mturk_docs[dataset]:

        # Get the document number from the docmap
        print(f"\tdocid {docid}")
        #print(ds_docmap)
        #print([docnum for docnum in ds_docmap.keys() if ds_docmap[docnum] == docid])
        docnum = [docnum for docnum in ds_docmap.keys() if ds_docmap[docnum] == docid][0]

        # Get conversation text
        conv = df.loc[df['index'] == docid, 'conv'].values[0]
        conv = conv.replace('\n\n', '\n')

        # Counts
        words = len(re.split(r'\W', conv))
        print(f"\t\twords {words}")
        wordct += words
        docct += 1

# Summary
print(f"docs: {docct}")
print(f"words: {wordct}")
print(f"words/doc: {wordct / docct}")
print()


Unnamed: 0,index,Dataset,Num_docs,Rnd_seed,Model,Num_topics,Model_params,Cv_score,Cuci_score,Cnpmi_score,...,Runtime,Timestamp,Topic_words,Doc_topics,Flan_topic,Cosine_similarity,Model_family,Topic_words2,Flan_topic2,Keyphrases
0,0,chitchat,250,77,lsi,8,{'num_topics': 15},0.256103,-12.5412,-0.44286,...,77.495494,20240401_133320,"[(0, [('tear', 0.1492861747332724), ('eye', 0....","{0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,...","[[0, ""People""], [1, ""People""], [2, ""Science/Te...","{""0"": [[0, 0.19291281700134277], [1, 0.0759510...",bow,"[[0, [[""job"", 0], [""work"", 0], [""eros"", 0], [""...","[[0, ""eros""], [1, ""eros""], [2, ""Brazil""], [3, ...",False



             Dataset  index
115         chitchat    115
163     topical chat    163
211  ubuntu dialogue    211
259      enron email    259

mturk rows with only one topic:
Empty DataFrame
Columns: [Num_topics, Doc_topics]
Index: []
mturk2 rows with only one topic:
Empty DataFrame
Columns: [Num_topics, Doc_topics]
Index: []

dataset cc
	docid 2276
		words 290
	docid 2504
		words 357
dataset tc
	docid 116
		words 384
	docid 808
		words 473
dataset ud
	docid 4140
		words 143
	docid 6584
		words 217
dataset ee
	docid 995
		words 131
	docid 4128
		words 134
docs: 8
words: 2129
words/doc: 266.125



In [92]:
# Prework for second round of mturk

# Look at what I chose before for rows
print(mturk_rows)
print(no_flan)


[16, 19, 288, 292, 96, 99, 112, 115, 40, 43, 296, 300, 144, 147, 160, 163, 64, 67, 304, 308, 192, 195, 208, 211, 88, 91, 312, 316, 240, 243, 256, 259]
[292, 300, 308, 316]


In [93]:
# Generate mturk docs

# Set round #
#     first round was without keyphrase extraction
#     second round was with keyphrase extraction
round = 2
which_rows = mturk_rows
if round == 2: which_rows=mturk_rows2

# Iterate over each doc; mturk_docs is keyed on dataset, with each dataset containing an array of docs
for dataset in mturk_docs.keys():

    # Get the document map for this dataset
    print(f"dataset {dataset}")
    ds_docmap = docmap[datasetmap[dataset]]  # This is the docmap for this dataset

    # Get handle to the dataframe
    df = eval(f'df{dataset}')
    #display(df.head())

    # Iterate over each doc in the dataset
    for docid in mturk_docs[dataset]:

        # Get the document number from the docmap
        print(f"\tdocid {docid}")
        #print(ds_docmap)
        #print([docnum for docnum in ds_docmap.keys() if ds_docmap[docnum] == docid])
        docnum = [docnum for docnum in ds_docmap.keys() if ds_docmap[docnum] == docid][0]

        # Get conversation text
        conv = df.loc[df['index'] == docid, 'conv'].values[0]
        conv = conv.replace('\n\n', '\n')
        o = f"dataset={dataset}, docnum={docnum}, docid={docid}, wordct={len(re.split(r'\W', conv))}\n\n"
        o += '[--- conv start ---]\n'
        o += conv + '\n'
        o += '[--- conv end ---]\n\n\n'

        # Get human-labeled keywords (only needed in round 1)
        if round == 1:
            topic = dfkw.loc[(dfkw['Dataset'] == datasetmap[dataset]) & (dfkw['Doc_id'] == docid), 'Topic_words'].values[0]
            topic = topic.replace(',', ', ')
            o += "[--- row start ---]\nrowid 9999, Human_keywords\n"
            o += f"[--- topic start ---]{topic}[--- topic end ---]\n[--- row end ---]\n\n"
            #print(o)

        # Get human-labeled friendly topic (only needed in round 1)
        if round == 1:
            topic = dfkw.loc[(dfkw['Dataset'] == datasetmap[dataset]) & (dfkw['Doc_id'] == docid), 'Human_topic'].values[0]
            topic = topic.replace(',', ', ')
            o += "[--- row start ---]\nrowid 9999, Human_friendly_topic\n"
            o += f"[--- topic start ---]{topic}[--- topic end ---]\n[--- row end ---]\n\n"
            #print(o)

        # Iterate over result rows
        for i, row in dfr.loc[dfr.index.isin(which_rows)].iterrows():

            # Make sure this row corresponds to the dataset we're currently working on
            if row['Dataset'] != datasetmap[dataset]:
                continue

            # Get topic number
            print(f"\t\trowid {i}")
            all_topics = row['Doc_topics']
            if type(all_topics) == list:  # bertopic Doc_topics will be a list of dicts, with each dict having a single entry having a key equal to the topic number
                tmp = {}
                for d in all_topics:
                    for k in d.keys():
                        tmp[k] = d[k]
                all_topics = tmp
            topicnum = [topicnum for topicnum in all_topics.keys() if docnum in all_topics[topicnum]][0]
            #print(topicnum)

            # Get topic representations
            # Do this three times: once without flan, once with flan, and once with enhanced flan;
            # enhanced flan is when I ran the resulting topic representations from the existing model
            # through the synonym and hypernym functions
            for colname in ['Topic_words', 'Flan_topic', 'Flan_topic2']:

                # Skip this row if it's flan and in the no_flan list; these are rows that used keybert topic representations,
                # so it will only have a single topic word; it would be stupid to generate a flan representation for a single
                # word, because it would just *be* that word!
                #if (colname == 'Flan_topic' or colname == 'Flan_topic2') and row['index'] in no_flan:
                #    continue

                # Get topics
                all_topics = row[colname]
                if colname == 'Topic_words':
                    topic = [word[0] for word in [topic_tuple[1][:num_topic_words] for topic_tuple in all_topics if topic_tuple[0]==topicnum][0] if len(word[0])>0]
                    topic = ', '.join(topic)
                elif colname == 'Flan_topic' or colname == 'Flan_topic2':
                    all_topics = json.loads(all_topics)
                    topic = [topic[1] for topic in all_topics if topic[0] == topicnum][0]
                otmp = f"[--- row start ---]\nrowid {i}, {colname}\n"
                otmp += f"[--- topic start ---]{topic}[--- topic end ---]\n[--- row end ---]\n\n"
                o += otmp
                #print(otmp)

        # Write to output file
        with open(f"{mturk_dir}round{round}/{dataset}_{docid}.txt", 'w') as fh:
            fh.write(o)
    

dataset cc
	docid 2276
		rowid 324
		rowid 344
		rowid 348
		rowid 392
	docid 2504
		rowid 324
		rowid 344
		rowid 348
		rowid 392
dataset tc
	docid 116
		rowid 330
		rowid 356
		rowid 360
		rowid 394
	docid 808
		rowid 330
		rowid 356
		rowid 360
		rowid 394
dataset ud
	docid 4140
		rowid 336
		rowid 368
		rowid 372
		rowid 396
	docid 6584
		rowid 336
		rowid 368
		rowid 372
		rowid 396
dataset ee
	docid 995
		rowid 342
		rowid 380
		rowid 384
		rowid 398
	docid 4128
		rowid 342
		rowid 380
		rowid 384
		rowid 398


In [None]:
# This is a check of topic numbers across the various columns to make sure they match up

#display(dfr[(dfr.index>=100) & (dfr.index<=111)])

for i, row in dfr.iterrows():

    if i >= 97 and i <=319:
        print(row['index'], row['Dataset'], row['Model'], row['Num_topics'])
        print(row['Model_params'])

        nt1 = row['Num_topics']
        nt2 = len([topic for topic in row['Topic_words']])
        nt3 = len(row['Doc_topics'])
        nt4 = len(json.loads(row['Flan_topic']))
        print(row['index'], nt1, nt2, nt3, nt4)

        print()


In [95]:
# This section creates a csv to upload to mechanical turk

# Set round #
#     first round was without keyphrase extraction
#     second round was with keyphrase extraction
round = 2

# Set whether this is the file to send to AWS or the one for my reference (which will have row IDs, etc)
aws_version = True

# Init result
d = {'conversation': [], 'topic': [], 'rowid': [], 'setid': [], 'dataset': [], 'docnum': [], 'docid': [], 'wordct': []}

# Iterate through files in dir
for f in os.listdir(f"{mturk_dir}round{round}"):

    # Read the file
    print(f)
    with open(f"{mturk_dir}round{round}/{f}", 'r') as fh:
        s = fh.read()

    # Read first row to get doc params - looks like this: dataset=cc, docnum=12, docid=2276, wordct=290
    m = re.findall(r'^(.+?)\r?\n', s)
    if len(m) != 1:
        print('\tstop!')
        break

    # Get doc params
    mm = re.findall(r'dataset=(.+?), docnum=(\d+), docid=(\d+), wordct=(\d+)', m[0])
    if len(mm) == 0 or len(mm[0]) != 4:
        print('\tno doc params!')
        break
    dataset = mm[0][0]
    docnum = mm[0][1]
    docid = mm[0][2]
    wordct = mm[0][3]

    # Get conversation
    conv = ''
    m = re.findall(r'\[--- conv start ---\]\r?\n(.+?)\r?\n\[--- conv end ---\]', s, re.DOTALL)
    if len(m) == 0:
        print('\tzero-length conversation!')
        break
    conv = m[0]
    #conv = html.escape(conv).encode('ascii', 'xmlcharrefreplace')
    conv = html.escape(conv)
    conv = re.sub(r'\r?\n', '<br />', conv)
    #print(conv)
    #print()

    # Read rows
    m = re.findall(r'\[--- row start ---\]\r?\n(.+?)\r?\n\[--- row end ---\]', s, re.DOTALL)
    if len(m) == 0:
        print('\tno rows!')
        break

    # Iterate over topic rows
    for row in m:

        mm = re.findall(r'^rowid (\d+), (.+?)\r?\n\[--- topic start ---\]', row)
        rowid = mm[0][0]
        setid = mm[0][1]

        # Read topics
        mm = re.findall(r'\[--- topic start ---\](.+?)\[--- topic end ---\]', row)
        if len(mm) == 0:
            print('\tno topics!')
            continue

        # Append topic to conversation
        topic = mm[0]
        #conv_to_add = conv + '<br /><br />Topic:&nbsp:' + html.escape(topic) + '<br /><br />'  # Uncomment to add the topic to the end of the conversation
        conv_to_add = conv + '<br />'

        # Add topic to result
        #print(topic)
        d['conversation'].append(conv_to_add)
        d['topic'].append(topic)
        d['rowid'].append(rowid)
        d['setid'].append(setid)
        d['dataset'].append(dataset)
        d['docnum'].append(docnum)
        d['docid'].append(docid)
        d['wordct'].append(wordct)

# Convert result dict to df and write to csv
df = pd.DataFrame(d)
display(df)
if aws_version:
    df[['conversation', 'topic']].to_csv(f'C:/Users/micha/Box Sync/cuny/698-Capstone/mturk_upload_round{round}.csv', index=False, quoting=csv.QUOTE_ALL)
else:
    df.to_csv(f'C:/Users/micha/Box Sync/cuny/698-Capstone/mturk_upload_round{round}_full.csv', index=False, quoting=csv.QUOTE_ALL)


cc_2276.txt
cc_2504.txt
	no topics!
ee_4128.txt
	no topics!
	no topics!
	no topics!
ee_995.txt
	no topics!
	no topics!
	no topics!
tc_116.txt
tc_808.txt
ud_4140.txt
ud_6584.txt


Unnamed: 0,conversation,topic,rowid,setid,dataset,docnum,docid,wordct
0,UserB: Traveling the world<br />UserA: I would...,"weddings, engagements, Europe, China, Korea, i...",324,Topic_words,cc,12,2276,290
1,UserB: Traveling the world<br />UserA: I would...,weddings,324,Flan_topic,cc,12,2276,290
2,UserB: Traveling the world<br />UserA: I would...,World,324,Flan_topic2,cc,12,2276,290
3,UserB: Traveling the world<br />UserA: I would...,"China, India, Africa, Europe, Germany, Greece,...",344,Topic_words,cc,12,2276,290
4,UserB: Traveling the world<br />UserA: I would...,World,344,Flan_topic,cc,12,2276,290
...,...,...,...,...,...,...,...,...
84,"aanonymouss: In order to run VNC (like vino, r...",Science/Tech,372,Flan_topic,ud,249,6584,217
85,"aanonymouss: In order to run VNC (like vino, r...",film industry,372,Flan_topic2,ud,249,6584,217
86,"aanonymouss: In order to run VNC (like vino, r...","ubuntu, linux, package, firebox, manager, grub...",396,Topic_words,ud,249,6584,217
87,"aanonymouss: In order to run VNC (like vino, r...",Linux,396,Flan_topic,ud,249,6584,217


In [96]:
# Function to get the topic number from a list or dict of topics
def get_topic_num(all_topics, docnum):

    # all_topics can either look like this:
    # {0: [2, 10, 24, 25, 55, 66, 74, 80, 83, 92, 10...
    # or like this:
    # [{-1: [0, 1, 2, 3, 4, 5, 10, 11, 16, 17, 18, 2...
    
    # See if it's a list or a dict
    if type(all_topics) == list:  # bertopic Doc_topics will be a list of dicts, with each dict having a single entry having a key equal to the topic number
        tmp = {}
        for d in all_topics:
            for k in d.keys():
                tmp[k] = d[k]
        all_topics = tmp
    topicnum = [topicnum for topicnum in all_topics.keys() if docnum in all_topics[topicnum]][0]
    return topicnum


### Hand-Label Relevance Dataset

This section prompts me with a conversation and a randomly selected topic representation that I can hand-label in preparation for topic relevance modeling.


In [97]:
# Function to get topic representations
def get_topic_repr(all_topics, colname, topicid):

    # See if we're looking for the standard topic words or the flan topic word
    if colname == 'Topic_words':
        topic = [word[0] for word in [topic_tuple[1][:num_topic_words] for topic_tuple in all_topics if topic_tuple[0] == topicid][0]]
        topic = ', '.join(topic)
    elif colname == 'Flan_topic' or colname == 'Flan_topic2':
        all_topics = json.loads(all_topics)
        try:
            topic = [topic[1] for topic in all_topics if topic[0] == topicid][0]
        except Exception as ex:
            print(ex)
            print(colname, topicid)
            print(all_topics)
            topic = ''

    # Return
    if type(topic) == list and len(topic) == 0:
        topic = ''
    return topic


In [98]:
# Init resultset for hand-labeling of topic relevance
d = {}
with open('relevance.json', 'r') as fh:
    d = json.load(fh)
print(d)


{'ee': {'70': {'72': {'Topic_words': 3, 'Flan_topic': 1, 'Flan_topic2': 1}}, '235': {'242': {'Topic_words': 1, 'Flan_topic': 1, 'Flan_topic2': 3}, '75': {'Topic_words': 1, 'Flan_topic': 1, 'Flan_topic2': 1}, '262': {'Topic_words': 1, 'Flan_topic': 2, 'Flan_topic2': 1}, '278': {'Topic_words': 2, 'Flan_topic': 1, 'Flan_topic2': 1}, '281': {'Topic_words': 1, 'Flan_topic': 1, 'Flan_topic2': 1}, '252': {'Topic_words': 1, 'Flan_topic': 3, 'Flan_topic2': 3}, '253': {'Topic_words': 1, 'Flan_topic': 1, 'Flan_topic2': 1}}, '90': {'254': {'Topic_words': 1, 'Flan_topic': 1, 'Flan_topic2': 1}, '248': {'Topic_words': 1, 'Flan_topic': 1, 'Flan_topic2': 1}, '256': {'Topic_words': 2, 'Flan_topic': 1, 'Flan_topic2': 1}, '280': {'Topic_words': 1, 'Flan_topic': 1, 'Flan_topic2': 1}, '245': {'Topic_words': 2, 'Flan_topic': 2, 'Flan_topic2': 2}, '313': {'Topic_words': 2, 'Flan_topic': 2, 'Flan_topic2': 1}, '269': {'Topic_words': 1, 'Flan_topic': 1, 'Flan_topic2': 1}, '275': {'Topic_words': 1, 'Flan_topic': 

In [None]:
# This section prompts me to label a conversation as relevant or not

# Iterate until I get sick of it
stop = False
while stop == False:

    # Select document at random
    ds = random.sample(list(datasetmap.keys()), 1)[0]
    docnum = int(random.randint(0, 249))
    docid = int(docmap[datasetmap[ds]][docnum])

    # Temp
    if ds != 'ee': continue

    # Get handle to dataframe
    dfconv = eval(f"df{ds}")
    conv = dfconv.loc[docid, 'conv']

    # Skip long conversations
    wordct = len(re.findall(r'\W', conv))
    if wordct > 500:
        continue

    # Do multiple topic sets with the same document to be more efficient
    i = 0
    while i < 8:

        # Select row at random
        #row = dfr.sample(1)
        row = dfr[dfr['index']>=320].sample(1)  # only rows with keyphrase extraction
        #rowid = int(row.index[0])
        rowid = int(row['index'])

        # Make sure this row corresponds to the dataset
        if row['Dataset'].values[0] != datasetmap[ds]:
            continue

        # Get topic number of this docnum
        topicid = int(get_topic_num(row['Doc_topics'].values[0], docnum))
        
        # Get topic representations for each topic type
        print(f"row {rowid}, topicid {topicid}")
        for colname in ['Topic_words', 'Flan_topic', 'Flan_topic2']:
    
            # Get topic representation
            print(f"before topicid {topicid}")
            topic = get_topic_repr(row[colname].values[0], colname, topicid)
            print(f"after topicid {topicid}")

            # Print
            if colname == 'Topic_words':
                clear_output()
                print("**************************************************************************************************")
                print(f"dataset {ds}, docnum {docnum}, docid {docid}, topicid {topicid}")
                print()
                print(conv)
                print()

            # Make sure we haven't recorded this answer yet
            if ds in d.keys() and docnum in d[ds].keys() and rowid in d[ds][docnum].keys() and colname in d[ds][docnum][rowid].keys():
                print("already done:")
                print(f"dataset {ds}, docnum {docnum}, docid {docid}, topicid {topicid}")
                print(f"row {rowid}, column {colname}, topic {topicid}: {topic}")
                print()
                continue
    
            # Always print out the topic
            print(f"row {rowid}, column {colname}, topic {topicid}:\n{topic}")
            print()

            # Get user input
            time.sleep(0.25)
            yn = input("Relevance? (1-5/c/s)")
            if yn == 'c':  # cancel
                stop = True
                break

            # See if we're skipping this one
            if yn != 's':  # skip

                # Update dict
                if ds not in d.keys(): d[ds] = {}
                if docnum not in d[ds].keys(): d[ds][docnum] = {}
                if rowid not in d[ds][docnum].keys(): d[ds][docnum][rowid] = {}
                #if colname not in d[ds][docnum][rowid].keys():
                d[ds][docnum][rowid][colname] = int(yn)

        # Increment counter
        i += 1
        
        if stop:
            break
    
print(d)
with open('relevance.json', 'w') as fh:
    fh.write(json.dumps(d))


In [None]:
# Convert int64s to ints
dnew = {}
for ds in d.keys():
    #print(f"dataset {ds}")
    if ds not in dnew.keys(): dnew[ds] = {}
    for docnum in d[ds].keys():
        #print(f"\tdocunum {docnum}, {type(docnum)}")
        newdocnum = int(docnum)
        if newdocnum not in dnew[ds].keys(): dnew[ds][newdocnum] = {}
        for rowid in d[ds][docnum].keys():
            #print(f"\t\trowid {rowid}")
            newrowid = int(rowid)
            if newrowid not in dnew[ds][newdocnum].keys(): dnew[ds][newdocnum][newrowid] = {}
            for colname in d[ds][docnum][rowid].keys():
                #print(f"\t\t\tcolname {colname}")
                if colname not in dnew[ds][newdocnum].keys(): dnew[ds][newdocnum][newrowid][colname] = {}
                if ds in d.keys() and docnum in d[ds].keys() and rowid in d[ds][docnum].keys() and colname in d[ds][docnum][rowid].keys():
                    result = int(d[ds][docnum][rowid][colname])
                    dnew[ds][newdocnum][newrowid][colname] = result

print(json.dumps(dnew))

with open('relevance_new.json', 'w') as fh:
    fh.write(json.dumps(dnew))


In [None]:
# This section adds the hand-labeled quality result to the relevance df

# Walk resultset
ct = 0
cty = 0
ctn = 0
d2 = {'dataset': [], 'docnum': [], 'rowid': [], 'colname': [], 'result': []}
for ds in d.keys():
    #print(f"dataset {ds}")
    for docnum in d[ds].keys():
        #print(f"\tdocunum {docnum}")
        newdocnum = int(docnum)
        for rowid in d[ds][docnum].keys():
            #print(f"\t\trowid {rowid}")
            newrowid = int(rowid)
            for colname in d[ds][docnum][rowid].keys():
                #print(f"\t\t\tcolname {colname}")
                if ds in d.keys() and docnum in d[ds].keys() and rowid in d[ds][docnum].keys() and colname in d[ds][docnum][rowid].keys():
                    result = int(d[ds][docnum][rowid][colname])
                    #print(f"result: {result} (dataset {ds}, docunum {docnum}, rowid {rowid}, colname {colname})")
                    d2['dataset'].append(ds)
                    d2['docnum'].append(newdocnum)
                    d2['rowid'].append(newrowid)
                    d2['colname'].append(colname)
                    d2['result'].append(result)
                    ct += 1
                    if result > 2:
                        cty += 1
                    else:
                        ctn += 1

# Print summary
print()
print(f"ct={ct}, cty={cty}, ctn={ctn}")
print()

print(d2)

# Create dataframe
dfrel = pd.DataFrame(d2)
display(dfrel)
timestamp = str(pd.Timestamp.now())[:19].replace('-', '').replace(' ', '_').replace(':', '')
dfrel.to_pickle('C:\\tmp\\pickles\\relevance\\dfrel_' + timestamp + '.pkl')


In [107]:
# This was used for troubleshooting flan topics - can be ignored now that it is fixed
"""
for i in range(104, 319):
    clear_output(wait=True)
    print(i, dfr.loc[dfr.index == i, 'Dataset'].values)
    print()
    print(dfr.loc[dfr.index == i, 'Doc_topics'].values)
    print()
    print(dfr.loc[dfr.index == i, 'Flan_topic'].values)
    print()
    print(dfr.loc[dfr.index == i, 'Flan_topic2'].values)
    print()
    print(dfr.loc[dfr.index == i, 'Topic_words'].values[0])
    print()
    time.sleep(0.25)
    input('here')
"""


"\nfor i in range(104, 319):\n    clear_output(wait=True)\n    print(i, dfr.loc[dfr.index == i, 'Dataset'].values)\n    print()\n    print(dfr.loc[dfr.index == i, 'Doc_topics'].values)\n    print()\n    print(dfr.loc[dfr.index == i, 'Flan_topic'].values)\n    print()\n    print(dfr.loc[dfr.index == i, 'Flan_topic2'].values)\n    print()\n    print(dfr.loc[dfr.index == i, 'Topic_words'].values[0])\n    print()\n    time.sleep(0.25)\n    input('here')\n"

In [108]:
# This section adds irrelevant documents to the keyword dataframe;
# not needed anymore because it's all done.

"""
# Init dict to store new irrelevant doc data
dirr = {'Dataset': [], 'Doc_id': [], 'Snippet': [], 'Topic_words': [], 'Human_topic': [], 'Relevant': []}

# Iterate over keyword df
for i, row in dfkw.iterrows():

    dirr['Dataset'].append(row['Dataset'])
    dirr['Doc_id'].append(row['irr_id1'])
    dirr['Snippet'].append('')
    dirr['Topic_words'].append(row['Topic_words'])
    dirr['Human_topic'].append('')
    dirr['Relevant'].append(0)

dftmp = pd.DataFrame(dirr)
dftmp.to_csv('tmpkw.csv', index=False)
"""


"\n# Init dict to store new irrelevant doc data\ndirr = {'Dataset': [], 'Doc_id': [], 'Snippet': [], 'Topic_words': [], 'Human_topic': [], 'Relevant': []}\n\n# Iterate over keyword df\nfor i, row in dfkw.iterrows():\n\n    dirr['Dataset'].append(row['Dataset'])\n    dirr['Doc_id'].append(row['irr_id1'])\n    dirr['Snippet'].append('')\n    dirr['Topic_words'].append(row['Topic_words'])\n    dirr['Human_topic'].append('')\n    dirr['Relevant'].append(0)\n\ndftmp = pd.DataFrame(dirr)\ndftmp.to_csv('tmpkw.csv', index=False)\n"

### Parse Survey Results

This section reads the results of the mechanical turk survey and incorporates them into the hand-labeled relevance dataframe. The code in survey2.ipynb should be run before proceeding with this section.


In [110]:
# Load survey results - these were saved in survey2.ipynb
dfsr = pd.read_pickle(f"{capstone_dir}/mturk_results.pkl")
print(dfsr.shape)
display(dfsr.head(1))
print(list(dfsr.columns))


(891, 40)


Unnamed: 0,index,HITId,HITTypeId,Title,Description,Keywords,Reward,CreationTime,MaxAssignments,RequesterAnnotation,...,Answer.relevance.label,Approve,Reject,rowid,setid,dataset,docnum,docid,wordct,quality
0,1,3XEIP58NMJ7N4WYIKUQFPJFMW7SZLF,3U2Q8YJASS9X6MVGG7Q77MB02CJT5L,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.22,Thu Apr 18 11:35:34 PDT 2024,4,BatchId:5211639;OriginalHitTemplateId:928390850;,...,4,,,9999,Human_keywords,cc,12.0,2276.0,290.0,good


['index', 'HITId', 'HITTypeId', 'Title', 'Description', 'Keywords', 'Reward', 'CreationTime', 'MaxAssignments', 'RequesterAnnotation', 'AssignmentDurationInSeconds', 'AutoApprovalDelayInSeconds', 'Expiration', 'NumberOfSimilarHITs', 'LifetimeInSeconds', 'AssignmentId', 'WorkerId', 'AssignmentStatus', 'AcceptTime', 'SubmitTime', 'AutoApprovalTime', 'ApprovalTime', 'RejectionTime', 'RequesterFeedback', 'WorkTimeInSeconds', 'LifetimeApprovalRate', 'Last30DaysApprovalRate', 'Last7DaysApprovalRate', 'Input.conversation', 'Input.topic', 'Answer.relevance.label', 'Approve', 'Reject', 'rowid', 'setid', 'dataset', 'docnum', 'docid', 'wordct', 'quality']


In [111]:
# Load keywords
dfkw = pd.read_excel('C:/Users/micha/Box Sync/cuny/698-Capstone/keywords.xlsx', sheet_name='human')
print(dfkw.shape)
display(dfkw.head(1))


(168, 6)


Unnamed: 0,Dataset,Doc_id,Snippet,Topic_words,Human_topic,Relevant
0,chitchat,1035.0,i never turn mine off anyways updates would ju...,"browser,college,major,BYU,student,media,marketing",traveling,1


In [112]:
# Load relevance results
dfrel = load_last_pickle('relevance')
print(dfrel.shape)
display(dfrel.head(1))


(456, 5)


Unnamed: 0,dataset,docnum,rowid,colname,result
0,ee,70,72,Topic_words,3


In [113]:
# Load modeling results
dfr = load_last_pickle('sim')
print(dfr.shape)
display(dfr.head(1))


(400, 25)


Unnamed: 0,index,Dataset,Num_docs,Rnd_seed,Model,Num_topics,Model_params,Cv_score,Cuci_score,Cnpmi_score,...,Runtime,Timestamp,Topic_words,Doc_topics,Flan_topic,Cosine_similarity,Model_family,Topic_words2,Flan_topic2,Keyphrases
0,0,chitchat,250,77,lsi,8,{'num_topics': 15},0.256103,-12.5412,-0.44286,...,77.495494,20240401_133320,"[(0, [('tear', 0.1492861747332724), ('eye', 0....","{0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,...","[[0, ""People""], [1, ""People""], [2, ""Science/Te...","{""0"": [[0, 0.19291281700134277], [1, 0.0759510...",bow,"[[0, [[""job"", 0], [""work"", 0], [""eros"", 0], [""...","[[0, ""eros""], [1, ""eros""], [2, ""Brazil""], [3, ...",False


In [None]:
# Read document map from disk - this maps a document number (0-249) to a document id within a specific dataset
with open(f"{capstone_dir}/docmap.json", 'r') as fh:
    docmap = json.load(fh)
datasetmap = {'cc': 'chitchat', 'tc': 'topical chat', 'ud': 'ubuntu dialogue', 'ee': 'enron email'}
print(docmap)


In [115]:
# Data prep for semantic quality and relevance

# First look
print('relevance')
print(dfrel.shape)
display(dfrel.head(1))
print(dfrel.columns)
print()
print('survey results')
print(dfsr.shape)
display(dfsr.head(1))
print(dfsr.columns)
print()
print('keywords')
print(dfkw.shape)
display(dfkw.head(1))
print(dfkw.columns)
print()


relevance
(456, 5)


Unnamed: 0,dataset,docnum,rowid,colname,result
0,ee,70,72,Topic_words,3


Index(['dataset', 'docnum', 'rowid', 'colname', 'result'], dtype='object')

survey results
(891, 40)


Unnamed: 0,index,HITId,HITTypeId,Title,Description,Keywords,Reward,CreationTime,MaxAssignments,RequesterAnnotation,...,Answer.relevance.label,Approve,Reject,rowid,setid,dataset,docnum,docid,wordct,quality
0,1,3XEIP58NMJ7N4WYIKUQFPJFMW7SZLF,3U2Q8YJASS9X6MVGG7Q77MB02CJT5L,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.22,Thu Apr 18 11:35:34 PDT 2024,4,BatchId:5211639;OriginalHitTemplateId:928390850;,...,4,,,9999,Human_keywords,cc,12.0,2276.0,290.0,good


Index(['index', 'HITId', 'HITTypeId', 'Title', 'Description', 'Keywords',
       'Reward', 'CreationTime', 'MaxAssignments', 'RequesterAnnotation',
       'AssignmentDurationInSeconds', 'AutoApprovalDelayInSeconds',
       'Expiration', 'NumberOfSimilarHITs', 'LifetimeInSeconds',
       'AssignmentId', 'WorkerId', 'AssignmentStatus', 'AcceptTime',
       'SubmitTime', 'AutoApprovalTime', 'ApprovalTime', 'RejectionTime',
       'RequesterFeedback', 'WorkTimeInSeconds', 'LifetimeApprovalRate',
       'Last30DaysApprovalRate', 'Last7DaysApprovalRate', 'Input.conversation',
       'Input.topic', 'Answer.relevance.label', 'Approve', 'Reject', 'rowid',
       'setid', 'dataset', 'docnum', 'docid', 'wordct', 'quality'],
      dtype='object')

keywords
(168, 6)


Unnamed: 0,Dataset,Doc_id,Snippet,Topic_words,Human_topic,Relevant
0,chitchat,1035.0,i never turn mine off anyways updates would ju...,"browser,college,major,BYU,student,media,marketing",traveling,1


Index(['Dataset', 'Doc_id', 'Snippet', 'Topic_words', 'Human_topic',
       'Relevant'],
      dtype='object')



In [116]:
# This section creates a full labeled data set using both the hand-labeled relevance data with the survey result data

# Change column names so that we can merge the hand-labeled relevance df with the survey result df
dfrel.rename(columns={'dataset': 'Dataset', 'docnum': 'Doc_num', 'rowid': 'Row_id', 'colname': 'Set_id', 'result': 'Quality'}, inplace=True)
dfsr2 = dfsr[['dataset', 'docnum', 'docid', 'rowid', 'setid', 'Answer.relevance.label', 'Input.conversation', 'Input.topic']].copy()
dfsr2.rename(columns={'dataset': 'Dataset', 'docnum': 'Doc_num', 'docid': 'Doc_id', 'rowid': 'Row_id', 'setid': 'Set_id', \
                      'Answer.relevance.label': 'Quality', 'Input.conversation': 'Conv2', 'Input.topic': 'Topic_words2'}, inplace=True)

# Label result source as either hand-labeled or mturk
dfrel['Source'] = 'Author'
dfsr2['Source'] = 'MTurk'

# Lookup docid in the hand-labeled relevance df (the mturk df already has the doc id)
for i, row in dfrel.iterrows():
    dfrel.at[i, 'Doc_id'] = docmap[datasetmap[row['Dataset']]][str(int(row['Doc_num']))]
print(dfrel.shape)
print(dfsr2.shape)

# Add temp columns to dfrel; this is just added as a sanity check to make sure the convos and topic words line up
dfrel['Conv2'] = pd.NA
dfrel['Topic_words2'] = pd.NA

# Merge
dfrel2 = pd.concat([dfrel, dfsr2], ignore_index=True)

# Set relevant or not
dfrel2.loc[dfrel2['Quality'] > 2, 'Relevant'] = 1
dfrel2.loc[dfrel2['Quality'] < 3, 'Relevant'] = 0

# Verify
print(dfrel2.shape)
display(dfrel2.head())
display(dfrel2.tail())


(456, 7)
(891, 9)
(1347, 10)


Unnamed: 0,Dataset,Doc_num,Row_id,Set_id,Quality,Source,Doc_id,Conv2,Topic_words2,Relevant
0,ee,70.0,72,Topic_words,3,Author,287.0,,,1.0
1,ee,70.0,72,Flan_topic,1,Author,287.0,,,0.0
2,ee,70.0,72,Flan_topic2,1,Author,287.0,,,0.0
3,ee,235.0,242,Topic_words,1,Author,1881.0,,,0.0
4,ee,235.0,242,Flan_topic,1,Author,1881.0,,,0.0


Unnamed: 0,Dataset,Doc_num,Row_id,Set_id,Quality,Source,Doc_id,Conv2,Topic_words2,Relevant
1342,ud,249.0,396,Flan_topic,3,MTurk,6584.0,"aanonymouss: In order to run VNC (like vino, r...",Linux,1.0
1343,ud,249.0,396,Flan_topic,5,MTurk,6584.0,"aanonymouss: In order to run VNC (like vino, r...",Linux,1.0
1344,ud,249.0,396,Flan_topic2,3,MTurk,6584.0,"aanonymouss: In order to run VNC (like vino, r...",Science/Tech,1.0
1345,ud,249.0,396,Flan_topic2,4,MTurk,6584.0,"aanonymouss: In order to run VNC (like vino, r...",Science/Tech,1.0
1346,ud,249.0,396,Flan_topic2,4,MTurk,6584.0,"aanonymouss: In order to run VNC (like vino, r...",Science/Tech,1.0


In [120]:
# Iterate over relevance results to get topic id
for i, row in dfrel2.iterrows():

    # Get conv
    df = eval(f"df{row['Dataset']}")
    conv = df.loc[df['index'] == row['Doc_id'], 'conv'].values[0]
    conv = conv.replace('\n\n', '\n')

    # Get doc topics
    if row['Set_id'] == 'Human_keywords':

        # Find topic words in the keywords df
        topicid = 999
        topic_words = dfkw.loc[(dfkw['Dataset']==datasetmap[row['Dataset']]) & (dfkw['Doc_id']==row['Doc_id']), 'Topic_words'].values[0]
        cs = pd.NA  # cosine sim
        
    elif row['Set_id'] == 'Human_friendly_topic':

        # Find human topics in the keywords df
        topicid = 9999
        topic_words = dfkw.loc[(dfkw['Dataset']==datasetmap[row['Dataset']]) & (dfkw['Doc_id']==row['Doc_id']), 'Human_topic'].values[0]
        cs = pd.NA  # cosine sim
        
    else:
        
        # Get row in modeling results
        row2 = dfr[dfr.index==row['Row_id']]

        # Find the topic id based on the document number
        topicid = get_topic_num(row2['Doc_topics'].values[0], row['Doc_num'])

        # Find the topic words for this topic id
        topic_words = get_topic_repr(row2[row['Set_id']].values[0], row['Set_id'], topicid)
        topic_words = topic_words.replace(', ', ',')

        # Cosine similarity
        all_cos = row2['Cosine_similarity'].values[0]
        if pd.isna(all_cos):
            all_cos = {}
        else:
            all_cos = json.loads(all_cos)  # cos sim for all topics and all docs
        #print(f"dataset {row['Dataset']}, docnum {row['Doc_num']}, docid {row['Doc_id']}, setid {row['Set_id']}, topicid {topicid}, rowid {row['Row_id']}")
        topic_cos = all_cos[str(int(topicid))]  # cos sim for all docs with this topic id
        if row['Row_id'] == 999:
            print(topic_words)
            print()
            print(all_cos)
            print()
            print(topicid, topic_cos)
            print()
            print(row)
            print()
            print(row2)
            print()
        cs = [cs[1] for cs in topic_cos if int(cs[0]) == int(row['Doc_num'])][0]

    # Set values
    dfrel2.at[i, 'Topic_words'] = topic_words
    dfrel2.at[i, 'Conv'] = conv
    dfrel2.at[i, 'Cosine_similarity'] = cs
print('done')


done


In [121]:
# Drop unneeded columns; I added these just as a sanity check to make sure the conversation and topic words matched up
if 'Topic_words2' in dfrel2.columns:
    dfrel2.drop(columns=['Topic_words2', 'Conv2'], inplace=True)

# Save df to disk
timestamp = str(pd.Timestamp.now())[:19].replace('-', '').replace(' ', '_').replace(':', '')
dfrel2.to_pickle('C:\\tmp\\pickles\\relevance\\dfrel_' + timestamp + '.pkl')


In [122]:
# This is a sanity check between topic words and cos sim
# to make sure the topic-to-doc mapping in cos_sim matched the topic-to-doc mapping in doc_topics
for i, row in dfr.iterrows():

    # generate dict like this:
    # {topicid: [docnum, docnum, ...]}

    # Get doc topics
    all_topics = row['Doc_topics']
    if type(all_topics) == list:
        tmp = {}
        for topic in all_topics:
            for k in topic.keys():
                tmp[k] = topic[k]
        all_topics = tmp

    # Get cos sim
    all_cos = row['Cosine_similarity']
    if not pd.isna(all_cos):
        all_cos = json.loads(all_cos)
        tmp = {}
        for topicid in all_cos.keys():
            tmp[topicid] = [doc[0] for doc in all_cos[topicid]]
        all_cos = tmp

    # Make keys ints
    tmp = {}
    for topicid in all_topics:
        tmp[int(topicid)] = all_topics[topicid]
    all_topics = tmp
    tmp = {}
    if not pd.isna(all_cos):
        for topicid in all_cos:
            tmp[int(topicid)] = all_cos[topicid]
        all_cos = tmp
    
    if all_cos != all_topics:
        print(f'NOT THE SAME: {i}')
        print(all_topics)
        print(all_cos)
        print()
        pass
    else:
        #print(f'same: {i}')
        pass
    if pd.isna(all_cos):
        print(i, row['Model'], all_topics)
print('done')


done
