## CUNY DATA698
### Topic Modeling for Forensics Analysis of Text-Based Conversations
#### Michael Ippolito
#### May 2024

This is part of a series of Python Jupyter notebooks in support of my master's capstone project. The aim of the project is to study various methods of preprocessing, topic modeling, and postprocessing text-based conversation data often extracted from electronic devices recovered during criminal or cybersecurity investigations.

The Jupyter notebooks used in this project are as follows:

| Module | Purpose |
|--------|---------|
| eda1.ipynb | Exploratory data analysis of the four datasets used in the study. |
| modeling1.ipynb | Loads and preprocesses the datasets, performs various topic models, postprocesses the topic representations. |
| survey1.ipynb | Generates conversation text and topic representations to submit to Mechanical Turk. It later parses the results and incorporates them into my own hand-labeled results. |
| survey2.ipynb | Loads Mechanical Turk survey results and evaluates them for quality based on reading speed and attention questinos. |
| eval1.ipynb | Evaluates the topic modeling and survey results based on topic coherence, semantic quality, and topic relevance. |

The study uses the following four datasets:

1. Chitchat
2. Topical Chat
3. Ubuntu Dialogue
4. Enron Email

For further details and attribution, see my paper in this github repo.


### Mechanical Turk Survey (Quality Evaluation)
#### survey2.ipynb

The code in this module evaluates the quality of the results obtained from the Mechanical Turk survey in terms of average reading speed and whether the worker responded correctly to the "attention responses" included in the survey (i.e., answers that would be obvious to a human if they were paying attention).


### Initialization

This section loads required libraries and sets module-wide parameters.


In [2]:
#Load libraries
import os
import re
import json
import numpy as np
import pandas as pd
import random
from collections import Counter
import html
import csv


In [3]:
# Params
capstone_dir = 'C:/Users/micha/Box Sync/cuny/698-Capstone'
pickle_dir = 'C:/tmp/pickles'
mturk_dir = 'C:/Users/micha/Box Sync/cuny/698-Capstone/mturk/'


### Read Survey Results

This section loads the results of the survey. Two rounds of survey were conducted, the second being because I added keyphrase preprocessing and wanted to see what Mechanical Turk workers thought of the semantic quality.


In [4]:
# Read mechanical turk survey result csv's
dfsr1 = pd.read_csv(f"{capstone_dir}/Batch_5211639_batch_results_final.csv")  # round 1
#dfsr1 = dfsr1[dfsr1['AssignmentStatus'] != 'Rejected'].reset_index()
dfsr2 = pd.read_csv(f"{capstone_dir}/Batch_5218310_batch_results_final.csv")  # round 2
#dfsr2 = dfsr2[dfsr2['AssignmentStatus'] != 'Rejected'].reset_index()
dfsr = pd.concat([dfsr1, dfsr2], ignore_index=True)

# Get length of each round
print('round 1:', dfsr1.shape)
print('round 2:', dfsr2.shape)

# Convert relevance to integer (only take the first characger since it will be a number from 1 to 5, skipping the trailing text "least relevant" and "most relevant")
dfsr['Answer.relevance.label'] = dfsr['Answer.relevance.label'].apply(lambda x: int(x[:1]))

# Filter out rejected answers
print(dfsr.shape)
display(dfsr.head())
display(dfsr.tail())


round 1: (730, 32)
round 2: (267, 32)
(997, 32)


Unnamed: 0,HITId,HITTypeId,Title,Description,Keywords,Reward,CreationTime,MaxAssignments,RequesterAnnotation,AssignmentDurationInSeconds,...,RequesterFeedback,WorkTimeInSeconds,LifetimeApprovalRate,Last30DaysApprovalRate,Last7DaysApprovalRate,Input.conversation,Input.topic,Answer.relevance.label,Approve,Reject
0,3XEIP58NMJ7N4WYIKUQFPJFMW7SZLF,3U2Q8YJASS9X6MVGG7Q77MB02CJT5L,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.22,Thu Apr 18 11:35:34 PDT 2024,4,BatchId:5211639;OriginalHitTemplateId:928390850;,3600,...,Impossibly fast response time,5,0% (0/12),0% (0/12),0% (0/12),UserB: Traveling the world<br />UserA: I would...,"saipan, marianna islands, wwii, ww2, scuba, ko...",4,,
1,3XEIP58NMJ7N4WYIKUQFPJFMW7SZLF,3U2Q8YJASS9X6MVGG7Q77MB02CJT5L,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.22,Thu Apr 18 11:35:34 PDT 2024,4,BatchId:5211639;OriginalHitTemplateId:928390850;,3600,...,,85,100% (1/1),100% (1/1),100% (1/1),UserB: Traveling the world<br />UserA: I would...,"saipan, marianna islands, wwii, ww2, scuba, ko...",4,,
2,3XEIP58NMJ7N4WYIKUQFPJFMW7SZLF,3U2Q8YJASS9X6MVGG7Q77MB02CJT5L,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.22,Thu Apr 18 11:35:34 PDT 2024,4,BatchId:5211639;OriginalHitTemplateId:928390850;,3600,...,,775,100% (12/12),100% (12/12),100% (12/12),UserB: Traveling the world<br />UserA: I would...,"saipan, marianna islands, wwii, ww2, scuba, ko...",4,,
3,3XEIP58NMJ7N4WYIKUQFPJFMW7SZLF,3U2Q8YJASS9X6MVGG7Q77MB02CJT5L,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.22,Thu Apr 18 11:35:34 PDT 2024,4,BatchId:5211639;OriginalHitTemplateId:928390850;,3600,...,,112,0% (0/0),0% (0/0),0% (0/0),UserB: Traveling the world<br />UserA: I would...,"saipan, marianna islands, wwii, ww2, scuba, ko...",3,,
4,3FJ2RVH26IQ2XJUX6QEBXZEKB4D29E,3U2Q8YJASS9X6MVGG7Q77MB02CJT5L,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.22,Thu Apr 18 11:35:34 PDT 2024,3,BatchId:5211639;OriginalHitTemplateId:928390850;,3600,...,,155,100% (18/18),100% (18/18),100% (18/18),UserB: Traveling the world<br />UserA: I would...,World travel,5,,


Unnamed: 0,HITId,HITTypeId,Title,Description,Keywords,Reward,CreationTime,MaxAssignments,RequesterAnnotation,AssignmentDurationInSeconds,...,RequesterFeedback,WorkTimeInSeconds,LifetimeApprovalRate,Last30DaysApprovalRate,Last7DaysApprovalRate,Input.conversation,Input.topic,Answer.relevance.label,Approve,Reject
992,3T2HW4QDVERFV1MZ3J3H9CN6N6D9C3,3VXAMLAOAAOLUVX4SBSYFM3UC2JFZ6,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.20,Sun May 05 03:51:23 PDT 2024,3,BatchId:5218310;OriginalHitTemplateId:928390850;,3600,...,,89,100% (3/3),100% (3/3),0% (0/0),"aanonymouss: In order to run VNC (like vino, r...",Linux,3,,
993,3T2HW4QDVERFV1MZ3J3H9CN6N6D9C3,3VXAMLAOAAOLUVX4SBSYFM3UC2JFZ6,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.20,Sun May 05 03:51:23 PDT 2024,3,BatchId:5218310;OriginalHitTemplateId:928390850;,3600,...,,988,0% (0/0),0% (0/0),0% (0/0),"aanonymouss: In order to run VNC (like vino, r...",Linux,5,,
994,3INZSNUD9JAP0TSD3FYSTI5O86P9D3,3VXAMLAOAAOLUVX4SBSYFM3UC2JFZ6,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.20,Sun May 05 03:51:23 PDT 2024,3,BatchId:5218310;OriginalHitTemplateId:928390850;,3600,...,,65,100% (3/3),100% (3/3),0% (0/0),"aanonymouss: In order to run VNC (like vino, r...",Science/Tech,3,,
995,3INZSNUD9JAP0TSD3FYSTI5O86P9D3,3VXAMLAOAAOLUVX4SBSYFM3UC2JFZ6,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.20,Sun May 05 03:51:23 PDT 2024,3,BatchId:5218310;OriginalHitTemplateId:928390850;,3600,...,,149,100% (5/5),100% (5/5),0% (0/0),"aanonymouss: In order to run VNC (like vino, r...",Science/Tech,4,,
996,3INZSNUD9JAP0TSD3FYSTI5O86P9D3,3VXAMLAOAAOLUVX4SBSYFM3UC2JFZ6,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.20,Sun May 05 03:51:23 PDT 2024,3,BatchId:5218310;OriginalHitTemplateId:928390850;,3600,...,,2374,0% (0/0),0% (0/0),0% (0/0),"aanonymouss: In order to run VNC (like vino, r...",Science/Tech,4,,


In [5]:
# Read csv of rows and topics uploaded to mturk
dfmt1 = pd.read_csv(f"{capstone_dir}/mturk_upload_round1_full.csv")
dfmt2 = pd.read_csv(f"{capstone_dir}/mturk_upload_round2_full.csv")
dfmt = pd.concat([dfmt1, dfmt2], ignore_index=True)
display(dfmt.head())


Unnamed: 0,conversation,topic,rowid,setid,dataset,docnum,docid,wordct
0,UserB: Traveling the world<br />UserA: I would...,"saipan, marianna islands, wwii, ww2, scuba, ko...",9999,Human_keywords,cc,12,2276,290
1,UserB: Traveling the world<br />UserA: I would...,World travel,9999,Human_friendly_topic,cc,12,2276,290
2,UserB: Traveling the world<br />UserA: I would...,"people, class, world, remember, end, bit, year...",16,Topic_words,cc,12,2276,290
3,UserB: Traveling the world<br />UserA: I would...,People,16,Flan_topic,cc,12,2276,290
4,UserB: Traveling the world<br />UserA: I would...,kinfolk,16,Flan_topic2,cc,12,2276,290


In [6]:
# Match survey results to mturk upload

# Add rowid to survey results so we can tell when the row has been filled
dfsr['rowid'] = pd.NA

# Iterate over mturk upload rows
for i, row in dfmt.iterrows():

    # Filter survey results
    dftmp = dfsr[(dfsr['Input.conversation'] == row['conversation']) & (dfsr['Input.topic'] == row['topic']) & (pd.isna(dfsr['rowid']))]
    #print(f"{i}: {dftmp.shape}, {list(dftmp.index)}")
    if dftmp.shape[0] == 0:
        print('************** NOT FOUND!!!! ******************')

    # Iterate over matched rows
    ct = 0
    for j in dftmp.index:

        # Set the rowid and setid in the mturk survey results df
        #print(f"\t{j}")
        dfsr.at[j, 'rowid'] = row['rowid']
        dfsr.at[j, 'setid'] = row['setid']
        dfsr.at[j, 'dataset'] = row['dataset']
        dfsr.at[j, 'docnum'] = row['docnum']
        dfsr.at[j, 'docid'] = row['docid']
        dfsr.at[j, 'wordct'] = row['wordct']
        ct += 1
        if ct > 2: break  # Only use the first 3 rows in case there are duplicate topics/conversations

# Make sure there are no NAs
print()
print('NAs')
print(len(dfsr[pd.isna(dfsr['rowid'])]))
print()

# Display
display(dfsr.head())



NAs
106



Unnamed: 0,HITId,HITTypeId,Title,Description,Keywords,Reward,CreationTime,MaxAssignments,RequesterAnnotation,AssignmentDurationInSeconds,...,Input.topic,Answer.relevance.label,Approve,Reject,rowid,setid,dataset,docnum,docid,wordct
0,3XEIP58NMJ7N4WYIKUQFPJFMW7SZLF,3U2Q8YJASS9X6MVGG7Q77MB02CJT5L,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.22,Thu Apr 18 11:35:34 PDT 2024,4,BatchId:5211639;OriginalHitTemplateId:928390850;,3600,...,"saipan, marianna islands, wwii, ww2, scuba, ko...",4,,,9999.0,Human_keywords,cc,12.0,2276.0,290.0
1,3XEIP58NMJ7N4WYIKUQFPJFMW7SZLF,3U2Q8YJASS9X6MVGG7Q77MB02CJT5L,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.22,Thu Apr 18 11:35:34 PDT 2024,4,BatchId:5211639;OriginalHitTemplateId:928390850;,3600,...,"saipan, marianna islands, wwii, ww2, scuba, ko...",4,,,9999.0,Human_keywords,cc,12.0,2276.0,290.0
2,3XEIP58NMJ7N4WYIKUQFPJFMW7SZLF,3U2Q8YJASS9X6MVGG7Q77MB02CJT5L,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.22,Thu Apr 18 11:35:34 PDT 2024,4,BatchId:5211639;OriginalHitTemplateId:928390850;,3600,...,"saipan, marianna islands, wwii, ww2, scuba, ko...",4,,,9999.0,Human_keywords,cc,12.0,2276.0,290.0
3,3XEIP58NMJ7N4WYIKUQFPJFMW7SZLF,3U2Q8YJASS9X6MVGG7Q77MB02CJT5L,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.22,Thu Apr 18 11:35:34 PDT 2024,4,BatchId:5211639;OriginalHitTemplateId:928390850;,3600,...,"saipan, marianna islands, wwii, ww2, scuba, ko...",3,,,,,,,,
4,3FJ2RVH26IQ2XJUX6QEBXZEKB4D29E,3U2Q8YJASS9X6MVGG7Q77MB02CJT5L,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.22,Thu Apr 18 11:35:34 PDT 2024,3,BatchId:5211639;OriginalHitTemplateId:928390850;,3600,...,World travel,5,,,9999.0,Human_friendly_topic,cc,12.0,2276.0,290.0


### Impossibly fast readers

This section detects workers who answered questions that would make them impossibly fast readers.


In [10]:
# Find impossibly fast readers; calc by taking the word count times a factor of a quarter (for fast readers) divided by average words read per minute
avg_wpm = 238  # Average words per minute a human can read
fast_wpm = 500 # Fast words per minute
vfast_wpm = 1000 # Very fast words per minute

# Set starting index depending on round (round 1 starts at 0, round 2 starts at 624)
start_index = 0
ifr = dfsr.loc[(dfsr.index >= start_index) & (dfsr['WorkTimeInSeconds'] < (60 * dfsr['wordct'] / vfast_wpm)), ['WorkTimeInSeconds', 'WorkerId']]
print(ifr.values)

# Group by worker id
ifrg = ifr.groupby(['WorkerId']).count()
crappy_workers = list(ifrg[ifrg['WorkTimeInSeconds'] > 1].index)
print()
print('crappy workers:')
print(crappy_workers)

# Count number of responses in the survey results answered by crappy workers
print('# of responses by crappy workers:', dfsr[dfsr['WorkerId'].isin(crappy_workers)].shape[0])


[[5 'A1DLZK8TZJ7ESF']
 [4 'A1DLZK8TZJ7ESF']
 [3 'A1DLZK8TZJ7ESF']
 [3 'A1DLZK8TZJ7ESF']
 [5 'A1DLZK8TZJ7ESF']
 [5 'A1DLZK8TZJ7ESF']
 [4 'A1DLZK8TZJ7ESF']
 [4 'A1DLZK8TZJ7ESF']
 [21 'A2KOECHP94T0TI']]

shitty workers:
['A1DLZK8TZJ7ESF']
# of responses by shitty workers: 12


### Incorrect answer to attention responses

This section detects workers who answered "attention responses" incorrectly, i.e. those that a human would consider to be obvious if he or she were paying attention.


In [11]:
# Look for obviously good and bad answers

# Read csv where I hand-evaluated the obviously good and bad answers
dfoa = pd.read_csv(f"{capstone_dir}/obvious_answers.csv")
print(dfoa.shape)
#display(dfoa.head())
#display(dfsr.head())

# Merge obvious answers with survey results
dfm = pd.merge(dfsr, dfoa, how='left', on=['dataset', 'docid', 'rowid', 'setid'])
print(dfm.shape)
print(dfsr.shape)
display(dfm.head(1))
print()

# Save results; will need these later
dfm.to_pickle(f"{capstone_dir}/mturk_results.pkl")

# Find obviously wrong answers (good or bad)
dfowa = dfm.loc[((dfm['quality'] == 'good') & (dfm['Answer.relevance.label'] < 2)) | \
    ((dfm['quality'] == 'bad') & (dfm['Answer.relevance.label'] > 4)), ['WorkerId', 'quality', 'Answer.relevance.label']]
print(dfowa)
print(dfowa.shape)

# Group by worker id
dfowag = dfowa.groupby(['WorkerId']).count()
crappy_workers2 = list(dfowag[dfowag['Answer.relevance.label'] > 1].index)
crappy_workers.extend(crappy_workers2)
print()
print('crappy workers:')
print(crappy_workers)


(38, 5)
(997, 39)
(997, 38)


Unnamed: 0,HITId,HITTypeId,Title,Description,Keywords,Reward,CreationTime,MaxAssignments,RequesterAnnotation,AssignmentDurationInSeconds,...,Answer.relevance.label,Approve,Reject,rowid,setid,dataset,docnum,docid,wordct,quality
0,3XEIP58NMJ7N4WYIKUQFPJFMW7SZLF,3U2Q8YJASS9X6MVGG7Q77MB02CJT5L,How relevant is the conversation to the given ...,How relevant is the conversation to the given ...,"text, conversation, relevance",$0.22,Thu Apr 18 11:35:34 PDT 2024,4,BatchId:5211639;OriginalHitTemplateId:928390850;,3600,...,4,,,9999,Human_keywords,cc,12.0,2276.0,290.0,good



           WorkerId quality  Answer.relevance.label
37   A1S9DHY61PTNLE     bad                       5
62   A226S9LUL53Q01     bad                       5
63   A3UKP9P1WXWHOX     bad                       5
74    AUIRY42QWTHEO     bad                       5
122   ACZLEW8JE4L7E    good                       1
126  A1S9DHY61PTNLE    good                       1
219  A108OZCQQWRJWF     bad                       5
220  A1AMGHYG5PT0L2     bad                       5
221   A2522PWSAG9XP     bad                       5
228   A2522PWSAG9XP     bad                       5
299  A1S9DHY61PTNLE    good                       1
309   ACZLEW8JE4L7E     bad                       5
339  A1XNJ79L33QYGT     bad                       5
355   AL42QQ0WUTNJY     bad                       5
357  A28UYPIYP4C1B9     bad                       5
358   AUIRY42QWTHEO     bad                       5
371   AUIRY42QWTHEO     bad                       5
372  A1Y3JL53XAO7QV     bad                       5
430   A3USC

### Filter bad workers

Filter out HITs from Mechanical Turk workers who were either impossibly fast readers or who answered the attention responses incorrectly.


In [12]:
# Find all survey results completed by crappy workers
dfcrappy = dfsr.loc[dfsr['WorkerId'].isin(crappy_workers), 'AssignmentId']
print(f"{dfcrappy.shape[0]} crappy rows out of {dfsr.shape[0]}")
#print(dfcrappy.values)


106 shitty rows out of 997


In [13]:
# Find good workers, i.e. they answered multiple answers well (obviously right answers); these are candidates for reward
dfora = dfm.loc[(dfm.index >= start_index) & (((dfm['quality'] == 'good') & (dfm['Answer.relevance.label'] > 3)) | \
    ((dfm['quality'] == 'bad') & (dfm['Answer.relevance.label'] < 3))), ['WorkerId', 'quality', 'Answer.relevance.label']]
print(dfora.values)
print(dfora.shape)

# Group by worker id
dforag = dfora.groupby(['WorkerId']).count()
good_workers = list(dforag[dforag['Answer.relevance.label'] > 2].index)

# Make sure they're not in the bad workers list
for e in crappy_workers:
    if e in good_workers:
        good_workers.remove(e)
print()
print('good workers:')
print(good_workers)


[['A1DLZK8TZJ7ESF' 'good' 4]
 ['AAC9DJ81ZXUE7' 'good' 4]
 ['A2VNIOI5GX51C6' 'good' 4]
 ['AEIACTPDXL4MJ' 'good' 5]
 ['A1RD8CM11VK3QR' 'good' 4]
 ['A22AM327QRSABB' 'bad' 2]
 ['AEIACTPDXL4MJ' 'bad' 1]
 ['A3Q9UK9RCL87O8' 'bad' 2]
 ['A2NZ4U7L5TG9X4' 'bad' 1]
 ['A2KOECHP94T0TI' 'bad' 1]
 ['A1DLZK8TZJ7ESF' 'good' 5]
 ['A3USCXGR3HY75' 'good' 5]
 ['A1GD9JRU44WE8P' 'good' 4]
 ['AEIACTPDXL4MJ' 'good' 5]
 ['A2522PWSAG9XP' 'good' 5]
 ['A1DLZK8TZJ7ESF' 'good' 4]
 ['A1G34EE9FIBKPF' 'good' 5]
 ['AEIACTPDXL4MJ' 'good' 5]
 ['AL42QQ0WUTNJY' 'bad' 1]
 ['A38KCCSGBFR26L' 'bad' 1]
 ['A9K0CV70JWG1W' 'bad' 1]
 ['AL42QQ0WUTNJY' 'bad' 1]
 ['A38DC3BG1ZCVZ2' 'bad' 1]
 ['AEIACTPDXL4MJ' 'bad' 1]
 ['A226S9LUL53Q01' 'bad' 1]
 ['AEIACTPDXL4MJ' 'bad' 1]
 ['A226S9LUL53Q01' 'bad' 1]
 ['A2RIDT5SZESV21' 'bad' 2]
 ['A38DC3BG1ZCVZ2' 'good' 5]
 ['A2S4PLMN34DDBR' 'good' 5]
 ['A1OOBCJDJNRE50' 'good' 4]
 ['ATBO8AV9ADX1C' 'good' 5]
 ['AURYD2FH3FUOQ' 'bad' 1]
 ['A52843SHVSRKM' 'bad' 2]
 ['A9K0CV70JWG1W' 'bad' 1]
 ['ACZLEW8JE4L7E' '