Create a CSV file with title and text in plaintext for Amazon Mechnical Turk with 2000 unique support questions #92

rtanglao · 2020-03-03T20:58:23Z

Requirements:

The CSV file should have the following headers (sample file with 9 questions):
sumo-ticket-title,sumo-ticket-text
The ticket text should have the HTML parsed out to plain text.
The file should have the the 200 tickets we have already tagged.
And then a further 1800 selected randomly from our giant file of support tickets
There may be some overlap in 3. Please make sure the file has exactly 2000 unique tickets.

Ana16boo · 2020-03-03T22:10:18Z

Do you think we need to include non-English tickets for this case?

willfenton · 2020-03-04T16:01:55Z

The CSV should probably include ticket ID as well, and if we need a version without we can easily remove that column.

willfenton · 2020-03-04T16:21:26Z

@rtanglao Just to clarify the HTML bit, would this

Ticket text bla bla <p>asdfasdf</p> ticket text

turn into this?

Ticket text bla bla asdfasdf ticket text

willfenton · 2020-03-04T17:13:22Z

This is how many tickets we have with X annotations, not including SUMO. We'll definitely include the 100 tickets with 7-9 annotations, and then should we include all of the tickets with 2 or 3 annotations, for a total of 300 human-tagged tickets? Or add 100 tickets with 3 annotations to round out the 200?

# of annotations	# of tickets
9	99
8	0
7	1
6	0
5	0
4	0
3	109
2	191
1	38
Total	438

@rtanglao @mlopatka

mlopatka · 2020-03-04T18:18:37Z

@willfenton Let's include all tickets with any number (greater than 1) of non-Sumo annotations.
So from the table in the comment above that looks like 400 rather than 300.
Let's get all of those into the Mtruk pipeline and then select an additional 1600 randomly from the most recent 12-18 months.

mlopatka · 2020-03-04T18:19:34Z

Keeping in mind these 2000 tickets will be tagged in triplicate by 3 different workers each.

willfenton · 2020-03-04T18:55:14Z

Ok, sounds good. Just need some clarification on the HTML preprocessing now

mlopatka · 2020-03-04T19:02:59Z

Ticket text bla bla
asdfasdf
ticket text
turn into this?
Ticket text bla bla asdfasdf ticket text

Yes, you can work under that assumption.
We do need to drop the column since this is feeding into an automated processing pipeline that expects exactly 2 columns:
and

I have confirmed that the annotations output includes the original raw texts for both fields, so we will be able to rejoin the annotations against our own data and re-associate with a ticket-id after tagging is complete.

willfenton · 2020-03-04T22:10:23Z

Sample output from my script, imported into Google Sheets

My preprocessing is removing newlines and carriage returns (\r, \n) and running it through textpipe's CleanText operation

rtanglao changed the title ~~Create a CSV file with title and text in plaintext for Amazon Mechnical Turk with 2000 support questions~~ Create a CSV file with title and text in plaintext for Amazon Mechnical Turk with 2000 unique support questions Mar 3, 2020

willfenton self-assigned this Mar 4, 2020

willfenton linked a pull request Mar 4, 2020 that will close this issue

Script for generating our Mechanical Turk CSV file #93

Merged

willfenton closed this as completed in #93 Mar 5, 2020

rtanglao mentioned this issue Mar 5, 2020

sample file of 2000 tickets has 4 duplicates #94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a CSV file with title and text in plaintext for Amazon Mechnical Turk with 2000 unique support questions #92

Create a CSV file with title and text in plaintext for Amazon Mechnical Turk with 2000 unique support questions #92

rtanglao commented Mar 3, 2020 •

edited

Loading

Ana16boo commented Mar 3, 2020

willfenton commented Mar 4, 2020

willfenton commented Mar 4, 2020

willfenton commented Mar 4, 2020

mlopatka commented Mar 4, 2020

mlopatka commented Mar 4, 2020

willfenton commented Mar 4, 2020

mlopatka commented Mar 4, 2020

willfenton commented Mar 4, 2020

Create a CSV file with title and text in plaintext for Amazon Mechnical Turk with 2000 unique support questions #92

Create a CSV file with title and text in plaintext for Amazon Mechnical Turk with 2000 unique support questions #92

Comments

rtanglao commented Mar 3, 2020 • edited Loading

Ana16boo commented Mar 3, 2020

willfenton commented Mar 4, 2020

willfenton commented Mar 4, 2020

willfenton commented Mar 4, 2020

mlopatka commented Mar 4, 2020

mlopatka commented Mar 4, 2020

willfenton commented Mar 4, 2020

mlopatka commented Mar 4, 2020

willfenton commented Mar 4, 2020

rtanglao commented Mar 3, 2020 •

edited

Loading