Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a CSV file with title and text in plaintext for Amazon Mechnical Turk with 2000 unique support questions #92

Closed
rtanglao opened this issue Mar 3, 2020 · 9 comments · Fixed by #93
Assignees

Comments

@rtanglao
Copy link
Collaborator

rtanglao commented Mar 3, 2020

Requirements:

  1. The CSV file should have the following headers (sample file with 9 questions):
    sumo-ticket-title,sumo-ticket-text
  2. The ticket text should have the HTML parsed out to plain text.
  3. The file should have the the 200 tickets we have already tagged.
  4. And then a further 1800 selected randomly from our giant file of support tickets
  5. There may be some overlap in 3. Please make sure the file has exactly 2000 unique tickets.
@rtanglao rtanglao changed the title Create a CSV file with title and text in plaintext for Amazon Mechnical Turk with 2000 support questions Create a CSV file with title and text in plaintext for Amazon Mechnical Turk with 2000 unique support questions Mar 3, 2020
@Ana16boo
Copy link
Collaborator

Ana16boo commented Mar 3, 2020

Do you think we need to include non-English tickets for this case?

@willfenton
Copy link
Collaborator

The CSV should probably include ticket ID as well, and if we need a version without we can easily remove that column.

@willfenton willfenton self-assigned this Mar 4, 2020
@willfenton
Copy link
Collaborator

@rtanglao Just to clarify the HTML bit, would this

Ticket text bla bla <p>asdfasdf</p> ticket text

turn into this?

Ticket text bla bla asdfasdf ticket text

@willfenton
Copy link
Collaborator

This is how many tickets we have with X annotations, not including SUMO. We'll definitely include the 100 tickets with 7-9 annotations, and then should we include all of the tickets with 2 or 3 annotations, for a total of 300 human-tagged tickets? Or add 100 tickets with 3 annotations to round out the 200?

# of annotations # of tickets
9 99
8 0
7 1
6 0
5 0
4 0
3 109
2 191
1 38
Total 438

@rtanglao @mlopatka

@mlopatka
Copy link
Owner

mlopatka commented Mar 4, 2020

@willfenton Let's include all tickets with any number (greater than 1) of non-Sumo annotations.
So from the table in the comment above that looks like 400 rather than 300.
Let's get all of those into the Mtruk pipeline and then select an additional 1600 randomly from the most recent 12-18 months.

@mlopatka
Copy link
Owner

mlopatka commented Mar 4, 2020

Keeping in mind these 2000 tickets will be tagged in triplicate by 3 different workers each.

@willfenton
Copy link
Collaborator

Ok, sounds good. Just need some clarification on the HTML preprocessing now

@mlopatka
Copy link
Owner

mlopatka commented Mar 4, 2020

Ticket text bla bla

asdfasdf

ticket text
turn into this?
Ticket text bla bla asdfasdf ticket text

Yes, you can work under that assumption.
We do need to drop the column since this is feeding into an automated processing pipeline that expects exactly 2 columns:
and

I have confirmed that the annotations output includes the original raw texts for both fields, so we will be able to rejoin the annotations against our own data and re-associate with a ticket-id after tagging is complete.

@willfenton
Copy link
Collaborator

Sample output from my script, imported into Google Sheets

My preprocessing is removing newlines and carriage returns (\r, \n) and running it through textpipe's CleanText operation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants