TweetSumm

A dataset focused on summarization of dialogs, which represents the rich domain of Twitter customer care conversations

Tweetsumm comprises of 1,100 dialogs reconstructed from Tweets that appear in the Kaggle Customer Support On Twitter dataset each accompanied by 3 extractive and 3 abstractive summaries generated by human annotators.

If you use this dataset in your work, please cite our paper

@inproceedings{feigenblat-etal-2021-tweetsumm-dialog,
    title = "{TWEETSUMM} - A Dialog Summarization Dataset for Customer Service",
    author = "Feigenblat, Guy  and
      Gunasekara, Chulaka  and
      Sznajder, Benjamin  and
      Joshi, Sachindra  and
      Konopnicki, David  and
      Aharonov, Ranit",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.24",
    pages = "245--260",
}

TweetSumm format

TweetSumm is released with a partition in Train/Test/Valid files. Each file is in JSON Lines format.

A TweetSum entry (line) has the following format:

conversation_id : a unique identifier of the dialog
tweet_ids_sentence_offset : a list of
- tweet_id : corresponding to Twitter Id in Kaggle Customer Support On Twitter dataset
- sentence offsets : the offsets of the sentences splitting we used.
annotations : a list of summaries generated by the human annotators - each entry contains :
- extractive : a list of sentences selected from the initial dialog. The sentences are in the format tweet_id, sentence offsets
- abstractive : a list of one or two sentences.

TweetSumm processing

The script tweet_summ_processor allows to transform the TweetSumm entries to readable strings for further processing.

The TweetSumProcessor class requires one parameter : the path to the twcs file of Kaggle Customer Support On Twitter. This file can be download here.

The TweetSumProcessor has one method : get_dialog_with_summaries which gets a list of TweetSumm entries (lines in the TweetSumm files) and returns a list of corresponding DialogWithSummaries objects. These objects allow to access to the readable conversation associated with their readable human generated summaries.

Code sample:

    processor = TweetSumProcessor(TWCS_FILE_PATH)
    with open(TWEET_SUMM_FILE_PATH) as f:
        dialog_with_summaries = processor.get_dialog_with_summaries(f.readlines())
        for dialog_with_summary in dialog_with_summaries:
            json_format = dialog_with_summary.get_json()
            string_format = str(dialog_with_summary)
            ...

TweetSumm entry sample

{
  "conversation_id": "bbde6d8ec7c39c4551da1ff6024f997b",
  "tweet_ids_sentence_offset": [
    {
      "tweet_id": 2263653,
      "sentence_offsets": [
        "[0, 80]",
        "[82, 95]"
      ]
    },
    {
      "tweet_id": 2263651,
      "sentence_offsets": [
        "[0, 43]",
        "[44, 68]",
        "[69, 134]"
      ]
    },
    {
      "tweet_id": 2263652,
      "sentence_offsets": [
        "[0, 57]"
      ]
    },
    {
      "tweet_id": 2263654,
      "sentence_offsets": [
        "[0, 14]",
        "[15, 114]"
      ]
    },
    {
      "tweet_id": 2263655,
      "sentence_offsets": [
        "[0, 24]",
        "[25, 76]"
      ]
    },
    {
      "tweet_id": 2263656,
      "sentence_offsets": [
        "[0, 42]",
        "[43, 132]"
      ]
    },
    {
      "tweet_id": 2263657,
      "sentence_offsets": [
        "[0, 67]",
        "[68, 118]",
        "[119, 163]",
        "[164, 177]"
      ]
    },
    {
      "tweet_id": 2263658,
      "sentence_offsets": [
        "[0, 16]",
        "[17, 45]",
        "[46, 108]",
        "[109, 110]"
      ]
    }
  ],
  "annotations": [
    {
      "extractive": [
        {
          "tweet_id": 2263653,
          "sentence_offset": "[0, 80]"
        },
        {
          "tweet_id": 2263654,
          "sentence_offset": "[15, 114]"
        },
        {
          "tweet_id": 2263656,
          "sentence_offset": "[43, 132]"
        },
        {
          "tweet_id": 2263658,
          "sentence_offset": "[46, 108]"
        }
      ],
      "abstractive": [
        "Customer is complaining that the watchlist is not updated with new episodes from past two days.",
        "Agent informed that the team is working hard to investigate to show new episodes on page."
      ]
    },
    {
      "extractive": [
        {
          "tweet_id": 2263653,
          "sentence_offset": "[0, 80]"
        },
        {
          "tweet_id": 2263651,
          "sentence_offset": "[69, 134]"
        },
        {
          "tweet_id": 2263654,
          "sentence_offset": "[15, 114]"
        },
        {
          "tweet_id": 2263655,
          "sentence_offset": "[25, 76]"
        },
        {
          "tweet_id": 2263656,
          "sentence_offset": "[43, 132]"
        }
      ],
      "abstractive": [
        "Customer is complaining that my watch list is not updating with the new episodes.",
        "Agent updated to recommend checking the show page for these shows as the new eps will be there."
      ]
    },
    {
      "extractive": [
        {
          "tweet_id": 2263653,
          "sentence_offset": "[0, 80]"
        },
        {
          "tweet_id": 2263651,
          "sentence_offset": "[69, 134]"
        },
        {
          "tweet_id": 2263654,
          "sentence_offset": "[15, 114]"
        },
        {
          "tweet_id": 2263656,
          "sentence_offset": "[43, 132]"
        },
        {
          "tweet_id": 2263657,
          "sentence_offset": "[0, 67]"
        }
      ],
      "abstractive": [
        "Customer is complaining that he is not getting the updated watch list with new episodes for some shows.",
        "Agent assures that their team is working on it to resolve the issue and requests to navigate manually in the meantime."
      ]
    }
  ]
}

License

Dataset released under the CDLA-Sharing license https://cdla.io/sharing-1-0/

Disclaimer

IBM is not responsible for the content of the data, nor for any claim related to the data (including claims related to alleged intellectual property or privacy breach).

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
tweet_sum_data_files		tweet_sum_data_files
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
tweet_sum_processor.py		tweet_sum_processor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TweetSumm

TweetSumm format

TweetSumm processing

TweetSumm entry sample

License

Disclaimer

About

Releases

Packages

Contributors 2

Languages

License

guyfe/Tweetsumm

Folders and files

Latest commit

History

Repository files navigation

TweetSumm

TweetSumm format

TweetSumm processing

TweetSumm entry sample

License

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages