Skip to content

microsoft/Azure_OpenAI_Summarization

Repository files navigation

Introduction

The repository is structured to facilitate document summarization, featuring an application for user interaction and a core algorithm suite for processing documents of varying lengths. It includes:

  1. An application interface that allows users to input text and receive summarized versions, serving as a demonstration platform.
  2. A 'summary' folder containing the primary codebase with two classes: a. 'doc_summary' class, which conatins algorithms for summarizing long, medium, and short documents. b. 'utilities' class, providing common methods used across the summarization processes.
  3. Three Jupyter notebooks named evaluation_longdocs.ipynb, evaluation_mediumdocs.ipynb, and evaluation_shortdocs.ipynb, each utilizing a a subset of dataset of 203,037 scientific research papers sourced from Hugging Face's repository of arXiv papers. The documents are categorized by token count into long (20,000-100,000 tokens), medium (3,500-20,000 tokens), and short (100-3,500 tokens).

For summarization, the GPT-3.5 4k model is employed. The long document summarization algorithm leverages k-means/hierarchical clustering to identify key document sections, followed by a mapreduce strategy for abstracting content. Medium documents are processed using mapreduce approach, while short documents utilize stuffing method.

To validate the effectiveness of these algorithms, a random sample of around 50 documents from each category is selected. The selected documents are summarised using the algorithms mentioned above and compared to human-written abstracts. The quality of the summaries is assessed using ROUGE and BERT scores, with further evaluation of coherence, fluency, relevance, and consistency using the GPT-3.5 4K model. This comprehensive structure ensures a robust framework for document summarization research and development.

Build & Run:

To use the repository:

  1. Clone the repository to your local machine.
  2. Create a virtual environment and install the necessary packages listed in the requirements.txt file.
  3. Locate the config.json file within the config folder and update it with your OpenAI configuration values.
  4. Once the setup is complete, you can run the notebooks or application as intended.

Application:

  1. The application has been developed using the **Streamlit** libraries.
  2. To activate the virtual environment, open a command prompt and execute the necessary command.
  3. Launch the application by typing `streamlit run generate_summary`.
  4. The application interface will open in a web browser.
  5. To generate a summary, input the desired text and click the "Submit" button.

Running Notebooks for Document Summarization

In this section, we discuss how to run notebooks for document summarization. The documents are categorized based on their token count into three groups:

  1. Long Documents (20,000-100,000 tokens)
  2. Medium Documents (3,500-20,000 tokens)
  3. Short Documents (100-3,500 tokens)

Dataset and Source

The notebooks utilize a subset of a dataset containing 203,037 scientific research papers. These papers are sourced from Hugging Face's repository of arXiv papers: Scientific papers datasets.

Steps for Running Notebooks

  1. Before running the notebooks, ensure that you have set up the appropriate Python interpreter within your virtual environment.

Evaluation Notebooks

We have three evaluation notebooks, each targeting a specific document length:

1. evaluation_longdocs.ipynb

  • This notebook focuses on summarizing long documents.
  • The summarization algorithm leverages k-means/hierarchical clustering to identify key sections within the document.
  • A mapreduce strategy is then applied to abstract the content.
  • To validate the effectiveness of the algorithms, we randomly select around 50 documents.
  • The generated summaries are evaluated against the provided dataset summaries using BERT scores and Rouge scores.
  • Additionally, we assess coherence, fluency, consistency, and relevance using GPT-3.5.

2. evaluation_mediumdocs.ipynb

  • For medium-length documents, we employ a mapreduce approach for summarization.
  • Similar to the long document evaluation, we randomly select around 50 documents.
  • The generated summaries are evaluated using BERT scores, Rouge scores. Also the summaries are evaluated for coherence, fluency, consistency, and relevance using GPT-3.5.

3. evaluation_shortdocs.ipynb

  • Short documents are summarized using a technique called stuffing.
  • We evaluate a random sample of 50 documents.
  • The summaries are compared to the dataset summaries using BERT scores, Rouge scores. Coherence, fluency, consistency and relevance of the document summaries are evaluated using GPT3.5.

Contact

For more details or help deploying, contact the following:

Acknowledgements

This repository is built in part using the following frameworks:

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published