Skip to content

mkreager/nlp-summarization

Repository files navigation

Automated Lecture Video Summarization

Abstract

We present a sophisticated, yet pragmatic, solution for automatic summarization of lecture videos combining the state-of-the-art long document Transformer model, LED, with the long-proven graph-based model, TextRank. The model outputs a web page that includes a high-level abstract summary generated by LED, with key words/phrases and a detailed summary body extracted by TextRank. Summary body sentences are hyperlinked to the source sentences in the full transcript, which in turn are hyperlinked to the timestamps in the source video. Our statistical (ROUGE) and qualitative assessment shows that our model provides a significant improvement over BERT Extractive Summarizer.

Summary of techniques

Pre-processing

Fetch the transcript from a link to the YouTube video and insert into a Pandas dataframe. Prepend timestamps to phrases in a new column in the dataframe. Join each of the original text and time-prepended text columns are into continuous strings. Then segment the strings into sentences using NLTK and insert into a new dataframe. We then have segmented original sentences matched up with segmented timestamp-embedded sentences. This allows us to link each sentence back to the video at the approximate time of utterance.

Extractive summarization

We use the PyTextRank implementation of TextRank with our spaCy pipeline to extract the top ranked words/phrases and sentences from the pre-segmented input sentences. We set the limit on returned words/phrases to 20 and the limit on sentences to 30% of the original sentence count. We use a ratio of 5:1 for ranked phrases to sentences for the graph model vertices. The resulting output is 144 sentences, returned in order of original appearance (not ranked order), containing 6434 words (57% fewer than the original text).

Abstractive summarization

We take the TextRank output sentences from the previous step as input to a pre-trained LED model, fine-tuned on the PubMed dataset from the Hugging Face Transformers library. A short abstractive summary of the transcript is output from this step.

Document output

The final output is a minimally-formatted HTML web page with the following sections: Overview (a short introduction); Abstract (the abstractive summary); Keywords/phrases (top 20 words/phrases); Summary (the extractive summary); Full Transcript (the full video transcript). Sentences are grouped into paragraphs in the ‘Summary’ section based on their positional locations. Paragraphs provide visual cues to the user for where context gaps may exist. Long paragraphs indicate several sentences in close proximity with minimal pruning between them. Short paragraphs and ‘orphaned’ sentences suggest that more context may be needed. Sentences in the ‘Summary’ section are hyperlinked to the ‘Full Transcript’ section. Sentences in the ‘Full Transcript’ section are hyperlinked to the video at the approximate time of utterance, using the timestamps that we extracted in the preprocessing step. The hyperlinks are provided as a convenience for the user who wishes to obtain additional context or clarification.

Evaluation

To evaluate our model, we conduct some qualitative analysis of the abstractive and extractive summaries (checking for factualness, faithfulness and fluency), as well as a statistical evaluation (ROUGE) of the extractive summary.

Explanation of files

There are two IPython notebook files. The ‘Summarization.ipynb’ file is the main code file, and ‘Bert_Extractive_Summarizer.ipynb’ is for comparison with another recent approach for lecture summarization.

The IPython notebooks are available as html: Summarization and Bert_Extractive_Summarizer. The summarization output is shown in the summary.html file, which is generated by running the Summarization IPython notebook. The project report is also available in the repository.

There are various text files (*.txt) in the repositiory. The 'human.txt' file is the manually-extracted human summary for comparison with our model. The 'summary_textrank_144_6434.txt' file is the TextRank output generated by the ‘Summarization.ipynb’ file. The 'summary_positionrank_144_6239.txt' file is the PositionRank output generated by the ‘Summarization.ipynb’ file (as an alternative to TextRank). And finally, the 'summary_bert_144_3713.txt' file is the TextRank output generated by the ‘Bert_Extractive_Summarizer.ipynb’ file.

About

Automated Lecture Video Summarization

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published