We present a sophisticated, yet pragmatic, solution for automatic summarization of lecture videos combining the state-of-the-art long document Transformer model, LED, with the long-proven graph-based model, TextRank. The model outputs a web page that includes a high-level abstract summary generated by LED, with key words/phrases and a detailed summary body extracted by TextRank. Summary body sentences are hyperlinked to the source sentences in the full transcript, which in turn are hyperlinked to the timestamps in the source video. Our statistical (ROUGE) and qualitative assessment shows that our model provides a significant improvement over BERT Extractive Summarizer.
Fetch the transcript from a link to the YouTube video and insert into a Pandas dataframe. Prepend timestamps to phrases in a new column in the dataframe. Join each of the original text and time-prepended text columns are into continuous strings. Then segment the strings into sentences using NLTK and insert into a new dataframe. We then have segmented original sentences matched up with segmented timestamp-embedded sentences. This allows us to link each sentence back to the video at the approximate time of utterance.
We use the PyTextRank implementation of TextRank with our spaCy pipeline to extract the top ranked words/phrases and sentences from the pre-segmented input sentences. We set the limit on returned words/phrases to 20 and the limit on sentences to 30% of the original sentence count. We use a ratio of 5:1 for ranked phrases to sentences for the graph model vertices. The resulting output is 144 sentences, returned in order of original appearance (not ranked order), containing 6434 words (57% fewer than the original text).
We take the TextRank output sentences from the previous step as input to a pre-trained LED model, fine-tuned on the PubMed dataset from the Hugging Face Transformers library. A short abstractive summary of the transcript is output from this step.
The final output is a minimally-formatted HTML web page with the following sections: Overview (a short introduction); Abstract (the abstractive summary); Keywords/phrases (top 20 words/phrases); Summary (the extractive summary); Full Transcript (the full video transcript). Sentences are grouped into paragraphs in the ‘Summary’ section based on their positional locations. Paragraphs provide visual cues to the user for where context gaps may exist. Long paragraphs indicate several sentences in close proximity with minimal pruning between them. Short paragraphs and ‘orphaned’ sentences suggest that more context may be needed. Sentences in the ‘Summary’ section are hyperlinked to the ‘Full Transcript’ section. Sentences in the ‘Full Transcript’ section are hyperlinked to the video at the approximate time of utterance, using the timestamps that we extracted in the preprocessing step. The hyperlinks are provided as a convenience for the user who wishes to obtain additional context or clarification.
To evaluate our model, we conduct some qualitative analysis of the abstractive and extractive summaries (checking for factualness, faithfulness and fluency), as well as a statistical evaluation (ROUGE) of the extractive summary.
There are two IPython notebook files. The ‘Summarization.ipynb’ file is the main code file, and ‘Bert_Extractive_Summarizer.ipynb’ is for comparison with another recent approach for lecture summarization.
The IPython notebooks are available as html: Summarization and Bert_Extractive_Summarizer. The summarization output is shown in the summary.html file, which is generated by running the Summarization IPython notebook. The project report is also available in the repository.
There are various text files (*.txt) in the repositiory. The 'human.txt' file is the manually-extracted human summary for comparison with our model. The 'summary_textrank_144_6434.txt' file is the TextRank output generated by the ‘Summarization.ipynb’ file. The 'summary_positionrank_144_6239.txt' file is the PositionRank output generated by the ‘Summarization.ipynb’ file (as an alternative to TextRank). And finally, the 'summary_bert_144_3713.txt' file is the TextRank output generated by the ‘Bert_Extractive_Summarizer.ipynb’ file.