<a href="https://colab.research.google.com/github/paulowoicho/msc_project/blob/master/Summarization_Task_Results_and_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Prepare the Data

Data is hosted on a google cloud bucket. Some processing was already done to convert the dataset from the original json format to a csv format with episode_id and transcript columns. Code for that is on github.

The dataset contains over 100k different episodes. Episodes can be segmented by shows, with one show spanning several episodes

In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
project_id = 'test-281700'
!gcloud config set project {project_id}
!gsutil ls

Updated property [core/project].


To take a quick anonymous survey, run:
  $ gcloud survey

gs://spotify_asr_dataset/
gs://staging.test-281700.appspot.com/
gs://test-281700.appspot.com/


In [None]:
bucket_name = 'spotify_asr_dataset'
#download dataset
!gsutil -m cp -r gs://{bucket_name}/dataset.csv /content/

Copying gs://spotify_asr_dataset/dataset.csv...
/ [1/1 files][  2.9 GiB/  2.9 GiB] 100% Done  27.7 MiB/s ETA 00:00:00           
Operation completed over 1 objects/2.9 GiB.                                      


In [None]:
import pandas as pd
dataset = pd.read_csv('dataset.csv')

In [None]:
dataset.head(5)

Unnamed: 0,episode_id,transcript
0,spotify:episode:399kdfMnjw0KYANZU7CQJ0,It's the mother back a podcast. Well that was...
1,spotify:episode:49wcMBeJfaaL6KFFdsWvac,If you haven't heard about anchor is the easi...
2,spotify:episode:0JOymLFsRdeBVZbEA72ayj,Hello and welcome to the podcast the first ev...
3,spotify:episode:7sHyO8wLeEd1LuxfS8AIls,"Hey, hey. Hey. Hey. Hey, this is your girl Je..."
4,spotify:episode:1WosITIkpJemzZaPh8zAVb,This is the planetary potential podcast for t...


In [None]:
len(dataset)

105360

In [None]:
#download gold summaries
!gsutil -m cp -r gs://{bucket_name}/150gold.tsv /content/

#download metadata for episodes
!gsutil -m cp -r gs://{bucket_name}/metadata.tsv /content/

Copying gs://spotify_asr_dataset/150gold.tsv...
- [1/1 files][379.9 KiB/379.9 KiB] 100% Done                                    
Operation completed over 1 objects/379.9 KiB.                                    
Copying gs://spotify_asr_dataset/metadata.tsv...
\ [1/1 files][112.2 MiB/112.2 MiB] 100% Done                                    
Operation completed over 1 objects/112.2 MiB.                                    


Spotify also provides a set of 150 episodes for which baseline models were used to generate summaries. These summaries are compared against the creator provided episode descriptions, which vary in quality per the evaluation of spotify's annotators using an EGFB grading scale. 

Metadata file contains other information about each episode which may or may not be useful for training models

In [None]:
gold_summaries = pd.read_csv('150gold.tsv', sep='\t')
podcasts_metadata = pd.read_csv('metadata.tsv', sep='\t')

In [None]:
gold_summaries.head(5)

Unnamed: 0,show name,episode name,episode id,creator description,EGFB,lexrank summary,EGFB.1,textrank summary,EGFB.2,lsa summary,EGFB.3,quasi-supervised summary,EGFB.4,supervised summary,EGFB.5
0,Alpha Male Strategies,Passive Aggressive Women & Developing Mental S...,spotify:episode:4KRC1TZ28FavN3J5zLHEtQ,Boost the podcast! Leave a 5-star review on th...,B,All right guys now as y'all guys might know so...,G,There's no such thing as talk about passengers...,F,I'll pay for all you guys who don't know what ...,F,All women a passive-aggressive. When a woman w...,G,Rejection is a woman’s way of saying you’re no...,B
1,I Hope the Day Has a Good YOU!,"If You Are Bored, You Are Boring.",spotify:episode:4tdDQcsBOUVWnA9XrpgTzS,☀️ aGoodYOU.com 🧡 Discuss this episode in 🗣: f...,B,It was the first and last time I ever said tha...,E,The answer usually comes in the form of a whis...,E,It was the first and last time I ever said tha...,F,If you are bored you are boring. Most people m...,E,"If you are bored, you are boring. --- Suppo...",B
2,American English Grammar Review,Prepositions of Movement Review Two American E...,spotify:episode:626YAxomH0HZ6nCW9NLlGY,Prepositions of movement review two is the sec...,F,So off is again the opposite of on what about ...,B,So off is again the opposite of on what about ...,B,Let's start out with into and out of into is t...,B,Prepositions of movement in this lesson. Let's...,B,"In, out, and off refer to services differently...",B
3,Simmers Digest,Simmer's Digest ep. 1.0 : Introductions,spotify:episode:6AUFl7KQWN6pzGFEIEKFQu,Welcome one and all to the newest the first ev...,F,I hope you enjoy it and I am excited to get to...,F,My passion for The Sims 4 Grew From consuming ...,E,So sit back turn your volume up to 11 and let'...,B,Techobabble is the host of the simmers digest ...,F,Welcome to the very first episode of The Simme...,B
4,Kitty's Pod,TXT Run Away Reaction,spotify:episode:6C4V9iKa9qygtvJCngPO93,TXT Run Away Reaction --- This episode is s...,B,And I will tell you that I am very thrilled an...,B,And I will tell you that I am very thrilled an...,B,"But before we get into that, if you haven't he...",B,Anchor is the easiest way to make a podcast. T...,B,Listen to the song Runaway by txt --- This...,B


In [None]:
podcasts_metadata.head(5)

Unnamed: 0,show_uri,show_name,show_description,publisher,language,rss_link,episode_uri,episode_name,episode_description,duration,show_filename_prefix,episode_filename_prefix
0,spotify:show:2NYtxEZyYelR6RMKmjfPLB,Kream in your Koffee,A 20-something blunt female takes on the world...,Katie Houle,['en'],https://anchor.fm/s/11b84b68/podcast/rss,spotify:episode:000A9sRBYdVh66csG2qEdj,1: It’s Christmas Time!,On the first ever episode of Kream in your Kof...,12.700133,show_2NYtxEZyYelR6RMKmjfPLB,000A9sRBYdVh66csG2qEdj
1,spotify:show:15iWCbU7QoO23EndPEO6aN,Morning Cup Of Murder,Ever wonder what murder took place on today in...,Morning Cup Of Murder,['en'],https://anchor.fm/s/b07181c/podcast/rss,spotify:episode:000HP8n3hNIfglT2wSI2cA,The Goleta Postal Facility shootings- January ...,"See something, say something. It’s a mantra ma...",6.019383,show_15iWCbU7QoO23EndPEO6aN,000HP8n3hNIfglT2wSI2cA
2,spotify:show:6vZRgUFTYwbAA79UNCADr4,Inside The 18 : A Podcast for Goalkeepers by G...,Inside the 18 is your source for all things Go...,Inside the 18 GK Media,['en'],https://anchor.fm/s/81a072c/podcast/rss,spotify:episode:001UfOruzkA3Bn1SPjcdfa,Ep.36 - Incorporating a Singular Goalkeeping C...,Today’s episode is a sit down Michael and Omar...,43.616333,show_6vZRgUFTYwbAA79UNCADr4,001UfOruzkA3Bn1SPjcdfa
3,spotify:show:5BvKEjaMSuvUsGROGi2S7s,Arrowhead Live!,Your favorite podcast for everything @Chiefs! ...,Arrowhead Live!,['en-US'],https://anchor.fm/s/917dba4/podcast/rss,spotify:episode:001i89SvIQgDuuyC53hfBm,Episode 1: Arrowhead Live! Debut,Join us as we take a look at all current Chief...,58.1892,show_5BvKEjaMSuvUsGROGi2S7s,001i89SvIQgDuuyC53hfBm
4,spotify:show:7w3h3umpH74veEJcbE6xf4,FBoL,"The comedy podcast about toxic characters, wri...",Emily Edwards,['en'],https://www.fuckboisoflit.com/episodes?format=rss,spotify:episode:0025RWNwe2lnp6HcnfzwzG,"The Lion, The Witch, And The Wardrobe - Ashley...",The modern morality tail of how to stay good f...,51.78205,show_7w3h3umpH74veEJcbE6xf4,0025RWNwe2lnp6HcnfzwzG


In [None]:
#one show can span multiple episodes
podcasts_metadata[podcasts_metadata['show_uri'] == 'spotify:show:2NYtxEZyYelR6RMKmjfPLB']

Unnamed: 0,show_uri,show_name,show_description,publisher,language,rss_link,episode_uri,episode_name,episode_description,duration,show_filename_prefix,episode_filename_prefix
0,spotify:show:2NYtxEZyYelR6RMKmjfPLB,Kream in your Koffee,A 20-something blunt female takes on the world...,Katie Houle,['en'],https://anchor.fm/s/11b84b68/podcast/rss,spotify:episode:000A9sRBYdVh66csG2qEdj,1: It’s Christmas Time!,On the first ever episode of Kream in your Kof...,12.700133,show_2NYtxEZyYelR6RMKmjfPLB,000A9sRBYdVh66csG2qEdj
11957,spotify:show:2NYtxEZyYelR6RMKmjfPLB,Kream in your Koffee,A 20-something blunt female takes on the world...,Katie Houle,['en'],https://anchor.fm/s/11b84b68/podcast/rss,spotify:episode:0sTNg31EACSHfZlt41RHmS,2: Tan Hands Save Lives,The do’s & don’ts of self-tanning. A weird con...,20.016367,show_2NYtxEZyYelR6RMKmjfPLB,0sTNg31EACSHfZlt41RHmS
46132,spotify:show:2NYtxEZyYelR6RMKmjfPLB,Kream in your Koffee,A 20-something blunt female takes on the world...,Katie Houle,['en'],https://anchor.fm/s/11b84b68/podcast/rss,spotify:episode:3Ny7dKZ1QHZwadslXJ8Umf,6: #BYOD (with Liz Pickles),On this week’s episode the gorgeous Liz Pickle...,80.472617,show_2NYtxEZyYelR6RMKmjfPLB,3Ny7dKZ1QHZwadslXJ8Umf


In [None]:
full_dataset = pd.merge(left=podcasts_metadata, right=dataset, how='left', left_on='episode_uri', right_on='episode_id')
del full_dataset['episode_uri']

In [None]:
full_dataset.head(5)

Unnamed: 0,show_uri,show_name,show_description,publisher,language,rss_link,episode_name,episode_description,duration,show_filename_prefix,episode_filename_prefix,episode_id,transcript
0,spotify:show:2NYtxEZyYelR6RMKmjfPLB,Kream in your Koffee,A 20-something blunt female takes on the world...,Katie Houle,['en'],https://anchor.fm/s/11b84b68/podcast/rss,1: It’s Christmas Time!,On the first ever episode of Kream in your Kof...,12.700133,show_2NYtxEZyYelR6RMKmjfPLB,000A9sRBYdVh66csG2qEdj,spotify:episode:000A9sRBYdVh66csG2qEdj,Hello. Hello. Hello everyone. This is Katie a...
1,spotify:show:15iWCbU7QoO23EndPEO6aN,Morning Cup Of Murder,Ever wonder what murder took place on today in...,Morning Cup Of Murder,['en'],https://anchor.fm/s/b07181c/podcast/rss,The Goleta Postal Facility shootings- January ...,"See something, say something. It’s a mantra ma...",6.019383,show_15iWCbU7QoO23EndPEO6aN,000HP8n3hNIfglT2wSI2cA,spotify:episode:000HP8n3hNIfglT2wSI2cA,There were two more murders 15 miles away arr...
2,spotify:show:6vZRgUFTYwbAA79UNCADr4,Inside The 18 : A Podcast for Goalkeepers by G...,Inside the 18 is your source for all things Go...,Inside the 18 GK Media,['en'],https://anchor.fm/s/81a072c/podcast/rss,Ep.36 - Incorporating a Singular Goalkeeping C...,Today’s episode is a sit down Michael and Omar...,43.616333,show_6vZRgUFTYwbAA79UNCADr4,001UfOruzkA3Bn1SPjcdfa,spotify:episode:001UfOruzkA3Bn1SPjcdfa,Welcome to inside the 18. Today's episode is ...
3,spotify:show:5BvKEjaMSuvUsGROGi2S7s,Arrowhead Live!,Your favorite podcast for everything @Chiefs! ...,Arrowhead Live!,['en-US'],https://anchor.fm/s/917dba4/podcast/rss,Episode 1: Arrowhead Live! Debut,Join us as we take a look at all current Chief...,58.1892,show_5BvKEjaMSuvUsGROGi2S7s,001i89SvIQgDuuyC53hfBm,spotify:episode:001i89SvIQgDuuyC53hfBm,Hey cheese fans before we get started. I want...
4,spotify:show:7w3h3umpH74veEJcbE6xf4,FBoL,"The comedy podcast about toxic characters, wri...",Emily Edwards,['en'],https://www.fuckboisoflit.com/episodes?format=rss,"The Lion, The Witch, And The Wardrobe - Ashley...",The modern morality tail of how to stay good f...,51.78205,show_7w3h3umpH74veEJcbE6xf4,0025RWNwe2lnp6HcnfzwzG,spotify:episode:0025RWNwe2lnp6HcnfzwzG,"Sorry to interrupt the show, but I do have to..."


#Recreate Baselines (TextRank, LexRank, Lsa)

Three of the extractive baselines used on the dataset are TextRank, LexRank, and Latent Semantic Analysis. These can be easily replicated using the python Sumy package.

In [None]:
one_fifty_gold_summaries = full_dataset.loc[full_dataset["episode_id"].isin(gold_summaries['episode id'])]
len(one_fifty_gold_summaries)

150

In [None]:
one_fifty_gold_summaries = one_fifty_gold_summaries[['episode_id', 'episode_name', 'episode_description', 'transcript']]
one_fifty_gold_summaries

Unnamed: 0,episode_id,episode_name,episode_description,transcript
1926,spotify:episode:08hXUWN6aOnHULXrqMiwTi,Recruiting Secrets From a MLM Recruiting Monst...,If you want to start mastering recruiting whic...,Hello everybody. What's going on in this Jess...
2692,spotify:episode:0CExTNH4LFqp1ec1mhTd4I,A Public Service Announcement from the mrbrown...,Don't be a silent victim of crime. a parody fr...,"Ladies, have you been molested don't be a vic..."
2704,spotify:episode:0CGkzBarXCoYA24w5buABS,Favorite parts!!!,We will share some of our favorite parts of Ha...,"Hello everybody, welcome back to butter girls..."
2943,spotify:episode:0DKARHBAz6GNwwIBz9Y4Sk,Fear? I Don't Know Her!,"Hey eaters!! To commemorate spooky season, thi...","Hey heaters, it's Daisy and Abby and we are t..."
3463,spotify:episode:0Fg2g7eNnVBUj9FmtN8uKd,Unschool -001 /Introduction,Introduction | Show Formalities | What to Expe...,"Hey, what's up, this is Uncle Sharma from the..."
...,...,...,...,...
99028,spotify:episode:7JdHnC4Jm05d7gsvpt2iMN,Self Guided 30 Minute Meditation | Meditation ...,Learn to meditate by practicing your own medit...,Thank you for joining me for another piece pr...
99372,spotify:episode:7LBJuMprDt6nN8lWZhw1ZN,The Lady Lemur,"This week, John & Kat are joined by the Lady L...",Welcome to the Leaky nib a podcast about pens...
100490,spotify:episode:7dEbuTgXCXjIufsfgyohnI,1. Pilot,"Bruce Forsyth, Sherlock's hair and time travel...",I've got something very excited to tell you t...
104281,spotify:episode:7uv6JrgV01fiecvCgxRMy8,Best Organic Dessert Recipes,Comfort Foods for Cold Nights,Ingredients bolognese sauce 1 large onion coa...


In [None]:
!pip install sumy

Collecting sumy
[?25l  Downloading https://files.pythonhosted.org/packages/61/20/8abf92617ec80a2ebaec8dc1646a790fc9656a4a4377ddb9f0cc90bc9326/sumy-0.8.1-py2.py3-none-any.whl (83kB)
[K     |████████████████████████████████| 92kB 2.2MB/s 
Collecting breadability>=0.1.20
  Downloading https://files.pythonhosted.org/packages/ad/2d/bb6c9b381e6b6a432aa2ffa8f4afdb2204f1ff97cfcc0766a5b7683fec43/breadability-0.1.20.tar.gz
Collecting pycountry>=18.2.23
[?25l  Downloading https://files.pythonhosted.org/packages/76/73/6f1a412f14f68c273feea29a6ea9b9f1e268177d32e0e69ad6790d306312/pycountry-20.7.3.tar.gz (10.1MB)
[K     |████████████████████████████████| 10.1MB 7.0MB/s 
Building wheels for collected packages: breadability, pycountry
  Building wheel for breadability (setup.py) ... [?25l[?25hdone
  Created wheel for breadability: filename=breadability-0.1.20-py2.py3-none-any.whl size=21684 sha256=f7496fa4e217bf72f27f42955b357119b96ab33e569aed04945c798207718d3b
  Stored in directory: /root/.cache

In [None]:
!python sumy_example.py

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
from sumy_example import summarize

one_fifty_gold_summaries['textrank_summary'] = one_fifty_gold_summaries.apply(lambda row: summarize(row['transcript'], 'text_rank'), axis=1)
one_fifty_gold_summaries['lexrank_summary'] = one_fifty_gold_summaries.apply(lambda row: summarize(row['transcript'], 'lex_rank'), axis=1)
one_fifty_gold_summaries['lsa_summary'] = one_fifty_gold_summaries.apply(lambda row: summarize(row['transcript'], 'lsa'), axis=1)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


  warn(message % (words_count, sentences_count))
  warn(message % (words_count, sentences_count))


In [None]:
one_fifty_gold_summaries.iloc[3]['textrank_summary']

"Yeah, and it gives you opportunity to be that courageous person where if you weren't afraid of something you would never know that it was actually like something you could have done and then they talked kind of talk about the fight or flight response and it's the eye  Dia of fear keeps us safe in a lot of situations like you and I are both very aware of where we are at night and looking around to different people and making sure we're safe and that fear Keeps Us Alive everyday. Yeah before you can actually show forth, which is one of the hardest parts of overcoming fear because admitting fear to yourself is scary foot and I got anything for your own fear to um, yeah. "

In [None]:
one_fifty_gold_summaries.iloc[3]['lexrank_summary']

"I kind of get that fear also a side note of what I'm afraid of kidnap being kidnapped is probably the biggest one but in a much more real sense, it's going like I'm not accomplishing enough for giving enough that really ties in with the time thing because it says it yeah, we should be having something done. And what's my fear? "

In [None]:
one_fifty_gold_summaries.iloc[3]['lsa_summary']

"Yeah, so the idea of habituation  In the psychology today article talks about how it's when our nervous system returns to a more comfortable state, which is a good thing. When you happen to be in a very dangerous situation, once you return to this state of habituation, then you know that your body has calmed down but then we don't go outside and explore to a mode and we can go on autopilot and then then we're not as aware that we're doing it because we're just naturally doing it. "

In [None]:
one_fifty_gold_summaries.iloc[3]['episode_description']

#Recreate Baselines (Semi-Supervised Bart + Supervised Bart)

Abstractive baselines used on the data are a semi-supervised bart, which is a bart model that was only pre-trained on the CNN/DailyMail dataset, and a supervised bart which is further fine-tuned on the spotify dataset. Due to compute constraints, the bart model is only fine-tuned on 5000 samples, although I will fine-tune on more samples and see if that improves the results.

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 2.8MB/s 
Collecting tokenizers==0.8.1.rc1
[?25l  Downloading https://files.pythonhosted.org/packages/40/d0/30d5f8d221a0ed981a186c8eb986ce1c94e3a6e87f994eae9f4aa5250217/tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 14.1MB/s 
[?25hCollecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 30.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB

##semi supervised

Inference with the model consumes too much RAM on very long pieces of text, so I limit the length of text to the first 4000 characters. This is sure to impact results.

In [None]:
from transformers import pipeline
bart_summarizer = pipeline("summarization")


def semi_supervised_bart(transcript):
  summary = bart_summarizer(transcript[:4000], min_length=50, max_length=150)
  return summary[0]['summary_text']

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1578.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=26.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1222317369.0, style=ProgressStyle(descr…




In [None]:
one_fifty_gold_summaries['semi_supervised'] = one_fifty_gold_summaries.apply(lambda row: semi_supervised_bart(row['transcript']), axis=1)

Your max_length is set to 150, but you input_length is only 134. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)
Your max_length is set to 150, but you input_length is only 142. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)
Your max_length is set to 150, but you input_length is only 142. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)


In [None]:
one_fifty_gold_summaries.iloc[3]['semi_supervised']

" Daisy and Abby talk about fear and how courage and fear go hand in hand . Abby says fear is not a sign of what you shouldn't do, it's a call to action . Abby: I have this weird fear that I'm not doing enough at the moment and that I kind of compare myself to other people and I'm like, well this person is doing this and I could be doing a lot more ."

In [None]:
one_fifty_gold_summaries.iloc[1]['semi_supervised']

" If you have been molested call the police shout for help and call nine nine nine . Don't be a silent victim avoid walking through secluded areas alone . Have someone escort you home when it's late above all try your best not to be molested by a culprit from a good University who has good academic results and the potential to excel in life ."

##supervised (trained on 5000 transcripts)

In [None]:
!pip install ohmeow-blurr
!pip install nlp
!pip install pyarrow==0.16.0


Collecting ohmeow-blurr
  Downloading https://files.pythonhosted.org/packages/fd/80/6dfb38121c4d61d3f8d72d16ff45b23e6a1ef0ac798a1e2b03542e8e0500/ohmeow_blurr-0.0.7-py3-none-any.whl
Collecting nlp
[?25l  Downloading https://files.pythonhosted.org/packages/e5/77/9b8a06120e85d8057339a87dca967c8b73280fe463702cc4966c30f43960/nlp-0.3.0-py3-none-any.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 5.8MB/s 
[?25hCollecting seqeval
  Downloading https://files.pythonhosted.org/packages/34/91/068aca8d60ce56dd9ba4506850e876aba5e66a6f2f29aa223224b50df0de/seqeval-0.0.12.tar.gz
Collecting rouge-score
  Downloading https://files.pythonhosted.org/packages/1f/56/a81022436c08b9405a5247b71635394d44fe7e1dbedc4b28c740e09c2840/rouge_score-0.0.4-py2.py3-none-any.whl
Collecting fastcore
  Downloading https://files.pythonhosted.org/packages/47/54/84e9f7c5ab718526e891fef0801e3cab98b4e4e28471fbd0e41efbbbe412/fastcore-0.1.20-py3-none-any.whl
Collecting fastai2
[?25l  Downloading https://files.python

Collecting pyarrow==0.16.0
[?25l  Downloading https://files.pythonhosted.org/packages/00/d2/695bab1e1e7a4554b6dbd287d55cca096214bd441037058a432afd724bb1/pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl (63.1MB)
[K     |████████████████████████████████| 63.2MB 162kB/s 
Installing collected packages: pyarrow
  Found existing installation: pyarrow 1.0.0
    Uninstalling pyarrow-1.0.0:
      Successfully uninstalled pyarrow-1.0.0
Successfully installed pyarrow-0.16.0


Copying gs://spotify_asr_dataset/BART_finetuned_5000.pkl...
- [1/1 files][  1.8 GiB/  1.8 GiB] 100% Done  53.4 MiB/s ETA 00:00:00           
Operation completed over 1 objects/1.8 GiB.                                      


In [None]:
one_fifty_gold_summaries.to_csv('150_gold.csv', index=False)

In [None]:
#restarted run time
import pandas as pd
one_fifty_gold_summaries = pd.read_csv('150_gold.csv')
one_fifty_gold_summaries.head(2)

Unnamed: 0,episode_id,episode_name,episode_description,transcript,textrank_summary,lexrank_summary,lsa_summary,semi_supervised,supervised
0,spotify:episode:08hXUWN6aOnHULXrqMiwTi,Recruiting Secrets From a MLM Recruiting Monst...,If you want to start mastering recruiting whic...,Hello everybody. What's going on in this Jess...,Let's download these stories about whoever I d...,You can do whatever networking events and I'm ...,We're going to talk about some powerful recrui...,"You cannot scummy hashtag wisely, or the peop...","You cannot scummy hashtag wisely, or the peopl..."
1,spotify:episode:0CExTNH4LFqp1ec1mhTd4I,A Public Service Announcement from the mrbrown...,Don't be a silent victim of crime. a parody fr...,"Ladies, have you been molested don't be a vic...",Have someone escort you home when it's late ab...,But you can touch his life. But you can touch ...,If you only touched you in minor ways never mi...,If you have been molested call the police sho...,"Ladies, have you been molested don't be a vict..."


In [None]:
import nlp
from fastai2.text.all import *
from transformers import *

from blurr.data.all import *
from blurr.modeling.all import *

In [None]:
supervised_bart_model = load_learner(fname='BART_finetuned_5000.pkl')

def supervised_bart(transcript):
  return supervised_bart_model.generate_text(transcript)[0]



In [None]:
one_fifty_gold_summaries['supervised'] = one_fifty_gold_summaries.apply(lambda row: supervised_bart(row['transcript']), axis=1)

In [None]:
one_fifty_gold_summaries.to_csv('150_gold.csv', index=False)

#Recreate Baselines (First Five Sentences/first min?)

With the way I reformatted the data, implementing a baseline that looks at the words uttered in the first minute of the podcast might be ineffective as it would involve lots of assumptions. For example, the average speaker says 125 words per minute, taking the first 125 words as my summary would not take into account speakers who pause alot while talking. 

The original dataset contains the timestamp which each word was uttered, I could have hosted the files somewhere and looked up the associated transcript via an API as needed, but the set of uncompressed files is very large.

I decided to go with the first five sentences with the assumption that it could be fairly representative of the first minute of a podcast and that it is fairly trivial to implement

In [1]:
import pandas as pd
results = pd.read_csv('150_gold.csv') #or download from bucket
results.head(2)

Unnamed: 0,episode_id,episode_name,episode_description,transcript,textrank_summary,lexrank_summary,lsa_summary,semi_supervised,supervised,extractive,t5_abstractive,first_five
0,spotify:episode:08hXUWN6aOnHULXrqMiwTi,Recruiting Secrets From a MLM Recruiting Monst...,If you want to start mastering recruiting whic...,Hello everybody. What's going on in this Jess...,Let's download these stories about whoever I d...,You can do whatever networking events and I'm ...,We're going to talk about some powerful recrui...,"You cannot scummy hashtag wisely, or the peop...",In this episode I talk about some powerful re...,"You cannot scummy hashtag wisely, or the peopl...",--- This episode is sponsored by Anchor: The ...,Hello everybody. What's going on in this Jess...
1,spotify:episode:0CExTNH4LFqp1ec1mhTd4I,A Public Service Announcement from the mrbrown...,Don't be a silent victim of crime. a parody fr...,"Ladies, have you been molested don't be a vic...",Have someone escort you home when it's late ab...,But you can touch his life. But you can touch ...,If you only touched you in minor ways never mi...,If you have been molested call the police sho...,"If you have been molested, here are some of t...","Ladies, have you been molested don't be a vict...","Ladies, have you been molested? Don't be a sil...","Ladies, have you been molested don't be a vic..."


In [2]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def first_five(transcript, threshold):
  sentences = sent_tokenize(transcript)
  return ' '.join(sentences[:threshold])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
results['first_five'] = results.apply(lambda row: first_five(row['transcript'], 5), axis=1)
results.to_csv('150_gold.csv', index=False)

In [26]:
results.iloc[0]['first_five']

" Hello everybody. What's going on in this Jesse Lee? You cannot scummy hashtag wisely, or the people's Mentor in today. We're going to talk about some powerful recruiting techniques for massive growth in your business and in your team, and so if you are new to this program, feel free to subscribe share with a friend do all the good things. I appreciate you guys."

#Extractive (BERT + KMeans)

The Bert + Kmeans approach is based on the bert-extractive-summarizer project by dmiller where bert is used to encode sentences and kmeans is used to create a clustering of the encodings. Sentences that make the final summary are those closest to the clusters generated. The default number of clusters is two, and I stick with that. However, I wonder how results would be affected if the number of clusters was changed. Also, the model is not fine-tuned. What would happen if it was?

In [None]:
#restarted run time
import pandas as pd
one_fifty_gold_summaries = pd.read_csv('150_gold.csv')
one_fifty_gold_summaries.head(2)

Unnamed: 0,episode_id,episode_name,episode_description,transcript,textrank_summary,lexrank_summary,lsa_summary,semi_supervised,supervised
0,spotify:episode:08hXUWN6aOnHULXrqMiwTi,Recruiting Secrets From a MLM Recruiting Monst...,If you want to start mastering recruiting whic...,Hello everybody. What's going on in this Jess...,Let's download these stories about whoever I d...,You can do whatever networking events and I'm ...,We're going to talk about some powerful recrui...,"You cannot scummy hashtag wisely, or the peop...",In this episode I talk about some powerful re...
1,spotify:episode:0CExTNH4LFqp1ec1mhTd4I,A Public Service Announcement from the mrbrown...,Don't be a silent victim of crime. a parody fr...,"Ladies, have you been molested don't be a vic...",Have someone escort you home when it's late ab...,But you can touch his life. But you can touch ...,If you only touched you in minor ways never mi...,If you have been molested call the police sho...,"If you have been molested, here are some of t..."


In [None]:
!git clone https://github.com/paulowoicho/bert-extractive-summarizer.git
!pip install spacy
!pip install neuralcoref
!mv bert-extractive-summarizer summarizer

Cloning into 'bert-extractive-summarizer'...
remote: Enumerating objects: 4, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 378 (delta 0), reused 3 (delta 0), pack-reused 374[K
Receiving objects: 100% (378/378), 85.48 KiB | 735.00 KiB/s, done.
Resolving deltas: 100% (215/215), done.
Collecting neuralcoref
[?25l  Downloading https://files.pythonhosted.org/packages/ea/24/0ec7845a5b73b637aa691ff4d1b9b48f3a0f3369f4002a59ffd7a7462fdb/neuralcoref-4.0-cp36-cp36m-manylinux1_x86_64.whl (287kB)
[K     |████████████████████████████████| 296kB 5.3MB/s 
Installing collected packages: neuralcoref
Successfully installed neuralcoref-4.0


In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 8.5MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 21.7MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 44.2MB/s 
[?25hCollecting tokenizers==0.8.1.rc1
[?25l  Downloading https://files.pythonhosted.org/packages/40/d0/30d5f8d221a0ed981a186c8eb986ce1c94e3a6e87f994eae9f4aa5250217/tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl (3.0MB

In [None]:
from summarizer import Summarizer
extractive_model = Summarizer()

def extractive_summary(transcript):
  result = extractive_model(transcript, min_length=60, ratio=0.01)
  return ''.join(result)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=434.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1344997306.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [None]:
one_fifty_gold_summaries['extractive'] = one_fifty_gold_summaries.apply(lambda row: extractive_summary(row['transcript']), axis=1)
one_fifty_gold_summaries.to_csv('150_gold.csv', index=False)

In [None]:
one_fifty_gold_summaries.head(2)

Unnamed: 0,episode_id,episode_name,episode_description,transcript,textrank_summary,lexrank_summary,lsa_summary,semi_supervised,supervised,extractive
0,spotify:episode:08hXUWN6aOnHULXrqMiwTi,Recruiting Secrets From a MLM Recruiting Monst...,If you want to start mastering recruiting whic...,Hello everybody. What's going on in this Jess...,Let's download these stories about whoever I d...,You can do whatever networking events and I'm ...,We're going to talk about some powerful recrui...,"You cannot scummy hashtag wisely, or the peop...",In this episode I talk about some powerful re...,"You cannot scummy hashtag wisely, or the peopl..."
1,spotify:episode:0CExTNH4LFqp1ec1mhTd4I,A Public Service Announcement from the mrbrown...,Don't be a silent victim of crime. a parody fr...,"Ladies, have you been molested don't be a vic...",Have someone escort you home when it's late ab...,But you can touch his life. But you can touch ...,If you only touched you in minor ways never mi...,If you have been molested call the police sho...,"If you have been molested, here are some of t...","Ladies, have you been molested don't be a vict..."


In [None]:
one_fifty_gold_summaries.iloc[7]['extractive']

"If you ever heard about anchor it's the easiest way to make a podcast. They make 90 degree turns and they're following people and they like all suddenly like six eliminations in a row and while when I say, they're doing 90 degree rotations, that's also while building structures and going through walls and editing its it was really an interesting to watch the perfect gives you ideas of how to play better. I like he got a lot of them will eliminations but he doesn't get credit for them because he didn't actually hit them with a bullet."

#Abstractive (T5)

The t5 model is only fine-tuned on 3000 samples due to compute constraints.

In [None]:
#set up again because restarted runtime
from google.colab import auth
auth.authenticate_user()

project_id = 'test-281700'
!gcloud config set project {project_id}

#download model..restarted runtime
bucket_name = 'spotify_asr_dataset'
!gsutil -m cp -r gs://{bucket_name}/t5-model-3000.zip /content/

Updated property [core/project].


To take a quick anonymous survey, run:
  $ gcloud survey

Copying gs://spotify_asr_dataset/t5-model-3000.zip...
- [1/1 files][787.9 MiB/787.9 MiB] 100% Done                                    
Operation completed over 1 objects/787.9 MiB.                                    


In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 5.4MB/s 
[?25hCollecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 29.3MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 40.2MB/s 
[?25hCollecting tokenizers==0.8.1.rc1
[?25l  Downloading https://files.pythonhosted.org/packages/40/d0/30d5f8d221a0ed981a186c8eb986ce1c94e3a6e87f994eae9f4aa5250217/tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl 

In [None]:
!unzip t5-model-3000.zip

Archive:  t5-model-3000.zip
   creating: t5-model-3000/
  inflating: t5-model-3000/pytorch_model.bin  
 extracting: t5-model-3000/tokenizer_config.json  
  inflating: t5-model-3000/config.json  
  inflating: t5-model-3000/spiece.model  
  inflating: t5-model-3000/special_tokens_map.json  


In [None]:
import pandas as pd
one_fifty_gold_summaries = pd.read_csv('150_gold.csv')
one_fifty_gold_summaries.head(2)

Unnamed: 0,episode_id,episode_name,episode_description,transcript,textrank_summary,lexrank_summary,lsa_summary,semi_supervised,supervised,extractive,t5_abstractive
0,spotify:episode:08hXUWN6aOnHULXrqMiwTi,Recruiting Secrets From a MLM Recruiting Monst...,If you want to start mastering recruiting whic...,Hello everybody. What's going on in this Jess...,Let's download these stories about whoever I d...,You can do whatever networking events and I'm ...,We're going to talk about some powerful recrui...,"You cannot scummy hashtag wisely, or the peop...",In this episode I talk about some powerful re...,"You cannot scummy hashtag wisely, or the peopl...",--- This episode is sponsored by Anchor: The ...
1,spotify:episode:0CExTNH4LFqp1ec1mhTd4I,A Public Service Announcement from the mrbrown...,Don't be a silent victim of crime. a parody fr...,"Ladies, have you been molested don't be a vic...",Have someone escort you home when it's late ab...,But you can touch his life. But you can touch ...,If you only touched you in minor ways never mi...,If you have been molested call the police sho...,"If you have been molested, here are some of t...","Ladies, have you been molested don't be a vict...","Ladies, have you been molested? Don't be a sil..."


In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# # # Setting up the device for GPU usage
# from torch import cuda
# device = 'cuda' if cuda.is_available() else 'cpu'

tokenizer = T5Tokenizer.from_pretrained('/content/t5-model-3000')
model = T5ForConditionalGeneration.from_pretrained('/content/t5-model-3000')


def t5_inference(transcript):
  threshold = 7000
  t5_form = 'summarize: ' + transcript
  tokenized_text = tokenizer.encode(t5_form, return_tensors="pt")
  if len(tokenized_text[0]) > threshold:
    #run out of RAM/crashes on large number of tokens
    revised_text = sent_tokenize(t5_form)
    length = len(revised_text)
    final_text = revised_text[:int(length/2)] #maybe they talk about content in the first half? find proof
    text = ' '.join(final_text)
    return t5_inference(text)
  summary_ids = model.generate(tokenized_text, max_length=150, num_beams=2, repetition_penalty=2.5, length_penalty=1.0, early_stopping=True)
  output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
  return output

In [None]:
one_fifty_gold_summaries['t5_abstractive'] = one_fifty_gold_summaries.apply(lambda row: t5_inference(row['transcript']), axis=1)
one_fifty_gold_summaries.to_csv('150_gold.csv', index=False)
!gsutil -m cp -r /content/150_gold.csv gs://{bucket_name}/

Token indices sequence length is longer than the specified maximum sequence length for this model (5814 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (5474 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2458 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2514 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1462 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length fo

Copying file:///content/150_gold.csv [Content-Type=text/csv]...
/ [0/1 files][    0.0 B/  4.3 MiB]   0% Done                                    / [1/1 files][  4.3 MiB/  4.3 MiB] 100% Done                                    -
Operation completed over 1 objects/4.3 MiB.                                      


In [None]:
one_fifty_gold_summaries.iloc[5]['t5_abstractive']

'I talk about anger and how we deal with it. --- Send in a voice message: https://anchor.fm/thebeast-podcast/message Support this podcast: https://anchor.fm/thebeast-podcast/support this podcast: https://anchor.fm/thebeast-podcast/support this podcast: https://anchor.fm/thebeast-podcast/support this podcast: https://anchor.fm/thebeast-podcast/support this podcast: https://anchor.fm/thebeast/support this podcast: https://anchor.fm/thebeast-podcast'

#Extractive + Abstractive



Given the performance of the abstractive and extractive models, I would like to see how well a combination of both performs. My best extractive models are First Five Sentences and Bert + KMeans. On their own, I doubt that their results would provide enough context to serve as meaningful input to an extractive model, so I can expand first five sentences to include the first 15, and increase the threshold for bert + kmeans to include more important sentences (It is currently 1% of total sentences, maybe 10%?).

In [6]:
#first_15
extractive_results = pd.DataFrame(columns=['first_15', 'bert_kmeans'])
extractive_results['first_15'] = results.apply(lambda row: first_five(row['transcript'], 15), axis=1)

extractive_results.head(2)

Unnamed: 0,first_15,bert_kmeans
0,Hello everybody. What's going on in this Jess...,
1,"Ladies, have you been molested don't be a vic...",


In [7]:
#bert + kmeans (10%)

" Hello everybody. What's going on in this Jesse Lee? You cannot scummy hashtag wisely, or the people's Mentor in today. We're going to talk about some powerful recruiting techniques for massive growth in your business and in your team, and so if you are new to this program, feel free to subscribe share with a friend do all the good things. I appreciate you guys. I love all you and if you haven't already go ahead and screenshot this bad boy put it in your Instagram story and I will repost and I've been doing all kinds of giveaways. So I appreciate you guys also, make sure you Review over on iTunes. There will be a fan of the Week on here and check my Instagram story at I mbos SLE within 24 hours this podcast. I will be giving away $50 $50 cashola US dollars to to one of you who left a review. So love you guys. Appreciate you guys. Let's Jump Right In  So I feel like if you're listening to this, you probably want to become a network marketing machine, right? So I'll just tell you that p

#Evaluation (Rouge 1, Rouge 2, Rouge L)

##Post Processing

Some generative summaries contain links and other fluff. Which could negatively impact rouge scores. It does seem like creator descriptions that contain links lead to auto summaries that also contain links. I will explore this further.

This method helps to clean up the text.

In [16]:
import re

def post_process(string):
  string = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', string)
  string = re.sub(r'\W+\s', ' ', string)
  return string
  #remove duplicates?
  # string = string.split()
  # return " ".join(sorted(set(string), key=words.index))

results['t5_abstractive'] = results.apply(lambda row: post_process(row['t5_abstractive']), axis=1)
results['supervised'] = results.apply(lambda row: post_process(row['supervised']), axis=1)

##Results

My results are different from the baselines provided, however the same conclusions can still be reached.

In [17]:
!pip install rouge

Collecting rouge
  Downloading https://files.pythonhosted.org/packages/43/cc/e18e33be20971ff73a056ebdb023476b5a545e744e3fc22acd8c758f1e0d/rouge-1.0.0-py3-none-any.whl
Installing collected packages: rouge
Successfully installed rouge-1.0.0


In [18]:
results.columns.values

array(['episode_id', 'episode_name', 'episode_description', 'transcript',
       'textrank_summary', 'lexrank_summary', 'lsa_summary',
       'semi_supervised', 'supervised', 'extractive', 't5_abstractive',
       'first_five'], dtype=object)

In [27]:
from rouge import Rouge
rouge = Rouge()

textrank = rouge.get_scores(results['episode_description'], results['textrank_summary'], avg=True)
lexrank = rouge.get_scores(results['episode_description'], results['lexrank_summary'], avg=True)
lsa = rouge.get_scores(results['episode_description'], results['lsa_summary'], avg=True)
semi_supervised = rouge.get_scores(results['episode_description'], results['semi_supervised'], avg=True)
supervised = rouge.get_scores(results['episode_description'], results['supervised'], avg=True)
extractive = rouge.get_scores(results['episode_description'], results['extractive'], avg=True)
t5_abstractive = rouge.get_scores(results['episode_description'], results['t5_abstractive'], avg=True)
first_five = rouge.get_scores(results['episode_description'], results['first_five'], avg=True)

In [28]:
row_names = ['textrank', 'lexrank', 'lsa', 'semi_supervised', 'supervised', 'extractive', 't5_abstractive', 'first_five']
rouge1_scores = [textrank['rouge-1']['f'], lexrank['rouge-1']['f'], lsa['rouge-1']['f'], semi_supervised['rouge-1']['f'], supervised['rouge-1']['f'], extractive['rouge-1']['f'], t5_abstractive['rouge-1']['f'], first_five['rouge-1']['f']]

rouge1_df = pd.DataFrame({'Rouge1-F':rouge1_scores}, index=row_names)
rouge1_df

Unnamed: 0,Rouge1-F
textrank,0.116462
lexrank,0.119708
lsa,0.132847
semi_supervised,0.161447
supervised,0.22137
extractive,0.137192
t5_abstractive,0.203084
first_five,0.138515


In [29]:
rouge2_scores = [textrank['rouge-2']['f'], lexrank['rouge-2']['f'], lsa['rouge-2']['f'], semi_supervised['rouge-2']['f'], supervised['rouge-2']['f'], extractive['rouge-2']['f'], t5_abstractive['rouge-2']['f'], first_five['rouge-2']['f']]

rouge2_df = pd.DataFrame({'Rouge2-F':rouge2_scores}, index=row_names)
rouge2_df

Unnamed: 0,Rouge2-F
textrank,0.014054
lexrank,0.014022
lsa,0.010752
semi_supervised,0.029776
supervised,0.091048
extractive,0.020735
t5_abstractive,0.092811
first_five,0.02927


In [30]:
rougel_scores = [textrank['rouge-l']['f'], lexrank['rouge-l']['f'], lsa['rouge-l']['f'], semi_supervised['rouge-l']['f'], supervised['rouge-l']['f'], extractive['rouge-l']['f'], t5_abstractive['rouge-l']['f'], first_five['rouge-l']['f']]

rougel_df = pd.DataFrame({'RougeL-F':rougel_scores}, index=row_names)
rougel_df

Unnamed: 0,RougeL-F
textrank,0.104832
lexrank,0.107688
lsa,0.109165
semi_supervised,0.146533
supervised,0.21152
extractive,0.116782
t5_abstractive,0.226783
first_five,0.126801


In [31]:
rouge_scores = rouge1_df.merge(rouge2_df, left_index=True, right_index = True)
rouge_scores = rouge_scores.merge(rougel_df, left_index=True, right_index = True)

In [32]:
rouge_scores

Unnamed: 0,Rouge1-F,Rouge2-F,RougeL-F
textrank,0.116462,0.014054,0.104832
lexrank,0.119708,0.014022,0.107688
lsa,0.132847,0.010752,0.109165
semi_supervised,0.161447,0.029776,0.146533
supervised,0.22137,0.091048,0.21152
extractive,0.137192,0.020735,0.116782
t5_abstractive,0.203084,0.092811,0.226783
first_five,0.138515,0.02927,0.126801


Note to self: On the 150 gold summaries, it quantitatively appears that the t5 model performs best, supervised bart is a very close second. Among the extractive techniques, the first five sentences approach performs best, closely followed by Bert + kmeans. However, remember that:
 

*   the reference episode descriptions vary in quality (a good rouge score on a terrible reference is not ideal)
*   Post processing has not yet been done on the results of abstractive models to clean up summaries where links and other fluff were generated. (This affects precision scores, and overall f1 score, then again some references contain links and other promotional material)
*   Abstractive models were only trained on a subset of the data because of colab constraints (Bart: 5000, t5: 3000), maybe models would be more performant if fine-tuned on more samples
*   For the t5 model, very long text was shortened to avoid using up all the RAM. Ideally, podcasters would talk about their content in the first few minutes of the episode but that might not always be the case. This could have also affected peformance
*   Also need to test out a combination of the two summarization approaches (extractive + abstractive), maybe that could improve rouge scores?
*   Summaries need to be a short, standalone, grammatically complete statement that helps the user decide if episode is worth a listen. It should be readable on a phone screen. (Some observed generative summaries, (and extractive) seem to have done well in this regard)


Try/To do:


*   Fine-tuning the BERT model used for the extractive technique
*   Fine-tuning T5 and BART on more samples to see if that improves results
*   Cleaning up generated summaries to remove links and other promotional content (reduces overall rouge scores)
*   Evaluating the result of combining an extractive and abstractive model
*   Pass in a fine-tuned t5 or Bart to the extractive pipeline and see (does not work)
*   Would the extractive model have performed better if the threshold for final sentences was higher
*   Coreference resolution with spacy

