# EXTRACTIVE SUMMARIZATION

#### Dependencies

* To avoid dependency issues, install the following versions

Python = 3.6.9 <br>
torch==1.7.0 <br> 
spacy==2.3.1 <br>
bert-extractive-summarizer <br>

In [54]:
from summarizer import Summarizer
import traceback 
import re

### Parsing SRT files

In [7]:
def subtitle_to_textblob(subtitle_file):

    input_text_list = list()
    input_times_list = list()

    count = 0
    with open(subtitle_file, 'r') as fp:
        input_lines = fp.readlines()
        for line in input_lines:
            line = line.strip()

            # print('Count ', count)
            if (line):
                # Process line numbers
                if (count == 0):
                    count += 1
                elif (count == 1):
                    input_times_list.append(line)
                    count += 1
                elif (count == 2):
                    input_text_list.append(line)
                    count = 0
    return input_text_list, input_times_list

In [8]:
def extractive_summarization(input_text, num_sentences, debug=False):

    model = Summarizer()
    output_text = model(input_text, num_sentences=num_sentences-1)
    
    if (debug):        
        print('----------------- TOP',str(num_sentences),'SENTENCES -----------------')
        print(output_text)
        print('----------------------------------------------------')
        
    return output_text

In [55]:
def extracted_text_to_output(input_text, output_text, output_file, input_times_list, time_delimiter):
    
    try:
        extracted_sentences = re.split(r'[.!?\n]\s*',output_text.strip())
        print('Size of times list: ', len(input_times_list))

        with open(op_file, 'w') as fp:

            for sentence in extracted_sentences:

                sentence = sentence.strip()

                if (sentence):

                    # print(sentence)
                    search_list = list(sentence.split())

                    end_char_index = input_text.find(sentence)
                    start_word_index = len(input_text[:end_char_index].split())
                    end_word_index = start_word_index + len(search_list)-1
                    
                    # print(end_char_index, start_word_index, end_word_index)
                    
                    # print(start_word_index, end_word_index)

                    start_ip_time = input_times_list[start_word_index].split(time_delimiter)[0].strip()
                    # print(start_ip_time)

                    end_ip_time = input_times_list[end_word_index].split(time_delimiter)[1].strip()
                    # print(end_ip_time)

                    fp.write(start_ip_time+time_delimiter+end_ip_time+time_delimiter+sentence+'\n')
    except:
        print('Exception in extracted_text_to_output()')
        traceback.print_exc()

In [39]:
def extract_from_output_text(input_file, output_file, output_text_file, num_sentences):

    try:
        time_delimiter = '-->'

        input_text_list, input_times_list = subtitle_to_textblob(input_file)

        input_text = ' '.join(input_text_list)
        # print(input_text)
        output_text = ''
        
        with open(output_text_file, 'r') as fp:
            output_text = fp.read()
            
        extracted_text_to_output(input_text, output_text, output_file, input_times_list, time_delimiter)
        print('Output with timestamps written to ', output_file)
            
    except:
        print('Exception in extract()')
        traceback.print_exc()

In [58]:
num_sentences_list = [15]
input_file = 'data/podcast__transcription_test.srt'

for num_sentences in num_sentences_list:
    output_text_file = input_file.split('.')[0] + '_optext_' + str(num_sentences) + '.txt'
    output_file = input_file.split('.')[0] + '_op_' + str(num_sentences) + '.txt'
    
    extract_from_output_text(input_file, output_file, output_text_file, num_sentences)

Size of times list:  17425
At some point, we tried to raise more money and you know, the investor was like the entire natural deodorant industry is something like $30 million a year
0 0 27
0 27
00:00:04,160
00:00:12,170
Aluminum is probably the largest thing that we're like, people are concerned about when it comes to the deodorant category
4825 926 945
926 945
00:04:14,019
00:04:19,360
I really liked what it stood for, which was like, we want to use ingredients that were like native to the earth
7211 1393 1414
1393 1414
00:06:08,594
00:06:13,635
And this was in 2015, you know, like today there's a new, natural theater at the lunches every day in 2015
14349 2739 2759
2739 2759
00:11:32,400
00:11:37,890
And she's like, we're going to scale this business
17030 3241 3249
3241 3249
00:13:46,240
00:13:47,680
I don't really want to be making these deodorants in my apartment
17720 3375 3386
3375 3386
00:14:18,814
00:14:21,095
And the meantime I'm like Googling and I'm like, okay, Who makes de

In [18]:
def extract_from_srt(input_file, output_file, num_sentences):

    try:
        time_delimiter = '-->'

        input_text_list, input_times_list = subtitle_to_textblob(input_file)

        input_text = ' '.join(input_text_list)

        

        output_text = extractive_summarization(input_text, num_sentences, True)
        
        with open(output_file, 'w') as fp:
            fp.write(output_text)
        
        extracted_text_to_output(input_text, output_text, output_file, input_times_list, time_delimiter)
    except:
        print('Exception in extract()')
        traceback.print_exc()

In [19]:
'''
ip_file = 'data/AE_Shopify Walkthrough 1.srt'
op_file = 'data/AE_Shopify Walkthrough 1_op.txt'
num_sentences = 10

extract(ip_file, op_file, num_sentences)
'''

"\nip_file = 'data/AE_Shopify Walkthrough 1.srt'\nop_file = 'data/AE_Shopify Walkthrough 1_op.txt'\nnum_sentences = 10\n\nextract(ip_file, op_file, num_sentences)\n"

In [13]:
num_sentences_list = [5]

ip_file = 'data/podcast__transcription_test.srt'

for num_sentences in num_sentences_list:
    
    op_file = ip_file.split('.')[0] + '_optext_' + str(num_sentences) + '.txt'
    
    extract(ip_file, op_file, num_sentences)

for num_sentences in num_sentences_list:
    
    ip_textfile = ip_file.split('.')[0] + '_optext_' + str(num_sentences) + '.txt'
    with open(ip_textfile, 'r') as fp:
        output_text = fp.read()

    op_file = ip_file.split('.')[0] + '_op_' + str(num_sentences) + '.txt'
    extracted_text_to_output(input_text, output_text, output_file, input_times_list, time_delimiter)

START --- data/podcast__transcription_test_op_5.txt
----------------- TOP 5 SENTENCES -----------------
At some point, we tried to raise more money and you know, the investor was like the entire natural deodorant industry is something like $30 million a year. If you're thinking about starting an idea, A business or you are, you want to do that someday, or you're doing it right now, either way, listening to speakers, tell their stories, being in the audience, amongst a bunch of other like-minded people that is a really, really good way to get amongst it. And so I tweeted it and I was like, If you're wondering when I decided to destroy your business, it was when you sent me this letter, like before I was like, you know, you guys are a Nat, uh, and I don't care about your tiny business and I'll let you exist. And it's hard, like, you know, for us, we had a great safety net of like a success already or like momentum. Like, you know, we have this contest internally to make another email fun

In [15]:
num_sentences_list = [10,15,20,25,30,35,40,45,50,60,70,80,90,100]

ip_file = 'data/podcast__transcription_test.srt'

for num_sentences in num_sentences_list:
    
    op_file = ip_file.split('.')[0] + '_op_' + str(num_sentences) + '.txt'
    
    # print('START ---', op_file)
    extract(ip_file, op_file, num_sentences)
    # print('END ---', op_file)

----------------- TOP 10 SENTENCES -----------------
At some point, we tried to raise more money and you know, the investor was like the entire natural deodorant industry is something like $30 million a year. And this was in 2015, you know, like today there's a new, natural theater at the lunches every day in 2015. And she's like, we're going to scale this business. If you're thinking about starting an idea, A business or you are, you want to do that someday, or you're doing it right now, either way, listening to speakers, tell their stories, being in the audience, amongst a bunch of other like-minded people that is a really, really good way to get amongst it. It sounds like it really paid off that you weren't just using a Chinese manufacturer. You know, they hear you say we're only at a hundred thousand a month in revenue. Sir, we hired another person, a third, uh, the third person to join the team and he's doing customer service. And you know, I have no idea what our business will lo

Traceback (most recent call last):
  File "<ipython-input-14-9ba77870bbc0>", line 23, in extracted_text_to_output
    start_ip_time = input_times_list[start_word_index].split(time_delimiter)[0].strip()
IndexError: list index out of range


----------------- TOP 20 SENTENCES -----------------
At some point, we tried to raise more money and you know, the investor was like the entire natural deodorant industry is something like $30 million a year. And so that, that was like the Genesis of like, okay, look, we think we can create a deodorant that does the job of an, a. Antiperspirant the other way I thought about it was look like a lot of us work in office environments, where we commute in a car, we get to an office, we sit at a desk and work at a computer and, you know, using an antiperspirant everyday. I put it on product hunt, product hunt has multiple pages. And this was in 2015, you know, like today there's a new, natural theater at the lunches every day in 2015. I want to have another meeting because I think there's a problem with the deodorants. It sounds like it really paid off that you weren't just using a Chinese manufacturer. But we're doing like tiny changes, like, you know, like changes that wouldn't even influe

Traceback (most recent call last):
  File "<ipython-input-14-9ba77870bbc0>", line 23, in extracted_text_to_output
    start_ip_time = input_times_list[start_word_index].split(time_delimiter)[0].strip()
IndexError: list index out of range


----------------- TOP 30 SENTENCES -----------------
At some point, we tried to raise more money and you know, the investor was like the entire natural deodorant industry is something like $30 million a year. So why would anyone be interested in investing in a category that only has $30 million a year run rates? And the tactics that he shared are actually quite insightful. And so that, that was like the Genesis of like, okay, look, we think we can create a deodorant that does the job of an, a. Antiperspirant the other way I thought about it was look like a lot of us work in office environments, where we commute in a car, we get to an office, we sit at a desk and work at a computer and, you know, using an antiperspirant everyday. And so we're shipping these boxes out, we shipped 60 out and then sales start dropping because you know, we're off the product hunt and I'm like, okay. And this was in 2015, you know, like today there's a new, natural theater at the lunches every day in 2015. I

Traceback (most recent call last):
  File "<ipython-input-14-9ba77870bbc0>", line 23, in extracted_text_to_output
    start_ip_time = input_times_list[start_word_index].split(time_delimiter)[0].strip()
IndexError: list index out of range


----------------- TOP 35 SENTENCES -----------------
At some point, we tried to raise more money and you know, the investor was like the entire natural deodorant industry is something like $30 million a year. So why would anyone be interested in investing in a category that only has $30 million a year run rates? Aluminum is probably the largest thing that we're like, people are concerned about when it comes to the deodorant category. And so that, that was like the Genesis of like, okay, look, we think we can create a deodorant that does the job of an, a. Antiperspirant the other way I thought about it was look like a lot of us work in office environments, where we commute in a car, we get to an office, we sit at a desk and work at a computer and, you know, using an antiperspirant everyday. Uh, so we launched the company in July, 2015. We get one sale and I'm like, okay, this business is over. And so he said that he said, told a story on the podcast, which was that at some point you guy

Traceback (most recent call last):
  File "<ipython-input-14-9ba77870bbc0>", line 23, in extracted_text_to_output
    start_ip_time = input_times_list[start_word_index].split(time_delimiter)[0].strip()
IndexError: list index out of range


----------------- TOP 40 SENTENCES -----------------
At some point, we tried to raise more money and you know, the investor was like the entire natural deodorant industry is something like $30 million a year. You build a deodorant brand of all things. And so that, that was like the Genesis of like, okay, look, we think we can create a deodorant that does the job of an, a. Antiperspirant the other way I thought about it was look like a lot of us work in office environments, where we commute in a car, we get to an office, we sit at a desk and work at a computer and, you know, using an antiperspirant everyday. I really liked what it stood for, which was like, we want to use ingredients that were like native to the earth. And I was just like, look, these products, aren't cutting it. I get a bunch of crinkle paper delivered to the house with wonderful is my brother is the messiest. And this was in 2015, you know, like today there's a new, natural theater at the lunches every day in 2015. Sh

Traceback (most recent call last):
  File "<ipython-input-14-9ba77870bbc0>", line 23, in extracted_text_to_output
    start_ip_time = input_times_list[start_word_index].split(time_delimiter)[0].strip()
IndexError: list index out of range


----------------- TOP 45 SENTENCES -----------------
At some point, we tried to raise more money and you know, the investor was like the entire natural deodorant industry is something like $30 million a year. And the tactics that he shared are actually quite insightful. And so that's really where the problem starts. Aluminum is probably the largest thing that we're like, people are concerned about when it comes to the deodorant category. And so that, that was like the Genesis of like, okay, look, we think we can create a deodorant that does the job of an, a. Antiperspirant the other way I thought about it was look like a lot of us work in office environments, where we commute in a car, we get to an office, we sit at a desk and work at a computer and, you know, using an antiperspirant everyday. Cause I don't think that that's necessarily the case. I'm not sure if it's my personal body chemistry or what it is, but yeah, w we were testing natural deodorants left and right from Etsy. I get

Traceback (most recent call last):
  File "<ipython-input-14-9ba77870bbc0>", line 23, in extracted_text_to_output
    start_ip_time = input_times_list[start_word_index].split(time_delimiter)[0].strip()
IndexError: list index out of range


----------------- TOP 50 SENTENCES -----------------
At some point, we tried to raise more money and you know, the investor was like the entire natural deodorant industry is something like $30 million a year. He is the founder of native deodorant, which is a natural deodorant brand. And it'll be on my body for 23 hours and 45 minutes for the next 60 years, I should at least be able to pronounce some of the ingredients in this thing. I really liked what it stood for, which was like, we want to use ingredients that were like native to the earth. We get one sale and I'm like, okay, this business is over. And what was the little tagline at that time? I get a bunch of crinkle paper delivered to the house with wonderful is my brother is the messiest. And this was in 2015, you know, like today there's a new, natural theater at the lunches every day in 2015. We're basically one of the few guys out there. I used to be selling products at a farmer's market, and now I'm making 500 dealers for you

Traceback (most recent call last):
  File "<ipython-input-14-9ba77870bbc0>", line 23, in extracted_text_to_output
    start_ip_time = input_times_list[start_word_index].split(time_delimiter)[0].strip()
IndexError: list index out of range


----------------- TOP 60 SENTENCES -----------------
At some point, we tried to raise more money and you know, the investor was like the entire natural deodorant industry is something like $30 million a year. And the tactics that he shared are actually quite insightful. So I'm in New York city running another e-commerce business. I've lived in the same place for like four years. And, you know, I'm an attorney, I'm not a dumb guy. And what it does is aluminum acts as like a plug to block your sweat glands from excluding sweat. And so that, that was like the Genesis of like, okay, look, we think we can create a deodorant that does the job of an, a. Antiperspirant the other way I thought about it was look like a lot of us work in office environments, where we commute in a car, we get to an office, we sit at a desk and work at a computer and, you know, using an antiperspirant everyday. And this was in 2015, you know, like today there's a new, natural theater at the lunches every day in 201

Traceback (most recent call last):
  File "<ipython-input-14-9ba77870bbc0>", line 23, in extracted_text_to_output
    start_ip_time = input_times_list[start_word_index].split(time_delimiter)[0].strip()
IndexError: list index out of range


----------------- TOP 70 SENTENCES -----------------
At some point, we tried to raise more money and you know, the investor was like the entire natural deodorant industry is something like $30 million a year. But because he is a very entertaining guy, very charismatic. So I'm in New York city running another e-commerce business. And it'll be on my body for 23 hours and 45 minutes for the next 60 years, I should at least be able to pronounce some of the ingredients in this thing. And what it does is aluminum acts as like a plug to block your sweat glands from excluding sweat. Cause I don't think that that's necessarily the case. Bought the name in July, 2015, basically 12 days later launched the business. I'm not sure if it's my personal body chemistry or what it is, but yeah, w we were testing natural deodorants left and right from Etsy. I started contacting all these like reporters and I'm like, Hey, do you want to write it out native? And this was in 2015, you know, like today there'

Traceback (most recent call last):
  File "<ipython-input-14-9ba77870bbc0>", line 23, in extracted_text_to_output
    start_ip_time = input_times_list[start_word_index].split(time_delimiter)[0].strip()
IndexError: list index out of range


----------------- TOP 80 SENTENCES -----------------
At some point, we tried to raise more money and you know, the investor was like the entire natural deodorant industry is something like $30 million a year. So why would anyone be interested in investing in a category that only has $30 million a year run rates? And so he sold it two and a half years later to Proctor and gamble for a hundred million dollars. But because he is a very entertaining guy, very charismatic. So where are you when you have the idea for native deodorant? And so I'm really familiar with this place. And so that, that was like the Genesis of like, okay, look, we think we can create a deodorant that does the job of an, a. Antiperspirant the other way I thought about it was look like a lot of us work in office environments, where we commute in a car, we get to an office, we sit at a desk and work at a computer and, you know, using an antiperspirant everyday. Cause I don't think that that's necessarily the case. Boug

Traceback (most recent call last):
  File "<ipython-input-14-9ba77870bbc0>", line 23, in extracted_text_to_output
    start_ip_time = input_times_list[start_word_index].split(time_delimiter)[0].strip()
IndexError: list index out of range


----------------- TOP 90 SENTENCES -----------------
At some point, we tried to raise more money and you know, the investor was like the entire natural deodorant industry is something like $30 million a year. So why would anyone be interested in investing in a category that only has $30 million a year run rates? He is the founder of native deodorant, which is a natural deodorant brand. And so he sold it two and a half years later to Proctor and gamble for a hundred million dollars. And the tactics that he shared are actually quite insightful. I've lived in the same place for like four years. Aluminum is probably the largest thing that we're like, people are concerned about when it comes to the deodorant category. I don't want to take a pill to stop going to the bathroom. Aluminum free deodorant will do the job for you. And so you say we, but at the beginning it was just you correct? It was like, And, you know, we had gotten financial product. So I started testing every deodorant I coul

Traceback (most recent call last):
  File "<ipython-input-14-9ba77870bbc0>", line 23, in extracted_text_to_output
    start_ip_time = input_times_list[start_word_index].split(time_delimiter)[0].strip()
IndexError: list index out of range


----------------- TOP 100 SENTENCES -----------------
At some point, we tried to raise more money and you know, the investor was like the entire natural deodorant industry is something like $30 million a year. So why would anyone be interested in investing in a category that only has $30 million a year run rates? And I was like, if the natural deodorant industry is $30 million a year where the entire natural deodorant, we're doing $30 million a year at this point, Okay. And the tactics that he shared are actually quite insightful. So where are you when you have the idea for native deodorant? And you know, I've been seeing this problem for the last four years since I've been buying the deodorant from Dwayne Reed. And it'll be on my body for 23 hours and 45 minutes for the next 60 years, I should at least be able to pronounce some of the ingredients in this thing. And then my sister gets pregnant and she's telling me how she's using dove. And what it does is aluminum acts as like a plug 

Traceback (most recent call last):
  File "<ipython-input-14-9ba77870bbc0>", line 23, in extracted_text_to_output
    start_ip_time = input_times_list[start_word_index].split(time_delimiter)[0].strip()
IndexError: list index out of range
