# Meeting Notes Summarizer: AWS Summarize

### This code represents using SageMaker, and HuggingFace, to summarize the transcripts from a given meeting, and organizing them for further reference.

## GOALS:

#### Integrate one of the HuggingFace pretrained models, that we will fine tune based on a lot of self created data, and then build and deploy. 

#### STEPS:

1. Build, train and deploy the model from the HuggingFace pretrained model library.

2. Leverage self recordings from Chime, with all of the transcripts stored in the s3 bucket that we will use for reference and training.

3. Use the trained model to create an efficient notes organizer for AWS employees and meeting members.

#### Integrate a Speech to text converter to convert speech and points from different speakers in the meeting in a live document for our model to refer to and train our data on.

## STEP 0: INSTALL THE TRANSFORMERS SDK LOCALLY



In [2]:
%%writefile requirements.txt

transformers == 4.6.1


Overwriting requirements.txt


In [3]:
%%capture
import IPython
import sys

!{sys.executable} -m pip install ipywidgets
IPython.Application.instance().kernel.do_shutdown(True)  # has to restart kernel so changes are used

In [1]:
## Represents installing the requirements for this model
!pip install -r requirements.txt

Collecting transformers==4.6.1
  Using cached transformers-4.6.1-py3-none-any.whl (2.2 MB)
Collecting filelock
  Using cached filelock-3.12.2-py3-none-any.whl (10 kB)
Collecting regex!=2019.12.17
  Using cached regex-2023.6.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (772 kB)
Collecting tokenizers<0.11,>=0.10.1
  Using cached tokenizers-0.10.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
Collecting huggingface-hub==0.0.8
  Using cached huggingface_hub-0.0.8-py3-none-any.whl (34 kB)
Collecting sacremoses
  Using cached sacremoses-0.0.53-py3-none-any.whl
Installing collected packages: tokenizers, regex, filelock, sacremoses, huggingface-hub, transformers
Successfully installed filelock-3.12.2 huggingface-hub-0.0.8 regex-2023.6.3 sacremoses-0.0.53 tokenizers-0.10.3 transformers-4.6.1
[0m

## STEP 1: DOWNLOAD A PRETRAINED FACEBOOK BART MODEL AND TEST IT LOCALLY

In [2]:
pip install ipywidgets widgetsnbextension pandas-profiling

Collecting pandas-profiling
  Using cached pandas_profiling-3.6.6-py2.py3-none-any.whl (324 kB)
Collecting ydata-profiling
  Using cached ydata_profiling-4.4.0-py2.py3-none-any.whl (356 kB)
Collecting visions[type_image_path]==0.7.5
  Using cached visions-0.7.5-py3-none-any.whl (102 kB)
Collecting dacite>=1.8
  Using cached dacite-1.8.1-py3-none-any.whl (14 kB)
Collecting multimethod<2,>=1.4
  Using cached multimethod-1.9.1-py3-none-any.whl (10 kB)
Collecting imagehash==4.3.1
  Using cached ImageHash-4.3.1-py2.py3-none-any.whl (296 kB)
Collecting phik<0.13,>=0.11.1
  Using cached phik-0.12.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (679 kB)
Collecting statsmodels<1,>=0.13.2
  Using cached statsmodels-0.14.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.2 MB)
Collecting wordcloud>=1.9.1
  Using cached wordcloud-1.9.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (461 kB)
Collecting pydantic<2,>=1.8.1
  Using cached pydantic-1.10.12-cp38-cp38-man

In [7]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

notes_gpt = "facebook/bart-large-cnn"

tokenizer = AutoTokenizer.from_pretrained(notes_gpt)
model = AutoModelForSeq2SeqLM.from_pretrained(notes_gpt)

Downloading:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In [17]:
from transformers import set_seed

## Represents displaying the output in the way we need
def get_outputs(sample_outputs, tokenizer):
    
    ## Represents taking in a tokenizer, and raw output from the given model, decoding and 
    ## formatting the output nicely
    rt = []
    
    print("Output:\n" + 100 * '-')
    for i, sample_output in enumerate(sample_outputs):
        txt = tokenizer.decode(sample_output, skip_special_tokens = True)
        print("{}: {}...".format(i, txt))
        print('')
        rt.append(txt)
        
    return rt

## Setting the seed helps us ensure reproducibility, and when the seed is consistent, the model outputs will be consistent
set_seed(42)

text = "Karen hadn’t asked to be named Karen. She hadn’t asked to be dressed in modest dresses, always with tights and shoes. She certainly hadn’t asked for her parents to use the sort of psychological conditioning that led to so many people saying, “Butt out, Karen!” Once Mom and Dad passed away, Karen decided she’d finally do something about all the negative comments. She colored her hair, bought a pair of honest-to-goodness jeans, and changed her name to Kathy. Upon leaving the Social Security Administration, she spied a couple arguing heatedly about what their married last name ought to be. Kathy couldn’t stand to see and hear such animosity between two people in love, and walked toward them. Before she could even open her mouth, however, the woman turned to her and said, “Butt out, Karen!”."

input_ids = tokenizer.encode(text, return_tensors = 'pt')

sample_outputs = model.generate(input_ids, 
                                     do_sample = True, 
                                     ##max_length = 90,
                                     num_return_sequences = 1)

## Represents giving out the output
generic_outputs = get_outputs(sample_outputs, tokenizer)

Output:
----------------------------------------------------------------------------------------------------
0: Karen decided she’d finally do something about all the negative comments. She colored her hair, bought a pair of honest-to-goodness jeans, and changed her name to Kathy. Upon leaving the Social Security Administration, she spied a couple arguing heatedly about what their married last name ought to be. Kathy couldn’t stand to see and hear such animosity between two people in love, and walked toward them. Before she could even open her mouth, however, the woman turned to her and said, “Butt out, Karen!”...



## STEP 2: FINE TUNE THE FACEBOOK BART SUMMARIZER WITH A REAL MEETING TRANSCRIPT (SELF-RECORDED)

#### Here, we will tune the model on a real experiment done where I used myself to generate a meeting transcript, to check how the model performs on the transcript:

#### GOALS:

1. Summarizing a transcript in an organized way

2. Making sure all important points have been come across to the reader (maybe we assign labels to the meeting)

3. Make sure the model uses the label to pick up the important pointers from the meeting.

In [11]:
%%writefile train.txt

Madhur: Hey, how's it going? So let me turn on the transcript uh language preference. Let's go with English for now. OK. 
So I can see that the machine generated captions are by Amazon transcribe. Well, this is the first time I'm joining 
Amazon Shine with my hair all open. Usually during my workout, I tie, tie them back because it just looks uh I look like a broccoli. 
But anyways, uh this is an experiment. So I'm trying to uh this, I'm trying to work on a project where for every meeting, I'm trying to 
extract the transcripts through the chime calls and then display it after the calls have ended or the meetings have ended to the uh 
members of the meeting in a summarized manner or in a manner where they feel comfortable to read. Maybe they feel like being more organized 
after they missed a huge meeting. So they just want to look at the, the important pointers. So I will be focusing on taking this transcript and 
actually using it in the prototype that I'm trying to create. And let's see how it goes. I'm trying to see if I can get this transcript really long 
so that I can see that my protype works or not. And I'm just looking at the transcripts right now because I'm just kind of distracted at how Amazon
chime also has a one second delay, maybe a millisecond delay in their um meeting. So I can see my lips moving a bit slower than they actually are. 
So, so, yeah, a lot of redundant information there. Let's move on forward and uh, try this transcript out. All right. See you.

Overwriting train.txt


#### We are going to use a script written by hunning face: run CLM that sits on the Hugging Face repo and we can pass in generic text (we do not have to tokenize it.

In [14]:
data = []

## Represents going over the training transcript sample above

with open('train.txt') as f:
    for row in f.readlines():
        d = row.strip()
        if len(d) > 0:
            data.append(d)

In [15]:
print(data[:10])

["Madhur: Hey, how's it going? So let me turn on the transcript uh language preference. Let's go with English for now. OK.", "So I can see that the machine generated captions are by Amazon transcribe. Well, this is the first time I'm joining", 'Amazon Shine with my hair all open. Usually during my workout, I tie, tie them back because it just looks uh I look like a broccoli.', "But anyways, uh this is an experiment. So I'm trying to uh this, I'm trying to work on a project where for every meeting, I'm trying to", 'extract the transcripts through the chime calls and then display it after the calls have ended or the meetings have ended to the uh', 'members of the meeting in a summarized manner or in a manner where they feel comfortable to read. Maybe they feel like being more organized', 'after they missed a huge meeting. So they just want to look at the, the important pointers. So I will be focusing on taking this transcript and', "actually using it in the prototype that I'm trying to c

In [18]:
## Represents importing the sagemaker role

import sagemaker

sess = sagemaker.Session()
bucket = sess.default_bucket()

train_file_name = 'train.txt'
s3_train_data = 's3://{}/bart/{}'.format(bucket, train_file_name)

!aws s3 cp {train_file_name} {s3_train_data}

upload: ./train.txt to s3://sagemaker-us-east-1-988564344122/bart/train.txt


In [23]:
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFace

try:
	role = sagemaker.get_execution_role()
except ValueError:
	iam = boto3.client('iam')
	role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
		
hyperparameters = {
	'model_name_or_path':'facebook/bart-large-cnn',
	'output_dir':'/opt/ml/model',
    'do_train':True,
    'train_file': '/opt/ml/input/data/train/{}'.format(train_file_name),
    'num_train_epochs': 5, 
    "per_device_train_batch_size": 64,
	# add your remaining hyperparameters
	# more info here https://github.com/huggingface/transformers/tree/v4.26.0/examples/pytorch/seq2seq
}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.26.0'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
	entry_point='run_summarization.py',
	source_dir='./examples/pytorch/seq2seq',
	instance_type='ml.p3.2xlarge',
	instance_count=1,
	role=role,
	git_config=git_config,
	transformers_version='4.26.0',
	pytorch_version='1.13.1',
	py_version='py39',
	hyperparameters = hyperparameters,
    
    ## Pass the training compiler config to speed up your job
    ##compiler_config = TrainingCompilerConfig(), 
    environment = {'GPU_NUM_DEVICES': '1'},
    disable_profiler = True, 
    debugger_hook_config = False
)

# starting the train job
huggingface_estimator.fit({'train': s3_train_data}, wait=True)

ValueError: Source directory does not exist in the repo.