# Meeting Notes Summarizer: AWS Summarize

### This code represents using SageMaker, and HuggingFace, to summarize the transcripts from a given meeting, and organizing them for further reference.

## GOALS:

#### Integrate one of the HuggingFace pretrained models, that we will fine tune based on a lot of self created data, and then build and deploy. 

#### STEPS:

1. Build, train and deploy the model from the HuggingFace pretrained model library.

2. Leverage self recordings from Chime, with all of the transcripts stored in the s3 bucket that we will use for reference and training.

3. Use the trained model to create an efficient notes organizer for AWS employees and meeting members.

#### Integrate a Speech to text converter to convert speech and points from different speakers in the meeting in a live document for our model to refer to and train our data on.

## STEP 0: INSTALL THE TRANSFORMERS SDK LOCALLY



In [2]:
%%writefile requirements.txt

transformers == 4.6.1


Overwriting requirements.txt


In [3]:
## Represents installing the requirements for this model
!pip install -r requirements.txt

Collecting transformers==4.6.1
  Using cached transformers-4.6.1-py3-none-any.whl (2.2 MB)
Collecting sacremoses
  Using cached sacremoses-0.0.53-py3-none-any.whl
Collecting tokenizers<0.11,>=0.10.1
  Using cached tokenizers-0.10.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
Collecting regex!=2019.12.17
  Using cached regex-2023.6.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (772 kB)
Collecting filelock
  Using cached filelock-3.12.2-py3-none-any.whl (10 kB)
Collecting huggingface-hub==0.0.8
  Using cached huggingface_hub-0.0.8-py3-none-any.whl (34 kB)
Installing collected packages: tokenizers, regex, filelock, sacremoses, huggingface-hub, transformers
Successfully installed filelock-3.12.2 huggingface-hub-0.0.8 regex-2023.6.3 sacremoses-0.0.53 tokenizers-0.10.3 transformers-4.6.1
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;4

## STEP 1: DOWNLOAD A PRETRAINED FACEBOOK BART MODEL AND TEST IT LOCALLY

In [4]:

# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

notes_gpt = "knkarthick/MEETING_SUMMARY"

tokenizer = AutoTokenizer.from_pretrained(notes_gpt)
model = AutoModelForSeq2SeqLM.from_pretrained(notes_gpt)

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
from transformers import set_seed

## Represents displaying the output in the way we need
def get_outputs(sample_outputs, tokenizer):
    
    ## Represents taking in a tokenizer, and raw output from the given model, decoding and 
    ## formatting the output nicely
    rt = []
    
    print("Output:\n" + 100 * '-')
    for i, sample_output in enumerate(sample_outputs):
        txt = tokenizer.decode(sample_output, skip_special_tokens = True)
        print("{}: {}...".format(i, txt))
        print('')
        rt.append(txt)
        
    return rt

## Setting the seed helps us ensure reproducibility, and when the seed is consistent, the model outputs will be consistent
set_seed(42)

text = "Karen hadn’t asked to be named Karen. She hadn’t asked to be dressed in modest dresses, always with tights and shoes. She certainly hadn’t asked for her parents to use the sort of psychological conditioning that led to so many people saying, “Butt out, Karen!” Once Mom and Dad passed away, Karen decided she’d finally do something about all the negative comments. She colored her hair, bought a pair of honest-to-goodness jeans, and changed her name to Kathy. Upon leaving the Social Security Administration, she spied a couple arguing heatedly about what their married last name ought to be. Kathy couldn’t stand to see and hear such animosity between two people in love, and walked toward them. Before she could even open her mouth, however, the woman turned to her and said, “Butt out, Karen!”."

input_ids = tokenizer.encode(text, return_tensors = 'pt')

sample_outputs = model.generate(input_ids, 
                                     do_sample = True, 
                                     ##max_length = 90,
                                     num_return_sequences = 1)

## Represents giving out the output
generic_outputs = get_outputs(sample_outputs, tokenizer)

  next_indices = next_tokens // vocab_size


Output:
----------------------------------------------------------------------------------------------------
0: Karen's parents forced her to bear the name Karen. After her parents passed away, Kathy decided to change her name to Kathy. She noticed a couple arguing at the Social Security Administration about what their married last name ought to be....



## STEP 2: FINE TUNE THE FACEBOOK BART SUMMARIZER WITH A REAL MEETING TRANSCRIPT (SELF-RECORDED)

#### Here, we will tune the model on a real experiment done where I used myself to generate a meeting transcript, to check how the model performs on the transcript:

#### GOALS:

1. Summarizing a transcript in an organized way

2. Making sure all important points have been come across to the reader (maybe we assign labels to the meeting)

3. Make sure the model uses the label to pick up the important pointers from the meeting.

In [6]:
%%writefile train.txt

Madhur: Hey, how's it going? So let me turn on the transcript uh language preference. Let's go with English for now. OK. 
So I can see that the machine generated captions are by Amazon transcribe. Well, this is the first time I'm joining 
Amazon Shine with my hair all open. Usually during my workout, I tie, tie them back because it just looks uh I look like a broccoli. 
But anyways, uh this is an experiment. So I'm trying to uh this, I'm trying to work on a project where for every meeting, I'm trying to 
extract the transcripts through the chime calls and then display it after the calls have ended or the meetings have ended to the uh 
members of the meeting in a summarized manner or in a manner where they feel comfortable to read. Maybe they feel like being more organized 
after they missed a huge meeting. So they just want to look at the, the important pointers. So I will be focusing on taking this transcript and 
actually using it in the prototype that I'm trying to create. And let's see how it goes. I'm trying to see if I can get this transcript really long 
so that I can see that my protype works or not. And I'm just looking at the transcripts right now because I'm just kind of distracted at how Amazon
chime also has a one second delay, maybe a millisecond delay in their um meeting. So I can see my lips moving a bit slower than they actually are. 
So, so, yeah, a lot of redundant information there. Let's move on forward and uh, try this transcript out. All right. See you.

Overwriting train.txt


#### We are going to use a script written by hunning face: run CLM that sits on the Hugging Face repo and we can pass in generic text (we do not have to tokenize it.

In [7]:
data = []

## Represents going over the training transcript sample above

with open('train.txt') as f:
    for row in f.readlines():
        d = row.strip()
        if len(d) > 0:
            data.append(d)

In [8]:
print(data[:10])

["Madhur: Hey, how's it going? So let me turn on the transcript uh language preference. Let's go with English for now. OK.", "So I can see that the machine generated captions are by Amazon transcribe. Well, this is the first time I'm joining", 'Amazon Shine with my hair all open. Usually during my workout, I tie, tie them back because it just looks uh I look like a broccoli.', "But anyways, uh this is an experiment. So I'm trying to uh this, I'm trying to work on a project where for every meeting, I'm trying to", 'extract the transcripts through the chime calls and then display it after the calls have ended or the meetings have ended to the uh', 'members of the meeting in a summarized manner or in a manner where they feel comfortable to read. Maybe they feel like being more organized', 'after they missed a huge meeting. So they just want to look at the, the important pointers. So I will be focusing on taking this transcript and', "actually using it in the prototype that I'm trying to c

In [9]:
## Represents importing the sagemaker role

import sagemaker

sess = sagemaker.Session()
bucket = sess.default_bucket()

train_file_name = 'train.txt'
s3_train_data = 's3://{}/bart/{}'.format(bucket, train_file_name)

!aws s3 cp {train_file_name} {s3_train_data}

upload: ./train.txt to s3://sagemaker-us-east-1-988564344122/bart/train.txt


In [10]:
!pip install -U sagemaker
!pip install sagemaker --upgrade

Collecting sagemaker
  Using cached sagemaker-2.175.0-py2.py3-none-any.whl
Collecting platformdirs
  Using cached platformdirs-3.10.0-py3-none-any.whl (17 kB)
Collecting attrs<24,>=23.1.0
  Using cached attrs-23.1.0-py3-none-any.whl (61 kB)
Collecting boto3<2.0,>=1.26.131
  Using cached boto3-1.28.20-py3-none-any.whl (135 kB)
Collecting PyYAML~=6.0
  Using cached PyYAML-6.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (736 kB)
Collecting jsonschema
  Using cached jsonschema-4.18.6-py3-none-any.whl (83 kB)
Collecting tblib==1.7.0
  Using cached tblib-1.7.0-py2.py3-none-any.whl (12 kB)
Collecting botocore<1.32.0,>=1.31.20
  Using cached botocore-1.31.20-py3-none-any.whl (11.1 MB)
Collecting pkgutil-resolve-name>=1.3.10
  Using cached pkgutil_resolve_name-1.3.10-py3-none-any.whl (4.7 kB)
Collecting jsonschema-specifications>=2023.03.6
  Using cached jsonschema_specifications-2023.7.1-py3-none-any.whl (17 kB)
Collecting referencing>=0.28.4
  Using cached referencing-0.30.2-py

In [11]:
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFace

try:
	role = sagemaker.get_execution_role()
except ValueError:
	iam = boto3.client('iam')
	role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
		
hyperparameters = {
	'model_name_or_path':'knkarthick/MEETING_SUMMARY',
	'output_dir':'/opt/ml/model',
    'do_train':True,
    'train_file': '/opt/ml/input/data/train/{}'.format(train_file_name),
    'num_train_epochs': 5, 
    "per_device_train_batch_size": 64,
	# add your remaining hyperparameters
	# more info here https://github.com/huggingface/transformers/tree/v4.26.0/examples/pytorch/seq2seq
}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.26.0'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
	entry_point='run_clm.py',
	source_dir='examples/pytorch/language-modeling',
	instance_type='ml.p3.2xlarge',
	instance_count=1,
	role=role,
	git_config=git_config,
	transformers_version='4.26.0',
	pytorch_version='1.13.1',
	py_version='py39',
	hyperparameters = hyperparameters,
    ## Pass the training compiler config to speed up your job
    ##compiler_config = TrainingCompilerConfig(), 
    environment = {'GPU_NUM_DEVICES': '1'},
    disable_profiler = True, 
    debugger_hook_config = False
)

# starting the train job
huggingface_estimator.fit({'train': s3_train_data}, wait=True)


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-08-06-00-18-54-732


2023-08-06 00:19:14 Starting - Starting the training job...
2023-08-06 00:19:40 Starting - Preparing the instances for training.........
2023-08-06 00:21:11 Downloading - Downloading input data
2023-08-06 00:21:11 Training - Downloading the training image.....................
2023-08-06 00:24:37 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-08-06 00:24:56,950 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-08-06 00:24:56,969 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-08-06 00:24:56,981 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-08-06 00:24:56,984 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-08-06 00:24:57,2

## STEP 4: TESTING OUR TRAINED MODEL LOCALLY

In [12]:
from sagemaker.huggingface import HuggingFace
import time

try:
    s3_model_data = huggingface_estimator.model_data
    local_model_path = 'bart_finetuned'
    
except:
    time.sleep(5)
    s3_model_data = huggingface_estimator.model_data
    local_model_path = 'bart_finetuned'

In [13]:
!mkdir {local_model_path}
!aws s3 cp {s3_model_data} {local_model_path}
!tar -xvf {local_model_path}/model.tar.gz -C {local_model_path}
!rm {local_model_path}/model.tar.gz

mkdir: cannot create directory ‘bart_finetuned’: File exists
download: s3://sagemaker-us-east-1-988564344122/huggingface-pytorch-training-2023-08-06-00-18-54-732/output/model.tar.gz to bart_finetuned/model.tar.gz
generation_config.json
tokenizer.json
merges.txt
tokenizer_config.json
pytorch_model.bin
all_results.json
trainer_state.json
special_tokens_map.json
training_args.bin
vocab.json
config.json
train_results.json
README.md


In [14]:
## Load into the transformer SDK framework
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("knkarthick/MEETING_SUMMARY")

In [15]:
## to make sure we cna run inference with this model locally
model.eval()

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50264, 1024, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50264, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0): BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
          (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
   

In [16]:
from transformers import set_seed

## Setting the seed helps us ensure reproducibility, and when the seed is consistent, the model outputs will be consistent
set_seed(42)

text = 'Madhur: Hey, hows it going? So let me turn on the transcript uh language preference. Lets go with English for now. OK. So I can see that the machine generated captions are by Amazon transcribe. Well, this is the first time Im joining  Amazon Shine with my hair all open. Usually during my workout, I tie, tie them back because it just looks uh I look like a broccoli.  But anyways, uh this is an experiment. So Im trying to uh this, Im trying to work on a project where for every meeting, Im trying to extract the transcripts through the chime calls and then display it after the calls have ended or the meetings have ended to the uh members of the meeting in a summarized manner or in a manner where they feel comfortable to read. Maybe they feel like being more organized after they missed a huge meeting. So they just want to look at the, the important pointers. So I will be focusing on taking this transcript and actually using it in the prototype that Im trying to create. And lets see how it goes. Im trying to see if I can get this transcript really long so that I can see that my protype works or not. And Im just looking at the transcripts right now because Im just kind of distracted at how Amazonchime also has a one second delay, maybe a millisecond delay in their um meeting. So I can see my lips moving a bit slower than they actually are. So, so, yeah, a lot of redundant information there. Lets move on forward and uh, try this transcript out. All right. See you.'

input_ids = tokenizer.encode(text, return_tensors = 'pt')

sample_outputs = model.generate(input_ids, 
                                     do_sample = True, 
                                     max_length = 90,
                                     num_return_sequences = 1)

## Represents giving out the output
generic_outputs = get_outputs(sample_outputs, tokenizer)

  next_indices = next_tokens // vocab_size


Output:
----------------------------------------------------------------------------------------------------
0: Madhur is working on a project where he wants to extract the transcripts from the Amazon Chime calls and present them to the participants in a summarized manner or in a manner where they feel comfortable to read them....



### Passing in different parameters to play with the sample outputs


In [17]:
sample_outputs = model.generate(input_ids, 
                                     do_sample = True, 
                                ## only pick tokens at and above this probability level
                                top_p = 0.85,
                                ## only pick from this many tokens
                                top_k=200,
                                     ##max_length = 90,
                                     num_return_sequences = 3)

## Represents giving out the output
generic_outputs = get_outputs(sample_outputs, tokenizer)

Output:
----------------------------------------------------------------------------------------------------
0: Madhur is working on a project to extract the transcripts from the Amazon Chime calls and display them after the calls have ended. He is trying to create a prototype of the project....

1: Madhur is working on a project to extract the transcripts from the Amazon Chime calls and present them to the participants in a summarized manner after the calls have ended....

2: Madhur is working on a project to extract the transcripts from the Amazon Chime calls and present them to the participants in a summarized manner after the calls have ended. He is trying to create a prototype of the project....



In [18]:
sample_outputs = model.generate(input_ids, 
                                     do_sample = True, 
                                ## only pick tokens at and above this probability level
                                top_p = 0.95,
                                ## only pick from this many tokens
                                top_k=100,
                                     ##max_length = 90,
                                     num_return_sequences = 3)

## Represents giving out the output
generic_outputs = get_outputs(sample_outputs, tokenizer)

Output:
----------------------------------------------------------------------------------------------------
0: Madhur is working on a project to extract the transcripts from the Amazon Chime calls and display them after the calls have ended to give them to the participants of the meeting in a summarized manner or in a manner where they feel comfortable to read....

1: Madhur is working on a project where he wants to extract the transcripts from the Amazon Chime calls and present them to the participants in a summarized manner or in a manner where they feel comfortable to read them....

2: Madhur is working on a project to extract the transcripts from the Amazon Chime calls and display them after the calls have ended. He is trying to create a prototype of the project....



### DEPLOYING OUR MODEL TO AN ENDPOINT BY PUBLISHING TO HUGGING FACE AND THEN DEPLOYING IT TO STREAMLIT


In [19]:


model_endpoint = huggingface_estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large")

INFO:sagemaker:Creating model with name: huggingface-pytorch-training-2023-08-06-00-28-51-706
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-training-2023-08-06-00-28-51-706
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-training-2023-08-06-00-28-51-706


------!

In [87]:
model_endpoint.endpoint_name

'huggingface-pytorch-training-2023-08-06-00-28-51-706'

# Meeting Notes Summarizer: AWS Instructor

### This code represents using SageMaker, and HuggingFace, to use the text to develop some instructions based on the input given to the model

## GOALS:

#### Integrate one of the HuggingFace pretrained models, that we will fine tune based on a lot of self created data, and then build and deploy. 

#### STEPS:

1. Build, train and deploy the model from the HuggingFace pretrained model library.

2. Leverage self recordings from Chime, with all of the transcripts stored in the s3 bucket that we will use for reference and training.

3. Use the trained model to create an efficient notes organizer for AWS employees and meeting members.

#### Integrate a Speech to text converter to convert speech and points from different speakers in the meeting in a live document for our model to refer to and train our data on.

## STEP 0: INSTALL THE TRANSFORMERS SDK LOCALLY

In [88]:
%%writefile requirements1.txt

transformers == 4.6.1

Writing requirements1.txt


In [89]:
## Represents installing the requirements for this model
!pip install -r requirements1.txt

Collecting transformers==4.6.1
  Using cached transformers-4.6.1-py3-none-any.whl (2.2 MB)
Collecting tokenizers<0.11,>=0.10.1
  Using cached tokenizers-0.10.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
Collecting huggingface-hub==0.0.8
  Using cached huggingface_hub-0.0.8-py3-none-any.whl (34 kB)
Installing collected packages: tokenizers, huggingface-hub, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.13.3
    Uninstalling tokenizers-0.13.3:
      Successfully uninstalled tokenizers-0.13.3
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.16.4
    Uninstalling huggingface-hub-0.16.4:
      Successfully uninstalled huggingface-hub-0.16.4
  Attempting uninstall: transformers
    Found existing installation: transformers 4.27.0.dev0
    Uninstalling transformers-4.27.0.dev0:
      Successfully uninstalled transformers-4.27.0.dev0
Successful

## STEP 1: DOWNLOAD A PRETRAINED FACEBOOK BART MODEL AND TEST IT LOCALLY

In [11]:
pip install git+https://github.com/zphang/transformers@llama_push

Collecting git+https://github.com/zphang/transformers@llama_push
  Cloning https://github.com/zphang/transformers (to revision llama_push) to /tmp/pip-req-build-6h9jekei
  Running command git clone --filter=blob:none --quiet https://github.com/zphang/transformers /tmp/pip-req-build-6h9jekei
  Running command git checkout -b llama_push --track origin/llama_push
  Switched to a new branch 'llama_push'
  Branch 'llama_push' set up to track remote branch 'llama_push' from 'origin'.
  Resolved https://github.com/zphang/transformers to commit 3884da12ce327667d4df5101aef3533cc32be61f
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Using cached tokenizers-0.13.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
Collecting huggingface-hub<1.0,>=0.11.0
  Using cached huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
Bui

In [13]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("llm-blender/gen_fuser_770m")
model = AutoModelForSeq2SeqLM.from_pretrained("llm-blender/gen_fuser_770m")

Downloading: 100%|██████████| 788/788 [00:00<00:00, 485kB/s]
Downloading: 100%|██████████| 792k/792k [00:00<00:00, 77.3MB/s]
Downloading: 100%|██████████| 2.42M/2.42M [00:00<00:00, 50.6MB/s]
Downloading: 100%|██████████| 2.20k/2.20k [00:00<00:00, 813kB/s]
Downloading: 100%|██████████| 2.54k/2.54k [00:00<00:00, 1.05MB/s]
Downloading: 100%|██████████| 1.57G/1.57G [00:27<00:00, 56.4MB/s]


In [None]:
from transformers import set_seed

## Represents displaying the output in the way we need
def get_outputs(sample_outputs, tokenizer):
    ## Represents taking in a tokenizer, and raw output from the given model, decoding and 
    ## formatting the output nicely
    rt = []
    print("Output:\n" + 100 * '-')
    for i, sample_output in enumerate(sample_outputs):
        txt = tokenizer.decode(sample_output, skip_special_tokens=True)
        print("{}: {}...".format(i, txt))
        print('')
        rt.append(txt)
    return rt

## Setting the seed helps us ensure reproducibility, and when the seed is consistent, the model outputs will be consistent
set_seed(42)

# Initialize the question or prompt
question = "What is the project about?"

text = "Madhur is working on a project and he wants to be able to take the transcripts of a meeting and be able to summarize them for the meeting attendees and then display them in the form of instructions to make meetings easier, efficient, and more effective."

# Combine the question and text using appropriate separators like "[SEP]"
input_text = question + " [SEP] " + text

# Tokenize the combined text
input_ids = tokenizer.encode(input_text, return_tensors='pt')

sample_outputs = model.generate(input_ids,
                                do_sample=True,
                                num_return_sequences=1)

## Represents giving out the output
generic_outputs = get_outputs(sample_outputs, tokenizer)


Output:
----------------------------------------------------------------------------------------------------
0: The project is to develop a tool to easily summarize and display the transcripts of meeting proceedings...



## STEP 2: FINE TUNE THE llama-2-7b Instruction Generator 

#### Here, we will tune the model on a real experiment done where I used myself to generate a meeting transcript, to check how the model performs on the transcript:

#### GOALS:

1. Posting instructions based on the meeting notes given

2. Making sure all important points have been come across to the reader (maybe we assign labels to the meeting)

3. Make sure the model uses the label to pick up the important pointers from the meeting.

In [15]:
%%writefile train.txt

Madhur: Hey, how's it going? So let me turn on the transcript uh language preference. Let's go with English for now. OK. 
So I can see that the machine generated captions are by Amazon transcribe. Well, this is the first time I'm joining 
Amazon Shine with my hair all open. Usually during my workout, I tie, tie them back because it just looks uh I look like a broccoli. 
But anyways, uh this is an experiment. So I'm trying to uh this, I'm trying to work on a project where for every meeting, I'm trying to 
extract the transcripts through the chime calls and then display it after the calls have ended or the meetings have ended to the uh 
members of the meeting in a summarized manner or in a manner where they feel comfortable to read. Maybe they feel like being more organized 
after they missed a huge meeting. So they just want to look at the, the important pointers. So I will be focusing on taking this transcript and 
actually using it in the prototype that I'm trying to create. And let's see how it goes. I'm trying to see if I can get this transcript really long 
so that I can see that my protype works or not. And I'm just looking at the transcripts right now because I'm just kind of distracted at how Amazon
chime also has a one second delay, maybe a millisecond delay in their um meeting. So I can see my lips moving a bit slower than they actually are. 
So, so, yeah, a lot of redundant information there. Let's move on forward and uh, try this transcript out. All right. See you.

Overwriting train.txt


In [16]:
data = []

## Represents going over the training transcript sample above

with open('train.txt') as f:
    for row in f.readlines():
        d = row.strip()
        if len(d) > 0:
            data.append(d)

In [12]:
## Represents importing the sagemaker role

import sagemaker

sess = sagemaker.Session()
bucket = sess.default_bucket()

train_file_name1 = 'train.txt'
s3_train_data = 's3://{}/llm/{}'.format(bucket, train_file_name1)

!aws s3 cp {train_file_name1} {s3_train_data}



upload: ./train.txt to s3://sagemaker-us-east-1-988564344122/llm/train.txt


In [13]:
!pip install -U sagemaker
!pip install sagemaker --upgrade

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## STEP 4: TESTING OUR TRAINED MODEL LOCALLY

In [18]:
from sagemaker.huggingface import HuggingFace
import time

try:
    s3_model_data = huggingface_estimator.model_data
    local_model_path = 'gen_fuser_770m'
    
except:
    time.sleep(5)
    s3_model_data = huggingface_estimator.model_data
    local_model_path = 'gen_fuser_770m'

In [26]:
pip install transformers[torch]

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [27]:
pip install accelerate -U

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [45]:
!pip install transformers torch torchvision torchaudio



[0m

In [50]:
from transformers import set_seed
## Setting the seed helps us ensure reproducibility, and when the seed is consistent, the model outputs will be consistent
set_seed(42)

# Initialize the question or prompt
question = "What is the project about and what does madhur have to do?"

text = "Madhur is working on a project and he wants to be able to take the transcripts of a meeting and be able to summarize them for the meeting attendees and then display them in the form of instructions to make meetings easier, efficient, and more effective."

# Combine the question and text using appropriate separators like "[SEP]"
input_text = question + " [SEP] " + text

# Tokenize the combined text
input_ids = tokenizer.encode(input_text, return_tensors='pt')

sample_outputs = model.generate(input_ids,
                                do_sample=True,
                                num_return_sequences=1)

## Represents giving out the output
generic_outputs = get_outputs(sample_outputs, tokenizer)



NameError: name 'get_outputs' is not defined