# Meeting Notes Summarizer: AWS Summarize

#### This code represents using SageMaker, and HuggingFace, to summarize the transcripts from a given meeting, and organizing them for further reference.

#### GOALS:

1. Integrate HuggingFace Text2Text generation Flan-T5-large model to take into input the transcripts from the meeting, and generate an organized text, based on the requirements of the conductor of the meeting

2. Integrate a Speech to text converter to convert speech and points from different speakers in the meeting in a live document for our model to refer to and train our data on.

### STEP 1: Integrate the hugging face model: flan-t5-large

In [2]:
## model_id = "google/flan-t5-large"

model_id = "google/flan-t5-base"

## Represents getting the dataset ID that we will refer to: https://huggingface.co/datasets/lytang/MeetingBank-transcript
dataset_id = "lytang/MeetingBank-transcript"

#### This data set above has the following information (Columns):

1. meeting_id (string): Represents the meeting ID of the given meeting we want to analyze transcripts from.

2. source (string): Represents the source, which contains the speech from the given speaker that we will use as our target.

3. type (string): Represents our label, which describes the purpose of the given meeting, also that can be extracted from the title of the meeting. 	

4. reference (string): Not needed for our dataset training.

5. city (string): Represents the city/location of the meeting - not necessarily needed for our goals of this project.

### STEP 2: SETUP

#### Represents setting up and installing the transformers dataset

In [3]:
pip install --upgrade pip

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [4]:
!pip -q install transformers datasets sagemaker --upgrade

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [5]:
!pip -q install widgetsnbextension ipywidgets

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

#### Represents setting up the sagemaker session and setting the default bucket for storing the training data in

In [6]:
## Importing sagemaker
import sagemaker

## Visualizing the version of sagemaker that we are operating with 
print(sagemaker.__version__)

## Represents initializing the session
sess = sagemaker.Session()

## Represents setting up our default bucket
bucket = sess.default_bucket()

2.173.0


#### Now, we will import the tranformers and datasets libraries from HuggingFace

In [7]:
## Importing the transformers library
import transformers

## Importing the datasets library from hugging face
import datasets

## Represents showing the versions of both ofn the libraries that we are working with
print (transformers.__version__)
print (datasets.__version__)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


4.30.2
2.13.1


### STEP 3: PREPROCESSING DATA

#### Now, we will LOAD the DATASET from the dataset_id we are going to be using to train our model on

In [8]:
## Importing the dataset ibraries
from datasets import load_dataset, load_from_disk

dataset = load_dataset(dataset_id)

## Print the dataset information
dataset

Found cached dataset csv (/root/.cache/huggingface/datasets/lytang___csv/lytang--MeetingBank-transcript-1a764c51490b6b19/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['meeting_id', 'source', 'type', 'reference', 'city'],
        num_rows: 5169
    })
    validation: Dataset({
        features: ['meeting_id', 'source', 'type', 'reference', 'city'],
        num_rows: 861
    })
    test: Dataset({
        features: ['meeting_id', 'source', 'type', 'reference', 'city'],
        num_rows: 862
    })
})

### STEP 4: PREPROCESS THE DATASET

In [9]:
## Represents importing the Autotokenizer library to convert the data into tokens, to output the same after our training
from transformers import AutoTokenizer

## Represents initializing the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

## Represents the prefix
prefix = "summarize: "

## Initializing the input max length
input_max_length = 2000

## Initializing the output max length
output_max_length = 150

## Represents the preprocessing function 
def preprocess_function(examples):
    
    ## Represents taking in all of the inputs from the document/transcripts
    inputs = [prefix + transcript for transcript in examples["source"]]
    
    ## Represents converting those tokens for inputting into our model
    model_inputs = tokenizer(inputs, max_length = input_max_length, truncation=True)
    
    ## Now, we will set the labels we need from the examples
    labels = tokenizer(
        text_target = examples["type"], max_length = output_max_length, truncation=True
    )
    
    ## Now, we will take the model inputs and output them
    model_inputs["labels"] = labels["input_ids"]
    
    return model_inputs
    

In [10]:
## Now, we will run this on our dataset with the map function in one go

tokenized_dataset = dataset.map(
    preprocess_function, batched=True, remove_columns=['meeting_id', 'source', 'type', 'reference', 'city']
)

Loading cached processed dataset at /root/.cache/huggingface/datasets/lytang___csv/lytang--MeetingBank-transcript-1a764c51490b6b19/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-704b52d273e963cd.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/lytang___csv/lytang--MeetingBank-transcript-1a764c51490b6b19/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-e9cb19c15b5760eb.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/lytang___csv/lytang--MeetingBank-transcript-1a764c51490b6b19/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-e5aa10435d78532e.arrow


### STEP 5: UPLOADING THE DATA TO S3

In [11]:
## Represents importing the S3file system from the datasets library
from datasets.filesystems import S3FileSystem

## Represents initializing the file system
s3 = S3FileSystem()

s3_prefix = "huggingface/summarize_transcripts"

## Represents getting the input paths of the dataset, training and validation datasets
dataset_input_path = "s3://{}/{}".format(bucket, s3_prefix)
train_input_path = "{}/train".format(dataset_input_path)
valid_input_path = "{}/validation".format(dataset_input_path)

## Represents printing out these paths
print(dataset_input_path)
print(train_input_path)
print(valid_input_path)

s3://sagemaker-us-east-1-988564344122/huggingface/summarize_transcripts
s3://sagemaker-us-east-1-988564344122/huggingface/summarize_transcripts/train
s3://sagemaker-us-east-1-988564344122/huggingface/summarize_transcripts/validation


In [12]:
## Now, because we integrated our datasets to the s3 bucket, we can save the tokenized datasets to the disk

tokenized_dataset["train"].save_to_disk(train_input_path, fs=s3)
tokenized_dataset["test"].save_to_disk(valid_input_path, fs=s3)



Saving the dataset (0/1 shards):   0%|          | 0/5169 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/862 [00:00<?, ? examples/s]

### STEP 6: FINE-TUNE ON SAGEMAKER BY USING A HUGGING FACE DEEP LEARNING CONTAINER

In [13]:
!pygmentize train.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[34mimport[39;49;00m [04m[36mevaluate[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mdatasets[39;49;00m [34mimport[39;49;00m load_from_disk[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m ([37m[39;49;00m
    AutoModelForSeq2SeqLM,[37m[39;49;00m
    AutoTokenizer,[37m[39;49;00m
    DataCollatorForSeq2Seq,[37m[39;49;00m
    Seq2SeqTrainer,[37m[39;49;00m
    Seq2SeqTrainingArguments,[37m[39;49;00m
)[37m[39;49;00m
[37m[39;49;00m
rouge = evaluate.load([33m"[39;49;00m[33mrouge[39;49;00m[33m"[39;49;00m)[37m[39;49;00m
[37m[39;49;00m
[37m[39;49;00m
[34mdef[39;49;00m [32mcompute_met

In [14]:
## Defining the hyperparameters
hyperparameters = {
    "epochs": 1, 
    "learning_rate":  0.0001 , 
    "train-batch-size": 1, 
    "eval-batch-size": 3, 
    "model-name": model_id,
}

In [20]:
## Now, we will create the hugging face estimator, passing in the script, requirements

from sagemaker.huggingface import HuggingFace

huggingface_estimator = HuggingFace(
    
    ## Getting the role of the sagemaker
    role = sagemaker.get_execution_role(), 
    
    ## Represents fine-tuning our script
    entry_point="train.py",
    dependencies=["requirements.txt"],
    hyperparameters=hyperparameters, 
    
    ## Represents the infrastructure
    transformers_version = "4.26.0",
    pytorch_version = "1.13.1",
    py_version = "py39",
    instance_type="ml.p3.16xlarge",
    instance_count=1, 
    distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
)

In [21]:
pip install -U sagemaker

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [22]:
## Now, we will fit the model and start the training job
huggingface_estimator.fit({"train": train_input_path, "valid": valid_input_path})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-07-31-02-55-44-513


Using provided s3_resource
2023-07-31 02:55:44 Starting - Starting the training job......
2023-07-31 02:56:41 Starting - Preparing the instances for training.........
2023-07-31 02:57:50 Downloading - Downloading input data...
2023-07-31 02:58:21 Training - Downloading the training image............
2023-07-31 03:00:36 Training - Training image download completed. Training in progress.....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-07-31 03:01:09,550 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-07-31 03:01:09,617 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-07-31 03:01:09,630 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-07-31 03:01:09,632 sagemaker_pytorch_container.training INFO     Invoking SMDataParallel[0m
[34m20

## DEPLOY ON SAGEMAKER WITH HUGGING FACE DEEP LEARNING CONTAINER

In [26]:
huggingface_predictor = huggingface_estimator.deploy(
    initial_instance_count=1, instance_type="ml.g4dn.xlarge"
)

INFO:sagemaker:Creating model with name: huggingface-pytorch-training-2023-07-31-03-10-39-147
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-training-2023-07-31-03-10-39-147
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-training-2023-07-31-03-10-39-147


---------!

### Now, let's use an example to see how a meeting transcript can be summarized

In [27]:
test_transcript = {"inputs": f"{prefix}: {dataset['test'][20]}"}
print(test_transcript)

{'inputs': 'summarize: : {\'meeting_id\': \'LongBeachCC_03222022_22-0281\', \'source\': "Speaker 0: Thank you. We have a first item up is item 17.\\nSpeaker 2: Item 17 Communication from Councilwoman Zendaya\'s Recommendation to direct city attorney to draft resolution to advocate changes to the California retail food code.\\nSpeaker 0: There\'s a motion in a second. I think we have come to consider how do you want to do public comment first or do you want to address, please? There\'s one member of the public.\\nSpeaker 2: Mankind could.\\nSpeaker 6: Try again. My name is Cameron Coon and I started a catering service at the worst possible time in January of 2020. In order to survive the pandemic. My partner, Juan Fernandez, and I took our pop up cafe intended for film set, and we became street vendors. During the past year, we\'ve served a cappuccino to Robert Garcia, a peppermint mocha to Rex Richardson. A Ice Americano to Suzy Price\'s husband and at least two hot chocolates to Mary\

In [28]:
prediction = huggingface_predictor.predict(test_transcript)
print(prediction)

[{'generated_text': 'The machine\'s next item is item 18, please."'}]
