#**A COMPUTATIONAL CREATIVITY PROJECT**


---
 **AUTOMATED CRICKET COMMENTARY**

Nirmalkumar Pajany,
MSc AI, Queen Mary University of London.
nirmalkumarnk10111@gmail.com


## Logistics Code:

In [2]:
#@title BLOCK-1.1: Pre-Requisets
!pip install transformers
!pip install datasets

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 24.6 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 54.1 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 6.0 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 59.7 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 38.7 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: 

In [3]:
#@title Block-1.2 Imports
import csv
import os
import random
from datasets import load_dataset
from transformers import pipeline
from transformers import GPT2Tokenizer
from transformers import Trainer, TrainingArguments
from transformers import GPT2LMHeadModel, AutoModelForCausalLM

In [4]:
#@title Block-1.3 Mounting GDrive
# from google.colab import drive
# drive.mount('/content/drive')

In [5]:
#@title Block-1.4 Global variables
model_checkpoint = "gpt2"
block_size = 128
distilled_comentary_model = "nirmalkumar/distilledgpt2-cric-commentary"
gpt2_commentary_model = "nirmalkumar/gpt2-cric-commentary"

In [6]:
commentary_model = "nirmalkumar/gpt2-cric-commentary" # @param ["nirmalkumar/distilledgpt2-cric-commentary", "nirmalkumar/gpt2-cric-commentary"]


In [7]:
print("Model of choice: {}".format(commentary_model))

Model of choice: nirmalkumar/gpt2-cric-commentary


# **Data Preparation**

The following steps describe how the data has been prepared:



1.   Dataset for this task has been downloaded from [Kaggle](https://www.kaggle.com/datasets/raghuvansht/cricket-scorecard-and-commentary-dataset).
2.   After downloading and extracting the archive, the data for our interest could be found at the location "archive/COMMENTARY_INTL_MATCH/COMMENTARY_INTL_MATCH". The same was uploaded and saved in G Drive
3.   The data has many columns, but only columns that influence the commentary are selected. Those columns could be found in the below code
4.   The tables has been parsed and converted to the format inspired by TableGPT paper as mentioned in report. The two new tokens introduced are: &lt;start_of_table&gt;, &lt;end_of_table&gt; refer: *[Block 2.1]*
5.   While parsing we select 50,204 rows for training, 12,816 rows for testing and 27,381 for validation
6.   This data is then saved at google drive
7.   The saved data is then loaded into Dataset using a script called ***txt.py***. This script was saved along with the data at the drive. But for the purpose of inference it is shared here. refer: *[Block 2.2]*
8.   While loading the dataset to make it easy for the reader to load, the same data set has been uploaded to HuggingFace Dataset repo in public mode and the same is used here to load the data. refer: *[Block 2.3]*
9.   As of now the data is prepared and loaded. It is not processed. The data processing will be handles by ***GPT2Tokenizer*** The tokenizer loads the pretrained GPT2 tokenizer. refer: *[Block 2.5]*
10.  For our task, we used two new tokens and they are not available in the tokenizer vocabulary already. Those are added. The padding tokens are also added refer: *[Block 2.6]*
11. Now we need to tokenize and group our input as per our model's input requirement. refer: *[Block 2.7]*

In [8]:
#@title Block 2.1: Data preparation from txt files (refer point 4 to 6)
# #The following lines are commented because they are not needed anymore to load the data
# #The code is posted here to let the reader know how the data was prepared





# path='/content/drive/MyDrive/Q/Garage/CC/CC_DO/COMMENTARY_INTL_MATCH/'
# dir_list = os.listdir(path)
# row_row = []

# table_file = ""

# j = True
# i = 0

# for f in dir_list:
#   print("i: {}".format(i))
#   if j == False:
#     break
#   table_file = "/content/drive/MyDrive/Q/Garage/CC/CC_DO/data/TableText/tablefile50000.txt"
#   with open(table_file, 'a') as tf:
#     file_path = file_path = str(path) + str(f)
#     with open(file_path) as csv_file:
#       if i > 50000:
#         j = False
#         break
#       print("File being read: {0}".format(file_path))
#       table = csv.reader(csv_file, delimiter=',')
#       for row in table:
#         try:
#           row_str = '<start_of_table>'
#           # print(row[0])
#           if row[0] != 'PlayType_description':
#             i+=1


#             #The following are the columns those were selected 



#             row_str = row_str + ' play type description is ' + str(row[0])
#             row_str = row_str + ' batting team is ' + str(row[2])
#             row_str = row_str + ' bowling team is ' + str(row[4])
#             row_str = row_str + ' total runs on delivery is ' + str(row[6])
#             row_str = row_str + ' bowler name is ' + str(row[11])
#             row_str = row_str + ' batsman name is ' + str(row[25])
#             row_str = row_str + ' over runs is ' + str(row[44])
#             row_str = row_str + ' dismissal is ' + str(row[46])
#             row_str = row_str + ' dismissal type is ' + str(row[47])
#             row_str = row_str + ' dismissal text is ' + str(row[50])
#             row_str = row_str + ' innings wickets is ' + str(row[55])
#             row_str = row_str + ' <end_of_table> ' + ' commentary ' + str(row[7]) + '\n'
#             tf.write(row_str)
#         except:
#           pass
#       csv_file.close()
#     tf.close()

In [9]:
#@title Block 2.2: txt.py (refer point 7)

#The below is the script txt.py saved at the DatFolder to load the data into Dataset




# import os
# import datasets
# class TableToTextData(datasets.GeneratorBasedBuilder):
#   _DESCRIPTION = """

#   Table To Text DataSet Loader by NK

#   """
#   # _URL = "/Users/nirmalkumarp/Q/Garage/SEM_B/CC/DataSet/TableText/"
  
#   # TRAIN_URL = "https://drive.google.com/file/d/1-69_2ZA_kKBG40PRgXlMBxZASlYXembE/view?usp=sharing"
#   # TEST_URL = "https://drive.google.com/file/d/1-v-4YQoUY-NWMQHqooCUrfo82XJ40rXq/view?usp=sharing"
#   # VAL_DIR = "https://drive.google.com/file/d/1-ywrK4i_508t6rEqfwHcpkhabmQJz5LF/view?usp=sharing"

#   def _info(self):
#     return datasets.DatasetInfo(
#       description=self._DESCRIPTION,
#       features=datasets.Features(
#         {
#           "rows": datasets.Value("string")
#         }
#       ),
#       supervised_keys=None
#     )
  

#   def _split_generators(self, dl_manager):
#     data_dir = dl_manager.download_and_extract(self.TRAIN_URL)
#     train_dir = dl_manager.download_and_extract(self.TEST_URL)
#     val_dir = dl_manager.download_and_extract(self.VAL_DIR)

#     return [
#       datasets.SplitGenerator(
#         name=datasets.Split.TRAIN, gen_kwargs={"file_path": data_dir}
#       ),
#       datasets.SplitGenerator(
#         name=datasets.Split.TEST, gen_kwargs={"file_path": train_dir}
#       ),
#       datasets.SplitGenerator(
#         name=datasets.Split.VALIDATION, gen_kwargs={"file_path": val_dir}
#       )
#     ]

  
#   def _generate_examples(self, file_path):
#     _id = 0
#     print(file_path)
#     try:
#       with open(file_path, 'r') as fp:
#         lines = fp.readlines()
#         for line in lines:
#           _id += 1
#           yield _id, {
#               "rows":line
#           }
#     except:
#       pass

In [10]:
#@title Block 2.3: Loading the final Data (refer point 8)

dataset = load_dataset("nirmalkumar/cricket-commentary")

Downloading:   0%|          | 0.00/945 [00:00<?, ?B/s]

Using custom data configuration nirmalkumar--cricket-commentary-4995d8df88487f01


Downloading and preparing dataset table_to_text_data/default (download: 6.66 MiB, generated: 35.23 MiB, post-processed: Unknown size, total: 41.90 MiB) to /root/.cache/huggingface/datasets/parquet/nirmalkumar--cricket-commentary-4995d8df88487f01/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.90M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/nirmalkumar--cricket-commentary-4995d8df88487f01/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [11]:
#@title Block 2.4: print sample data 

dataset['train'][0]

{'rows': '<start_of_table> play type description is no run batting team is India bowling team is England total runs on delivery is 0 bowler name is James Anderson batsman name is Shikhar Dhawan over runs is 2 dismissal is False dismissal type is  dismissal text is 0.0 innings wickets is 0 <end_of_table>  commentary short of a length and some shape into off stump. Allows it come on and dabs with soft hands in front of point\n'}

In [12]:
#@title Block 2.5: Tokenizer (refer point 9)
tokenizer = GPT2Tokenizer.from_pretrained(model_checkpoint, use_fast=False)

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [13]:
#@title Block 2.6: Adding special tokens to Tokenizer (refer point 10)
tokenizer.add_tokens(["<start_of_table>", "<end_of_table>"])
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

1

In [14]:
#@title Block 2.7: tokenizing and grouping the input (refer point 11)

def tokenize_function(examples):
  return tokenizer(examples['rows'], truncation=True, padding=True)

def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=['rows'], load_from_cache_file=False)
lm_datasets = tokenized_datasets.map(group_texts, batched=True, batch_size=1000, num_proc=4)

        

#0:   0%|          | 0/13 [00:00<?, ?ba/s]

#1:   0%|          | 0/13 [00:00<?, ?ba/s]

#2:   0%|          | 0/13 [00:00<?, ?ba/s]

#3:   0%|          | 0/13 [00:00<?, ?ba/s]

        

#3:   0%|          | 0/7 [00:00<?, ?ba/s]

#1:   0%|          | 0/7 [00:00<?, ?ba/s]

#0:   0%|          | 0/7 [00:00<?, ?ba/s]

#2:   0%|          | 0/7 [00:00<?, ?ba/s]

        

#3:   0%|          | 0/4 [00:00<?, ?ba/s]

#0:   0%|          | 0/4 [00:00<?, ?ba/s]

#2:   0%|          | 0/4 [00:00<?, ?ba/s]

#1:   0%|          | 0/4 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/13 [00:00<?, ?ba/s]

#1:   0%|          | 0/13 [00:00<?, ?ba/s]

#2:   0%|          | 0/13 [00:00<?, ?ba/s]

#3:   0%|          | 0/13 [00:00<?, ?ba/s]

        

#1:   0%|          | 0/7 [00:00<?, ?ba/s]

#0:   0%|          | 0/7 [00:00<?, ?ba/s]

#3:   0%|          | 0/7 [00:00<?, ?ba/s]

#2:   0%|          | 0/7 [00:00<?, ?ba/s]

       

#0:   0%|          | 0/4 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/4 [00:00<?, ?ba/s]

#2:   0%|          | 0/4 [00:00<?, ?ba/s]

#3:   0%|          | 0/4 [00:00<?, ?ba/s]

##Model Preparation

In [15]:
#@title Block 3.1: Model Selection
#@markdown By default "nirmalkumar/gpt2-cric-commentary" model will be loaded
# ["nirmalkumar/distilledgpt2-cric-commentary", "nirmalkumar/gpt2-cric-commentary"]

if commentary_model == 'nirmalkumar/distilledgpt2-cric-commentary':
  print("Model being Loaded: {}".format(commentary_model))
  model = AutoModelForCausalLM.from_pretrained(commentary_model)
elif commentary_model == 'nirmalkumar/gpt2-cric-commentary':
  print("Model being Loaded: {}".format(commentary_model))
  model = GPT2LMHeadModel.from_pretrained(commentary_model)
else:
  print("Default Model being Loaded: {}".format(commentary_model))
  model = GPT2LMHeadModel.from_pretrained('nirmalkumar/gpt2-cric-commentary')


Model being Loaded: nirmalkumar/gpt2-cric-commentary


Downloading:   0%|          | 0.00/996 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/487M [00:00<?, ?B/s]

In [16]:
#@title Block 3.2: Resizing the Model embeddings
model.resize_token_embeddings(len(tokenizer))

Embedding(50260, 768)

In [17]:
#@title Block 3.3: Preparing the Training Arguments
training_args = TrainingArguments(
    f"{commentary_model}-finetuned-tabetext",
    evaluation_strategy = 'epoch',
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
    num_train_epochs=4.0,
    save_strategy="epoch"
)

In [18]:
#@title Block 3.4: Preparing the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets['train'],
    eval_dataset=lm_datasets['validation']
)

In [19]:
#@title Block 3.5: Training on the trainer

# trainer.train()
model = model.to(device='cpu')

In [20]:
#@title Block 3.6: Preparing the genarator
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

In [21]:
choice = random.randint(1, 12000)
test_str = str(dataset['test'][choice]['rows'])
test_str = test_str.split(" commentary")[0]
test_str = str(test_str) + ' commentary'
print("len {}".format(len(test_str)))
print(test_str)

len 311
<start_of_table> play type description is no run batting team is South Africa bowling team is India total runs on delivery is 0 bowler name is Hardik Pandya batsman name is Faf du Plessis over runs is 5 dismissal is False dismissal type is  dismissal text is 3.04 innings wickets is 0 <end_of_table>  commentary


In [35]:
gen = generator(test_str, max_length = 128, num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [51]:
for g in gen:
  gen_sent = str((str(g['generated_text']).split('commentary'))[1])
  print("Generated Sentence: {}".format(gen_sent))
# print(gen)

Generated Sentence:  good length on middle stump, defended off the back foot
play type description is no run batting team is Sri Lanka total runs on delivery is 0 bowler name is Lakshan Sandakan over runs is 4 dismissal is False dismissal type is  dismissal text is 4.03 innings wickets is 0 <end_of_table> comment
Generated Sentence:  full at off stump, defended on the front foot. This one has to be struck
play type description is no run batting team is New Zealand total runs on delivery is 0 bowler name is Trent Boult over runs is 2 dismissal is False dismissal type is  dismissal text is 4.28 innings wickets
Generated Sentence:  good length outside off, blocked down the pitch off the front foot
play type description is no run batting team is Sri Lanka total runs on delivery is 0 bowler name is Lahiru Gamage batsman name is Kusal Mendis over runs is 3 dismissal is False dismissal type is  dismissal text is 0
