# Coding the Attention Paper from Scratch (Part 1: Dataset)
We want to code the famous 2017 paper, **"Attention is All You Need"** from scratch. This paper is the landmark paper that introduce the transformer architecture to the whole world. Before transformer comes, most machine translation task are done through sequence to sequence model such as RNN (recurrent neural network), LSTM (long-short term memory) and couple of other sequence model such as them. When the transformer paper came, they propose an entirely new architecture using full attention mechanism as it's base instead of RNN/LSTM. When this paper came out, they achieved the highest BLEU (Bilingual Evaluation Understudy)score using the Benchmark of WMT 2014 English-to-German and WMT 2014 English-to-French datasets. 

I'm trying to replicate the code, but in order to do that, I need to find the benchmark dataset first to complement my replication. Since we don't have complete benchmark, I have to use the public data from Mesolitica. Credit to them. The data are available via https://huggingface.co/datasets/mesolitica/google-translate-malay-news

In [1]:
#let's prepare the dataset
import json
import glob

In [2]:
import pandas as pd

# Folder where your LFS files are stored
folder_path = '/Users/abuhuzaifahbidin/Documents/GitHub/attention-paper/dataset/'

# Use glob to find all files with the specific pattern
files = glob.glob(f"{folder_path}/malay-news.*.splitted.requested")

# Initialize lists to store the source and translated text
all_source_texts = []
all_translated_texts = []

# Process each file one by one
for file_path in files:
    with open(file_path, 'r') as file:
        for line in file:
            entry = json.loads(line.strip())
            src_text = entry["src"]
            translated_text = entry["r"]["result"]

            # Append to lists
            all_source_texts.append(src_text)
            all_translated_texts.append(translated_text)

# Create a DataFrame to store the data
dataset_df = pd.DataFrame({
    'source_text': all_source_texts,
    'translated_text': all_translated_texts
})

# Save the combined dataset as a CSV file
output_csv_path = '/Users/abuhuzaifahbidin/Documents/GitHub/attention-paper/dataset/combined_dataset.csv'
dataset_df.to_csv(output_csv_path, index=False)

dataset_df

Unnamed: 0,source_text,translated_text
0,Antara yang terlibat adalah Program Permata Ne...,Among those involved is the National Jewels Pr...
1,Program 1Malaysia For Youth (1M4U) cetusan ide...,The 1Malaysia For Youth (1M4U) Program of Datu...
2,Sejumlah agensi diletakkan bawah Kementerian H...,A number of agencies are placed under the Mini...
3,"Selain itu, Jabatan Perangkaan, Halal Developm...","In addition, the Department of Statistics, Hal..."
4,Penstrukturan semula ini juga menyaksikan lima...,This restructuring also saw five corridors and...
...,...,...
2087459,'Tunggakan gaji pemain Melaka selesai','Melacca players' salary arrears are over'
2087460,KEMELUT masalah tunggakan gaji yang didakwa me...,THE PROBLEM of salary arrears allegedly plagui...
2087461,"Presiden Persatuan Bola Sepak Melaka (MUSA), A...",The president of Melaka Football Association (...
2087462,Beliau yang juga Ketua Menteri Melaka mengakui...,"He, who is also the Chief Minister of Malacca,..."


The dataset is now ready. 