# Solving NLP (Natural Language Processing) Problems with HuggingFace Library on AWS for Turkish Language

## Use Case
In this demo, we will demonstrate two NLP use cases.
- _Sentiment Analysis:_ You can find positive, negative and neutral mentions about your business, competitors or any topic provided as text to the Machine Learning model.
- _Question Answering:_ Question-Answering Models are  deep learning models that can answer questions given some context, and sometimes without any context (e.g. open-domain QA). They can extract answer phrases from paragraphs, paraphrase the answer generatively, or choose one option out of a list of given options, and so on.

## Dataset
We will use following datasets:
- _Dataset Card for Turkish Product Reviews:_ This Turkish Product Reviews Dataset contains 235.165 product reviews collected online. There are 220.284 positive, 14881 negative reviews.
- _Turkish NLP Q&A Dataset:_ This dataset is the Turkish Question & Answer dataset on Turkish Science History.

## Approach
Instead of creating a new Machine Learning (ML) model for every new task, we can leverage the concept of *Transfer Learning*.
In particular, we can use generic language models and teach it new tasks by fine-tuning them using corresponding datasets.
In this notebook we will use a Turkish language model created by the MDZ Digital Library team (dbmdz) at the Bavarian State Library (https://github.com/stefan-it/turkish-bert). We will use the Hugging Face Model Hub to download the model (https://huggingface.co/dbmdz/bert-base-turkish-uncased) and then fine-tune it to two  different tasks. We will deploy to SageMaker for real-time inferencing.
- Sentiment Analysis: We will see how the fine tuned model achieves SoTA (State of the Art) performance for Sentiment Analysis for Turkish easily.
- Question Answering:

## How to Run this Notebook in Amazon SageMaker
You can run this notebook in SageMaker Studio. Please select the `PyTorch 1.6 Python 3.6 CPU Optimized` kernel.

## SageMaker Setup

In [None]:
!pip install transformers -q -U

In [None]:
!pip install datasets -q -U

In [None]:
!pip install ipywidgets IProgress -q

In [None]:
!pip install sagemaker -q -U

In [None]:
!mkdir data

In [None]:
import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sagemaker_session_bucket}")
print(f"sagemaker session region: {sess.boto_region_name}")

#### Define model name

In [None]:
model_name = 'dbmdz/bert-base-turkish-uncased'

## Sentiment Analysis

### Downloading dataset and splitting into test and training sets

We will downlaod the data directly from Huggingface: https://huggingface.co/datasets/turkish_product_reviews

In [None]:
from datasets import load_dataset
import pandas as pd
from transformers import AutoTokenizer
from sagemaker.huggingface.model import HuggingFacePredictor

In [None]:
dataset_name = 'turkish_product_reviews'
dataset = load_dataset(dataset_name)

We will only take 10% of the data to reduce training time

In [None]:
sample = dataset['train'].train_test_split(test_size=0.1)

Now we split the data into training set (90%) and test set (10%)

In [None]:
dataset = sample['test']
train_test = dataset.train_test_split(test_size=0.1)

In [None]:
train_dataset = train_test['train']
test_dataset = train_test['test']

Now we can inspect the training data

In [None]:
df_train = pd.DataFrame(train_dataset)

In [None]:
pd.set_option('display.max_colwidth', 0)

In [None]:
df_inspect = pd.concat([df_train[df_train['sentiment']==0].head(3), df_train[df_train['sentiment']==1].head(3)])

In [None]:
df_inspect

Before we can start the training we need to tokenize the data save it in S3

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

# tokenizer helper function
def tokenize(batch):
    return tokenizer(batch['sentence'], padding='max_length', truncation=True)

In [None]:
train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

In [None]:
train_dataset =  train_dataset.rename_column("sentiment", "labels")
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset = test_dataset.rename_column("sentiment", "labels")
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

In [None]:
s3_prefix_sentiment = 'datasets/turkish_product_reviews'

In [None]:
import botocore
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()  

# save train_dataset to s3
training_input_path = f's3://{sagemaker_session_bucket}/{s3_prefix_sentiment}/train'
train_dataset.save_to_disk(training_input_path,fs=s3)

# save test_dataset to s3
test_input_path = f's3://{sagemaker_session_bucket}/{s3_prefix_sentiment}/test'
test_dataset.save_to_disk(test_input_path,fs=s3)

### Model Training

In [None]:
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters_sentiment={'epochs': 1,
                 'train_batch_size': 8,
                 'model_name': model_name
                 }

In [None]:
huggingface_estimator_sentiment = HuggingFace(entry_point='train.py',
                                    source_dir='./scripts',
                                    instance_type='ml.p3.2xlarge',
                                    instance_count=1,
                                    role=role,
                                    transformers_version='4.6',
                                    pytorch_version='1.7',
                                    py_version='py36',
                                    hyperparameters=hyperparameters_sentiment,
                                    )

In [None]:
huggingface_estimator_sentiment.fit({'train': training_input_path, 'test': test_input_path}, wait=False)

### Model Deployment

In [None]:
predictor_sentiment = huggingface_estimator_sentiment.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge",
    wait=False,
    endpoint_name="turkish-sentiment-endpoint"
)

### Model Testing

In [None]:
# This is only required to create a predictor from an already deployed model
predictor_sentiment = HuggingFacePredictor('turkish-sentiment-endpoint')

In [None]:
# Input text: "This is a pretty bad product, I wouldn't recommend this to anyone"
sentiment_input= {"inputs": "Bu oldukça kötü bir ürün, bunu kimseye tavsiye etmem"}
predictor_sentiment.predict(sentiment_input)

In [None]:
#Input text: "I love this shampoo, it makes my hair so shiny"
sentiment_input= {"inputs": "Bu şampuanı seviyorum, saçlarımı çok parlak yapıyor"}
predictor_sentiment.predict(sentiment_input)

## Question Answering

### Downloading the data

Taken from https://github.com/TQuad/turkish-nlp-qa-dataset

In [None]:
!wget https://raw.githubusercontent.com/TQuad/turkish-nlp-qa-dataset/master/train-v0.1.json -q

In [None]:
!wget https://raw.githubusercontent.com/TQuad/turkish-nlp-qa-dataset/master/dev-v0.1.json -q

In [None]:
!mv train-v0.1.json data/train-v0.1.json
!mv dev-v0.1.json data/dev-v0.1.json

The JSON files must be converted so that they can be used in a Q&A model

In [None]:
import json
from datasets import load_dataset

def convert_json(input_filename, output_filename):
    with open(input_filename) as f:
        dataset = json.load(f)

    with open(output_filename, "w") as f:
        for article in dataset["data"]:
            title = article["title"]
            for paragraph in article["paragraphs"]:
                context = paragraph["context"]
                answers = {}
                for qa in paragraph["qas"]:
                    question = qa["question"]
                    idx = qa["id"]
                    answers["text"] = [a["text"] for a in qa["answers"]]
                    answers["answer_start"] = [int(a["answer_start"]) for a in qa["answers"]]
                    f.write(
                        json.dumps(
                            {
                                "id": idx,
                                "title": title,
                                "context": context,
                                "question": question,
                                "answers": answers,
                            }
                        )
                    )
                    f.write("\n")

In [None]:
convert_json('data/train-v0.1.json', 'data/train.json')
convert_json('data/dev-v0.1.json', 'data/val.json')

In [None]:
data_files = {}
data_files["train"] = 'data/train.json'
data_files["validation"] = 'data/val.json'

In [None]:
from datasets import load_dataset
ds = load_dataset("json", data_files=data_files)

In [None]:
df = pd.DataFrame(ds['train'])

In [None]:
df.iloc[7518:7521]

Uploading to S3

In [None]:
s3_prefix_qa = 'datasets/turkish_qa'

In [None]:
!aws s3 cp data/train.json s3://$sagemaker_session_bucket/$s3_prefix_qa/train.json
!aws s3 cp data/val.json s3://$sagemaker_session_bucket/$s3_prefix_qa/val.json

### Model Training

In [None]:
from sagemaker.huggingface import HuggingFace

hyperparameters_qa={
    'model_name_or_path': model_name,
    'train_file': '/opt/ml/input/data/train/train.json',
    'validation_file': '/opt/ml/input/data/val/val.json',
    'do_train': True,
    'do_eval': False,
    'fp16': True,
    'per_device_train_batch_size': 4,
    'per_device_eval_batch_size': 4,
    'num_train_epochs': 2,
    'max_seq_length': 384,
    'pad_to_max_length': True,
    'doc_stride': 128,
    'output_dir': '/opt/ml/model'
}

instance_type = 'ml.p3.16xlarge'
instance_count = 1
volume_size = 200

In [None]:
huggingface_estimator_qa = HuggingFace(entry_point='run_qa.py',
                                       source_dir='./scripts',
                                       instance_type=instance_type,
                                       instance_count=instance_count,
                                       volume_size=volume_size,
                                       role=role,
                                       transformers_version='4.10',
                                       pytorch_version='1.9',
                                       py_version='py38',
                                       hyperparameters=hyperparameters_qa,
                                       disable_profiler=True,
                                      )

In [None]:
huggingface_estimator_qa.fit({'train': f's3://{sagemaker_session_bucket}/{s3_prefix_qa}/', 'val': f's3://{sagemaker_session_bucket}/{s3_prefix_qa}/'}, wait=False)

### Model Deployment

In [None]:
predictor_qa = huggingface_estimator_qa.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge",
    wait=False,
    endpoint_name="turkish-qa-endpoint"
)

### Model Testing

In [None]:
# This is only required to create a predictor from an already deployed model
predictor_qa = HuggingFacePredictor('turkish-qa-endpoint')

In [None]:
#Question: "When did he start a vagabond life?"
#Predicted answer: "On his father's death"

data = {
"inputs": {
    "question": "Ne zaman avare bir hayata başladı?",
    "context": """ABASIYANIK, Sait Faik. Hikayeci (Adapazarı 23 Kasım 1906-İstanbul 11 Mayıs 1954). \
İlk öğrenimine Adapazarı’nda Rehber-i Terakki Mektebi’nde başladı. İki yıl kadar Adapazarı İdadisi’nde okudu.\
İstanbul Erkek Lisesi’nde devam ettiği orta öğrenimini Bursa Lisesi’nde tamamladı (1928). İstanbul Edebiyat \
Fakültesi’ne iki yıl devam ettikten sonra babasının isteği üzerine iktisat öğrenimi için İsviçre’ye gitti. \
Kısa süre sonra iktisat öğrenimini bırakarak Lozan’dan Grenoble’a geçti. Üç yıl başıboş bir edebiyat öğrenimi \
gördükten sonra babası tarafından geri çağrıldı (1933). Bir müddet Halıcıoğlu Ermeni Yetim Mektebi'nde Türkçe \
gurup dersleri öğretmenliği yaptı. Ticarete atıldıysa da tutunamadı. Bir ay Haber gazetesinde adliye muhabirliği\
yaptı (1942). Babasının ölümü üzerine aileden kalan emlakin geliri ile avare bir hayata başladı. Evlenemedi.\
Yazları Burgaz adasındaki köşklerinde, kışları Şişli’deki apartmanlarında annesi ile beraber geçen bu fazla \
içkili bohem hayatı ömrünün sonuna kadar sürdü."""
    }
}
predictor_qa.predict(data)['answer']

In [None]:
#Question: "When did Einstein return to Germany?"
#Predicted answer: "1914"

data = {
"inputs": {
    "question": "Ne zaman Almanya’ya döndü?",
    "context": """1908’de artık oldukça tanınmış, büyük bir bilim adamı olarak tanınıyordu ve Bern \
Üniversitesinde öğretmen olarak atanmıştı. Sonraki sene patent ofisindeki işinden ve öğretmenlikten \
ayrıldı ve Zürih Üniversitesinde fizik doçentliğine başladı. 1911 yılında Prag’da Karl-Ferdinand \
Üniversitesinde profesörlük unvanı aldı. 1914 yılında Almanya’ya döndü, Kaiser Willhelm Fizik \
Enstitüsü’nde yönetici, Berlin Humboldt Üniversitesinde profesör oldu. Bu işlerindeki \
sözleşmelerinde öğretmenlik görevlerini oldukça azaltan maddeler vardı."""
    }
}
predictor_qa.predict(data)['answer']