#Question 1: Sentiment Analysis with Transformers
Dataset Problem: Use the IMDB movie reviews dataset to perform sentiment analysis using a Transformer model. load the dataset from TensorFlow datasets library and solve the problem.
Due to the complexity and size of Transformer models, use via libraries like Hugging Face's Transformers and work it out, feel free to experiment with more than 1 transformer model and compare the results and give a short explanation on the best model, what are the reasons for its performance.  


In [1]:
import pandas as pd
import tensorflow as tf
import sklearn
from tqdm import tqdm
df=pd.read_csv("/content/IMDB Dataset.csv")
df.sample(5)

Unnamed: 0,review,sentiment
28382,This one grew on me. I love the R.D. Burman mu...,positive
8318,"If this could be rated a 0, it would be. From ...",negative
17764,"""A Mouse in the House"" is a very classic carto...",positive
14961,I'm watching this film as I write this. It's a...,negative
47897,"I have officially vomited in my own mouth, tha...",negative


In [None]:
 #Installing Transformers library
 !pip install transformers

In [None]:
# Loading the BERT Classifier and Tokenizer along with Input module
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [4]:
# changing positive and negative into numeric values

def cat2num(value):
    if value=='positive':
        return 1
    else:
        return 0

df['sentiment']  =  df['sentiment'].apply(cat2num)
train = df[:45000]
test = df[45000:]

#Data Preprocessing
For training model with BERT, we need to do some additional Prepriocessing. Let's understand them one by one!

Add special tokens to separate sentences and do classification
Pass sequences of constant length (introduce padding)
Create array of 0s (pad token) and 1s (real token) called attention mask

In [5]:
 #But first see BERT tokenizer exmaples and other required stuff!

example='In this Kaggle notebook, I will do sentiment analysis using BERT with Huggingface'
tokens=tokenizer.tokenize(example)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(tokens)
print(token_ids)

['in', 'this', 'ka', '##ggle', 'notebook', ',', 'i', 'will', 'do', 'sentiment', 'analysis', 'using', 'bert', 'with', 'hugging', '##face']
[1999, 2023, 10556, 24679, 14960, 1010, 1045, 2097, 2079, 15792, 4106, 2478, 14324, 2007, 17662, 12172]


In [6]:
def convert_data_to_examples(train, test, review, sentiment):
    train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[review],
                                                          label = x[sentiment]), axis = 1)

    validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[review],
                                                          label = x[sentiment]), axis = 1,)

    return train_InputExamples, validation_InputExamples

train_InputExamples, validation_InputExamples = convert_data_to_examples(train,  test, 'review',  'sentiment')

In [7]:
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in tqdm(examples):
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,    # Add 'CLS' and 'SEP'
            max_length=max_length,    # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],input_dict["token_type_ids"], input_dict['attention_mask'])
        features.append(InputFeatures( input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label) )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )


DATA_COLUMN = 'review'
LABEL_COLUMN = 'sentiment'

In [8]:
train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(32).repeat(2)

100%|██████████| 45000/45000 [03:52<00:00, 193.83it/s]


In [9]:
validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(32)

100%|██████████| 5000/5000 [00:25<00:00, 196.32it/s]


In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

model.fit(train_data, epochs=2, validation_data=validation_data)

Epoch 1/2
      4/Unknown - 244s 48s/step - loss: 0.6860 - accuracy: 0.5781

In [None]:
# Loading the BERT Classifier and Tokenizer along with Input module
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

# The following line was missing in your original code:
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")


In [4]:
pred_sentences = ['worst movie of my life, will never watch movies from this series', 'Wow, blew my mind, what a movie by Marvel, animation and story is amazing']

In [7]:
import tensorflow as tf # This line was missing

tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')   # we are tokenizing before sending into our trained model
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)       # axis=-1, this means that the index that will be returned by argmax will be taken from the *last* axis.
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
    print(pred_sentences[i], ": ", labels[label[i]])

worst movie of my life, will never watch movies from this series :  Positive
Wow, blew my mind, what a movie by Marvel, animation and story is amazing :  Positive


#Question 2: Text Generation with Transformers
Dataset Problem: Using a pre-trained GPT model (any version) from Hugging Face's Transformers, generate a short story based on a given prompt. Example prompt is below
Prompt=” In a distant future, humanity has discovered”


In [None]:
!pip install transformers torch numpy

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

model_name = "gpt2"  # You can choose a different model on hugging face or fine-tune a model
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

In [3]:
prompt = "In a distant future, humanity has discovered"

# Tokenize input
input_ids = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=True)

# Create an attention mask
attention_mask = torch.ones(input_ids.shape, dtype=torch.long)

# Generate text with attention mask
output = model.generate(input_ids, attention_mask=attention_mask, max_length=100, num_return_sequences=1, no_repeat_ngram_size=2, top_k=50, top_p=0.95, temperature=0.7)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a distant future, humanity has discovered a way to make the world a better place.

The world is a place of peace, harmony, and harmony. It is the place where the human race has been born. The world has become a world of love. And the love of the people is not only the greatest love, but the most important love in the universe. Love is love for the whole of humanity.
