# Basic NLP tasks using Huggingface transformers 🤗
This notebook contains examples of NLP tasks like:  
- text summarization
- text classification
- machine translation
- question answering
- named entity recognition  

By the end of this notebook, you will clearly understand how to utilize Transformer models for NLP tasks, specifically sentiment analysis, using Python and the Transformers library. Whether you are a data scientist, a machine learning enthusiast, or simply curious about NLP, this notebook is designed to provide a practical and hands-on experience.

## HuggingFace platform 🤗

Hugging Face is forefront of the latest Natural Language Processing (NLP) advancements. It has become synonymous with state-of-the-art machine learning models, especially in language understanding and generation. Renowned for its comprehensive, open-source library "Transformers", Hugging Face provides an easy-to-use platform that houses a wide range of pre-trained models such as BERT, GPT, T5, and DistilBERT.

These models, built using deep learning techniques, can perform various complex NLP tasks, including but not limited to sentiment analysis, text summarization, translation, and question-answering. Hugging Face's platform facilitates rapid deployment and experimentation, making it a favorite among researchers, data scientists, and developers in academic settings.

First, the notebook presents a sentence classification application. Then, additional tasks typical for NLP are presented.

In [None]:
import numpy as np
import pandas as pd

In [None]:
import logging

logging.getLogger("transformers").setLevel(logging.ERROR)

## Sentiment analysis example 🧐

In [None]:
# Select the model from Huggingface Hub
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

In [None]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, pipeline

model_classification = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer_classification = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", 
                      model=model_classification,
                      tokenizer=tokenizer_classification)

In [None]:
reviews = [
    ("This product is amazing! Highly recommended.", "POSITIVE"),
    ("Absolutely terrible service, very disappointed.", "NEGATIVE"),
    ("Quite good, but could be better.", "POSITIVE"),
    ("I love this! Definitely buying again.", "POSITIVE"),
    ("Not what I expected, quite underwhelming.", "NEGATIVE"),
    ("This is the best purchase I've ever made.", "POSITIVE"),
    ("Complete waste of money, do not buy this.", "NEGATIVE"),
    ("An average product, nothing special.", "NEGATIVE"),
    ("Exceeded my expectations, wonderful quality!", "POSITIVE"),
    ("The service was bad, and the product is faulty.", "NEGATIVE"),
    ("I'm so happy with this, great job!", "POSITIVE"),
    ("It's okay, but I've seen better.", "NEGATIVE"),
    ("Worst experience ever, will not be returning.", "NEGATIVE"),
    ("This is exactly what I needed, thank you!", "POSITIVE"),
    ("Mediocre, not worth the hype.", "NEGATIVE"),
    ("Impressed with the fast delivery and quality.", "POSITIVE"),
    ("Terrible quality, broke after one use.", "NEGATIVE"),
    ("Good for the price, but has some issues.", "POSITIVE"),
    ("I'm very satisfied with my purchase.", "POSITIVE"),
    ("Disappointing product, not as described.", "NEGATIVE")
]

In [None]:
output_dict = {"review": [],
               "actual_sent": [], 
               "predicted_sent": []}
for review, actual_sentiment in reviews:
    predicted_sentiment = classifier(review)[0]['label']
    output_dict["review"].append(review)    
    output_dict["actual_sent"].append(actual_sentiment)
    output_dict["predicted_sent"].append(predicted_sentiment)
output_df = pd.DataFrame(output_dict)
output_df

### Text classification on existing dataset

Let's load an existing dataset from HiggingFace Hub. The dataset used in the example was obtained from  `https://huggingface.co/datasets/glue`. 

In [None]:
import datasets

dataset = datasets.load_dataset("glue", "sst2")
print(dataset)

In [None]:
dataset = datasets.load_dataset("glue", "sst2", split='train')
dataset_df = pd.DataFrame(dataset)
dataset_df.head()

In [None]:
n_samples = 200

subset = dataset_df.sample(n_samples, random_state=53)
X = list(subset['sentence'])
y = list(subset['label'])

In [None]:
results = [classifier(rev)[0]['label'] for rev in X]
yhat = [1 if res == "POSITIVE" else 0 for res in results]

In [None]:
diff = np.abs(np.array(y) - np.array(yhat))
accuracy = 1 - (np.sum(diff)) / len(diff)
print(f'Accuracy on the set: {accuracy:.2f}')

## Text summarization

In [None]:
# Example texts
news_article = """
Climate change is accelerating, with carbon dioxide levels rising and global temperatures increasing at an alarming rate. 
The impact is seen worldwide, with more frequent and severe weather events like hurricanes, droughts, and wildfires. 
Scientists are urging immediate action to reduce greenhouse gas emissions to mitigate these effects.
"""

scientific_abstract = """
In this study, we explore the application of convolutional neural networks (CNNs) in classifying medical imaging. 
Our dataset comprises 10,000 MRI scans of various brain diseases. We trained our CNN model using this dataset and 
achieved a 95% accuracy in differentiating between malignant and benign tumors, outperforming traditional methods.
"""

story_excerpt = """
Once upon a time in a faraway land, there was a kingdom of extraordinary beauty. The kingdom was known for its 
enchanting forests and a majestic castle where the beloved royal family lived. Despite its beauty, the kingdom faced 
troubles from a fearsome dragon that threatened peace.
"""

In [None]:
# Summarizing each text
summarizer = pipeline("summarization")

articles = [news_article, scientific_abstract, story_excerpt]
for i, text in enumerate(articles):
    summary = summarizer(text, max_length=30, min_length=10, do_sample=False)
    print(f'Summary of the article no {i+1}:\n{summary[0]["summary_text"]}\n')

## Text translation

In [None]:
translator = pipeline("translation", model="sdadas/mt5-base-translator-en-pl")
enpl_translation = translator("We are now learning how to use natural Language Processing in Python")
print(enpl_translation[0]['translation_text'])

## Question answering

In [None]:
oracle = pipeline(model="deepset/roberta-base-squad2")
oracle(question="Where do I live?", context="My name is Wolfgang and I live in Wroclaw")

In [None]:
context = """
Pythagoras was an ancient Ionian Greek philosopher and the eponymous founder of Pythagoreanism. His political and 
religious teachings were well known in Magna Graecia and influenced the philosophies of Plato, Aristotle, and, 
through them, Western philosophy. Knowledge of his life is clouded by legend, but he appears to have been the son of 
Mnesarchus, a gem engraver on the island of Samos. Modern scholars disagree regarding Pythagoras's education and 
influences, but they do agree that, around 530 BC, he traveled to Croton, where he founded a school in which 
initiates were sworn to secrecy and lived a communal, ascetic lifestyle.
"""

questions = [
    "Who was Pythagoras?",
    "What did Pythagoras influence?",
    "Where did Pythagoras found his school?"
]

# Answering each question
for question in questions:
    result = oracle(question=question, context=context)
    print(f"Question: {question}")
    print(f"Answer: {result['answer']}\n")


## Named Entity Recognition

In [None]:
# Initialize NER pipeline
ner_pipeline = pipeline("ner", grouped_entities=True)

# Sample text
text = "Google was founded by Larry Page and Sergey Brin while they were students at Stanford University."

# Performing NER
ner_results = ner_pipeline(text)
for entity in ner_results:
    print(f"Entity: {entity['word']}, Type: {entity['entity_group']}, Score: {entity['score']:.2f}")

## Text generation

In [None]:
generator = pipeline("text-generation", model="gpt2")
capital = generator('The most popular programming language is')

In [None]:
print(capital[0]['generated_text'])

## Prompt Engineering

In [None]:
# Initialize the text generation pipeline
generator = pipeline('text-generation', model='gpt2')

# Style-specific prompts
prompts = {
    "Shakespearean": "To be or not to be, that is the question:",
    "News Report": "Today in New York City, a major event took place where",
    "Science Fiction": "In a distant future, humanity has colonized Mars and"
}

# Generating and displaying responses
for style, prompt in prompts.items():
    result = generator(prompt, max_length=50, num_return_sequences=1)
    print(f"Style: {style}")
    print(f"Generated Text: {result[0]['generated_text']}\n")

In [None]:
# Fine-tuning the response by slightly altering prompts
original_prompt = "What is the best way to learn programming?"
modified_prompts = [
    original_prompt,
    "As a beginner, " + original_prompt,
    "In a fun and engaging way, " + original_prompt
]

# Generating responses
for prompt in modified_prompts:
    result = generator(prompt, max_length=50, num_return_sequences=1)
    print(f"Prompt: {prompt}")
    print(f"Generated Text: {result[0]['generated_text']}\n")

In [None]:
# Genre-specific prompts
genres = {
    "Horror": "In a dark, abandoned house, there was a mysterious noise that",
    "Comedy": "At the comedy club, the stand-up comedian started his act by saying:",
    "Romantic": "In the beautiful city of Paris, two lovers met and"
}

# Generating genre-specific texts
for genre, prompt in genres.items():
    result = generator(prompt, max_length=50, num_return_sequences=1)
    print(f"Genre: {genre}")
    print(f"Generated Text: {result[0]['generated_text']}\n")