<a href="https://colab.research.google.com/github/miguelangel18241/NLP/blob/main/DistilBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers




In [2]:
from transformers import pipeline



In [None]:
## Create a pipeline for fill-mask task with DistilBERT model
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')

In [None]:
## Perform masked language modeling
result = unmasker("Hello I'm [MASK] model")

In [None]:
print(result)

[{'score': 0.09484544396400452, 'token': 2026, 'token_str': 'my', 'sequence': "hello i'm my model"}, {'score': 0.09461958706378937, 'token': 2115, 'token_str': 'your', 'sequence': "hello i'm your model"}, {'score': 0.06731895357370377, 'token': 1037, 'token_str': 'a', 'sequence': "hello i'm a model"}, {'score': 0.05952874943614006, 'token': 1996, 'token_str': 'the', 'sequence': "hello i'm the model"}, {'score': 0.01111823134124279, 'token': 7592, 'token_str': 'hello', 'sequence': "hello i'm hello model"}]


In [None]:
for prediction in result:
  print(f"Sequence: {prediction['sequence']}")
  print(f"Score: {round(prediction['score'], 4)}")
  print(f"Token: {prediction['token']}")
  print(f"Token String: {prediction['token_str']}")
  print()


Sequence: hello i'm my model
Score: 0.0948
Token: 2026
Token String: my

Sequence: hello i'm your model
Score: 0.0946
Token: 2115
Token String: your

Sequence: hello i'm a model
Score: 0.0673
Token: 1037
Token String: a

Sequence: hello i'm the model
Score: 0.0595
Token: 1996
Token String: the

Sequence: hello i'm hello model
Score: 0.0111
Token: 7592
Token String: hello



In [None]:
list_of_scores = [(round(prediction['score'], 4)) for prediction in result]
sorted_scores = sorted(list_of_scores, reverse=True)

for score in sorted_scores:
  print(score)

0.0948
0.0946
0.0673
0.0595
0.0111


In [None]:
## When making a dictionary, I first need to decide what do I need to put in every index of the dictionary with for Loop
## Which means, create the variables that will be in that dictionary
## Then, pair them

sorted_pairs = {}

for prediction in result:
  score = (round(prediction['score'],4))
  token_str = prediction['token_str']

  #pair them
  sorted_pairs[score] = token_str

sorted_pairs = dict(sorted(sorted_pairs.items(), key=lambda item: item[0], reverse=True))

for score, token_str in sorted_pairs.items():
  print(f"Score: {score}, Token String: {token_str} ")


Score: 0.0948, Token String: my 
Score: 0.0946, Token String: your 
Score: 0.0673, Token String: a 
Score: 0.0595, Token String: the 
Score: 0.0111, Token String: hello 


In [None]:
## Perform masked language modeling
## sfnt : Sentence from news today
sfnt = unmasker("Britain's Conservative Party [MASK] for years of scandal")

In [None]:
## Access the sentence
##get the score and the strinTk
#with those make a dic
#sort it
#print it

sorted_sfnt = {}

for prediction in sfnt:
  score = round(prediction['score'], 4)
  token_string = prediction['token_str']

  #pair them
  sorted_sfnt[score] = token_string

##Make the dic
sorted_sfnt = dict(sorted(sorted_sfnt.items(), key=lambda item: item[0], reverse=True))

#print the Sorted scores
for score, token_string in sorted_sfnt.items():
  print(f"Score: {score}, Token String: {token_string}")

Score: 0.0818, Token String: mp
Score: 0.0667, Token String: leader
Score: 0.0305, Token String: mps
Score: 0.0233, Token String: chairman
Score: 0.0215, Token String: candidate


In [None]:

# Provided contextual information
context = """
Today, I haven't eaten anything.
"""

# The masked sentence
masked_sentence = "I am very [MASK]."

# Combine the masked sentence with the context
input_text = f"{context} {masked_sentence}"

# Perform masked language modeling with context
result = unmasker(input_text)

# Print the results
for prediction in result:
    print(f"Sequence: {prediction['sequence']}")
    print(f"Score: {prediction['score']:.4f}")
    print(f"Token: {prediction['token']}")
    print(f"Token String: {prediction['token_str']}")
    print()


Sequence: today, i haven't eaten anything. i am very hungry.
Score: 0.8127
Token: 7501
Token String: hungry

Sequence: today, i haven't eaten anything. i am very thirsty.
Score: 0.0834
Token: 24907
Token String: thirsty

Sequence: today, i haven't eaten anything. i am very tired.
Score: 0.0194
Token: 5458
Token String: tired

Sequence: today, i haven't eaten anything. i am very starving.
Score: 0.0158
Token: 18025
Token String: starving

Sequence: today, i haven't eaten anything. i am very sick.
Score: 0.0043
Token: 5305
Token String: sick



**Electra-base-discriminator MODEL**

In [None]:
## electra-base-discriminator

# Initialize pipeline for fill-mask task with Electra base model
unmasker = pipeline('fill-mask', model='google/electra-base-discriminator')

# Provided contextual information
context = """
Liz Truss had been British prime minister for less than two months when she was asked in parliament why she was still there.
"I'm a fighter, not a quitter," she said. The next day, she resigned.
It was October 2022, and the Conservative Party — which had been in power for 12 years — were feasting on themselves.
Truss's tenure, which lasted just 49 days, was a disaster.
"But she didn't come out of nowhere," says veteran political journalist and broadcaster Ian Dunt.
"""

# The masked sentence
masked_sentence = "Britain's Conservative Party [MASK] for years of scandal."

# Combine the masked sentence with the context
input_text = f"{context} {masked_sentence}"

# Perform masked language modeling with context
result = unmasker(input_text)

# Print the results
for prediction in result:
    print(f"Sequence: {prediction['sequence']}")
    print(f"Score: {prediction['score']:.4f}")
    print(f"Token: {prediction['token']}")
    print(f"Token String: {prediction['token_str']}")
    print()


Some weights of ElectraForMaskedLM were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['generator_lm_head.bias', 'generator_predictions.LayerNorm.bias', 'generator_predictions.LayerNorm.weight', 'generator_predictions.dense.bias', 'generator_predictions.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sequence: liz truss had been british prime minister for less than two months when she was asked in parliament why she was still there. " i'm a fighter, not a quitter, " she said. the next day, she resigned. it was october 2022, and the conservative party — which had been in power for 12 years — were feasting on themselves. truss's tenure, which lasted just 49 days, was a disaster. " but she didn't come out of nowhere, " says veteran political journalist and broadcaster ian dunt. britain's conservative party for years of scandal.
Score: 0.0196
Token: 0
Token String: [PAD]

Sequence: liz truss had been british prime minister for less than two months when she was asked in parliament why she was still there. " i'm a fighter, not a quitter, " she said. the next day, she resigned. it was october 2022, and the conservative party — which had been in power for 12 years — were feasting on themselves. truss's tenure, which lasted just 49 days, was a disaster. " but she didn't come out of nowhere,

In [None]:
from transformers import GPTNeoForCausalLM, GPT2Tokenizer

# Load pretrained GPT-Neo model and tokenizer
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")

# Input prompt
prompt = (
    "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
    "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
    "researchers was the fact that the unicorns spoke perfect English."
)

# Tokenize the prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Generate text using the model
gen_tokens = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.9,
    max_length=100,
)

# Decode generated tokens into text
gen_text = tokenizer.batch_decode(gen_tokens)[0]

# Print the generated text
print(gen_text)


config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.31G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

This discovery has also sparked an international controversy.



The scientists who discovered the unicorns believed that the valley might contain a secret, yet to be uncovered, lake of water with a rare plant called Lavinia bicolora - the ‘gold


In [4]:
from transformers import GPTNeoForCausalLM, GPT2Tokenizer

# Load pretrained GPT-Neo model and tokenizer
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")

# Input prompt
prompt = (
    "Copa América action continued Friday on FS1 with Group A runner-up Canada pulling a 1(4)-1(3) upset on penalty kicks against Group B winner Venezuela facing in the quarterfinals at AT&T Stadium in Arlington, Texas"
    "Defending Copa América champion Argentina, who advanced past Ecuador in penalties on Thursday, awaits Canada in the semifinals for a rematch of the tournament opener, a 2-0 victory for Lionel Messi's squad on June 20."
)



# Tokenize the prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Generate text using the model
gen_tokens = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.9,
    max_length=100,
)

# Decode generated tokens into text
gen_text = tokenizer.batch_decode(gen_tokens)[0]

# Print the generated text
print(gen_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Copa América action continued Friday on FS1 with Group A runner-up Canada pulling a 1(4)-1(3) upset on penalty kicks against Group B winner Venezuela facing in the quarterfinals at AT&T Stadium in Arlington, TexasDefending Copa América champion Argentina, who advanced past Ecuador in penalties on Thursday, awaits Canada in the semifinals for a rematch of the tournament opener, a 2-0 victory for Lionel Messi's squad on June 20.The victory was


In [25]:
text = "Copa América action continued Friday on FS1 with Group A runner-up Canada pulling a 1(4)-1(3) upset on penalty kicks against Group B winner Venezuela facing in the quarterfinals at AT&T Stadium in Arlington, TexasDefending Copa América champion Argentina, who advanced past Ecuador in penalties on Thursday, awaits Canada in the semifinals for a rematch of the tournament opener, a 2-0 victory for Lionel Messi's squad on June 20.The victory was"

## To count the letters in a string, or a sentence, first split it, then count the splited words
## Split makes a list of words

words = text.split()

dict_of_words = {}

#iterate through the words and count ocurrences
for index, word in enumerate(words):
  dict_of_words[index+1] = word

dict_of_words

{1: 'Copa',
 2: 'América',
 3: 'action',
 4: 'continued',
 5: 'Friday',
 6: 'on',
 7: 'FS1',
 8: 'with',
 9: 'Group',
 10: 'A',
 11: 'runner-up',
 12: 'Canada',
 13: 'pulling',
 14: 'a',
 15: '1(4)-1(3)',
 16: 'upset',
 17: 'on',
 18: 'penalty',
 19: 'kicks',
 20: 'against',
 21: 'Group',
 22: 'B',
 23: 'winner',
 24: 'Venezuela',
 25: 'facing',
 26: 'in',
 27: 'the',
 28: 'quarterfinals',
 29: 'at',
 30: 'AT&T',
 31: 'Stadium',
 32: 'in',
 33: 'Arlington,',
 34: 'TexasDefending',
 35: 'Copa',
 36: 'América',
 37: 'champion',
 38: 'Argentina,',
 39: 'who',
 40: 'advanced',
 41: 'past',
 42: 'Ecuador',
 43: 'in',
 44: 'penalties',
 45: 'on',
 46: 'Thursday,',
 47: 'awaits',
 48: 'Canada',
 49: 'in',
 50: 'the',
 51: 'semifinals',
 52: 'for',
 53: 'a',
 54: 'rematch',
 55: 'of',
 56: 'the',
 57: 'tournament',
 58: 'opener,',
 59: 'a',
 60: '2-0',
 61: 'victory',
 62: 'for',
 63: 'Lionel',
 64: "Messi's",
 65: 'squad',
 66: 'on',
 67: 'June',
 68: '20.The',
 69: 'victory',
 70: 'was'}

In [14]:
print(words[0])
# print(word_count[0])

Copa


In [1]:
# Step 1: Install transformers library
!pip install transformers

# Step 2: Import necessary modules
from transformers import pipeline

# Step 3: Initialize question-answering pipeline
qa_pipeline = pipeline("question-answering")

# Step 4: Define the context and question
context = """
Transformers have revolutionized the field of natural language processing by introducing mechanisms for
handling long-range dependencies and parallel processing.
"""
question = "What have transformers revolutionized?"

# Step 5: Perform question answering
result = qa_pipeline(question=question, context=context)

# Step 6: Print the result
print(result)




No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.5600082874298096, 'start': 47, 'end': 74, 'answer': 'natural language processing'}
