<a href="https://colab.research.google.com/github/gupta24789/hugging-face/blob/main/text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Challanges in Generating Cohereent Text
- Repetition: The model can repeat itself, generating the same text over and over again.
- Limited Vocabulary: The model can use the same words and phrases over and over again.
- Lack of Context: The model can generate sentences that are grammatically correct, but lack context and meaning.


#### How to Generate Text

- *Greedy Search:* The model generates the word with the highest probability as the next word.
- *Beam Search:* The model generates the top 𝑘 words and keeps track of the probability of each sequence. The sequence with the highest probability is used as the next sequence.
- *Top-K Sampling:* The model generates the top 𝑘 words and samples from those words using their probabilities as weights.
- *Top-p (nucleus) Sampling:* The model generates the smallest possible set of words whose cumulative probability exceeds the probability 𝑝. The model then samples from those words using their probabilities as weights.

In [None]:
# !pip3 install -q -U bitsandbytes==0.42.0
# !pip3 install -q -U peft==0.8.2
# !pip3 install -q -U trl==0.7.10
# !pip3 install -q -U accelerate==0.27.1
# !pip3 install -q -U datasets==2.17.0
# !pip3 install -q -U transformers==4.38.0

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ['CUDA_VISIBLE_DEVICES'] = "0"

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from pprint import pprint

In [None]:
torch.cuda.device_count()

1

## Check GPU

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

## Load Tokenizer & model

In [None]:
model_name = "google/gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_name, token=os.environ['HF_TOKEN'], device_map = device)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Generate Text

In [None]:
input_text = "Can you please prepare the step by step roadmap to learn data science"

In [None]:
## encode input text
inputs = tokenizer(input_text, return_tensors='pt').to(device)
print(inputs)

{'input_ids': tensor([[    2,  3611,   692,  3743, 12523,   573,  4065,   731,  4065, 96774,
           577,  3918,  1423,  8042]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}


## Beam Search

In [None]:
max_length = 128
output = model.generate(**inputs, max_length=max_length, num_beams=3, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Can you please prepare the step by step roadmap to learn data science?

Answer:
1. Understand the basics of data science. 2. Learn how to collect and analyze data. 3. Learn how to use data to make predictions. 4. Learn how to use data to make decisions.


## Beam Search with no repeat


- no_repeat_ngram_size : This imposes the penality on number of repeats

In [None]:
max_length = 128
output = model.generate(**inputs, max_length=max_length, num_beams=3, do_sample=False, no_repeat_ngram_size = 2)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Can you please prepare the step by step roadmap to learn data science?

Answer:

Step 1/10
1. Understand the basics of statistics and data analysis: This includes learning about variables, data types, descriptive statistics, probability distributions, hypothesis testing, and inferential statistics. It is important to understand these concepts before moving on to more advanced topics. Resources: - Statistics for Business and Economics by Douglas C. Montgomery - An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani - Hands-on Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron




## Sampling Method


- do_sample : If True then use the sampling method
    - top_p : nucleus sampling
    - top_k : random sampling


- no_repeat_ngram_size : This imposes the penality on number of repeats

In [None]:
## Random Sampling
max_length = 128
output = model.generate(**inputs, max_length=max_length, do_sample=True, top_k = 100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Can you please prepare the step by step roadmap to learn data science? My aim is to take this knowledge and to help poor Indian youth to reduce their unemployment ratio.

I have been working in that field as a data scientist since 2012 in Bangalore & Delhi. I want to take steps to increase the awareness about data science and machine learning among the people

Hi. I recently started learning data science with the intention to explore ways of automating data collection to help scientists create better ML pipelines. I will be working to help others get started as well in ML and data science. I also plan to learn ways to help automate


In [None]:
## Nucleus Sampling
max_length = 128
output = model.generate(**inputs, max_length=max_length, do_sample=True, top_p = .8)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Can you please prepare the step by step roadmap to learn data science for beginners with 100% guaranteed placement in top MNCs ?

I have cleared all the interviews but unable to get the job as the salary is not in my range . I am asking you for a solution

I am currently doing a research project on "Data Science" but my professor doesn't know much about it.

I have been looking for the right career for myself for a while now, and data science and data analytics is something that has been of utmost interest to me. I am currently a college student, but I have been researching the


In [None]:
max_length = 128
output = model.generate(inputs['input_ids'], max_length=max_length, do_sample=True, top_p = 0.7, no_repeat_ngram_size = 2)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Can you please prepare the step by step roadmap to learn data science.

Can anyone please share the data analytics training material, and the way to prepare for interviews for Data Science roles. I have completed a PG in Data science and have been practicing a lot for the interviews. Can anyone help me with the right way of preparing for it. Any online resources for practice are appreciated. Thanks in advance.


## Model with different precision

In [None]:
## torch.float16, torch.bfloat16
model_name = "google/gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_name, token=os.environ['HF_TOKEN'], device_map = device, torch_dtype=torch.bfloat16 )

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
max_length = 512
output = model.generate(**inputs, max_length=max_length, do_sample=True, top_p = 0.9, no_repeat_ngram_size = 2)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Can you please prepare the step by step roadmap to learn data science?

Answer:
1. Identify your goals and define your data analysis tasks.
2. Collect and prepare your dataset. 3. Analyze your collected data.


## Quantized Versions through bitsandbytes

In [None]:
## you can load the model in qunatized version using bitsandbytes
quantization_config = BitsAndBytesConfig(load_in_8bit=True, load_in_4bit = False)

model_name = "google/gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_name, token=os.environ['HF_TOKEN'], device_map = device, quantization_config = quantization_config )

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
max_length = 512
output = model.generate(**inputs, max_length=max_length, do_sample=True, top_p = 0.6, no_repeat_ngram_size = 2)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Can you please prepare the step by step roadmap to learn data science?

Answer:

Step 1/6
1. Start with a basic understanding of mathematics and statistics.

 Step 2. Learn about programming languages such as Python, R, or Java. Step3. Get familiar with machine learning algorithms and techniques.Step4. Practice data cleaning and data wrangling.
Step5. Build a data model and visualize data.
