# Import Dependencies

In [1]:
import json
import argparse
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import configs
import transformers
from torch import bfloat16
import requests

## Call API

In [2]:
import requests

API_URL = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3"
headers = {"Authorization": "Bearer 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"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()
# See other parameters at https://huggingface.co/docs/api-inference/detailed_parameters, we can easily specify the minimum and maximum length of the output text, the number of beams, the temperature, etc.
output = query({
	"inputs": "Who are you?",
    "max_length": 100,
})
print(output)

[{'generated_text': "Who are you? Introduce yourself. Which other countries have you visited?\n\nHi, I'm Kate from Canada, a lovely country in North America. I'm a travel enthusiast who loves exploring new places, cultures, and meeting people from all over the world. I've had the privilege to visit several countries, including:\n\n1. United States (several states)\n2. Mexico\n3. Costa Rica\n4. Cuba\n5. Austria\n6."}]


## Use a quantized model on local machine

Usually use a quantized version will result in worse performance. However, use 4 bit quantization for a 7B model can fit it in a GPU with 24GB memory. (I use a 3090 GPU with 24GB memory)

In [3]:
# setup quantization config
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16,
)

In [4]:
# Check if a GPU is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [5]:
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)  # quantized model

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [6]:
prompts = "Who are you?"
input_ids = tokenizer(prompts, return_tensors="pt").to(device)
# print(len(input_ids.input_ids[0]))
attention_mask = tokenizer(prompts, return_tensors="pt").attention_mask.to(device)

# Generate output using the model
outputs = model.generate(
    input_ids.input_ids,
    num_return_sequences=1,  # number of different sequences to generate
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
    attention_mask=attention_mask,
    max_new_tokens=1000,
)

# Decode the generated output
outputs = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(outputs)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Who are you?

I am a 23-year-old graphic designer from the Netherlands. I’ve been working in the design industry for about 5 years now. I’ve worked for various design agencies and studios, but I’ve been freelancing for the past year and a half.

What do you do?

I specialize in branding, visual identity, and print design. I help businesses and individuals create a strong visual identity that sets them apart from their competitors. I design logos, business cards, brochures, websites, and other marketing materials.

What inspired you to become a graphic designer?

I’ve always been interested in art and design. In high school, I took a graphic design class and fell in love with it. I loved the idea of creating visual solutions to communicate ideas and solve problems. I went on to study graphic design in college and have been working in the industry ever since.

What do you enjoy most about being a graphic designer?

I enjoy the creative problem-solving aspect of graphic design. Every proj

## Use CPU Offloading to a normal Model

CPU offloading is a technique to offload the computation to a CPU. It is useful when the model is too large to fit in the GPU memory. We only move the layer that is needed for the current computation to the GPU and move it back to the CPU after the computation is done.

In [7]:
from accelerate import cpu_offload
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model = cpu_offload(model, execution_device=device)  # offload the model to the CPU

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [8]:
prompts = "Who are you?"
input_ids = tokenizer(prompts, return_tensors="pt").to(device)
# print(len(input_ids.input_ids[0]))
attention_mask = tokenizer(prompts, return_tensors="pt").attention_mask.to(device)

# Generate output using the model
outputs = model.generate(
    input_ids.input_ids,
    num_return_sequences=1,  # number of different sequences to generate
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
    attention_mask=attention_mask,
    max_new_tokens=1000,
)

# Decode the generated output
outputs = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(outputs)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Who are you?

I am a 23-year-old artist from the Netherlands. I’ve been drawing since I was a kid, but I started taking it more seriously around 18. I’ve been working as a freelance artist for a few years now.

What do you create?

I create digital illustrations, mostly of characters and creatures. I like to experiment with different styles and techniques, but I usually stick to a more detailed and colorful approach.

What inspires you?

I find inspiration in a lot of things, from nature and animals to movies, books, and other art. I also love to explore different cultures and mythologies. I think it’s important to keep an open mind and be curious about the world around you.

What is your creative process like?

My creative process usually starts with brainstorming ideas and sketching out rough concepts. I like to experiment with different compositions and color palettes to find what works best for the concept. Once I have a solid idea, I’ll start refining the details and adding textur