# Code Explanation using Transformers

In this notebook, we will use the `transformers` library to summarize and explain Python code. We will utilize the `codellama/CodeLlama-13b-Instruct-hf` model for this purpose.

## Step-by-Step Instructions

### 1. Import the Required Libraries

First, we need to import the necessary modules from the `transformers` library and other dependencies.


In [1]:
!pip install accelerate



In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

In [3]:
device = "cuda"

### 2. Load the Tokenizer and Model
We will load the tokenizer and model using the specified model ID, `codellama/CodeLlama-13b-Instruct-hf`.



In [5]:
# model_id = "codellama/CodeLlama-13b-Instruct-hf"
# Let's take a smaller one :)
# model_id = "meta-llama/CodeLlama-7b-hf"
model_id = "codellama/CodeLlama-7b-Instruct-hf"

dtype = torch.bfloat16

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device,
    torch_dtype=dtype,
)

tokenizer_config.json:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/646 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

### 3. Define the Function for Code Explanation
We will define a function to summarize and explain Python code. This function will take a code snippet and a prompt, and generate the appropriate explanation.

In [6]:
# Define the template function
def summarize_code(max_new_tokens: int, prompt: str) -> str:
    # Tokenize the input
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate the summary
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)

    # Decode the output
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    summary = summary.split("[/INST]")[-1].strip()
    return summary.strip()


### Example Code
Let's consider an example Python function for bubble sort. We will use our functions to summarize and explain this code.

In [9]:
code_example = """
import gensim
from gensim import corpora
from gensim.models import LdaModel
from pprint import pprint

# Assuming 'corpus' is a list of documents, where each document is a list of tokens
# If you don't have this, you'll need to preprocess your text data first

# Create a dictionary from the corpus
dictionary = corpora.Dictionary(corpus)

# Create a document-term matrix
doc_term_matrix = [dictionary.doc2bow(doc) for doc in corpus]

# Set parameters for the LDA model
num_topics = 5  # Number of topics to extract
passes = 20     # Number of passes through the corpus during training
iterations = 400  # Number of iterations in each pass

# Train the LDA model
lda_model = LdaModel(corpus=doc_term_matrix,
                     id2word=dictionary,
                     num_topics=num_topics,
                     random_state=100,
                     update_every=1,
                     chunksize=100,
                     passes=passes,
                     iterations=iterations,
                     alpha='auto',
                     per_word_topics=True)

# Print the topics
print("The topics extracted from the corpus are:")
pprint(lda_model.print_topics())

# Print the top 10 words for each topic
print("\nTop 10 words for each topic:")
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic: {idx}")
    print(", ".join([word.split("*")[1].strip().replace('"', '') for word in topic.split("+")[:10]]))
    print()

# To see the dominant topic for each document
print("Dominant topic for each document:")
for i, doc in enumerate(doc_term_matrix):
    top_topics = lda_model.get_document_topics(doc, minimum_probability=0.0)
    top_topic = max(top_topics, key=lambda x: x[1])
    print(f"Document {i}: Topic {top_topic[0]} (Probability: {top_topic[1]:.2f})")
"""

### Examples

In [10]:
prompt = f"""
[INST] <>
You are an expert in Python Programming. Below is a line of python code that describes a task.
Return only one line of summary that appropriately describes the task that the code is
performing. You must write only summary without any prefix or suffix explanations.
Note: The summary should have minimum 1 words and can have on an average 25 words. Your reader is not an expert in python
<>
{code_example} [/INST]
"""

print(f'Generating explanation of code.')
# Get the summary of the code
summary = summarize_code(max_new_tokens=256, prompt=prompt)

print('-'*50)
print(summary)


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Generating explanation of code.
--------------------------------------------------
The code is performing topic modeling on a corpus of text documents using the Latent Dirichlet Allocation (LDA) algorithm. The code first creates a dictionary from the corpus, which is a list of documents, where each document is a list of tokens. The code then creates a document-term matrix, which is a matrix where each row represents a document and each column represents a term in the vocabulary.

The code then sets the parameters for the LDA model, including the number of topics to extract, the number of passes through the corpus during training, and the number of iterations in each pass. The code then trains the LDA model using the corpus and the document-term matrix.

Finally, the code prints the topics extracted from the corpus and the top 10 words for each topic, as well as the dominant topic for each document.


Other examples of prompts.


In [11]:
prompt = f"""
[INST] <>
You are an expert in Python programming. Below is a Python code that describes a task.
Explain the code step by step with details about the implementation.
<>
{code_example} [/INST]
"""

print(f'Generating explanation of code.')
# Get the summary of the code
summary = summarize_code(max_new_tokens=256, prompt=prompt)

print('-'*50)
print(summary)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Generating explanation of code.
--------------------------------------------------
This code is using the Gensim library to perform Latent Dirichlet Allocation (LDA) on a corpus of text documents. The code first creates a dictionary from the corpus, which is a list of documents, where each document is a list of tokens. The dictionary is used to create a document-term matrix, which is a matrix where each row represents a document and each column represents a term (word) in the corpus.

The code then sets the parameters for the LDA model, including the number of topics to extract (num_topics), the number of passes through the corpus during training (passes), the number of iterations in each pass (iterations), and the random state (100). The LDA model is then trained on the corpus using the `LdaModel` class from Gensim.

Once the model is trained, the code prints the topics extracted from the corpus using the `print_topics` method of the `LdaModel` class. The `print_topics` method takes a

In [12]:
prompt = f"""
[INST] <>
You are an expert in Python programming. Below is a Python code that describes a task.
Explain the code step by step with details about the implementation in a friendly and chatty way!
<>
{code_example} [/INST]
"""

print(f'Generating explanation of code.')
# Get the summary of the code
summary = summarize_code(max_new_tokens=256, prompt=prompt)

print('-'*50)
print(summary)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Generating explanation of code.
--------------------------------------------------
This code is using the Gensim library to perform Latent Dirichlet Allocation (LDA) on a corpus of text documents. The goal of LDA is to extract topics from the corpus, which are groups of words that are co-occurring in the documents.

Here's a step-by-step explanation of the code:

1. The first step is to import the necessary libraries: Gensim, corpora, and pprint. Gensim is a Python library for natural language processing, and corpora is a module within Gensim that provides functions for working with text corpora. pprint is a module that allows you to pretty-print Python objects.
2. The next step is to create a dictionary from the corpus. This is done using the `corpora.Dictionary` function, which takes a list of documents as input and returns a dictionary that maps each word in the corpus to its frequency in the corpus.
3. The next step is to create a document-term matrix, which is a matrix where each 

In [14]:
# prompt: unload model from gpu

del tokenizer
del model
torch.cuda.empty_cache()


NameError: name 'tokenizer' is not defined

In [17]:
# model_id = "meta-llama/Meta-Llama-3-8B"
# model_id = "meta-llama/Llama-2-7b-hf"
model_id = "meta-llama/Llama-2-7b-chat-hf"

dtype = torch.bfloat16

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device,
    torch_dtype=dtype,
)

OSError: mistralai/mamba-codestral-7B-v0.1 does not appear to have a file named config.json. Checkout 'https://huggingface.co/mistralai/mamba-codestral-7B-v0.1/tree/main' for available files.

In [None]:
prompt = f"""
[INST] <>
You are an expert in Python programming. Below is a Python code that describes a task.
Explain the code step by step with details about the implementation in a friendly and chatty way!
<>
{code_example} [/INST]
"""

print(f'Generating explanation of code.')
# Get the summary of the code
summary = summarize_code(max_new_tokens=256, prompt=prompt)

print('-'*50)
print(summary)

Generating explanation of code.
--------------------------------------------------

