# Leveraging LLMs for Code Analysis

In [1]:
%load_ext jupyter_ai

In [3]:
%%ai ollama:deepseek-coder-v2

How do I create an array in using Python?

 To create an array in Python, you can use the `array` module or simply list. Here’s how to create and print an array (or a list) in both formats:

```python
import array as arr

# Creating an array of integers
int_array = arr.array('i', [1, 2, 3, 4, 5])  # 'i' denotes integer type

# Printing the array
print(int_array)

# Alternatively, using a list
python_list = [1, 2, 3, 4, 5]

# Printing the list
print(python_list)
```

Output:
```
array('i', [1, 2, 3, 4, 5])
[1, 2, 3, 4, 5]
```

In [4]:
import array as arr


# Creating an array of integers
int_array = arr.array('i', [1, 2, 3, 4, 5])  # 'i' denotes integer type

# Printing the array
print(int_array)

# Alternatively, using a list
python_list = [1, 2, 3, 4, 5]

# Printing the list
print(python_list)

array('i', [1, 2, 3, 4, 5])
[1, 2, 3, 4, 5]


In [6]:
%%ai ollama:deepseek-coder-v2

Your are an expert Python programmer. Explain the following Python code delimited by 
``` to a more junior developer.

```
import array as arr


# Creating an array of integers
int_array = arr.array('i', [1, 2, 3, 4, 5])  # 'i' denotes integer type

# Printing the array
print(int_array)

# Alternatively, using a list
python_list = [1, 2, 3, 4, 5]

# Printing the list
print(python_list)
```

 Sure! Let's break down this Python code for a more junior developer:

### Code Explanation

1. **Importing the Array Module**
   ```python
   import array as arr
   ```
   - Here, we are importing the `array` module and giving it an alias `arr`. This allows us to use functions or classes from this module more conveniently later in our code.

2. **Creating an Array of Integers**
   ```python
   int_array = arr.array('i', [1, 2, 3, 4, 5])  # 'i' denotes integer type
   ```
   - We are creating an array specifically for integers (`'i'` stands for integer).
   - The `arr.array('i', [1, 2, 3, 4, 5])` part means we are initializing an array with the name `int_array`, and it contains the elements `[1, 2, 3, 4, 5]`.

3. **Printing the Array**
   ```python
   print(int_array)
   ```
   - This line prints the contents of the array to the console. The output will be: `array('i', [1, 2, 3, 4, 5])`.

4. **Creating a List Using Python's Built-in Syntax**
   ```python
   python_list = [1, 2, 3, 4, 5]
   ```
   - Here, we are creating a list named `python_list` that contains the same integers as our array. The syntax for lists in Python is square brackets `[]`.

5. **Printing the List**
   ```python
   print(python_list)
   ```
   - This line prints the contents of the list to the console. The output will be: `[1, 2, 3, 4, 5]`.

### Summary
- We used the `array` module to create an array of integers and a Python list. Both are collections that can hold multiple items.
- Arrays in Python's `array` module are more specific about the type of elements they contain (e.g., only integers), while lists are more flexible and can contain any data types.
- We printed both the array and the list to verify their contents.

### Output
When you run this code, it will produce the following output:
```
array('i', [1, 2, 3, 4, 5])
[1, 2, 3, 4, 5]
```

This should give a clear understanding of how arrays and lists are created and used in Python.

In [7]:
%%ai ollama:deepseek-coder-v2

What happens if I try to access data in an empty Python array?

 ```python
# Create an empty list
my_array = []

# Attempt to access the first element of the array
try:
    first_element = my_array[0]
except IndexError as e:
    print(f"An error occurred: {e}")
```

Output: 
```
An error occurred: list index out of range
```

In [None]:
Instruction 1:

Write some code to implement a linked list in Python, and then ask it to explain the downsides and overheads associated with linked lists compared to other data structures.



In [None]:
Instruction 2:

Choose one of the points that the LLM raised in response to your last prompt, and ask some follow up questions to dig deeper. Continue the conversation to get more information on any points that interest you or that related to concepts you may not be familiar with. 

The goal is to explore how an ongoing conversation can reveal deeper insights about the problem you are working on and options for solving it. One useful skill is asking an LLM to critique your understanding of an issue. This can help you improve your knowledge and skills.



In [None]:
Instruction 3:

Take a look at the linked-list code included in the video downloads (this code was generated using GPT-4o). Take some time to implement code that enables you to remove a node from the linked list. You can do this by hand, or ask an LLM to do it for you. Then go back and ask the LLM to profile the new code from the point of view of an expert in mitigating denial-of-service attacks, and to help you update the code to make sure it is safe from attack and performs well.

### Exercise

In [None]:
Consider the reasons why some would want to implement a doubly-linked list rather than a singly-linked one. Take a few moments to think of some reasons, Then prompt an LLM with the same question, and compare the model’s answers to the items that you identified. 

In [None]:
Instruction 1:

Take a look at the binary search tree code included in the downloads for this video. What issues would you have to consider before deploying code that uses a BST like this into production? Make a note of any issues you come up with. 

Then pass the code to an LLM and ask it to assume the role of a site reliability engineer and analyze the code and suggest changes to make it perform well and securely in production.See what issues it comes up with, and compare the model’s answer to your own list. Take a close look at any code the LLM writes and ask follow-up questions to make sure you understand what it generated.



In [None]:
Instruction 2:

Take a look at the AVL.py code in the downloads for this video. Try running the code in the Coursera lab environment, in Google colab, or on your own machine. Try to break the code - there are some bugs in the implementation. Hint - think of a role to assign the LLM to help you identify edge cases. Then work with the LLM to fix the bugs you find and enhance the code.

In [None]:
Instruction 1:

Examine the code in the graph.py file in the downloads for this video. This code implements a directed graph. Take a few moments to think try and the code out, see how it works, and identify any issues that might arise if this code was used in production. 

What improvements or changes would you need to implement to make the code ready for production settings? Is there any missing functionality? Or any security vulnerabilities to consider?

After thinking through these issues yourself, pass the code to an LLM and ask it to critique the code and explore these issues for you. Read the model’s output, ask follow up questions if you need to, and then try out any code it generates to check that it works. When you are done, come back to see how that conversation progressed for Laurence.

Instruction 2:

Download the code in graph2.py from the video downloads, which Laurence generated after his back and forth with ChatGPT. 

Try the code out and see if you can break it, or ask an LLM to help you identify edge cases and test those out. Then with the LLM to fix any issues that you identify. 

## Hash tables and maps

In [2]:
%%ai ollama:deepseek-coder-v2

Please write plain Python code using the base language to download text from a given url 
and then count every instance of every word in the text.


 Here's a simple Python script that downloads text from a given URL and counts the occurrences of each word in the downloaded text. The script uses the `requests` library to fetch the content from the URL and the `collections.Counter` class from the standard library to count the words.

```python
import requests
from collections import Counter
import re

def download_text(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        raise Exception("Failed to retrieve the content from the URL")

def count_words(text):
    # Use regex to find all words, ignoring punctuation and case
    words = re.findall(r'\b\w+\b', text.lower())
    return Counter(words)

# Example usage:
url = 'http://example.com'  # Replace with the desired URL
text = download_text(url)
word_counts = count_words(text)

print("Word Counts:")
for word, count in word_counts.items():
    print(f"{word}: {count}")
```

This script will output a list of words along with their counts from the text found at the specified URL. Each line represents a unique word and how many times it appears in the downloaded text.

In [3]:
%%ai ollama:deepseek-coder-v2

Assume the role of an expert Python programmer and explain the functionality of the Counter 
class from the collections package of the Python standard library.


 The `Counter` class is a part of the `collections` module in the Python standard library, which provides specialized container datatypes beyond those typically available in standard Python (via the built-in types like list, dict, and tuple). 

The `Counter` class is designed to count hashable objects. It allows you to easily count the occurrences of each element in an iterable or a mapping. Here’s how it works:

### Basic Usage
You can create a Counter object by passing an iterable (like a list or string) to its constructor:

```python
from collections import Counter

# Example with a list
c = Counter([1, 2, 2, 3, 4, 4, 4])
print(c)
```

### Output
The output will be:
```
Counter({4: 3, 2: 2, 1: 1, 3: 1})
```
This means that the number `4` appears `3` times, `2` appears `2` times, and so on.

### Functions and Methods
- **`.elements()`**: Returns an iterator over elements repeating each as many times as its count. Elements are returned in arbitrary order:
    ```python
    print(list(c.elements()))  # Output: [1, 2, 2, 3, 4, 4, 4]
    ```
- **`.most_common([n])`**: Returns a list of the `n` most common elements and their counts from the most common to the least:
    ```python
    print(c.most_common())  # Output: [(4, 3), (2, 2), (1, 1), (3, 1)]
    ```
- **`.elements()`**: Returns an iterator over elements repeating each as many times as its count. Elements are returned in arbitrary order:
    ```python
    print(list(c.elements()))  # Output: [1, 2, 2, 3, 4, 4, 4]
    ```
- **`.update([iterable-or-mapping])`**: Updates the counts with elements from an iterable or a mapping:
    ```python
    c.update([2, 3, 3])
    print(c)  # Output: Counter({4: 3, 3: 3, 2: 3})
    ```
- **`.subtract([iterable-or-mapping])`**: Subtracts counts from an iterable or a mapping:
    ```python
    c.subtract([1, 2])
    print(c)  # Output: Counter({4: 3, 3: 3, 2: 1, 1: -1})
    ```
- **`.total()`**: Returns the sum of all counts (equivalent to `sum(c.values())`):
    ```python
    print(c.total())  # Output: 9
    ```

### Use Cases
1. **Counting occurrences in a list or string:**
    ```python
    text = "mississippi"
    c = Counter(text)
    print(c['s'])  # Output: 4
    ```
2. **Finding the most common elements:**
    ```python
    words = ["apple", "banana", "apple", "cherry", "banana", "apple"]
    word_count = Counter(words)
    print(word_count.most_common())  # Output: [('apple', 3), ('banana', 2), ('cherry', 1)]
    ```

### Summary
The `Counter` class is a powerful tool for counting and analyzing the frequency of elements in an iterable or mapping. It simplifies handling counts, making it easier to work with data that involves repeated occurrences of values. This makes it particularly useful in scenarios such as text analysis, inventory management, and more.

In [4]:
%%ai ollama:deepseek-coder-v2

Why would you use the Counter class from the Python collections package to count the occurence 
of words in a text file instead of a plain Python dictionary?


 The `Counter` class from the Python `collections` package is particularly useful for counting the occurrences of elements in an iterable, such as words in a text file. Here are some reasons why you might prefer using `Counter` over a plain Python dictionary:

1. **Ease of Use**: The `Counter` class provides methods like `.most_common()` which make it easy to find the most common elements and their counts. For example, you can easily get the top N words by frequency with just one line of code. This is not as straightforward using a plain dictionary.
   ```python
   from collections import Counter

   # Example usage
   text = "this is a test file this file contains some sample text"
   word_counts = Counter(text.split())
   print(word_counts.most_common(5))  # Outputs the top 5 most common words
   ```

2. **Mathematical Operations**: `Counter` objects support arithmetic operations such as addition (`+`), subtraction (`-`), and union (`|`) which can be useful for combining word counts from multiple texts or removing specific counts. This is more cumbersome with plain dictionaries.
   ```python
   c1 = Counter(a=4, b=3)
   c2 = Counter(a=2, b=5)
   result = c1 - c2  # Result: Counter({'a': 2, 'b': -2})
   ```

3. **Default Values**: Unlike plain dictionaries, `Counter` objects handle missing elements by providing a default count of zero rather than raising a key error. This makes it easier to deal with words that might not be present in the text.
   ```python
   c = Counter(a=1, b=2)
   print(c['c'])  # Outputs: 0 (instead of KeyError)
   ```

4. **Readability**: Using `Counter` can make your code more readable and concise, especially when dealing with simple counting tasks like word frequencies in a text file. It clearly expresses the intent that you are trying to count occurrences of elements.

5. **Integration with Other Libraries**: The `collections` module is part of Python's standard library, which means it works seamlessly with other libraries and tools in the Python ecosystem without requiring additional installations.

In summary, while a plain dictionary can be used for counting words, using `Counter` from the `collections` package offers more functionality and ease-of-use for handling counted data.

In [5]:
%%ai ollama:deepseek-coder-v2

Suppose that I have millions of text files and I need to count the occurence of each word across 
all the files. Does the Counter class from the Python collections library scale better than a 
standard Python dictionary?


 Yes, the `Counter` class from the Python `collections` library scales better than a standard Python dictionary for counting occurrences of words across millions of text files. This is because `Counter` is optimized for such tasks and provides efficient methods for updating counts as well as accessing the most common elements.

In [7]:
%%ai ollama:deepseek-coder-v2

I see that the Python code uses regular expressions to break the corpus into words. Is this the 
fastest and most scalable solution? Suppose that I have millions of text files to process?


 No, using regular expressions for breaking a large corpus of text files into words is not typically the fastest or most scalable solution due to their complexity and overhead. A more efficient approach would be to use tokenization methods provided by libraries like NLTK (Natural Language Toolkit) or spaCy, which are designed specifically for processing natural language data. These libraries offer faster and more accurate word segmentation techniques that can handle larger volumes of text efficiently.

Here's a basic example using Python with the `re` module to split text into words:

```python
import re

def tokenize_text(text):
    # Split on any whitespace, punctuation, or special characters
    tokens = re.findall(r'\b\w+\b', text)
    return tokens

# Example usage
corpus = "This is a sample text file for processing."
words = tokenize_text(corpus)
print(words)
```

However, for large-scale processing of millions of text files, consider using specialized libraries:

```python
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')  # Download the tokenizer models

def tokenize_text(text):
    tokens = word_tokenize(text)
    return tokens

# Example usage
corpus = "This is a sample text file for processing."
words = tokenize_text(corpus)
print(words)
```

Or with spaCy:

```python
import spacy
nlp = spacy.load('en_core_web_sm')  # Download the English language model

def tokenize_text(text):
    doc = nlp(text)
    tokens = [token.text for token in doc]
    return tokens

# Example usage
corpus = "This is a sample text file for processing."
words = tokenize_text(corpus)
print(words)
```

These alternatives are generally faster and more scalable than using regular expressions, especially when dealing with large volumes of text.

### Exercise

Take a moment to examine the code in `../src/count_words2.py`, which is available in the downloads for this video. Try the code out, see how it works, and think through any vulnerabilities that might exist in the code and note down your ideas. 

Next, pass the code to an LLM and ask it to identify any vulnerabilities in your code. Compare your list to the items that the LLM comes up with. 

Read the model’s output, ask follow up questions if you need to, and then try out any code it generates to check that it works. When you are done, come back to see how that conversation progressed for Laurence.


### Exercise

Compare `../src/count_words3.py code` to the enhanced code you developed after chatting with an LLM. Critique the code and identify any vulnerabilities and edge cases. 

Then ask an LLM to fix these and to offer any other suggestions for improving the code. Some things you could try include: writing test cases, implementing support for other languages, refactoring the code into another programming language, or adding documentation. Your goal is to write a robust word counting function that could work well at scale and that is safe and secure.
