<a href="https://colab.research.google.com/github/mbumba1/Data-Science-For-Beginners/blob/main/(Student_version)_Week2_Day1_AI_Assisted_Coding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to AI Assisted Coding

Instructor: Zheng Guo (zhgguo@umich.edu)

In this lecture, we'll explore how to leverage AI tools for coding.

**Topics:**
- Writing code with Gemini
- Fixing problems with Gemini
- Testing with Gemini

## Tools

There are many AI tools available to help you get better programming experiences. Here are some examples:

- GitHub Copilot: https://github.com/features/copilot
- Cursor: https://www.augmentcode.com/
- Amazon Q Developer: https://aws.amazon.com/q/developer
- Google Gemini: https://gemini.google.com/

Today we are going to focusing on usage of Gemini in Google Colab.

## Setup and Installation

First, let's ensure we have the necessary libraries installed. You can run the following command to install them.

In [None]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

iris_features, iris_type = load_iris(return_X_y=True, as_frame=True)
print(iris_features.shape)

## Gemini in Google Colab

At the top right corner of your Google Colab, near your account avatar, there exists a "Gemini" button that allows you to open the chat box and use Gemini.

In your code editor (such as VS Code), once Copilot is installed and enabled, you can write a comment or a function signature, and Copilot will provide suggestions. You can accept, reject, or modify these suggestions as needed.

**TODO: add more screen shots on how to do each**

Below, we use the iris dataset as a running example to demonstrate:
- Loading data
- Exploratory data analysis (EDA)
- Fixing a bug in a function
- Refactoring code
- Writing tests

Letâ€™s get started!

## Code Completion

When you write the code in Google Colab, you will often see gray text appear after your cursor, and those are intelligent code completion. To accept those code suggestions, you may press the "Tab". If you don't want to accept them, keep typing and ignore them.

Note that, the code completion here is different from the library function suggestion because the gray suggestions may be compositions of library functions, local variables, and constants.

When you are modifying anything in the middle of a line, i.e. there exists something immediately after your cursor, the code completion would not work.

In [None]:
import numpy as np

x = np.random.randint(3,4,5)

## Generate Code from Natural Languages


### Write the prompt in the chat

* Specify the function inputs and outputs
* Provide a high-level description of its functionality
* Sepcial constraints that you put over the implementation
* Prompt in small steps if possible

### Example 1: Manipulating Pandas Dataframes

In this part, we are going to present several examples on dataframe manipulation.

In [None]:
# prompt: write a function of two arguments "sepal" and "petal", and the function returns the feature table where sepal length is greater than "sepal" cm and petal length greater than "petal" cm

**Q1**: Try the same prompt multiple times, will you get the same generated program?

**Q2**: Try similar prompt with different wordings, will you get significantly different programs?

In [None]:
# prompt: write a function that returns the average values of features aggregate by labels

**Q3**: write the prompt for the following task and test the results

Given a label mapping m, e.g `{0 : "setosa", 1 : "versicolor", 2 : "virginica"}`, create two functions:

1. `map_labels`: generate a feature table where integer labels are replaced with corresponding new labels in the mapping;

2. `avg_feature_by_label`: produce the average values of features for a mapped label.


### Example 2: Plot important information about the dataset

#### Example 2.1: histogram of pedal lengths for different iris types
Draw a histogram of how the petal lengths are distributed for different types with Matplotlib.

In [None]:
# prompt: Draw a histogram of how the petal lengths are distributed for different types with Matplotlib. When two histogram bars overlap, show the blended colors for the overlapping parts and do not draw them side by side. Control the overall bin number to 10

You will probably need to refine the prompt so that the results match the expectation.

**Q4**: plot histograms for all features

Could you make a 2x2 plot grid where each cell is a histogram of features from the iris dataset? You will need to refine your prompt to have the desired behavior.

In [None]:
# make a 2x2 plot grid
# each cell is a histogram of features from the iris dataset
# the histograms should be colored by the iris type
# each histogram should use the same set of bins
def plot_feature_histograms():
    """
    Plots a 2x2 grid of histograms of features from the iris dataset colored by iris type.
    Each histogram uses the same set of bins for different iris types.

    Returns:
        None
    """
    fig, axes = plt.subplots(2, 2, figsize=(10, 10))
    feature_names = iris_features.columns
    bins = np.linspace(iris_features.min().min(), iris_features.max().max(), 21)  # Define common bins for all features
    for i, ax in enumerate(axes.flat):
        for j in range(3):
            ax.hist(
                iris_features[feature_names[i]][iris_type == j],
                bins=bins,
                label=['Setosa', 'Versicolor', 'Virginica'][j],
                color=plt.cm.Set2(j / 2),  # Use the Set2 colormap
                alpha=0.7,
                rwidth=0.8
            )
        ax.set_xlabel(feature_names[i])
        ax.set_ylabel('Frequency')
        ax.legend(title='Iris Type')
    plt.suptitle('Distribution of Features by Iris Type')
    plt.tight_layout()
    plt.show()

plot_feature_histograms()

You may specify the context in your prompt such as referring to a previous cell, referring to a function that has been implemented above, etc.

**Warnings and Suggestions**:
1. AI may hallucinate about many things. For example, the library function generated by AI tools may not exist.
2. AI may generate code that does not match your library version because it was trained on older versions.
3. AI may generate garbage code but requires you tons of time to realize that was garbage. Train yourself to balance when to spend time on examining AI generated code and when to give more hints.

## Fix problems in the code

### Example 1: Explain code

AI tools is very good at identify patterns in complex code structures, and explain unfamiliar library functions to you.

For example, given the following code, without using AI tools, can you find out what it is doing?

In [None]:
import pandas as pd
import io

csv = '''
breed,size,weight,height
Labrador Retriever,medium,67.5,23.0
German Shepherd,large,,24.0
Beagle,small,,14.0
Golden Retriever,medium,60.0,22.75
Yorkshire Terrier,small,5.5,
Bulldog,medium,45.0,
Boxer,medium,,23.25
Poodle,medium,,16.0
Dachshund,small,24.0,
Rottweiler,large,,24.5
'''
dogs = pd.read_csv(io.StringIO(csv))

(dogs
  .sort_values('size')
  .groupby('size')['height']
  .agg(['sum', 'mean', 'std'])
)

code source: https://osf.io/zgybd/wiki/home/?view_only=20e52fb304fd484593fac7cf508b5f27

If you cannot get the sense of what it is doing, feel free to click the "explain code" button and see how LLMs understand it.

What about the next one?

In [None]:
from collections import Counter

test_str = 'geeksforgeeks'
print("The original string is : " + str(test_str))
res = Counter(test_str[idx : idx + 2] for idx in range(len(test_str) - 1))

print(res)

code source: https://www.geeksforgeeks.org/python-bigrams-frequency-in-string/

**Q5**: Ask LLMs and find out what the following code snippets are doing

source: programiz

In [None]:
from functools import reduce

def f1(x):
  assert x >= 0
  x_str = str(x)
  return reduce(lambda acc, x: acc + int(x), x_str, 0)

source: geeksforgeeks

In [None]:
def f2(xss, ys):
  assert len(xss) == len(ys)
  results = []
  for i, xs in enumerate(xss):
    results.append([x+(ys[i],) for x in xs])

  return results

**Warning**: Though AI tools may perform very well on these toy examples. As your code becomes longer and the number of coding files increases, there is more chance for AI to make mistakes. You should trust more on the official documentation than AI tools if you are skeptical about the AI generated results.

### Example 2: Debug errors

In [None]:
print(iris_features.columns)

Interact with AI and see what's wrong with the following code.

In [None]:
def filter_iris_features(sepal, petal):
    """
    Filters the iris feature table to include only rows where
    sepal length > sepal and petal length > petal.
    """
    filtered_features = iris_features[
        (iris_features['sepal length'] > sepal) &
        (iris_features['petal length'] > petal)
    ]
    return filtered_features

print(filter_iris_features(6.6, 5.5))

Unit tests are small test cases that are used to check the basic functionality of the implementation matches the actual requirement. The following code cannot pass the unit tests. Please use AI to find out whether the unit tests are wrong or the implementation has bugs.

In [None]:
# Start with importing packages that we need
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

iris_features, iris_type = load_iris(return_X_y=True, as_frame=True)
print(iris_features.shape)

def filter_iris_features(sepal, petal):
    """
    Filters the iris feature table to include only rows where
    sepal length > sepal and petal length > petal.

    Args:
        sepal (float): Minimum sepal length in cm.
        petal (float): Minimum petal length in cm.

    Returns:
        pd.DataFrame: Filtered feature table.
    """
    filtered_features = iris_features[
        (iris_features['sepal length (cm)'] > sepal) &
        (iris_features['petal length (cm)'] > petal)
    ]
    return filtered_features

# add unit tests for the functions
def test_filter_iris_features():
    """
    Tests the filter_iris_features function.
    """
    assert filter_iris_features(6.6, 5.5).shape == (8, 4), "Test case 1 failed"
    assert filter_iris_features(5.5, 4.5).shape == (100, 4), "Test case 2 failed"
    print("All test cases passed")

test_filter_iris_features()

**Q6**: Find the bugs in the following code snippets

source: github

In [None]:
# Goal: print the numbers 0 to 20
for i in range(0, 20):
  print(i)

In [None]:
# Goal: double each number in the list
def f(xs):
  ys = []
  for x in xs:
    y = x * 2
  ys.append(y)
  return ys

In [None]:
# Goal: capitalize the first letters in all words
words = ['debugging', 'with', 'vscode']

for i, word in enumerate(words):
  word = word.capitalize()

print(words)

### Example 3: Add documentations

AI tools can help us write the documentation for an existing method. We may request documentation through chat window.

In [None]:
def avg_feature_values(label):

    avg_values = iris_features[iris_type == label].mean()
    return avg_values

**Warning**: Again, AI tools cannot guarantee that the generated documentation is perfect, and it requires human efforts to double check.

## Using Copilot in VS Code

If you are using VS code in your local environment, you may choose to use GitHub Copilot, which has a well integration with your IDE.

Now we will go through most of the functions that we mentioned above and demo them again in VS code. They have a different interface which allows you to have more interaction between the code and LLMs.