<a href="https://colab.research.google.com/github/jblocher/mgt6534/blob/main/2_py4genai_tests_apis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tests, Conditional Execution, Iteration, and Practice
> Rounding out our Python knowledge for success with langchain

In today's lesson, we'll cover conditional execution, iteration, tests, and get some practice with these concepts. We'll continue learning the syntax and grammar of the Python language to effectively communicate our goals to Python.

In this lesson, you'll learn:
* Communicating conditional execution syntax to Python
* Different ways of communicating iteration to Python
* Testing - how to make sure your code does what it should do
* How to use APIs when genAI fails you!

Let's get started!


# Conditional Execution
Let's start by revisiting our email example. Recall that our function was:

In [None]:
# process document function
def process_document(document_string):
    lines = document_string.split('\n')

    header_dict = {}
    email_body = []

    # Flag to know when the header ends and the body starts
    body_flag = False

    for line in lines:
        if not body_flag:
            # Header lines processing
            if ':' in line:
                key, value = line.split(':', 1)
                header_dict[key.strip()] = value.strip()

                if 'Lines' in key:
                    body_flag = True
            else:
                # Header lines without ':' are considered part of the last field
                header_dict[key] += '\n' + line.strip()

        else:
            # Body lines processing
            email_body.append(line)

    return header_dict, email_body

We used this function by executing something similar to:
```
# Read in the document
with open('document.txt', 'r') as file:
    document_string = file.read()

# Process the document
header, body = process_document(document_string)

print(header)
print(body)
```

This already has an example of conditional execution, in which a variable is only modified if conditions are met. Let's explore this at greater depth.

## Writing conditional execution code
What we've written above is essentially the form of one or more `if`, `if-else`, `if-elif-else` statements. The syntax for communicating conditional execution looks like so:

### `if` statements
```
if condition:
  #code block for if condition true
```

### `if-else` statements
`if-else` statements allow binary, mutually exclusive decisions:

```
if condition:
  #code block for if condition true
else:
  #code block for if condition false
```

### `if-elif-else` statements

`if-elif-else` statements allow multiple, mutually exclusive decisions:
```
if condition A:
  #code block for if condition A true
elif condition B:
  #code block for if condition B true
elif condition C:
  #code block for if condition C true
else:
  #code block for none of the above conditions are true
```

Let's see what this looks like for our code. Go back to the code above and identify the conditional statements. What are they doing? Why?

# Iteration
We've already seen some examples of iteration, where we need to cycle through a collection data structure to apply statements to one or more of the elements of the collection (e.g. dictionaries, lists).

There are 2 primary ways that you see iteration in Python:
* `for` loops
* `comprehensions` (list, dictionary, generator)

A final type that we will see _at length_ with Huggingface is:
* `map`

Let's explore this.

## `for` loops
We've already seen some examples of `for` loops when we were learning about lists and dictionaries.

We said that our `for` goes through cyclical iterations, updating the index to process each element.

Our syntax was as follows:
```
for dummy_name in collection:
  ## indented code block steps to take
```

Let's explore this.

### Finishing our email processing and exploration
Although we've read in one email, recall that we wanted to process all (we'll do a subset here for the sake of time) of the documents in the folder. Let's try this using the syntax that we learned for iteration.

In [None]:
import glob

In [None]:
# download the data using command line tools
!curl -Os http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz

In [None]:
# Untar the data to the current directory
# note that we can use %%capture if we don't want to see all of the output
!tar -zxf 20news-19997.tar.gz

In [None]:
#use glob to get a list with each file name in it
email_paths_list = ???

#print length
print('Number of emails:', len(email_paths_list))

# Print first 10
email_paths_list[:10]

In [None]:
#write for loop to iterate through 20 emails and do the separate them into header and body with our function above

???
...
???

In [None]:
#Show a single email to see that it worked
print(all_emails[19])

#show header
display('\nPrinting Header......', all_headers[19], '\n')

#show body
print('\nPrinting body.........', all_bodies[19])

## List Comprehensions
Another very compact way to represent for loops is through list comprehensions. They're great if you:
* Essentially have one function to apply to a list of elements
* Want to do binary conditional execution on elements of a list
* Want to perform filtering of elements (reduce the size of the list based on some condition)

The difficulty of list and dictionary comprehensions is the syntax because of the concise expression of the for loop. Let's take a look, but it offers wonderful functionality. It streamlines the creation of new lists based on old lists. Let's look at a brief comparison.

<center>
<img src="https://github.com/vanderbilt-data-science/p4ai-essentials/blob/main/img/iteration_comparison.png?raw=true" width="800">
</center>

### Converting part of our `for loop` to list comprehension
Yep, that's right - we're doing the same exact example again, except using list comprehensions. Don't think about the outputs of this too hard, this is more like gaining experience with the syntax of list comprehensions rather than any output functionality.

Recall that we had:
```
#write for loop to iterate through elements and do the task
max_emails = 20
all_headers = []
all_bodies = []
all_emails = []

for email_path in email_paths_list[:max_emails]:
  with open(email_path, 'r') as f:
    this_email = f.read()
  
  #add the email
  all_emails.append(this_email)

  #process the email
  email_header, email_body = process_document(this_email)

  #save for later
  all_headers.append(email_header)
  all_bodies.append(email_body)
```
We can re-express this as a list comprehension. We will start with a single portion, using the existing `all_emails` list. You'll investigate the use of list comprehensions on other components in a breakout room later.



In [None]:
# recall all_emails, where we'll show the first 2
all_emails[:2]

In [None]:
# Choose to express as a list comprehension


In [None]:
# Get headers and emails separately


In [None]:
# Check
display(all_headers)
print()
display(all_bodies)

## Breakout Session 2 (10 minutes)

In this breakout session, you'll re-express the code we had before as a set of list comprehensions. In this example: try to think through the list comprehension yourself first and write it. If you run into issues, work with GenAI to determine where the error in your logic (or syntax) exists.

**In your group:**
1. Write a function `read_docs` which will read a single email given a file path. It should return the email.
2. Create a list comprehension which reads all (20 for the sake of time) of the emails. Returned should be a list of all the emails (`all_emails`) as above. Recall that we have `email_paths_list`.
3. Leverage the list comprehension written above to process all the emails, and separate into `all_headers` and `all_bodies`, and condense 1-3 into a single cell.
4. Make sure your code runs, and ensure the output is as you expect.
5. Compare and contrast the approaches of the `for` loop and the list comprehension. What are the advantages/disadvantages of both?
6. (If time) Create a function which contains all of the functionality described above and the `glob` functionality. Your function should take as input the directory to be read and the number of emails that should be returned.

## Your Work

In [None]:
#TODO: Complete #1 here
def read_doc(filepath):
  ???


# Check a single email
read_doc(email_paths_list[0])

In [None]:
#TODO: Complete  #2 here
all_emails_bo = [???]


len(all_emails_bo)
print(all_emails_bo[0], '\n')
print(all_emails_bo[19])

In [None]:
# Complete #3 , running it all together.
#TODO: read all emails
all_emails_bo = [read_doc(fpath) for fpath in email_paths_list[:20]]

#TODO: process all emails - use our process_document() function above.
processed_emails_bo = [???]

#separate into component pieces - use GenAI to see what this line of code is doing. N
all_headers_bo, all_bodies_bo = zip(*processed_emails_bo)

In [None]:
#TODO: Complete #4 here

#headers
display(???)
print()
#bodies
display(???)

# Tests and Asserts
Tests are code that you write to ensure your code is:
1. Semantically correct
2. Robust to all failure modes (e.g., explore all branches of if statements, etc)

This can be conceived of as formulating **test cases.** How can we do this for our `process_documents` function? We can decide on some tests for point (1) by thinking about what we need the code to do.

## Example Test 0: Basic Structure
> Make sure the inputs and outputs are as expected

* Inputs: string
* Output: dictionary, string (2)

In [None]:
# Formulate test case
input_email = """From: John Doe
To: Jane Doe
Organizer-Names: Friends and Family
Message ID: 253421-2
Reply-To:
Lines: 3

Dear Jane,
The cat is out of the bag.
John
"""

print(input_email)

In [None]:
# Call the function
processed_email = process_document(input_email)
processed_email

In [None]:
#Verify the behavior
# 1: it should returna list of two objects
# 2: the first object should be a dictionary
# 3: the second object should be a string


???


## Example Test 1: Semantics
> Making sure input and output values are correct

* We will use the same `input_email` from before
* We will use the same `processed_email` from before

Both are repeated here for clarity

In [None]:
# Original email
input_email = """From: John Doe
To: Jane Doe
Organizer-Names: Friends and Family
Message ID: 253421-2
Reply-To:
Lines: 3

Dear Jane,
The cat is out of the bag.
John
"""

print(input_email)

In [None]:
# Formulate the expected outputs
true_out_header = {'From': 'John Doe',
  'To': 'Jane Doe',
  'Organizer-Names': 'Friends and Family',
  'Message ID': '253421-2',
  'Reply-To': '',
  'Lines': '3'}

true_out_body = """
Dear Jane,
The cat is out of the bag.
John
"""

In [None]:
# Call the function
processed_email = process_document(input_email)
processed_email

In [None]:
#separate the components of the output tuple
proc_out_header, proc_out_body = processed_email

#What should be true about this?
???


Huh. It does not seem that our function is doing exactly what we want it to do...

## Example Test 2: Robustness
> What if the email has no body and is only a header?



In [None]:
# Formulate test case
input_email = """From: John Doe
To: Jane Doe
Organizer-Names: Friends and Family
Message ID: 253421-2
Reply-To:
Lines: 3"""

In [None]:
# Call the function
out_header, out_body = process_document(input_email)

In [None]:
#What should be true about this, then?

???

## Breakout Session 1 (10 mins)
In this breakout session, you will continue the process of generating and writing tests/test cases for the provided code. Your assignments are as follows:

1. Write a test that ensures that the code can run on the standard header provided in one of the emails.
2. Write a test that ensures that the code can run with no header and only a message body.
3. Generate your own test to test the robustness of the function (either you can generate this or experiment by asking your GenAI).
4. (If time) Fix the code for where the asserts fail.

### Do your work here

In [None]:
# Use these examples to write tests (scroll to bottom of this code block)
standard_header_email = """Newsgroups: sci.med
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!darwin.sura.net!sgiblab!sdd.hp.com!decwrl!decwrl!uunet!utcsri!utnut!utzoo!telly!problem!intacc!bed
From: bed@intacc.uucp (Deb Waddington)
Subject: INFO NEEDED: Gaucher's Disease
Message-ID: <1993Mar18.002149.1111@intacc.uucp>
Date: Thu, 18 Mar 1993 00:21:49 GMT
Distribution: Everywhere
Expires: 01 Jun 93
Reply-To: bed@intacc.UUCP (Deb Waddington)
Organization: Matrix Artists' Network
Lines: 33


Dear Cat,

Please stop eating my food. You have your own. It doesn't matter that I eat yours.

Sincerely,

Dog

"""

true_standard_header = {'Newsgroups': 'sci.med',
 'Path': 'cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!darwin.sura.net!sgiblab!sdd.hp.com!decwrl!decwrl!uunet!utcsri!utnut!utzoo!telly!problem!intacc!bed',
 'From': 'bed@intacc.uucp (Deb Waddington)',
 'Subject': "INFO NEEDED: Gaucher's Disease",
 'Message-ID': '<1993Mar18.002149.1111@intacc.uucp>',
 'Date': 'Thu, 18 Mar 1993 00:21:49 GMT',
 'Distribution': 'Everywhere',
 'Expires': '01 Jun 93',
 'Reply-To': 'bed@intacc.UUCP (Deb Waddington)',
 'Organization': "Matrix Artists' Network",
 'Lines': '33'}

true_standard_body = """

Dear Cat,

Please stop eating my food. You have your own. It doesn't matter that I eat yours.

Sincerely,

Dog

"""

#TODO: Write tests here for #1
???


In [None]:
# #2
no_header_email = """Dear Cat,

Please stop eating my food. You have your own. It doesn't matter that I eat yours.

Sincerely,

Dog

"""

#TODO: Write tests here for #2
???

In [None]:
#TODO: Give #3 a try here. What else could we test?



In [None]:
#TODO: Rewrite function here in a way that all the tests above will pass (tests are just below, no need to copy-paste them)



**Run most of the original tests, making sure to run the problematic ones.**

In [None]:
# Run tests
input_email = """From: John Doe
To: Jane Doe
Organizer-Names: Friends and Family
Message ID: 253421-2
Reply-To:
Lines: 3

Dear Jane,
The cat is out of the bag.
John
"""

# Formulate the expected outputs
true_out_header = {'From': 'John Doe',
  'To': 'Jane Doe',
  'Organizer-Names': 'Friends and Family',
  'Message ID': '253421-2',
  'Reply-To': '',
  'Lines': '3'}

true_out_body = """
Dear Jane,
The cat is out of the bag.
John
"""

# Call the function
processed_email = process_document(input_email)
processed_email

# Basic struture
assert len(processed_email)==2, 'The function did not return the correct number of elements'
assert isinstance(processed_email[0], dict), 'The function did not return the first element as a dictionary'
assert isinstance(processed_email[1], str), 'The function did not return the second element as a string.'

# Semantics 1
assert processed_email[0] == true_out_header, 'The header of the function output is incorrect.'
assert processed_email[1] == true_out_body, 'The body of the function output is incorrect.'

# Robustness
no_header_header, no_header_body = process_document(no_header_email)
assert not no_header_header, "The header should be empty since there isn't one"
assert no_header_body, "The body should exist and not be empty"

**Yay!**

# Guided Exploration of HF API

We can see that most of the time, we see code that looks sort of like functions. We see the "contract" components of the new functions that are available to us - the function name, the inputs required and available, and the outputs. Let's experiment with one!

In [None]:
#install packages
%%capture
!pip install transformers

In [None]:
#make pipeline package available to python kernel


In [None]:
# create a pipeline to do text-classification


In [None]:
#when in doubt, check the type


In [None]:
#Use the pipeline for inference
a_text = 'The dog was walking on a beautiful summer day.'
???

## Guided Example 1
What if we want to return all scores?

In [None]:
#Return all scores


## Guided Example 2
What if we had a list of texts, loaded, for example, from file?

In [None]:
#Using lists
texts = ['The dog was happy and ran along playfully.',
         'The cat glared at me, judging me from afar.',
         'The groundhog peeked its head above the ground.',
         'Opposums are criminally underrated.']


## Guided Example 3
Maybe we don't like how this model is performing. Can we use a different model? How?

Reference: [Huggingface Models](https://huggingface.co/models)

In [None]:
#modify code to use different model
cardiff_pipe = pipeline('text-classification')
cardiff_scores = cardiff_pipe(texts)

In [None]:
#see results
cardiff_scores

# Breakout Session 3 (10 minute): APIs
Now, try this on your own using the following APIs:
* [HuggingFace Documentation](https://huggingface.co/docs/transformers/index)
* [OpenAI Documentation](https://platform.openai.com/docs/api-reference)

### Using one of the quickstart pages, tutorials, answer the following questions:
1. What is the package being used?
2. What is being imported? Does it appear to be imported from a module? Which module and how can you tell?
3. Try to run the quickstart code to try out the example.
4. Summarize what you think the code is doing. Otherwise, use a GenAI to help summarize the behavior.
5. Modify the code to implement something of interest to you.

# Congratulations!
You made it through the first crash course with Python for using HuggingFace! You now:

1. How to use Google Colab
2. Have built some intuition around what it is to program and that programming is just another language with syntax, semantics, and grammar
3. Know several standard Python data types and how to use them
4. Know several standard Python data structures and how to use them
5. Have learned how to use functions, what they expect, and what they return
6. Know packages, libraries, modules, classes, functions, and methods all relate and how you can leverage this information to help understand APIs
7. Understood APIs as contracts about what is expected to be input and what should be returned
8. Learned how to communication conditional execution
9. Learned about standard types of iteration with Python

That is A LOT to cover - and I'm proud of you for sticking with it!

