# Requests, Exceptions, Generators
## Advanced Python for Life Sciences @ Physalia courses (Summer 2025)
### Marco Chierici, Fondazione Bruno Kessler

---

# HTTP Requests

## Basic requests

HTTP requests are commonly used to retrieve data from a specified resource, such as a website. They retrieve data from an API (Application Programmable Interface): the API defines how computer programs communicate with each other, and how you should form your query in order to get something from the website.

In [1]:
import requests

url = "https://catfact.ninja/fact"

# Make a request
response = requests.get(url)
response.content

b'{"fact":"A cat has more bones than a human being; humans have 206 and the cat has 230 bones.","length":83}'

In [2]:
response.json()

{'fact': 'A cat has more bones than a human being; humans have 206 and the cat has 230 bones.',
 'length': 83}

In [3]:
response.status_code

200

## API with parameters

The "Cat facts" API above is very simple: you cannot pass additional parameters to it. Most of the times, however, APIs will be more complex and accept further parameters. 

To deal with this, you can use `requests.get(url, params)`, where

- `url` is the base URL of the API, and
- `params` is a Python dictionary containing API's parameters and corresponding values as key-value pairs.

### Exercise

You will retrieve a list of universities and their domain names using a web API.

1. Try a simple query on the URL `http://universities.hipolabs.com/search?country=Finland` using the above method. Note that here we have:
  - the base URL, `http://universities.hipolabs.com/search`
  - followed by `?`
  - followed by the parameter `country` set to the value `Finland`.

In [4]:
url = "http://universities.hipolabs.com/search?country=Finland"

# Make a request
response = requests.get(url)

2. Display the response in JSON format, like above, but this time limit the output to the first two elements (use Python's list slicing syntax).

In [5]:
response.json()[:2]

[{'country': 'Finland',
  'name': 'Abo Akademi University',
  'alpha_two_code': 'FI',
  'state-province': None,
  'web_pages': ['http://www.abo.fi/'],
  'domains': ['abo.fi']},
 {'country': 'Finland',
  'name': 'Central Ostrobothnia University of Applied Sciences',
  'alpha_two_code': 'FI',
  'state-province': None,
  'web_pages': ['http://www.cou.fi/'],
  'domains': ['cou.fi']}]

3. Adapt your code and repeat the query using the `requests.get(url, params)` syntax.

In [6]:
url = "http://universities.hipolabs.com/search"
params = {"country": "Finland"}

# Make a request
response = requests.get(url, params=params)
response.json()[:2]

[{'country': 'Finland',
  'name': 'Abo Akademi University',
  'alpha_two_code': 'FI',
  'state-province': None,
  'web_pages': ['http://www.abo.fi/'],
  'domains': ['abo.fi']},
 {'country': 'Finland',
  'name': 'Central Ostrobothnia University of Applied Sciences',
  'alpha_two_code': 'FI',
  'state-province': None,
  'web_pages': ['http://www.cou.fi/'],
  'domains': ['cou.fi']}]

## Querying UniProt

This is a case study for querying an API with parameters. Suppose you want to download the FASTA sequence of the DNA replication licensing factor MCM7 from UniProt.

By reading the [UniProt API documentation](https://www.uniprot.org/help/api_retrieve_entries), you know that the endpoint for getting a FASTA output from a UniProt ID is in the form

```https://rest.uniprot.org/uniprotkb/<UNIPROT ID>.fasta```

Given that the UniProt ID for MCM7 is P33993, let's give it a try with Python.

In [7]:
uniprot_id = "P33993"
url = f"https://rest.uniprot.org/uniprotkb/{uniprot_id}.fasta"
response = requests.get(url)
print(response.text)

>sp|P33993|MCM7_HUMAN DNA replication licensing factor MCM7 OS=Homo sapiens OX=9606 GN=MCM7 PE=1 SV=4
MALKDYALEKEKVKKFLQEFYQDDELGKKQFKYGNQLVRLAHREQVALYVDLDDVAEDDP
ELVDSICENARRYAKLFADAVQELLPQYKEREVVNKDVLDVYIEHRLMMEQRSRDPGMVR
SPQNQYPAELMRRFELYFQGPSSNKPRVIREVRADSVGKLVTVRGIVTRVSEVKPKMVVA
TYTCDQCGAETYQPIQSPTFMPLIMCPSQECQTNRSGGRLYLQTRGSRFIKFQEMKMQEH
SDQVPVGNIPRSITVLVEGENTRIAQPGDHVSVTGIFLPILRTGFRQVVQGLLSETYLEA
HRIVKMNKSEDDESGAGELTREELRQIAEEDFYEKLAASIAPEIYGHEDVKKALLLLLVG
GVDQSPRGMKIRGNINICLMGDPGVAKSQLLSYIDRLAPRSQYTTGRGSSGVGLTAAVLR
DSVSGELTLEGGALVLADQGVCCIDEFDKMAEADRTAIHEVMEQQTISIAKAGILTTLNA
RCSILAAANPAYGRYNPRRSLEQNIQLPAALLSRFDLLWLIQDRPDRDNDLRLAQHITYV
HQHSRQPPSQFEPLDMKLMRRYIAMCREKQPMVPESLADYITAAYVEMRREAWASKDATY
TSARTLLAILRLSTALARLRMVDVVEKEDVNEAIRLMEMSKDSLLGDKGQTARTQRPADV
IFATVRELVSGGRSVRFSEAEQRCVSRGFTPAQFQAALDEYEELNVWQVNASRTRITFV



## NCBI Eutils

Let's now query the [Entrez](https://www.ncbi.nlm.nih.gov/books/NBK25500/) nucleotide db for the gene id 1676324825 (MYCN proto-oncogene transcript variant 3), returning the output in Genbank format.

In [8]:
# simple query
gene_id = 1676324825
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id={gene_id}&rettype=gb"
response = requests.get(url)
print(response.text)

LOCUS       NM_001293231            1706 bp    mRNA    linear   PRI 28-APR-2025
DEFINITION  Homo sapiens MYCN proto-oncogene, bHLH transcription factor (MYCN),
            transcript variant 3, mRNA.
ACCESSION   NM_001293231
VERSION     NM_001293231.2
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 1706)
  AUTHORS   Vempuluru,V.S., Maniar,A., Bakal,K. and Kaliki,S.
  TITLE     Role of MYCN in retinoblastoma: A review of current literature
  JOURNAL   Surv Ophthalmol 69 (5), 697-706 (2024)
   PUBMED   38796108
  REMARK    GeneRIF: Role of MYCN in retinoblastoma: A review of current
            literature.
            Review article
REFERENCE   2  (bases 1 to 1706)
  AUTHORS   Sundaramoorthy,S., Colombo,D.F., Sanalkumar,R., Broye,L., Balmas
  

**Your turn:** create an improved and more readable version of the above code to work with the `params` syntax.

In [9]:
# improved version with params (more readable)
gene_id = 1676324825
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
params = {"db": "nucleotide", "id": gene_id, "rettype": "gb"}

response = requests.get(url, params=params)
print(response.text)

LOCUS       NM_001293231            1706 bp    mRNA    linear   PRI 28-APR-2025
DEFINITION  Homo sapiens MYCN proto-oncogene, bHLH transcription factor (MYCN),
            transcript variant 3, mRNA.
ACCESSION   NM_001293231
VERSION     NM_001293231.2
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 1706)
  AUTHORS   Vempuluru,V.S., Maniar,A., Bakal,K. and Kaliki,S.
  TITLE     Role of MYCN in retinoblastoma: A review of current literature
  JOURNAL   Surv Ophthalmol 69 (5), 697-706 (2024)
   PUBMED   38796108
  REMARK    GeneRIF: Role of MYCN in retinoblastoma: A review of current
            literature.
            Review article
REFERENCE   2  (bases 1 to 1706)
  AUTHORS   Sundaramoorthy,S., Colombo,D.F., Sanalkumar,R., Broye,L., Balmas
  

In [None]:
# more complete example (check for status code)
def fetch_gene_sequence(gene_id):
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id={gene_id}&rettype=fasta&retmode=text"
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        print(f"Failed to retrieve data: {response.status_code}")


# Usage
gene_id = "NC_000852"  # or 9790228
gene_sequence = fetch_gene_sequence(gene_id)

In [None]:
gene_sequence[:10]

We'll see later how we could improve on the check for status code.

## Scrape data from web pages

Here we use `requests` to get the HTML source code of a web page (`www.example.com`), and then we extract links from it using [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/), a great Python library for pulling data out of HTML and XML files.

If needed, install Beautiful Soup by `conda install -y bs4` or `pip install -y bs4`.

You can do that without leaving Jupyter Lab: just start a code cell with `!`, immediately followed by the command, which will be run in the shell.

`!conda install -y bs4` or `!pip install -y bs4`

*When you use this interface, which is non-interactive, remember to apply the `-y` argument, i.e., to set any confirmation values to 'yes' automatically.*

In [None]:
from bs4 import BeautifulSoup

url = "http://www.example.com/"
# make a request
response = requests.get(url)
# display the response
print(response.content)

The content looks like HTML: let's access it using the `text` attribute

In [None]:
html = response.text
print(html)

In [None]:
# Create soup
soup = BeautifulSoup(html, "html.parser")

The `soup` variable is a data structure representing a parsed HTML document.

In [None]:
# Extract page title from the HTML
print(f"Found title: {soup.title.text}")

In [None]:
# Extract links (hrefs) from the HTML
for link in soup.find_all("a"):
    print(f"Found link: {link.get('href')}")

In [None]:
# Extract all text from the HTML
print(f"Found text: {soup.get_text()}")

---

# Interact with FTP servers

FTP servers may look old-fashioned next to fancy HTTP requests, API queries, and web scraping. But a lot of bioinformatics-related services still use them: for example, reference genome builds on the UCSC hub, NCBI Short Read Archive (SRA), Gene Expression Omnibus (GEO) files, and more.

Python provides tools to deal with FTP: as an example, we'll see here two of them.



## `urllib.request`

This solution is based on native Python's URL handling, which supports FTP URLs out of the box.

General syntax:

```python
from urllib.request import urlopen

with open(urlopen('ftp://username:password@server/path/to/file') as resource:
    data = resource.read()
    with open('file', 'wb') as f:
        f.write(data)
```

Let's download a FASTA sequence for the human mitochondrial chromosome from the [UCSC FTP server](https://genome.ucsc.edu/goldenPath/help/ftp.html).

From the above help page, we take note of the FTP server we need to use: `ftp://hgdownload.soe.ucsc.edu`

To obtain the file path and location, we may browse [UCSC downloads page](http://hgdownload.soe.ucsc.edu/) and look for the file of interest. Let's go ahead and look for the hg38 assembly > sequence data by chromosome > scroll down to chrM.fa.gz

The HTTP URL of the file is `https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chrM.fa.gz`, so the path relative to the server is `goldenPath/hg38/chromosomes/chr22.fa.gz`.

Let's put all this together and download the file with Python.

In [10]:
from urllib.request import urlopen

SERVER = "ftp://hgdownload.soe.ucsc.edu"
path = "goldenPath/hg38/chromosomes"
filename = "chrM.fa.gz"

with urlopen(f'{SERVER}/{path}/{filename}') as resource:
    data = resource.read()
    with open(filename, 'wb') as f:
        f.write(data)

## `ftplib`

You can use `ftplib` if you want more control over the FTP file actions: for example, change directories, list files, or download many files in a loop.

In [11]:
from ftplib import FTP

SERVER = "hgdownload.soe.ucsc.edu"  # no "ftp://" here
path = "goldenPath"

ftp = FTP(SERVER)
ftp.login()  # anonymous by default
ftp.cwd(path)
files = ftp.nlst()  # a Python list of the directory content

EOFError: 

In [None]:
sorted(files)[1:20]

# for fn in files:
#     print(" ", fn)

In [None]:
path = "hg38/chromosomes"  # relative to the previous path!
ftp.cwd(path)
files = ftp.nlst()
sorted(files)[1:20]

In [None]:
filename = "chrM.fa.gz"

with open(filename, "wb") as fout:
    ftp.retrbinary(f"RETR {filename}", fout.write)

ftp.quit()

As another example, we'll use `ftplib` to list and download available files from NCBI's Gene Expression Omnibus (GEO) repository.

The data set of interest has a GEO accession number GSE186032 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE186032) and involves gene expression profiling by RNA-seq to identify changes in mouse breast tumor tissues resistant to anti-PD-L1.

The main files of a GEO data set are:
- a SOFT formatted file
- a MINiML formatted file
- a Series Matrix file

In addition to these, supplementary files may be available. They can be, for example, clinical information, READMEs, raw read counts, etc. and they can be in different formats.

In our case, there is one supplementary file consisting of RSEM estimated counts.

By hovering on `ftp` in the data set's page on GEO, we could simply copy the FTP URL of the file:

`ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE186nnn/GSE186032/suppl/GSE186032%5FEMT6%2DRSEM%2Destimated%2Dcounts.txt.gz`

But we are more interested in finding it programmatically!

GEO has a convention for forming its FTP URLs. The base URL for any accession is:

`ftp://ftp.ncbi.nlm.nih.gov/geo/series/<PREFIX>nnn/<ACCESSION>/<FILETYPE>`

- where `<ACCESSION>` is the full GEO accession, i.e., GSE186032
- `<PREFIX>` is a substring of the accession minus its last 3 digits, i.e., GSE186
- `<FILETYPE>` is one of `soft`, `matrix`, `miniml`, or `suppl`.

With this information, we can connect to the host, change into the correct directory, list the available supplementary files, and grab the one we want.

In [None]:
ACCESSION = "GSE186032"
PREFIX = ACCESSION[:-3]
FTP_HOST = "ftp.ncbi.nlm.nih.gov"
FTP_PATH = f"geo/series/{PREFIX}nnn/{ACCESSION}/suppl"

ftp = FTP(FTP_HOST)
ftp.login()
ftp.cwd(FTP_PATH)

In [None]:
suppl_files = ftp.nlst()
print(f"Available supplementary files for accession {ACCESSION}:")

for fn in suppl_files:
    print(" ", fn)

In [None]:
# there is only 1 file
my_file = suppl_files[0]

with open(my_file, "wb") as fout:
    ftp.retrbinary(f"RETR {my_file}", fout.write)

ftp.quit()

The file is now on our local computer, ready for downstream analysis.

---

# Assertions

Assertions are useful sanity checks in Python: they allow you to check if some specific condition is True.
They are particularly used while debugging your code: when the code goes into production, they are typically switched off.

General syntax:

```
assert condition[, assertion_message]
```

The `assertion_message` is optional but highly recommended: after all, it should be helpful!

In [None]:
number = 42
assert number > 0

In [None]:
number = -42
assert number > 0

In [None]:
number = -42
assert number > 0, f"Positive number requested, got {number}"

In [None]:
# comparison
assert 3>2

In [None]:
# membership
numbers = [1, 3, 5, 7, 9]
assert 10 in numbers, "Input number not found in list"

In [None]:
# type check
x = 1
y = x
assert x is y
assert x is not y

In [None]:
# identity
number = 42.0
assert isinstance(number, int), f"{number} is not integer"

Let's check whether an input nucleotide sequence contains only valid nucleotide letters: 'A', 'C', 'G', or 'T'.
As it often happens, this could be accomplished in many alternative ways. I choose here to check if each of the letters belongs to the set of characters 'ACGT'.

In [None]:
nucleotide_sequence = "ACCGAGTACG"
assert all(base in 'ACGT' for base in nucleotide_sequence), f"Invalid nucleotide detected in {nucleotide_sequence}"

In [None]:
nucleotide_sequence = "ACCGBGTACG"
assert all(base in 'ACGT' for base in nucleotide_sequence), f"Invalid nucleotide detected in {nucleotide_sequence}"

Another common application is to check the response code of a HTTP requests.
Remember the UniProt example?

In [None]:
uniprot_id = "P33993"
url = f"https://rest.uniprot.org/uniprotkb/{uniprot_id}.fasta"
response = requests.get(url)
print(response.text)

Suppose now you used a different endpoint, because you found it in an old script or a Stackexchange post.

In [None]:
uniprot_id = "P33993"
url = f"https://www.uniprot.org/uniprot/?query=id:{uniprot_id}&columns=sequence&format=tab"
response = requests.get(url)
print(response.text)

Not the output we expected, right?
It is not so uncommon actually: APIs change during time, and because of this we can't expect that a server responses always in the same way. We need to check its response. 

In [None]:
assert response.status_code == '200', 'Something went wrong during the query: check the API'

# Exceptions

Exceptions are commonly referred to as "errors". In Python, there are many types of exceptions, depending on the specific error.

In [None]:
fruits = ["apple", "orange", "banana", "apricot"]
fruits[4]

In [None]:
fruits.remove("tomato")

In [None]:
fruits[0] + 5

## Exception handling

Exceptions terminate the execution of your program. To avoid this, it is necessary to write code to "handle" the exceptions gracefully and make it possible to continue evaluating our script.

In [None]:
a = 10
b = 0
c = a / b
print("a / b = ", c)

In [None]:
a = 10
b = 0
try:
    c = a/b
    print("a / b = ", c)
except:
    print("Can't divide by zero.")

In [None]:
a = int(input("a = "))
b = int(input("b = "))
c = a / b
print("a / b = ", c)

In [None]:
try:
    a = int(input("a = "))
    b = int(input("b = "))
    c = a / b
    print("a / b = ", c)
except ValueError:
    print("Entered value must be a number.")
except ZeroDivisionError:
    print("Can't divide by zero.")

Another typical situation is where we want to read from a file, but that file is actually missing!

In [None]:
with open("file.log") as file:
    read_data = file.read()

print("Here we are")

Note that "Here we are" is never evaluated if the block above (in particular `open('file.log')`) throws an exception.

The code can be improved by handling the situation in which a file is potentially missing.

In [None]:
try:
    with open("file.log") as file:
        read_data = file.read()
except:
    print("Could not open file.log")

print("Here we are")

It is always good practice to specify the exact exception that the `except` clause should catch: in this case, `FileNotFoundError` (see above).

In [None]:
try:
    with open("file.log") as file:
        read_data = file.read()
except FileNotFoundError as nf_err:
    print(nf_err)

print("Here we are")

An example application is to be sure that a dictionary with gene expression values actually contains valid items.

In [None]:
gene_expr = {"Rb1": 2.5, "P53": None, "Ras": "missing"}

for gene, expression in gene_expr.items():
    try:
        value = float(expression)
        print(f"expression level of {gene} = {value}")
    except TypeError as te:
        print(f"Expression for {gene} is missing!")
    except ValueError:
        print(f"Invalid expression for {gene}!")

We could use the exception handling mechanism to check if a nucleotide sequence is valid, i.e., contains only 'A', 'C', 'G', or 'T' letters. We elaborate on the previous example by creating a function `validate_sequence()` and using it in a `try...except` construct.

In [None]:
def validate_sequence(seq):
    assert all(base in "ATGC" for base in seq), "Invalid nucleotide detected"
    return True


sequence = "ATGBC"
try:
    validate_sequence(sequence)
except AssertionError as e:
    print(f"Validation failed: {e}")

print("Here we are!")

Let's revisit our previous example of querying the NCBI Eutils. We got this:

In [None]:
def fetch_gene_sequence(gene_id):
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id={gene_id}&rettype=fasta&retmode=text"
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        print(f"Failed to retrieve data: {response.status_code}")


# Usage
gene_id = "NC_000852"  # or try with an invalid ID such as "NM_00005"
gene_sequence = fetch_gene_sequence(gene_id)

We can now improve the code above by using a `try...except` construct, which is the preferred way to deal with this kind of situations:

In [None]:
# better exception handling
def fetch_gene_sequence(gene_id):
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id={gene_id}&rettype=fasta"
    response = requests.get(url)
    try:
        response.raise_for_status()  # throws an HTTPError exception if something bad happens
        return response.text
    except requests.HTTPError:
        print(
            f"Failed to retrieve sequence for accession {gene_id}: code {response.status_code}"
        )
        return None


# Usage
gene_id = "NC_000852"  # or try with an invalid ID such as "NM_00005"
sequence = fetch_gene_sequence(gene_id)
if sequence:
    print(sequence)

---

# Iterators and Generators

Iterators are objects that can be iterated upon, meaning that they return one action or item at a time. To be considered an iterator, objects need to implement two methods: `__iter__()` and `__next__()`.

We already know a few Python iterators: `for` loops and list comprehensions.

Generators are a Pythonic implementation of iterators, without needing to explicitly implement a class with `__iter__()` and `__next__()` methods. Similarly, and most importantly, you don't need to keep track of the object's internal state. An important thing to note is that generators iterate over an object **lazily**, meaning they do not store their contents in memory.

The `yield` statement controls the flow of a generator. The statement goes further to handle the state of the generator function, pausing it until it's called again, using the `next()` function on the generator object.

In [None]:
# creating a simple generator in Python
def return_n_values(n):
    num = 0
    while num < n:
        yield num
        num += 1


values = return_n_values(10)

print(type(values))
print(values)

# start unpacking the generator
print(next(values))
print(next(values))

In [None]:
# unpacking generators with a for loop
def return_n_values(n):
    num = 0
    while num < n:
        yield num
        num += 1


values = return_n_values(5)

for val in values:
    print(val)

In [None]:
# creating a generator with a for loop
def return_n_values(n):
    for i in range(n):
        yield i


values = return_n_values(3)

for val in values:
    print(val, end=" ")

## Exercise: from functions to generators

Rewrite as a generator the following function:

```python
def even_numbers(n):
    evens = []
    for i in range(n):
        if i % 2 == 0:
            evens.append(i)
    return evens
```

Test both the function and the generator with `n=10`, printing their outputs.

In [None]:
def even_numbers(n):
    evens = []
    for i in range(n):
        if i % 2 == 0:
            evens.append(i)
    return evens


evens = even_numbers(10)
print(evens)

In [None]:
def even_numbers(n):
    for i in range(n):
        if i % 2 == 0:
            yield i


evens = even_numbers(10)
print(evens)  # generator object

for even in evens:
    print(even, end=" ")

## Exercise: codon generator

Write a simple generator function that yields the codons from a given DNA sequence.

Example input: "ATGCGTATGCCTAATGATCTAAGCTAGCTGATCGATCTAGCTAGATGTAG"

Example output:
```
ATG
CGT
ATG
CCT
(truncated)
```

In [None]:
def codon_generator(dna_sequence):
    for i in range(0, len(dna_sequence), 3):
        # Ensure slicing does not go beyond the sequence length
        if i + 3 <= len(dna_sequence):
            yield dna_sequence[i : i + 3]


sequence = "ATGCGTATGCCTAATGATCTAAGCTAGCTGATCGATCTAGCTAGATGTAG"

for codon in codon_generator(sequence):
    print(codon)

## Applications of generators

Since generators are lazily evaluated, they are useful when iterating through large datasets, avoiding the need to create a duplicate of the dataset in memory. Large datasets include infinite sequences such as the Fibonacci sequence.

### Reading large files

If you have a large file that you need to process line by line, you can use a generator to read the file one line at a time, instead of reading the entire file into memory at once. This can save a lot of memory, especially if the file is very large.

```python
def read_file(filename):
    with open(filename, 'r') as f:
        for line in f:
            yield line

for line in read_file('large_file.txt'):
    print(line)
```

In [None]:
# large file reader: inefficient version
def file_reader(file_path):
    rows = []
    for row in open(file_path, "r"):
        rows.append(row)
    return rows
  
def print_row_count(file_path):
    count = 0
    for row in file_reader(file_path):
        count += 1
    print(f"Total count is {count}")

print_row_count('data/ecoli_K12_genomic.fasta')

In [None]:
# large file reader: better version
def gen_file_reader(file_path):
    for row in open(file_path, "r"):
         yield row
        
def gen_print_row_count(file_path):
    count = 0
    for row in gen_file_reader(file_path):
        count += 1
    print(f"Total count is {count}")
    
gen_print_row_count('data/ecoli_K12_genomic.fasta')

### Generating infinite sequences

In [None]:
def fibonacci():
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b


for i, fib in enumerate(fibonacci()):
    print(fib)
    if i > 10:
        break

In [None]:
fib = fibonacci()

In [None]:
# try re-executing this code cell over and over again (CMD+ENTER or CRTL+ENTER)
next(fib)

We can revise our Fibonacci generator with a parameter reprenting an upper bound:

In [None]:
def fibonacci(nums):
    a, b = 0, 1
    for _ in range(nums):
        a, b = b, a + b
        yield a


for fib in fibonacci(10):
    print(fib)

Similarly to list comprehension, there is a shorthand syntax for generators!

Use the generator expression syntax to create simple generators. For example, you can use the following syntax to create a generator that yields the even numbers between 0 and n:

In [None]:
n = 10
evens = (i for i in range(n) if i % 2 == 0)
print(list(evens))

In [None]:
evens_squared = (i**2 for i in range(n) if i % 2 == 0)
print(list(evens_squared))

### Pipelines: Filtering and transforming data

Generators can be used to filter and transform data in a single step, which can make your code more readable.

In [None]:
def even_numbers(numbers):
    for number in numbers:
        if number % 2 == 0:
            yield number


def square(numbers):
    for number in numbers:
        yield number**2


numbers = [1, 2, 3, 4, 5, 6]
even_squares = square(even_numbers(numbers))
print(list(even_squares))  # [4, 16, 36]

We can combine our `square` and (revised) `fibonacci` generators to print the sum of squares of the first N=10 numbers in the Fibonacci series.

In [None]:
sum(square(fibonacci(10)))

---

# Credits

Partially abridged from work by Lee Stott (MIT License), Sebastian Bassi (MIT License), and datanagy.io