# **Introduction to Python**


## Session 4 Exercises

Create a python script called `ID_exercise_block1_part4.py`



#### Exercise 1
A **Generator Function** that reads a Fasta file. 
In each iteration, the function must return a tuple with the following format: (identifier, sequence). 

```python
FASTA_iterator(fasta_filename)
```


In [1]:
def FASTA_iterator(fasta_filename):
    """
    Generator function to read multiline fasta files
    
    Yield a tuple (id, sequence) for each sequence in the FASTA file
    """
    
    fd = open(fasta_filename)
    
    sequence = ""
    
    for line in fd:
        
        if line[0] == ">":
           
            if len(sequence)>0:
                yield((identifier, sequence))
            
            sequence = ""
            identifier = line[1:].strip()
        
        else:
            sequence += line.strip()
    
    if len(sequence)>0:
        yield((identifier, sequence))
    
    fd.close()

In [2]:
FASTA_iterator("example_fasta_file.fa")

<generator object FASTA_iterator at 0x7fcd4f496350>

In [3]:
n = 0
for id, sequence in FASTA_iterator("example_fasta_file.fa"):
    print(id, len(sequence))
    n +=1
n
    

Q8WZ42 34350
Q5S007 2527
P00533 1210
A2ASS6 35213
P10721 976
P21802 821
P31749 480
P11362 822
P06213 1382
P09619 1106
P35968 1356
P12931 536
P08581 1390
P04626 1255
P17948 1338
P16234 1089
P07949 1114
Q9NYL2 800
P22607 806
P00519 1130
P53350 603
P35916 1363
P53355 1430
P29320 983
Q9BXM7 581
Q02763 1124
P27361 379
P53779 464
P47811 360
Q96JB8 637
Q16659 721
Q8NB16 471
Q02750 393
P29678 393
P46734 347
Q4KSH7 419
O54874 1732
Q6DT37 1551
P47809 397
Q9NZW5 540
Q5T2T1 576
Q8CE90 535
P52564 334
Q9Y5S2 1711
Q5VT25 1732
Q32MK0 819
Q9H1R3 596
Q8WXR4 1341
A8C984 715
Q15746 1914
Q8K3H5 900


51

#### Exercise 2

Given a list of FASTA files, create a function that **returns a dictionary** that contains the 4 
following keys with the associated values:

- `intersection` : a **set** with the common identifiers found in all the files.

- `union`: a **set** with all the identifiers (unique) found in all the files.

- `frequency`: a **dictionary** with all the identifiers as keys and the number of files in which it appears as values (int)

- `specific`: a **dictionary** with the name of the input files as keys and a set with the specific identifiers as values (i.e. identifiers that are exclusive in that fasta file)

> Note 1: Common identifier equivalence must be case-insensitive (i.e. Code_A,code_a and CODE_A are equivalents).

> Note 2: It must use the FASTA_iterator function created in exercise 1. 
, the function must return a tuple with the following format: (identifier, sequence). 

```python
compare_fasta_file_identifiers(fasta_filenames_list)
```

In [4]:
def compare_fasta_file_identifiers(fasta_filenames_list):

    ids_dict = {}
    
    return_dict = {}
    
    for fasta_file in fasta_filenames_list:
        ids_dict[fasta_file] = set()
        
        for seq_id, sequence in FASTA_iterator(fasta_file):
            ids_dict[fasta_file].add(seq_id)
            
        return_dict.setdefault("intersection", set(ids_dict[fasta_file])).intersection_update(ids_dict[fasta_file])
        return_dict.setdefault("union", set()).update(ids_dict[fasta_file])
        
    frequencies = {}
    specific = {}
    
    for fasta_file, ids_set in ids_dict.items():
        for seq_id in ids_set:
            if seq_id not in frequencies:
                frequencies[seq_id] = 0
            frequencies[seq_id] += 1
        
        specific[fasta_file] = ids_set.copy()    
        for other_fasta_file, other_ids_set in ids_dict.items():
            if other_fasta_file != fasta_file:
                specific[fasta_file].difference_update(other_ids_set)
    
    return_dict["frequency"] = frequencies
    return_dict["specific"]  = specific
    
    return return_dict

In [5]:
info = compare_fasta_file_identifiers([
    "uniprot_sprot_sample.fasta",
    "uniprot_sprot_sample2.fasta",
    "uniprot_sprot_sample3.fasta"])

In [6]:
info.keys()

dict_keys(['intersection', 'union', 'frequency', 'specific'])

In [7]:
[(k, type(info[k]), len(info[k])) for k in info.keys()]

[('intersection', set, 490),
 ('union', set, 504),
 ('frequency', dict, 504),
 ('specific', dict, 3)]

In [8]:
def compare_fasta_file_identifiers_v2( fasta_filenames_list ):

    ids_dict = {}
    
    return_dict = {}
    
    frequencies = {}
    
    for fasta_file in fasta_filenames_list:
        ids_dict[fasta_file] = set()
        
        for seq_id, sequence in FASTA_iterator(fasta_file):
            ids_dict[fasta_file].add(seq_id)
            
        return_dict.setdefault("intersection", set(ids_dict[fasta_file])).intersection_update(ids_dict[fasta_file])
        return_dict.setdefault("specific", {}).setdefault(fasta_file, ids_dict[fasta_file].copy()).difference_update(return_dict.setdefault("union", set()))
        return_dict.setdefault("union", set()).update(ids_dict[fasta_file])
        frequencies.update(dict(((seq_id, frequencies.setdefault(seq_id, 0)+1) for seq_id in ids_dict[fasta_file])))
    

    return_dict["frequency"] = frequencies

    return return_dict
