# Objectives

* Work with dicts (review [https://docs.python.org/3.5/tutorial/datastructures.html#dictionaries](https://docs.python.org/3.5/tutorial/datastructures.html#dictionaries)), tuples and sets

# Dicts

Dicts are actually *sets* of *key*/*value* pairs. 
List lists where you access elements by a key value rather than by their location in the structure.
Like Webster's: look up values using a word.

They are **very** useful



## Example: count words

Simple start: 

```
	wordCounts={'this':2,'that':3,'forever':1}
	wordCounts['this']
	wordCounts['that']
```

or, imagine:

```	
	sequence="CGGATCGNNAAGCTCTGTTGTTGGTGANNNYYGGATAYAGGUUNYGTAACTGGCCT"
	nucs=['A','C','G','T']
	amb=['N','Y','U']

	characters = nucs+amb
	charCounts = [ sequence.count(x) for x in nucs+amb ]
```

Look up a couple entries

```1
	charCounts[1]  # C's
	charCounts[5]  # Y's
```

To see all the counts:

```
	for nextIndex in range(len(characters)):
		print( "%s\t%d" % ( characters[ nextIndex ], charCounts[ nextIndex ] ) )
```

easier would be 

```
	charCounts['C']  # C's
	charCounts['Y']  # Y's
	
	for nextKey in charCountDict:
		print( "%s\t%d" % ( nextKey, charCountDict[nextKey] ) )
```

Which requires a dictionary:

```
	charCountDict = { 'A':	9, 'C':	7, 'G':	16, 'T':	12, 'N':	6, 'Y':	4, 'U':	2 }
```

* moral of the story: dicts let you look up entries in a collection using a *value* rather than an *index*



## Creating Dictionaries

* Easy to add entries

```
		my_dict = {}
		my_dict['A'] = sequence.count('A')
```

* To create from scratch: set of pairs

```
		char_count_dict = { 'A':	9, 'C':	7, 'G':	16, 'T': 12, 'N': 6, 'Y': 4, 'U': 2 }
```

But spacing can make it easier to read:

```
	char_count_dict = { 
		'A':	9, 
		'C':	7, 
		'G':	16, 
		'T': 12, 
		'N': 6, 
		'Y': 4, 
		'U': 2 
		}
```

**note* that order of entries in dict is unpredictable (based on hash())**

>print(char_count_dict)


## A more interesting example:

```
    codontable = {
    	'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    	'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    	'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    	'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    	'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    	'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    	'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    	'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    	'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    	'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    	'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    	'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    	'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    	'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    	'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    	'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }
```



## dict.get() method with default value

Easy to look up entries that already exist. What about when they don't exist?

```
	my_dict['B']
	
	my_dict.get('B', 'its not there')
	my_dict.get('A', 'its not there')
```

for example, count words in a file (deconstruct this)

```
	word_counts = {}
	with open('TSE.txt') as in_file:
		for next_line in in_file: 
			print(next_line)
			words = next_line.split()
			for next_word in words:
				word_counts[next_word] = word_counts.get(next_word,0) + 1
				# word_counts[next_word] = word_counts[next_word] + 1
	print(word_counts.sort())
```



## Useful Dict Methods

* dict.get(): see above
* iterators
	* dict.keys()
	* dict.values()
	* dict.items()

```
	word_counts.keys()
	word_counts.values()
	word_counts.items()
	word_counts

	for next_word in sorted( word_counts.keys() ):
		print( "{}\t{}".format(next_word, word_counts[nextWord] ) ) 

	for word, count in word_counts.items():
		print( "{}\t{}".format( word, count ) )
```

* keys must be *hashable*, but you don't have to know what that really means. Keys can be
	* tuples
	* strings
	* int
	* float
* keys can **not** be lists or dicts

Describe how hashing works. Note, order of entrys in dicts are **unpredictable**.

* dictionary values can be *any* object: lists, dictionaries, mixed types, lists of lists of lists, ...



## A still MORE interesting example

Consider this HISEQ fasta header:

```
GFU6RAN01AFQ0P_cs_nbp_rc cs_nbp=31-445 sample=700113576 rbarcode=TCGTTGTC primer=V3-V5 subject=764346198 body_site=Mid vagina center=WUGSC
```

A useful dict entry might be:

```
	sequences = {}
	sequences['GFU6RAN01AFQ0P'] = { 
		'sample': 700113576, 
		'rbarcode' : 'TCGTTGTC',
		'primer' : 'V3-V5',
		'subject' : 764346198,
		'body_site' : 'Mid vagina',
		'center' : 'WUGSC'
		}
```

Note that this is a dict of dicts. That can be *very* useful.

or, equivalently

```
	sequences = {
		'GFU6RAN01AFQ0P': 
			{
			'sample': 700113576, 
			'rbarcode' : 'TCGTTGTC',
			'primer' : 'V3-V5',
			'subject' : 764346198,
			'body_site' : 'Mid vagina',
			'center' : 'WUGSC'
			}
		}
```

Look up all information about a specific read:

```
	sequences['GFU6RAN01AFQ0P']
```

From which patient was this DNA taken? who did the sequencing?

```
	sequences['GFU6RAN01AFQ0P']['subject']
	sequences['GFU6RAN01AFQ0P']['center']
```



## Other useful functions



### zip(): good for combining lists into a dict

*zip(a, b)* returns a zip object (actually a generator), where the first tuple entry is from *a* and the second from *b*

For example, remember that we have:

```
    sequence="CGGATCGNNAAGCTCTGTTGTTGGTGANNNYYGGATAYAGGUUNYGTAACTGGCCT"
    nucs=['A','C','G','T']
    amb=['N','Y','U']
    char_counts = [ sequence.count(x) for x in nucs+amb ]
```

Try this:

```
	zip( nucs+amb, char_counts )
	list(zip( nucs+amb, char_counts ))
```

Another way to make a dict:

```
	dict( zip( nucs+amb, char_counts ) )
```

Yet another;

```
	{ x:y for (x,y) in zip( nucs+amb, char_counts ) }
```



### enumerate()

*enumerate(b)* returns a ennumerate object (a generator) of tuples where the first tuple entry is the
order of the element in *b* and the second is the actual element in *b*.

Try this:

```
	enumerate( nucs+amb, char_counts )
	list(enumerate( nucs+amb, char_counts ))
```

this can be useful for printing a sequence of data with "line numbers".
For example, if you have a list of lines of poetry, you can print with line numbers like this:

```
	poem = ['this is the way the world ends',
			'this is the way the world ends',
			'this is the way the world ends',
            '\n',
			'not with a bang',
            '\n',
			'but a whimper']
	for (line_number, line) in enumerate(poem):
		print(line_number, line)
```


# (if time) Tuples and sets

## Tuples

Tuples are (iterable) sequences of values. Unlike lists (but like strings) they are immutable. 

Syntax 

>(value, value, value...)

You have seen tuples before! In 
>enumerate( nucs+amb, char_counts )

this is a tuple whose first element is a string and second element a list of ints

>( nucs+amb, char_counts )

Indexing and slicing tuples is same as with lists.

There are only two tuple methods. (compare to string!)

>tuple.count()
>tuple.index()

Examples:

```
    sequence="CGGATCGNNAAGCTCTGTTGTTGGTGANNNYYGGATAYAGGUUNYGTAACTGGCCT"
    nucs=['A','C','G','T']
    amb=['N','Y','U']
    char_counts = [ sequence.count(x) for x in nucs+amb ]

    nuc_tuple = ( nucs+amb, char_counts )
    nuc_tuple[:-1]

    tuple(sequence)

    sequence.count('A')

    counts = 1
    for next_char in sequence:
        if next_char = 'A':
            counts = counts + 1

    x, y = 0, 1
    print(x,y)
    x, y = y, x
    print(x,y)

    sequence[0] = 'X'
```       

## Sets

Sets are iterable sequences of *unique* elements (like mathematical sets)

Useful methods (note that they use the *object.method(other_object)* format we saw with *str.join(list)*

>set.add(item)
>set.remove(item)

>set.difference(set)
>set.intersection(set)
>set.symmetric_difference(set)
>set.union(set)

```
mammals = {'horse', 'platypus', 'cat', 'human'}
quadrapeds = {'horse', 'platypus', 'dog'}
placental_mammals = {'horse', 'cat', 'human', 'fish'} # remember to check your data for accuracy!

placental_mammals.remove('fish')
placental_mammals.add('rat')

hairy_quadrapeds = quadrapends.intersection(placental_mammals)
hairy_quadrapeds_placentals = quadrapends.intersection(placental_mammals.intersection(mammals))

'booze_hound' in mammals
'dog' in quadrapeds

all_animals = mammals.union(quadrapeds.union(placental_mammals))
len(all_animals)
```

Common use: get rid of duplicates
```
mammals = [ 'horse', 'platypus', 'cat', 'human', 'horse', 'horse']
print(mammals)
mammals = list(set(mammals))
print(mammals)
```

Iterative, so they can be used in loops:

```
for next_animal in mammals:
   print('my next favorite animal', next_animal)
```
