# Module 2 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

In [1]:
from pprint import pprint

## Local Search - Genetic Algorithm

There are some key ideas in the Genetic Algorithm.

First, there is a problem of some kind that either *is* an optimization problem or the solution can be expressed in terms of an optimization problem.
For example, if we wanted to minimize the function

$$f(x) = \sum (x_i - 0.5)^2$$

where $n = 10$.
This *is* an optimization problem. Normally, optimization problems are much, much harder.

![Eggholder](http://www.sfu.ca/~ssurjano/egg.png)!

The function we wish to optimize is often called the **objective function**.
The objective function is closely related to the **fitness** function in the GA.
If we have a **maximization** problem, then we can use the objective function directly as a fitness function.
If we have a **minimization** problem, then we need to convert the objective function into a suitable fitness function, since fitness functions must always mean "more is better".

Second, we need to *encode* candidate solutions using an "alphabet" analogous to G, A, T, C in DNA.
This encoding can be quite abstract.
You saw this in the Self Check.
There a floating point number was encoded as bits, just as in a computer and a sophisticated decoding scheme was then required.

Sometimes, the encoding need not be very complicated at all.
For example, in the real-valued GA, discussed in the Lectures, we could represent 2.73 as....2.73.
This is similarly true for a string matching problem.
We *could* encode "a" as "a", 97, or '01100001'.
And then "hello" would be:

```
["h", "e", "l", "l", "o"]
```

or

```
[104, 101, 108, 108, 111]
```

or

```
0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1
```

In Genetics terminology, this is the **chromosome** of the individual. And if this individual had the **phenotype** "h" for the first character then they would have the **genotype** for "h" (either as "h", 104, or 01101000).

To keep it straight, think **geno**type is **genes** and **pheno**type is **phenomenon**, the actual thing that the genes express.
So while we might encode a number as 10110110 (genotype), the number itself, 182, is what goes into the fitness function.
The environment operates on zebras, not the genes for stripes.

## String Matching

You are going to write a Genetic Algorithm that will solve the problem of matching a target string (at least at the start).
Now, this is kind of silly because in order for this to work, you need to know the target string and if you know the target string, why are you trying to do it?
Well, the problem is *pedagogical*.
It's a fun way of visualizing the GA at work, because as the GA finds better and better candidates, they make more and more sense.

Now, string matching is not *directly* an optimization problem so this falls under the general category of "if we convert the problem into an optimization problem we can solve it with an optimization algorithm" approach to problem solving.
This happens all the time.
We have a problem.
We can't solve it.
We convert it to a problem we *can* solve.
In this case, we're using the GA to solve the optimization part.

And all we need is some sort of measure of the difference between two strings.
We can use that measure as a **loss function**.
A loss function gives us a score tells us how similar two strings are.
The loss function becomes our objective function and we use the GA to minimize it by converting the objective function to a fitness function.
So that's the first step, come up with the loss/objective function.
The only stipulation is that it must calculate the score based on element to element (character to character) comparisons with no global transformations of the candidate or target strings.

And since this is a GA, we need a **genotype**.
The genotype for this problem is a list of "characters" (individual letters aren't special in Python like they are in some other languages):

```
["h", "e", "l", "l", "o"]
```

and the **phenotype** is the resulting string:

```
"hello"
```

In addition to the generic code and problem specific loss function, you'll need to pick parameters for the run.
These parameters include:

1. population size
2. number of generations
3. probability of crossover
4. probability of mutation

You will also need to pick a selection algorithm, either roulette wheel or tournament selection.
In the later case, you will need a tournament size.
This is all part of the problem.

Every **ten** (10) generations, you should print out the fitness, genotype, and phenotype of the best individual in the population for the specific generation.
The function should return the best individual *of the entire run*, using the same format.

In [2]:
ALPHABET = "abcdefghijklmnopqrstuvwxyz "

<a id="Helper Functions"></a>
## Helper Functions

<a id="imports"></a>
### imports

In [3]:
from typing import List, Tuple, Dict, Callable
import random
from numpy.random import randint
from numpy.random import rand
from copy import deepcopy

<a id="ENCODER"></a>
### ENCODER

`ENCODER` creates a dictionary encoding mechanism using decimal codes for each letter / character in the alphabet. 

In [4]:
ENCODER = {'a': 97, 'b': 98, 'c': 99, 'd': 100, 'e': 101, 'f': 102, 'g':103, 'h':104, 'i':105, 'j':106, 'k':107, 
          'l':108, 'm':109, 'n':110, 'o':111, 'p':112, 'q':113, 'r':114, 's':115, 't':116, 'u':117, 'v':118, 'w':119, 
          'x':120, 'y':121, 'z':122, ' ':32}

<a id="encode"></a>
### encode

`encode` uses an encoding dictionary to encode a target string in decimal values. **Used by**: [generate_population](#generate_population), [evaluate](#evaluate)

***Inputs:***
* **target**: String: target string - value we're trying to reproduce
* **encoder**: Dict[str, int]: Dictionary of encoding scheme, letters > decimal values

**returns** genotype: encoded chromosome of target string


In [5]:
def encode(target: str, encoder: Dict[str, int]): 
    genotype = []
    for char in target: 
        genotype.append(encoder[char])
        
    return genotype

In [6]:
#Tests / Assertions
test_target1 = 'abc'
test_genotype = [97, 98, 99]
assert encode(test_target1, ENCODER) == test_genotype

test_target2 = 'a b c '
test_genotype = [97, 32, 98, 32, 99, 32]
assert encode(test_target2, ENCODER) == test_genotype

test_target3 = 'z a q'
test_genotype = [122, 32, 97, 32, 113]
assert encode(test_target3, ENCODER) == test_genotype

<a id="decode"></a>
### decode

`decode` takes in a genotype and converts it back to letters in the alphabet. **Used by**: [get_best](#get_best)

***Inputs:***
* **genotype**: List[int]: encoded value (chromosome) of interest
* **encoder**: Dict[str, int]: Dictionary of encoding scheme, letters > decimal values

**returns** phenotype: decoded chromosome in letter / alphabet format


In [7]:
def decode(genotype: List[int], encoder: Dict[str, int]): 
    phenotype = []
    for code in genotype: 
        phenotype.append(next(key for key, value in encoder.items() if value == code))
    # phenotype = ''.join(phenotype)    
    return phenotype

In [8]:
#Tests / Assertions
test_genotype1 = [97, 98, 99]
test_phenotype1 = 'abc'
assert ''.join(decode(test_genotype1, ENCODER)) == test_phenotype1

test_genotype2 = [97, 32, 98, 32, 99, 32]
test_phenotype2 = 'a b c '
assert ''.join(decode(test_genotype2, ENCODER)) == test_phenotype2

test_genotype3 = [122, 32, 97, 32, 113]
test_phenotype3 = 'z a q'
assert ''.join(decode(test_genotype3, ENCODER)) == test_phenotype3

<a id="generate_population"></a>
### generate_population

`generate_population` creates a randomized population of individuals (chromosomes / genotypes) using size and length parameters, an alphabet, and an encoder dictionary **Used by**: [genetic_algorithm](#genetic_algorithm)

***Inputs:***
* **pop_size**: Integer: Number of individuals in population
* **indiv_length**: Integer: Length of each individual chromosome in the population; this should match the target string length
* **alphabet**: String: String of characters used in target string, and phenotypes
* **encoder**: Dict[str, int]: Dictionary of encoding scheme, letters > decimal values

**returns** population: List[List[int]] of chromosomes in population

In [9]:
def generate_population(pop_size: int, indiv_length: int, alphabet: str, encoder: Dict[str, int]): 
    population = []
    for i in range(pop_size): #iterate over population size

        random_string = ''.join(random.choice(alphabet) for i in range(indiv_length))
        code = encode(random_string, encoder)
        population.append(code)
    
    return population 

In [10]:
# Tests / Assertions
test_pop = generate_population(5, 7, ALPHABET, ENCODER)
assert len(test_pop) == 5
assert len(test_pop[0]) == 7
assert type(test_pop) == list

<a id="evaluate"></a>
### evaluate

`evaluate` takes in a population, compares each individual to a target string on a character to character level, and assigns a fitness score for each individual. **Used by**: [genetic_algorithm](#genetic_algorithm)

***Inputs:***
* **population**: List[List[int]]: Full population of chromosomes / genotypes
* **target**: String: Target string we're trying to reproduce
* **encoder**: Dict[str, int]: Dictionary of encoding scheme, letters > decimal values
* **method**: String: Evaluation method - linked to Problems 1 - 3: 'default' looks for the closest identical string, 'reverse' looks for the target string reversed, and 'ROT13' looks for the target string encoded with the Caesar Cypher

**returns** pop_scores: List[int] of fitness scores for each individual in a population; same size as *population*

In [11]:
def evaluate(population: List[List[int]], target: str, encoder: Dict[str, int], method='default'): 
    genotype = encode(target, encoder) #target
    pop_scores = []
    for indiv in population: 
        assert len(indiv) == len(genotype) #double check same length
        score = 0 #reset
        for i in range(len(indiv)): 
            if method == 'default': 
                score += abs(genotype[i] - indiv[i]) #lower score is better
            elif method == 'reverse': 
                score += abs(genotype[(len(genotype)-1-i)] - indiv[i]) #lower score is better
            elif method == 'ROT13': 
                r_index = (genotype[i] - 97 + 13) % 26 + 97
                score += abs(r_index - indiv[i]) #lower score is better
            else: 
                print('Evaluation error! Method not acceptable')
                return None
        pop_scores.append(score)
    return pop_scores

In [12]:
# Assertions / Tests
eval_target = 'abc'
eval_pop = [[116, 103, 112], [114, 114, 98], [115, 100, 103]]
eval_scores = evaluate(eval_pop, eval_target, ENCODER)
assert eval_scores == [37, 34, 24]

eval_pop2 = [[97, 98, 99]]
eval_scores2 = evaluate(eval_pop2, eval_target, ENCODER)
assert eval_scores2 == [0]

eval_target = 'cba'
eval_pop3 = [[97, 98, 99]]
eval_scores3 = evaluate(eval_pop3, eval_target, ENCODER, method='reverse')
assert eval_scores3 == [0]

<a id="pick_parents"></a>
### pick_parents

`pick_parents` takes in a population and population scores (from the *evaluate* function), and uses Tournament Selection to select two parents from the population. This function picks 7 random individuals from the population, evaluates each, and selects the one with the best fitness score (lowest evaluation score). This is run twice to select two parents in total. **Used by**: [reproduce](#reproduce)

***Inputs:***
* **population**: List[List[int]]: Full population of chromosomes / genotypes
* **pop_scores**: List[int]: Fitness scores for each individual in population

**returns** parents: List[List[int]] two individuals selected from initial population for use in reproduction 

In [13]:
def pick_parents(population: List[List[int]], pop_scores: List[int]): 
    assert len(population) == len(pop_scores) #indices line up
    assert len(population) >= 6
    parents = [] #init
    
    for iteration in range(2): 
        indices = []
        for i in range(7): indices.append(random.randint(0,len(population)-1)) #pull 7 indices randomly
       
        subscores, subpop = [], []
        for j in indices: 
            subscores.append(pop_scores[j]) #pull scores for those 7 index vals
            subpop.append(population[j]) #pull indiv w/ same index
        sub_index = subscores.index(min(subscores)) #min score sub-index
        
        parents.append(subpop[sub_index]) #add parent individual / genotype
    
    return parents

In [14]:
# Tests / Assertions
eval_target = 'abcdefg'
test_pop = generate_population(7, 7, ALPHABET, ENCODER)
test_scores = evaluate(test_pop, eval_target, ENCODER)
# test_pop[test_scores.index(min(test_scores))]
parents = pick_parents(test_pop, test_scores)
assert len(parents) == 2
assert type(parents) == list
assert len(parents[0]) == 7

<a id="crossover"></a>
### crossover

`crossover` takes in two parents and a probability of crossover, determines if crossover will happen, then creates two children from the two parents. If crossover does not occur, the parents are returned. **Used by**: [reproduce](#reproduce)

***Inputs:***
* **parents**: List[List[int]]: Two individuals selected from overall population
* **p_cross**: Float: probability of crossover, usually around 0.8

**returns** children: List[List[int]] two children resulting from parental crossover

In [15]:
def crossover(parents: List[List[int]], p_cross: float): 
    indiv1, indiv2 = parents[0], parents[1]
    child1, child2 = [], [] #init
    gene_index = randint(0,len(indiv1))
    
    if rand() < p_cross: #crossover happens
        for i in range(0, gene_index): #build first portion of children
            child1.append(indiv1[i])
            child2.append(indiv2[i]) 
        for j in range(gene_index, len(indiv1)): 
            child1.append(indiv2[j])
            child2.append(indiv1[j])
    else: #no crossover
        return parents
        
    return [child1, child2]

In [16]:
# Tests / Assertions
eval_target = 'abcdefg'
test_pop = generate_population(7, 7, ALPHABET, ENCODER)
test_scores = evaluate(test_pop, eval_target, ENCODER)
parents = pick_parents(test_pop, test_scores)
children = crossover(parents, 0.8)
assert len(children) == 2
assert len(children[0]) == len(parents[0])
assert type(children) == list
# assert children[0][0] == parents[0][0]

<a id="mutate"></a>
### mutate

`mutate` takes in two children and a probability of mutation, determines if mutation will happen, then creates two mutated children from the two original children. If mutation does not occur, the original children are returned. **Used by**: [reproduce](#reproduce)

***Inputs:***
* **children**: List[List[int]]: Two individual children generated from two selected parents in the overall population
* **p_mut**: Float: probability of mutation, usually around 0.05
* **encoder**: Dict[str,int]: Dictionary of encoding scheme, letters > decimal values

**returns** mutants: List[List[int]] two children resulting from mutation

In [17]:
def mutate(children: List[List[int]], p_mut: float, encoder: Dict[str,int]): 
    mutants = deepcopy(children) 
    for iter in range(len(children)): 
        if rand() < p_mut: #mutation occurs
            mutation = random.choice(list(encoder.values()))
            mut_loc = randint(0, len(children[0]))
            mutants[iter][mut_loc] = mutation
    return mutants

In [18]:
# Tests / Assertions
eval_target = 'abcdefg'
test_pop = generate_population(7, 7, ALPHABET, ENCODER)
test_scores = evaluate(test_pop, eval_target, ENCODER)
parents = pick_parents(test_pop, test_scores)
children = crossover(parents, 0.8)
mutants = mutate(children, 0.05, ENCODER)
assert len(mutants) == len(children)
assert len(mutants[0]) == len(children[0])
assert type(mutants) == list

<a id="reproduce"></a>
### reproduce

`reproduce` takes in a population, population scores, probabilities of crossover and mutation, the alphabet, and a decoder to select parents and reproduce into a new generation. This function repeats the selection / crossover / mutation process until the new population is the same size as the original population. The new population is then returned. **Used by**: [genetic_algorithm](#genetic_algorithm)

***Inputs:***
* **population**: List[List[int]]: Full population of chromosomes / genotypes
* **pop_scores**: List[int]: Fitness scores for each individual in population
* **p_cross**: Float: probability of crossover, usually around 0.8
* **p_mut**: Float: probability of mutation, usually around 0.05
* **alphabet**: String: String of characters used in target string, and phenotypes
* **encoder**: Dict[str,int]: Dictionary of encoding scheme, letters > decimal values

**returns** new_pop: List[List[int]] new population resulting from parental selection, crossover, and mutation; equal in size to original population

In [19]:
def reproduce(population: List[List[int]], pop_scores: List[int], p_cross: float, p_mut: float, 
              alphabet: str, encoder: Dict[str, int]): 
    new_pop = []
    while len(new_pop) != len(population): 
        parents = pick_parents(population, pop_scores) #pick parents
        children = crossover(parents, p_cross) #crossover
        mutants = mutate(children, p_mut, encoder) #mutation
        child1, child2 = mutants[0], mutants[1]
        new_pop.append(child1)
        new_pop.append(child2)
    
    return new_pop

In [20]:
# Tests / Assertions
eval_target = 'abcdefg'
test_pop = generate_population(10, 7, ALPHABET, ENCODER)
test_scores = evaluate(test_pop, eval_target, ENCODER)
parents = pick_parents(test_pop, test_scores)
children = crossover(parents, 0.8)
mutants = mutate(children, 0.05, ENCODER)
new_pop = reproduce(test_pop, test_scores, 0.8, 0.05, ALPHABET, ENCODER)
new_scores = evaluate(new_pop, eval_target, ENCODER)
assert len(new_pop) == len(test_pop)
assert len(new_pop[0]) == len(test_pop[0])
assert min(new_scores) <= min(test_scores) #new pop is better

<a id="get_best"></a>
### get_best

`get_best` takes in a population, population scores, and an encoder dictionary to find the best individual in the given population. The function finds the best (minimum) fitness score, and uses that index to identify the associated genotype. This value is then decoded using the encoder dictionary. This best individual is stored in a custom dictionary, then returned. **Used by**: [genetic_algorithm](#genetic_algorithm)

***Inputs:***
* **population**: List[List[int]]: Full population of chromosomes / genotypes
* **pop_scores**: List[int]: Fitness scores for each individual in population
* **encoder**: Dict[str,int]: Dictionary of encoding scheme, letters > decimal values

**returns** best_indiv: Dictionary of best individual genotype, phenotype, and fitness score from a given population.

In [21]:
def get_best(population: List[List[int]], pop_scores: List[int], encoder: Dict[str,int]): 
    best_index = pop_scores.index(min(pop_scores))
    genotype = population[best_index] #best individual
    phenotype = decode(genotype, encoder)
    fitness = pop_scores[best_index]
    phenotype_f = ''.join(phenotype)

    best_indiv = {'genotype': genotype, 'phenotype': phenotype, 'fitness': fitness, 'formatted_phenotype': phenotype_f}
    return best_indiv

In [22]:
# Tests / Assertions
eval_target = 'abcdefg'
test_pop = generate_population(10, 7, ALPHABET, ENCODER)
test_scores = evaluate(test_pop, eval_target, ENCODER)
best_indiv = get_best(test_pop, test_scores, ENCODER)
assert type(best_indiv) == dict
assert best_indiv['fitness'] == min(test_scores)
assert best_indiv['phenotype'] == decode(best_indiv['genotype'], ENCODER)

<a id="ga_print"></a>
### ga_print

`ga_print` takes in a generation number and best individual in a population, decomposes this dictionary, and prints out the requested output. This is called for every 10th generation in *genetic_algorithm*. **Used by**: [genetic_algorithm](#genetic_algorithm)

***Inputs:***
* **generation**: Integer: generation number within genetic_algorithm loop
* **best_indiv**: Dictionary: Best individual from given generation & population; includes genotype, phenotype, and fitness score.

**returns** None: print statements

In [23]:
def ga_print(generation: int, best_indiv: Dict):
    genotype = best_indiv['genotype']
    phenotype = best_indiv['phenotype']
    fitness = best_indiv['fitness']
    print('Gen: ', generation, ' || Phenotype: ', ''.join(phenotype), ' || Genotype: ', genotype, ' || Fitness: ', fitness)
    return None

In [24]:
# Tests / Assertions
eval_target = 'abcdefg'
test_pop = generate_population(10, 7, ALPHABET, ENCODER)
test_scores = evaluate(test_pop, eval_target, ENCODER)
best_indiv = get_best(test_pop, test_scores, ENCODER)
ga_print(10, best_indiv)

Gen:  10  || Phenotype:  ikjiqhj  || Genotype:  [105, 107, 106, 105, 113, 104, 106]  || Fitness:  46


<div style="background: #4682b4">
    
***JW Note: Given this is a simple print function, no additional unit tests / assertions are necessary***

<a id="genetic_algorithm"></a>
### genetic_algorithm

`genetic_algorithm` is the main driver function for the genetic algorithm logic. This function calls individual helper functions to create a population, pick parents, reproduce into a new generation, and continue to loop until a generation limit is met. Every 10th generation calls the *ga_print* function to output the best individual's genotype, phenotype, and fitness score. 

***Inputs:***
* **target**: String: set of letters we're trying to replicate
* **pop_size**: Integer: Desired number of individuals in a population
* **num_gens**: Integer: Iteration limit for algorithm, Number of total generations
* **p_cross**: Float: probability of crossover, usually around 0.8
* **p_mut**: Float: probability of mutation, usually around 0.05
* **alphabet**: String: String of characters used in target string, and phenotypes
* **encoder**: Dict[str,int]: Dictionary of encoding scheme, letters > decimal values

**returns** best_overall: best individual throughout all generations

In [25]:
def genetic_algorithm(target: str, pop_size: int, num_gens: int, p_cross: float, p_mut: float, alphabet: str, 
                     encoder: Dict[str, int], method='default'):
    population = generate_population(pop_size, len(target), alphabet, encoder) #init
    for gen in range(num_gens): 
        pop_scores = evaluate(population, target, encoder, method) #evaluate
        best_indiv = get_best(population, pop_scores, encoder) #check best
        if (gen == 0) or (best_indiv['fitness'] <= best_overall['fitness']): #first run
            best_overall = best_indiv
        
        if best_overall['fitness'] == 0: # stop & return
            print('Solution reached! Generation: ', gen, ' Best: ')
            ga_print(gen, best_overall)
            return best_overall
        
        if gen % 10 == 0: #call print
            ga_print(gen, best_indiv)
        
        population = reproduce(population, pop_scores, p_cross, p_mut, alphabet, encoder)
    return best_overall # return the best individual of the entire run.

<div style="background: #4682b4">
    
***JW Note: No assertions necessary for main function; follow-on Problems will prove out function***

## Problem 1

The target is the string "this is so much fun".
The challenge, aside from implementing the basic algorithm, is deriving a fitness function based on "b" - "p" (for example).
The fitness function should come up with a fitness score based on element to element comparisons between target v. phenotype.

In [26]:
target1 = "this is so much fun"

In [27]:
pop_size = 100
num_gens = 1000
p_cross = 0.8
p_mut = 0.05

In [28]:
result1 = genetic_algorithm(target1, pop_size, num_gens, p_cross, p_mut, ALPHABET, ENCODER) # do what you need to do for your implementation but don't change the lines above or below.

Gen:  0  || Phenotype:  lwbiyiv fvjxkzm ghi  || Genotype:  [108, 119, 98, 105, 121, 105, 118, 32, 102, 118, 106, 120, 107, 122, 109, 32, 103, 104, 105]  || Fitness:  294
Gen:  10  || Phenotype:  shgl gn qn kyap btl  || Genotype:  [115, 104, 103, 108, 32, 103, 110, 32, 113, 110, 32, 107, 121, 97, 112, 32, 98, 116, 108]  || Fitness:  43
Gen:  20  || Phenotype:  shhv gt qn luab ctl  || Genotype:  [115, 104, 104, 118, 32, 103, 116, 32, 113, 110, 32, 108, 117, 97, 98, 32, 99, 116, 108]  || Fitness:  26
Gen:  30  || Phenotype:  shhv gt qn lucf ctn  || Genotype:  [115, 104, 104, 118, 32, 103, 116, 32, 113, 110, 32, 108, 117, 99, 102, 32, 99, 116, 110]  || Fitness:  18
Gen:  40  || Phenotype:  shhv gt qn luch ctn  || Genotype:  [115, 104, 104, 118, 32, 103, 116, 32, 113, 110, 32, 108, 117, 99, 104, 32, 99, 116, 110]  || Fitness:  16
Gen:  50  || Phenotype:  shhv gt tn luch ftn  || Genotype:  [115, 104, 104, 118, 32, 103, 116, 32, 116, 110, 32, 108, 117, 99, 104, 32, 102, 116, 110]  || Fitness:

In [29]:
pprint(result1, compact=True)

{'fitness': 0,
 'formatted_phenotype': 'this is so much fun',
 'genotype': [116, 104, 105, 115, 32, 105, 115, 32, 115, 111, 32, 109, 117, 99,
              104, 32, 102, 117, 110],
 'phenotype': ['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 's', 'o', ' ', 'm', 'u',
               'c', 'h', ' ', 'f', 'u', 'n']}


## Problem 2

You should have working code now.
The goal here is to think a bit more about fitness functions.
The target string is now, 'nuf hcum os si siht'.
This is obviously target #1 but reversed.
If we just wanted to match the string, this would be trivial.
Instead, this problem, we want to "decode" the string so that the best individual displays the target forwards.
In order to do this, you'll need to come up with a fitness function that measures how successful candidates are towards this goal.
The constraint is that you may not perform any global operations on the target or individuals.
Your fitness function must still compare a single gene against a single gene.
Your solution will likely not be Pythonic but use indexing.
That's ok.
<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        You may not reverse an entire string (either target or candidate) at any time.
        Everything must be a computation of one gene against one gene (one letter against one letter).
        Failure to follow these directions will result in 0 points for the problem.
    </p>
</div>

The best individual in the population is the one who expresses this string *forwards*.

In [30]:
target2 = "nuf hcum os si siht"

In [31]:
pop_size = 100
num_gens = 1000
p_cross = 0.8
p_mut = 0.05

In [32]:
result2 = genetic_algorithm(target2, pop_size, num_gens, p_cross, p_mut, ALPHABET, ENCODER, method='reverse')

Gen:  0  || Phenotype:  sbaa vqczkadyhgopti  || Genotype:  [115, 98, 97, 97, 32, 118, 113, 99, 122, 107, 97, 100, 121, 104, 103, 111, 112, 116, 105]  || Fitness:  305
Gen:  10  || Phenotype:  yeez nn xm osbf msz  || Genotype:  [121, 101, 101, 122, 32, 110, 110, 32, 120, 109, 32, 111, 115, 98, 102, 32, 109, 115, 122]  || Fitness:  64
Gen:  20  || Phenotype:  ueiz nn xm osbf dsi  || Genotype:  [117, 101, 105, 122, 32, 110, 110, 32, 120, 109, 32, 111, 115, 98, 102, 32, 100, 115, 105]  || Fitness:  44
Gen:  30  || Phenotype:  ueiz nn vm osbf dsi  || Genotype:  [117, 101, 105, 122, 32, 110, 110, 32, 118, 109, 32, 111, 115, 98, 102, 32, 100, 115, 105]  || Fitness:  42
Gen:  40  || Phenotype:  ueiz jn vm osbf esj  || Genotype:  [117, 101, 105, 122, 32, 106, 110, 32, 118, 109, 32, 111, 115, 98, 102, 32, 101, 115, 106]  || Fitness:  36
Gen:  50  || Phenotype:  ueiz jn qm osbf eso  || Genotype:  [117, 101, 105, 122, 32, 106, 110, 32, 113, 109, 32, 111, 115, 98, 102, 32, 101, 115, 111]  || Fitnes

In [33]:
pprint(result2, compact=True)

{'fitness': 0,
 'formatted_phenotype': 'this is so much fun',
 'genotype': [116, 104, 105, 115, 32, 105, 115, 32, 115, 111, 32, 109, 117, 99,
              104, 32, 102, 117, 110],
 'phenotype': ['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 's', 'o', ' ', 'm', 'u',
               'c', 'h', ' ', 'f', 'u', 'n']}


## Problem 3

This is a variation on the theme of Problem 2.
The Caeser Cypher replaces each letter of a string with the letter 13 characters down alphabet (rotating from "z" back to "a" as needed).
This is also known as ROT13 (for "rotate 13").
Latin did not have spaces (and the space is not continguous with the letters a-z) so we'll remove them from our alphabet.
Again, the goal is to derive a fitness function that compares a single gene against a single gene, without global transformations.
This fitness function assigns higher scores to individuals that correctly decode the target.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        You may not apply ROT13 to an entire string (either target or candidate) at any time.
        Everything must be a computation of one gene against one gene.
        Failure to follow these directions will result in 0 points for the problem.
    </p>
</div>

The best individual will express the target *decoded*.

In [34]:
ALPHABET3 = "abcdefghijklmnopqrstuvwxyz"

In [35]:
target3 = "guvfvffbzhpusha"

In [36]:
pop_size = 100
num_gens = 1000
p_cross = 0.8
p_mut = 0.05

In [37]:
result3 = genetic_algorithm(target3, pop_size, num_gens, p_cross, p_mut, ALPHABET3, ENCODER, method='ROT13')

Gen:  0  || Phenotype:  gavtiptmtujfnvf  || Genotype:  [103, 97, 118, 116, 105, 112, 116, 109, 116, 117, 106, 102, 110, 118, 102]  || Fitness:  73
Gen:  10  || Phenotype:  thhtiptmrubfhun  || Genotype:  [116, 104, 104, 116, 105, 112, 116, 109, 114, 117, 98, 102, 104, 117, 110]  || Fitness:  18
Gen:  20  || Phenotype:  thhtiptmrubfgun  || Genotype:  [116, 104, 104, 116, 105, 112, 116, 109, 114, 117, 98, 102, 103, 117, 110]  || Fitness:  17
Gen:  30  || Phenotype:  thhtiptmjubfgun  || Genotype:  [116, 104, 104, 116, 105, 112, 116, 109, 106, 117, 98, 102, 103, 117, 110]  || Fitness:  15
Gen:  40  || Phenotype:  thhtiprmlubfgun  || Genotype:  [116, 104, 104, 116, 105, 112, 114, 109, 108, 117, 98, 102, 103, 117, 110]  || Fitness:  13
Gen:  50  || Phenotype:  thhtitrmlubfgun  || Genotype:  [116, 104, 104, 116, 105, 116, 114, 109, 108, 117, 98, 102, 103, 117, 110]  || Fitness:  11
Gen:  60  || Phenotype:  thhsittolubfgun  || Genotype:  [116, 104, 104, 115, 105, 116, 116, 111, 108, 117, 98, 10

In [38]:
pprint(result3, compact=True)

{'fitness': 0,
 'formatted_phenotype': 'thisissomuchfun',
 'genotype': [116, 104, 105, 115, 105, 115, 115, 111, 109, 117, 99, 104, 102,
              117, 110],
 'phenotype': ['t', 'h', 'i', 's', 'i', 's', 's', 'o', 'm', 'u', 'c', 'h', 'f',
               'u', 'n']}


## Problem 4

There is no code for this problem.

In Problem 3, we assumed we knew what the shift was in ROT-13.
What if we didn't?
Describe how you might solve that problem including a description of the solution encoding (chromosome and interpretation) and fitness function. Assume we can add spaces into the message.

**Answer:** If we assume the letters are still in order and just shifted by a constant value (i.e. 13), and we assume the target value is a readable / logical phrase, we could use a test message and a brute force loop to evaluate the shift. We could create a function to generate an encoding dictionary based on the standard alphabet string and an assumed (variable) offset. This dictionary could be a simple custom numerical encoding (i.e. 1, 2, 3 rather than 32, 97, 98, etc.).  

We could run the Genetic Algorithm for ~500 generations in each loop, and loop 27 times (26 letters, one space) each with a different character offset, saving the decoded "best individual" from each run. At the end, we'd have a list of 27 individual decoded strings. From here, we can either manually select the index that makes logical sense (i.e. the phrase that is readable), or we could use an imported language model to determine best fit with our list (create secondary fitness function to parse out words, compare to language model). 

The individual chromosomes and interpretations would be identical in form to the existing implementation, but shifted in values by the constant character offset. The fitness function would still perform a character-to-character evaluation, and the *evaluate* function would still seek to minimize that fitness value within each generation. 

## Challenge

**You do not need to do this problem and it won't be graded if you do. It's just here if you want to push your understanding.**

The original GA used binary encodings for everything.
We're basically using a Base 27 encoding.
You could, however, write a version of the algorithm that uses an 8 bit encoding for each letter (ignore spaces as they're a bit of a bother).
That is, a 4 letter candidate looks like this:

```
0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1
```

If you wrote your `genetic_algorithm` code general enough, with higher order functions, you should be able to implement it using bit strings instead of latin strings.

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.