<a href="https://colab.research.google.com/github/lrbenitez/ColabInteligentes/blob/main/Genetics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!git clone -l -s https://github.com/aimacode/aima-python
%cd aima-python
%pip install -r requirements.txt

Cloning into 'aima-python'...
remote: Enumerating objects: 5092, done.[K
remote: Total 5092 (delta 0), reused 0 (delta 0), pack-reused 5092[K
Receiving objects: 100% (5092/5092), 17.43 MiB | 28.98 MiB/s, done.
Resolving deltas: 100% (3416/3416), done.
/content/aima-python
Collecting image
  Downloading image-1.5.33.tar.gz (15 kB)
Collecting ipythonblocks
  Downloading ipythonblocks-1.9.0-py2.py3-none-any.whl (13 kB)
Collecting pytest-cov
  Downloading pytest_cov-3.0.0-py3-none-any.whl (20 kB)
Collecting qpsolvers
  Downloading qpsolvers-1.7.1-py3-none-any.whl (35 kB)
Collecting django
  Downloading Django-3.2.9-py3-none-any.whl (7.9 MB)
[K     |████████████████████████████████| 7.9 MB 47.1 MB/s 
Collecting pytest>=4.6
  Downloading pytest-6.2.5-py3-none-any.whl (280 kB)
[K     |████████████████████████████████| 280 kB 52.8 MB/s 
[?25hCollecting coverage[toml]>=5.2.1
  Downloading coverage-6.1.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x

In [None]:
!git submodule init
!git submodule update

Submodule 'aima-data' (https://github.com/aimacode/aima-data.git) registered for path 'aima-data'
Cloning into '/content/aima-python/aima-data'...
Submodule path 'aima-data': checked out 'f6cbea61ad0c21c6b7be826d17af5a8d3a7c2c86'


In [None]:
from search import *
from notebook import psource, heatmap, gaussian_kernel, show_map, final_path_colors, display_visual, plot_NQueens

# Needed to hide warnings in the matplotlib sections
import warnings
warnings.filterwarnings("ignore")

In [None]:
%matplotlib inline
import networkx as nx
import matplotlib.pyplot as plt
from matplotlib import lines

from ipywidgets import interact
import ipywidgets as widgets
from IPython.display import display
import time

## GENETIC ALGORITHM

Genetic algorithms (or GA) are inspired by natural evolution and are particularly useful in optimization and search problems with large state spaces.

Given a problem, algorithms in the domain make use of a *population* of solutions (also called *states*), where each solution/state represents a feasible solution. At each iteration (often called *generation*), the population gets updated using methods inspired by biology and evolution, like *crossover*, *mutation* and *natural selection*.

### Overview

A genetic algorithm works in the following way:

1) Initialize random population.

2) Calculate population fitness.

3) Select individuals for mating.

4) Mate selected individuals to produce new population.

     * Random chance to mutate individuals.

5) Repeat from step 2) until an individual is fit enough or the maximum number of iterations was reached.

### Glossary

Before we continue, we will lay the basic terminology of the algorithm.

* Individual/State: A list of elements (called *genes*) that represent possible solutions.

* Population: The list of all the individuals/states.

* Gene pool: The alphabet of possible values for an individual's genes.

* Generation/Iteration: The number of times the population will be updated.

* Fitness: An individual's score, calculated by a function specific to the problem.

### Crossover

Two individuals/states can "mate" and produce one child. This offspring bears characteristics from both of its parents. There are many ways we can implement this crossover. Here we will take a look at the most common ones. Most other methods are variations of those below.

* Point Crossover: The crossover occurs around one (or more) point. The parents get "split" at the chosen point or points and then get merged. In the example below we see two parents get split and merged at the 3rd digit, producing the following offspring after the crossover.

![point crossover](images/point_crossover.png)

* Uniform Crossover: This type of crossover chooses randomly the genes to get merged. Here the genes 1, 2 and 5 were chosen from the first parent, so the genes 3, 4 were added by the second parent.

![uniform crossover](images/uniform_crossover.png)

### Mutation

When an offspring is produced, there is a chance it will mutate, having one (or more, depending on the implementation) of its genes altered.

For example, let's say the new individual to undergo mutation is "abcde". Randomly we pick to change its third gene to 'z'. The individual now becomes "abzde" and is added to the population.

### Selection

At each iteration, the fittest individuals are picked randomly to mate and produce offsprings. We measure an individual's fitness with a *fitness function*. That function depends on the given problem and it is used to score an individual. Usually the higher the better.

The selection process is this:

1) Individuals are scored by the fitness function.

2) Individuals are picked randomly, according to their score (higher score means higher chance to get picked). Usually the formula to calculate the chance to pick an individual is the following (for population *P* and individual *i*):

$$ chance(i) = \dfrac{fitness(i)}{\sum_{k \, in \, P}{fitness(k)}} $$

### Implementation

Below we look over the implementation of the algorithm in the `search` module.

First the implementation of the main core of the algorithm:

In [None]:
psource(genetic_algorithm)

The algorithm takes the following input:

* `population`: The initial population.

* `fitness_fn`: The problem's fitness function.

* `gene_pool`: The gene pool of the states/individuals. By default 0 and 1.

* `f_thres`: The fitness threshold. If an individual reaches that score, iteration stops. By default 'None', which means the algorithm will not halt until the generations are ran.

* `ngen`: The number of iterations/generations.

* `pmut`: The probability of mutation.

The algorithm gives as output the state with the largest score.

For each generation, the algorithm updates the population. First it calculates the fitnesses of the individuals, then it selects the most fit ones and finally crosses them over to produce offsprings. There is a chance that the offspring will be mutated, given by `pmut`. If at the end of the generation an individual meets the fitness threshold, the algorithm halts and returns that individual.

The function of mating is accomplished by the method `recombine`:

In [None]:
psource(recombine)

The method picks at random a point and merges the parents (`x` and `y`) around it.

The mutation is done in the method `mutate`:

In [None]:
psource(mutate)

We pick a gene in `x` to mutate and a gene from the gene pool to replace it with.

To help initializing the population we have the helper function `init_population`":

In [None]:
psource(init_population)

The function takes as input the number of individuals in the population, the gene pool and the length of each individual/state. It creates individuals with random genes and returns the population when done.

### Explanation

Before we solve problems using the genetic algorithm, we will explain how to intuitively understand the algorithm using a trivial example.

#### Generating Phrases

In this problem, we use a genetic algorithm to generate a particular target phrase from a population of random strings. This is a classic example that helps build intuition about how to use this algorithm in other problems as well. Before we break the problem down, let us try to brute force the solution. Let us say that we want to generate the phrase "genetic algorithm". The phrase is 17 characters long. We can use any character from the 26 lowercase characters and the space character. To generate a random phrase of length 17, each space can be filled in 27 ways. So the total number of possible phrases is

$$ 27^{17} = 2153693963075557766310747 $$

which is a massive number. If we wanted to generate the phrase "Genetic Algorithm", we would also have to include all the 26 uppercase characters into consideration thereby increasing the sample space from 27 characters to 53 characters and the total number of possible phrases then would be

$$ 53^{17} = 205442259656281392806087233013 $$

If we wanted to include punctuations and numerals into the sample space, we would have further complicated an already impossible problem. Hence, brute forcing is not an option. Now we'll apply the genetic algorithm and see how it significantly reduces the search space. We essentially want to *evolve* our population of random strings so that they better approximate the target phrase as the number of generations increase. Genetic algorithms work on the principle of Darwinian Natural Selection according to which, there are three key concepts that need to be in place for evolution to happen. They are:

* **Heredity**: There must be a process in place by which children receive the properties of their parents. <br> 
For this particular problem, two strings from the population will be chosen as parents and will be split at a random index and recombined as described in the `recombine` function to create a child. This child string will then be added to the new generation.


* **Variation**: There must be a variety of traits present in the population or a means with which to introduce variation. <br>If there is no variation in the sample space, we might never reach the global optimum. To ensure that there is enough variation, we can initialize a large population, but this gets computationally expensive as the population gets larger. Hence, we often use another method called mutation. In this method, we randomly change one or more characters of some strings in the population based on a predefined probability value called the mutation rate or mutation probability as described in the `mutate` function. The mutation rate is usually kept quite low. A mutation rate of zero fails to introduce variation in the population and a high mutation rate (say 50%) is as good as a coin flip and the population fails to benefit from the previous recombinations. An optimum balance has to be maintained between population size and mutation rate so as to reduce the computational cost as well as have sufficient variation in the population.


* **Selection**: There must be some mechanism by which some members of the population have the opportunity to be parents and pass down their genetic information and some do not. This is typically referred to as "survival of the fittest". <br>
There has to be some way of determining which phrases in our population have a better chance of eventually evolving into the target phrase. This is done by introducing a fitness function that calculates how close the generated phrase is to the target phrase. The function will simply return a scalar value corresponding to the number of matching characters between the generated phrase and the target phrase.

Before solving the problem, we first need to define our target phrase.

In [None]:
target = 'Genetic Algorithm'

We then need to define our gene pool, i.e the elements which an individual from the population might comprise of. Here, the gene pool contains all uppercase and lowercase letters of the English alphabet and the space character.

In [None]:
# The ASCII values of uppercase characters ranges from 65 to 91
u_case = [chr(x) for x in range(65, 91)]
# The ASCII values of lowercase characters ranges from 97 to 123
l_case = [chr(x) for x in range(97, 123)]

gene_pool = []
gene_pool.extend(u_case) # adds the uppercase list to the gene pool
gene_pool.extend(l_case) # adds the lowercase list to the gene pool
gene_pool.append(' ')    # adds the space character to the gene pool

We **print** all the possible values that can take the genes

In [None]:
print (gene_pool)

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', ' ']


We now need to define the maximum size of each population. Larger populations have more variation but are computationally more  expensive to run algorithms on.

In [None]:
max_population = 100

As our population is not very large, we can afford to keep a relatively large mutation rate.

In [None]:
mutation_rate = 0.07 # 7%

Great! Now, we need to define the most important metric for the genetic algorithm, i.e the fitness function. This will simply return the number of matching characters between the generated sample and the target phrase.

In [None]:
def fitness_fn(sample):
    # initialize fitness to 0
    fitness = 0
    for i in range(len(sample)):
        # increment fitness by 1 for every matching character
        if sample[i] == target[i]:
            fitness += 1
    return fitness

Before we run our genetic algorithm, we need to initialize a random population. We will use the `init_population` function to do this. We need to pass in the maximum population size, the gene pool and the length of each individual, which in this case will be the same as the length of the target phrase.

In [None]:
population = init_population(max_population, gene_pool, len(target))

Now, we show the **initial population**

In [None]:
for i in population:
    print (target)
    print (i,fitness_fn(i))

Genetic Algorithm
['l', 'M', 'B', 'k', 'U', 'H', 'o', 'l', 'H', 'l', 'W', 'Y', 'M', 'A', 'e', 'F', 'E'] 1
Genetic Algorithm
['u', 'n', 'r', ' ', 's', 'c', 'G', 'x', 'R', 'C', 'V', 'D', 'J', 'C', 'j', 'P', 'C'] 0
Genetic Algorithm
['d', 'K', 'n', 'b', 'f', 'l', 'M', 'd', 'g', 'm', 'l', 'K', 'T', 'i', 'Y', 's', 'r'] 2
Genetic Algorithm
['l', 'w', 'l', 'Q', 'y', 'V', 'L', 'O', 'q', 'p', 'H', 'r', 'l', 'Z', 'w', 'g', 'O'] 0
Genetic Algorithm
['a', 'A', 'P', 'x', 'J', 'K', 'T', 'M', 'v', 'C', 'K', 'O', 'G', 'r', 'I', 'K', 'f'] 0
Genetic Algorithm
['B', 'w', 's', 'V', 'F', 'h', 'D', 'H', 'I', 'i', 'j', 'x', 'p', 'o', 'U', 'S', 'm'] 1
Genetic Algorithm
['e', 'W', 'g', 'r', 'h', 'M', 'i', 'n', 'L', 'P', 'F', 'R', 'k', 'l', 't', 'S', 'Z'] 1
Genetic Algorithm
['s', 'a', 'L', ' ', 'G', 'v', 's', 'u', 'N', 'A', 'K', 'n', 'd', 'X', 'L', 'f', 'E'] 0
Genetic Algorithm
['D', 'q', 'i', 'f', 'M', 't', 'a', 'V', 'J', 'a', 'N', 'H', 'g', 'S', 'm', 'V', 'P'] 0
Genetic Algorithm
['m', 'U', 'i', 'l', 'N', 'H

We will now define how the individuals in the population should change as the number of generations increases. First, the `select` function will be run on the population to select *two* individuals with high fitness values. These will be the parents which will then be recombined using the `recombine` function to generate the child.

In [None]:
parents = select(2, population, fitness_fn) 

In [None]:
for i in parents:
    print (i,fitness_fn(i))

['l', 'M', 'B', 'k', 'U', 'H', 'o', 'l', 'H', 'l', 'W', 'Y', 'M', 'A', 'e', 'F', 'E'] 1
['J', 's', 'D', 'e', 'N', 'U', 'c', 'J', 'L', 'H', 'I', 'N', 'F', 'd', 'u', 'Z', 'Q'] 2


In [None]:
# The recombine function takes two parents as arguments, so we need to unpack the previous variable
child = recombine(*parents)

In [None]:
print (child)
print (fitness_fn(child))

['l', 'M', 'B', 'k', 'U', 'U', 'c', 'J', 'L', 'H', 'I', 'N', 'F', 'd', 'u', 'Z', 'Q']
1


Next, we need to apply a mutation according to the mutation rate. We call the `mutate` function on the child with the gene pool and mutation rate as the additional arguments.

In [None]:
print(child)
print (fitness_fn(child))
child = mutate(child, gene_pool, mutation_rate)
print (child)
print (fitness_fn(child))

['l', 'M', 'B', 'k', 'U', 'U', 'c', 'J', 'L', 'H', 'I', 'N', 'F', 'd', 'u', 'Z', 'Q']
1
['l', 'M', 'B', 'k', 'U', 'U', 'c', 'J', 'L', 'H', 'I', 'N', 'F', 'z', 'u', 'Z', 'Q']
1


The above lines can be condensed into

`child = mutate(recombine(*select(2, population, fitness_fn)), gene_pool, mutation_rate)`

And, we need to do this `for` every individual in the current population to generate the new population.

In [None]:
population = [mutate(recombine(*select(2, population, fitness_fn)), gene_pool, mutation_rate) for i in range(len(population))]
for i in population:
    print (target)
    print (i,fitness_fn(i))

Genetic Algorithm
['L', 'b', 'w', 'Q', 'g', 'u', 'y', 'C', 'y', 'k', 'g', 'y', 'f', 'z', 'u', 'm', 'i'] 1
Genetic Algorithm
['d', 'K', 'n', 'b', 'f', 'l', 'M', 'd', 'g', 'm', 'l', 'K', 'T', 'd', 'u', 'Z', 'Q'] 1
Genetic Algorithm
['J', 'F', 'i', 'v', 'g', 'v', ' ', 'C', 'G', 'x', 'l', 'K', 'T', 'i', 'Y', 's', 'r'] 1
Genetic Algorithm
['J', 'F', 'i', 'v', 'g', 'B', 'd', 'q', 'l', 'C', 'g', 's', 'P', 'q', 'U', 'j', 'f'] 1
Genetic Algorithm
['J', 's', 'D', 'e', 'N', 'U', 'c', 'J', 'L', 'H', 'I', 'N', 'F', 'd', 'u', 'F', 'E'] 2
Genetic Algorithm
['F', 'e', 'i', 'x', 'O', 'a', 'E', 'S', 'A', 'j', 'D', 'J', 'f', 's', 'P', 'j', 'N'] 2
Genetic Algorithm
['e', 'W', 'g', 'r', 'h', 'M', 'i', 'n', 'L', 'P', 'F', 'R', 'k', 'l', 'K', 'Z', 'd'] 0
Genetic Algorithm
['O', 'l', 'o', 'Y', 'M', 'h', 'D', 'H', 'I', 'i', 'j', 'x', 'p', 'o', 'U', 'S', 'm'] 1
Genetic Algorithm
['F', 'e', 'i', 'Y', 'M', 'W', 'I', 'V', 'P', 'C', 'M', 'o', 'p', 'i', 'B', 'N', 'M'] 3
Genetic Algorithm
['z', 'O', 'u', 'C', 'v', 'm

The individual with the highest fitness can then be found using the `max` function.

In [None]:
current_best = max(population, key=fitness_fn)

Let's print this out

In [None]:
print(current_best, fitness_fn(current_best))

['F', 'e', 'i', 'Y', 'M', 'W', 'I', 'V', 'P', 'C', 'M', 'o', 'p', 'i', 'B', 'N', 'M'] 3


We see that this is a list of characters. This can be converted to a string using the join function

In [None]:
current_best_string = ''.join(current_best)
print(current_best_string)

FeiYMWIVPCMopiBNM


We now need to define the conditions to terminate the algorithm. This can happen in two ways
1. Termination after a predefined number of generations
2. Termination when the fitness of the best individual of the current generation reaches a predefined threshold value.

We define these variables below

In [None]:
ngen = 500 # maximum number of generations
# we set the threshold fitness equal to the length of the target phrase
# i.e the algorithm only terminates whne it has got all the characters correct 
# or it has completed 'ngen' number of generations
f_thres = len(target)

To generate `ngen` number of generations, we run a `for` loop `ngen` number of times. After each generation, we calculate the fitness of the best individual of the generation and compare it to the value of `f_thres` using the `fitness_threshold` function. After every generation, we print out the best individual of the generation and the corresponding fitness value. Lets now write a function to do this.

In [None]:
def genetic_algorithm_stepwise(population, fitness_fn, gene_pool=[0, 1], f_thres=None, ngen=1200, pmut=0.1):
    for generation in range(ngen):
        population = [mutate(recombine(*select(2, population, fitness_fn)), gene_pool, pmut) for i in range(len(population))]
        # stores the individual genome with the highest fitness in the current population
        current_best = ''.join(max(population, key=fitness_fn))
        print('Current best: {current_best}\t\tGeneration: {str(generation)}\t\tFitness: {fitness_fn(current_best)}\r', end='')
        
        # compare the fitness of the current best individual to f_thres
        fittest_individual = fitness_threshold(fitness_fn, f_thres, population)
        
        # if fitness is greater than or equal to f_thres, we terminate the algorithm
        if fittest_individual:
            return fittest_individual, generation
    return max(population, key=fitness_fn) , generation       

The function defined above is essentially the same as the one defined in `search.py` with the added functionality of printing out the data of each generation.

In [None]:
psource(genetic_algorithm)

We have defined all the required functions and variables. Let's now create a new population and test the function we wrote above.

In [None]:
population = init_population(max_population, gene_pool, len(target))
solution, generations = genetic_algorithm_stepwise(population, fitness_fn, gene_pool, f_thres, ngen, mutation_rate)



In [None]:
print(solution)

['G', 'e', 'n', 'e', 't', 'i', 'o', ' ', 'A', 'o', 'O', 'A', 'r', 'i', 't', 'B', 'j']
