## 1. Import the libraries and prepare dataset

Download [`heart-disease` dataset from kaggle](https://www.kaggle.com/ronitf/heart-disease-uci)

> 1. "age" = age
> 2. "sex" = sex
> 3. "cp" = chest pain type (4 values)
> 4. "trestbps" = resting blood pressure
> 5. "chol" = serum cholestoral in mg/dl
> 6. "fbs" = fasting blood sugar > 120 mg/dl
> 7. "restecg" = resting electrocardiographic results (values 0,1,2)
> 8. "thalach" = maximum heart rate achieved
> 9. "exang" = exercise induced angina
> 10. "oldpeak" = oldpeak = ST depression induced by exercise relative to rest
> 11. "slope" = the slope of the peak exercise ST segment
> 12. "ca" = number of major vessels (0-3) colored by flourosopy
> 13. "thal" = thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
> 14. "target" = has heart disease

Credits: [Samiran Bera, 2020](https://medium.com/analytics-vidhya/feature-selection-using-genetic-algorithm-20078be41d16)

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Import regressor values
df_r_x = pd.read_csv('Annual_Stock_Price_Fundamentals_Ratios.csv', index_col=0)
df_r_y = pd.read_csv('Annual_Stock_Price_Performance_Percentage.csv', index_col=0)
df_r_y = df_r_y['Performance']

r_x = df_r_x.values
r_y = df_r_y.values

sc = StandardScaler()
r_x = sc.fit_transform(r_x)

df_r = pd.concat([pd.DataFrame(r_x, columns=df_r_x.columns), pd.DataFrame(r_y, columns=['Performance'])], axis=1, join='inner')    

# Import classifier values
df_c = pd.read_csv('heart_disease.csv')

# Impute missing values
df_c.ca.loc[df_c.ca == '?']     = '0' 
df_c.thal.loc[df_c.thal == '?'] = '3'

# Remove outliers
# df_c = df_c[df_c.chol < 500][df_c.oldpeak < 5]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


## 2. Create the Genetic Algorithm

Getting an idea of key concepts,

<img src="https://miro.medium.com/max/1112/1*vIrsxg12DSltpdWoO561yA.png" width="300" height="auto" />


From [Vijini Mallawaarachchi](https://towardsdatascience.com/introduction-to-genetic-algorithms-including-example-code-e396e98d8bf3),

**Initial Population**

The initial set of individuals which is called a **Population**. Each individual is a solution to the problem you want to solve.

**Genes**

An individual is characterized by a set of parameters (variables) known as Genes, which are joined into a string to form a **Chromosome** (solution).

**Fitness Function**

Determines how fit an individual is - which is the ability of an individual to compete with other individuals. 

It gives a fitness score to each individual. 

The probability that an individual will be selected for reproduction is based on its fitness score.

**Selection**

Select the fittest individuals and let them pass their genes to the next generation.

Two pairs of individuals (parents) are selected based on their fitness scores. 

Individuals with _**high** fitness have more chance to be **selected** for reproduction_.

**Crossover**

Crossover is the most significant phase in a genetic algorithm. 

For each pair of parents to be mated, a crossover point is chosen at random from within the genes.

For example, consider the crossover point to be 3 as shown below.

<img src="https://miro.medium.com/max/654/1*Wi6ou9jyMHdxrF2dgczz7g.png" width="200" height="auto" />

Offspring are created by exchanging the genes of parents among themselves until the crossover point is reached.

<img src="https://miro.medium.com/max/622/1*eQxFezBtdfdLxHsvSvBNGQ.png" width="200" height="auto" />

This exchanges genes among parents, and the new offspring are added to the population.

<img src="https://miro.medium.com/max/622/1*_Dl6Hwkay-UU24DJ_oVrLw.png" width="200" height="auto" />

**Mutation**

In certain new offspring formed, some of their genes can be subjected to a mutation with a low random probability. This implies that some of the bits in the bit string can be flipped.

<img src="https://miro.medium.com/max/702/1*CGt_UhRqCjIDb7dqycmOAg.png" width="200" height="auto" />

Mutation occurs to maintain diversity within the population and prevent premature convergence.

**Termination**

The algorithm terminates if the population has converged (does not produce offspring which are significantly different from the previous generation). Then it is said that the genetic algorithm has provided a set of solutions to our problem.


Example pseudocode,

```
START
Generate the initial population
Compute fitness
REPEAT
    Selection
    Crossover
    Mutation
    Compute fitness
UNTIL population has converged
STOP
```

In [25]:
import math

class GeneticAlgorithm:
    """
    data_df: The dataset used in the study
    features: A set of features which need to be optimized
    target: Denotes the dependent variable
    n: The size of the population
    max_iterations: The number of iterations to evaluate
    """
    
    def __init__(self, features, target='target', epochs=100, max_iterations=1000, is_classifier=True):
        self.epochs = epochs
        self.max_iterations = max_iterations
        self.features = features
        self.target = target
        self.is_classifier = is_classifier
    
    def _init_population(self):
        return np.array([
            [math.ceil(e) for e in pop] for pop in (np.random.rand(self.epochs, len(self.features)) - 0.5)
        ])
    
    def _predictive_model(self, x, y):
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=7)

        if self.is_classifier:
            model = LogisticRegression(solver='liblinear', max_iter=100, random_state=7)
            model.fit(x_train,y_train)    
            return accuracy_score(y_test, model.predict(x_test))

        model = LinearRegression()
        model.fit(x, y)
        return mean_squared_error(y, model.predict(x))

    def _get_fitness(self, df_data, population):
        """
        For each chromosome, the _predictive_model() function evaluates the accuracy score,
           which is aggregated at _get_fitness() function for the entire population.
        """
        fitness = []
        n = population.shape[1]

        for i in range(population.shape[0]):        
            fitness.append(
                self._predictive_model(df_data[
                    [self.features[j] for j in range(n) if population[i, j] == 1]
                ], df_data[self.target])
            )                

        return fitness
    
    def _random_selection(self, population):
        r = population.shape[0]
        return np.array([population[np.random.randint(0, r)] for i in range(r)])

    def _single_point_crossover(self, population):
        r = population.shape[0]
        c = population.shape[1]
        n = np.random.randint(1, c)

        for i in range(0, r, 2):            
            # What happens when it is out of bound?
            if r == i + 1:
                population[i-1] = np.append(population[i-1][0:n], population[i][n:c])
                population[i] = np.append(population[i][0:n], population[i-1][n:c]) 
                continue

            population[i] = np.append(population[i][0:n], population[i+1][n:c])
            population[i+1] = np.append(population[i+1][0:n], population[i][n:c])        

        return population
    
    def _flip_mutation(self, population):
        return population.max() - population
    
    def _replace_duplicate(self, population):
        return np.unique(population, axis=0)
    
    def find(self, df_data):
        population = self._init_population()
        population = self._replace_duplicate(population)
        fitness = self._get_fitness(df_data, population)
    
        optimal_value = max(fitness)
        optimal_solution = population[np.where(fitness == optimal_value)][0]    
    
        for i in range(self.max_iterations):                
            population = self._random_selection(population)
            population = self._single_point_crossover(population)                        
        
            if np.random.rand() < 0.3:
                population = self._flip_mutation(population)   
    
            population = self._replace_duplicate(population)    
            fitness = self._get_fitness(df_data, population)
                
            if max(fitness) > optimal_value:
                optimal_value = max(fitness)
                optimal_solution = population[np.where(fitness == optimal_value)][0]                               

        return dict(optimal_solution=optimal_solution, 
                    optimal_value=optimal_value, 
                    population=population,
                    fitness=fitness)

In [29]:
def apply_ga(df, title, target, is_classifier=True):
    print(f'\n===== {title} =====')

    ga = GeneticAlgorithm(df.columns, epochs=25, max_iterations=100, target=target, is_classifier=is_classifier)

    print('Input:')
    print(f'  Features={ga.features.values}')
    print(f'  Target={ga.target}')

    result = ga.find(df)

    print('\nOutput:')
    print(f'  Optimal Features={[ga.features[i] for i, v in enumerate(result["optimal_solution"]) if v == 1]}')
    print(f'  Optimal Accuracy={round(result["optimal_value"] * 100, 2)}%')


apply_ga(df_r, 'Using continuous variables', 'Performance', is_classifier=False)
apply_ga(df_c, 'Using categorical binary variables', 'target')


===== Using continuous variables =====
Input:
  Features=['EV/EBIT' 'Op. In./(NWC+FA)' 'P/E' 'P/B' 'P/S' 'Op. In./Interest Expense'
 'Working Capital Ratio' 'ROE' 'ROCE' 'Debt/Equity' 'Debt Ratio'
 'Cash Ratio' 'Asset Turnover' 'Gross Profit Margin' '(CA-CL)/TA' 'RE/TA'
 'EBIT/TA' 'Book Equity/TL' 'Performance']
  Target=Performance

Output:
  Optimal Features=['EV/EBIT', 'P/E', 'Op. In./Interest Expense', 'Debt/Equity', 'Cash Ratio', 'EBIT/TA']
  Optimal Accuracy=31.54%

===== Using categorical binary variables =====
Input:
  Features=['age' 'sex' 'cp' 'trestbps' 'chol' 'fbs' 'restecg' 'thalach' 'exang'
 'oldpeak' 'slope' 'ca' 'thal' 'target']
  Target=target

Output:
  Optimal Features=['trestbps', 'exang', 'ca', 'thal', 'target']
  Optimal Accuracy=100.0%
