## Computational Intelligence: Genetic Algorithms Project. 

In a broad sense, Genetic Algorithms can be defined as population based models that
use selection and recombination operators to generate new sample points ina search space. The idea of Genetic Algorithms are heavily infuenced by the Theory of Evolution, by encoding potential solutions to a specific problem into simple chromosome-like data structure and apply recombination operators to these structures, as to preserve critical information.  


In [11]:
import pandas as pd
import numpy as np 
from IPython.display import display
from difflib import get_close_matches
from sklearn.feature_extraction.text import TfidfVectorizer
import utils as ul
import importlib
importlib.reload(ul)

<module 'utils' from 'c:\\Users\\kwnka\\vs-code projects\\CompIntel\\project_B\\utils.py'>

### Preprocessing

1. **Loading**: To start off the dataset is loaded into a Pandas dataframe and some information is printedusing the `print_dataframe_info()` function, to help understand the data better. Then we filter the dataframe, keeping the data having `region_main_id == 1693`. This means that only the inscriptions of a particular region are kept, which have a generally higher chance to have similar content.
2. **BoW with Tf-Idf**: In order to implement the Bag of Words model using tf-idf vectorization, the following steps are performed:
    * An object of the `TfidfVectorizer()` Class is instantiated.
    * The method `fit_transform()` of the Class is used in order for the vocabulary to be constructed and for the inscriptions to be transformed into the aprropriate form.
    * The **output** of the vectorizer is a **sparse matrix**, where the rows are the different inscriptions and columns the different features of the vocabulary. A value in a particular position of the matrix signifies the existance of the particular feature in that inscription.
3. **Target Inscription**: Since the damaged inscription that needs to be filled, is not part of the dataset, the word *αλεξανδρε* does not exist in the vocabulary built during the BoW, so it is replaced, by the closest related word in hte dictionary, the word *ανδρες*, which is found using the `get_close_matches(missing_word, vocab, n=1)` command from the `difflib` library, where:
    * **missing word**: Is the word that is missing, so the word *αλεξανδρε*.
    * **vocab**: Is the vocabulary obtained from the `print_vectorizer_info()` function.
    * **n=1**: Signifies that the closest possible word to the missing word should be returned.
    
So the new target inscription after this procedure is [...]*ανδρες ουδις*[...], which is then tranformed into a tf-idf vector using the same vectorizer as before.

In [12]:
# Load dataset into a dataframe and print information about it
df = pd.read_csv('iphi2802.csv', delimiter='\t')

ul.print_dataframe_info(df)


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2802 entries, 0 to 2801
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              2802 non-null   int64  
 1   text            2802 non-null   object 
 2   metadata        2802 non-null   object 
 3   region_main_id  2802 non-null   int64  
 4   region_main     2802 non-null   object 
 5   region_sub_id   2802 non-null   int64  
 6   region_sub      2802 non-null   object 
 7   date_str        2802 non-null   object 
 8   date_min        2802 non-null   float64
 9   date_max        2802 non-null   float64
 10  date_circa      2802 non-null   float64
dtypes: float64(3), int64(3), object(5)
memory usage: 240.9+ KB

Number of NULL values per column:
id                0
text              0
metadata          0
region_main_id    0
region_main       0
region_sub_id     0
region_sub        0
date_str          0
date_min          0
date_ma

In [13]:
# Keep rows of the same region as the target inscription.  
df_filtered= df.query("region_main_id == 1693")

# Initalize the tf-idf Vectorizer and transform the text column of the dataframe.
vectorizer = TfidfVectorizer()
index_matrix = vectorizer.fit_transform(df_filtered['text'].to_list())

vocab_dict = ul.print_vectorizer_info(vectorizer, index_matrix, True)

"The dictionary of the dataset: ['αβ' 'αββεος' 'αβδαασθωρης' ... 'ϙε' 'ϙτ' 'ϛπ']"

'The shape of the output matrix: (127, 1678)'

'The matrix: [[0. 0. 0. ... 0. 0. 0.]\n [0. 0. 0. ... 0. 0. 0.]\n [0. 0. 0. ... 0. 0. 0.]\n ...\n [0. 0. 0. ... 0. 0. 0.]\n [0. 0. 0. ... 0. 0. 0.]\n [0. 0. 0. ... 0. 0. 0.]]'

"The unique values of the output matrix: [<127x1678 sparse matrix of type '<class 'numpy.float64'>'\n \twith 2342 stored elements in Compressed Sparse Row format>]"

In [14]:
# Find the word closest to the missing word 'αλεξανδρε' and turn the incomplete inscription into a tf-idf vector
missing_word = 'αλεξανδρε'
replaced_word = get_close_matches(missing_word, vocab_dict.values(), n=1)
print(f"The word closest to the missing word: {replaced_word[0]}")
replaced_inscription = f'{replaced_word[0]} ουδις'
print(f"The new inscription: {replaced_inscription}")
incomplete_vector = vectorizer.transform([replaced_inscription]).toarray()

The word closest to the missing word: ανδρες
The new inscription: ανδρες ουδις


### Encoding

Firstly, the chromosomes of the population, consist of two different genes, one for each word that is missing. For this project the different genes, hence the chromosomes of the population are encoded using integer value encoding. Since the dictionary of this project consists of 1678 different features, the different values of a gene will be 0-1677. The final chromosome is a tuple consisting of two numbers, which are the indexes of the words in the dictionary. 

Using this particular encoding ensures that no values outside of the range 0-1677 will occur, hence making the fitness function simpler and more cost-effective, since it has less comparisons to perform. Even if values past 1677 occur, the chromosomes carrying these values can be treated with one of the following ways:


In [15]:
ul.fitness_func

<function utils.fitness_func(solution, solution_idx)>