# **Libraries Used**

In [8]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
from keras.utils import to_categorical
from keras import backend as K

# **Selecting and combining data**

so basically I'm sleectinh the columns with modality 01, and emotions either 03 or 04. Once this is done the column representing facial recognition are selcted and all their data is stored and combined in dataframes.

In [6]:
def select_and_combine_data(dataset_path, modality_code, emotion_codes):
    print("Selecting and combining data entries...")
    dataframes = []

    for file_name in os.listdir(dataset_path):
        if file_name.endswith('.csv'):
            file_info = file_name.split('.')[0].split('-')
            if (file_info[0] == modality_code) and (file_info[2] in emotion_codes):
                df = pd.read_csv(os.path.join(dataset_path, file_name))
                df_selected = df.iloc[:, 298:637]
                df_selected['label'] = file_info[2]
                dataframes.append(df_selected)

    if not dataframes:
        raise ValueError("No dataframes to concatenate. Check your file filtering conditions.")

    combined_df = pd.concat(dataframes, ignore_index=True)
    print("Combined DataFrame:")
    print(combined_df)

    return combined_df

Randomly assigning 0 and 1, telling absence or presence of a feature


In [9]:
def initialize_population(population_size, chromosome_length):
    print("Initializing population...")
    return np.random.randint(2, size=(population_size, chromosome_length))

# **Fitness Function**

  Input Parameters:

  chromosome: This parameter represents a binary string where each bit corresponds to whether a feature is selected (1) or not (0).
  X_train: The feature matrix of the training dataset.
  X_test: The feature matrix of the testing dataset.
  y_train: The target labels of the training dataset.
  y_test: The target labels of the testing dataset.

  **chromosome** is used to select the features from dataset, if 1 it is selected if 0 then not selected.

Neural Network
using keras sequential API, several dense layers are consisted in the architecture with ReLU activation function, followed by softmax activation function in output layer.

The model is compiled with appropriate loss function ('categorical_crossentropy'), optimizer ('adam'), and evaluation metric ('accuracy').

The EarlyStopping callback is used to monitor the validation loss during training and stop training early if the loss does not improve after a certain number of epochs (patience=1 in your code).

The model is trained using the training data (X_train_selected and y_train) for a specified number of epochs (1 epoch in your code) and a batch size of 256. The training process is performed silently (verbose=0) to avoid printing training progress to the console.

fter training, the model's performance is evaluated using the testing data (X_test_selected and y_test). The accuracy of the model on the testing dataset is computed and returned as the fitness score.

Finally, Keras session resources are cleared (K.clear_session()) and the model object is deleted to release memory resources.

In [10]:
def evaluate_fitness(chromosome, X_train, X_test, y_train, y_test):
    print("Evaluating fitness...")
    selected_features = X_train.columns[chromosome == 1]
    X_train_selected = X_train[selected_features]
    X_test_selected = X_test[selected_features]

    model = Sequential([
        Dense(120, input_dim=X_train_selected.shape[1], activation='relu'),
        Dense(80, activation='relu'),
        Dense(80, activation='relu'),
        Dense(80, activation='relu'),
        Dense(80, activation='relu'),
        Dense(80, activation='relu'),
        Dense(100, activation='relu'),
        Dense(100, activation='relu'),
        Dense(2, activation='softmax')
    ])

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    early_stopping = EarlyStopping(monitor='val_loss', patience=1, restore_best_weights=True)

    batch_size = 256
    epochs = 1
    model.fit(X_train_selected, to_categorical(y_train), epochs=epochs, batch_size=batch_size, verbose=0, validation_split=0.1, callbacks=[early_stopping])

    _, accuracy = model.evaluate(X_test_selected, to_categorical(y_test), verbose=0)

    K.clear_session()
    del model

    return accuracy

# **Crossover Function**

Imagine you have two strings of numbers, like "010110" and "111000". These strings represent different traits, like hair color or height, that you inherit from your parents.

Now, to create children with a mix of traits, we randomly pick a point in each string. Let's say we pick the third point. Then, we take the first part of the first string ("010") and combine it with the second part of the second string ("1000"), creating one child. For the second child, we do the opposite: we take the first part of the second string ("111") and combine it with the second part of the first string ("110"), creating the second child.

In this way, we're essentially mixing traits from both parents to create new children with a blend of characteristics.

In [11]:
def crossover(parent1, parent2):
    print("Performing crossover...")
    crossover_point = np.random.randint(1, len(parent1))
    return np.concatenate((parent1[:crossover_point], parent2[crossover_point:])), np.concatenate((parent2[:crossover_point], parent1[crossover_point:]))


# **Mutation Function**

It initializes a loop that iterates over each gene in the chromosome.
For each gene, it generates a random number between 0 and 1 using np.random.rand().

If the random number is less than the mutation rate, it means the gene will undergo mutation.

In the mutation process, it flips the gene by subtracting its value from 1 (1 - chromosome[i]). If the gene is 0, it becomes 1, and if it's 1, it becomes 0.

After iterating through all genes in the chromosome, the mutated chromosome is returned.

Think of mutation as a random change in one of the traits encoded in the string.

Imagine you have a string of numbers representing traits, like "010110". Mutation means that, occasionally, one of these numbers randomly flips to its opposite. So, if we have a mutation rate of 0.1 (10%), for each trait in the string, there's a 10% chance it will change.

For example, if we randomly select a position and it's a "0", with a mutation rate of 0.1, there's a 10% chance it will become a "1". Similarly, if it's a "1", there's a 10% chance it will become a "0".

This process introduces randomness and diversity in the population, helping to explore new solutions in the search space.

In [12]:
def mutate(chromosome, mutation_rate):
    print("Mutating...")
    for i in range(len(chromosome)):
        if np.random.rand() < mutation_rate:
            chromosome[i] = 1 - chromosome[i]
    return chromosome

# **Genetic Algorithm**

This function, `genetic_algorithm`, is the main logic for running a genetic algorithm for feature selection. Here's a simple breakdown of what it does:

1. **Initialization**: It starts by initializing the population of potential solutions (chromosomes) using the `initialize_population` function.

2. **Iterative Process**: It then enters a loop that iterates over a specified number of generations.

3. **Evaluation**: For each generation, it evaluates the fitness of each chromosome in the population using the `evaluate_fitness` function.

4. **Selection**: Based on the fitness scores, it selects two parent chromosomes from the population probabilistically, giving higher chances to chromosomes with higher fitness scores.

5. **Crossover**: It performs crossover (recombination) between the selected parent chromosomes to produce two offspring chromosomes using the `crossover` function.

6. **Mutation**: It applies mutation to the offspring chromosomes with a certain probability using the `mutate` function.

7. **Update Population**: It replaces the least fit chromosomes in the population with the mutated offspring.

8. **Tracking Best Solution**: It keeps track of the best solution (chromosome) found so far along with its fitness score.

9. **Return**: Finally, it returns the best solution found and a list of fitness scores for each generation.

This process iterates for the specified number of generations, gradually improving the population's fitness and hopefully converging towards an optimal solution for the feature selection problem.

In [13]:
def genetic_algorithm(population_size, num_generations, mutation_rate, X_train, X_test, y_train, y_test):
    print("Starting Genetic Algorithm for Feature Selection...")
    chromosome_length = X_train.shape[1]
    population = initialize_population(population_size, chromosome_length)
    best_solution = None
    best_fitness = 0
    accuracies = []

    for generation in range(num_generations):
        print(f"Generation {generation + 1}/{num_generations}")
        fitness_scores = [evaluate_fitness(chromosome, X_train, X_test, y_train, y_test) for chromosome in population]
        accuracies.append(fitness_scores)
        max_fitness_index = np.argmax(fitness_scores)
        if fitness_scores[max_fitness_index] > best_fitness:
            best_solution = population[max_fitness_index]
            best_fitness = fitness_scores[max_fitness_index]
            print(f"Generation {generation + 1}: Best Fitness = {best_fitness}")

        # Selection, crossover, and mutation
        parents = population[np.random.choice(population_size, size=2, p=np.array(fitness_scores) / np.sum(fitness_scores))]
        child1, child2 = crossover(parents[0], parents[1])
        min_fitness_index = np.argmin(fitness_scores)
        population[min_fitness_index] = mutate(child1, mutation_rate)
        population[(min_fitness_index + 1) % population_size] = mutate(child2, mutation_rate)

    return best_solution, accuracies

# **Output and Usage**

**Data Selection and Combination:**

Read data files from a specific directory.
Filter files based on criteria like modality code and emotion codes.
Combine selected data into a single DataFrame.

**Data Encoding:**

Encode categorical labels into numerical format using LabelEncoder.

**Data Splitting:**

Separate features and labels.
Split data into training and testing sets.

**Genetic Algorithm for Feature Selection:**

Initialize a population of potential solutions (chromosomes).
Evaluate the fitness of each chromosome (solution) using a neural network model.

**Iteratively evolve the population over several generations:**

Select parent chromosomes based on fitness scores.
Perform crossover and mutation operations to create new offspring chromosomes.
Replace less fit chromosomes with mutated offspring.
Track the best solution (selected features) and its accuracy.

**Evaluation:**

Evaluate the accuracy of the best solution (selected features) using the neural network model.
Evaluate the accuracy of using all features for comparison.

**Result Analysis:**

Print the best solution (selected features).
Print the accuracies of using all features and selected features.
Determine whether feature selection improved classification accuracy.
Print the accuracies achieved in each generation during the genetic algorithm's execution.







In [None]:
combined_df = select_and_combine_data("/content/drive/MyDrive/archive/", '01', ['03', '04'])
label_encoder = LabelEncoder()
combined_df['label'] = label_encoder.fit_transform(combined_df['label'])
X = combined_df.drop(columns=['label'])
y = combined_df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

best_solution, accuracies = genetic_algorithm(50, 10, 0.1, X_train, X_test, y_train, y_test)

print("\nBest Solution (Selected Features):")
print(X_train.columns[best_solution])

accuracy_all_features = evaluate_fitness(np.ones(X_train.shape[1], dtype=int), X_train, X_test, y_train, y_test)
accuracy_selected_features = evaluate_fitness(best_solution, X_train, X_test, y_train, y_test)

print("Accuracy with all features:", accuracy_all_features)
print("Accuracy with selected features:", accuracy_selected_features)

if accuracy_selected_features > accuracy_all_features:
    print("Feature selection improved classification accuracy.")
else:
    print("Feature selection did not improve classification accuracy.")

# Output all accuracies
for i, accuracy in enumerate(accuracies):
    print(f"Generation {i+1} Accuracies:", accuracy)

Selecting and combining data entries...


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
