# Band Gap Prediction Project

This notebook serves as the main interface for the Band Gap Prediction project. It includes sections for loading the dataset, preprocessing the data, training various models (MLP, GNN, GAN), evaluating their performance, and visualizing the results.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset
data = pd.read_csv('../data/Bandgap_data.csv')

# Display the first few rows of the dataset
data.head()

Unnamed: 0,composition,density,formation_energy,Eg (eV)
0,NaCl,2.16,-7.73,8.5
1,MgO,3.58,-6.98,7.8
2,SiO2,2.65,-7.54,8.9
3,Al2O3,3.95,-8.0,8.8
4,ZnO,5.61,-4.8,3.4


## Data Preprocessing

In this section, we will preprocess the data, including feature selection and normalization.

In [8]:
import pandas as pd

def preprocess_data(data):
    """
    Preprocess the dataset for band gap prediction.

    Parameters:
        data (pd.DataFrame): The raw dataset containing features and target.

    Returns:
        features (pd.DataFrame): Processed feature set.
        target (pd.Series): Target variable (band gap).
    """
    # Drop rows with missing values
    data = data.dropna()

    # Separate features and target
    target = data['Eg (eV)']  # Assuming 'Eg (eV)' is the target column
    features = data.drop(columns=['Eg (eV)'])  # Drop the target column from features

    # Identify categorical columns
    categorical_columns = features.select_dtypes(include=['object']).columns

    # Perform one-hot encoding for categorical features
    features = pd.get_dummies(features, columns=categorical_columns, drop_first=True)

    return features, target

# Preprocessing the data
# Assuming 'data' is already loaded in the notebook
features, target = preprocess_data(data)

# Display the processed features and target
print(features.head())
print(target.head())

   density  formation_energy  composition_BaTiO3  composition_C  \
0     2.16             -7.73               False          False   
1     3.58             -6.98               False          False   
2     2.65             -7.54               False          False   
3     3.95             -8.00               False          False   
4     5.61             -4.80               False          False   

   composition_CaF2  composition_CdS  composition_Cu2O  composition_Fe2O3  \
0             False            False             False              False   
1             False            False             False              False   
2             False            False             False              False   
3             False            False             False              False   
4             False            False             False              False   

   composition_GaN  composition_Ge  ...  composition_KBr  composition_LiF  \
0            False           False  ...            False 

## Model Training

In this section, we will train the models defined in the `models` directory. We will start with the MLP model.

In [23]:
import pandas as pd
from src.data_preprocessing import preprocess_data
from models.mlp_model import MLPModel
from src.train import train_model

# Load the dataset
data = pd.read_csv('../data/Bandgap_data.csv')

# Preprocess the data
features, target = preprocess_data(data)

# Initialize the MLP model
mlp_model = MLPModel(input_dim=features.shape[1])

# Train the model
train_model(mlp_model, features, target)

ValueError: could not convert string to float: 'KBr'

## Model Evaluation

After training the model, we will evaluate its performance using metrics such as Mean Squared Error (MSE) and R^2 score.

In [None]:
# Evaluate the model
from src.evaluate import evaluate_model

# Evaluate the trained MLP model
mse, r2 = evaluate_model(mlp_model, features, target)

# Display the evaluation metrics
print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')

## Visualization

In this section, we will visualize the results of the predictions compared to the actual band gaps.

In [None]:
# Visualize the results
from src.visualize import plot_results

# Plot the predicted vs actual band gaps
plot_results(mlp_model, features, target)

## Conclusion

This notebook provides a comprehensive workflow for predicting the band gap of compounds using various models. Further sections can be added to explore GNN and GAN models similarly.