<a href="https://colab.research.google.com/github/majoduran/Denovo_immunogenicitymodel/blob/main/Data_input.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modeling Immunogenicity of De Novo Proteins
Welcome to our project repository! This document outlines the necessary structure to model the immunogenicity of de novo proteins.

## Data Input Requirements
The data input for this project consists of two types of datasets: Test dataset and the Training dataset from the models evaluated. Although they differ in content, both datasets must adhere to the same structure. Below is the required format for each input file:

### CSV File Structure
Each CSV file should contain the following columns in the specified order:

Sequence: The protein sequence of interest

Length: The length of the sequence.

Tm: The melting temperature of the sequence.

#### Additional columns
The Test dataset contains two extra columns that are important to include for following analysis:

Category: The immunogenic classification of the sequence.
Immunogenic Score: The known immunogenic score for controls.

#### Additional Notes
The CSV file must contain a header row with the column names as specified above.

Ensure each column is correctly populated as missing or malformed data may lead to processing errors.

Once data matches the structure outlined above, place the CSV file in the designated directory within the repository.

#### Example Data
An example of how the data should be structured is shown below:

```
PTSSST,133,52.9069,clinically approved,0
GPEEEG,15,59.10659,de-immunized mutant, non-self,0
MCDLPQ,166,54.9079,clinically approved,0
```


### Data Cleaning Process
To ensure the integrity and consistency of the input data, our project includes a robust data cleaning pipeline. This pipeline is designed to read various file types, deduplicate sequences, and save the cleaned output in a standard CSV format.

### Overview of the Cleaning Pipeline
The pipeline reads data from three supported file formats:

Text files (txt): Assumes a space-separated format with specific columns.

Excel files (xlsx): Reads data from an Excel spreadsheet.

FASTA files (fasta): Parses protein or DNA sequence data.

#### Deduplication Function
The core function, deduplicate_and_save, follows these steps:

Input Reading: Depending on the specified file type, the function reads the corresponding input file.

Deduplication: The function removes duplicate sequences by keeping only the first occurrence, based on the sequence identity.

#### Data Processing:
For text files, it reads the data into a DataFrame, specifying columns for Sequence, Length, Tm, Category, and Immunogenic Score.

For Excel files, it loads the entire sheet into a DataFrame.

For FASTA files, it extracts sequences and creates a DataFrame from them.

### Usage Instructions
Place your input files in an accessible directory.
Modify the input_file path and file_type argument in the deduplicate_and_save function calls in the main block as needed.
Run the script to perform the cleaning and generate a deduplicated output CSV file.

In [None]:
# Data Cleaning Code

import pandas as pd
import glob
import os
import re
import logging
from pathlib import Path

# Setup logging configuration
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def deduplicate_and_save(input_file, output_file, file_type):
    """
    Reads various file types, deduplicates the sequences, and saves to a specified output file.

    Args:
        input_file (str): The path to the input file (text, Excel, or FASTA).
        output_file (str): The path for the output deduplicated file (CSV).
        file_type (str): The type of the input file ('txt', 'excel', or 'fasta').

    Returns:
        None
    """
    try:
        # Read the input file based on its type
        if file_type == 'txt':
            # Read the file, allowing for both possible structures
            df = pd.read_csv(input_file, sep=r'\s+', header=None, names=['Sequence', 'Length', 'Tm', 'Category', 'Immunogenic_Score'])

            # Check if additional columns are present
            if df['Category'].isnull().all() and df['Immunogenic_Score'].isnull().all():
                df = df[['Sequence', 'Length', 'Tm']]  # Keep only the necessary columns

        elif file_type == 'excel':
            df = pd.read_excel(input_file)

        elif file_type == 'fasta':
            with open(input_file, "r") as file:
                fasta_data = file.read()

            entries = fasta_data.split(">")[1:]  # Skip the first empty split
            data = [{"Sequence": "".join(entry.split("\n")[1:]).strip()} for entry in entries]
            df = pd.DataFrame(data)

        else:
            logging.error("Unsupported file type provided.")
            return

        # Deduplicate entries based on the 'Sequence' column
        clean_df = df.drop_duplicates(subset=["Sequence"], keep="first")
        clean_df.to_csv(output_file, index=False)
        logging.info(f"Deduplicated data saved to {output_file}")

    except FileNotFoundError:
        logging.error(f"File not found: {input_file}")
    except Exception as e:
        logging.error(f"An error occurred in processing the {file_type} file: {e}")

if __name__ == "__main__":
    # Example configuration options
    input_files = [
        {'path': 'input_data/sample_data.txt', 'type': 'txt'},
        {'path': 'input_data/sample_data.xlsx', 'type': 'excel'},
        {'path': 'input_data/sample_data.fasta', 'type': 'fasta'}
    ]
    output_csv = 'deduplicated_sequences.csv'

    # Process each input file
    for input_info in input_files:
        deduplicate_and_save(input_info['path'], output_csv, input_info['type'])