# Raw Data Analysis

In this section, we will analyze the raw data to understand its structure, identify any inconsistencies, and gather initial insights. The data we are working with includes columns for `species`, `chromosome`, and `region`.

## Steps Taken

1. **Loading the Data**:
    - We used `pandas` to load the data into a DataFrame for easy manipulation and analysis.

2. **Initial Exploration**:
    - Displayed the first few rows of the DataFrame to get an overview of the data.
    - Checked for any missing or null values to ensure data integrity.

3. **Data Summary**:
    - Generated summary statistics for the data to understand its distribution.
    - Reviewed the unique values in each column to identify any potential anomalies.

4. **Identifying Conflicts**:
    - Applied the `find_conflicting_chromosomes` function to identify regions that have different chromosome values for the same species.
    - Grouped the data by `species` and `region` and counted the unique chromosome values to pinpoint discrepancies.

## Findings

- **Initial Data Overview**:
    - The data consists of several species, each associated with multiple regions and chromosomes.
    - Some species and regions have multiple chromosome values, indicating potential conflicts.

- **Summary Statistics**:
    - Provided insights into the range and distribution of the data.
    - Highlighted any outliers or unusual patterns in the dataset.

- **Conflicting Chromosomes**:
    - Identified specific `species` and `region` combinations with more than one unique `chromosome` value.
    - This will help in further cleaning and validation of the data.

By thoroughly analyzing the raw data, we ensure that we have a clear understanding of its structure and any inconsistencies that need to be addressed in subsequent steps.


In [1]:
import os
import sys
import pandas as pd

from pandas import DataFrame

In [2]:
notebooks_dir = os.getcwd()
project_root = os.path.abspath(os.path.join(notebooks_dir, os.pardir))

# Add the project root directory to sys.path
if project_root not in sys.path:
    sys.path.append(project_root)

# Now you can import functions from the utils module
try:
    from utils.load_data import get_expression_data
    from utils.load_data import get_upstream_data
    from utils.load_data import get_merged_data
    print("Import successful")
except ImportError as e:
    print(f"Error importing module: {e}")

Import successful


In [3]:
# Paths
expression_data_path = os.path.join(notebooks_dir, '..', 'data', 'data_expression')
upstream_data_path = os.path.join(notebooks_dir, '..', 'data','data_sequences_upstream', 'upstream_sequences.xlsx')

In [4]:
# Get expression data
expression_df = get_expression_data(expression_data_path)

# Get upstream data
upstream_df = get_upstream_data(upstream_data_path)

# Get merged data
merged_df = get_merged_data(expression_df, upstream_df)

In [5]:
# Ensure consistency with the expression data
assert (
    len(set(upstream_df["csv"]).difference(set(expression_df["csv"]))) == 0
), "Inconsistent CSV filenames between upstream and expression data."
assert (
    len(
        set(map(tuple, upstream_df[["csv", "region"]].values)).difference(
            set(map(tuple, expression_df[["csv", "region"]].values))
        )
    )
    == 0
), "Inconsistent 'csv' and 'region' combinations between upstream and expression data."


## Conflicting Regions

We now identify regions that correspond to the same region and species but have different chromosome values.

In [6]:
def find_conflicting_chromosomes(df: DataFrame) -> DataFrame:
    """
    Find regions corresponding to the same region and species but with different chromosome values.

    Args:
    df (DataFrame): The input DataFrame with columns 'species', 'chromosome', and 'region'.

    Returns:
    DataFrame: A DataFrame containing rows where there are different 'chromosome' values
               for the same 'species' and 'region'.
    """
    # Group by 'species' and 'region', then filter groups with more than one unique 'chromosome'
    result = df.groupby(['species', 'region']).filter(lambda x: x['chromosome'].nunique() > 1)
    return result