## EDA model explaination

This Jupyter notebook is to explain how the Module 'EDA' works.
The EDA function below only perfoms basic EDA. Generally it should work well but one must consider their individual dataset. What this function does is:

- Check for missing values: This involves finding columns that contain null or NaN values, which can indicate missing or incomplete data.
- Check for outliers: Outliers are data points that are significantly different from the majority of the data. Outliers can have a significant impact on the results of an analysis, so it is important to identify and understand them. This code uses the interquartile range (IQR) to identify outliers that are more than 1.5 times the IQR above or below the first or third quartiles, respectively.
- Check data types: This involves finding the data type of each column, which can be useful for ensuring that the data is in the correct format for a particular analysis.
- Check value counts: This involves counting the number of occurrences of each unique value in a column. This can be useful for identifying patterns in categorical data.
- Check correlations: This involves calculating the correlation between pairs of columns, which can give insight into the strength and direction of the relationship between the variables.
-Check for skewness: This involves calculating the skewness of the distribution of each column, which can give insight into the symmetry of the data. Positive skewness indicates a distribution with a longer tail on the right side, while negative skewness indicates a distribution with a longer tail on the left side.


In [2]:
#A simple code to analyse the dataset

import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt

def EDA(data):
    """Perform exploratory data analysis on a given dataset.
    Args:
      data (pandas.DataFrame): The dataset to analyze.
    """
    from IPython.display import display
    display(data.head())
     
    # Print the shape of the dataset
    print(f"Number of rows: {data.shape[0]}")
    print(f"Number of columns: {data.shape[1]}")   

    # Check for missing values
    if data.isnull().values.any():
        # Find the number of missing values for each column
        missing_counts = data.isnull().sum()
        # Calculate the percentage of missing values for each column
        missing_percentages = 100 * missing_counts / len(data)
        # Select only the columns with missing values
        missing_columns = missing_percentages[missing_percentages > 0]
        print(f"Percentage of missing values in each column:")
        print(missing_columns)
    else:
        print("No missing values found")

    # Check for duplicated rows
    duplicates = data[data.duplicated()]
    if duplicates.empty:
        print("No duplicated rows found")
    else:
        duplicate_count = len(duplicates)
        total_count = len(data)
        print(f"{duplicate_count} out of {total_count} rows are duplicated ({100 * duplicate_count / total_count:.2f}%)")

    # Check for outliers
    q1 = data.quantile(0.25)
    q3 = data.quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    outliers = ((data < lower_bound) | (data > upper_bound)).sum()
    if outliers.sum() == 0:
        print("No outliers found")
    else:
        print(f"Number of outliers per column:")
        print(outliers)
    
    # Check data types
    data_types = data.dtypes
    print(f"Data types: {data_types}")

    # Check correlations
    correlations = data.corr()
    print("Correlation table:")
    print(correlations)

## Things to note.

- if mising values are found, the code will output the position of missing values.
- If any outliers are found, the code will output the position of the outliers.
- For the value count, it serves as a way to check for consistency in categorical data. Main focus here was columns which relate to Gender with variations of Male and Female codes included and also columns which have Yes or No information.

## How to use the function?

To use the EDA function, you can simply use the 'import' statement to import the eda module and then call the EDA function like this:

In [None]:
import eda
# Load data
data = pd.read_csv('data.csv')
# Run EDA
eda.EDA(data)