## EDA model explaination

This Jupyter notebook is to explain how the Module 'EDA' works.
The EDA function below only perfoms basic EDA. Generally it should work well but one must consider their individual dataset. What this function does is:

- Returns the first five rows of the dataset using the display function from IPython.
- Returns the number of rows and columns in the dataset using the shape attribute of the pandas DataFrame.
- Checks if there are any missing values in the dataset using the isnull().values.any() method. If there are missing values, it - finds the number and percentage of missing values in each column and prints them.
- Checks for duplicated rows using the duplicated() method. If there are duplicated rows, it prints the number of duplicated rows and their percentage.
- Checks for outliers in the dataset using the interquartile range (IQR) method. It calculates the lower and upper bounds and - - checks for values that are less than the lower bound or greater than the upper bound. If there are outliers, it prints the number of outliers per column.
- Checks the data types of each column using the dtypes attribute of the pandas DataFrame.
- Calculates the correlation matrix for the dataset using the corr() method and prints it.

In [2]:
#A simple code to analyse the dataset

import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt

def EDA(data):
    """Perform exploratory data analysis on a given dataset.
    Args:
      data (pandas.DataFrame): The dataset to analyze.
    """
    from IPython.display import display
    display(data.head())
     
    # Print the shape of the dataset
    print(f"Number of rows: {data.shape[0]}")
    print(f"Number of columns: {data.shape[1]}")   

    # Check for missing values
    if data.isnull().values.any():
        # Find the number of missing values for each column
        missing_counts = data.isnull().sum()
        # Calculate the percentage of missing values for each column
        missing_percentages = 100 * missing_counts / len(data)
        # Select only the columns with missing values
        missing_columns = missing_percentages[missing_percentages > 0]
        print(f"Percentage of missing values in each column:")
        print(missing_columns)
    else:
        print("No missing values found")

    # Check for duplicated rows
    duplicates = data[data.duplicated()]
    if duplicates.empty:
        print("No duplicated rows found")
    else:
        duplicate_count = len(duplicates)
        total_count = len(data)
        print(f"{duplicate_count} out of {total_count} rows are duplicated ({100 * duplicate_count / total_count:.2f}%)")

    # Check for outliers
    q1 = data.quantile(0.25)
    q3 = data.quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    outliers = ((data < lower_bound) | (data > upper_bound)).sum()
    if outliers.sum() == 0:
        print("No outliers found")
    else:
        print(f"Number of outliers per column:")
        print(outliers)
    
    # Check data types
    data_types = data.dtypes
    print(f"Data types: {data_types}")

    # Check correlations
    correlations = data.corr()
    print("Correlation table:")
    print(correlations)

## Things to note.

- if mising values are found, the code will output the percentage of missing values per column.
- If any outliers are found, the code will output the position of the outliers.

## How to use the function?

To use the EDA function, you can simply use the 'import' statement to import the eda module and then call the EDA function like this:

In [None]:
import eda
# Load data
data = pd.read_csv('data.csv')
# Run EDA
eda.EDA(data)