# Exploratory Data Analysis (EDA)

In this notebook, we will perform exploratory data analysis on the training dataset to understand its structure, visualize key features, and identify any potential issues that may need to be addressed during preprocessing.

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style='whitegrid')

# Load the dataset
data_path = '../data/processed/preprocessed_data.csv'
df = pd.read_csv(data_path)

# Display the first few rows of the dataset
df.head()

## Data Overview

Let's check the shape of the dataset and the data types of each column.

In [None]:
# Check the shape and data types
df.info()

## Descriptive Statistics

We will also look at some descriptive statistics to understand the distribution of numerical features.

In [None]:
# Descriptive statistics
df.describe()

## Visualizations

Let's create some visualizations to better understand the data.

In [None]:
# Distribution of the target variable
plt.figure(figsize=(8, 6))
sns.countplot(x='correct_background?', data=df)
plt.title('Distribution of Correct Background')
plt.xlabel('Correct Background')
plt.ylabel('Count')
plt.show()

## Correlation Matrix

We can also look at the correlation matrix to see how features relate to each other.

In [None]:
# Correlation matrix
plt.figure(figsize=(10, 8))
correlation = df.corr()
sns.heatmap(correlation, annot=True, fmt='.2f', cmap='coolwarm', square=True)
plt.title('Correlation Matrix')
plt.show()

## Conclusion

In this notebook, we performed exploratory data analysis on the training dataset. We visualized the distribution of the target variable, examined the correlation between features, and identified potential areas for further investigation during preprocessing.