# Basic Exploratory Data Analysis (EDA)

This notebook provides a comprehensive guide to basic exploratory data analysis techniques. EDA is a critical first step in any data analysis project, allowing you to understand the structure, patterns, and peculiarities of your dataset before applying more advanced analytical or machine learning techniques.

## What is EDA?

Exploratory Data Analysis is an approach to analyzing datasets to summarize their main characteristics, often using visual methods. The primary goal is to:
- Understand the data structure
- Detect outliers and anomalies
- Identify patterns and relationships
- Test underlying assumptions
- Develop an initial understanding before formal modeling

Let's explore the fundamental techniques of EDA!

## 1. Import Required Libraries

We'll start by importing the essential Python libraries needed for data analysis:

In [None]:
# Import essential libraries for data analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Configure visualization settings
plt.style.use('seaborn-whitegrid')
sns.set(style="whitegrid", palette="muted")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['figure.dpi'] = 100

# Display all columns and rows when printing DataFrames
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

# Suppress warning messages
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## 2. Load Dataset

For this demonstration, we'll load a sample dataset. We'll use the Titanic dataset from seaborn, which is a classic dataset for data analysis. We'll also show how to load data from different sources.

In [None]:
# Method 1: Load a dataset directly from seaborn
titanic_df = sns.load_dataset('titanic')

# Display the first few rows
print("Dataset from seaborn:")
titanic_df.head()

In [None]:
# Method 2: Load from scikit-learn (another common source)
from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target

print("Iris dataset from scikit-learn:")
iris_df.head()

In [None]:
# Method 3: Example of loading from CSV file (commented out since we don't have the file)
"""
# For loading from a local CSV file:
# data = pd.read_csv('path/to/your/file.csv')

# For loading from a URL:
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data_from_url = pd.read_csv(url)
print("Dataset from URL:")
data_from_url.head()
"""

# For the rest of this notebook, we'll use the Titanic dataset
df = titanic_df.copy()
print(f"We'll use the Titanic dataset with {df.shape[0]} rows and {df.shape[1]} columns")

## 3. Data Overview & Structure

Next, we'll explore the structure and composition of our dataset using various pandas methods to get a clear understanding of what we're working with.

In [None]:
# Let's examine the first few rows of the dataset
print("First 5 rows of the dataset:")
df.head()

In [None]:
# Overview of the dataset information
print("Dataset information:")
df.info()

In [None]:
# Dataset dimensions
print(f"Dataset shape: {df.shape} (rows, columns)")

# Column names
print("\nColumn names:")
print(df.columns.tolist())

# Check data types
print("\nData types:")
print(df.dtypes)

# Check for duplicate rows
duplicate_count = df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicate_count}")

## 4. Descriptive Statistics

Now that we understand the structure of our data, let's calculate and examine summary statistics to get a better sense of the data's central tendency, dispersion, and distribution.

In [None]:
# Generate descriptive statistics for numerical columns
print("Descriptive statistics for numerical columns:")
df.describe()

In [None]:
# For categorical columns, we can use .describe(include=['object'])
print("Descriptive statistics for categorical columns:")
df.describe(include=['object'])

In [None]:
# We can also look at specific statistics for individual columns
print("Age statistics:")
print(f"Mean age: {df['age'].mean():.2f} years")
print(f"Median age: {df['age'].median():.2f} years")
print(f"Min age: {df['age'].min()} years")
print(f"Max age: {df['age'].max()} years")

# Mode for categorical variables
print("\nMost frequent passenger class:", df['class'].mode()[0])
print("Most frequent embarkation point:", df['embark_town'].mode()[0])

## 5. Missing Value Analysis

Missing data can significantly impact the analysis and subsequent modeling. Let's identify, visualize, and address missing values in our dataset.

In [None]:
# Check for missing values in each column
missing_values = df.isna().sum()
missing_percentage = (missing_values / len(df)) * 100

# Create a summary DataFrame for missing values
missing_data = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage (%)': missing_percentage
})

# Sort by missing percentage in descending order
print("Missing data analysis:")
missing_data[missing_data['Missing Values'] > 0].sort_values('Missing Values', ascending=False)

In [None]:
# Visualize missing values
plt.figure(figsize=(12, 8))
sns.heatmap(df.isna(), cbar=False, yticklabels=False, cmap='viridis')
plt.title('Missing Value Heatmap', fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
# Let's also visualize the percentage of missing values per column
missing_data = missing_data[missing_data['Missing Values'] > 0].sort_values('Percentage (%)', ascending=True)

plt.figure(figsize=(10, 6))
plt.barh(missing_data.index, missing_data['Percentage (%)'], color='steelblue')
plt.xlabel('Percentage of Missing Values (%)')
plt.ylabel('Columns')
plt.title('Percentage of Missing Values by Column', fontsize=16)
plt.tight_layout()
plt.show()

### Handling Missing Values

There are several ways to handle missing values:
1. Remove rows with missing values
2. Fill with statistics (mean, median, mode)
3. Use advanced imputation techniques

Let's demonstrate a few of these methods:

In [None]:
# Create a copy of the dataframe to avoid modifying the original
df_clean = df.copy()

# 1. Fill missing numerical values with median
df_clean['age'] = df_clean['age'].fillna(df_clean['age'].median())

# 2. Fill missing categorical values with the mode (most frequent)
df_clean['embark_town'] = df_clean['embark_town'].fillna(df_clean['embark_town'].mode()[0])
df_clean['deck'] = df_clean['deck'].fillna('Unknown')  # for deck, we'll use 'Unknown' instead of mode

# 3. Drop the 'embarked' column as it's redundant with 'embark_town'
df_clean.drop('embarked', axis=1, inplace=True)

# Check if we've handled all missing values
print("Remaining missing values after handling:")
print(df_clean.isna().sum())

# Compare the mean, median, min, max of age before and after imputation
print("\nAge statistics before imputation:")
print(df['age'].describe())
print("\nAge statistics after imputation:")
print(df_clean['age'].describe())

## 6. Distribution of Variables

Understanding how variables are distributed helps us identify patterns, outliers, and potential issues with our data. Let's analyze the distributions of key variables.

In [None]:
# Let's examine the distribution of passenger ages
plt.figure(figsize=(12, 6))

# Histogram
plt.subplot(1, 2, 1)
sns.histplot(df_clean['age'], kde=True, bins=30)
plt.title('Age Distribution', fontsize=14)
plt.axvline(df_clean['age'].mean(), color='red', linestyle='--', label=f'Mean: {df_clean["age"].mean():.2f}')
plt.axvline(df_clean['age'].median(), color='green', linestyle='-.', label=f'Median: {df_clean["age"].median():.2f}')
plt.legend()

# QQ Plot for normality check
plt.subplot(1, 2, 2)
stats.probplot(df_clean['age'].dropna(), plot=plt)
plt.title('QQ Plot of Age', fontsize=14)

plt.tight_layout()
plt.show()

In [None]:
# Examine the distribution of fare
plt.figure(figsize=(12, 6))

# Histogram - original scale
plt.subplot(1, 3, 1)
sns.histplot(df_clean['fare'], kde=True, bins=30)
plt.title('Fare Distribution', fontsize=14)
plt.axvline(df_clean['fare'].mean(), color='red', linestyle='--', label=f'Mean: {df_clean["fare"].mean():.2f}')
plt.axvline(df_clean['fare'].median(), color='green', linestyle='-.', label=f'Median: {df_clean["fare"].median():.2f}')
plt.legend()

# Histogram - log scale
plt.subplot(1, 3, 2)
sns.histplot(np.log1p(df_clean['fare']), kde=True, bins=30)
plt.title('Log-transformed Fare Distribution', fontsize=14)

# Box plot
plt.subplot(1, 3, 3)
sns.boxplot(y=df_clean['fare'])
plt.title('Fare Box Plot', fontsize=14)

plt.tight_layout()
plt.show()

In [None]:
# Distribution of categorical variables
plt.figure(figsize=(16, 12))

plt.subplot(2, 2, 1)
sns.countplot(x='class', data=df_clean, palette='viridis')
plt.title('Passenger Class Distribution', fontsize=14)
plt.ylabel('Count')

plt.subplot(2, 2, 2)
sns.countplot(x='sex', data=df_clean, palette='viridis')
plt.title('Gender Distribution', fontsize=14)
plt.ylabel('Count')

plt.subplot(2, 2, 3)
sns.countplot(x='embark_town', data=df_clean, palette='viridis')
plt.title('Embarkation Port Distribution', fontsize=14)
plt.xticks(rotation=45)
plt.ylabel('Count')

plt.subplot(2, 2, 4)
sns.countplot(x='survived', data=df_clean, palette='viridis')
plt.title('Survival Distribution', fontsize=14)
plt.ylabel('Count')

plt.tight_layout()
plt.show()

## 7. Correlation Analysis

Correlation analysis helps us understand relationships between numerical variables. Let's calculate and visualize correlations in our dataset.

In [None]:
# Select only numeric columns for correlation
numeric_df = df_clean.select_dtypes(include=['float64', 'int64'])

# Calculate the correlation matrix
correlation_matrix = numeric_df.corr()

# Print the correlation matrix
print("Correlation matrix:")
correlation_matrix

In [None]:
# Visualize the correlation matrix using a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, fmt='.2f')
plt.title('Correlation Matrix of Numerical Variables', fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
# Create scatter plots for important correlations
plt.figure(figsize=(18, 6))

# Fare vs Age
plt.subplot(1, 3, 1)
sns.scatterplot(x='age', y='fare', hue='survived', data=df_clean, palette='viridis', alpha=0.7)
plt.title('Fare vs Age (colored by survival)', fontsize=14)

# Fare vs Pclass
plt.subplot(1, 3, 2)
sns.boxplot(x='class', y='fare', data=df_clean, palette='viridis')
plt.title('Fare Distribution by Class', fontsize=14)

# Age vs Pclass
plt.subplot(1, 3, 3)
sns.boxplot(x='class', y='age', data=df_clean, palette='viridis')
plt.title('Age Distribution by Class', fontsize=14)

plt.tight_layout()
plt.show()

## 8. Categorical Variable Analysis

Let's explore the categorical variables in more depth to understand their distributions and relationships with other variables.

In [None]:
# Distribution of survival by sex
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
sns.countplot(x='sex', hue='survived', data=df_clean, palette='viridis')
plt.title('Survival Count by Gender', fontsize=14)
plt.xlabel('Gender')
plt.ylabel('Count')

# Add percentages to the plot
for i, gender in enumerate(['male', 'female']):
    total = len(df_clean[df_clean['sex'] == gender])
    survived = len(df_clean[(df_clean['sex'] == gender) & (df_clean['survived'] == 1)])
    survival_rate = survived / total * 100
    plt.annotate(f'{survival_rate:.1f}%', 
                xy=(i, survived/2), 
                ha='center', 
                va='center',
                fontsize=12)

plt.subplot(1, 2, 2)
# Create a cross-tabulation to calculate survival rates by gender
survival_by_sex = pd.crosstab(df_clean['sex'], df_clean['survived'])
survival_by_sex_pct = survival_by_sex.div(survival_by_sex.sum(1), axis=0) * 100
survival_by_sex_pct.plot(kind='bar', stacked=True, figsize=(8, 6), colormap='viridis')
plt.title('Survival Rate by Gender (%)', fontsize=14)
plt.xlabel('Gender')
plt.ylabel('Percentage')
plt.legend(['Did not survive', 'Survived'])

plt.tight_layout()
plt.show()

In [None]:
# Distribution of survival by class
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
sns.countplot(x='class', hue='survived', data=df_clean, palette='viridis')
plt.title('Survival Count by Class', fontsize=14)
plt.xlabel('Passenger Class')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
# Create a cross-tabulation to calculate survival rates by class
survival_by_class = pd.crosstab(df_clean['class'], df_clean['survived'])
survival_by_class_pct = survival_by_class.div(survival_by_class.sum(1), axis=0) * 100
survival_by_class_pct.plot(kind='bar', stacked=True, figsize=(8, 6), colormap='viridis')
plt.title('Survival Rate by Class (%)', fontsize=14)
plt.xlabel('Passenger Class')
plt.ylabel('Percentage')
plt.legend(['Did not survive', 'Survived'])

plt.tight_layout()
plt.show()

In [None]:
# Let's create a contingency table to analyze the relationship between passenger class and gender
contingency_table = pd.crosstab(df_clean['class'], df_clean['sex'])
print("Class vs Gender Contingency Table:")
print(contingency_table)

plt.figure(figsize=(10, 6))
contingency_table.plot(kind='bar', stacked=True, colormap='viridis')
plt.title('Distribution of Gender by Passenger Class', fontsize=14)
plt.xlabel('Passenger Class')
plt.ylabel('Count')
plt.legend(title='Gender')
plt.tight_layout()
plt.show()

## 9. Outlier Detection

Outliers can significantly impact statistical analyses and model performance. Let's identify and visualize outliers in our dataset.

In [None]:
# Box plots are excellent for identifying outliers
plt.figure(figsize=(15, 10))

# Age Outliers
plt.subplot(2, 2, 1)
sns.boxplot(y='age', data=df_clean)
plt.title('Age Distribution with Outliers', fontsize=14)

# Fare Outliers
plt.subplot(2, 2, 2)
sns.boxplot(y='fare', data=df_clean)
plt.title('Fare Distribution with Outliers', fontsize=14)

# Fare by Class - to see if high fares are outliers or just first class
plt.subplot(2, 2, 3)
sns.boxplot(x='class', y='fare', data=df_clean)
plt.title('Fare Distribution by Class', fontsize=14)

# Numerical approach to identifying outliers in fare
plt.subplot(2, 2, 4)
plt.scatter(range(len(df_clean)), df_clean['fare'], alpha=0.5)
plt.axhline(y=df_clean['fare'].mean() + 3*df_clean['fare'].std(), color='r', linestyle='--', label='3σ threshold')
plt.title('Fare Values with 3σ Threshold', fontsize=14)
plt.ylabel('Fare')
plt.xlabel('Index')
plt.legend()

plt.tight_layout()
plt.show()

In [None]:
# Identifying outliers using Z-score and IQR methods

def identify_outliers_zscore(df, column, threshold=3):
    """Identify outliers using Z-score method"""
    z_scores = np.abs(stats.zscore(df[column].dropna()))
    outliers = df[column].dropna()[z_scores > threshold]
    return outliers

def identify_outliers_iqr(df, column):
    """Identify outliers using IQR method"""
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)][column]
    return outliers

# Identify fare outliers
fare_outliers_zscore = identify_outliers_zscore(df_clean, 'fare')
fare_outliers_iqr = identify_outliers_iqr(df_clean, 'fare')

print(f"Fare outliers using Z-score method: {len(fare_outliers_zscore)}")
print(f"Fare outliers using IQR method: {len(fare_outliers_iqr)}")

# Identify age outliers
age_outliers_zscore = identify_outliers_zscore(df_clean, 'age')
age_outliers_iqr = identify_outliers_iqr(df_clean, 'age')

print(f"\nAge outliers using Z-score method: {len(age_outliers_zscore)}")
print(f"Age outliers using IQR method: {len(age_outliers_iqr)}")

# Let's look at some of the fare outliers
print("\nTop 10 fare outliers (Z-score method):")
fare_outliers_df = df_clean.loc[fare_outliers_zscore.index]
fare_outliers_df.sort_values(by='fare', ascending=False)[['class', 'fare', 'sex', 'age']].head(10)

## 10. Basic Data Visualization

Finally, let's create some insightful visualizations to gain deeper insights into the dataset relationships.

In [None]:
# Create a pair plot for numerical variables
plt.figure(figsize=(12, 10))
sns.pairplot(df_clean, 
             vars=['age', 'fare', 'pclass'],
             hue='survived',
             diag_kind='kde',
             plot_kws={'alpha': 0.6},
             palette='viridis')
plt.suptitle('Pair Plot of Key Variables', y=1.02, fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
# Violin plots to compare distributions
plt.figure(figsize=(16, 8))

plt.subplot(1, 2, 1)
sns.violinplot(x='class', y='age', hue='survived', data=df_clean, palette='viridis', split=True)
plt.title('Age Distribution by Class and Survival', fontsize=14)

plt.subplot(1, 2, 2)
sns.violinplot(x='class', y='fare', hue='survived', data=df_clean, palette='viridis', split=True)
plt.title('Fare Distribution by Class and Survival', fontsize=14)

plt.tight_layout()
plt.show()

In [None]:
# Distribution of passengers by class, gender, and survival
plt.figure(figsize=(15, 10))

# Create a FacetGrid
g = sns.catplot(
    data=df_clean, kind="count",
    x="class", hue="survived", col="sex",
    palette="viridis", height=6, aspect=.7)

# Customize the plot
g.set_axis_labels("Passenger Class", "Count")
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Passenger Counts by Class, Gender, and Survival', fontsize=16, y=1.05)
g.add_legend(title="Survived")

plt.tight_layout()
plt.show()

In [None]:
# Create a swarm plot to visualize individual data points
plt.figure(figsize=(14, 8))

plt.subplot(1, 2, 1)
sns.swarmplot(x='class', y='age', hue='survived', data=df_clean.sample(n=min(300, len(df_clean))), palette='viridis')
plt.title('Age by Class and Survival (Sample of Points)', fontsize=14)

plt.subplot(1, 2, 2)
sns.swarmplot(x='embark_town', y='fare', hue='survived', data=df_clean.sample(n=min(300, len(df_clean))), palette='viridis')
plt.title('Fare by Embarkation Point and Survival (Sample of Points)', fontsize=14)
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## Summary and Conclusions

From our exploratory data analysis of the Titanic dataset, we've gained several key insights:

1. **Demographics**: The dataset contains information about 891 passengers with variables including age, sex, class, fare, and survival status.

2. **Missing Data**: There were significant missing values in 'age' (~20%) and 'deck' (~77%), which we addressed through imputation.

3. **Survival Patterns**:
   - Women had a much higher survival rate than men
   - Passengers in higher classes (1st class) had better survival rates than those in lower classes
   - Age also played a role in survival, with children having better odds

4. **Correlations**:
   - Fare was strongly correlated with passenger class (higher fares in higher classes)
   - Survival was correlated with sex, class, and fare
   - Age showed some correlation with class and survival

5. **Distributions**:
   - Age followed a somewhat normal distribution with a mean around 30 years
   - Fare was right-skewed, with most passengers paying lower fares and a few paying much higher amounts
   - There were more male passengers than female passengers

6. **Outliers**: We identified several outliers, particularly in the fare variable, mostly attributed to first-class passengers.

These insights provide a solid foundation for further analysis and modeling, highlighting important relationships and potential predictors of survival.

## Next Steps

Based on our EDA, potential next steps could include:

1. **Feature Engineering**: Create new features like family size (combining siblings/spouses and parents/children) or fare per person

2. **Data Preprocessing**: Normalize numerical features and encode categorical features for modeling

3. **Model Building**: Build predictive models to determine factors that influenced survival

4. **Advanced Analysis**: Conduct more sophisticated statistical tests to validate the observed patterns

5. **Visualization**: Create an interactive dashboard for stakeholders to explore the data and findings

EDA is just the beginning of the data science workflow, but it provides critical insights that guide all subsequent analytical decisions.