# Exploratory Data Analysis (EDA) â€” Tips & Tricks (Python)

**Role:** Expert in EDA using Python  
**Audience:** Data scientists, analysts, beginners & intermediate users  

Is notebook mein hum step-by-step EDA karenge â€” libraries, data cleaning, visualizations aur best practices. Examples ke liye hum Seaborn ka classic `titanic` dataset use karenge. Thoda "desi" touch bhi milega â€” simple, practical aur friendly (jaise chai pe baat ho). â˜•ðŸ‡µðŸ‡°

References & further learning:
- codanics: https://codanics.com
- codanics YouTube channel: https://www.youtube.com/c/codanics

Notebook structure (four pillars in separate sections):
1. Data composition
2. Data distribution
3. Data relationships
4. Data comparison

Run cells in order. Agar local environment mein ho to uncomment pip install lines if needed.

In [None]:
# Uncomment to install if you're on a fresh environment
# !pip install pandas numpy matplotlib seaborn plotly missingno scipy

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import missingno as msno
from scipy import stats

import warnings
warnings.filterwarnings('ignore')

# Useful display settings
sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (10, 5)
pd.set_option('display.max_columns', 60)
pd.set_option('display.max_rows', 100)

print('Libraries imported â€” ready for EDA!')

## Load dataset â€” Titanic (Seaborn)

Seaborn provides a ready `titanic` dataset. We'll load it, keep a copy of the original, and inspect basic info.

In [None]:
df = sns.load_dataset('titanic')
df_original = df.copy()
print('Shape:', df.shape)
df.head()

### Quick overview: useful functions
- `df.shape` â€” rows & columns
- `df.info()` â€” dtypes & non-null counts
- `df.describe()` â€” numeric summary
- `df.describe(include=['object','category'])` â€” categorical summary
- `df.head()`, `df.tail()`

Run the cell below to see a quick overview.

In [None]:
df.info()
display(df.describe())
display(df.describe(include=['object','category']))

----
## Data Cleaning Techniques â€” Practical tips 
Chuninda tips jo roz kaam aate hain:
- Missing values: understand pattern, impute or drop intelligently
- Duplicates: drop if duplicates are not meaningful
- Data types: convert to category for better memory & plotting
- Outliers: detect and treat depending on objective
- Feature engineering: create helpful derived features

Har step ke saath code snippet aur explanation hai â€” comments ko dhyaan se padho (well-commented).

In [None]:
# 1) Missing value pattern
print('Missing per column:')
display(df.isna().sum().sort_values(ascending=False))

# Visualize missingness quickly (matrix + bar)
msno.matrix(df);
plt.title('Missingness matrix')

msno.bar(df);
plt.title('Missingness bar')


Interpretation / tip:
- `age` has many missing values â†’ impute carefully (median by groups is a good starting point).
- `deck` is sparse â†’ keep as 'Missing' category or drop depending on task.

Example imputations shown below (practical).

In [None]:
# Work on a copy
df_clean = df.copy()

# Impute 'age' with median by ('pclass', 'sex') to keep signal
df_clean['age'] = df_clean.groupby(['pclass', 'sex'])['age'].transform(lambda x: x.fillna(x.median()))

# Impute 'embarked' with mode (only a couple nulls)
df_clean['embarked'] = df_clean['embarked'].fillna(df_clean['embarked'].mode()[0])

# 'deck' is sparse -> mark as 'Missing' category (keeps info that deck was missing)
df_clean['deck'] = df_clean['deck'].astype(object).fillna('Missing')

# Convert some columns to category
for c in ['sex', 'embarked', 'class', 'who', 'deck', 'alive', 'alone']:
    if c in df_clean.columns:
        df_clean[c] = df_clean[c].astype('category')

print('After imputations:')
display(df_clean.isna().sum())

### Duplicates & data types
- Duplicates rarely matter in `titanic`, but check with `df.duplicated().sum()`.
- Converting strings to `category` often helps with plotting and memory.


In [None]:
print('Duplicates:', df_clean.duplicated().sum())
print('\nMemory usage (after converting categorical):')
display(df_clean.memory_usage(deep=True))

### Outliers detection (IQR method) â€” numeric columns
Tip: outliers are not always "bad"; decide based on domain. For pricing (fare) log-transform can help.

In [None]:
num_cols = df_clean.select_dtypes(include=['number']).columns.tolist()
num_cols

def detect_iqr_outliers(series):
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    return series[(series < lower) | (series > upper)]

for c in ['age', 'fare']:
    out = detect_iqr_outliers(df_clean[c].dropna())
    print(f"{c}: outliers detected = {out.shape[0]}")

----
## Pillar 1 â€” Data Composition (what's inside the dataset?)
Focus: counts, unique categories, class imbalance, composition by groups.

In [None]:
# Value counts and bar plots for categorical columns
cat_cols = ['sex', 'class', 'embarked', 'who', 'deck']
for c in cat_cols:
    print('\n', c.upper())
    display(df_clean[c].value_counts())
    plt.figure(figsize=(6,3))
    sns.countplot(data=df_clean, x=c, order=df_clean[c].value_counts().index, palette='pastel')
    plt.title(f'Count plot: {c}')
    plt.show()

Interpretation (desi note):  
- Zyada log `male` hain vs `female` (composition dekho).  
- `Third` class sab se zyada â€” analysis mein ye imbalance dhyaan rakhna.

These composition checks help decide sampling strategies (up/down sampling) and stratified splits.

----
## Pillar 2 â€” Data Distribution (how features are distributed)
Focus: histograms, density plots, boxplots, transformations (log), skew detection.

In [None]:
# Histograms + KDE for numeric columns
plt.figure(figsize=(12,5))
sns.histplot(df_clean['age'].dropna(), bins=30, kde=True, color='skyblue')
plt.title('Age distribution (imputed)')
plt.xlabel('Age')
plt.show()

plt.figure(figsize=(12,5))
sns.histplot(df_clean['fare'].dropna(), bins=40, kde=True, color='salmon')
plt.title('Fare distribution (raw)')
plt.xlim(0, 200)
plt.show()

# Fare is right skewed â€” try log transform for visualization
plt.figure(figsize=(10,4))
sns.histplot(np.log1p(df_clean['fare']), bins=30, kde=True, color='olivedrab')
plt.title('Log(1+fare) distribution')
plt.show()

# Boxplots to spot outliers
plt.figure(figsize=(8,4))
sns.boxplot(x='pclass', y='fare', data=df_clean, palette='muted')
plt.ylim(0, 200)
plt.title('Fare by Pclass (boxplot)')
plt.show()

Interpretation:
- Age is roughly bell-shaped but skewed; some missing imputed earlier.  
- Fare is heavily right-skewed; log-transform often helps for models and visualizations.  
- Boxplots show Pclass differences in fare â€” domain makes sense (First > Third).

----
## Pillar 3 â€” Data Relationships (how features interact)
Focus: correlations, categorical relationships, pairwise plots, interactive visualizations.

In [None]:
# Correlation matrix (numeric)
plt.figure(figsize=(8,6))
corr = df_clean.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.title('Numeric correlation matrix')
plt.show()

# Pairplot (small selection) â€” can be slow on large data
sns.pairplot(df_clean[['age','fare','sibsp','parch','survived']].dropna(),
             hue='survived', corner=True, plot_kws={'alpha':0.6})
plt.suptitle('Pairplot (subset)', y=1.02)
plt.show()

Categorical relationships (violin/box) and a Plotly interactive example below.

In [None]:
# Violin plot: age distribution by survival and sex
plt.figure(figsize=(10,5))
sns.violinplot(x='survived', y='age', hue='sex', data=df_clean, split=True, palette='Set2')
plt.title('Age distribution by survival & sex')
plt.show()

# Interactive scatter: age vs fare colored by survival
fig = px.scatter(df_clean.dropna(subset=['age','fare']), x='age', y='fare',
                 color='survived', hover_data=['sex','pclass'], title='Age vs Fare (interactive)')
fig.update_layout(height=600)
fig.show()

# Animated histogram of age across classes (plotly)
df_for_anim = df_clean.dropna(subset=['age'])
fig2 = px.histogram(df_for_anim, x='age', animation_frame='class', nbins=30,
                    title='Age distribution animated across class (Plotly)')
fig2.update_layout(height=500)
fig2.show()

Interpretations and tips:
- Pairplots help detect linear relationships and clusters.  
- Violin/box plots show distribution differences across groups (e.g., survivors tend to be younger/older depending on group).  
- Interactive Plotly charts are great for presentations and exploratory clicks (hover to see details).

----
## Pillar 4 â€” Data Comparison (compare groups & segments)
Focus: groupby, pivot tables, survival rates by groups, and simple statistical tests (chi-square, t-test).

In [None]:
# Survival rates by class and sex
surv_by_class = df_clean.groupby('class')['survived'].mean().sort_values(ascending=False)
surv_by_sex = df_clean.groupby('sex')['survived'].mean()
print('Survival rate by class:')
display(surv_by_class)
print('\nSurvival rate by sex:')
display(surv_by_sex)

# Pivot table: survival rate by class & sex
pt = df_clean.pivot_table(index='class', columns='sex', values='survived', aggfunc='mean')
display(pt)

sns.heatmap(pt, annot=True, fmt='.2f', cmap='Blues')
plt.title('Survival rate by class & sex')
plt.show()

Statistical checks (quick):
- Chi-square for independence between sex and survival (categorical vs categorical).  
- t-test for difference in age between survivors and non-survivors (numeric vs binary).

In [None]:
from scipy.stats import chi2_contingency, ttest_ind

# Chi-square: sex vs survived
cont = pd.crosstab(df_clean['sex'], df_clean['survived'])
chi2, p, dof, expected = chi2_contingency(cont)
print('Chi-square test (sex vs survived):')
print('chi2=', chi2, 'p-value=', p)

# t-test: age between survived groups
ages_surv = df_clean[df_clean['survived']==1]['age'].dropna()
ages_not = df_clean[df_clean['survived']==0]['age'].dropna()
tstat, pval = ttest_ind(ages_surv, ages_not, equal_var=False)
print('\nt-test (age: survived vs not):')
print('t-statistic=', tstat, 'p-value=', pval)


Interpretation (desi style):  
- Chi-square p-value small => survival depends on sex (women had higher survival) â€” yeh historical baat bhi yaad rakho.  
- t-test p-value small => age distributions between survivors and non-survivors are different (lekin distribution overlap bhi dekhna zaroori hai).

----
## Best Practices & Tips (short checklist)
- Always start with shape, info, describe.  
- Visualize missingness early (missingno).  
- For skewed distributions try log transforms.  
- For categorical comparisons use pivot tables and bar/stacked-bar charts.  
- When in doubt, plot it â€” visuals catch issues quick.  
- Keep a copy of the original dataset (df_original) so you can revert.  
- Document decisions: why you imputed, why you dropped rows, etc.  

Desi note: "EDA is like chai ki pehli cup â€” ek baar sahi se bana lo to baqi sab theek ho jata hai." â˜•

In [None]:
# df_clean.to_csv('titanic_clean.csv', index=False)
# df_clean.to_parquet('titanic_clean.parquet')
print('Done. Notebook contains both static (matplotlib/seaborn) and interactive (plotly) examples âš¡.')

----
## Final notes & references
- This guide is designed to be practical â€” start with composition & distribution, then relationships and comparisons.  
- Use the visualizations to form hypotheses, then test them statistically.  

Further learning:  
- codanics: https://codanics.com  
- codanics YouTube: https://www.youtube.com/c/codanics  

Agar chahiye, main is notebook ko aur expand karke modeling-ready feature engineering steps bhi add kar dunga â€” just bolo! ðŸ”¥