# 🧾 Exploratory Data Analysis (EDA) - Sellers

This notebook performs a complete EDA on the **Sellers** dataset, including:

- General overview of the dataset  
- Basic data cleaning  
- Summary statistics  
- Visualization of numerical and categorical variables


Postgre sql connection set up

In [1]:
import psycopg2
try:
    conn = psycopg2.connect(
        host="localhost",
        database="project",
        user="postgres",
        password="vender2299",
        client_encoding='WIN1252')
    cur = conn.cursor()
    print("Successfully connected to PostgreSQL!")

except psycopg2.Error as e:
    print(f"Error connecting to PostgreSQL: {e}")

Successfully connected to PostgreSQL!


Using SQLAlchemy to Load Data from PostgreSQL into a Pandas DataFrame

In [2]:
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://postgres:vender2299@localhost:5432/project')

query = "SELECT * FROM sellers_clean"
sellers = pd.read_sql(query, con=engine)

print(sellers.head())


                          seller_id seller_name seller_postal_code  \
0  d1b65fc7debc3361ea86b5f14c68d2e2  NeuroLabsX           DE-14469   
1  51a04a8a6bdcb23deccc82b0b80742cf   SwiftLabs            DE-6108   
2  e49c26c3edfa46d227d5121a6b6e4d37  EcoFutures           ES-33003   
3  1b938a7ec6ac5061a66a3766e0e75f90    HyperHub            DE-6112   
4  a7a9b880c49781da66651ccf4ba9ac38     EliteAI           DE-18069   

     seller_city country_code seller_country  
0        Potsdam           DE        Germany  
1  Halle (Saale)           DE        Germany  
2         Oviedo           ES          Spain  
3  Halle (Saale)           DE        Germany  
4        Rostock           DE        Germany  


Using SQLAlchemy to Load Data from PostgreSQL into a Pandas DataFrame.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 🔍 General overview
print("🔹 Dimensions:", sellers.shape)
print("\n🔹 Columns:")
print(sellers.columns)
print("\n🔹 Data types:")
print(sellers.dtypes)
print("\n🔹 First 5 rows:")
print(sellers.head())

# 🔁 Duplicates
duplicates = sellers.duplicated()
print("\n🔁 Duplicate rows:", duplicates.sum())
if duplicates.sum() > 0:
    print("\n🔍 Showing duplicates:")
    display(sellers[duplicates])

# 🧼 Missing values
print("\n🔹 Missing values per column:")
print(sellers.isnull().sum())

# 📊 Basic statistics
print("\n📈 Numerical statistics:")
print(sellers.describe())

# 🧠 Separate columns by type
num_cols = sellers.select_dtypes(include=np.number).columns
cat_cols = sellers.select_dtypes(include='object').columns

print("\n🔸 Numerical columns:", list(num_cols))
print("🔸 Categorical columns:", list(cat_cols))

# 🎯 Unique values in categorical columns
for col in cat_cols:
    print(f"\n🎲 {col} - Unique values:")
    print(sellers[col].value_counts())

# 📦 Boxplots for numerical columns
for col in num_cols:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=sellers[col])
    plt.title(f'Outliers - {col}')
    plt.show()


🔹 Dimensions: (3095, 6)

🔹 Columns:
Index(['seller_id', 'seller_name', 'seller_postal_code', 'seller_city',
       'country_code', 'seller_country'],
      dtype='object')

🔹 Data types:
seller_id             object
seller_name           object
seller_postal_code    object
seller_city           object
country_code          object
seller_country        object
dtype: object

🔹 First 5 rows:
                          seller_id seller_name seller_postal_code  \
0  d1b65fc7debc3361ea86b5f14c68d2e2  NeuroLabsX           DE-14469   
1  51a04a8a6bdcb23deccc82b0b80742cf   SwiftLabs            DE-6108   
2  e49c26c3edfa46d227d5121a6b6e4d37  EcoFutures           ES-33003   
3  1b938a7ec6ac5061a66a3766e0e75f90    HyperHub            DE-6112   
4  a7a9b880c49781da66651ccf4ba9ac38     EliteAI           DE-18069   

     seller_city country_code seller_country  
0        Potsdam           DE        Germany  
1  Halle (Saale)           DE        Germany  
2         Oviedo           ES          Spain  