# Data Insight Exploration

This notebook serves as the starting point of the Northwind database analysis project.  
Its purpose is to explore the content of the database in order to understand the available data, identify the main tables, examine key relationships, detect irrelevant columns, and uncover business-relevant patterns.

This exploration phase will guide the next steps of the project — including data cleaning, modeling, and visualization.  
All queries and observations are documented here to ensure full traceability and reproducibility.

## 1. Connecting to the SQLite database

In this section, we establish a connection to the `northwind.db` file using Python's built-in `sqlite3` module.  
This connection allows us to run SQL queries directly from the notebook and load the results into pandas DataFrames for further analysis.

In [1]:
import sqlite3
import pandas as pd

conn = sqlite3.connect("../data/northwind.db")

query = "SELECT name FROM sqlite_master WHERE type='table';"
tables_df = pd.read_sql_query(query, conn)
tables_df

Unnamed: 0,name
0,Categories
1,sqlite_sequence
2,CustomerCustomerDemo
3,CustomerDemographics
4,Customers
5,Employees
6,EmployeeTerritories
7,Order Details
8,Orders
9,Products


## 2. Exploring the `Categories` table

In [2]:
categories = pd.read_sql_query("SELECT * FROM Categories;", conn)
categories.shape # Number of rows and columns

(8, 4)

In [3]:
categories #displays the whole table

Unnamed: 0,CategoryID,CategoryName,Description,Picture
0,1,Beverages,"Soft drinks, coffees, teas, beers, and ales",b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...
1,2,Condiments,"Sweet and savory sauces, relishes, spreads, an...",b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...
2,3,Confections,"Desserts, candies, and sweet breads",b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...
3,4,Dairy Products,Cheeses,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...
4,5,Grains/Cereals,"Breads, crackers, pasta, and cereal",b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...
5,6,Meat/Poultry,Prepared meats,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...
6,7,Produce,Dried fruit and bean curd,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...
7,8,Seafood,Seaweed and fish,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...


### Summary of findings – `Categories` table

**Use cases:**
- Identify best-selling categories by joining with Products → Order Details.

**Key relations:**
- Categories (1) > Products (many)

**Irrelevant columns:**
- `Picture`


## 3. Exploring the `Products` table

In [11]:
products = pd.read_sql_query("SELECT * FROM Products;", conn)
products.head() #displays the first five rows

Unnamed: 0,ProductID,ProductName,SupplierID,CategoryID,QuantityPerUnit,UnitPrice,UnitsInStock,UnitsOnOrder,ReorderLevel,Discontinued
0,1,Chai,1,1,10 boxes x 20 bags,18.0,39,0,10,0
1,2,Chang,1,1,24 - 12 oz bottles,19.0,17,40,25,0
2,3,Aniseed Syrup,1,2,12 - 550 ml bottles,10.0,13,70,25,0
3,4,Chef Anton's Cajun Seasoning,2,2,48 - 6 oz jars,22.0,53,0,0,0
4,5,Chef Anton's Gumbo Mix,2,2,36 boxes,21.35,0,0,0,1


### Summary of findings – `Products` table

**Use cases:**
- Top-selling products by revenue and quantity
- Compare listed catalog prices (Products.UnitPrice) with actual transaction prices (OrderDetails.UnitPrice) to detect price deviations (e.g., negotiated deals or pricing errors)
- Analyze supplier concentration among top-performing products using SupplierID
- Evaluate whether the ReorderLevel is aligned with actual product demand, to avoid stockouts or overstock

**Key relations:**
- Categories (1) > Products (many)
- Products (1) > Order Details (many)
- Suppliers (1) > Products (many)

**Hierarchy:**
- Category > Product


## 4. Exploring the `Order Details` table

In [10]:
order_details = pd.read_sql_query('SELECT * FROM "Order Details";', conn)
order_details.head()

Unnamed: 0,OrderID,ProductID,UnitPrice,Quantity,Discount
0,10248,11,14.0,12,0.0
1,10248,42,9.8,10,0.0
2,10248,72,34.8,5,0.0
3,10249,14,18.6,9,0.0
4,10249,51,42.4,40,0.0


### Summary of findings – `Order Details` table

**Use cases:**
- Compute Sales
- Analyze the impact of discounts on sales volume (e.g., do discounted products sell more?)
- Join with the Products table to identify top-selling products and explore pricing behaviors
  
**Key relations:**
- Products (1) > Order Details (many)

**Irrelevant columns:**
- None


## 5. Exploring the `Orders` table

In [13]:
orders = pd.read_sql_query('SELECT * FROM Orders;', conn)
orders.head()

Unnamed: 0,OrderID,CustomerID,EmployeeID,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry
0,10248,VINET,5,2016-07-04,2016-08-01,2016-07-16,3,16.75,Vins et alcools Chevalier,59 rue de l-Abbaye,Reims,Western Europe,51100,France
1,10249,TOMSP,6,2016-07-05,2016-08-16,2016-07-10,1,22.25,Toms Spezialitäten,Luisenstr. 48,Münster,Western Europe,44087,Germany
2,10250,HANAR,4,2016-07-08,2016-08-05,2016-07-12,2,25.0,Hanari Carnes,"Rua do Paço, 67",Rio de Janeiro,South America,05454-876,Brazil
3,10251,VICTE,3,2016-07-08,2016-08-05,2016-07-15,1,20.25,Victuailles en stock,"2, rue du Commerce",Lyon,Western Europe,69004,France
4,10252,SUPRD,4,2016-07-09,2016-08-06,2016-07-11,2,36.25,Suprêmes délices,"Boulevard Tirou, 255",Charleroi,Western Europe,B-6000,Belgium


### Summary of findings – `Orders` table

**Use cases:**
- Link each order to its customer, employee, and shipper for multi-angle sales analysis
- Perform seasonality analysis using a custom Date table (based on OrderDate)
- Assess delivery performance by comparing OrderDate, ShippedDate, and RequiredDate
- Analyze the geographical distribution of shipments using ShipCity and ShipCountry
  
**Key relations:**
- Customers (1) > Orders (many)
- Employees (1) > Orders (many)
- Shippers (1) > Orders (many)
- Date custom table (1) > Orders (many) (via OrderDate)

**Hierarchy:**
- Create a custom Date table to enable monthly, quarterly, and yearly analysis

**Irrelevant columns:**
- Ship Address,Ship Postal Code

## 6. Exploring the `Customers` table

In [14]:
customers = pd.read_sql_query("SELECT * FROM Customers;", conn)
customers.head()

Unnamed: 0,CustomerID,CompanyName,ContactName,ContactTitle,Address,City,Region,PostalCode,Country,Phone,Fax
0,ALFKI,Alfreds Futterkiste,Maria Anders,Sales Representative,Obere Str. 57,Berlin,Western Europe,12209,Germany,030-0074321,030-0076545
1,ANATR,Ana Trujillo Emparedados y helados,Ana Trujillo,Owner,Avda. de la Constitución 2222,México D.F.,Central America,05021,Mexico,(5) 555-4729,(5) 555-3745
2,ANTON,Antonio Moreno Taquería,Antonio Moreno,Owner,Mataderos 2312,México D.F.,Central America,05023,Mexico,(5) 555-3932,
3,AROUT,Around the Horn,Thomas Hardy,Sales Representative,120 Hanover Sq.,London,British Isles,WA1 1DP,UK,(171) 555-7788,(171) 555-6750
4,BERGS,Berglunds snabbköp,Christina Berglund,Order Administrator,Berguvsvägen 8,Luleå,Northern Europe,S-958 22,Sweden,0921-12 34 65,0921-12 34 67


### Summary of findings – `Customers` table

**Use cases:**
- Identify top customers based on total sales volume or frequency of orders
- Perform geographic segmentation using City, Region, and Country
- Assess customer retention and ordering patterns over time (via relationship with Orders)
  
**Key relations:**
- Customers (1) > Orders (many)

**Irrelevant columns:**
- Address, Phone, Fax

## 7. Exploring the `Suppliers` table

In [15]:
suppliers = pd.read_sql_query("SELECT * FROM Suppliers;", conn)
suppliers.head()

Unnamed: 0,SupplierID,CompanyName,ContactName,ContactTitle,Address,City,Region,PostalCode,Country,Phone,Fax,HomePage
0,1,Exotic Liquids,Charlotte Cooper,Purchasing Manager,49 Gilbert St.,London,British Isles,EC1 4SD,UK,(171) 555-2222,,
1,2,New Orleans Cajun Delights,Shelley Burke,Order Administrator,P.O. Box 78934,New Orleans,North America,70117,USA,(100) 555-4822,,#CAJUN.HTM#
2,3,Grandma Kelly's Homestead,Regina Murphy,Sales Representative,707 Oxford Rd.,Ann Arbor,North America,48104,USA,(313) 555-5735,(313) 555-3349,
3,4,Tokyo Traders,Yoshi Nagase,Marketing Manager,9-8 Sekimai\nMusashino-shi,Tokyo,Eastern Asia,100,Japan,(03) 3555-5011,,
4,5,Cooperativa de Quesos 'Las Cabras',Antonio del Valle Saavedra,Export Administrator,Calle del Rosal 4,Oviedo,Southern Europe,33007,Spain,(98) 598 76 54,,


### Summary of findings – `Suppliers` table

**Use cases:**
- Analyze supplier concentration across top-selling products (dependency risks)
- Understand the geographical distribution of suppliers (e.g., by Country or City)
  
**Key relations:**
- Suppliers (1) > Products (many)

**Irrelevant columns:**
- Address, Postal Code, Phone, Fax, HomePage

## 8. Exploring the `Employees` table

In [16]:
employees = pd.read_sql_query("SELECT * FROM Employees;", conn)
employees.head()

Unnamed: 0,EmployeeID,LastName,FirstName,Title,TitleOfCourtesy,BirthDate,HireDate,Address,City,Region,PostalCode,Country,HomePhone,Extension,Photo,Notes,ReportsTo,PhotoPath
0,1,Davolio,Nancy,Sales Representative,Ms.,1968-12-08,2012-05-01,507 - 20th Ave. E.Apt. 2A,Seattle,North America,98122,USA,(206) 555-9857,5467,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...,Education includes a BA in psychology from Col...,2.0,http://accweb/emmployees/davolio.bmp
1,2,Fuller,Andrew,"Vice President, Sales",Dr.,1972-02-19,2012-08-14,908 W. Capital Way,Tacoma,North America,98401,USA,(206) 555-9482,3457,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...,Andrew received his BTS commercial in 1974 and...,,http://accweb/emmployees/fuller.bmp
2,3,Leverling,Janet,Sales Representative,Ms.,1983-08-30,2012-04-01,722 Moss Bay Blvd.,Kirkland,North America,98033,USA,(206) 555-3412,3355,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...,Janet has a BS degree in chemistry from Boston...,2.0,http://accweb/emmployees/leverling.bmp
3,4,Peacock,Margaret,Sales Representative,Mrs.,1957-09-19,2013-05-03,4110 Old Redmond Rd.,Redmond,North America,98052,USA,(206) 555-8122,5176,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...,Margaret holds a BA in English literature from...,2.0,http://accweb/emmployees/peacock.bmp
4,5,Buchanan,Steven,Sales Manager,Mr.,1975-03-04,2013-10-17,14 Garrett Hill,London,British Isles,SW1 8JR,UK,(71) 555-4848,3453,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...,Steven Buchanan graduated from St. Andrews Uni...,2.0,http://accweb/emmployees/buchanan.bmp


### Summary of findings – `Employees` table

**Use cases:**
- Analyze employee performance through total sales generated
- Conduct tenure analysis by comparing HireDate with sales performance
- Explore the reporting hierarchy using the ReportsTo field (self-referencing relationship) to analyze team structures, compare performance across managers, evaluate team size and composition, and identify which manager leads the best-performing sales team.

**Key relations:**
- Employees (1) → Orders (many)
- Employees (self join) via ReportsTo (To model managerial hierarchy)

**Irrelevant columns:**
- TitleOfCourtesy, BirthDate, Address, Postal Code, HomePhone, Extension, Photo, Notes, PhotoPath

## 9. Exploring the `Shippers` table

In [17]:
shippers = pd.read_sql_query("SELECT * FROM Shippers;", conn)
shippers

Unnamed: 0,ShipperID,CompanyName,Phone
0,1,Speedy Express,(503) 555-9831
1,2,United Package,(503) 555-3199
2,3,Federal Shipping,(503) 555-9931


This table displays the names of the shipping companies.

**Irrelevant columns**
- Phone

We won't be using the following tables in our analysis:
- Territories
- Regions
- EmployeeTerritories
- CustomerDemographics (empty)
- CustomerCustomerDemo (empty)