# Data Structure Exploration

The goal of this notebook is to explore the structure of the Northwind database — including table schemas, column data types, and formatting issues — and to detect potential problems such as missing values, duplicates, or inconsistent entries.

This structural analysis will guide future steps such as data cleaning, modeling, and visualization.  
All queries and observations are documented here to ensure full traceability and reproducibility of the project.

## Steps:

In this notebook, we will follow a structured approach to analyze the integrity and structure of the Northwind tables:

- **Load and preview data**  
   Quick overview of table dimensions and samples

- **Check for duplicated rows**  
   Identify exact duplicates within each table

- **Check for missing values**  
   Detect null values and assess their impact

- **Column uniqueness & primary key validation**  
   Verify if identifiers are unique as expected

- **Review column data types**  
   Ensure consistency between expected and actual types

- **Analyze value distributions and format issues**  
   Spot inconsistent formats, rare categories, or outliers

- **Review column relevance**  
   Flag unnecessary or irrelevant columns for removal

- **Summary and next cleaning actions**  
   Document findings and define the data cleaning roadmap

## 1. Connecting to the SQLite database

In this section, we establish a connection to the `northwind.db` file using Python's built-in `sqlite3` module.  
This connection allows us to run SQL queries directly from the notebook and load the results into pandas DataFrames for further analysis.

In [3]:
import sqlite3
import pandas as pd

conn = sqlite3.connect("../data/northwind.db")

query = "SELECT name FROM sqlite_master WHERE type='table';"
tables_df = pd.read_sql_query(query, conn)
tables_df

Unnamed: 0,name
0,Categories
1,sqlite_sequence
2,CustomerCustomerDemo
3,CustomerDemographics
4,Customers
5,Employees
6,EmployeeTerritories
7,Order Details
8,Orders
9,Products


## 2. Exploring the `Categories` table

We begin our table-level exploration with the `Categories` table, which contains information about the different product categories available in the database.  
We will display its schema and preview the data.

In [2]:
categories = pd.read_sql_query("SELECT * FROM Categories;", conn)
categories.shape # Number of rows and columns

(8, 4)

In [4]:
categories #displays the full table

Unnamed: 0,CategoryID,CategoryName,Description,Picture
0,1,Beverages,"Soft drinks, coffees, teas, beers, and ales",b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...
1,2,Condiments,"Sweet and savory sauces, relishes, spreads, an...",b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...
2,3,Confections,"Desserts, candies, and sweet breads",b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...
3,4,Dairy Products,Cheeses,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...
4,5,Grains/Cereals,"Breads, crackers, pasta, and cereal",b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...
5,6,Meat/Poultry,Prepared meats,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...
6,7,Produce,Dried fruit and bean curd,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...
7,8,Seafood,Seaweed and fish,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...


In [8]:
#check Data types
categories.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   CategoryID    8 non-null      int64 
 1   CategoryName  8 non-null      object
 2   Description   8 non-null      object
 3   Picture       8 non-null      object
dtypes: int64(1), object(3)
memory usage: 388.0+ bytes


### Summary – Categories table

- No duplicates or missing values. 
- Primary key (`CategoryID`) is unique. 

> **Cleaning actions:**
> - The column (`Picture`) is irrelevant and will be dropped in the cleaning phase. 
> - `object` columns will be converted to strings for consistency. 

## 3. Exploring the `Products` table

In [4]:
products = pd.read_sql_query("SELECT * FROM Products;", conn)
products.shape

(77, 10)

Table shape: 77 rows and 10 columns.

In [10]:
products.head() #displays the first five rows

Unnamed: 0,ProductID,ProductName,SupplierID,CategoryID,QuantityPerUnit,UnitPrice,UnitsInStock,UnitsOnOrder,ReorderLevel,Discontinued
0,1,Chai,1,1,10 boxes x 20 bags,18.0,39,0,10,0
1,2,Chang,1,1,24 - 12 oz bottles,19.0,17,40,25,0
2,3,Aniseed Syrup,1,2,12 - 550 ml bottles,10.0,13,70,25,0
3,4,Chef Anton's Cajun Seasoning,2,2,48 - 6 oz jars,22.0,53,0,0,0
4,5,Chef Anton's Gumbo Mix,2,2,36 boxes,21.35,0,0,0,1


In [11]:
#check duplicates
products.duplicated().sum()

np.int64(0)

No duplicated rows detected.

In [12]:
#check missing values
products.isnull().sum()

ProductID          0
ProductName        0
SupplierID         0
CategoryID         0
QuantityPerUnit    0
UnitPrice          0
UnitsInStock       0
UnitsOnOrder       0
ReorderLevel       0
Discontinued       0
dtype: int64

No missing values found.

In [13]:
#check column uniqueness for primary key
products['ProductID'].is_unique

True

ProductID is unique and primary key to this table.

In [14]:
#check data types
products.dtypes

ProductID            int64
ProductName         object
SupplierID           int64
CategoryID           int64
QuantityPerUnit     object
UnitPrice          float64
UnitsInStock         int64
UnitsOnOrder         int64
ReorderLevel         int64
Discontinued        object
dtype: object

Most data types are appropriate:
    - Discontinued is stored as `object`but should be converted to integer (binary 0/1)

In [None]:
# Check value distribution for numerical columns using descriptive statistics
products.describe()

Unnamed: 0,ProductID,SupplierID,CategoryID,UnitPrice,UnitsInStock,UnitsOnOrder,ReorderLevel
count,77.0,77.0,77.0,77.0,77.0,77.0,77.0
mean,39.0,13.649351,4.116883,28.866364,40.506494,10.12987,12.467532
std,22.371857,8.220267,2.395028,33.815111,36.147222,23.141072,10.931105
min,1.0,1.0,1.0,2.5,0.0,0.0,0.0
25%,20.0,7.0,2.0,13.25,15.0,0.0,0.0
50%,39.0,13.0,4.0,19.5,26.0,0.0,10.0
75%,58.0,20.0,6.0,33.25,61.0,0.0,25.0
max,77.0,29.0,8.0,263.5,125.0,100.0,30.0


**Data quality check:**

- **UnitPrice**:
    - Min: 2.5 | Max: 263.5 | Std: 33.8
    - Observation: Large variance with extreme max value — potential outliers to investigate

- **UnitsOnOrder**:
    - Mean: 10.1 | Median: 0
    - Observation: Strong right skew. Most products currently have no units on order

- **ReorderLevel**:
    - Median: 0 | Max: 30
    - Observation: Many products have no defined reorder threshold. May require business logic clarification

- No major format issues were observed in numeric fields

In [5]:
#Top 5 most expensive products
products.sort_values('UnitPrice', ascending=False).head()

Unnamed: 0,ProductID,ProductName,SupplierID,CategoryID,QuantityPerUnit,UnitPrice,UnitsInStock,UnitsOnOrder,ReorderLevel,Discontinued
37,38,Côte de Blaye,18,1,12 - 75 cl bottles,263.5,17,0,15,0
28,29,Thüringer Rostbratwurst,12,6,50 bags x 30 sausgs.,123.79,0,0,0,1
8,9,Mishi Kobe Niku,4,6,18 - 500 g pkgs.,97.0,29,0,0,1
19,20,Sir Rodney's Marmalade,8,3,30 gift boxes,81.0,40,0,0,0
17,18,Carnarvon Tigers,7,8,16 kg pkg.,62.5,42,0,0,0


The top-priced products were manually reviewed. Their high prices are justified by the product type (e.g. fine wine, Kobe beef, bulk sausages, gift packages).  
> No cleaning action required.

In [37]:
# convert discontinued column to int
products['Discontinued'] = products['Discontinued'].astype(int)

# Count the number of active (non-discontinued) products that currently have no units on order
base_condition = (products['UnitsOnOrder'] == 0) & (products['Discontinued'] == 0)

base_condition.sum()

np.int64(52)

In [38]:
# Filter active products (not discontinued) with no units on order,
# where UnitsInStock > ReorderLevel and ReorderLevel is not zero
reassess_stock = products[
    base_condition &
    (products['ReorderLevel'] != 0) &
    (products['UnitsInStock'] > products['ReorderLevel'])
]

# Display count
print(f"Number of such products: {len(reassess_stock)}")

Number of such products: 35


35 active products with no orders are correctly stocked above their reorder threshold and require no action.

In [39]:
# Filter active products with 0 units on order and 0 reorderLevel
active_no_reorders = products[
    base_condition &
    (products['ReorderLevel'] == 0)
]

print(f"Active products with no reorder level: {len(active_no_reorders)}")
active_no_reorders[['ProductID', 'ProductName', 'UnitsInStock', 'ReorderLevel']].head()

Active products with no reorder level: 16


Unnamed: 0,ProductID,ProductName,UnitsInStock,ReorderLevel
3,4,Chef Anton's Cajun Seasoning,53,0
7,8,Northwoods Cranberry Sauce,6,0
9,10,Ikura,31,0
11,12,Queso Manchego La Pastora,86,0
13,14,Tofu,35,0


From the result above, we can see that 16 products are active and have:
- no units currently on order
- no reorder level defined
- remaining stock

**What this suggests:**
- These products are still active but not being restocked (ReorderLevel = 0)  
- Some still have substantial stock left (e.g., 86 units), so that may explain no restocking yet  
- However, others have low stock (e.g., 6 units), yet no reorder threshold is defined → This might represent a gap in inventory policy

**Potential actions**
- Review if these products should be phased out (but not yet marked as discontinued) 
- Or define a ReorderLevel to ensure automatic restocking before stock runs out 
- Talk to stakeholders to clarify product status and replenishment rules  

In [40]:
# Identify at-risk active products with no units on order and stock below reorder level
at_risk_products = products[
    base_condition &
    (products['UnitsInStock'] < products['ReorderLevel'])
]

# Display the count and the products
print(f"Number of at-risk products: {len(at_risk_products)}")
at_risk_products[['ProductID', 'ProductName', 'UnitPrice', 'UnitsInStock', 'ReorderLevel']]

Number of at-risk products: 1


Unnamed: 0,ProductID,ProductName,UnitPrice,UnitsInStock,ReorderLevel
29,30,Nord-Ost Matjeshering,25.89,10,15


- One active product is below its reorder threshold with no incoming stock — it should be restocked soon to avoid a stockout.

### Summary - Product table
Table is strucurally clean (no nulls or duplicates)

> **Cleaning actions:**  
> - Convert `Discontinued` to integer and `ProductName`to string for consistency

> **Data quality check:**  
> - There are 16 active products with no reorder level:  
    - Review if these products should be phased out (but not yet marked as discontinued)  
    - Or define a ReorderLevel to ensure automatic restocking before stock runs out  
    - Talk to stakeholders to clarify product status and replenishment rules  
> - There is 1 at-risk product, meaning with no units on order and stock below reorder level:  
    - Should be restocked to avoid stockout.

## 4. Exploring the `Order Details` table

In [42]:
order_details = pd.read_sql_query('SELECT * FROM "Order Details";', conn)
order_details.shape

(609283, 5)

Table shape: 609,283 rows and 5 columns

In [5]:
order_details.head()

Unnamed: 0,OrderID,ProductID,UnitPrice,Quantity,Discount
0,10248,11,14.0,12,0.0
1,10248,42,9.8,10,0.0
2,10248,72,34.8,5,0.0
3,10249,14,18.6,9,0.0
4,10249,51,42.4,40,0.0


In [6]:
#check duplicates
order_details.duplicated().sum()

np.int64(0)

No duplicated rows.

In [7]:
#check missing values
order_details.isnull().sum()

OrderID      0
ProductID    0
UnitPrice    0
Quantity     0
Discount     0
dtype: int64

No missing values found

In [8]:
#check column uniqueness for primary key
order_details['OrderID'].is_unique

False

OrderID` is **not unique**, as a single order can contain multiple products

In [9]:
#check data types
order_details.dtypes

OrderID        int64
ProductID      int64
UnitPrice    float64
Quantity       int64
Discount     float64
dtype: object

Data types are appropriate

In [10]:
# Check value distribution for numerical columns using descriptive statistics
order_details.describe()

Unnamed: 0,OrderID,ProductID,UnitPrice,Quantity,Discount
count,609283.0,609283.0,609283.0,609283.0,609283.0
mean,18785.560685,38.999563,28.850379,25.503095,0.000199
std,4484.093759,22.229827,33.56547,14.453939,0.005978
min,10248.0,1.0,2.0,1.0,0.0
25%,14907.0,20.0,13.25,13.0,0.0
50%,18789.0,39.0,19.5,25.0,0.0
75%,22681.0,58.0,33.25,38.0,0.0
max,26529.0,77.0,263.5,130.0,0.25


**Data quality check:**

**Quantity**:  
    - Min: 1 | Max: 130 | Mean: 25.50  
    - Observation: Most values are between 14 and 38. Quantity of 130 is unusually high and may need investigation  

In [47]:
order_details.sort_values('Quantity', ascending=False).head(5)

Unnamed: 0,OrderID,ProductID,UnitPrice,Quantity,Discount
1363,10764,39,18.0,130,0.1
2120,11072,64,33.25,130,0.0
1221,10711,53,32.8,120,0.0
1691,10894,75,7.75,120,0.05
703,10515,27,43.9,120,0.0


No issue here — the highest quantities (120–130 units) are repeated across several orders and look legitimate.

### Summary – Order Details table

Table is structurally clean:  
- No missing values  
- No duplicates  
- Appropriate data types  

`OrderID` is **not unique**, as expected: one order can include multiple products.

## 5. Exploring the `Orders` table

In [49]:
orders = pd.read_sql_query('SELECT * FROM Orders;', conn)
orders.shape

(16282, 14)

Table shape: 16,282 rows and 14 columns

In [13]:
orders.head()

Unnamed: 0,OrderID,CustomerID,EmployeeID,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry
0,10248,VINET,5,2016-07-04,2016-08-01,2016-07-16,3,16.75,Vins et alcools Chevalier,59 rue de l-Abbaye,Reims,Western Europe,51100,France
1,10249,TOMSP,6,2016-07-05,2016-08-16,2016-07-10,1,22.25,Toms Spezialitäten,Luisenstr. 48,Münster,Western Europe,44087,Germany
2,10250,HANAR,4,2016-07-08,2016-08-05,2016-07-12,2,25.0,Hanari Carnes,"Rua do Paço, 67",Rio de Janeiro,South America,05454-876,Brazil
3,10251,VICTE,3,2016-07-08,2016-08-05,2016-07-15,1,20.25,Victuailles en stock,"2, rue du Commerce",Lyon,Western Europe,69004,France
4,10252,SUPRD,4,2016-07-09,2016-08-06,2016-07-11,2,36.25,Suprêmes délices,"Boulevard Tirou, 255",Charleroi,Western Europe,B-6000,Belgium


In [14]:
#check duplicated
orders.duplicated().sum()

np.int64(0)

No duplicated rows

In [15]:
#check missing values
orders.isnull().sum()

OrderID             0
CustomerID          0
EmployeeID          0
OrderDate           0
RequiredDate        0
ShippedDate        21
ShipVia             0
Freight             0
ShipName            0
ShipAddress         0
ShipCity            0
ShipRegion          0
ShipPostalCode    172
ShipCountry         0
dtype: int64

- ShippedDate: 21 missing values. These likely correspond to orders not yet shipped (e.g., pending or canceled). This column is relevant and should be kept, but missing values may require specific handling depending on the analysis.  
- ShipPostal Code: 172 missing values. This column is irrelevant to our analysis and will be removed. For geographic analysis, we can use city, region and country.

In [18]:
#check column uniqueness for primary key
orders['OrderID'].is_unique

True

OrderID is unique and primary key to this table.

In [19]:
#check data types
orders.dtypes

OrderID             int64
CustomerID         object
EmployeeID          int64
OrderDate          object
RequiredDate       object
ShippedDate        object
ShipVia             int64
Freight           float64
ShipName           object
ShipAddress        object
ShipCity           object
ShipRegion         object
ShipPostalCode     object
ShipCountry        object
dtype: object

- These columns have appropriate types: `OrderID`, `EmployeeID`, `Freight`, `ShipVia`.
- The following columns are stored as `object` and should be explicitly converted:
    - `CustomerID` → string (alphanumeric identifier)
    - `OrderDate`, `RequiredDate`, `ShippedDate` → datetime
    - `ShipName`, `ShipCity`, `ShipRegion`, `ShipCountry` → string (text data)
- These conversions will improve data handling for filtering, grouping, and time-based operations.

In [None]:
# Check value distribution for numerical columns using descriptive statistics
orders.describe()

Unnamed: 0,OrderID,EmployeeID,ShipVia,Freight
count,16282.0,16282.0,16282.0,16282.0
mean,18388.5,4.968861,2.007739,248.585585
std,4700.352877,2.576741,0.814275,148.978822
min,10248.0,1.0,1.0,10.25
25%,14318.25,3.0,1.0,117.25
50%,18388.5,5.0,2.0,245.25
75%,22458.75,7.0,3.0,377.25
max,26529.0,9.0,3.0,587.0


In [25]:
#convert date columns to datetime to describe them as well
orders['OrderDate'] = pd.to_datetime(orders['OrderDate'], format='mixed')
orders['RequiredDate'] = pd.to_datetime(orders['RequiredDate'], format='mixed')
orders['ShippedDate'] = pd.to_datetime(orders['ShippedDate'], format='mixed')

orders[['OrderDate', 'RequiredDate', 'ShippedDate']].describe()

Unnamed: 0,OrderDate,RequiredDate,ShippedDate
count,16282,16282,16261
mean,2018-02-22 23:35:38.544036352,2018-03-14 02:43:24.552020480,2018-03-02 17:47:56.761453824
min,2012-07-10 15:40:46,2012-07-12 11:00:21,2012-07-13 21:20:47
25%,2015-07-09 16:32:16,2015-07-28 09:21:39.750000128,2015-07-14 22:48:26
50%,2018-01-09 00:00:00,2018-01-29 10:37:28.500000,2018-01-15 00:00:00
75%,2020-11-17 08:13:09.249999872,2020-12-06 19:41:10.750000128,2020-11-26 16:42:39
max,2023-10-28 00:09:48,2023-12-14 23:09:18,2023-11-19 02:55:24


- Orders span from mid-2012 to late 2023, indicating over 11 years of historical data
- No issues detected.

### Summary – Orders table

Table is generally clean and well-structured:
- No duplicated rows
- `OrderID` is unique and acts as primary key
- Missing values in only two columns:
  - `ShippedDate`: 21 missing, likely pending or canceled orders → to be kept
  - `ShipPostalCode`: 172 missing, irrelevant to our analysis → will be removed

> **Cleaning actions:**
> - Drop the columns `ShipAddress` and `ShipPostalCode` due to irrelevance and missing data
> - Convert `OrderDate`, `RequiredDate`, `ShippedDate` to datetime
> - Convert `CustomerID`, `ShipName`, `ShipCity`, `ShipRegion`, `ShipCountry` to string

> **Data quality check:**  
> - Keep missing `ShippedDate` values for now; handle later in Power Query if needed

## 6. Exploring the `Customers` table

In [26]:
customers = pd.read_sql_query("SELECT * FROM Customers;", conn)
customers.shape

(93, 11)

Table shape: 93 rows and 11 columns

In [27]:
customers.head()

Unnamed: 0,CustomerID,CompanyName,ContactName,ContactTitle,Address,City,Region,PostalCode,Country,Phone,Fax
0,ALFKI,Alfreds Futterkiste,Maria Anders,Sales Representative,Obere Str. 57,Berlin,Western Europe,12209,Germany,030-0074321,030-0076545
1,ANATR,Ana Trujillo Emparedados y helados,Ana Trujillo,Owner,Avda. de la Constitución 2222,México D.F.,Central America,05021,Mexico,(5) 555-4729,(5) 555-3745
2,ANTON,Antonio Moreno Taquería,Antonio Moreno,Owner,Mataderos 2312,México D.F.,Central America,05023,Mexico,(5) 555-3932,
3,AROUT,Around the Horn,Thomas Hardy,Sales Representative,120 Hanover Sq.,London,British Isles,WA1 1DP,UK,(171) 555-7788,(171) 555-6750
4,BERGS,Berglunds snabbköp,Christina Berglund,Order Administrator,Berguvsvägen 8,Luleå,Northern Europe,S-958 22,Sweden,0921-12 34 65,0921-12 34 67


In [28]:
#check duplicated
customers.duplicated().sum()

np.int64(0)

No duplicated rows

In [29]:
#check missing values
customers.isnull().sum()

CustomerID       0
CompanyName      0
ContactName      0
ContactTitle     0
Address          2
City             2
Region           2
PostalCode       3
Country          2
Phone            2
Fax             24
dtype: int64

- Minor missing values found in address-related fields, likely negligible
- Fax column will be removed

In [30]:
#check column uniqueness for primary key
customers['CustomerID'].is_unique

True

CustomerID is unique and serves as the primary key of this table. The Northwind database contains 93 customers in total.

In [31]:
#check data types
customers.dtypes

CustomerID      object
CompanyName     object
ContactName     object
ContactTitle    object
Address         object
City            object
Region          object
PostalCode      object
Country         object
Phone           object
Fax             object
dtype: object

- All columns will be converted to `string`

### Summary – Customers table

Table is generally clean and ready for use:  
- No duplicated rows  
- `CustomerID` is unique → serves as the **primary key**  
- Minor missing values in address-related fields (`Address`, `City`, `Region`, `PostalCode`, `Country`, `Phone`)  

> **Cleaning actions:**
> - Drop the `Address`, `PostalCode`, `Phone` and `Fax` columns (many missing values, not useful)  
> - Convert all columns to `string` for consistency and Power BI compatibility  
> - Replace missing values in `City`, `Region`, and `Country` with `'Unknown'` to avoid nulls in Power BI

## 7. Exploring the `Suppliers` table

In [33]:
suppliers = pd.read_sql_query("SELECT * FROM Suppliers;", conn)
suppliers.shape

(29, 12)

Table shape: 29 rows and 12 columns

In [34]:
suppliers.head()

Unnamed: 0,SupplierID,CompanyName,ContactName,ContactTitle,Address,City,Region,PostalCode,Country,Phone,Fax,HomePage
0,1,Exotic Liquids,Charlotte Cooper,Purchasing Manager,49 Gilbert St.,London,British Isles,EC1 4SD,UK,(171) 555-2222,,
1,2,New Orleans Cajun Delights,Shelley Burke,Order Administrator,P.O. Box 78934,New Orleans,North America,70117,USA,(100) 555-4822,,#CAJUN.HTM#
2,3,Grandma Kelly's Homestead,Regina Murphy,Sales Representative,707 Oxford Rd.,Ann Arbor,North America,48104,USA,(313) 555-5735,(313) 555-3349,
3,4,Tokyo Traders,Yoshi Nagase,Marketing Manager,9-8 Sekimai\nMusashino-shi,Tokyo,Eastern Asia,100,Japan,(03) 3555-5011,,
4,5,Cooperativa de Quesos 'Las Cabras',Antonio del Valle Saavedra,Export Administrator,Calle del Rosal 4,Oviedo,Southern Europe,33007,Spain,(98) 598 76 54,,


In [36]:
suppliers.duplicated().sum()

np.int64(0)

No duplicated rows

In [37]:
suppliers.isnull().sum()

SupplierID       0
CompanyName      0
ContactName      0
ContactTitle     0
Address          0
City             0
Region           1
PostalCode       0
Country          0
Phone            0
Fax             16
HomePage        24
dtype: int64

- One missing value in `Region`, likely negligible  (consider replacing with 'unkown' if used for group analysis)
- Missing values in `Fax` and `HomePage` columns. These are not useful for our analysis, so they will be removed

In [38]:
suppliers['SupplierID'].is_unique

True

SupplierID is unique and serves as the primary key of this table. The Northwind database contains 29 suppliers in total.

In [39]:
suppliers.dtypes

SupplierID       int64
CompanyName     object
ContactName     object
ContactTitle    object
Address         object
City            object
Region          object
PostalCode      object
Country         object
Phone           object
Fax             object
HomePage        object
dtype: object

All `object` columns will be converted to `string` for consistency

### Summary – Suppliers table

Table is generally clean and ready for use:  
- No duplicated rows  
- `SupplierID` is unique → serves as the **primary key**  
- One missing value in `Region` (likely negligible)  
- Several missing values in `Fax` and `HomePage`  

> **Cleaning actions:**
> - Drop the `Address`, `PostalCode`, `Phone`, `Fax` and `HomePage` columns (not useful)
> - Convert all columns to `string` for consistency and Power BI compatibility  
> - Replace missing `Region` with `'Unknown'`

## 8. Exploring the `Employees` table

In [40]:
employees = pd.read_sql_query("SELECT * FROM Employees;", conn)
employees.shape

(9, 18)

Table shape: 9 rows and 18 columns

In [41]:
employees

Unnamed: 0,EmployeeID,LastName,FirstName,Title,TitleOfCourtesy,BirthDate,HireDate,Address,City,Region,PostalCode,Country,HomePhone,Extension,Photo,Notes,ReportsTo,PhotoPath
0,1,Davolio,Nancy,Sales Representative,Ms.,1968-12-08,2012-05-01,507 - 20th Ave. E.Apt. 2A,Seattle,North America,98122,USA,(206) 555-9857,5467,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...,Education includes a BA in psychology from Col...,2.0,http://accweb/emmployees/davolio.bmp
1,2,Fuller,Andrew,"Vice President, Sales",Dr.,1972-02-19,2012-08-14,908 W. Capital Way,Tacoma,North America,98401,USA,(206) 555-9482,3457,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...,Andrew received his BTS commercial in 1974 and...,,http://accweb/emmployees/fuller.bmp
2,3,Leverling,Janet,Sales Representative,Ms.,1983-08-30,2012-04-01,722 Moss Bay Blvd.,Kirkland,North America,98033,USA,(206) 555-3412,3355,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...,Janet has a BS degree in chemistry from Boston...,2.0,http://accweb/emmployees/leverling.bmp
3,4,Peacock,Margaret,Sales Representative,Mrs.,1957-09-19,2013-05-03,4110 Old Redmond Rd.,Redmond,North America,98052,USA,(206) 555-8122,5176,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...,Margaret holds a BA in English literature from...,2.0,http://accweb/emmployees/peacock.bmp
4,5,Buchanan,Steven,Sales Manager,Mr.,1975-03-04,2013-10-17,14 Garrett Hill,London,British Isles,SW1 8JR,UK,(71) 555-4848,3453,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...,Steven Buchanan graduated from St. Andrews Uni...,2.0,http://accweb/emmployees/buchanan.bmp
5,6,Suyama,Michael,Sales Representative,Mr.,1983-07-02,2013-10-17,Coventry House\nMiner Rd.,London,British Isles,EC2 7JR,UK,(71) 555-7773,428,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...,Michael is a graduate of Sussex University (MA...,5.0,http://accweb/emmployees/davolio.bmp
6,7,King,Robert,Sales Representative,Mr.,1980-05-29,2014-01-02,Edgeham Hollow\nWinchester Way,London,British Isles,RG1 9SP,UK,(71) 555-5598,465,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...,Robert King served in the Peace Corps and trav...,5.0,http://accweb/emmployees/davolio.bmp
7,8,Callahan,Laura,Inside Sales Coordinator,Ms.,1978-01-09,2014-03-05,4726 - 11th Ave. N.E.,Seattle,North America,98105,USA,(206) 555-1189,2344,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...,Laura received a BA in psychology from the Uni...,2.0,http://accweb/emmployees/davolio.bmp
8,9,Dodsworth,Anne,Sales Representative,Ms.,1986-01-27,2014-11-15,7 Houndstooth Rd.,London,British Isles,WG2 7LT,UK,(71) 555-4444,452,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x00...,Anne has a BA degree in English from St. Lawre...,5.0,http://accweb/emmployees/davolio.bmp


In [42]:
employees.duplicated().sum()

np.int64(0)

In [43]:
employees.isnull().sum()

EmployeeID         0
LastName           0
FirstName          0
Title              0
TitleOfCourtesy    0
BirthDate          0
HireDate           0
Address            0
City               0
Region             0
PostalCode         0
Country            0
HomePhone          0
Extension          0
Photo              0
Notes              0
ReportsTo          1
PhotoPath          0
dtype: int64

In [44]:
employees.dtypes

EmployeeID           int64
LastName            object
FirstName           object
Title               object
TitleOfCourtesy     object
BirthDate           object
HireDate            object
Address             object
City                object
Region              object
PostalCode          object
Country             object
HomePhone           object
Extension           object
Photo               object
Notes               object
ReportsTo          float64
PhotoPath           object
dtype: object

### Summary – Employees table

Table is generally clean and ready for use:  
- No duplicated rows  
- `EmployeeID` is unique → serves as the **primary key**. The Northwind database contains 9 employees in total.
- One missing value in `ReportTo` (manager of all other employees)  

> **Cleaning actions:**  
> - Drop the `TitleOfCourtesy`, `BirthDate`, `Address`, `PostalCode`, `HomePhone`, `Extension`, `Photo`, `Notes`, and `PhotoPath` columns (not useful)  
> - Convert `HireDate` to datetime  
> - Convert all `object` columns to `string` for consistency and Power BI compatibility  
> - Merge the `FirstName` and `LastName` columns  
> - Replace missing value in `ReportsTo` with `-1` (likely top-level manager)

## 9. Exploring the `Shippers` table

In [45]:
shippers = pd.read_sql_query("SELECT * FROM Shippers;", conn)
shippers

Unnamed: 0,ShipperID,CompanyName,Phone
0,1,Speedy Express,(503) 555-9831
1,2,United Package,(503) 555-3199
2,3,Federal Shipping,(503) 555-9931


In [46]:
shippers.dtypes

ShipperID       int64
CompanyName    object
Phone          object
dtype: object

### Summary – `Shippers` table

- Table contains the names of the shippers

> **Cleaning actions:**
>  `Phone`column will be removed
> - Convert object columns to string type

### Other changes

- Some column names will be renamed directly in Power Query for clarity and consistency  
- A further check of data types will be done in Power Query to ensure compatibility with visualizations  
- Column ordering and formatting (e.g. dates, merged names) can also be adjusted in Power Query if needed  
- Final handling of missing values (e.g. `Unknown`, blanks) will be validated in Power BI