This notebook provides an initial exploratory review of the datasets to understand its structure, quality, and overall integrity before deeper analysis in SQL. The insights gathered here will highlight potential data quality issues, guide cleaning decisions, and ensure that subsequent SQL-based transformations are built on reliable and well-understood data.

## Part 1: Set up and Data loading**

### 1.1: import necessary library

In [1]:
# Core libraries for data manipulation and numerical operations
import numpy as np
import pandas as pd
#import sql engine
from sqlalchemy import create_engine 

### 1.2: import and Load Data from SQL

### Database Configuration

This notebook reads database credentials from environment variables.
Credentials are **not stored in the notebook or repository**.

Required variables:

- DB_USER
- DB_PASSWORD
- DB_HOST
- DB_NAME

For local development:

- Create a `.env` file (see `.env.example`)
- Load variables using `python-dotenv`

In [2]:
from dotenv import load_dotenv
import os

load_dotenv()

engine = create_engine(
    f"mysql+mysqlconnector://{os.getenv('DB_USER')}:{os.getenv('DB_PASSWORD')}"
    f"@{os.getenv('DB_HOST')}/{os.getenv('DB_NAME')}"
)

query="SELECT * FROM gdb041.dim_market"
df_dim_market=pd.read_sql(query, engine)
df_dim_market.head()

Unnamed: 0,market,sub_zone,region
0,China,ROA,APAC
1,India,India,APAC
2,Indonesia,ROA,APAC
3,Japan,ROA,APAC
4,Pakistan,ROA,APAC


### 1.3: inspect data to see 

1. Total Rows, Column
2. Data Types for each column
3. Mixed data type for any column
4. Distinct values per column
5. Null Values and percentage per column
6. Duplicate rows in dataset

In [3]:
def initial_report(df):
    print(" *** Initial Inspection ***\n" + "-"*40)

    print(f"*** Structure:\n- Total Rows: {df.shape[0]}\n- Total Columns: {df.shape[1]}")
    print(f"- Column Names: {list(df.columns)}\n")

    
    print(" Data Types:")
    for col, dtype in df.dtypes.items():
        print(f"  {col}: {dtype}")
    print()

    print("Mixed Data Types:")
    has_mixed_types = False
    for col in df.columns:
        try:
            type_counts = df[col].apply(type).value_counts()
            if len(type_counts) > 1:
                has_mixed_types = True
                print(f"  {col}:")
                for t, count in type_counts.items():
                    print(f"    - {t.__name__}: {count}")
        except Exception as e:
            print(f"  {col}: Error checking types - {e}")

    if not has_mixed_types:
        print("  No mixed data types found")
    print()

    print("*** Distinct Values per Column:")
    for col in df.columns:
        print(f"  {col}: {df[col].nunique()}")
    print()

    print("*** Null Values and Percentages:")
    has_null_value=False
    nulls = df.isnull().sum()
    for col in df.columns:
        pct_missing = np.mean(df[col].isnull())
        if nulls[col] > 0: # Only print if there are missing values
            has_null_value=True
            print(f"  {col}: Missing Values: {nulls[col]}, Pct: {round(pct_missing * 100, 3)}%")
    if not has_null_value:
        print("  No null values found")
    print()

    
    print(f"\n*** Duplicates: {df.duplicated().sum()}")

initial_report(df_dim_market)

 *** Initial Inspection ***
----------------------------------------
*** Structure:
- Total Rows: 27
- Total Columns: 3
- Column Names: ['market', 'sub_zone', 'region']

 Data Types:
  market: object
  sub_zone: object
  region: object

Mixed Data Types:
  No mixed data types found

*** Distinct Values per Column:
  market: 27
  sub_zone: 7
  region: 4

*** Null Values and Percentages:
  No null values found


*** Duplicates: 0


### 1.4. Quick summary of categorical columns to assess uniqueness and distribution.

In [4]:
df_dim_market.describe(include='O')

Unnamed: 0,market,sub_zone,region
count,27,27,27
unique,27,7,4
top,China,ROA,EU
freq,1,7,11


### Comment:`
There is 27 unique market, 7 unique sub_zone and 4 unique region

In [5]:
# Display value counts for each column to validate uniqueness and detect potential data quality issues.
for col in df_dim_market:
    print(df_dim_market[col].value_counts(dropna=False).sort_index())

market
Australia         1
Austria           1
Bangladesh        1
Brazil            1
Canada            1
Chile             1
China             1
Columbia          1
France            1
Germany           1
India             1
Indonesia         1
Italy             1
Japan             1
Mexico            1
Netherlands       1
Newzealand        1
Norway            1
Pakistan          1
Philiphines       1
Poland            1
Portugal          1
South Korea       1
Spain             1
Sweden            1
USA               1
United Kingdom    1
Name: count, dtype: int64
sub_zone
ANZ      2
India    1
LATAM    4
NE       7
ROA      7
SE       4
nan      2
Name: count, dtype: int64
region
APAC     10
EU       11
LATAM     4
nan       2
Name: count, dtype: int64


### Comment:
There is **nan** in sub_zone and region

In [6]:
#find rows where sub_zone and region is nan
df_dim_market[
    (df_dim_market['sub_zone'] == 'nan') |
    (df_dim_market['region'] == 'nan')
]

Unnamed: 0,market,sub_zone,region
21,USA,,
22,Canada,,


### Comment:
The nan is is found only in USA and Canada
It is an error. It should be NA (North America).
In the main database in SQL its need to be updated.

### Conclusion:
The analysis ends here. A new table `dim_market_clean` has been created in SQL