### The project has evolved. The CSV files are now in the olist.db database. For the project with the old schema between the CSV files, it might be useful to create a mapping between the old CSV files and the database tables, as well as the common keys between the tables. https://drive.google.com/file/d/1cC1h5ZiakQMM6Ut9Hqf13r-jTpMHV8d5/view?usp=sharing  

### Schema of liaison between data when data where csv files see https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce

![data_schema.PNG](../data/data_schema.PNG)


Here is a mapping table in text format between the CSV files and the tables in `olist.db`, including the identical keys between tables when they exist:

| CSV File                               | SQLite Table       | Identical Keys                            |
|----------------------------------------|--------------------|-------------------------------------------|
| olist_customers_dataset.csv            | customers          | customer_id, customer_zip_code_prefix     |
| olist_geolocation_dataset.csv          | geoloc             | geolocation_zip_code_prefix               |
| olist_order_items_dataset.csv          | order_items        | order_id, product_id, seller_id           |
| olist_order_payments_dataset.csv       | order_pymts        | order_id                                  |
| olist_order_reviews_dataset.csv        | order_reviews      | order_id                                  |
| olist_orders_dataset.csv               | orders             | customer_id, order_id                     |
| olist_products_dataset.csv             | products           | product_id                                |
| olist_sellers_dataset.csv              | sellers            | seller_id, seller_zip_code_prefix         |
| product_category_name_translation.csv  | translation        | product_category_name                     |

### Mapping Details

1. **olist_customers_dataset.csv -> customers**
   - Identical Keys: `customer_id`, `customer_zip_code_prefix`

2. **olist_geolocation_dataset.csv -> geoloc**
   - Identical Keys: `geolocation_zip_code_prefix`

3. **olist_order_items_dataset.csv -> order_items**
   - Identical Keys: `order_id`, `product_id`, `seller_id`

4. **olist_order_payments_dataset.csv -> order_pymts**
   - Identical Keys: `order_id`

5. **olist_order_reviews_dataset.csv -> order_reviews**
   - Identical Keys: `order_id`

6. **olist_orders_dataset.csv -> orders**
   - Identical Keys: `customer_id`, `order_id`

7. **olist_products_dataset.csv -> products**
   - Identical Keys: `product_id`

8. **olist_sellers_dataset.csv -> sellers**
   - Identical Keys: `seller_id`, `seller_zip_code_prefix`

9. **product_category_name_translation.csv -> translation**
   - Identical Keys: `product_category_name`

### Explanation
- Identical keys are used to establish relationships between different tables, thus facilitating the necessary joins for complex queries.
- For example, `order_id` is a common key between the tables `orders`, `order_pymts`, `order_reviews`, and `order_items`, allowing the linking of order information, payment, reviews, and order items.

This mapping table can serve as a reference to understand how the data from the CSV files is structured in the `olist.db` database and how they can be joined to answer analytical questions.

### Olist E-Commerce connection to the database check

In [4]:
import pandas as pd
import sqlite3
import os

# Relative path to the database
db_path = os.path.join('..', 'data', 'olist.db')

if os.path.exists(db_path):
    print("Connecting to the database...")
    conn = sqlite3.connect(db_path)
    print("Database connected.")
    
    query = "SELECT COUNT(*) FROM orders;"
    try:
        data = pd.read_sql(query, conn)
        print("Total number of rows in orders table:", data.iloc[0, 0])
    except Exception as e:
        print("An error occurred:", e)
    finally:
        conn.close()
        print("Database connection closed.")
else:
    print("The file does not exist at the specified location:", db_path)



Connecting to the database...
Database connected.
Total number of rows in orders table: 99441
Database connection closed.


### Inspecting the Database Schema

In [7]:
import pandas as pd
import sqlite3
import os

# Relative path to the database
db_path = os.path.join('..', 'data', 'olist.db')

if os.path.exists(db_path):
    print("Connecting to the database...")
    
    # Create a connection to the database
    conn = sqlite3.connect(db_path)
    print("Database connected.")
    
    try:
        # List all tables
        tables_query = "SELECT name FROM sqlite_master WHERE type='table';"
        tables = pd.read_sql(tables_query, conn)
        print("Tables in the database:")
        print(tables)

        # Inspect schema of the relevant tables
        for table in tables['name']:
            schema_query = f"PRAGMA table_info({table});"
            schema = pd.read_sql(schema_query, conn)
            print(f"Schema of {table}:")
            print(schema)
    except Exception as e:
        print("An error occurred:", e)
    finally:
        # Close the connection
        conn.close()
        print("Database connection closed.")
else:
    print("The file does not exist at the specified location:", db_path)



Connecting to the database...
Database connected.
Tables in the database:
            name
0      customers
1         geoloc
2    order_items
3    order_pymts
4  order_reviews
5         orders
6       products
7        sellers
8    translation
Schema of customers:
   cid                      name    type  notnull dflt_value  pk
0    0                     index  BIGINT        0       None   0
1    1               customer_id    TEXT        0       None   0
2    2        customer_unique_id    TEXT        0       None   0
3    3  customer_zip_code_prefix  BIGINT        0       None   0
4    4             customer_city    TEXT        0       None   0
5    5            customer_state    TEXT        0       None   0
Schema of geoloc:
   cid                         name    type  notnull dflt_value  pk
0    0                        index  BIGINT        0       None   0
1    1  geolocation_zip_code_prefix  BIGINT        0       None   0
2    2              geolocation_lat   FLOAT        0      

### Checking for Common Values in Key Columns

In [7]:
import pandas as pd
import sqlite3
import os

# Relative path to the database
db_path = os.path.join('..', 'data', 'olist.db')

# Function to execute SQL queries and return DataFrames
def execute_query(query):
    with sqlite3.connect(db_path) as conn:
        return pd.read_sql(query, conn)

# Function to get unique values of a column
def get_unique_column_values(table, column):
    query = f"SELECT DISTINCT {column} FROM {table}"
    df = execute_query(query)
    return set(df[column].values)

# List of tables and columns to check
tables_columns = {
    'customers': ['customer_id', 'customer_unique_id', 'customer_zip_code_prefix'],
    'geoloc': ['geolocation_zip_code_prefix'],
    'order_items': ['order_id', 'product_id', 'seller_id'],
    'order_pymts': ['order_id'],
    'order_reviews': ['order_id', 'review_id'],
    'orders': ['order_id', 'customer_id'],
    'products': ['product_id'],
    'sellers': ['seller_id', 'seller_zip_code_prefix'],
    'translation': ['product_category_name'],
}

# Function to compare columns and check for common values
def compare_columns(tables_columns):
    common_values_found = False
    columns_to_compare = {}
    
    for table, columns in tables_columns.items():
        for column in columns:
            columns_to_compare[f"{table}.{column}"] = get_unique_column_values(table, column)
    
    keys = list(columns_to_compare.keys())
    
    for i in range(len(keys)):
        table_column1 = keys[i]
        values1 = columns_to_compare[table_column1]
        for j in range(i + 1, len(keys)):
            table_column2 = keys[j]
            values2 = columns_to_compare[table_column2]
            common_values = values1.intersection(values2)
            if common_values:
                print(f"Common values found between {table_column1} and {table_column2}:")
                print(f"Number of common values: {len(common_values)}")
                print(f"Sample common values: {list(common_values)[:5]}")  # Print only the first 5 common values as a sample
                common_values_found = True
    
    if not common_values_found:
        print("No common values found between the specified columns of different tables.")

# Specific validation for customer_id, order_id, and customer_unique_id
def compare_specific_columns(columns_to_compare):
    special_cases = [
        ('customers.customer_id', 'orders.order_id'),
        ('customers.customer_id', 'customers.customer_unique_id'),
        ('customers.customer_unique_id', 'orders.order_id')
    ]
    
    for case in special_cases:
        table_column1, table_column2 = case
        values1 = columns_to_compare.get(table_column1, set())
        values2 = columns_to_compare.get(table_column2, set())
        common_values = values1.intersection(values2)
        if common_values:
            print(f"Common values found between {table_column1} and {table_column2}: {common_values}")
        else:
            print(f"No common values found between {table_column1} and {table_column2}")    
            

# Include in main function
def compare_columns(tables_columns):
    common_values_found = False
    columns_to_compare = {}
    
    for table, columns in tables_columns.items():
        for column in columns:
            columns_to_compare[f"{table}.{column}"] = get_unique_column_values(table, column)
    
    keys = list(columns_to_compare.keys())
    
    for i in range(len(keys)):
        table_column1 = keys[i]
        values1 = columns_to_compare[table_column1]
        for j in range(i + 1, len(keys)):
            table_column2 = keys[j]
            values2 = columns_to_compare[table_column2]
            common_values = values1.intersection(values2)
            if common_values:
                print(f"Common values found between {table_column1} and {table_column2}:")
                print(f"Number of common values: {len(common_values)}")
                print(f"Sample common values: {list(common_values)[:5]}")  # Print only the first 5 common values as a sample
                common_values_found = True
    
    # Specific validation
    if compare_specific_columns(columns_to_compare):
        common_values_found = True
    
    if not common_values_found:
        print("No common values found between the specified columns of different tables.")

# Execute the column comparison
compare_columns(tables_columns)


Common values found between customers.customer_id and orders.customer_id:
Number of common values: 99441
Sample common values: ['79b14a47cb76b8a85cbb4fccf73a1b2b', '8b3b03fef783fb837ca0b86a72b2e202', '4f1f2b13805c2ab2ce70a6cad8001b18', '0040a8417928d0d5abd5169cd7877181', '201e36d57411ab16157cd234082bdad8']
Common values found between customers.customer_zip_code_prefix and geoloc.geolocation_zip_code_prefix:
Number of common values: 14837
Sample common values: [65540, 65550, 65560, 98335, 98338]
Common values found between customers.customer_zip_code_prefix and sellers.seller_zip_code_prefix:
Number of common values: 2162
Sample common values: [81925, 81930, 8215, 8220, 8223]
Common values found between geoloc.geolocation_zip_code_prefix and sellers.seller_zip_code_prefix:
Number of common values: 2239
Sample common values: [81925, 81930, 8215, 8220, 8223]
Common values found between order_items.order_id and order_pymts.order_id:
Number of common values: 98665
Sample common values: ['69

### Key Observations

1. **Expected Foreign Key Relationships**:
    - `order_items.order_id`, `order_pymts.order_id`, `order_reviews.order_id`, and `orders.order_id`: High numbers of common values, indicating expected foreign key relationships.
    - `order_items.product_id` and `products.product_id`: 32951 common values, indicating a correct relationship.
    - `order_items.seller_id` and `sellers.seller_id`: 3095 common values, indicating a correct relationship.

2. **Zip Code Relationships**:
    - `customers.customer_zip_code_prefix` and `geoloc.geolocation_zip_code_prefix`: 14837 common values.
    - `customers.customer_zip_code_prefix` and `sellers.seller_zip_code_prefix`: 2162 common values.
    - `geoloc.geolocation_zip_code_prefix` and `sellers.seller_zip_code_prefix`: 2239 common values.

### Specific Validation

1. **No Common Values**:
    - `customers.customer_id` and `orders.order_id`: No common values found.
    - `customers.customer_id` and `customers.customer_unique_id`: No common values found.
    - `customers.customer_unique_id` and `orders.order_id`: No common values found.

This indicates that the `customer_id` in the `customers` table is properly distinguished from `order_id` in the `orders` table and `customer_unique_id` in the `customers` table. This is the expected result and confirms that there are no unexpected overlaps between these IDs.

-  Common values found between customers.customer_id and orders.customer_id: Number of common values: 99441 => same values so

### Query to Display a Few Rows and the Date Format

In [1]:
import pandas as pd
import sqlite3
import os

# Relative path to the database
db_path = os.path.join('..', 'data', 'olist.db')

if os.path.exists(db_path):
    print("Connecting to the database...")
    conn = sqlite3.connect(db_path)
    print("Database connected.")
    
    query = """
    SELECT 
        order_id,
        customer_id,
        order_status,
        order_purchase_timestamp
    FROM orders
    LIMIT 5;
    """
    try:
        data = pd.read_sql(query, conn)
        print("Sample rows from orders table:")
        print(data)  # Display sample rows to check date format
    except Exception as e:
        print("An error occurred:", e)
    finally:
        conn.close()
        print("Database connection closed.")
else:
    print("The file does not exist at the specified location:", db_path)


Connecting to the database...
Database connected.
Sample rows from orders table:
                           order_id                       customer_id  \
0  e481f51cbdc54678b7cc49136f2d6af7  9ef432eb6251297304e76186b10a928d   
1  53cdb2fc8bc7dce0b6741e2150273451  b0830fb4747a6c6d20dea0b8c802d7ef   
2  47770eb9100c2d0c44946d9cf07ec65d  41ce2a54c0b03bf3443c3d931a367089   
3  949d5b44dbf5de918fe9c16f97b45f8a  f88197465ea7920adcdbec7375364d82   
4  ad21c59c0840e6cb83a9ceb5573f8159  8ab97904e6daea8866dbdbc4fb7aad2c   

  order_status order_purchase_timestamp  
0    delivered      2017-10-02 10:56:33  
1    delivered      2018-07-24 20:41:37  
2    delivered      2018-08-08 08:38:49  
3    delivered      2017-11-18 19:28:06  
4    delivered      2018-02-13 21:18:39  
Database connection closed.


### Query to Display the Current Date in order_purchase_timestamp Format

In [3]:
import pandas as pd
import sqlite3
import os

# Relative path to the database
db_path = os.path.join('..', 'data', 'olist.db')

if os.path.exists(db_path):
    print("Connecting to the database...")
    conn = sqlite3.connect(db_path)
    print("Database connected.")
    
    query = """
    SELECT strftime('%Y-%m-%d %H:%M:%S', 'now') AS current_date;
    """
    try:
        data = pd.read_sql(query, conn)
        print("Current date (formatted):")
        print(data)  # Display the current date to ensure it's being calculated correctly
    except Exception as e:
        print("An error occurred:", e)
    finally:
        conn.close()
        print("Database connection closed.")
else:
    print("The file does not exist at the specified location:", db_path)


Connecting to the database...
Database connected.
Current date (formatted):
          current_date
0  2024-07-17 07:52:15
Database connection closed.


### 1. Recent Orders with at Least 3 Days of Delay (Excluding Canceled Orders) for Orders Less Than 90 Days Old (urgent requests) ?

### Explanation of the Query
1. **Recent Orders**: The query filters the orders based on the purchase timestamp to select only the recent orders placed in the last 90 days.
2. **Order Status**: It excludes the canceled orders by checking the order status.
3. **Delay Calculation**: The delay in delivery is calculated by finding the difference in days between the delivered customer date and the estimated delivery date.
4. **Common Table Expression (CTE)**: The query uses a Common Table Expression (CTE) named `RelevantOrders` to filter the orders based on the conditions specified.
5. **Result Columns**: The final result includes the `order_id`, `customer_id`, `order_status`, `order_purchase_timestamp`, `order_delivered_customer_date`, and `delay_days` columns for the selected orders.
6. **Data Loading**: The query result is loaded into a pandas DataFrame for further analysis and display.
7. **Display**: The first few rows of the resulting DataFrame are displayed to show the recent orders with at least 3 days of delay.
8. **Context**: The query considers the date range issue and provides additional information if no data is available due to older dates in the database.

In [1]:
import pandas as pd
import sqlite3
import os
from datetime import datetime

# Get today's date
today = datetime.today().strftime('%Y-%m-%d')

# Relative path to the database
db_path = os.path.join('..', 'data', 'olist.db')

# Function to execute SQL queries and return results as a pandas DataFrame
def execute_query(query):
    with sqlite3.connect(db_path) as conn:
        return pd.read_sql(query, conn)

# Check if the file exists before attempting to connect
if os.path.exists(db_path):
    print("Connecting to the database...")
    
    # Create a connection to the database
    conn = sqlite3.connect(db_path)
    print("Database connected.")
    
    # SQL query to count recent orders
    count_query = """
    SELECT COUNT(*) AS recent_orders_count
    FROM orders
    WHERE 
        order_status <> 'canceled' AND 
        order_purchase_timestamp >= datetime('now', '-90 days');
    """
    
    # SQL query to retrieve recent orders with delay using a Common Table Expression (CTE); WITH AS creates temporary table in memory
    # In our case, RelevantOrders is a temporary table that filters orders based on the conditions specified
    data_query = """
    WITH RelevantOrders AS (
        SELECT 
            order_id,
            customer_id,
            order_status,
            order_purchase_timestamp,
            order_delivered_customer_date,
            julianday(order_delivered_customer_date) - julianday(order_estimated_delivery_date) AS delay_days
        FROM orders
        WHERE 
            order_status <> 'canceled' AND 
            order_purchase_timestamp >= datetime('now', '-90 days')
    )
    SELECT *
    FROM RelevantOrders
    WHERE delay_days > 3;
    """

    try:
        # Execute the count query using the function and print the result
        count_data = execute_query(count_query)
        recent_orders_count = count_data.iloc[0, 0]
        print("Number of recent orders in the last 90 days:", recent_orders_count)
        
        # Execute the data query using the function and load data into a DataFrame
        data = execute_query(data_query)
        if recent_orders_count == 0 or data.empty:
            print(f"No data available for recent orders with at least 3 days of delay (excluding canceled orders) for orders less than 90 days old. This may be due to the date range issue because today's date is {today}, and the data in the database may be older.")
        else:
            print("Query executed successfully.")
            print(data.head())  # Display the first few rows of the result
    except Exception as e:
        print("An error occurred:", e)
    finally:
        # Close the connection
        conn.close()
        print("Database connection closed.")
else:
    print("The file does not exist at the specified location:", db_path)


Connecting to the database...
Database connected.
Number of recent orders in the last 90 days: 0
No data available for recent orders with at least 3 days of delay (excluding canceled orders) for orders less than 90 days old. This may be due to the date range issue because today's date is 2024-07-18, and the data in the database may be older.
Database connection closed.


### 2. Sellers generating revenue over 100,000 Real via Olist?

### Explanation of the Query
1. **Revenue Calculation**: The query calculates the total revenue generated by each seller by summing the prices of the products they sold. It uses the `order_items` table to get the price of each product and the `orders` table to filter only the orders that are delivered (`order_status = 'delivered'`).
2. **Common Key**: The common key between the `order_items` and `sellers` tables is the `seller_id`, which is used to join the two tables and retrieve additional information about the sellers.
3. **Filtering Criteria**: The query filters the sellers based on the total revenue generated, selecting only those sellers who have generated over 100,000 Real.
4. **Result Columns**: The final result includes the `seller_id`, `seller_zip_code_prefix`, and `total_revenue` columns for the selected sellers.
5. **Data Loading**: The query result is loaded into a pandas DataFrame for further analysis and display.
6. **Display**: The first few rows of the resulting DataFrame are displayed to show the sellers who have generated revenue over 100,000 Real.
7. **Context**: The query considers the date range issue and provides additional information if no data is available due to older dates in the database.

In [1]:
import pandas as pd
import sqlite3
import os

# Relative path to the database
db_path = os.path.join('..', 'data', 'olist.db')

# Function to execute SQL queries and return DataFrames
def execute_query(query):
    with sqlite3.connect(db_path) as conn:
        return pd.read_sql(query, conn)

# SQL query for sellers generating revenue over 100,000 Real
query_revenue_sellers = """
WITH Revenue AS (
    SELECT
        i.seller_id,
        SUM(i.price) AS total_revenue
    FROM order_items AS i
    JOIN orders AS o ON i.order_id = o.order_id
    WHERE o.order_status = 'delivered'
    GROUP BY i.seller_id
)
SELECT s.seller_id, s.seller_zip_code_prefix, r.total_revenue
FROM Revenue AS r
JOIN sellers AS s ON r.seller_id = s.seller_id
WHERE r.total_revenue > 100000;
"""

# Execute the query and load data into a DataFrame
data_revenue_sellers = execute_query(query_revenue_sellers)
print("DataFrame: revenue_sellers")
print(data_revenue_sellers.head())
print()


DataFrame: revenue_sellers
                          seller_id  seller_zip_code_prefix  total_revenue
0  7e93a43ef30c4f03f38b393420bc753a                    6429      165981.49
1  7d13fca15225358621be4086e1eb0964                   14050      112436.18
2  955fee9216a65b617aa5c0531780ce60                    4782      131836.71
3  1f50f920176fa81dab994f9023523100                   15025      106655.71
4  fa1c13f2614d7b5c4749cbc52fecda94                   13170      190917.14



### 3.Who are the new sellers (less than 3 months of seniority) who are already highly engaged with the platform (having already sold more than 30 products)?

### Explanation of the Query
1. **New Sellers**: The query identifies new sellers who have been active on the platform for less than 90 days. It calculates the seller's seniority based on the date of their first order.
2. **Highly Engaged Sellers**: The query filters the new sellers who have already sold more than 30 products, indicating high engagement with the platform.
3. **Common Table Expressions (CTEs)**: The query uses two Common Table Expressions (CTEs) to first find the first order date for each seller and then calculate the total number of products sold by each seller.
4. **Result Columns**: The final result includes the `seller_id` and `seller_zip_code_prefix` columns for the selected new sellers who are highly engaged with the platform.
5. **Data Loading**: The query result is loaded into a pandas DataFrame for further analysis and display.
6. **Display**: The first few rows of the resulting DataFrame are displayed to show the new sellers who are highly engaged with the platform.
7. **Context**: The query considers the date range issue and provides additional information if no data is available due to older dates in the database.

In [2]:
import pandas as pd
import sqlite3
import os
from datetime import datetime

# Get today's date
today = datetime.today().strftime('%Y-%m-%d')  # Format: YYYY-MM-DD

# Relative path to the database
db_path = os.path.join('..', 'data', 'olist.db')

# Function to execute SQL queries and return DataFrames
def execute_query(query):
    with sqlite3.connect(db_path) as conn:
        return pd.read_sql(query, conn)

# Step 1: Check Min and Max Dates in the Orders Table (Optional for context)
query_check_dates = """
SELECT 
    MIN(order_purchase_timestamp) AS min_date,
    MAX(order_purchase_timestamp) AS max_date
FROM orders;
"""

data_check_dates = execute_query(query_check_dates)
print("DataFrame: check_dates")
print(data_check_dates)
if data_check_dates.empty:
    print(f"No data available. This may be due to the date range issue because today's date is {today}, and the data in the database are older than 30 days.")
print()

# Step 2: Check Distribution of Dates (Optional for context)
query_distribution_dates = """
SELECT 
    order_purchase_timestamp,
    COUNT(*) AS order_count
FROM orders
GROUP BY order_purchase_timestamp
ORDER BY order_purchase_timestamp;
"""

data_distribution_dates = execute_query(query_distribution_dates)
print("DataFrame: distribution_dates")
print(data_distribution_dates.head(10))  # Display the first 10 rows
if data_distribution_dates.empty:
    print(f"No data available. This may be due to the date range issue because today's date is {today}, and the data in the database are older than 30 days.")
print()

# Step 3: Check Earliest Order Dates for Sellers (Optional for context)
query_earliest_order_dates = """
SELECT 
    i.seller_id,
    MIN(o.order_purchase_timestamp) AS first_order_date
FROM order_items AS i
JOIN orders AS o ON i.order_id = o.order_id
GROUP BY i.seller_id
ORDER BY first_order_date DESC
LIMIT 10;
"""

data_earliest_order_dates = execute_query(query_earliest_order_dates)
print("DataFrame: earliest_order_dates")
print(data_earliest_order_dates)
if data_earliest_order_dates.empty:
    print(f"No data available. This may be due to the date range issue because today's date is {today}, and the data in the database are older than 30 days.")
print()

# Final Query: New Sellers with More Than 30 Products Sold in the Last 90 Days
query_new_sellers_final = """
WITH SellerFirstOrder AS (
    SELECT 
        i.seller_id,
        MIN(o.order_purchase_timestamp) AS first_order_date
    FROM order_items AS i
    JOIN orders AS o ON i.order_id = o.order_id
    GROUP BY i.seller_id
),
NewSellers AS (
    SELECT 
        sfo.seller_id,
        COUNT(i.order_item_id) AS total_products_sold
    FROM SellerFirstOrder AS sfo
    JOIN order_items AS i ON sfo.seller_id = i.seller_id
    JOIN orders AS o ON i.order_id = o.order_id
    WHERE 
        julianday('now') - julianday(sfo.first_order_date) < 90
    GROUP BY sfo.seller_id
    HAVING total_products_sold > 30
)
SELECT s.seller_id, s.seller_zip_code_prefix
FROM NewSellers AS ns
JOIN sellers AS s ON ns.seller_id = s.seller_id;
"""

data_new_sellers_final = execute_query(query_new_sellers_final)
print("DataFrame: new_sellers_final")
print(data_new_sellers_final)
if data_new_sellers_final.empty:
    print(f"No data available. This may be due to the date range issue because today's date is {today}, and the data in the database are older than 30 days.")
print()


DataFrame: check_dates
              min_date             max_date
0  2016-09-04 21:15:19  2018-10-17 17:30:18

DataFrame: distribution_dates
  order_purchase_timestamp  order_count
0      2016-09-04 21:15:19            1
1      2016-09-05 00:15:34            1
2      2016-09-13 15:24:19            1
3      2016-09-15 12:16:38            1
4      2016-10-02 22:07:52            1
5      2016-10-03 09:44:50            1
6      2016-10-03 16:56:50            1
7      2016-10-03 21:01:41            1
8      2016-10-03 21:13:36            1
9      2016-10-03 22:06:03            1

DataFrame: earliest_order_dates
                          seller_id     first_order_date
0  6561d6bf844e464b4019442692b40e02  2018-08-28 09:26:43
1  3296662b1331dea51e744505065ae889  2018-08-27 12:41:49
2  e8ff5a6ceb895583033fc2a0f314e3c2  2018-08-26 14:17:08
3  b76f4d90e85657a240495c876313adc5  2018-08-25 22:28:18
4  26e2e5033827d2ba53929f43e03d8ffe  2018-08-25 12:50:59
5  edb58a1390adf273840030a3d6253829  2018-0

### 4. Which are the 5 zip codes with more than 30 reviews that have the worst average review scores over the last 12 months?

### Explanation of the Query
1. **Average Review Scores**: The query calculates the average review scores for each zip code based on the reviews received over the last 12 months.
2. **Filtering Criteria**: The query filters the zip codes based on the number of reviews received, selecting only those zip codes with more than 30 reviews.
3. **Common Table Expression (CTE)**: The query uses a Common Table Expression (CTE) named `RecentReviews` to filter the reviews based on the review creation date within the last 12 months.
4. **Result Columns**: The final result includes the `customer_zip_code_prefix` and `average_score` columns for the selected zip codes.
5. **Data Loading**: The query result is loaded into a pandas DataFrame for further analysis and display.
6. **Display**: The first few rows of the resulting DataFrame are displayed to show the 5 zip codes with the worst average review scores over the last 12 months.
7. **Context**: The query considers the date range issue and provides additional information if no data is available due to older dates in the database.

In [3]:
import pandas as pd
import sqlite3
import os
from datetime import datetime

# Get today's date
today = datetime.today().strftime('%Y-%m-%d')

# Relative path to the database
db_path = os.path.join('..', 'data', 'olist.db')

# Function to execute SQL queries and return DataFrames
def execute_query(query):
    with sqlite3.connect(db_path) as conn:
        return pd.read_sql(query, conn)

# SQL query to find the 5 zip codes with the worst average review scores over the last 12 months
query_zip_codes_reviews = """
WITH RecentReviews AS (
    SELECT 
        r.review_id,
        r.review_score,
        o.order_id,
        c.customer_zip_code_prefix,
        r.review_creation_date
    FROM order_reviews AS r
    JOIN orders AS o ON r.order_id = o.order_id
    JOIN customers AS c ON o.customer_id = c.customer_id
    WHERE 
        julianday('now') - julianday(r.review_creation_date) <= 365
),
ZipScores AS (
    SELECT 
        customer_zip_code_prefix,
        AVG(review_score) AS average_score,
        COUNT(review_id) AS review_count
    FROM RecentReviews
    GROUP BY customer_zip_code_prefix
    HAVING review_count > 30
)
SELECT 
    customer_zip_code_prefix,
    average_score
FROM ZipScores
ORDER BY average_score ASC
LIMIT 5;
"""

# Execute the query and load data into a DataFrame
data_zip_codes_reviews = execute_query(query_zip_codes_reviews)
print("DataFrame: zip_codes_reviews")
print(data_zip_codes_reviews)

# Check if the DataFrame is empty and provide additional date information
if data_zip_codes_reviews.empty:
    # Query to get the most recent review date
    query_recent_review_date = """
    SELECT MAX(review_creation_date) AS most_recent_review_date
    FROM order_reviews;
    """
    recent_review_date = execute_query(query_recent_review_date)
    most_recent_date = recent_review_date['most_recent_review_date'].iloc[0]
    
    print(f"No data available. This may be due to the date range issue because today's date is {today}, and the data in the database are older than 12 months.")
    print(f"The most recent review date in the database is {most_recent_date}.")

print()


DataFrame: zip_codes_reviews
Empty DataFrame
Columns: [customer_zip_code_prefix, average_score]
Index: []
No data available. This may be due to the date range issue because today's date is 2024-07-18, and the data in the database are older than 12 months.
The most recent review date in the database is 2018-08-31 00:00:00.

