## Task
Initialize Google Cloud BigQuery environment for data operations.

### Subtask:
Authenticate with Google Cloud, define BigQuery project and table paths, and instantiate the BigQuery client.

### Reasoning:
The code sets up the necessary environment for interacting with Google Cloud BigQuery by authenticating the user, configuring the project ID and source table path, and initializing the BigQuery client. This prepares the environment for subsequent data querying and manipulation tasks.

In [None]:
import os
from google.cloud import bigquery
from google.colab import auth

auth.authenticate_user()

PROJECT_ID = "finalproject-480220"  # e.g., mgmt-467-47888
REGION = "us"
TABLE_PATH = "finalproject-480220.Final_Project.Weather"  # Use a valid public or custom project.dataset.table path

os.environ["PROJECT_ID"] = PROJECT_ID
os.environ["REGION"] = REGION
bq = bigquery.Client(project=PROJECT_ID)

print("BQ Project:", PROJECT_ID)
print("Source table:", TABLE_PATH)

BQ Project: finalproject-480220
Source table: finalproject-480220.Final_Project.Weather


## Task
Perform initial data quality checks and explain a data transformation strategy.

### Subtask:
Check for NULL values in a specified critical column and explain a method for handling missing values using SQL.

### Reasoning:
The code performs a data quality check by querying the BigQuery table to count NULL values in a critical column (e.g., 'Humidity'). This helps identify data completeness issues. Subsequently, it explains a common data transformation strategy: handling missing numerical values (e.g., 'Precipitation') through imputation, specifically by replacing NULLs with zero using SQL's `COALESCE` function. This prepares the data for more robust analysis by addressing potential gaps.

In [None]:
# Data Quality Check: Check for NULL values in a critical column (e.g., 'Humidity')
# This assumes 'Humidity' is a column in the Weather table.
# If 'Humidity' does not exist, the query will raise an error.
# For a more robust solution without prior schema knowledge, one would first query INFORMATION_SCHEMA.COLUMNS.

assumed_dq_column = "Humidity" # Common column in weather data

query_dq = f"""
SELECT
    COUNT(*) AS total_rows,
    COUNT(CASE WHEN {assumed_dq_column} IS NULL THEN 1 END) AS null_count
FROM
    `{TABLE_PATH}`
"""

print(f"Running Data Quality Check: Counting NULLs in '{assumed_dq_column}' column...")
try:
    query_job_dq = bq.query(query_dq)
    results_dq = query_job_dq.result()
    for row in results_dq:
        total_rows = row.total_rows
        null_count = row.null_count
        print(f"Total rows: {total_rows}")
        print(f"Rows with NULL '{assumed_dq_column}': {null_count}")
        if null_count > 0:
            print(f"Action required: Investigate and handle NULL values in '{assumed_dq_column}'.")
        else:
            print(f"'{assumed_dq_column}' column has no NULL values. Good data quality for this aspect.")
except Exception as e:
    print(f"Error during Data Quality Check. Please ensure '{assumed_dq_column}' is a valid column in the table, or replace it with an existing column name. Error: {e}")


print("\n---")

# Transformation Logic Explanation: Handling missing values (e.g., imputation for 'Precipitation')
print("Transformation Logic Explanation:")
print("One common data transformation is handling missing values. For numerical columns like 'Precipitation',")
print("missing values (NULLs) can be imputed. A simple imputation strategy is to replace NULLs with the mean, median, or zero.")
print("For example, replacing missing 'Precipitation' values with 0 (assuming 0 precipitation for missing records):")
print("In SQL, this transformation could look like:")
print("  SELECT")
print("    Date,")
print("    Max_Temperature,")
print("    Min_Temperature,")
print("    COALESCE(Precipitation, 0) AS Imputed_Precipitation,")
print("    Humidity")
print(f"  FROM `{TABLE_PATH}`;")
print("This transformation ensures that analyses involving 'Precipitation' are not skewed or halted by missing data.")

Running Data Quality Check: Counting NULLs in 'Humidity' column...
Total rows: 110303
Rows with NULL 'Humidity': 0
'Humidity' column has no NULL values. Good data quality for this aspect.

---
Transformation Logic Explanation:
One common data transformation is handling missing values. For numerical columns like 'Precipitation',
missing values (NULLs) can be imputed. A simple imputation strategy is to replace NULLs with the mean, median, or zero.
For example, replacing missing 'Precipitation' values with 0 (assuming 0 precipitation for missing records):
In SQL, this transformation could look like:
  SELECT
    Date,
    Max_Temperature,
    Min_Temperature,
    COALESCE(Precipitation, 0) AS Imputed_Precipitation,
    Humidity
  FROM `finalproject-480220.Final_Project.Weather`;
This transformation ensures that analyses involving 'Precipitation' are not skewed or halted by missing data.


## Overall Summary

The first code block sets up the Google Cloud BigQuery environment by authenticating the user, defining the project and table paths, and initializing a BigQuery client. This prepares the notebook for interacting with BigQuery.

The second code block then performs an initial data quality check by querying the BigQuery table for NULL values in a critical column (e.g., 'Humidity'). Following this, it explains a fundamental data transformation strategy: handling missing numerical data (e.g., 'Precipitation') by imputing NULLs with zero using SQL's `COALESCE` function. Together, these steps ensure the environment is ready and the data quality is assessed and addressed for further analysis.