```markdown
# Overview of the Notebook

This Jupyter Notebook is designed to process and analyze geographic data from the Chicago Taxi Trips dataset, leveraging Google BigQuery and other Python libraries. Below is a step-by-step explanation of the workflow:

## 1. **Setup and Configuration**
    - The notebook begins by importing necessary libraries such as `json`, `os`, `re`, `pandas`, and `pandas_gbq`.
    - It loads configuration data from a JSON file (`locations_conf.json`) to retrieve directory paths and credentials for accessing Google Cloud services.

## 2. **Google Cloud Authentication**
    - The notebook uses the `google.oauth2.service_account` module to create credentials from a service account key file. These credentials are used to authenticate with Google Cloud services, including BigQuery.

## 3. **Querying BigQuery**
    - A SQL query is executed on the `bigquery-public-data.chicago_taxi_trips.taxi_trips` dataset to extract unique latitude and longitude values from both pickup and dropoff locations.
    - The results are loaded into a Pandas DataFrame (`chi_geo_data`), and the number of rows and unique rows are calculated and displayed.

## 4. **Reverse Geocoding**
    - The notebook uses the `reverse_geocode` library to resolve human-readable location information (e.g., city, country) for each latitude and longitude pair in the dataset.
    - The resolved data is stored in a new DataFrame (`geo_data_df`), and duplicate rows are removed to ensure uniqueness.

## 5. **Loading Data to BigQuery**
    - The processed geographic data is uploaded to a BigQuery table (`Chicago_Taxi_Trips.Geo_Locations`) using the `pandas_gbq.to_gbq` function. The table is replaced if it already exists.

## 6. **Documentation**
    - Markdown cells are used throughout the notebook to document each step of the workflow, making it easier to understand and maintain.

## Key Variables
    - `credentials`: Google Cloud credentials used for authentication.
    - `locations_data`: A dictionary containing configuration data, including the path to the service account key file.
    - `chi_geo_data`: A DataFrame containing unique latitude and longitude values from the Chicago Taxi dataset.
    - `geo_data_df`: A DataFrame containing resolved location information for the geographic coordinates.

## Purpose
The primary goal of this notebook is to extract, process, and enrich geographic data from the Chicago Taxi Trips dataset and load the processed data into BigQuery for further analysis or visualization.
```

## Get My BQ Credentials to Access the Dataset

## Load Directory Locations

In [1]:
import json
import os
import re

# Check if the file exists and load the JSON file into a dictionary
file_path = r'C:\Users\mike\Develop\Projects\Code Notebook\Credentials\locations_conf.json'
if os.path.exists(file_path):
    with open(file_path, 'r') as f:
        locations_data = json.load(f)
    for key, value in locations_data.items():
        if key == 'BQ_Service_Key' and isinstance(value, str):
            # Mask the unique identifier part of the file name
            masked_value = re.sub(r'([a-f0-9]{12})', '************', value)
            print(f"{key}: {masked_value}")
        else:
            print(f"{key}: {value}")
else:
    print(f"File not found: {file_path}")

Common_Funcs_Dir: /Users/mike/Develop/Projects/Code Notebook/Common/Functions
Credentials_Dir: /Users/mike/Develop/Projects/Code Notebook/Credentials
Rel_Pickes_Dir: ../.pickles
Pub_Data_Dir: '/Users/mike/Data/Public
BQ_Service_Key: /Users/mike/Develop/Conf/GCP Service Keys/mikecancell-development-************.json


# Connect to Google Cloud
from google.cloud import bigquery

In [2]:
from google.oauth2 import service_account

# Resolve the key path from the locations data
key_path = locations_data.get('BQ_Service_Key', 'default_key_path.json')

# Create credentials using the key file
credentials = service_account.Credentials.from_service_account_file(key_path)

# Get all of the Unique Latitude and Longitude Values from the Chi Taxi Dataset

In [27]:
import warnings
from pandas_gbq.exceptions import LargeResultsWarning

# Suppress the LargeResultsWarning
warnings.simplefilter('ignore', category=LargeResultsWarning)

# Import the pandas_gbq library
import pandas_gbq

# Define the SQL query
query = """
        SELECT DISTINCT
            latitude,
            longitude
        FROM (
            SELECT 
            pickup_latitude AS latitude,
            pickup_longitude AS longitude
            FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
            UNION DISTINCT
            SELECT 
            dropoff_latitude AS latitude,
            dropoff_longitude AS longitude
            FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
        )
        WHERE latitude IS NOT NULL AND longitude IS NOT NULL
"""

# Read the data from BigQuery into a pandas DataFrame
chi_geo_data = pandas_gbq.read_gbq(query, project_id=credentials.project_id, credentials=credentials)

# Print the total number of rows in the dataframe
print(f"Total number of rows: {len(chi_geo_data)}")
# Calculate and print the total number of unique rows
num_unique_rows = len(chi_geo_data.drop_duplicates())
print(f"Total number of unique rows: {num_unique_rows}")
# Display the first few rows of the dataframe
print(chi_geo_data.head())


Downloading: 100%|[32m██████████[0m|
Total number of rows: 875
Total number of unique rows: 875
    latitude  longitude
0  41.787279 -87.634342
1  41.827613 -87.604242
2  41.844145 -87.682996
3  41.869119 -87.756068
4  41.921701 -87.655912
Downloading: 100%|[32m██████████[0m|
Total number of rows: 875
Total number of unique rows: 875
    latitude  longitude
0  41.787279 -87.634342
1  41.827613 -87.604242
2  41.844145 -87.682996
3  41.869119 -87.756068
4  41.921701 -87.655912


# Extract, process unique geo data from the Chi Taxi Dataset

In [35]:
import pandas as pd
import reverse_geocode
# Use reverse_geocode to resolve location information for each latitude and longitude pair
coordinates = chi_geo_data[['latitude', 'longitude']].to_records(index=False)
coordinates = [(record.latitude, record.longitude) for record in coordinates]
resolved_locations = reverse_geocode.search(coordinates)

# Convert resolved locations into a new DataFrame
geo_data_df = pd.DataFrame(resolved_locations)

# Display the resulting DataFrame
print(geo_data_df.head())

# Print the final number of rows in the new DataFrame
print(f"Final number of rows in the new DataFrame: {len(geo_data_df)}")
num_duplicates = len(geo_data_df) - len(geo_data_df.drop_duplicates())
print(f"Number of duplicate rows: {num_duplicates}")
# Drop duplicate rows from the DataFrame
geo_data_df = geo_data_df.drop_duplicates()

# Print the final number of rows after removing duplicates
print(f"Final number of rows after removing duplicates: {len(geo_data_df)}")

  country_code           city  latitude  longitude  population     state  \
0           US      Englewood  41.77976  -87.64588       26121  Illinois   
1           US        Douglas  41.83476  -87.61811       20323  Illinois   
2           US  McKinley Park  41.83170  -87.67366       15612  Illinois   
3           US         Cicero  41.84559  -87.75394       83886  Illinois   
4           US   Lincoln Park  41.92170  -87.64783       66959  Illinois   

        county        country  
0  Cook County  United States  
1  Cook County  United States  
2  Cook County  United States  
3  Cook County  United States  
4  Cook County  United States  
Final number of rows in the new DataFrame: 875
Number of duplicate rows: 811
Final number of rows after removing duplicates: 64


# Load the Data to BiqQuery

In [36]:
# Define the destination table in BigQuery
destination_table = 'Chicago_Taxi_Trips.Geo_Locations'

# Load the DataFrame to BigQuery
pandas_gbq.to_gbq(geo_data_df, destination_table, project_id=credentials.project_id, credentials=credentials, if_exists='replace')

print(f"Data successfully loaded to {destination_table} in BigQuery.")

100%|██████████| 1/1 [00:00<00:00, 16320.25it/s]

Data successfully loaded to Chicago_Taxi_Trips.Geo_Locations in BigQuery.



