# NYC Data Analysis: Dataset Join Feasibility Assessment

This notebook examines the feasibility of joining NYC 311 service requests data with capital projects data:
1. **Temporal dimension** - checking for time overlap
2. **Geographic dimension** - checking for common geography
3. **Possible join keys** - identifying common fields for data linking

## Data Structure
- `311-service-requests-from-2010-to-present.csv` - citizen service requests
- `capital-project-schedules-and-budgets.csv` - capital construction projects
- `311-web-content-services.csv` - web service content
- Data dictionaries and metadata files

In [9]:
# Import necessary libraries
import pandas as pd
import numpy as np
import json
import os
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
plt.style.use('default')

# Path to data folder
data_path = './data/'
print("Available files in data folder:")
for file in os.listdir(data_path):
    print(f"- {file}")

Available files in data folder:
- SCA_Capital_Project_Schedules_and_Budgets_Data_Dictionary.xlsx
- socrata_metadata_311-service-requests-from-2010-to-present.json
- 311-web-content-services.csv
- 311-service-requests-from-2010-to-present.csv
- socrata_metadata.json
- capital-project-schedules-and-budgets.csv
- 311_SR_Data_Dictionary_2018.xlsx
- socrata_metadata_311-web-content-services.json


## 1. Data Files Structure Overview

First, let's load the main datasets and examine their structure:

In [None]:
# Load main datasets
print("=== DATA LOADING ===\n")

# 1. 311 Service Requests (main dataset)
print("1. Loading 311-service-requests...")
try:
    # Load first 100,000 rows for quick analysis
    df_311 = pd.read_csv(data_path + '311-service-requests-from-2010-to-present.csv',
                         nrows=100000, low_memory=False)
    print(f"   Size: {df_311.shape[0]:,} rows, {df_311.shape[1]} columns")
    print(f"   Columns: {list(df_311.columns[:10])}{'...' if len(df_311.columns) > 10 else ''}")
except Exception as e:
    print(f"   Error: {e}")

print()

# 2. Capital Projects
print("2. Loading capital-project-schedules...")
try:
    df_capital = pd.read_csv(data_path + 'capital-project-schedules-and-budgets.csv', low_memory=False)
    print(f"   Size: {df_capital.shape[0]:,} rows, {df_capital.shape[1]} columns")
    print(f"   Columns: {list(df_capital.columns[:10])}{'...' if len(df_capital.columns) > 10 else ''}")
except Exception as e:
    print(f"   Error: {e}")
    
print()

# 3. Web Content Services
print("3. Loading 311-web-content-services...")
try:
    df_web = pd.read_csv(data_path + '311-web-content-services.csv', low_memory=False)
    print(f"   Size: {df_web.shape[0]:,} rows, {df_web.shape[1]} columns")
    print(f"   Columns: {list(df_web.columns)}")
except Exception as e:
    print(f"   Error: {e}")

=== DATA LOADING ===

1. Loading 311-service-requests...


In [None]:
# Detailed structure overview of each dataset
print("=== DETAILED STRUCTURE OVERVIEW ===\n")

print("üìä 1. 311 SERVICE REQUESTS DATASET:")
print("="*50)
print("Main columns:", df_311.columns.tolist())
print(f"\nData information:")
print(df_311.info())

print("\n" + "="*70)
print("üìä 2. CAPITAL PROJECTS DATASET:")
print("="*50)
print("Main columns:", df_capital.columns.tolist())
print(f"\nData information:")
print(df_capital.info())

print("\n" + "="*70)
print("üìä 3. WEB CONTENT SERVICES DATASET:")
print("="*50)
print("Main columns:", df_web.columns.tolist())
print(f"\nFirst 3 rows:")
print(df_web.head(3))

=== –î–ï–¢–ê–õ–¨–ù–ò–ô –û–ì–õ–Ø–î –°–¢–†–£–ö–¢–£–†–ò ===

üìä 1. –î–ê–¢–ê–°–ï–¢ 311 SERVICE REQUESTS:
–û—Å–Ω–æ–≤–Ω—ñ –∫–æ–ª–æ–Ω–∫–∏: ['Unique Key', 'Created Date', 'Closed Date', 'Agency', 'Agency Name', 'Complaint Type', 'Descriptor', 'Location Type', 'Incident Zip', 'Incident Address', 'Street Name', 'Cross Street 1', 'Cross Street 2', 'Intersection Street 1', 'Intersection Street 2', 'Address Type', 'City', 'Landmark', 'Facility Type', 'Status', 'Due Date', 'Resolution Description', 'Resolution Action Updated Date', 'Community Board', 'BBL', 'Borough', 'X Coordinate (State Plane)', 'Y Coordinate (State Plane)', 'Open Data Channel Type', 'Park Facility Name', 'Park Borough', 'Vehicle Type', 'Taxi Company Borough', 'Taxi Pick Up Location', 'Bridge Highway Name', 'Bridge Highway Direction', 'Road Ramp', 'Bridge Highway Segment', 'Latitude', 'Longitude', 'Location', 'Zip Codes', 'Community Districts', 'Borough Boundaries', 'City Council Districts', 'Police Precincts']

–Ü–Ω—Ñ–æ—Ä–º–∞—Ü

## 2. Temporal Dimension Analysis (Time Overlap Analysis)

Let's check if there's time overlap between 311 service requests and capital projects:

In [None]:
# Analysis of temporal columns in 311 and Capital Projects datasets
print("=== TEMPORAL DATA ANALYSIS ===\n")

# 1. Analysis of 311 dataset
print("üìÖ 1. 311 SERVICE REQUESTS DATASET - Temporal columns:")
print("-" * 50)

# Find all date columns
date_columns_311 = [col for col in df_311.columns if 'date' in col.lower() or 'time' in col.lower()]
print(f"Date columns: {date_columns_311}")

# Convert dates and analyze periods
df_311['Created Date'] = pd.to_datetime(df_311['Created Date'], errors='coerce')
df_311['Closed Date'] = pd.to_datetime(df_311['Closed Date'], errors='coerce')

print(f"\n311 data period:")
print(f"  Earliest creation date: {df_311['Created Date'].min()}")
print(f"  Latest creation date: {df_311['Created Date'].max()}")
print(f"  Earliest closure date: {df_311['Closed Date'].min()}")
print(f"  Latest closure date: {df_311['Closed Date'].max()}")

print("\n" + "="*70)

# 2. Analysis of Capital Projects dataset  
print("üìÖ 2. CAPITAL PROJECTS DATASET - Temporal columns:")
print("-" * 50)

# Find all date columns
date_columns_capital = [col for col in df_capital.columns if 'date' in col.lower()]
print(f"Date columns: {date_columns_capital}")

# Convert dates
for date_col in date_columns_capital:
    df_capital[date_col] = pd.to_datetime(df_capital[date_col], errors='coerce')
    print(f"\n{date_col}:")
    print(f"  Min: {df_capital[date_col].min()}")
    print(f"  Max: {df_capital[date_col].max()}")
    print(f"  Number of non-null values: {df_capital[date_col].notna().sum()}/{len(df_capital)}")

=== –ê–ù–ê–õ–Ü–ó –ß–ê–°–û–í–ò–• –î–ê–ù–ò–• ===

üìÖ 1. –î–ê–¢–ê–°–ï–¢ 311 SERVICE REQUESTS - –ß–∞—Å–æ–≤—ñ –∫–æ–ª–æ–Ω–∫–∏:
--------------------------------------------------
–ö–æ–ª–æ–Ω–∫–∏ –∑ –¥–∞—Ç–∞–º–∏: ['Created Date', 'Closed Date', 'Due Date', 'Resolution Action Updated Date']

–ü–µ—Ä—ñ–æ–¥ –¥–∞–Ω–∏—Ö 311:
  –ù–∞–π—Ä–∞–Ω—ñ—à–∞ –¥–∞—Ç–∞ —Å—Ç–≤–æ—Ä–µ–Ω–Ω—è: 2019-11-12 16:48:09
  –ù–∞–π–ø—ñ–∑–Ω—ñ—à–∞ –¥–∞—Ç–∞ —Å—Ç–≤–æ—Ä–µ–Ω–Ω—è: 2019-12-01 02:04:01
  –ù–∞–π—Ä–∞–Ω—ñ—à–∞ –¥–∞—Ç–∞ –∑–∞–∫—Ä–∏—Ç—Ç—è: 2019-09-27 14:25:00
  –ù–∞–π–ø—ñ–∑–Ω—ñ—à–∞ –¥–∞—Ç–∞ –∑–∞–∫—Ä–∏—Ç—Ç—è: 2019-12-01 01:59:50

üìÖ 2. –î–ê–¢–ê–°–ï–¢ CAPITAL PROJECTS - –ß–∞—Å–æ–≤—ñ –∫–æ–ª–æ–Ω–∫–∏:
--------------------------------------------------
–ö–æ–ª–æ–Ω–∫–∏ –∑ –¥–∞—Ç–∞–º–∏: ['Project Phase Actual Start Date', 'Project Phase Planned End Date', 'Project Phase Actual End Date']

Project Phase Actual Start Date:
  –ú—ñ–Ω: 2003-09-12 00:00:00
  –ú–∞–∫—Å: 2020-12-31 00:00:00
  –ö—ñ–ª—å–∫—ñ—Å—Ç—å –Ω–µ-null –∑–Ω–∞—á–µ–Ω—å: 5502/

  df_capital[date_col] = pd.to_datetime(df_capital[date_col], errors='coerce')
  df_capital[date_col] = pd.to_datetime(df_capital[date_col], errors='coerce')
  df_capital[date_col] = pd.to_datetime(df_capital[date_col], errors='coerce')


In [None]:
# Analysis of temporal period overlap
print("\n=== PERIOD OVERLAP ANALYSIS ===")

# Define periods for each dataset
print("\nüîç Temporal period comparison:")
print("-" * 40)

# 311 period (from our sample)
period_311_start = df_311['Created Date'].min()
period_311_end = df_311['Created Date'].max()
print(f"311 Service Requests (sample): {period_311_start.date()} - {period_311_end.date()}")

# Capital Projects period
period_capital_start = df_capital['Project Phase Actual Start Date'].min()
period_capital_end = df_capital['Project Phase Actual Start Date'].max()
print(f"Capital Projects (start dates): {period_capital_start.date()} - {period_capital_end.date()}")

# Check for overlap
overlap_start = max(period_311_start, period_capital_start)
overlap_end = min(period_311_end, period_capital_end)

print(f"\n‚úÖ OVERLAP ANALYSIS RESULT:")
if overlap_start <= overlap_end:
    print(f"üéØ OVERLAP EXISTS! Period: {overlap_start.date()} - {overlap_end.date()}")
    overlap_days = (overlap_end - overlap_start).days
    print(f"üìä Overlap duration: {overlap_days} days")
    
    # Count records in overlap period
    count_311_overlap = df_311[
        (df_311['Created Date'] >= overlap_start) & 
        (df_311['Created Date'] <= overlap_end)
    ].shape[0]
    
    count_capital_overlap = df_capital[
        (df_capital['Project Phase Actual Start Date'] >= overlap_start) & 
        (df_capital['Project Phase Actual Start Date'] <= overlap_end)
    ].shape[0]
    
    print(f"üìà 311 requests in overlap period: {count_311_overlap:,}")
    print(f"üìà Capital projects (start) in period: {count_capital_overlap:,}")
else:
    print("‚ùå NO OVERLAP")

# Also check with all capital project dates
print(f"\nüîÑ Additional analysis with all capital project dates:")
capital_all_dates = pd.concat([
    df_capital['Project Phase Actual Start Date'].dropna(),
    df_capital['Project Phase Planned End Date'].dropna(),
    df_capital['Project Phase Actual End Date'].dropna()
])

capital_min_all = capital_all_dates.min()
capital_max_all = capital_all_dates.max()
print(f"Full capital projects period: {capital_min_all.date()} - {capital_max_all.date()}")

overlap_start_all = max(period_311_start, capital_min_all)
overlap_end_all = min(period_311_end, capital_max_all)

if overlap_start_all <= overlap_end_all:
    print(f"‚úÖ Overlap with all dates: {overlap_start_all.date()} - {overlap_end_all.date()}")
    print(f"üìä Duration: {(overlap_end_all - overlap_start_all).days} days")
else:
    print("‚ùå No overlap")


=== –ê–ù–ê–õ–Ü–ó –ü–ï–†–ï–¢–ò–ù–ê–ù–ù–Ø –ü–ï–†–Ü–û–î–Ü–í ===

üîç –ü–æ—Ä—ñ–≤–Ω—è–Ω–Ω—è —á–∞—Å–æ–≤–∏—Ö –ø–µ—Ä—ñ–æ–¥—ñ–≤:
----------------------------------------
311 Service Requests (–≤–∏–±—ñ—Ä–∫–∞): 2019-11-12 - 2019-12-01
Capital Projects (start dates): 2003-09-12 - 2020-12-31

‚úÖ –†–ï–ó–£–õ–¨–¢–ê–¢ –ê–ù–ê–õ–Ü–ó–£ –ü–ï–†–ï–¢–ò–ù–ê–ù–ù–Ø:
üéØ –Ñ –ü–ï–†–ï–¢–ò–ù–ê–ù–ù–Ø! –ü–µ—Ä—ñ–æ–¥: 2019-11-12 - 2019-12-01
üìä –¢—Ä–∏–≤–∞–ª—ñ—Å—Ç—å –ø–µ—Ä–µ—Ç–∏–Ω–∞–Ω–Ω—è: 18 –¥–Ω—ñ–≤
üìà –ó–≤–µ—Ä–Ω–µ–Ω–Ω—è 311 —É –ø–µ—Ä—ñ–æ–¥ –ø–µ—Ä–µ—Ç–∏–Ω–∞–Ω–Ω—è: 100,000
üìà –ö–∞–ø—ñ—Ç–∞–ª—å–Ω—ñ –ø—Ä–æ–µ–∫—Ç–∏ (–ø–æ—á–∞—Ç–æ–∫) —É –ø–µ—Ä—ñ–æ–¥: 74

üîÑ –î–æ–¥–∞—Ç–∫–æ–≤–∏–π –∞–Ω–∞–ª—ñ–∑ –∑ —É—Å—ñ–º–∞ –¥–∞—Ç–∞–º–∏ –∫–∞–ø—ñ—Ç–∞–ª—å–Ω–∏—Ö –ø—Ä–æ–µ–∫—Ç—ñ–≤:
–í–µ—Å—å –ø–µ—Ä—ñ–æ–¥ –∫–∞–ø—ñ—Ç–∞–ª—å–Ω–∏—Ö –ø—Ä–æ–µ–∫—Ç—ñ–≤: 2003-09-12 - 2023-09-03
‚úÖ –ü–µ—Ä–µ—Ç–∏–Ω–∞–Ω–Ω—è –∑ —É—Å—ñ–º–∞ –¥–∞—Ç–∞–º–∏: 2019-11-12 - 2019-12-01
üìä –¢—Ä–∏–≤–∞–ª—ñ—Å—Ç—å: 18 –¥–Ω—ñ–≤


## 3. Geographic Data Analysis (Spatial Analysis)

Let's check if there are common geographic identifiers for spatial joining:

In [None]:
# Analysis of geographic columns in datasets
print("=== GEOGRAPHIC DATA ANALYSIS ===\n")

# 1. Analysis of geographic columns in 311 dataset
print("üìç 1. 311 SERVICE REQUESTS DATASET - Geographic columns:")
print("-" * 60)

# Find columns with geographic data
geo_keywords = ['location', 'address', 'borough', 'zip', 'latitude', 'longitude', 'district', 'community']
geo_columns_311 = [col for col in df_311.columns 
                   if any(keyword in col.lower() for keyword in geo_keywords)]

print(f"Geographic columns: {geo_columns_311}")

# Analyze key geographic fields
key_geo_fields_311 = ['Borough', 'Incident Zip', 'Latitude', 'Longitude', 'Community Board']
for field in key_geo_fields_311:
    if field in df_311.columns:
        unique_count = df_311[field].nunique()
        null_count = df_311[field].isnull().sum()
        print(f"\n{field}:")
        print(f"  Unique values: {unique_count}")
        print(f"  Missing values: {null_count}/{len(df_311)} ({null_count/len(df_311)*100:.1f}%)")
        if unique_count < 20:  # Show values if not too many
            print(f"  Values: {sorted(df_311[field].dropna().unique())}")

print("\n" + "="*70)

# 2. Analysis of geographic columns in Capital Projects dataset
print("üìç 2. CAPITAL PROJECTS DATASET - Geographic columns:")
print("-" * 60)

geo_columns_capital = [col for col in df_capital.columns 
                      if any(keyword in col.lower() for keyword in geo_keywords)]
print(f"Geographic columns: {geo_columns_capital}")

# Analyze key geographic fields
key_geo_fields_capital = ['Project Geographic District ', 'Project School Name']
for field in key_geo_fields_capital:
    if field in df_capital.columns:
        unique_count = df_capital[field].nunique()
        null_count = df_capital[field].isnull().sum()
        print(f"\n{field}:")
        print(f"  Unique values: {unique_count}")
        print(f"  Missing values: {null_count}/{len(df_capital)} ({null_count/len(df_capital)*100:.1f}%)")
        if unique_count < 30:  # Show values if not too many
            sample_values = df_capital[field].dropna().unique()[:10]  # First 10 values
            print(f"  Sample values: {list(sample_values)}")

# Specific analysis of Geographic District
if 'Project Geographic District ' in df_capital.columns:
    print(f"\nüîç Detailed analysis of Project Geographic District:")
    district_counts = df_capital['Project Geographic District '].value_counts()
    print(f"Top 10 districts by project count:")
    print(district_counts.head(10))

=== –ê–ù–ê–õ–Ü–ó –ì–ï–û–ì–†–ê–§–Ü–ß–ù–ò–• –î–ê–ù–ò–• ===

üìç 1. –î–ê–¢–ê–°–ï–¢ 311 SERVICE REQUESTS - –ì–µ–æ–≥—Ä–∞—Ñ—ñ—á–Ω—ñ –∫–æ–ª–æ–Ω–∫–∏:
------------------------------------------------------------
–ì–µ–æ–≥—Ä–∞—Ñ—ñ—á–Ω—ñ –∫–æ–ª–æ–Ω–∫–∏: ['Location Type', 'Incident Zip', 'Incident Address', 'Address Type', 'Community Board', 'Borough', 'Park Borough', 'Taxi Company Borough', 'Taxi Pick Up Location', 'Latitude', 'Longitude', 'Location', 'Zip Codes', 'Community Districts', 'Borough Boundaries', 'City Council Districts']

Borough:
  –£–Ω—ñ–∫–∞–ª—å–Ω–∏—Ö –∑–Ω–∞—á–µ–Ω—å: 6
  –ü—Ä–æ–ø—É—â–µ–Ω–∏—Ö –∑–Ω–∞—á–µ–Ω—å: 0/100000 (0.0%)
  –ó–Ω–∞—á–µ–Ω–Ω—è: ['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN ISLAND', 'Unspecified']

Incident Zip:
  –£–Ω—ñ–∫–∞–ª—å–Ω–∏—Ö –∑–Ω–∞—á–µ–Ω—å: 209
  –ü—Ä–æ–ø—É—â–µ–Ω–∏—Ö –∑–Ω–∞—á–µ–Ω—å: 2749/100000 (2.7%)

Latitude:
  –£–Ω—ñ–∫–∞–ª—å–Ω–∏—Ö –∑–Ω–∞—á–µ–Ω—å: 50039
  –ü—Ä–æ–ø—É—â–µ–Ω–∏—Ö –∑–Ω–∞—á–µ–Ω—å: 3304/100000 (3.3%)

Longitude:
  –£–Ω—ñ–∫–∞–ª—å–Ω–∏—Ö –∑–

## 4. Join Keys Identification

Let's analyze possible ways to join the datasets:

In [None]:
# Search for possible keys to join datasets
print("=== JOIN KEYS SEARCH ===\n")

# 1. Compare all columns to find common ones
print("üîç 1. COLUMN COMPARISON BETWEEN DATASETS:")
print("-" * 50)

columns_311 = set(df_311.columns)
columns_capital = set(df_capital.columns)

# Search for exact matches
exact_matches = columns_311.intersection(columns_capital)
print(f"Exact column matches: {list(exact_matches) if exact_matches else 'None'}")

# Search for similar columns (by name)
similar_pairs = []
for col_311 in columns_311:
    for col_capital in columns_capital:
        # Check similarity by keywords
        keywords_common = ['district', 'borough', 'location', 'address', 'zip', 'community']
        col_311_lower = col_311.lower()
        col_capital_lower = col_capital.lower()
        
        for keyword in keywords_common:
            if keyword in col_311_lower and keyword in col_capital_lower:
                similar_pairs.append((col_311, col_capital, keyword))

print(f"\nSimilar columns by keywords:")
for pair in similar_pairs:
    print(f"  311: '{pair[0]}' <-> Capital: '{pair[1]}' (common: {pair[2]})")

print("\n" + "="*70)

# 2. Analysis of spatial join possibilities
print("üó∫Ô∏è 2. SPATIAL JOIN POSSIBILITIES:")
print("-" * 50)

# Check Borough in 311 and Geographic District in Capital
if 'Borough' in df_311.columns and 'Project Geographic District ' in df_capital.columns:
    
    # Unique boroughs in 311
    boroughs_311 = set(df_311['Borough'].dropna().unique())
    print(f"Boroughs in 311 dataset ({len(boroughs_311)}): {sorted(boroughs_311)}")
    
    # Unique districts in Capital Projects
    districts_capital = set(df_capital['Project Geographic District '].dropna().unique())
    print(f"\nDistricts in Capital dataset ({len(districts_capital)}):")
    print(f"First 10: {sorted(list(districts_capital))[:10]}")
    
    # Try to find correspondences between Borough and District
    print(f"\nüîÑ Search for Borough <-> District correspondences:")
    
    # NYC Borough to district numbers mapping
    nyc_boroughs = ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'BRONX', 'STATEN ISLAND']
    
    for borough in boroughs_311:
        if borough and borough.upper() in nyc_boroughs:
            # Count records for this borough
            count_311 = df_311[df_311['Borough'] == borough].shape[0]
            print(f"  {borough}: {count_311:,} 311 requests")

print("\n" + "="*70)

# 3. Community Board analysis as possible key
print("üèòÔ∏è 3. COMMUNITY BOARD ANALYSIS:")
print("-" * 50)

if 'Community Board' in df_311.columns:
    cb_311 = df_311['Community Board'].dropna().unique()
    print(f"Community Board in 311 ({len(cb_311)} unique):")
    print(f"Examples: {sorted(cb_311)[:10]}")
    
    # Check if there are similar fields in Capital
    cb_like_fields = [col for col in df_capital.columns if 'board' in col.lower() or 'community' in col.lower()]
    print(f"\nSimilar fields in Capital: {cb_like_fields}")

print("\n" + "="*70)

# 4. Coordinates analysis
print("üìç 4. COORDINATES ANALYSIS:")
print("-" * 50)

if 'Latitude' in df_311.columns and 'Longitude' in df_311.columns:
    lat_count = df_311['Latitude'].notna().sum()
    lon_count = df_311['Longitude'].notna().sum()
    print(f"311 dataset: {lat_count:,} records with latitude coordinates, {lon_count:,} with longitude")
    
    print(f"311 coordinate ranges:")
    print(f"  Latitude: {df_311['Latitude'].min():.4f} - {df_311['Latitude'].max():.4f}")
    print(f"  Longitude: {df_311['Longitude'].min():.4f} - {df_311['Longitude'].max():.4f}")

# Check if there are coordinates in Capital
coord_fields_capital = [col for col in df_capital.columns if any(word in col.lower() for word in ['lat', 'lon', 'coord'])]
print(f"\nCoordinate fields in Capital: {coord_fields_capital if coord_fields_capital else 'No explicit coordinate fields'}")

print("\n" + "="*70)

# 5. Summary of possible join strategies
print("üí° 5. POSSIBLE JOIN STRATEGIES:")
print("-" * 50)

strategies = [
    {
        'name': 'Geographic join via Borough/District',
        'feasible': bool(boroughs_311 and districts_capital),
        'description': 'Mapping Borough (311) -> Geographic District (Capital)',
        'challenge': 'Need additional mapping between Borough and District numbers'
    },
    {
        'name': 'Spatial join via coordinates',
        'feasible': lat_count > 0 and len(coord_fields_capital) == 0,
        'description': 'Using 311 coordinates to determine proximity to Capital projects',
        'challenge': 'Capital projects lack coordinates - need geocoding'
    },
    {
        'name': 'Temporal-geographic join',
        'feasible': True,
        'description': 'Combining temporal overlap + geographic proximity',
        'challenge': 'Need additional geographic reference'
    },
    {
        'name': 'Join via external sources',
        'feasible': True,
        'description': 'Using additional NYC geographic reference data',
        'challenge': 'Need external data for mapping'
    }
]

for i, strategy in enumerate(strategies, 1):
    status = "‚úÖ Possible" if strategy['feasible'] else "‚ùå Difficult"
    print(f"{i}. {strategy['name']} - {status}")
    print(f"   Description: {strategy['description']}")
    print(f"   Challenge: {strategy['challenge']}")
    print()

=== –ü–û–®–£–ö –ö–õ–Æ–ß–Ü–í –î–õ–Ø JOIN ===

üîç 1. –ü–û–†–Ü–í–ù–Ø–ù–ù–Ø –ö–û–õ–û–ù–û–ö –ú–Ü–ñ –î–ê–¢–ê–°–ï–¢–ê–ú–ò:
--------------------------------------------------
–¢–æ—á–Ω—ñ –∑–±—ñ–≥–∏ –∫–æ–ª–æ–Ω–æ–∫: –ù–µ–º–∞—î

–°—Ö–æ–∂—ñ –∫–æ–ª–æ–Ω–∫–∏ –∑–∞ –∫–ª—é—á–æ–≤–∏–º–∏ —Å–ª–æ–≤–∞–º–∏:
  311: 'Community Districts' <-> Capital: 'Project Geographic District ' (—Å–ø—ñ–ª—å–Ω–µ: district)
  311: 'City Council Districts' <-> Capital: 'Project Geographic District ' (—Å–ø—ñ–ª—å–Ω–µ: district)

üó∫Ô∏è 2. –ú–û–ñ–õ–ò–í–û–°–¢–Ü –ü–†–û–°–¢–û–†–û–í–û–ì–û –ó'–Ñ–î–ù–ê–ù–ù–Ø:
--------------------------------------------------
–ë–æ—Ä–æ –≤ 311 –¥–∞—Ç–∞—Å–µ—Ç—ñ (6): ['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN ISLAND', 'Unspecified']

–†–∞–π–æ–Ω–∏ –≤ Capital –¥–∞—Ç–∞—Å–µ—Ç—ñ (33):
–ü–µ—Ä—à—ñ 10: [np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10)]

üîÑ –ü–æ—à—É–∫ –≤—ñ–¥–ø–æ–≤—ñ–¥–Ω–æ—Å—Ç–µ–π Borough <-> District:
  BRON

## 5. Conclusions and Recommendations

Based on the conducted analysis, here are the summary conclusions regarding data joining possibilities:

In [None]:
# Final conclusions and recommendations
print("=" * 80)
print("üìã FINAL CONCLUSIONS AND RECOMMENDATIONS")
print("=" * 80)

print("\n‚úÖ 1. TEMPORAL DIMENSION (TIME OVERLAP)")
print("-" * 50)
print("RESULT: Significant temporal overlap exists!")
print(f"‚Ä¢ Overlap period: 18 days (2019-11-12 to 2019-12-01)")
print(f"‚Ä¢ 311 requests in period: 100,000 records")
print(f"‚Ä¢ Capital projects: 74 projects started in this period")
print("‚Ä¢ Overall Capital Projects period: 2003-2023 (covers all possible 311 periods)")

print("\n‚úÖ 2. GEOGRAPHIC DIMENSION")
print("-" * 50)
print("RESULT: Geographic joining possibilities exist!")
print("‚Ä¢ 311 dataset has Borough (5 NYC boroughs) + coordinates (96,696 records)")
print("‚Ä¢ Capital Projects has Geographic District (33 districts)")
print("‚Ä¢ Community Board in 311 (76 unique districts)")
print("‚Ä¢ Coordinates exist only in 311, Capital Projects lacks them")

print("\nüí° 3. RECOMMENDED JOIN STRATEGIES")
print("-" * 50)

strategies = [
    {
        "priority": "High",
        "name": "Borough ‚Üí District Mapping",
        "description": "Create mapping between Borough (311) and Geographic District (Capital)",
        "implementation": "Use NYC School Districts or Community Districts reference",
        "pros": "Direct relationship, high accuracy",
        "cons": "Requires additional reference data"
    },
    {
        "priority": "Medium", 
        "name": "Temporal-geographic join",
        "description": "Combine temporal overlap + geographic proximity",
        "implementation": "Filter by time + group by Borough/District",
        "pros": "Enables analysis of construction impact on requests",
        "cons": "More complex logic, requires validation"
    },
    {
        "priority": "Low",
        "name": "Geocoding + spatial join", 
        "description": "Add coordinates to Capital Projects via geocoding",
        "implementation": "Geocode project addresses, use search radius",
        "pros": "Most accurate spatial joining",
        "cons": "Requires geocoding, computationally intensive"
    }
]

for i, strategy in enumerate(strategies, 1):
    print(f"\n{i}. {strategy['name']} (Priority: {strategy['priority']})")
    print(f"   üìù Description: {strategy['description']}")
    print(f"   üî® Implementation: {strategy['implementation']}")
    print(f"   ‚úÖ Advantages: {strategy['pros']}")
    print(f"   ‚ö†Ô∏è  Disadvantages: {strategy['cons']}")

print(f"\nüéØ 4. BEST APPROACH FOR ANALYSIS")
print("-" * 50)
print("Recommended combined strategy:")
print("1Ô∏è‚É£ Temporal filtering: select overlap period")
print("2Ô∏è‚É£ Geographic grouping: Borough (311) + additional mapping to District")
print("3Ô∏è‚É£ Correlation analysis: requests before/during/after projects")
print("4Ô∏è‚É£ Visualization: maps with overlaid zones and time series")

print(f"\nüìä 5. EXPECTED ANALYSIS RESULTS")
print("-" * 50)
research_questions = [
    "Do 311 requests increase during active construction projects?",
    "What types of requests are most commonly related to construction work?", 
    "In which districts does construction most impact citizen requests?",
    "How long does the impact of construction projects on request volume last?",
    "Are there seasonal or weekly patterns in the relationship between projects and requests?"
]

for i, question in enumerate(research_questions, 1):
    print(f"{i}. {question}")

print(f"\nüìã 6. NEXT STEPS")
print("-" * 50)
next_steps = [
    "Load complete 311 dataset (not just 100K records)",
    "Find or create Borough ‚Üí Geographic District mapping",
    "Implement temporal-geographic join", 
    "Conduct exploratory analysis of joined data",
    "Create visualizations to test hypotheses",
    "Statistically verify correlations between projects and requests"
]

for i, step in enumerate(next_steps, 1):
    print(f"{i}. {step}")

print("\n" + "=" * 80)

üìã –§–Ü–ù–ê–õ–¨–ù–Ü –í–ò–°–ù–û–í–ö–ò –¢–ê –†–ï–ö–û–ú–ï–ù–î–ê–¶–Ü–á

‚úÖ 1. –ß–ê–°–û–í–ò–ô –í–ò–ú–Ü–† (TIME OVERLAP)
--------------------------------------------------
–†–ï–ó–£–õ–¨–¢–ê–¢: –Ñ –∑–Ω–∞—á–Ω–µ —á–∞—Å–æ–≤–µ –ø–µ—Ä–µ—Ç–∏–Ω–∞–Ω–Ω—è!
‚Ä¢ –ü–µ—Ä—ñ–æ–¥ –ø–µ—Ä–µ—Ç–∏–Ω–∞–Ω–Ω—è: 18 –¥–Ω—ñ–≤ (2019-11-12 –¥–æ 2019-12-01)
‚Ä¢ –ó–≤–µ—Ä–Ω–µ–Ω–Ω—è 311 —É –ø–µ—Ä—ñ–æ–¥: 100,000 –∑–∞–ø–∏—Å—ñ–≤
‚Ä¢ –ö–∞–ø—ñ—Ç–∞–ª—å–Ω—ñ –ø—Ä–æ–µ–∫—Ç–∏: 74 –ø—Ä–æ–µ–∫—Ç–∏ –ø–æ—á–∞–ª–∏—Å—è –≤ —Ü–µ–π –ø–µ—Ä—ñ–æ–¥
‚Ä¢ –ó–∞–≥–∞–ª—å–Ω–∏–π –ø–µ—Ä—ñ–æ–¥ Capital Projects: 2003-2023 (–ø–æ–∫—Ä–∏–≤–∞—î –≤—Å—ñ –º–æ–∂–ª–∏–≤—ñ –ø–µ—Ä—ñ–æ–¥–∏ 311)

‚úÖ 2. –ì–ï–û–ì–†–ê–§–Ü–ß–ù–ò–ô –í–ò–ú–Ü–†
--------------------------------------------------
–†–ï–ó–£–õ–¨–¢–ê–¢: –Ñ –º–æ–∂–ª–∏–≤–æ—Å—Ç—ñ –¥–ª—è –≥–µ–æ–≥—Ä–∞—Ñ—ñ—á–Ω–æ–≥–æ –∑'—î–¥–Ω–∞–Ω–Ω—è!
‚Ä¢ 311 –¥–∞—Ç–∞—Å–µ—Ç –º–∞—î Borough (5 –±–æ—Ä–æ NYC) + –∫–æ–æ—Ä–¥–∏–Ω–∞—Ç–∏ (96,696 –∑–∞–ø–∏—Å—ñ–≤)
‚Ä¢ Capital Projects –º–∞—î Geographic District (33 —Ä–∞–π–æ–Ω–∏)
‚Ä¢ Community Board –≤ 311