# **<center>Introduction</center>**
***

This notebook is an analysis and data visualization for the class "Python for Everybody" taught by Dr. Charles Severence (AKA "Dr. Chuck") at the University of Michigan, presented on Coursera. The assignment was to select an open dataset and do some sort of basic analysis and visualization in order to show that we have mastered the basic concepts taught in the course, such as web scraping, data cleaning, databases, and generation of plots. 

From the provided list of open data sources, I went to [data.gov](https://data.gov/) and searched for datasets pertaining to New York City, my hometown. That led me to **[NYC OpenData](https://opendata.cityofnewyork.us/)**, where I got the datasets used here. After browsing the available datasets, I decided to go with real estate, as there are several large datasets pertaining to that topic that caught my eye, and I picked these for my analysis.

My hope is that I will design this analysis in a generic enough way that I can use large parts of its code in analyses of other NYC OpenData datasets.

# **<center>Environment setup</center>**
***
- ### Import libraries

In [4]:
# Import standard libraries
import os
import re
import json
import dataclasses
import codecs
import requests
from urllib.request import urlopen
import datetime


# Import third-party libraries
import geopandas as gpd
from geoalchemy2 import Geometry
import pandas as pd
import numpy as np
import pyogrio
import sqlalchemy
from sqlalchemy import create_engine, Column, Integer, Float, String, Date, MetaData, event, Table, text, LargeBinary, ForeignKey
from sqlalchemy.dialects.sqlite import insert
from sqlalchemy.orm import sessionmaker, declarative_base
from sqlalchemy.sql.sqltypes import Boolean
from sqlalchemy.event import listen
from sqlalchemy.engine import Engine
# from sqlalchemy.ext.declarative import declarative_base

import sqlite3
import fiona
from fiona.crs import from_epsg

# Import custom helper functions
# import helperz
# from helperz import *


* ### Set paths for where to create the data folder where datasets will be downloaded and the sqliote database created

In [None]:
# Directory where the data folder for this analysis is to be created
datadir = "/home/james/Massive/PROJECTDATA"

# Name of the folder in which the project data is stored
project_name = "nyc_real_estate"

* ### Set the names and paths of the datasets to be used. Here they are alraady set for the analysis I am doing, but the hope is to make it as adaptable to other datasets as possible. ####

In [None]:
# Define datasets to be downloaded and and processed
datasets = {
    "MapPLUTO": {
        "url": "https://s-media.nyc.gov/agencies/dcp/assets/files/zip/data-tools/bytes/nyc_mappluto_24v3_1_fgdb.zip",
        "title": "NYC MapPLUTO 24v3.1",
        "description": "Geographic data for tex lots",
        "data_dictionary": "https://s-media.nyc.gov/agencies/dcp/assets/files/pdf/data-tools/bytes/pluto_datadictionary.pdf",
    },
    "property_valuation_and_assessment_data": {
        "url": "https://data.cityofnewyork.us/City-Government/Property-Valuation-and-Assessment-Data/yjxr-fw8i/about_data",
        "title": "Property Valuation and Assessment",
        "description": "Property Valuation and Assessment Data",
        "data_dictionary": "https://www1.nyc.gov/assets/finance/downloads/pdf/pdfs/2023/2023_property_tax_valuation_and_assessment_data.pdf",
    },
}

In [None]:
@dataclasses.dataclass
class Dataset:
    """Class to hold dataset metadata"""
    id: str
    main_url: str
    data_url: str
    data_dict_url: str
    name: str
    attribution: str
    createdAt: str
    description: str
    provenance: str
    publicationDate: str
    rowsUpdatedAt: str


In [None]:
# dataset_names = ["lien_data", "assessment_data"]

In [None]:
# Define shared dataset configurations
shared_dataset_configs = {
    "prefix": PROJECTDATA,
    "cols_to_drop": ["id", "sid", "position", "created_at", "created_meta", "updated_at", "updated_meta", "borough", "meta"],
    "cols_to_rename": {},
    "dtype_exceptions": {'zip_code': String},
    "lookup_columns": [],
    "datatype_mappings": {"meta_data": String, "postcode": String, "calendar_date": Date, "number": Integer, "text": String, "point": String}
}

# Define specific dataset configurations
specific_dataset_configs = {
    "lien_data": {
        "prefix": f'{PROJECTDATA}/intermediate_files',
        "cols_to_drop": [],
        "cols_to_rename": {'BORO': 'borough'},
        "lookup_columns": [],
        "dtype_exceptions": {}
    },
    "assessment_data": {
        "prefix": f'{PROJECTDATA}/intermediate_files',
        "cols_to_drop": [],
        "cols_to_rename": {"BLDGCL": "building_class", "TAXCLASS": "tax_class_code", "Zip Codes": "zip_code"},
        "lookup_columns": ["building_class", "street_name", "owner", "zip_code"],
        "dtype_exceptions": {}
    }
}

* ### Construct more variables from the paths you set above.

In [5]:
project_path = os.getcwd()

PROJECTDATA = f"{datadir}/{project_name}_data"
sqlite_path = f'sqlite:///{PROJECTDATA}/{project_name}_db.sqlite'


# Create necessary directories
os.makedirs(PROJECTDATA, exist_ok=True)
os.makedirs(f"{PROJECTDATA}/Downloads", exist_ok=True)
os.makedirs(f"{PROJECTDATA}/intermediate_files", exist_ok=True)

# Set environment variables
os.environ["PROJECTDATA"] = PROJECTDATA
os.environ["project_name"] = project_name
os.chdir(project_path)


In [None]:
engine = create_engine(f'{sqlite_path}?check_same_thread=False', echo=False)

SessionLocal = sessionmaker(bind=engine, autoflush=False, autocommit=False)