makepath · brendancol · Jan 11, 2023 · Jun 22, 2022
diff --git a/examples/housing_price_feature_engineering.ipynb b/examples/housing_price_feature_engineering.ipynb
@@ -0,0 +1,311 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "housing_price_feature_engineering",
+      "provenance": [],
+      "collapsed_sections": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Housing price prediction feature engineering\n",
+        "\n",
+        "In general, price of a house is determined under many factors and the location always plays a paramount role in making value of the property. In this notebook, we will discover how geo-related aspects affect housing price in the USA. We will consider a house based on the information provided in ***an existing dataset*** with some addtional spatial attributes extracted from its location using xarray-spatial ***(and probably some elevation dataset, and census-parquet as well?)***.\n",
+        "\n",
+        "Existing features:\n",
+        "- ...\n",
+        "\n",
+        "New features:\n",
+        "- ***Slope?*** (from an elevation dataset)\n",
+        "- ***Population density, ...?*** (from Census if none available in the existing features)\n",
+        "- Distance to nearest hospital (or grocery store / university / pharmacy)\n",
+        "\n",
+        "We'll first build a machine learning model and train it with all existing features. For each newly added feature, we'll retrain it and compare the results to find out which features help enrich the model."
+      ],
+      "metadata": {
+        "id": "WEWW_CD4owNL"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Imports\n",
+        "\n",
+        "First, let's import all neccesary libraries."
+      ],
+      "metadata": {
+        "id": "LWdCWgs31tjo"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "VoNuwihJovVG"
+      },
+      "outputs": [],
+      "source": [
+        "import numpy as np\n",
+        "import pandas as pd\n",
+        "import rasterio\n",
+        "\n",
+        "import datashader as ds\n",
+        "from datashader.transfer_functions import shade\n",
+        "from datashader.transfer_functions import stack\n",
+        "from datashader.transfer_functions import dynspread\n",
+        "from datashader.transfer_functions import set_background\n",
+        "from datashader.colors import Elevation\n",
+        "\n",
+        "from xrspatial import slope\n",
+        "from xrspatial import proximity"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Load the existing dataset"
+      ],
+      "metadata": {
+        "id": "eHISReTx1zmb"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# assume the data contain lat lon coords with some additional values\n",
+        "df = pd.DataFrame({\n",
+        "    'id': [0, 1, 2, 3, 4],\n",
+        "    'x': [0, 1, 2, 0, 4],\n",
+        "    'y': [2, 0, 1, 3, 1],\n",
+        "    'column_1': [2, 3, 4, 2, 6],\n",
+        "    'price': [1, 3, 4, 3, 7]\n",
+        "})"
+      ],
+      "metadata": {
+        "id": "o492hk5B5_mM"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Build and train housing price model\n",
+        "\n",
+        "We'll split the data into train data and test data."
+      ],
+      "metadata": {
+        "id": "NoXJhz-7153G"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "VrvvjQ0v6FME"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Now let's build the model to predict housing price. "
+      ],
+      "metadata": {
+        "id": "mIm7gBOf9Mi-"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "rSEWxAWh9Q7j"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "After tuning the hyper parameters, we selected the best model as below."
+      ],
+      "metadata": {
+        "id": "t7KiSyr29ROE"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "hmLkxoO-963c"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Prediction accuracy on the test set."
+      ],
+      "metadata": {
+        "id": "AL8mMHzI9-Hz"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "m4Pg2hXv99tf"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Calculated spatial attributes\n",
+        "\n",
+        "As stated above, we'll calculate spatial attributes (slope, ...?) of each location and its proximities to some nearest services. \n",
+        "\n",
+        "**TBD**: What is the format of additional data? Is it in vector or raster format?\n",
+        "- If raster (preferred), load directly as 2D xarray DataArrays\n",
+        "- If vector, load into a pandas/geopandas DataFrame and rasterize with datashader.\n",
+        "\n",
+        "Assume that the data is in vector format."
+      ],
+      "metadata": {
+        "id": "18o3Swi-6Fjg"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# bounding box of the raster\n",
+        "xmin, xmax, ymin, ymax = (\n",
+        "    df.x.min(),\n",
+        "    df.x.max(),\n",
+        "    df.x.min(),\n",
+        "    df.x.max()\n",
+        ")\n",
+        "xrange = (xmin, xmax)\n",
+        "yrange = (ymin, ymax)\n",
+        "\n",
+        "# width and height of the raster image\n",
+        "W, H = 800, 600\n",
+        "\n",
+        "# canvas object to rasterize the houses\n",
+        "cvs = ds.Canvas(plot_width=W, plot_height=H, x_range=xrange, y_range=yrange)\n",
+        "raster = cvs.points(df, x='x', y='y', agg=ds.min('id'))\n",
+        "\n",
+        "# visualize the raster\n",
+        "points_shaded = dynspread(shade(raster, cmap='salmon', min_alpha=0, span=(0,1), how='linear'), threshold=1, max_px=5)\n",
+        "set_background(points_shaded, 'black')"
+      ],
+      "metadata": {
+        "id": "mtQu5wSd6V_Y"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Identify location in pixel space of houses."
+      ],
+      "metadata": {
+        "id": "6k9E72q9Ge_d"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "EW3-QIoQLRN_"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Calculate new feature value."
+      ],
+      "metadata": {
+        "id": "kmkbu2e2LrQo"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "NJ6Djtu1Lqsv"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Retrain the model with new feature and compute test accuracy."
+      ],
+      "metadata": {
+        "id": "b15dhXjXLxMj"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "bVjatg3KLvvz"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Feature selection"
+      ],
+      "metadata": {
+        "id": "YEFwSg5Q6W1Z"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "KQB3kmlg6Zh4"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}