Merge pull request #135 from hackforla/dev

dev
hackforla · Dec 18, 2019 · d3ec81d · d3ec81d
2 parents 826a443 + 3057074
commit d3ec81d
Show file tree

Hide file tree

Showing 19 changed files with 520 additions and 157 deletions.
diff --git a/.gitignore b/.gitignore
@@ -33,11 +33,15 @@ config.js
 .env
 settings.cfg
 
+docker-compose.yml
 Orchestration/docker-compose.yml
 
 __pycache__/
 
-# csv files
-/dataAnalysis/rawdata
-/dataAnalysis/ryanAnalysis.ipynb
+# csv files 
+/dataAnalysis/datasets
+/server/src/static
+
+# checkpoints
 /dataAnalysis/.ipynb_checkpoints
+server/src/static/temp
diff --git a/dataAnalysis/ETL/__pycache__/utils.cpython-36.pyc b/dataAnalysis/ETL/__pycache__/utils.cpython-36.pyc
diff --git a/dataAnalysis/ETL/extract_clean.py b/dataAnalysis/ETL/extract_clean.py
diff --git a/dataAnalysis/ETL/transform.py b/dataAnalysis/ETL/transform.py
diff --git a/dataAnalysis/ETL/utils.py b/dataAnalysis/ETL/utils.py
diff --git a/dataAnalysis/Karlencleaning.ipynb b/dataAnalysis/Karlencleaning.ipynb
@@ -13,7 +13,7 @@
     "from shapely.geometry import LineString, Polygon, Point\n",
     "from shapely import wkt\n",
     "\n",
-    "from ETL import utils\n",
+    "import utils\n",
     "\n",
     "import matplotlib.pyplot as plt\n",
     "%matplotlib inline"
@@ -320,4 +320,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 2
-}
+}
diff --git a/dataAnalysis/dataCleaning.py b/dataAnalysis/dataCleaning.py
diff --git a/dataAnalysis/onboarding/311-onboarding.ipynb b/dataAnalysis/onboarding/311-onboarding.ipynb
@@ -0,0 +1,163 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 311-data\n",
+    "\n",
+    "## Welcome!\n",
+    "\n",
+    "Hi! Thanks for joining the 311-data team. In this document we'll do our best to bring you up to speed with the project's goals and organization; how the data science team has fit into that equation; and how our codebase is structured.\n",
+    "\n",
+    "## Documentation\n",
+    "\n",
+    "The project's aim is to provide data visualization and reporting tools to neighborhood councils and their constituents. The data science team has been responsible for data provenance, data cleaning, and providing the infrastructure for those datastores to be accessed and stored.\n",
+    "\n",
+    "Become familiar with the information referenced in the main [README.md](https://github.com/hackforla/311-data/blob/master/README.md) file. \n",
+    "\n",
+    "### Interface Overview\n",
+    "\n",
+    "The project prototype main page will contain three main components.\n",
+    "\n",
+    " - Pin Map: A geographical map overlayed with pins representing individual or grouped requests\n",
+    " - Time To Close Bar Chart: A bar chart indicating the median time for requests to be closed after opening organized by request type, also showing the range of times it takes to close requests\n",
+    " - Request Frequency Plot: Plot of number of requests received per unit time over a selected time scale\n",
+    " \n",
+    "Users can specify:\n",
+    "\n",
+    " - The time period to be analyzed\n",
+    " - The request type to be analyzed\n",
+    " - The geographic area to be analyzed (as selected by either the map interface or by neighborhood council name)\n",
+    "\n",
+    "*Mock-up graphics for the project will be provided soon.* \n",
+    "\n",
+    "## Technologies\n",
+    "\n",
+    "### Git\n",
+    "\n",
+    "The 311-data project [is hosted at GitHub.](https://github.com/hackforla/311-data/) The main branch in which development takes place is the `dev` branch. Fork the repo to your personal GitHub account and perform development on your personal fork. When you feel your commits are ready to be incorporated into the main repo's `dev` branch, go ahead and open a pull request.\n",
+    "\n",
+    "### General Operations\n",
+    "\n",
+    "311-data makes use of a number of tools, but from a global level the vast majority of the code base exists in `Python 3`.\n",
+    "\n",
+    "[pandas](https://pandas.pydata.org/) is the workhorse of the data analysis. The majority of data cleaning, filtering, and reorganization are performed using this library. It also handles ingestion of data into Postgres. [numpy](https://numpy.org/) forms the underpinning of `pandas` and is also heavily utilized.\n",
+    " \n",
+    "### Data Source\n",
+    "\n",
+    "Data for 311-data is hosted by the city of Los Angeles. Automated requests can be performed through the `Socrata` API, the key of which is available on the export page of each data set. The primary package in python for retrieval is the sodapy package, which provides tools for retrieving and iterating through `Socrata` compatible data stores. For bulk import, `csv` files are also available in the `export > download` tab of each dataset.\n",
+    "\n",
+    "The 311 dataset exists in five yearly data stores for the years 2015-2019 as listed below:\n",
+    "\n",
+    "[2015 Data](https://data.lacity.org/A-Well-Run-City/MyLA311-Service-Request-Data-2015/ms7h-a45h)\n",
+    "[2016 Data](https://data.lacity.org/A-Well-Run-City/MyLA311-Service-Request-Data-2016/ndkd-k878)\n",
+    "[2017 Data](https://data.lacity.org/A-Well-Run-City/MyLA311-Service-Request-Data-2017/d4vt-q4t5)\n",
+    "[2018 Data](https://data.lacity.org/A-Well-Run-City/MyLA311-Service-Request-Data-2018/h65r-yf5i)\n",
+    "[2019 Data](https://data.lacity.org/A-Well-Run-City/MyLA311-Service-Request-Data-2019/pvft-t768)\n",
+    "\n",
+    "Each dataset contains approximately 1 million rows of information. Data field descriptions are provided on the OpenData pages for each dataset. \n",
+    "\n",
+    "Some confusing data distinctions:\n",
+    "\n",
+    " - 311 requests are indexed by their CreatedDate. Due to this distinction, the bulk csv files for previous years may change. Because of this it is important to keep in mind that multiple datasets may need to be queried to catch all updated requests.\n",
+    " - There is a difference between ServiceDate and ClosedDate. ServiceDate indicates that a request was serviced by the city on a given date, while ClosedDate indicates that the request may have been passed on to another agency or has been closed without servicing.\n",
+    " \n",
+    "### Database\n",
+    " \n",
+    " Our main data repository is `Postgres 12.1` running in a `Docker` container. The `docker` instance is represented in the repository here. Install docker if you don't have it installed and run `docker-compose up` in the directory holding the yaml file to initialize the database. Once it is running, it should be available for access at `locahost:5432`.\n",
+    " \n",
+    " In order to import and access `Postgres` the primary tool has been [sqlAlchemy](https://www.sqlalchemy.org/). `sqlAlchemy` is used for import and export of data, receiving data handed off from `pandas` on the import end and in turn handing off data as `json` when moving it to the front end.\n",
+    " \n",
+    "### Data Retrieval\n",
+    "\n",
+    "The main interface between data layers is implemented using [Sanic](https://sanic.readthedocs.io/en/latest/). `Sanic` is a [Flask](http://flask.palletsprojects.com/en/1.1.x/)-like microserver that functions as the main lever for gluing the various services of the project together. The various scripts and data objects are generally shuffled through the chain of operations from the server implementation in the `server` directory.\n",
+    "\n",
+    "### EDA\n",
+    "\n",
+    "Data analysis and EDA is primarily performed in [Jupyter](https://jupyter.org/) notebook format. The archived analyses and experimental data manipulation implementations live in the [dataAnalysis folder](https://github.com/hackforla/311-data/tree/dev/dataAnalysis) on the `dev` branch. New data analysis and data cleaning procedures are committed as `Jupyter` notebooks and then incorporated into the data cleaning pipeline.\n",
+    "\n",
+    "## Getting up and Running\n",
+    "\n",
+    "### Introductory EDA\n",
+    "\n",
+    "First get familiar with the data. Open the [dataAnalysis folder](https://github.com/hackforla/311-data/tree/dev/dataAnalysis) and see what other 311-data members have investigated in the past. Download one of the `csv` files listed above and create a `Jupyter` notebook. Go through your data cleaning procedure to see if you can find any trends in the data that may have been overlooked. Areas where the data may be inconsistent are especially important, as we want to provide the highest degree of accuracy possible in the reports provided.\n",
+    "\n",
+    "### Install Dependencies\n",
+    "\n",
+    "An install script [has been provided in the github repo.](https://github.com/hackforla/311-data/blob/e10b829bd669b549c63a6a4fac3efe6f0c937979/onboard.sh) Run the script and make sure you can pass the associated checks.\n",
+    "\n",
+    "Take a look at the [requirements file](https://github.com/hackforla/311-data/blob/dev/server/requirements.txt) and see if you have everything you need to get going. The most important tools are:\n",
+    "\n",
+    " - Python 3\n",
+    " - pandas\n",
+    " - sqlAlchemy\n",
+    " - sodapy\n",
+    " - numpy\n",
+    " - Docker\n",
+    " - sanic\n",
+    " \n",
+    "You may also want to install the following python packages to be able to access `Jupyter` notebook files submitted by other contributors:\n",
+    "\n",
+    " - seaborn\n",
+    " - geopandas\n",
+    " - shapely\n",
+    " - matplotlib\n",
+    " - pandas_profiling\n",
+    " - statistics\n",
+    " - scipy\n",
+    " - sklearn\n",
+    " - fuzzywuzzy\n",
+    " - hdbscan\n",
+    " \n",
+    "### Fork the repo and go crazy\n",
+    "\n",
+    "Most important files are included in the server subdirectory. \n",
+    "\n",
+    "Run `docker-compose up` using the [yml file](https://github.com/hackforla/311-data/blob/master/Orchestration/docker-compose-default.yml) included in the repo. Reference the [README.md file in the Orchestration folder](https://github.com/hackforla/311-data/blob/master/Orchestration/README.md) for information on how to initialize the docker instance. The SODAPY_APPTOKEN field should not be necessary unless making more than 1000 requests per hour, but if you would like to apply for a personal token you can do so at [Socrata's developer site.](https://opendata.socrata.com/login?return_to=%2Fprofile%2Fedit%2Fdeveloper_settings)\n",
+    "\n",
+    "Create a copy of [settings.example.cfg](https://github.com/hackforla/311-data/blob/dev/server/src/settings.example.cfg) in the /server/src/ directory and name it `settings.cfg`. You will need to specify your local password for your dockerized `postgres` instance. `settings.cfg` is included in `.gitignore` so changes to this file will not be committed to the repo, allowing you to save your personal login credentials here. For the values that match the dockerized `postgres` instance see the [Orchestration README.md file](https://github.com/hackforla/311-data/blob/master/Orchestration/README.md) for more information.\n",
+    "\n",
+    "`sqlIngest.py` includes methods for acquiring data via `Socrata` and `csv` files. It is also the main place that data cleaning happens. Once decisions are made regarding the best way to address data integrity concerns, cleaning operations can be included. The goal is to have a unified data cleaning pipeline that reflects the consensus opinions of best practice from all data scientists working on the team.\n",
+    "\n",
+    "`app.py` is the main `sanic` script, and is the routing center for the 311-data project. User-facing triggers and requests occur here. \n",
+    "\n",
+    "Take a look at the tasks on the [project board](https://github.com/hackforla/311-data/projects/4) and see if there's anything you can take a swing at. \n",
+    "\n",
+    "### Don't be afraid to reach out\n",
+    "\n",
+    "The Slack chat is your friend. Feel free to reach out on #311-data, #311-data-dev, or #311-data-analysis, or go ahead and send a direct message. Other team members will be on the board periodically through the week and can help out with any questions or concerns. \n",
+    "\n",
+    "## Happy Hacking!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}