## <center>Intro to Python for Data Science with DC OpenData</center>

![dc flag](./images/4994-004-096A5339.jpg)

<small>This notebook was prepared by [Nicole Donnelly](mailto:nicole.donnelly@dc.gov) for the DC area regional Women in Data Science Conference [(DCMDVAWiDSRegional)](https://sites.google.com/view/dcmdvawidsregional/agenda?authuser=0) on March 5,2018 and presented as a one hour workshop.</small>

### Introduction
As cities embrace the [open data](https://en.wikipedia.org/wiki/Open_data) movement (you can find links to datasets for 85 cities [here](https://www.forbes.com/sites/metabrown/2017/06/30/quick-links-to-municipal-open-data-portals-for-85-us-cities/#290b91072290)), data scientist have an ever expanding population of data available to analyze and incorporate into other projects. As with any data source, unless you are designing and collecting it yourself, you will likely need to do some data wrangling before moving on to exploratory data analysis (EDA) and machine learning. 

During the course of this workshop, we will look at using [Python](https://www.python.org/) to wrangle [open data available from the Government of the District of Columbia](http://opendata.dc.gov/) in preparation for machine learning (this workshop will not cover machine learning). We will also look at some initial EDA once we build a data set we want to use.

### Overview
If you do not have particular project in mind, I encourage you to [browse through the available data sets](http://opendata.dc.gov/datasets) (951 as of the time this workshop was created). We are going to start today with the [Computer Assisted Mass Appraisal - Condominium](http://opendata.dc.gov/datasets/computer-assisted-mass-appraisal-condominium) data. There is a lot that can be done with this data, particularly in conjunction with other data available from DC ([tax data](http://opendata.dc.gov/datasets/integrated-tax-system-public-extract), [crime data](http://opendata.dc.gov/datasets?q=crime), [construction data](http://opendata.dc.gov/datasets?q=construction), or [city service requests](http://opendata.dc.gov/datasets?q=311) for example) or other sources like the [United States Census Bureau](https://www.census.gov/data.html).

Buying a house in DC can be a daunting task. Inventory was being describe in November 2017 as ["dismally low"](https://www.washingtonpost.com/news/where-we-live/wp/2017/11/14/buyers-are-gaining-more-leverage-in-the-hot-d-c-area-housing-market/?utm_term=.b1aa57960214). But maybe armed with some appraisal data and machine learning, we can understand condominium values a little better. For example, maybe we could create a simple application to determine appraisal value, similar to [this example](https://github.com/georgetown-analytics/machine-learning/blob/master/examples/bbengfort/home%20sales/home_sales.ipynb) which uses housing sales data.

### Data

Here is [some information](https://www.arcgis.com/sharing/rest/content/items/d6c70978daa8461992658b69dccb3dbf/info/metadata/metadata.xml?format=default&output=html) available to us about the data.

**Abstract**: Computer Assisted Mass Appraisal (CAMA) database. The dataset contains attribution on housing characteristics for commercial properties, and was created as part of the DC Geographic Information System (DC GIS) for the D.C. Office of the Chief Technology Officer (OCTO) and participating D.C. government agencies. All DC GIS data is stored and exported in Maryland State Plane coordinates NAD 83 meters. 

METADATA CONTENT IS IN PROCESS OF VALIDATION AND SUBJECT TO CHANGE.

**Purpose**: This data is used for the planning and management of Washington, D.C. by local government agencies.

**Supplemental Information**: Most lots have one building in the cama file, assigned BLDG_NUM of one in the table. For parcels where multiple buildings exist, the primary building (such as the main residence) is assigned BLDG_NUM = 1. The other buildings or structures have BLDG_NUM values in random sequential order. After the primary structure, there is no way to associate BLDG_NUM > 2 records with any particular structure on the lot.



There is also some attribute information available. Some of it has been copied here. Not all of it is overly descriptive. 


***Entity and Attribute Information***:


**Attribute Label**: SALEDATE

**Attribute**:


**Attribute Label**: Sale_Num

**Attribute Definition**: sale number


**Attribute Label**: EYB

**Attribute Definition:** The calculated or apparent year, that an improvement was built that is most often more recent than actual year built.


**Attribute Label**: Shape

**Attribute Definition**: Feature geometry.


**Attribute Label**: OWNERNAME

**Attribute Definition**: property owner name


**Attribute Label**: SSL

**Attribute Definition**: square suffix and lot


**Attribute Label**: Extwall_D

**Attribute Definition**: exterior wall description


**Attribute Label**: PRICE

**Attribute**:


**Attribute Label**: Yr_Rmdl

**Attribute Definition**: year structure was remodeled


**Attribute Label**: Saledate

**Attribute Definition**: date of last sale


**Attribute Label**: AYB

**Attribute Definition**: The earliest time the main portion of the building was built. It is not affected by subsequent construction.


**Attribute Label**: Price

**Attribute Definition**: price of last sale


**Attribute Label**: GBA

**Attribute Definition**: gross building area in square feet


### Tools

A popular package for working with data in python is [pandas](https://pandas.pydata.org/pandas-docs/stable/).

From the above link:

""**pandas** is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, **real world** data analysis in Python. Additionally, it has the broader goal of becoming **the most powerful and flexible open source data analysis / manipulation tool available in any language**. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:

* Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
* Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure""

We will also use (Seaborn)[https://seaborn.pydata.org/] which is a visualization package built on (matplotlib)[https://matplotlib.org/], a 2D plotting library in python. 



In [None]:
import os
import urllib
import openpyxl

import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt

In [None]:
pd.options.display.max_columns = 35
%matplotlib inline

In [None]:
DATA_DIR = "./data"
cama_url = "https://opendata.arcgis.com/datasets/d6c70978daa8461992658b69dccb3dbf_24.csv"
cama_file = os.path.join(DATA_DIR, "cama-condo.csv")

In [None]:
def get_data(dname, furl, fname):
    if not os.path.exists(dname):
        print("making directory")
        os.makedirs(dname)
    else:
        print("directory exists")
    if not os.path.isfile(fname):
        print("downloading file")
        urllib.request.urlretrieve(furl, fname)
    else:
        print("file exists")

In [None]:
get_data(DATA_DIR, cama_url, cama_file)

In [None]:
df = pd.read_csv(cama_file)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.drop(['OBJECTID', 'QUALIFIED', 'USECODE', 'LANDAREA', 'GIS_LAST_MOD_DTTM'], axis=1, inplace=True)

In [None]:
df['SALEDATE'] = pd.to_datetime(df['SALEDATE'], errors='coerce')

In [None]:
df.info()

In [None]:
df.assign(SALEDATE = pd.to_datetime(df['SALEDATE'], errors='coerce'), inplace=True)

In [None]:
df = df[df.PRICE.notnull()]

In [None]:
df.info()

imputation - what are we going to do about missing values?
well, I am interested in condos that have a minimum of 2 bedrooms and are in ward 6. I don't have address/ ward info.
so first, let's deal with what I do have. drop all that are not at least 2 BR

In [None]:
df = df[df.BEDRM >= 2.0]

In [None]:
df.info()

yr_rmdl - too  many missing values so drop
rooms- less concerned with overall rm count so drop
hf_bathrm - if there is no value, assume 0
fireplaces - if there is no value, assume 0

In [None]:
df.drop(['YR_RMDL', 'ROOMS'], axis=1, inplace=True)

In [None]:
df.info()

In [None]:
df.fillna(0, inplace=True)

In [None]:
df.info()

so now how do I know where these place are? MAR and
https://octo.dc.gov/node/1161947
https://octo.dc.gov/sites/default/files/dc/sites/octo/publication/attachments/DCGIS-MARGeocoderUserGuide_1.pdf

http://opendata.dc.gov/datasets/address-residential-units

Address Residential Units. This table contains residential units and attributes of Address points, created as part of the Master Address Repository (MAR) for the D.C. Residential units can be condominiums or also apartments. Office of the Chief Technology Officer (OCTO) and DC Department of Consumer and Regulatory Affairs . It contains the addresses in the District of Columbia which are typically placed on the buildings. More information on the MAR can be found at http://dcgis.dc.gov.

In [None]:
aru_url = "https://opendata.arcgis.com/datasets/c3c0ae91dca54c5d9ce56962fa0dd645_68.csv"
aru_file = os.path.join(DATA_DIR, "address_residential_unit.csv")

In [None]:
get_data(DATA_DIR, aru_url, aru_file)

In [None]:
aru_df = pd.read_csv(aru_file)

In [None]:
aru_df.head()

In [None]:
aru_df.shape

In [None]:
aru_df.info()

In [None]:
df['SSL'].isin(aru_df['SSL']).value_counts()

In [None]:
condos = pd.merge(df, aru_df, on='SSL')

In [None]:
condos.shape

In [None]:
condos.head()

In [None]:
condos.info()

In [None]:
print(condos.UNITTYPE.value_counts())
print('\n')
print(condos.STATUS.value_counts())

In [None]:
condos = condos[condos.STATUS != 'RETIRE']

In [None]:
condos.drop(['OBJECTID', 'STATUS', 'UNITTYPE', 'METADATA_ID'], axis=1, inplace=True)

In [None]:
condos.info()

In [None]:
mar_file = os.path.join(DATA_DIR, "addresses.xlsx")
writer = pd.ExcelWriter(mar_file)

```addresses = pd.DataFrame(condos['FULLADDRESS'].unique(), columns=['full_address'])
addresses.to_excel(writer, index=False)
writer.save()```

unique address list took about 5 minutes. if we had done them all it could have taken a while.

In [None]:
mar = pd.read_excel(mar_file)

In [None]:
mar.info()

In [None]:
condos = pd.merge(condos, mar, left_on='FULLADDRESS',  right_on='full_address')

In [None]:
condos.shape

In [None]:
condos.head()

In [None]:
condos.MAR_WARD.value_counts()

In [None]:
condo_6 = condos[condos.MAR_WARD == 'Ward 6']

In [None]:
condo_6.shape

In [None]:
condo_6.drop(['full_address',  'MAR_MATCHADDRESS', 'MAR_XCOORD', 'MAR_YCOORD', 'MAR_LATITUDE', 'MAR_LONGITUDE', 'MAR_WARD',
               'MAR_ZIPCODE', 'MARID', 'MAR_ERROR', 'MAR_SCORE', 'MAR_SOURCEOPERATION', 'MAR_IGNORE'], axis=1, inplace=True)

In [None]:
condo_6.shape

In [None]:
condo_6.info()

In [None]:
condo_6.describe()

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
g = sns.distplot(condo_6.PRICE, rug=True, kde=True, ax=ax)
t = g.set_title("Distribution of Sale Prices")

In [None]:
fig, ax = plt.subplots(figsize=(14,8))
g = sns.boxplot(y='PRICE', x=condo_6['SALEDATE'].dt.year, data=condo_6, ax=ax)
t = g.set_title("Distribution of Sale Price by Year")

In [None]:
g = sns.jointplot(y="PRICE", x="LIVING_GBA", data=condo_6, kind="hex", size=8)

In [None]:
condo_6.info()

In [None]:
numerical = condo_6[list(set(condo_6.columns) - set(['SSL', 'SALEDATE', 'FULLADDRESS', 'UNITNUM']))]
numerical.info()

In [None]:
corr_matrix = numerical.corr()
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(corr_matrix, ax=ax);