# Begin tiered rental score by updating sdata

## Sales and assetment data avilable on MD department of planning and MD open data are the foundation for ownership, tax id, and address information used to tie rentals to owners


### SDAT information is maintained from a baseline that was updated to 2017, available in a zip file on this link.  The file interest is: Dorc2017.dbf  
https://www.dropbox.com/s/oc1l1frorg66vlr/DORC_MPV17.zip
extract it using: 
```
$> tar -xf DORC_MPV17.zip Dorc2017/ATDATA/DATABASE/Dorc2017dbf
``` 
or on mac: 
```
unzip DORC_MPV17.zip Dorc2017/ATDATA/DATABASE/Dorc2017dbf
``` 

### SDAT is updated by aggregating all published updates on MD Department of planning   
Updating this file requires downloading monthly (or quarterly) changes listed on this page under 'Sales Data':
https://planning.maryland.gov/Pages/OurProducts/DownloadFiles.aspx

All of the update files must be saved in /SDATA folder so script can aggregate all upadtes in a single file written from this notebook. This notebook pulls data updates from all Sales .dbf files saved in /SDATA folder path.

### Housing Assestment data is then pulled from MD open data portal 'Maryland Real Property Assessments: Hidden Property Owner Names'
This data set is updated monthly and enriches sdata with assetment and sqft data. Available at https://opendata.maryland.gov/Business-and-Economy/Maryland-Real-Property-Assessments-Hidden-Property/ed4q-f8tm


**Latest update:** Through July 2021

In [46]:
# TODO: Set environment dynamically 
enviro = 'dev'

####
# Set Latest File Names & data path - UPDATE TO LATEST 
###
sdat_fn = 'md_sdat/Dorc2017.dbf'
test_sdat_fn = 'Sale0521.dbf'
can_enrich_fn = 'geocoding/CAN-ref.csv'
output_fn = '_output/SDAT-CAN-ref-072021_src_sdat_etl.csv'

# Data file path 
if enviro == 'dev':
    path = 'data/'
    
     # Folder with all sdat .dbf updates
    sdat_updates_folder = 'md_sdat/SDAT_Updates/'
else: 
    path = '/content/drive/My Drive/pita 2021/'
    
    # mount google drive with data 
    from google.colab import drive
    drive.mount('/content/drive')
    
    # Folder with all sdat .dbf updates
    sdat_updates_folder = 'drive/My Drive/SDAT/'

In [2]:
# add any packages that aren't available by default
!pip install simpledbf



In [3]:
import pandas as pd
from simpledbf import Dbf5
from os import walk

## Step one create single SDAT file

#### Open the baseline 2017 sdat file from the state

In [4]:
dbf = Dbf5(path+sdat_fn)
df = dbf.to_dataframe()
df = df.set_index('acctid')

#### Discover all the update files, and append them in the order they were published

In [5]:
def update(df):
  print("rows:",len(df))
  for (dirpath, dirnames, filenames) in walk(sdat_updates_folder):
      for file in ([name for name in sorted(filenames) if 'SALE' in name.upper()]):
        print(dirpath+file)
        add_df = Dbf5(dirpath+file)
        new_df = add_df.to_dataframe()
        new_df.columns = [col.lower() for col in new_df.columns]
        new_df = new_df.query('jurscode == "DORC"').set_index('acctid')
        new_df = new_df[~new_df.index.duplicated(keep='last')] #keep last update
        updates = [str(v) for v in set(df.index.values).intersection(set(new_df.index.values)) if int(v) > 0]
        df = df.drop(updates)
        update_columns = set(df.columns).intersection(set(new_df.columns))
        df = df.append(new_df[update_columns])
      break
  df = df[~df.index.duplicated(keep='last')]
  print("final:",len(df))
  return df

merged_df = update(df.copy())

rows: 23202
final: 23202


## TEST CASE

#### Check to verify the updates from MD are applied OK.  Look up one record that we know was updated.  The merged_df should be like the new record, and the original df should be different now.

In [6]:
add_df = Dbf5(path+sdat_updates_folder+test_sdat_fn)
test_df = add_df.to_dataframe()
test_df.columns = [col.lower() for col in test_df.columns]
test_df = test_df.query('jurscode == "DORC"').set_index('acctid')

test_df.query('acctid == "1001000020"')

Unnamed: 0_level_0,jurscode,digxcord,digycord,ct2010,bg2010,geogcode,ooi,address,city,zipcode,...,mortgag1,curlndvl,curimpvl,curttlvl,sallndvl,salimpvl,salttlvl,ptype,sdatwebadr,existing
acctid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1001000020,DORC,505757.4,99135.2,24019970100,240199701002,80,D,5430 INDIANTOWN ROAD,RHODESDALE,21659,...,0,71800,266300,0,71800,266300,0,2,https://sdat.dat.maryland.gov/RealProperty/Pag...,MDPV2017_18


In [7]:
df.query('acctid == "1001000020"')

Unnamed: 0_level_0,jurscode,digxcord,digycord,ct2010,bg2010,geogcode,ooi,resityp,address,strtnum,...,resi1990,resiuths,aprtment,trailer,special,other,ptype,sdatwebadr,existing,mdpvdate
acctid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1001000020,DORC,505757.4,99135.2,24019970100,240199701002,80,D,SF,5430 INDIANTOWN ROAD,5430,...,0,0,0,0,0,1,2,http://sdat.dat.maryland.gov/RealProperty/Page...,MDPV2017_18,2020JUN


In [8]:
merged_df.query('acctid == "1001000020"')

Unnamed: 0_level_0,jurscode,digxcord,digycord,ct2010,bg2010,geogcode,ooi,resityp,address,strtnum,...,resi1990,resiuths,aprtment,trailer,special,other,ptype,sdatwebadr,existing,mdpvdate
acctid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1001000020,DORC,505757.4,99135.2,24019970100,240199701002,80,D,SF,5430 INDIANTOWN ROAD,5430,...,0,0,0,0,0,1,2,http://sdat.dat.maryland.gov/RealProperty/Page...,MDPV2017_18,2020JUN


## TEST CASE

#### Verify the result joins cleanly with some enrichment data we carry from year to year  
This is things like names of rental operation groups, notes, etc that we might want to reference with the new data.  This will be added later so we don't have to store it multiple times.

In [23]:
enrichment = pd.read_csv(path+can_enrich_fn).set_index('acctid')
e_merge = merged_df.merge(enrichment, left_index=True, right_index=True, how='outer', indicator=True)
print((len(enrichment), "records. Enriched after the join:", len(e_merge.query('_merge == "both"'))))
print("These have an issue, but that looks ok because only :",\
      len(e_merge.query('_merge == "right_only"')[e_merge.query('_merge == "right_only"')['CAN_OWNCLASS']=="HOUSING"])\
     , "Housing have issues")

print("Remaining with an issue :",\
      len(e_merge.query('_merge == "right_only"')[e_merge.query('_merge == "right_only"')['CAN_OWNCLASS']!="HOUSING"])\
     , "are not Housing")


(7924, 'records. Enriched after the join:', 7854)
These have an issue, but that looks ok because only : 3 Housing have issues
Remaining with an issue : 67 are not Housing


In [24]:
# View records that coul not be enriched 
e_merge.query('_merge == "right_only"')[e_merge.query('_merge == "right_only"')['CAN_OWNCLASS']=="HOUSING"]

Unnamed: 0_level_0,jurscode,digxcord,digycord,ct2010,bg2010,geogcode,ooi,resityp,address,strtnum,...,other,ptype,sdatwebadr,existing,mdpvdate,CAN_GROUP,CAN_OWNCLASS,GEOLATLON,GEOHASH,_merge
acctid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1007127049-1,,,,,,,,,,,...,,,,,,BRADLEY,HOUSING,"38.5705247976085, -76.08614165037464",dqcgktec86uy,right_only
1007127049-2,,,,,,,,,,,...,,,,,,BRADLEY,HOUSING,"38.5705247976085, -76.08614165037464",dqcgktec86uy,right_only
1007159773,,,,,,,,,,,...,,,,,,,HOUSING,"38.56280556455046, -76.08029511712246",dqcgku0h8eup,right_only


## Step two add assestment fields to sdat

### Add the latest assessment data, grabbing it directly from MD Open Data

In [25]:
assessments = pd.read_csv('https://opendata.maryland.gov/resource/ed4q-f8tm.csv?jurisdiction_code_mdp_field_jurscode=DORC&$limit=25000')
assessment_fields = ['account_id_mdp_field_acctid','real_property_search_link',
                     'search_google_maps_for_this_location',
                     'c_a_m_a_system_data_structure_area_sq_ft_mdp_field_sqftstrc_sdat_field_241',
                     'current_assessment_year_total_phase_in_value_sdat_field_171',
                     'c_a_m_a_system_data_year_built_yyyy_mdp_field_yearblt_sdat_field_235',
                     'premise_address_number_mdp_field_premsnum_sdat_field_20',
                     'premise_address_number_suffix_sdat_field_21',
                     'premise_address_direction_mdp_field_premsdir_sdat_field_22',
                     'premise_address_name_mdp_field_premsnam_sdat_field_23',
                     'premise_address_type_mdp_field_premstyp_sdat_field_24',
                     'premise_address_city_mdp_field_premcity_sdat_field_25',
                     'premise_address_zip_code_mdp_field_premzip_sdat_field_26',
                     'mdp_street_address_mdp_field_address']
assessments = assessments[assessment_fields]
assessment_column_names = ['acctid','sdat','google_maps','struct_sqft',\
                           'assessed_value','address_number','address_unit_id','street_direction',\
                           'street_name','street_type']
assessments.rename(columns=dict(zip(assessment_fields,assessment_column_names)),inplace=True)
assessments.acctid = assessments.acctid.astype(str)
assessments.set_index('acctid',inplace=True)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


### Merge updated SDAT and assestment data

In [42]:
sdat_plus_assessments = merged_df.merge(assessments,how='outer', \
                                        left_index=True, right_index=True,indicator=True)
print("SDAT has ",len(merged_df), "records.  After the join there are:",\
      len(sdat_plus_assessments.query('_merge == "both"')))

SDAT has  23202 records.  After the join there are: 23202


## Step three - Save merged data for later use 

### Write the combined data set out for use later.

In [47]:
sdat_plus_assessments.query('_merge == "both"')\
            .drop(columns='_merge')\
            .to_csv(path+output_fn)