# Exploring and Preparing Land Prices Data by Census Tract 

[Data Source](https://www.fhfa.gov/PolicyProgramsResearch/Research/Pages/wp1901.aspx)


[FAQs](https://www.fhfa.gov/PolicyProgramsResearch/Research/PaperDocuments/FAQs-Land-10-28-20.pdf)

In this notebook, I will prepare our land values data for use in our decision tree for our random forest evaluation. I will filter down to the census tracts within the areas we are interested in, I will clean the data and look for NAs and duplicates, I will join the data to census tract boundaries, and I will prepare the data to be spatially joined to the parcel level SCAG data. 

First: Look at Census Tract level land values data, this dataset has residential data only. 

In [2]:
import pandas as pd

alldf = pd.read_csv('land_vals.csv', encoding_errors = 'ignore')

In [3]:
alldf.sample(5)

Unnamed: 0,State,County,Census Tract,"Land Value\n(1/4 Acre Lot, Standardized)","Land Value\n(Per Acre, As-Is)",Land Share of Property Value,Lot Size,Interior Square Feet,Property Value (Standardized),Property Value (As-is)
33815,New York,Sullivan County,36105952500,18400,25400,0.146,37280,1290,154500,148900
15315,Illinois,Cook County,17031804510,52100,229500,0.253,8920,1300,263200,185900
20700,Louisiana,St. Tammany Parish,22103040802,58200,191300,0.188,14120,2550,223300,329100
195,Alabama,Lawrence County,1079979500,11800,12700,0.13,50480,1590,128500,113300
41236,Pennsylvania,Philadelphia County,42101021200,99400,859100,0.26,2880,1420,334600,218000


In [4]:
#filter down to counties of interest
socal = alldf[(alldf.County == 'Riverside County')|(alldf.County == 'San Bernardino County')
              |(alldf.County == 'Los Angeles County')]
socal.sample(10)

Unnamed: 0,State,County,Census Tract,"Land Value\n(1/4 Acre Lot, Standardized)","Land Value\n(Per Acre, As-Is)",Land Share of Property Value,Lot Size,Interior Square Feet,Property Value (Standardized),Property Value (As-is)
3248,California,Los Angeles County,6037501900,318700,1832100,0.627,6460,1290,621000,433300
2500,California,Los Angeles County,6037101300,406500,1826100,0.621,8890,1580,733100,599700
3240,California,Los Angeles County,6037500600,287400,1777100,0.612,5740,1260,572400,383100
2862,California,Los Angeles County,6037262400,2146700,5212500,0.747,21460,3610,2261300,3439000
4457,California,Riverside County,6065041702,192400,887700,0.539,8800,1290,450500,332600
3612,California,Los Angeles County,6037910603,55300,294900,0.218,7120,1600,270800,221500
3058,California,Los Angeles County,6037403402,480800,2291800,0.671,8400,1870,731100,658500
2872,California,Los Angeles County,6037265201,2231000,9977600,0.754,9170,3350,2221800,2784400
3456,California,Los Angeles County,6037620522,951600,5036700,0.747,7210,2260,1150100,1116100
3587,California,Los Angeles County,6037901011,61200,285400,0.212,8410,1960,278800,260000


In [5]:
import geopandas as gpd

tracts = gpd.read_file('tl_2019_06_tract/tl_2019_06_tract.shp')
tracts.dtypes

STATEFP       object
COUNTYFP      object
TRACTCE       object
GEOID         object
NAME          object
NAMELSAD      object
MTFCC         object
FUNCSTAT      object
ALAND          int64
AWATER         int64
INTPTLAT      object
INTPTLON      object
geometry    geometry
dtype: object

In [6]:
tracts['GEOID'] = tracts['GEOID'].astype('float')
tracts.head()

Unnamed: 0,STATEFP,COUNTYFP,TRACTCE,GEOID,NAME,NAMELSAD,MTFCC,FUNCSTAT,ALAND,AWATER,INTPTLAT,INTPTLON,geometry
0,6,37,139301,6037139000.0,1393.01,Census Tract 1393.01,G5020,S,2865657,0,34.1781538,-118.5581265,"POLYGON ((-118.57150 34.17758, -118.57148 34.1..."
1,6,37,139302,6037139000.0,1393.02,Census Tract 1393.02,G5020,S,338289,0,34.176723,-118.5383655,"POLYGON ((-118.54073 34.18019, -118.54070 34.1..."
2,6,37,139502,6037140000.0,1395.02,Census Tract 1395.02,G5020,S,1047548,0,34.1628402,-118.526311,"POLYGON ((-118.53225 34.16201, -118.53177 34.1..."
3,6,37,139600,6037140000.0,1396.0,Census Tract 1396,G5020,S,2477482,0,34.1640599,-118.5101001,"POLYGON ((-118.51858 34.15858, -118.51858 34.1..."
4,6,37,139701,6037140000.0,1397.01,Census Tract 1397.01,G5020,S,3396396,2411,34.157429,-118.4954117,"POLYGON ((-118.50980 34.15691, -118.50848 34.1..."


In [7]:
socal = socal.rename(columns = {'Census Tract':'GEOID'})
socal['GEOID'] = socal['GEOID'].astype('float')
socal = socal.set_index('GEOID')

In [8]:
tracts = tracts.set_index('GEOID')

In [9]:
socal.shape

(1889, 9)

In [10]:
tracts.head()

Unnamed: 0_level_0,STATEFP,COUNTYFP,TRACTCE,NAME,NAMELSAD,MTFCC,FUNCSTAT,ALAND,AWATER,INTPTLAT,INTPTLON,geometry
GEOID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
6037139000.0,6,37,139301,1393.01,Census Tract 1393.01,G5020,S,2865657,0,34.1781538,-118.5581265,"POLYGON ((-118.57150 34.17758, -118.57148 34.1..."
6037139000.0,6,37,139302,1393.02,Census Tract 1393.02,G5020,S,338289,0,34.176723,-118.5383655,"POLYGON ((-118.54073 34.18019, -118.54070 34.1..."
6037140000.0,6,37,139502,1395.02,Census Tract 1395.02,G5020,S,1047548,0,34.1628402,-118.526311,"POLYGON ((-118.53225 34.16201, -118.53177 34.1..."
6037140000.0,6,37,139600,1396.0,Census Tract 1396,G5020,S,2477482,0,34.1640599,-118.5101001,"POLYGON ((-118.51858 34.15858, -118.51858 34.1..."
6037140000.0,6,37,139701,1397.01,Census Tract 1397.01,G5020,S,3396396,2411,34.157429,-118.4954117,"POLYGON ((-118.50980 34.15691, -118.50848 34.1..."


In [11]:
socalGdf = socal.join(tracts[['geometry']], on= 'GEOID')
socalGdf.sample(3)
socalGdf.head()

Unnamed: 0_level_0,State,County,"Land Value\n(1/4 Acre Lot, Standardized)","Land Value\n(Per Acre, As-Is)",Land Share of Property Value,Lot Size,Interior Square Feet,Property Value (Standardized),Property Value (As-is),geometry
GEOID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
6037101000.0,California,Los Angeles County,325200,1722700,0.568,7100,1400,668000,494500,"POLYGON ((-118.30229 34.25870, -118.30091 34.2..."
6037101000.0,California,Los Angeles County,327100,1233000,0.538,11340,1850,652600,596600,"POLYGON ((-118.30334 34.27371, -118.30330 34.2..."
6037101000.0,California,Los Angeles County,302800,1467400,0.503,7840,1600,660200,525100,"POLYGON ((-118.29945 34.25598, -118.29792 34.2..."
6037101000.0,California,Los Angeles County,325600,1684900,0.554,7170,1400,683800,500800,"POLYGON ((-118.28592 34.25227, -118.28592 34.2..."
6037101000.0,California,Los Angeles County,406500,1826100,0.621,8890,1580,733100,599700,"POLYGON ((-118.27822 34.25068, -118.27822 34.2..."


# Parcel Level Data

Now, instead try the parcel level. Data source: https://gis.rivco.org/pages/data-distribution (Parcels Attributed) 2021

In [16]:
import pandas as pd

riverside = pd.read_csv('riv_parcels_values.csv')

  riverside = pd.read_csv('riv_parcels_values.csv')


In [17]:
riverside.shape

(827053, 35)

In [18]:
riverside.columns

Index(['OID_', 'APN', 'FLAG', 'MAIL_STREET', 'MAIL_CITY', 'SITUS_STREET',
       'SITUS_CITY', 'STREET_NUMBER', 'STREET_PREDIRECTION', 'STREET_NAME',
       'STREET_TYPE', 'STREET_SUFFIX', 'UNIT_NUMBER', 'CITY', 'ZIP_CODE',
       'CLASS_CODE', 'MULTIPLE', 'SUBDIVISION_NAME', 'ACREAGE',
       'RECORDER_MAP_TYPE', 'BOOK', 'PAGE', 'MAP_BOOK_PAGE', 'COUNTY_CODE',
       'LOT_TYPE', 'LOT', 'BLOCK', 'CAME_FROM', 'TAX_RATE_AREA', 'LAND',
       'STRUCTURES', 'PRIMARY_OWNER', 'ALL_OWNER_LIST', 'SHAPE_Length',
       'SHAPE_Area'],
      dtype='object')

In [19]:
desired_cols = ['APN', 'ACREAGE', 'LOT_TYPE', 'CLASS_CODE','LAND','SHAPE_Length', 'SHAPE_Area']
riverside = riverside[desired_cols]
riverside = riverside.dropna()
riverside.head()

Unnamed: 0,APN,ACREAGE,LOT_TYPE,CLASS_CODE,LAND,SHAPE_Length,SHAPE_Area
349,101160001,1.37,Lot,Vacant Land - Predominate Agricultural Use,1612.0,1034.716993,63941.295998
356,101200001,0.26,Lot,HOMESITE/< 1 ACRE,33813.0,482.02856,11167.476229
357,101200010,0.21,Lot,Vacant Land - Predominate Agricultural Use,4751.0,429.474623,9283.691471
358,101200011,0.21,Lot,HOMESITE/< 1 ACRE,7803.0,429.620279,9288.04807
379,101160002,0.87,Lot,Vacant Land - Predominate Agricultural Use,1048.0,796.745006,39647.856447


In [20]:
riverside['APN'].is_unique

True

In [21]:
riverside['LAND'].median()

75738.0

In [22]:
zero = riverside[riverside.LAND==0]
zero

Unnamed: 0,APN,ACREAGE,LOT_TYPE,CLASS_CODE,LAND,SHAPE_Length,SHAPE_Area
477,101460007,5.96,L,Common Area/No Imps,0.0,3566.400118,259451.650854
502,101510035,0.74,L,Common Area/No Imps,0.0,805.957931,32402.130188
524,101460004,10.49,L,Common Area/No Imps,0.0,5845.847237,457551.435433
546,102091030,0.97,L,CT-Golf Course,0.0,843.991375,42098.167793
547,102101047,0.26,P,Vacant Commercial Land,0.0,1827.004432,59174.682883
...,...,...,...,...,...,...,...
824265,290980072,0.21,L,Common Area/No Imps,0.0,418.152843,9073.659732
824518,290980070,0.33,L,Common Area/No Imps,0.0,542.660180,14397.984327
824530,290980071,0.54,L,Common Area/No Imps,0.0,881.103382,23554.920015
824531,290980076,0.02,L,Common Area/No Imps,0.0,192.020207,1030.067370


In [23]:
riverside.shape

(736371, 7)

In [24]:
riverside = riverside[riverside.LAND!=0]
riverside.shape

(732615, 7)

In [25]:
riverside['LAND'].median()

76498.0

In [26]:
riverside.sort_values(by = "LAND", ascending = True, inplace = True)

In [27]:
riverside.sample(20)

Unnamed: 0,APN,ACREAGE,LOT_TYPE,CLASS_CODE,LAND,SHAPE_Length,SHAPE_Area
707681,919462007,0.12,Lot,Single Family Dwelling,70986.0,307.989087,5252.725205
583545,906712020,0.67,Lot,Single Family Dwelling,107219.0,780.405109,28621.810658
202590,694260026,0.4,Lot,Single Family Dwelling,91977.0,545.596322,17223.558289
279081,661410017,0.19,Lot,Single Family Dwelling,69991.0,368.986128,8198.591627
186291,692560024,0.17,Lot,Single Family Dwelling,78920.0,349.541597,7320.748235
289466,646192006,0.18,Lot,Single Family Dwelling,82022.0,357.252148,7875.195133
553814,304510041,0.17,Lot,Single Family Dwelling,65617.0,363.695713,7329.447563
401252,146280042,0.11,Lot,Single Family Dwelling,113582.0,298.171153,4742.368831
323918,504291002,0.46,Lot,Single Family Dwelling,310374.0,554.958668,19121.571243
409840,225072012,0.17,Lot,Single Family Dwelling,50276.0,349.431447,7358.051545


In [29]:
riverside['LAND'].describe()

count    7.326150e+05
mean     1.238838e+05
std      4.486991e+05
min      1.000000e+00
25%      4.707800e+04
50%      7.649800e+04
75%      1.158850e+05
max      7.907040e+07
Name: LAND, dtype: float64

In [30]:
socal['Land Value\n(1/4 Acre Lot, Standardized)'].describe()

count    1.889000e+03
mean     4.002681e+05
std      4.683973e+05
min      1.050000e+04
25%      1.118000e+05
50%      2.742000e+05
75%      4.683000e+05
max      4.592800e+06
Name: Land Value\n(1/4 Acre Lot, Standardized), dtype: float64

Discuss with group: Drop some of the lower values?

In [36]:
riverside.to_csv('riversideparcels.csv')