# Data Download and Investigation

In this notebook, I load in the `Real Property Sales` data and investigate the columns to get a feel for the data types, NA values and how the data is generally structured.

### Bash commands to get data:

#### Real Property Sales:

In [107]:
# Get real property sales zip file:
! wget -P ../data/raw https://aqua.kingcounty.gov/extranet/assessor/Real%20Property%20Sales.zip

--2020-06-14 15:25:26--  https://aqua.kingcounty.gov/extranet/assessor/Real%20Property%20Sales.zip
Resolving aqua.kingcounty.gov (aqua.kingcounty.gov)... 146.129.240.28
Connecting to aqua.kingcounty.gov (aqua.kingcounty.gov)|146.129.240.28|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 128312536 (122M) [application/x-zip-compressed]
Saving to: ‘../data/raw/Real Property Sales.zip.1’



In [108]:
# unzip real property sales zip file:
! unzip ../data/raw/Real\ Property\ Sales.zip -d ../data

Archive:  ../data/raw/Real Property Sales.zip
replace ../data/EXTR_RPSale.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


#### Parcel:

In [None]:
# Get parcel zip file:
! wget -P ../data/raw https://aqua.kingcounty.gov/extranet/assessor/Parcel.zip

--2020-06-14 15:25:53--  https://aqua.kingcounty.gov/extranet/assessor/Parcel.zip
Resolving aqua.kingcounty.gov (aqua.kingcounty.gov)... 146.129.240.28
Connecting to aqua.kingcounty.gov (aqua.kingcounty.gov)|146.129.240.28|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30467367 (29M) [application/x-zip-compressed]
Saving to: ‘../data/raw/Parcel.zip.1’

Parcel.zip.1          7%[>                   ]   2.29M  3.76MB/s               ^C


In [None]:
# unzip real property sales zip file:
! unzip ../data/raw/Parcel.zip -d ../data

#### Residential Building:

In [None]:
# Get parcel zip file:
! wget -P ../data/raw https://aqua.kingcounty.gov/extranet/assessor/Residential%20Building.zip

In [None]:
# unzip real property sales zip file:
! unzip ../data/raw/Residential\ Building.zip -d ../data

### Create Pandas DataFrame: Real Property Sales data

In [1]:
# test file collect:
import pandas as pd
import numpy as np

# real property sales df create:
rps = pd.read_csv("../data/EXTR_RPSale.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
rps.head()

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,PropertyType,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning
0,2687551,138860,110,08/21/2014,245000,20140828001436,,,,,...,3,6,3,N,N,N,N,1,8,
1,1235111,664885,40,07/09/1991,0,199203161090,71.0,1.0,664885.0,C,...,3,0,26,N,N,N,N,18,3,11
2,2704079,423943,50,10/11/2014,0,20141205000558,,,,,...,3,6,15,N,N,N,N,18,8,18 31 51
3,2584094,403700,715,01/04/2013,0,20130110000910,,,,,...,3,6,15,N,N,N,N,11,8,18 31 38
4,3027422,213043,120,12/20/2019,560000,20191226000848,,,,,...,11,6,3,N,N,N,N,1,8,


In [3]:
rps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2089099 entries, 0 to 2089098
Data columns (total 24 columns):
ExciseTaxNbr          int64
Major                 object
Minor                 object
DocumentDate          object
SalePrice             int64
RecordingNbr          object
Volume                object
Page                  object
PlatNbr               object
PlatType              object
PlatLot               object
PlatBlock             object
SellerName            object
BuyerName             object
PropertyType          int64
PrincipalUse          int64
SaleInstrument        int64
AFForestLand          object
AFCurrentUseLand      object
AFNonProfitUse        object
AFHistoricProperty    object
SaleReason            int64
PropertyClass         int64
dtypes: int64(7), object(17)
memory usage: 382.5+ MB


In [4]:
# check na's:
rps.isna().sum()

ExciseTaxNbr          0
Major                 0
Minor                 0
DocumentDate          0
SalePrice             0
RecordingNbr          0
Volume                0
Page                  0
PlatNbr               0
PlatType              0
PlatLot               0
PlatBlock             0
SellerName            0
BuyerName             0
PropertyType          0
PrincipalUse          0
SaleInstrument        0
AFForestLand          0
AFCurrentUseLand      0
AFNonProfitUse        0
AFHistoricProperty    0
SaleReason            0
PropertyClass         0
dtype: int64

We can see there are no null items but we see blanks in the preview...

`SalePrice` is an `int64` (as expected) but `DocumentDate` is `object` so need to convert to datetime object.

In [5]:
rps['DocumentDate'] = pd.to_datetime(rps['DocumentDate'])

In [6]:
rps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2089099 entries, 0 to 2089098
Data columns (total 24 columns):
ExciseTaxNbr          int64
Major                 object
Minor                 object
DocumentDate          datetime64[ns]
SalePrice             int64
RecordingNbr          object
Volume                object
Page                  object
PlatNbr               object
PlatType              object
PlatLot               object
PlatBlock             object
SellerName            object
BuyerName             object
PropertyType          int64
PrincipalUse          int64
SaleInstrument        int64
AFForestLand          object
AFCurrentUseLand      object
AFNonProfitUse        object
AFHistoricProperty    object
SaleReason            int64
PropertyClass         int64
dtypes: datetime64[ns](1), int64(7), object(16)
memory usage: 382.5+ MB


Now we've converted date to be the correct dtype.

In [7]:
rps['DocumentDate'][0]

Timestamp('2014-08-21 00:00:00')

In [8]:
# create year column:
rps['year'] = rps['DocumentDate'].dt.year

rps.head()

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning,year
0,2687551,138860,110,2014-08-21,245000,20140828001436,,,,,...,6,3,N,N,N,N,1,8,,2014
1,1235111,664885,40,1991-07-09,0,199203161090,71.0,1.0,664885.0,C,...,0,26,N,N,N,N,18,3,11,1991
2,2704079,423943,50,2014-10-11,0,20141205000558,,,,,...,6,15,N,N,N,N,18,8,18 31 51,2014
3,2584094,403700,715,2013-01-04,0,20130110000910,,,,,...,6,15,N,N,N,N,11,8,18 31 38,2013
4,3027422,213043,120,2019-12-20,560000,20191226000848,,,,,...,6,3,N,N,N,N,1,8,,2019


In [9]:
# test if this gets all rows from 2019
rps[rps['year'] == 2019]

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning,year
4,3027422,213043,120,2019-12-20,560000,20191226000848,,,,,...,6,3,N,N,N,N,1,8,,2019
118,2999169,919715,200,2019-07-08,192000,20190712001080,,,,,...,2,3,N,N,N,N,1,3,,2019
144,3000673,894444,200,2019-06-26,185000,20190722001395,,,,,...,2,3,N,N,N,N,1,3,,2019
164,3002257,940652,630,2019-07-22,435000,20190730001339,,,,,...,6,3,N,N,N,N,1,8,,2019
445,2980836,937630,695,2019-03-28,550000,20190404001008,,,,,...,6,3,N,N,N,N,1,8,,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2089056,3004408,66000,2210,2019-08-07,41040000,20190812000849,,,,,...,7,22,N,N,N,N,18,2,45,2019
2089057,3004408,66000,2225,2019-08-07,41040000,20190812000849,,,,,...,7,22,N,N,N,N,18,2,45,2019
2089058,3004408,66000,2195,2019-08-07,41040000,20190812000849,,,,,...,7,22,N,N,N,N,18,2,45,2019
2089059,3004408,66000,2220,2019-08-07,41040000,20190812000849,,,,,...,7,22,N,N,N,N,18,2,45,2019


In [10]:
# create data frame with just data from 2019
rps_2019 = rps[rps['year'] == 2019]
rps_2019.head()

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning,year
4,3027422,213043,120,2019-12-20,560000,20191226000848,,,,,...,6,3,N,N,N,N,1,8,,2019
118,2999169,919715,200,2019-07-08,192000,20190712001080,,,,,...,2,3,N,N,N,N,1,3,,2019
144,3000673,894444,200,2019-06-26,185000,20190722001395,,,,,...,2,3,N,N,N,N,1,3,,2019
164,3002257,940652,630,2019-07-22,435000,20190730001339,,,,,...,6,3,N,N,N,N,1,8,,2019
445,2980836,937630,695,2019-03-28,550000,20190404001008,,,,,...,6,3,N,N,N,N,1,8,,2019


In [11]:
# check that only year is 2019:
rps_2019['year'].unique()

array([2019])

In [12]:
rps_2019.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61363 entries, 4 to 2089082
Data columns (total 25 columns):
ExciseTaxNbr          61363 non-null int64
Major                 61363 non-null object
Minor                 61363 non-null object
DocumentDate          61363 non-null datetime64[ns]
SalePrice             61363 non-null int64
RecordingNbr          61363 non-null object
Volume                61363 non-null object
Page                  61363 non-null object
PlatNbr               61363 non-null object
PlatType              61363 non-null object
PlatLot               61363 non-null object
PlatBlock             61363 non-null object
SellerName            61363 non-null object
BuyerName             61363 non-null object
PropertyType          61363 non-null int64
PrincipalUse          61363 non-null int64
SaleInstrument        61363 non-null int64
AFForestLand          61363 non-null object
AFCurrentUseLand      61363 non-null object
AFNonProfitUse        61363 non-null object
AFHist

As you can see, there are no null rows for any of the columns.  However when we preview the head of the data we can see blank values for columns like 'Volume', 'Page' and 'PlateNbr' etc... They're listed as 'objects' so perhaps they are all just empty strings?  

We can also see from the preivew that the columns `AFForestLand`, `AFCurrentUseLand`, `AFNonProfitUse`, and `AFHistoricProperty` all have 'N's as entries and have no look up codes available.  Let's preview these columns more closely...

In [13]:
columns = list(rps_2019.columns)

In [14]:
# don't need to see unique entries of dates...
columns.remove('DocumentDate')

In [15]:
# Find unique values for columns:
for col in columns:
    print(f"Unique entries for {col}: {rps_2019[col].unique()}")

Unique entries for ExciseTaxNbr: [3027422 2999169 3000673 ... 2998779 3004408 3009227]
Unique entries for Major: [213043 919715 894444 ... 135400 156270 638657]
Unique entries for Minor: [120 200 630 ... 2561 9437 4239]
Unique entries for SalePrice: [  560000   192000   185000 ...   632075   700988 41040000]
Unique entries for RecordingNbr: ['20191226000848' '20190712001080' '20190722001395' ... '20190711000444'
 '20190812000849' '20190909000838']
Unique entries for Volume: ['   ']
Unique entries for Page: ['   ']
Unique entries for PlatNbr: ['      ']
Unique entries for PlatType: [' ']
Unique entries for PlatLot: ['              ']
Unique entries for PlatBlock: ['       ']
Unique entries for SellerName: ['DOYLE REGAN M+STERLING C                          '
 'WAGNERESTATES LLC                                 '
 'MAY THOMAS A+SHIRLEY E                            ' ...
 'KURUP REVATHY                                     '
 '1901 MINOR LLC                                    '
 'LEONG JOHN

So we can see that `Volume`, `Page`, `PlatNbr`, `PlatType`, `PlatLot`, `PlatBlock` all contain white space strings.  These columns could thus easily be droped from the df without loss of data.

Columns `AFForestLand`, `AFCurrentUseLand`, `AFNonProfitUse`, and `AFHistoricProperty` appear to be binary 'Yes/No' responses.  It would probably be best to change these to boolean types instead if they are required features for the model.  If they aren't necessary for analysis they could be dropped.  However, if you want to filter the dataframe based on these criteria then change these columns to boolean types and filter accordingly. 

`SellerName` and `BuyerName` also appear to have a lot of white space in the entries.  You could go through and clean this up however the seller/buyer names are probably not relevant to our analysis so we could drop these columns instead.

`year` only has 2019 as we saw before so we don't need to keep track of this (this information is kept in the `DocumentDate` column anyway).

`SaleWarning` contains a lot different pairs of numbers in a string format.  There is also white space entries which would signify that there was no sale warning.  Think about whether you might want to reformat this column to something more helpful/descriptive.  If sale warnings are necessary for analysis, perhaps you could create one-hot-encoding codes for this column for all the different types of sale warnings.  We won't remove this column but we also won't format it just yet until our analysis needs are determined. 

### Drop irrelevant columns:

In [16]:
# list of cols to initally drop:
cols_to_drop = ['Volume', 'Page', 'PlatNbr', 'PlatType', 
                'PlatLot', 'PlatBlock', 'SellerName', 'BuyerName', 'year']
rps_2019.drop(columns = cols_to_drop, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [17]:
rps_2019.head()

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,PropertyType,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning
4,3027422,213043,120,2019-12-20,560000,20191226000848,11,6,3,N,N,N,N,1,8,
118,2999169,919715,200,2019-07-08,192000,20190712001080,3,2,3,N,N,N,N,1,3,
144,3000673,894444,200,2019-06-26,185000,20190722001395,3,2,3,N,N,N,N,1,3,
164,3002257,940652,630,2019-07-22,435000,20190730001339,11,6,3,N,N,N,N,1,8,
445,2980836,937630,695,2019-03-28,550000,20190404001008,3,6,3,N,N,N,N,1,8,


In [18]:
rps_2019.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61363 entries, 4 to 2089082
Data columns (total 16 columns):
ExciseTaxNbr          61363 non-null int64
Major                 61363 non-null object
Minor                 61363 non-null object
DocumentDate          61363 non-null datetime64[ns]
SalePrice             61363 non-null int64
RecordingNbr          61363 non-null object
PropertyType          61363 non-null int64
PrincipalUse          61363 non-null int64
SaleInstrument        61363 non-null int64
AFForestLand          61363 non-null object
AFCurrentUseLand      61363 non-null object
AFNonProfitUse        61363 non-null object
AFHistoricProperty    61363 non-null object
SaleReason            61363 non-null int64
PropertyClass         61363 non-null int64
dtypes: datetime64[ns](1), int64(7), object(8)
memory usage: 8.0+ MB


In [19]:
# create a list of columns that I want to be bools:
cols_to_bool = ['AFForestLand', 'AFCurrentUseLand', 'AFNonProfitUse', 'AFHistoricProperty']
for col in cols_to_bool:
    for item in rps_2019[col]:
        if item == 'N':


# rps_2019[cols_to_bool] = rps_2019[cols_to_bool].astype('bool')

SyntaxError: unexpected EOF while parsing (<ipython-input-19-f06fba91221d>, line 8)