## Acquire

Acquire data from mySQL using the python module to connect and query. You will want to end with a single dataframe. Make sure to include: the logerror, all fields related to the properties that are available. You will end up using all the tables in the database.

* Be sure to do the correct join (inner, outer, etc.). We do not want to eliminate properties purely because they may have a null value for airconditioningtypeid.
* Only include properties with a transaction in 2017, and include only the last transaction for each property (so no duplicate property ID's), along with zestimate error and date of transaction.
* Only include properties that include a latitude and longitude value.

In [1]:
import numpy as np
import pandas as pd
import env

In [2]:
zillow_query = """
SELECT * FROM properties_2017
LEFT JOIN predictions_2017 ON predictions_2017.parcelid = properties_2017.parcelid
LEFT JOIN airconditioningtype ON airconditioningtype.airconditioningtypeid = properties_2017.airconditioningtypeid
LEFT JOIN architecturalstyletype ON architecturalstyletype.architecturalstyletypeid = properties_2017.architecturalstyletypeid
LEFT JOIN buildingclasstype ON buildingclasstype.buildingclasstypeid = properties_2017.buildingclasstypeid
LEFT JOIN storytype ON storytype.storytypeid = properties_2017.storytypeid
LEFT JOIN typeconstructiontype ON typeconstructiontype.typeconstructiontypeid = properties_2017.typeconstructiontypeid
LEFT JOIN heatingorsystemtype ON heatingorsystemtype.heatingorsystemtypeid = properties_2017.heatingorsystemtypeid
LEFT JOIN propertylandusetype ON propertylandusetype.propertylandusetypeid = properties_2017.propertylandusetypeid
WHERE (predictions_2017.transactiondate >= '2017-01-01'
    AND predictions_2017.transactiondate <= '2017-12-31')
      AND properties_2017.latitude IS NOT NULL
       AND properties_2017.longitude IS NOT NULL;
"""

In [3]:
zillow_url = f'mysql+pymysql://{env.user}:{env.password}@{env.host}/zillow'

In [4]:
zillow = pd.read_sql(zillow_query, zillow_url)

In [5]:
#Duplicate parcelids still need to be dropped.
#Place df in ascending order of transaction data then drop_duplicatese with keep = last
zillow.transactiondate.sort_values()
zillow.transactiondate

0        2017-01-01
1        2017-01-01
2        2017-01-01
3        2017-01-01
4        2017-01-01
            ...    
77574    2017-09-20
77575    2017-09-20
77576    2017-09-21
77577    2017-09-21
77578    2017-09-25
Name: transactiondate, Length: 77579, dtype: object

In [6]:
duplicate_parcels = zillow[zillow.duplicated(subset = 'parcelid')].copy()
duplicate_parcels

Unnamed: 0,id,parcelid,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,...,buildingclasstypeid.1,buildingclassdesc,storytypeid,storydesc,typeconstructiontypeid,typeconstructiondesc,heatingorsystemtypeid,heatingorsystemdesc,propertylandusetypeid,propertylandusedesc
117,2463969,11393337,,,,3.0,3.0,,4.0,3.0,...,,,,,,,,,247,"Triplex (3 Units, Any Combination)"
624,2026522,14634203,1.0,,,2.0,3.0,,,2.0,...,,,,,,,24.0,Yes,266,Condominium
1017,616260,11721753,,,,2.0,3.0,,6.0,2.0,...,,,,,,,7.0,Floor/Wall,261,Single Family Residential
1246,2061546,11289917,1.0,,,2.0,3.0,,6.0,2.0,...,,,,,,,2.0,Central,261,Single Family Residential
1732,2554497,11637029,1.0,,,2.0,3.0,,9.0,2.0,...,,,,,,,2.0,Central,266,Condominium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59986,2008746,13066981,1.0,,,2.0,4.0,,8.0,2.0,...,,,,,,,2.0,Central,261,Single Family Residential
62214,492024,10852812,,,,7.0,11.0,,8.0,7.0,...,,,,,,,2.0,Central,260,Residential General
63107,2407178,12136147,,,,2.0,3.0,,5.0,2.0,...,,,,,,,,,246,"Duplex (2 Units, Any Combination)"
64253,2938730,17282392,,,,2.0,3.0,,,2.0,...,,,,,,,,,261,Single Family Residential


In [7]:
zillow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77579 entries, 0 to 77578
Data columns (total 77 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   id                            77579 non-null  int64  
 1   parcelid                      77579 non-null  int64  
 2   airconditioningtypeid         25007 non-null  float64
 3   architecturalstyletypeid      207 non-null    float64
 4   basementsqft                  50 non-null     float64
 5   bathroomcnt                   77579 non-null  float64
 6   bedroomcnt                    77579 non-null  float64
 7   buildingclasstypeid           15 non-null     float64
 8   buildingqualitytypeid         49809 non-null  float64
 9   calculatedbathnbr             76963 non-null  float64
 10  decktypeid                    614 non-null    float64
 11  finishedfloor1squarefeet      6037 non-null   float64
 12  calculatedfinishedsquarefeet  77378 non-null  float64
 13  f

In [8]:
#Now drop duplicates, but keep last occurence. This will ensure the latest transaction date is kept
zillow = zillow.drop_duplicates(subset = 'parcelid', keep = 'last')

In [9]:
zillow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 77381 entries, 0 to 77578
Data columns (total 77 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   id                            77381 non-null  int64  
 1   parcelid                      77381 non-null  int64  
 2   airconditioningtypeid         24953 non-null  float64
 3   architecturalstyletypeid      206 non-null    float64
 4   basementsqft                  50 non-null     float64
 5   bathroomcnt                   77381 non-null  float64
 6   bedroomcnt                    77381 non-null  float64
 7   buildingclasstypeid           15 non-null     float64
 8   buildingqualitytypeid         49672 non-null  float64
 9   calculatedbathnbr             76772 non-null  float64
 10  decktypeid                    614 non-null    float64
 11  finishedfloor1squarefeet      6023 non-null   float64
 12  calculatedfinishedsquarefeet  77185 non-null  float64
 13  f