>#### Imports and Calling in Data:

In [1]:
#imports for calling in data
import wrangle
df=wrangle.get_zillow_data()

#needed imports for general notebook needs
import pandas as pd
pd.options.display.max_rows = 100
import matplotlib.pyplot as plt
import seaborn as sns



# Acquisition Phase:

>### Summary of the Raw Acquired Data:

In [2]:
#using a summary df built in wrangle.py
wrangle.df_summary(df)

---Shape: (52319, 68)

---Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52319 entries, 0 to 52318
Data columns (total 68 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   id                            52319 non-null  int64  
 1   parcelid                      52319 non-null  int64  
 2   airconditioningtypeid         13615 non-null  float64
 3   architecturalstyletypeid      70 non-null     float64
 4   basementsqft                  47 non-null     float64
 5   bathroomcnt                   52319 non-null  float64
 6   bedroomcnt                    52319 non-null  float64
 7   buildingclasstypeid           0 non-null      float64
 8   buildingqualitytypeid         33654 non-null  float64
 9   calculatedbathnbr             52184 non-null  float64
 10  decktypeid                    389 non-null    float64
 11  finishedfloor1squarefeet      4371 non-null   float64
 12  calculatedfinishedsquarefeet 

# Preparation Phase:

>### Handling the nulls and Missing Values & dtypes

Notes:
    <br>- I first handled my missing values by determining the count and percentage missing from each feature, and found that with rows & columns, taking out the data that had 60% columns and 60% of rows data missing did not affect my data count by much AND left me with some good features that I think will help conduct the clustering model. 
    <br>- I then changed the dtypes all to int for later scaline/modeling.

In [3]:
#bringing in my function that 1) Handled missing values by %, 2)dropped unneeded feature columns, 3) dropna the smaller 
#amount of nulls that were missing by row
df = wrangle.data_prep(df, cols_to_remove=['propertylandusetypeid','transactiondate','propertycountylandusecode','propertylandusedesc','propertyzoningdesc','buildingqualitytypeid','heatingorsystemtypeid','unitcnt','heatingorsystemdesc','calculatedbathnbr','id','finishedsquarefeet12','fullbathcnt','structuretaxvaluedollarcnt','landtaxvaluedollarcnt','taxamount','regionidcity','censustractandblock'], prop_required_column=.6, prop_required_row=.6)

In [4]:
#just to show what columns were left and that nulls are now cleaned
df.isnull().sum()

parcelid                        0
bathroomcnt                     0
bedroomcnt                      0
calculatedfinishedsquarefeet    0
fips                            0
latitude                        0
longitude                       0
lotsizesquarefeet               0
rawcensustractandblock          0
regionidcounty                  0
regionidzip                     0
roomcnt                         0
yearbuilt                       0
taxvaluedollarcnt               0
assessmentyear                  0
logerror                        0
dtype: int64

## Split the Data:

In [5]:
#using a 80/20 split on Train/Test, and then 70/30 for train,validate...stratifying on fips for even mix of data between
#using the fips codes for the three counties
train, validate, test = wrangle.split_data(df)
train.head()

Unnamed: 0,parcelid,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,fips,latitude,longitude,lotsizesquarefeet,rawcensustractandblock,regionidcounty,regionidzip,roomcnt,yearbuilt,taxvaluedollarcnt,assessmentyear,logerror
34222,11533536,1,2,864,6037,33986536,-118424309,5002,60372753,3101,96047,0,1950,787004,2016,0
21551,11058331,2,3,1243,6037,34270033,-118492886,7795,60371111,3101,96370,0,1953,174930,2016,0
39107,14003091,2,4,1157,6059,33794310,-117985512,7350,60590879,1286,97023,7,1954,274539,2016,0
9029,14208016,2,3,1491,6059,33861905,-117769542,6000,60590219,1286,97027,6,1975,523394,2016,0
2982,10844442,2,4,1843,6037,34173761,-118454651,7202,60371284,3101,96420,0,1949,113758,2016,0


## Encoding Fips:
I want to encode this column so I can use a quick 0/1 to use in predictions with clustering and modeling

In [7]:
train = wrangle.one_hot_encode(train)
train.head()

Unnamed: 0,parcelid,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,fips,latitude,longitude,lotsizesquarefeet,rawcensustractandblock,regionidcounty,regionidzip,roomcnt,yearbuilt,taxvaluedollarcnt,assessmentyear,logerror,is_Los_Angeles
34222,11533536,1,2,864,6037,33986536,-118424309,5002,60372753,3101,96047,0,1950,787004,2016,0,1
21551,11058331,2,3,1243,6037,34270033,-118492886,7795,60371111,3101,96370,0,1953,174930,2016,0,1
39107,14003091,2,4,1157,6059,33794310,-117985512,7350,60590879,1286,97023,7,1954,274539,2016,0,0
9029,14208016,2,3,1491,6059,33861905,-117769542,6000,60590219,1286,97027,6,1975,523394,2016,0,0
2982,10844442,2,4,1843,6037,34173761,-118454651,7202,60371284,3101,96420,0,1949,113758,2016,0,1


## Data Distributions:
>### Look at each feature and it's distribution

>### Outliers: Creating the ranges of data to train on