# Zillow1 - Load Data

This code loads raw data saved on disk* and sets simple names for each dataset. User inputs `geography` and `tenure`, to load only necessary datasets.

*data was saved on disk for reproducibility, as Zillow updates data and changes methodology. Script for data download: `Zillow0_donwload.ipynb`.

In [1]:
import pandas as pd
import os

In [None]:
# print message when loaded on other script
print("Function load_zillow(geography ,tenure=all)")
print("geography=('state'/'county'), tenure=('rentals'/'prices'/'all')")
print()

In [2]:
# raw data is in this folder on subdirectories
input_folder = '../input'

#### Create list of names to load data

In [3]:
# path of raw files on my folder
root_dir = os.getcwd()
file_list = []

for dir_, _, files in os.walk(input_folder):
    for file_name in files:
        rel_dir = os.path.relpath(dir_, root_dir)
        rel_file = os.path.join(rel_dir, file_name)
        file_list.append(rel_file)

file_list.pop(0) # remove first item
file_list

['../input/zri/State_Zri_AllHomesPlusMultifamily.csv',
 '../input/zri/State_Zri_MultiFamilyResidenceRental.csv',
 '../input/zri/County_Zri_AllHomesPlusMultifamily_Summary.csv',
 '../input/zri/County_Zri_MultiFamilyResidenceRental.csv',
 '../input/zri/State_Zri_AllHomesPlusMultifamily_Summary.csv',
 '../input/zri/County_Zri_AllHomesPlusMultifamily.csv',
 '../input/zhvi/County_Zhvi_SingleFamilyResidence.csv',
 '../input/zhvi/State_Zhvi_AllHomes.csv',
 '../input/zhvi/County_Zhvi_AllHomes.csv',
 '../input/zhvi/State_Zhvi_SingleFamilyResidence.csv',
 '../input/zhvi/State_Zhvi_BottomTier.csv',
 '../input/zhvi/State_Zhvi_TopTier.csv',
 '../input/zhvi/County_Zhvi_BottomTier.csv',
 '../input/zhvi/County_Zhvi_TopTier.csv']

I will create simpler names from the files in the subfolder

In [4]:
# name generator
file_names = pd.Series([s.split('/')[-1][:-4] for s in file_list]) # extract only file name
file_names # check file names 

0              State_Zri_AllHomesPlusMultifamily
1           State_Zri_MultiFamilyResidenceRental
2     County_Zri_AllHomesPlusMultifamily_Summary
3          County_Zri_MultiFamilyResidenceRental
4      State_Zri_AllHomesPlusMultifamily_Summary
5             County_Zri_AllHomesPlusMultifamily
6              County_Zhvi_SingleFamilyResidence
7                            State_Zhvi_AllHomes
8                           County_Zhvi_AllHomes
9               State_Zhvi_SingleFamilyResidence
10                         State_Zhvi_BottomTier
11                            State_Zhvi_TopTier
12                        County_Zhvi_BottomTier
13                           County_Zhvi_TopTier
dtype: object

In [5]:
# substitution long expressions to simplify file names
dic_sub = {'MultiFamilyResidenceRental': 'MFR',
           'SingleFamilyResidence': 'SFR',
           'AllHomesPlusMultifamily': 'All',
           'Tier':'',
           'Homes':'', 
           'County':'C',
           'State':'S',
           'l_Summary':'s',
           'Bottom':'Bot'}

var_names = file_names

# make substitutions
for key, value in dic_sub.items():
    var_names = var_names.str.replace(key, value)
    
# check result
var_names [:5]

0    S_Zri_All
1    S_Zri_MFR
2    C_Zri_Als
3    C_Zri_MFR
4    S_Zri_Als
dtype: object

In [6]:
# verify correspondence name-path visually
(pd.DataFrame([var_names, pd.Series(file_list)]).T).set_axis(['file', 'path'],1)

Unnamed: 0,file,path
0,S_Zri_All,../input/zri/State_Zri_AllHomesPlusMultifamily...
1,S_Zri_MFR,../input/zri/State_Zri_MultiFamilyResidenceRen...
2,C_Zri_Als,../input/zri/County_Zri_AllHomesPlusMultifamil...
3,C_Zri_MFR,../input/zri/County_Zri_MultiFamilyResidenceRe...
4,S_Zri_Als,../input/zri/State_Zri_AllHomesPlusMultifamily...
5,C_Zri_All,../input/zri/County_Zri_AllHomesPlusMultifamil...
6,C_Zhvi_SFR,../input/zhvi/County_Zhvi_SingleFamilyResidenc...
7,S_Zhvi_All,../input/zhvi/State_Zhvi_AllHomes.csv
8,C_Zhvi_All,../input/zhvi/County_Zhvi_AllHomes.csv
9,S_Zhvi_SFR,../input/zhvi/State_Zhvi_SingleFamilyResidence...


#### Load only selected data
By runing the function `load_zillow`, user can choose especific groups of datasets.

In [1]:
# converts function arguments to substrings contained on file_names and variables
dic = {'county':'C_',
      'state':'S_',
      'prices':'Zhvi',
      'rentals':'Zri',
      'all':'Zhvi|Zri'}

In [2]:
def load_zillow(geography ,tenure=all):
    
    # position that contains geography and tenure chosen
    pos = var_names.str.contains(dic[tenure] ) & var_names.str.contains(dic[geography])
    
    # iterate over list of var names and paths to load data
    for var, path in zip(var_names[pos].tolist(), pd.Series(file_list)[pos]):
        # note in globals()[var] I get string as praceholder of var (var name)
        globals()[var] = pd.read_csv(path, encoding='ISO-8859-1') # use globals so you can keep variable "name" 
        print(var, path)
    