# Data Shape

Let's take a quick look at the data.

Note, you'll need to install `pandas` and `numpy` packages, along with `ipykernel` prior to running the commands in this file. Also, make sure you've selected a kernel using the **Select Kernel** button in the upper right of this window.

In [2]:
## Run this to install pandas in the notebook kernel. You only need to do this once.
## conda install -p .\.conda pandas numpy requests

import pandas as pd
import numpy as np
import io
import requests

Next, we'll download and import the data into a data frame.

You'll immediately notice that the data frame is rather wide and includes columns for each month from January 31, 2000 to the present. The entire dataset is sorted by `SizeRank`, with `New York, NY` at the top and `Lamesa, TX` (a mere 13.31 km² according to a quick web search) at the bottom. The data does not include the actual size of the area, only it's rank.

In [21]:

url="https://files.zillowstatic.com/research/public_csvs/zhvi/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv?t=1699740944"
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')), encoding = "utf-8", index_col="RegionID")
df


Unnamed: 0_level_0,SizeRank,RegionName,RegionType,StateName,2000-01-31,2000-02-29,2000-03-31,2000-04-30,2000-05-31,2000-06-30,...,2022-12-31,2023-01-31,2023-02-28,2023-03-31,2023-04-30,2023-05-31,2023-06-30,2023-07-31,2023-08-31,2023-09-30
RegionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
102001,0,United States,country,,121428.348338,121641.979730,121906.914111,122475.146355,123129.113541,123830.254505,...,341524.686817,340331.965361,339460.290723,339398.680892,340364.871948,341993.734727,343935.026591,345686.216142,347311.245095,348538.961877
394913,1,"New York, NY",msa,NY,216218.985144,217137.793836,218065.112352,219944.218493,221890.097976,224047.392222,...,607957.914467,607138.375566,605781.039331,606096.535158,608105.120897,612136.785991,616308.017005,619911.494379,623211.984905,625939.720932
753899,2,"Los Angeles, CA",msa,CA,222303.044856,223130.294238,224232.182526,226424.570068,228822.354722,231203.347369,...,883096.482193,874754.169505,863791.280350,853971.725026,851581.705998,855385.245879,863224.274090,874493.271073,888127.785264,901894.488002
394463,3,"Chicago, IL",msa,IL,152289.701354,152430.677254,152699.168339,153367.107433,154170.557823,155072.246539,...,292513.670485,291906.868582,291753.784548,292397.065395,294085.140305,296271.600804,298827.032389,301362.893227,303811.358338,305636.274755
394514,4,"Dallas, TX",msa,TX,125341.331449,125397.158916,125461.338304,125628.005434,125847.751482,126070.179855,...,373653.261774,371121.004193,368863.612270,367332.301710,366782.695296,367067.613765,367882.973641,368856.669150,369775.450746,370189.087719
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
753929,935,"Zapata, TX",msa,TX,,,,,,,...,128820.616257,124500.076959,121698.517557,120678.497416,120861.552144,120377.367718,119691.405639,119336.614239,119555.327710,119218.474344
394743,936,"Ketchikan, AK",msa,AK,,,,,,,...,393779.436232,391467.727268,389754.325134,387641.460892,386519.303527,386847.150776,388239.179204,389755.421023,389345.066681,387400.567422
753874,937,"Craig, CO",msa,CO,98973.406536,99226.488457,99697.660699,100368.653492,101148.393050,101883.417385,...,265092.079451,266386.963394,267729.652398,268933.717873,271002.669802,273691.861160,277371.349796,280306.434975,282581.826984,283845.901872
395188,938,"Vernon, TX",msa,TX,,,,,,,...,92945.807423,90805.018699,89793.168060,90918.434770,92419.782770,93680.519832,93642.904321,93103.127344,92088.769220,91114.278438


The dataset only includes one row for country, and the rest refer to "msa" regions.

If you're curious, "msa" stands for [Metropolitan Statistical Area (MSA)](https://en.wikipedia.org/wiki/Metropolitan_statistical_area) or [Micropolitan Statistical Area (μSA)](https://en.wikipedia.org/wiki/Micropolitan_statistical_area), which (according to Wikipedia) are geographical regions with a relatively high population density at their core and close economic ties. In other words, the region doesn't necessarily represent a single city, nor when combined do they represent an entire state. As of 2020, there were 927 such regions defined in the U.S., so the ZHVI doesn't include all of them.



In [41]:
df.groupby('RegionType').count()

Unnamed: 0_level_0,SizeRank,RegionName,StateName,2000-01-31,2000-02-29,2000-03-31,2000-04-30,2000-05-31,2000-06-30,2000-07-31,...,2022-12-31,2023-01-31,2023-02-28,2023-03-31,2023-04-30,2023-05-31,2023-06-30,2023-07-31,2023-08-31,2023-09-30
RegionType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
country,1,1,0,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
msa,894,894,894,430,431,432,434,436,437,438,...,894,894,894,894,894,894,894,894,894,894


Except for the first few columns, the bulk of the row is numeric. Here's the descriptive statistics for those columns,
excluding the row for `country`.

In [44]:
mask = df['RegionType'].isin(['country'])
df[~mask].describe()

Unnamed: 0,SizeRank,2000-01-31,2000-02-29,2000-03-31,2000-04-30,2000-05-31,2000-06-30,2000-07-31,2000-08-31,2000-09-30,...,2022-12-31,2023-01-31,2023-02-28,2023-03-31,2023-04-30,2023-05-31,2023-06-30,2023-07-31,2023-08-31,2023-09-30
count,894.0,430.0,431.0,432.0,434.0,436.0,437.0,438.0,439.0,440.0,...,894.0,894.0,894.0,894.0,894.0,894.0,894.0,894.0,894.0,894.0
mean,462.268456,108310.945759,108489.909344,108641.374011,109318.156068,110056.874462,110628.937589,111391.302582,112008.951956,112686.244461,...,269082.6,268102.3,267387.5,267428.6,268426.8,270038.2,271793.0,273200.1,274360.7,275202.9
std,268.416053,47229.579627,47344.087705,47538.093094,48193.001942,48801.545688,49423.742901,50136.91484,50879.421199,51631.385851,...,170065.7,168845.0,167760.4,166918.7,166848.1,167307.3,168402.7,169757.2,171245.3,172605.0
min,1.0,33790.975592,33785.351808,33768.393579,33732.132381,33714.859687,33712.200697,33791.095617,33911.701453,34086.411984,...,47403.82,46430.33,45638.52,45586.66,46519.09,47387.35,47623.64,47130.12,46480.75,46117.54
25%,231.25,78069.762376,78140.417898,78211.300594,78364.339855,78552.121678,78821.85877,79272.481026,79693.692929,80057.032029,...,167407.8,166770.3,166120.7,166044.6,166986.3,168574.2,170079.4,170854.1,171200.6,170700.4
50%,460.5,96952.700498,97223.14719,97193.681321,97489.968825,98126.272063,98377.056982,98893.484177,99197.086804,99715.147671,...,216952.5,216102.6,216040.4,216792.1,218066.9,220567.5,222056.7,222955.5,223668.4,224644.5
75%,689.75,124617.174821,124729.425788,124914.013803,125536.802045,126500.416654,127052.274804,128229.353424,128399.854166,128703.986107,...,314298.0,312788.3,314102.5,314705.9,316760.7,319608.2,322746.7,324832.8,325688.2,326169.5
max,939.0,364053.011758,365929.309281,368755.917179,376410.574381,384436.013479,393631.233852,401547.487966,411291.530866,421165.689735,...,1430127.0,1420832.0,1411023.0,1401486.0,1399193.0,1406153.0,1422072.0,1440191.0,1452815.0,1461070.0


Here are the same numbers, grouped by `StateName`.

In [45]:
df.groupby('StateName').describe()

Unnamed: 0_level_0,SizeRank,SizeRank,SizeRank,SizeRank,SizeRank,SizeRank,SizeRank,SizeRank,2000-01-31,2000-01-31,...,2023-08-31,2023-08-31,2023-09-30,2023-09-30,2023-09-30,2023-09-30,2023-09-30,2023-09-30,2023-09-30,2023-09-30
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
StateName,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
AK,4.0,571.0,365.196751,139.0,340.0,604.5,835.5,936.0,1.0,134525.107965,...,408625.389605,466466.4,4.0,378765.808622,72257.438151,291078.379681,349841.759733,378415.060253,407339.109142,467154.7
AL,20.0,361.4,207.754511,51.0,199.25,325.0,478.5,788.0,7.0,103282.32359,...,222759.968069,377431.0,20.0,206149.831877,65104.278864,75433.789604,172340.85086,188461.197168,222903.243157,376925.6
AR,20.0,579.2,265.685528,81.0,413.75,652.5,801.5,927.0,18.0,76799.787159,...,193335.282137,327117.2,20.0,160194.380993,65765.003416,46117.539607,113304.228733,164049.604354,194043.866247,328624.9
AZ,10.0,344.3,236.552108,11.0,219.75,318.5,530.25,718.0,8.0,107958.673398,...,372479.961453,601481.5,10.0,353597.612767,109846.738473,246452.341539,264985.881776,348182.771961,373953.270394,605112.6
CA,34.0,228.205882,212.549,2.0,72.5,176.0,306.5,840.0,20.0,205376.068989,...,820564.018596,1425581.0,34.0,594245.073938,296048.577231,232887.071784,371551.825402,478731.26091,827769.439363,1452954.0
CO,17.0,486.647059,314.672834,19.0,161.0,566.0,811.0,937.0,13.0,172244.258831,...,729481.063477,1221034.0,17.0,579980.341244,294456.630733,247119.893467,330939.132868,488766.175841,731873.399679,1233960.0
CT,5.0,119.6,85.183919,49.0,60.0,68.0,185.0,236.0,5.0,170410.09707,...,361186.615034,578859.2,5.0,399248.606635,105238.770188,336414.183187,347226.719948,360484.275687,365732.073009,586385.8
DE,2.0,189.0,77.781746,134.0,161.5,189.0,216.5,244.0,1.0,131508.409926,...,391673.40403,410144.2,2.0,375056.821589,52285.1084,338085.666885,356571.244237,375056.821589,393542.398942,412028.0
FL,29.0,266.310345,243.545552,8.0,90.0,155.0,390.0,859.0,25.0,103317.804252,...,394730.459785,1006245.0,29.0,363215.076527,152229.174828,204916.775962,272209.520508,355223.805175,396032.816796,1007954.0
GA,37.0,512.027027,255.916163,9.0,300.0,553.0,688.0,923.0,32.0,89401.320664,...,241137.80898,375358.0,37.0,211165.644466,72624.443775,107669.306704,158602.668327,192857.101479,243943.278563,377265.7


Before we dig into this data further, I just wanted to share with you a method of parsing the column names as a date. 

The worst part about this data set is the decision to use the last day of the month in the date, rather than the first. This causes the last day of February to move between the 28th and the 29th for leap years, making even a minor calculation like **Year-Over-Year (YOY)** to be fraught with peril. 

In [69]:
from datetime import datetime
import calendar

dformat = '%Y-%m-%d'
colname = '2020-02-29' ## df.columns[-1]
dt = datetime.strptime(colname, dformat)

year = dt.year - 1
day = dt.day if dt.month != 2 else 29 if calendar.isleap(year) else 28
dt_start = datetime(year, dt.month, day)

df[[dt_start.strftime(dformat), dt.strftime(dformat)]]

Unnamed: 0_level_0,2019-03-31,2020-03-31
RegionID,Unnamed: 1_level_1,Unnamed: 2_level_1
102001,236914.871571,249513.852467
394913,489448.186378,505127.712208
753899,647333.340345,673275.311995
394463,235046.844159,240305.188157
394514,249110.605153,257843.283963
...,...,...
753929,112850.359106,120926.449679
394743,311697.502003,335840.463594
753874,176692.211266,189523.496954
395188,71937.854546,75896.936771
