<h1>Exploratory Data Analysis</h1>
<p>In this notebook, we'll walk through some basic data characterization to draw basic, coarse insights using the data we cleaned in the previous notebook. Here, we'll plan to do the following:</p>
    <ul>
        <li>Load in data</li>
        <li>Use descriptive statistics on some key metrics</li>
        <li>Compare neighborhood similarity</li>
        <li>Choose features for some predictive machine learning</li>
    </ul>


In [2]:
#As always, first we load in relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

In [13]:
#Next, load in the pickled data.
df = pd.read_pickle('./Data/CleanData.pkl')
print(df.columns)

Index(['mean_average_price', 'sum_houses_sold', 'sum_crimes', 'code',
       'median_salary', 'life_satisfaction', 'mean_salary', 'recycling_pct',
       'population_size', 'number_of_jobs', 'area_size', 'no_of_houses',
       'borough_flag'],
      dtype='object')


In [17]:
#filler EDA cell
df2 = df['mean_average_price'].unstack()
df2.head()

year,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
area,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
barking and dagenham,51818.0,51718.25,55974.25,60285.75,65320.833333,77549.5,88664.0,112221.916667,142499.0,158176.0,...,163465.083333,165863.916667,173733.666667,201172.25,233460.083333,273919.75,287734.833333,295196.666667,299294.166667,301057.0
barnet,91792.5,94000.416667,106883.25,122359.25,136004.416667,167952.666667,185563.333333,220746.166667,251212.833333,271854.083333,...,338978.0,358627.416667,374770.583333,430363.333333,478688.083333,525939.5,538280.916667,533266.416667,519552.583333,520682.0
bexley,64291.583333,65490.5,70789.5,80632.0,86777.666667,103186.5,116527.083333,136798.0,164482.083333,179141.25,...,200672.083333,202546.416667,213470.25,244459.583333,274209.25,321563.666667,335694.416667,342603.583333,337537.166667,331683.0
brent,73029.916667,75236.0,86749.083333,100692.666667,112157.416667,140962.416667,157287.333333,185898.083333,216501.75,236023.416667,...,298964.416667,314112.833333,339655.75,394687.416667,440951.75,489469.416667,487703.75,492845.333333,474172.0,408523.0
bromley,81967.25,83547.416667,94224.666667,108286.5,120874.083333,147826.916667,162131.833333,186646.083333,215993.083333,234462.666667,...,274874.5,282025.083333,296669.25,347857.333333,385681.5,428008.166667,441218.666667,443410.0,437700.333333,436757.0


In [26]:
#Since this is longitudinal data, we want to know how things change over time.
#For a basic tabulation, let's find the percentage change of each value for each area between the first and last timepoints

#initialize a new df for this task
change_df = pd.DataFrame()

#for monthly measures (mean_average_price, sum_houses_sold), this is pretty straightforward. let's do that first.
#pull out 2d table of mean average price (area x year)
df2 = df['mean_average_price'].unstack()
#monthly values run from 1995 to 2020, so we simply calculate: 100*(2020/1995)-1 to find the % change
change_df['mean_average_price'] = 100*((df2[2020]/df2[1995]))-1

df2 = df['sum_houses_sold'].unstack()
change_df['sum_houses_sold'] = 100*((df2[2020]/df2[1995]))-1

change_df.head()

Unnamed: 0_level_0,mean_average_price,sum_houses_sold
area,Unnamed: 1_level_1,Unnamed: 2_level_1
barking and dagenham,579.989232,16.795845
barnet,566.238064,4.62309
bexley,514.904233,5.99492
brent,558.3913,7.645955
bromley,531.8433,4.412625
