# DS-SF-27 | Codealong 03 | Exploratory Data Analysis

In [87]:
import os

import math

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 20)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

## Part A - Review and Activity | Subsetting with pandas

In [88]:
df = pd.DataFrame({'name': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve', 'Frank'],
    'gender': ['Female', 'Male', 'Female', 'Male', 'Female', 'Male'],
    'age': [24, 34, 44, 41, 52, 43],
    'marital_status': [0, 2, 1, 2, 0, 1]}).\
        set_index('name')

In [89]:
df

Unnamed: 0_level_0,age,gender,marital_status
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice,24,Female,0
Bob,34,Male,2
Carol,44,Female,1
Dave,41,Male,2
Eve,52,Female,0
Frank,43,Male,1


> Question 1.  Subset the dataframe on the age and gender columns

In [92]:
# TODO
df[["age", "gender"]]


Unnamed: 0_level_0,age,gender
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Alice,24,Female
Bob,34,Male
Carol,44,Female
Dave,41,Male
Eve,52,Female
Frank,43,Male


> Question 2.  Subset the dataframe on the age column alone, first as a `DataFrame`, then as a `Series`

In [61]:
# TODO (DataFrame)
df[["age"]]

Unnamed: 0_level_0,age
name,Unnamed: 1_level_1
Alice,24
Bob,34
Carol,44
Dave,41
Eve,52
Frank,43


In [93]:
# TODO (Series)
#df["age"]
df.age

name
Alice    24
Bob      34
Carol    44
Dave     41
Eve      52
Frank    43
Name: age, dtype: int64

> Question 3.  Subset the dataframe on the rows Bob and Carol

In [63]:
# TODO
df.loc[["Bob", "Carol"]]

Unnamed: 0_level_0,age,gender,marital_status
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bob,34,Male,2
Carol,44,Female,1


> Question 4.  Subset the dataframe on the row Eve alone, first as a `DataFrame`, then as a `Series`

In [64]:
# TODO (DataFrame)
df.loc[["Eve"]]

Unnamed: 0_level_0,age,gender,marital_status
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Eve,52,Female,0


In [65]:
# TODO (Series)
df.loc["Eve"]

age                   52
gender            Female
marital_status         0
Name: Eve, dtype: object

> Question 5.  How old is Frank?

In [99]:
# TODO
#df.at["Frank", "age"]
#df.age.Frank
df.age["Frank"]

43

## Part B

- `.mean()`
- `.var()`, `.std()`

In [67]:
df = pd.read_csv(os.path.join('..', 'datasets', 'zillow-03-starter.csv'), index_col = 'ID')

In [68]:
df.head(5)

Unnamed: 0_level_0,Address,DateOfSale,SalePrice,IsAStudio,BedCount,BathCount,Size,LotSize,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
15063471,"55 Vandewater St APT 9, San Francisco, CA",12/4/15,710000.0,0.0,1.0,,550.0,,1980.0
15063505,"740 Francisco St, San Francisco, CA",11/30/15,2150000.0,0.0,,2.0,1430.0,2435.0,1948.0
15063609,"819 Francisco St, San Francisco, CA",11/12/15,5600000.0,0.0,2.0,3.5,2040.0,3920.0,1976.0
15064044,"199 Chestnut St APT 5, San Francisco, CA",12/11/15,1500000.0,0.0,1.0,1.0,1060.0,,1930.0
15064257,"111 Chestnut St APT 403, San Francisco, CA",1/15/16,970000.0,0.0,2.0,2.0,1299.0,,1993.0


### `Series.mean()` - Compute the `Series` mean value

In [69]:
df.SalePrice.mean()

1397422.943

> What's `Size`'s mean?

In [70]:
# TODO
df.Size.mean()

1641.3009307135471

> What's fraction of the properties sold in the dataset are studios?

In [71]:
# TODO
len(df[df.IsAStudio > 0])

29

### `DataFrame.mean()` - Compute the `DataFrame` mean value

In [72]:
# TODO
df.mean()

SalePrice      1.397423e+06
IsAStudio      2.941176e-02
BedCount       2.572967e+00
BathCount      1.977548e+00
Size           1.641301e+03
LotSize        3.020640e+03
BuiltInYear    1.947533e+03
dtype: float64

### `.var()` - Compute the unbiased variance (normalized by `N-1` by default)

In [73]:
# TODO
df.var()

SalePrice      3.015131e+12
IsAStudio      2.857569e-02
BedCount       1.564729e+00
BathCount      1.277654e+00
Size           1.054762e+06
LotSize        8.142064e+06
BuiltInYear    1.445639e+03
dtype: float64

> What's the variance for the number of beds in the dataset?

In [74]:
# TODO
df.BedCount.var()

1.5647293928888621

### `.std()` - Compute the unbiased standard deviation (normalized by `N-1` by default)

In [75]:
# TODO
df.std()

SalePrice      1.736413e+06
IsAStudio      1.690435e-01
BedCount       1.250891e+00
BathCount      1.130334e+00
Size           1.027016e+03
LotSize        2.853430e+03
BuiltInYear    3.802156e+01
dtype: float64

> What's the standard deviation for the number of beds in the dataset?

In [76]:
# TODO
df.BedCount.std()

1.2508914392899417

## Part C

- `.median()`
- `.count()`, `.dropna()`, `.isnull()`
- `.min()`, `.max()`
- `.quantile()`
- `.describe()`

### `.median()` - Compute the median value

In [77]:
# TODO
df.median()

SalePrice      1100000.0
IsAStudio            0.0
BedCount             2.0
BathCount            2.0
Size              1350.0
LotSize           2622.0
BuiltInYear       1939.0
dtype: float64

> What's the median sale price for properties in the dataset?

In [78]:
# TODO
df.SalePrice.median()

1100000.0

### `.count()` - Compute the number of rows/observations without `NaN` and `.sum()` - Compute the sum of the values

In [79]:
df.count()

Address        1000
DateOfSale     1000
SalePrice      1000
IsAStudio       986
BedCount        836
BathCount       942
Size            967
LotSize         556
BuiltInYear     975
dtype: int64

In [80]:
df.IsAStudio.count()

986

`count()` counts the number of non-`NaN` values:

In [81]:
df.IsAStudio.dropna().shape[0]

986L

In [82]:
df.IsAStudio.isnull().sum()

14

Which leaves 14 houses for which we don't know if they are studios or not.

In [83]:
df.IsAStudio.dropna().shape[0] + df.IsAStudio.isnull().sum()

1000

In [84]:
df.IsAStudio.sum()

29.0

29 properties are studios.

### `.min()` and `.max()` - Compute the minimum and maximum values

In [85]:
df.min()

Address        1 Crescent Way APT 1402, San Francisco, CA
DateOfSale                                        1/10/16
SalePrice                                               1
IsAStudio                                               0
BedCount                                                1
BathCount                                               1
Size                                                  264
LotSize                                                44
BuiltInYear                                          1870
dtype: object

> What are properties that were sold at the lowest price?  At what price?

In [86]:
# TODO
df.columns

Index([u'Address', u'DateOfSale', u'SalePrice', u'IsAStudio', u'BedCount',
       u'BathCount', u'Size', u'LotSize', u'BuiltInYear'],
      dtype='object')

In [None]:
df.max()

> What are properties that were sold at the highest price?  At what price?

In [None]:
# TODO

### `.quantile()` - Compute values at the given quantile

In [None]:
df.quantile(.5)

In [None]:
df.median()

In [None]:
df.quantile(.25)

In [None]:
df.quantile(.75)

### `.describe()` - Generate various summary statistics

In [None]:
df.describe()

In [None]:
df.SalePrice.describe()

## Part D

- Boxplots

In [None]:
df.SalePrice.plot(kind = 'box', figsize = (8, 8))

> In the same plot, plot the boxplots of `BedCount` and `BathCount`

In [None]:
# TODO

## Part E

- Histograms

In [None]:
df.BedCount.plot(kind = 'hist', figsize = (8, 8))

> In the same plot, plot the histograms of `BedCount` and `BathCount`

In [None]:
# TODO

## Part F

- `.mode()`

### `.mode()` - Compute the mode value(s)

In [None]:
df.mode()

From the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html): Gets the mode(s) of each element along the columns.  Empty if nothing has 2+ occurrences. Adds a row for each mode per label, fills in gaps with `NaN`.  Note that there could be multiple values returned in the columns (when more than one value share the maximum frequency), which is the reason why a dataframe is returned.

In [None]:
df.Address[df.Address == '1 Mono St # B, San Francisco, CA']

In [None]:
df.Address[df.Address == '829 Folsom St UNIT 906, San Francisco, CA']

In [None]:
df[df.DateOfSale == '11/20/15'].shape[0]

In [None]:
(df.DateOfSale == '11/20/15').sum()

## Part G

- `.corr()`
- Heatmaps
- Scatter plots
- Scatter matrices

In [None]:
df.corr()

### Heatmaps

In [None]:
corr = df.corr()

corr

In [None]:
# TODO

Let's pretty this up.

In [None]:
list(corr.columns)

In [None]:
figure = plt.figure()
subplot = figure.add_subplot(1, 1, 1)
figure.colorbar(subplot.matshow(corr))
subplot.set_xticklabels([None] + list(corr.columns), rotation = 90)
subplot.set_yticklabels([None] + list(corr.columns))

### Scatter plots

In [None]:
df[ ['BedCount', 'BathCount'] ].plot(kind = 'scatter', x = 'BedCount', y = 'BathCount', s = 100, figsize = (8, 8))

### Scatter matrices

In [None]:
pd.tools.plotting.scatter_matrix(df[ ['BedCount', 'BathCount'] ], diagonal = 'kde', s = 500, figsize = (8, 8))

In [None]:
pd.tools.plotting.scatter_matrix(df[ ['SalePrice', 'Size'] ], s = 200, figsize = (8, 8))

## Part H

- `.value_counts()`
- `.crosstab()`

> Reproduce the `BedCount` histogram above.  For each possible bed count, how many properties share that bed count?

In [None]:
# TODO

> Careful on checking for `NaN` values!

In [None]:
# TODO

> Create a frequency table for `BathCount` over `BedCount`.

In [None]:
# TODO