# Codealong 3

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

%matplotlib inline
plt.style.use('ggplot')

In [None]:
df = pd.read_csv(os.path.join('..', 'datasets', 'zillow-03-starter.csv'), index_col = 'ID')

In [None]:
df

## Part A
- `.mean()`
- `.var()`, `.std()`

### `Series.mean()` - Compute the `Series` mean value

In [None]:
df.SalePrice.mean()

In [None]:
df.Size.mean()

`Size` has nan values that `.mean()` skips.

In [None]:
df.IsAStudio.mean()

About 3% of the properties sold are studios.  (Note that we are "dropping" the properties with no studio information)

### `DataFrame.mean()` - Compute the `DataFrame` mean value

In [None]:
df.mean()

`DataFrame.mean()` only applies to numerical columns.  Address and date of sales aren't included.

### `.var()` - Compute the unbiased variance (normalized by `N-1` by default)

In [None]:
df.var()

In [None]:
df.BedCount.var()

### `.std()` - Compute the unbiased standard deviation (normalized by `N-1` by default)

In [None]:
df.std()

In [None]:
df.BedCount.std()

## Part B
- `.median()`
- `.count()`, `.dropna()`, `.isnull()`
- `.min()`, `.max()`
- `.quantile()`
- `.describe()`

### `.median()` - Compute the median value

In [None]:
df.median()

In [None]:
df.SalePrice.median()

### `.count()` - Compute the number of rows/observations and `.sum()` - Compute the sum and the values

In [None]:
df.BuiltInYear.mode()

In [None]:
df.count()

In [None]:
df.IsAStudio.count()

`count()` counts the number of non-nan values:

In [None]:
len(df.IsAStudio.dropna())

In [None]:
df.IsAStudio.isnull().sum()

Which leaves 14 houses for which we don't know if they are studios or not.

In [None]:
df.IsAStudio.sum()

29 properties are studios.

### `.min()` and `.max()` - Compute the minimum and maximum values

In [None]:
df.min()

In [None]:
df[df.SalePrice == df.SalePrice.min()]

A 7 bedrooms/6 bathrooms house for $1.  What a bargain!

In [None]:
df.max()

In [None]:
df[df.SalePrice == df.SalePrice.max()]

### `.quantile()` - Compute values at the given quantile

In [None]:
df.quantile(.5) 

In [None]:
df.median()

In [None]:
df.quantile(.25) 

In [None]:
df.quantile(.75)

### `.describe()` - Generate various summary statistics

In [None]:
df.describe()

In [None]:
df.SalePrice.describe()

## Part C
- Boxplots

In [None]:
df.SalePrice.plot(kind = 'box', figsize = (8, 8))

In [None]:
df[ ['BedCount', 'BathCount'] ].plot(kind = 'box', figsize = (8, 8))

## Part D
- Histograms

In [None]:
df.BedCount.plot(kind = 'hist', figsize = (8, 8))

In [None]:
df[ ['BedCount', 'BathCount'] ].plot(kind = 'hist', figsize = (8, 8))

## Part E
- `.mode()`

### `.mode()` - Compute the mode value(s)

In [None]:
df.mode()

From the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html): Gets the mode(s) of each element along the columns.  Empty if nothing has 2+ occurrences. Adds a row for each mode per label, fills in gaps with nan.  Note that there could be multiple values returned in the columns (when more than one value share the maximum frequency), which is the reason why a dataframe is returned.  

In [None]:
df.Address[df.Address == '1 Mono St # B, San Francisco, CA']

In [None]:
df.Address[df.Address == '829 Folsom St UNIT 906, San Francisco, CA']

In [None]:
len(df[df.DateOfSale == '11/20/15'])

In [None]:
(df.DateOfSale == '11/20/15').sum()

In [None]:
bed_counts = df.BedCount.dropna().unique()

In [None]:
bed_counts

In [None]:
for bed_count in np.sort(bed_counts):
    home_count = (df.BedCount == bed_count).sum()
    print '{} homes have {} bedrooms'.format(home_count, bed_count)

Note: That's the same information we got from the histogram above.

In [None]:
df.BedCount.isnull().sum()

Careful on checking for `nan` values:

In [None]:
(df.BedCount == np.nan).sum()

## Part F
- `.corr()`
- Heatmaps
- Scatter plots
- Scatter matrices

In [None]:
df.corr()

### Heatmaps

In [None]:
plt.matshow(df.corr())

In [None]:
corr = df.corr()

figure = plt.figure()
subplot = figure.add_subplot(1, 1, 1)
figure.colorbar(subplot.matshow(df.corr()))
subplot.set_xticklabels(corr.columns, rotation = 90)
subplot.set_yticklabels(corr.columns)
plt.show()

### Scatter plots

In [None]:
df[ ['BedCount', 'BathCount'] ].plot(kind = 'scatter', x = 'BedCount', y = 'BathCount', s = 100, figsize = (8, 8))

### Scatter matrices

In [1]:
pd.tools.plotting.scatter_matrix(df[ ['BedCount', 'BathCount'] ], diagonal = 'kde', s = 500, figsize = (8, 8))

NameError: name 'pd' is not defined

In [None]:
pd.tools.plotting.scatter_matrix(df[ ['SalePrice', 'Size'] ], s = 200, figsize = (8, 8))