# Data Wrangling in Pandas

This session draws primarily on Chapter 7 in Python for Data Analysis.  It covers methods that are used heavily in 'data wrangling', which refers to the data manipulation that is often needed to transform raw data into a form that is useful for analysis.  We'll stick to the data and examples used in the book for most of this session, since the examples are clearer on the tiny datasets.  After that we will work through some of these methods again using real data.

Key methods covered include:

* Merging and Concatenating
* Reshaping data
* Data transformations
* Categorization
* Detecting and Filtering Outliers
* Creating Dummy Variables


In [1]:
import pandas as pd
import numpy as np

## Merging

Merging two datasets is a very common operation in preparing data for analysis.  It generally means adding columns from one table to colums from another, where the value of some key, or merge field, matches.

Let's begin by creating two simple DataFrames to be merged.

In [2]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
print(df1)
print(df2)

  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   a      5
6   b      6
  key  data2
0   a      0
1   b      1
2   d      2


Here is a many to one merge.  The join field is implicit, based on what columns it finds in common between the two dataframes. Note that they share some values of the key field (a, b), but do not share key values c and d.  What do you expect to happen when we merge them? The result contains the values from both inputs where they both have a value of the merge field, which is 'key' in this example.  The default behavior is that the key value has to be in both inputs to be kept.  In set terms it would be an intersection of the two sets.

In [None]:
#dropping everything it cant find a match! (inner merge DEFAULT). (outer merge - keeps everything).
# then theres a left and a right
# 'use the how = ' function to merge inner outer left or right

In [3]:
pd.merge(df1,df2)

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [4]:
pd.merge(df1,df2, how = 'right')

Unnamed: 0,key,data1,data2
0,b,0.0,1
1,b,1.0,1
2,b,6.0,1
3,a,2.0,0
4,a,4.0,0
5,a,5.0,0
6,d,,2


Here is the same merge, but making the join field explicit.


In [4]:
pd.merge(df1,df2, on='key')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [None]:
#what if there are more than one value of key in both dataframes? This is a many-to-many merge.
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df3 = pd.DataFrame({'key': ['a', 'b', 'b', 'd'],'data2': range(4)})
print(df1)
print(df3)
pd.merge(df1,df3, on='key')
#This produces a cartesian product of the number of occurrences of each key value in both dataframes:
# (b shows up 3 times in df1 and 2 times in df3, so we get 6 occurrences in the result of the merge)

In [None]:
# There are several types of joins: left, right, inner, and outer. Let's compare them.
# How does a 'left' join compare to our initial join?  Note that it keeps the result if it shows up in df1,
# regardless of whether it also shows up in df2.  It fills in a value of NaN for the missing value from df2.
pd.merge(df1,df3, on='key', how='left')

In [None]:
#How does a 'right' join compare?  Same idea, but this time it keeps a result if it shows up in df2, regardless
# of whether it also shows up in df1.
pd.merge(df1,df3, on='key', how='right')

In [None]:
#How does an 'inner' join compare?
pd.merge(df1,df3, on='key', how='inner')
# seems to be the default argument...

In [None]:
#How does an 'outer' join compare?  If inner joins are like an intersection of two sets, outer joins are unions.
pd.merge(df1,df3, on='key', how='outer')

In [None]:
#What if the join fields have different names?  No problem - just specify the names.
df4 = pd.DataFrame({'key_1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df5 = pd.DataFrame({'key_2': ['a', 'b', 'b', 'd'],'data2': range(4)})
pd.merge(df4,df5, left_on='key_1', right_on='key_2')

In [None]:
# Here is an example that uses a combination of a data column and an index to merge two dataframes.
df4 = pd.DataFrame({'key_1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df5 = pd.DataFrame({'data2': [4,6,8,10]}, index=['a','b','c','d'])
pd.merge(df4,df5, left_on='key_1', right_index=True)

## Concatenating

In [None]:
# Concatenating can append rows, or columns, depending on which axis you use. Default is 0
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
pd.concat([s1, s2, s3])
# Since we are concatenating series on axis 0, this creates a longer series, appending each of the three series

In [None]:
# What if we concatenate on axis 1?
pd.concat([s1, s2, s3], axis=1)

In [None]:
# Outer join is the default:
pd.concat([s1, s2, s3], axis=1, join='outer')

In [None]:
# What would an inner join produce?
pd.concat([s1, s2, s3], axis=1, join='inner')

In [None]:
# We need some overlapping values to have the inner join produe non-empty results
s4 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
s5 = pd.Series([1, 2, 3], index=['d', 'e', 'f'])
s6 = pd.Series([7, 8, 9, 10], index=['d', 'e', 'f', 'g'])
pd.concat([s4, s5, s6], axis=1, join='outer')

In [None]:
# Here is the inner join 
pd.concat([s4, s5, s6], axis=1, join='inner')
# Note that it contains only entries that overlap in all three series.

## Reshaping with Hierarchical Indexing

In [6]:
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                 index=pd.Index(['Ohio', 'Colorado'], name='state'),
                 columns=pd.Index(['one', 'two', 'three'], name='number'))
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [7]:
# Stack pivots the columns into rows, producing a Series with a hierarchical index:
result = data.stack()
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

In [8]:
# Unstack reverses this process:
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


See also the related pivot method

## Data Transformations

In [None]:
# Start with a dataframe containing some duplicate values
data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4,'k2': [1, 1, 2, 3, 3, 4, 99]})
data

In [None]:
# How to see which rows contain duplicate values
data.duplicated()

In [None]:
# How to remove duplicate values
data.drop_duplicates()

In [None]:
#If 99 is a code for missing data, we could replace any such values with NaNs
data['k2'].replace(99,np.nan)

## Categorization (binning)

In [None]:
# Let's look at how to create categories of data using ranges to bin the data using cut
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
type(cats)

In [None]:
cats.categories

In [None]:
cats.codes

In [None]:
pd.value_counts(cats)

In [None]:
# Consistent with mathematical notation for intervals, a parenthesis means that the side is open while the 
#square bracket means it is closed (inclusive). Which side is closed can be changed by passing right=False:
cats = pd.cut(ages, bins, right=False)
print(ages)
print(pd.value_counts(cats))

### Removing Outliers

In [None]:
# Start by creating a dataframe with 4 columns of 1,000 random numbers
# We'll use a fixed seed for the random number generator to get repeatable results
np.random.seed(12345)
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

In [None]:
# This identifies any values in column 3 with absolute values > 3
col = data[3]
col[np.abs(col) > 3]

In [None]:
# This identifies all the rows with any column containing absolute values > 3
data[(np.abs(data) > 3).any(1)]

In [None]:
# Now we can cap the values at -3 to 3 using this:
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()

### Computing Dummy Variables

In [9]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [None]:
# This generates dummy variables for each value of key
# Dummy variables are useful in statistical modeling, to have 0/1 indicator
# variables for the presence of some condition
pd.get_dummies(df['key'])

In [None]:
# This generates dummy variables for each value of key and appends these to the dataframe
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Notice that we used join instead of merge.  The join method is very similar to merge, but uses indexes to merge, by default.  From the documentation:

http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
merge is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.

The related DataFrame.join method, uses merge internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior for merge). If you are joining on index, you may wish to use DataFrame.join to save yourself some typing

## A bit more: 
1. Filter out records with more than 4 bedrooms
2. Create dummy variables for each bedroom count (e.g. bed_1 would have 1 for rows with 1 bedroom, 0 for others), and merge them with the dataframe
3. Filter sqft < 500 and > 3000
4. Create a set of 5 bins for price and do counts of how many records are in each category

In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv("items.csv",encoding = 'utf-8')

In [3]:
df['price']=df.price.str.replace('$','')

In [4]:
df = df[df['price'].notnull()]

In [5]:
df['price'] = df['price'].astype(int)
df

Unnamed: 0,neighborhood,title,price,bedrooms,pid,longitude,date,link,latitude,sqft,sourcepage
0,(SOMA / south beach),"1bed + Den, 1bath at Mission Bay",2895,/ 1br - 950ft² -,4046628359,-122.399663,Sep 4 2013,/sfc/apa/4046628359.html,37.774623,/ 1br - 950ft² -,http://sfbay.craigslist.org/sfc/apa/
1,(SOMA / south beach),Love where you live!,3354,/ 1br - 710ft² -,4046761563,,Sep 4 2013,/sfc/apa/4046761563.html,,/ 1br - 710ft² -,http://sfbay.craigslist.org/sfc/apa/
2,(inner sunset / UCSF),We Welcome Your Furry Friends! Call Today!,2865,/ 1br - 644ft² -,4046661504,-122.470727,Sep 4 2013,/sfc/apa/4046661504.html,37.765739,/ 1br - 644ft² -,http://sfbay.craigslist.org/sfc/apa/
3,(financial district),Golden Gateway Commons | 2BR + office townhous...,5500,/ 2br - 1450ft² -,4036170429,,Sep 4 2013,/sfc/apa/4036170429.html,,/ 2br - 1450ft² -,http://sfbay.craigslist.org/sfc/apa/
4,(lower nob hill),Experience Luxury Living in San Fransisco!,3892,/ 2br -,4046732678,,Sep 4 2013,/sfc/apa/4046732678.html,,/ 2br -,http://sfbay.craigslist.org/sfc/apa/
5,(sunset / parkside),"$1250 - 1 bdrm, 1 bath",1250,/ 1br -,4046731229,,Sep 4 2013,/sfc/apa/4046731229.html,,/ 1br -,http://sfbay.craigslist.org/sfc/apa/
6,(SOMA / south beach),Made For The Die-hard Giants Fan,3249,/ 549ft² -,4046730047,-122.389798,Sep 4 2013,/sfc/apa/4046730047.html,37.774192,/ 549ft² -,http://sfbay.craigslist.org/sfc/apa/
7,(russian hill),Open Concept 1bed 1bath,2690,/ 1br - 781ft² -,4046570245,-122.420787,Sep 4 2013,/sfc/apa/4046570245.html,37.796034,/ 1br - 781ft² -,http://sfbay.craigslist.org/sfc/apa/
8,,"Contemporary, charming 2bds/1ba with private d...",2850,/ 2br -,4006732632,-122.457100,Sep 4 2013,/sfc/apa/4006732632.html,37.735400,/ 2br -,http://sfbay.craigslist.org/sfc/apa/
9,(pacific heights),"2bd/2.5ba, 2 car tandem parking @ 1998 Broadwa...",6500,/ 2br - 1400ft² -,4046018830,-122.429850,Sep 4 2013,/sfc/apa/4046018830.html,37.794973,/ 2br - 1400ft² -,http://sfbay.craigslist.org/sfc/apa/index200.html


In [6]:
# bedroom columns need to be intergers.
def bedrooms(value):
    if isinstance(value,str):
        i = value.find('br')
        if i == -1:
            return None
        else:
            k = value.find('/') + 2
            return int(value[k:i])
df['bedrooms'] = df['bedrooms'].map(bedrooms)

#sqft to intergers
def sqft(value):
    if isinstance(value, str):
        i = value.find('f')
        if i == -1:
            return None
        else:
            k = value.find('br')
            if k==-1:
                z = value.find('/')+2
            else:
                z = value.find('-')+2
            return int(value[z:i])
df['sqft'] = df['sqft'].map(sqft)

less_than_four = df[df['bedrooms']<4]
less_than_four

Unnamed: 0,neighborhood,title,price,bedrooms,pid,longitude,date,link,latitude,sqft,sourcepage
0,(SOMA / south beach),"1bed + Den, 1bath at Mission Bay",2895,1.0,4046628359,-122.399663,Sep 4 2013,/sfc/apa/4046628359.html,37.774623,950.0,http://sfbay.craigslist.org/sfc/apa/
1,(SOMA / south beach),Love where you live!,3354,1.0,4046761563,,Sep 4 2013,/sfc/apa/4046761563.html,,710.0,http://sfbay.craigslist.org/sfc/apa/
2,(inner sunset / UCSF),We Welcome Your Furry Friends! Call Today!,2865,1.0,4046661504,-122.470727,Sep 4 2013,/sfc/apa/4046661504.html,37.765739,644.0,http://sfbay.craigslist.org/sfc/apa/
3,(financial district),Golden Gateway Commons | 2BR + office townhous...,5500,2.0,4036170429,,Sep 4 2013,/sfc/apa/4036170429.html,,1450.0,http://sfbay.craigslist.org/sfc/apa/
4,(lower nob hill),Experience Luxury Living in San Fransisco!,3892,2.0,4046732678,,Sep 4 2013,/sfc/apa/4046732678.html,,,http://sfbay.craigslist.org/sfc/apa/
5,(sunset / parkside),"$1250 - 1 bdrm, 1 bath",1250,1.0,4046731229,,Sep 4 2013,/sfc/apa/4046731229.html,,,http://sfbay.craigslist.org/sfc/apa/
7,(russian hill),Open Concept 1bed 1bath,2690,1.0,4046570245,-122.420787,Sep 4 2013,/sfc/apa/4046570245.html,37.796034,781.0,http://sfbay.craigslist.org/sfc/apa/
8,,"Contemporary, charming 2bds/1ba with private d...",2850,2.0,4006732632,-122.457100,Sep 4 2013,/sfc/apa/4006732632.html,37.735400,,http://sfbay.craigslist.org/sfc/apa/
9,(pacific heights),"2bd/2.5ba, 2 car tandem parking @ 1998 Broadwa...",6500,2.0,4046018830,-122.429850,Sep 4 2013,/sfc/apa/4046018830.html,37.794973,1400.0,http://sfbay.craigslist.org/sfc/apa/index200.html
10,(SOMA / south beach),"Stunning Modern GARDEN Loft! Upgrades, Views &...",4100,1.0,4045981009,-122.402387,Sep 4 2013,/sfc/apa/4045981009.html,37.781055,,http://sfbay.craigslist.org/sfc/apa/index200.html


In [7]:
#dummy variables
pd.get_dummies(less_than_four['bedrooms'])
dummy_variables = pd.get_dummies(less_than_four['bedrooms'], prefix='bed')
df_with_dummy = less_than_four[['bedrooms']].join(dummy_variables)
#merge with dataframe
df_merge = pd.merge(less_than_four, df_with_dummy,how='outer')
df_merge

Unnamed: 0,neighborhood,title,price,bedrooms,pid,longitude,date,link,latitude,sqft,sourcepage,bed_1.0,bed_2.0,bed_3.0
0,(SOMA / south beach),"1bed + Den, 1bath at Mission Bay",2895,1.0,4046628359,-122.399663,Sep 4 2013,/sfc/apa/4046628359.html,37.774623,950.0,http://sfbay.craigslist.org/sfc/apa/,1,0,0
1,(SOMA / south beach),"1bed + Den, 1bath at Mission Bay",2895,1.0,4046628359,-122.399663,Sep 4 2013,/sfc/apa/4046628359.html,37.774623,950.0,http://sfbay.craigslist.org/sfc/apa/,1,0,0
2,(SOMA / south beach),"1bed + Den, 1bath at Mission Bay",2895,1.0,4046628359,-122.399663,Sep 4 2013,/sfc/apa/4046628359.html,37.774623,950.0,http://sfbay.craigslist.org/sfc/apa/,1,0,0
3,(SOMA / south beach),"1bed + Den, 1bath at Mission Bay",2895,1.0,4046628359,-122.399663,Sep 4 2013,/sfc/apa/4046628359.html,37.774623,950.0,http://sfbay.craigslist.org/sfc/apa/,1,0,0
4,(SOMA / south beach),"1bed + Den, 1bath at Mission Bay",2895,1.0,4046628359,-122.399663,Sep 4 2013,/sfc/apa/4046628359.html,37.774623,950.0,http://sfbay.craigslist.org/sfc/apa/,1,0,0
5,(SOMA / south beach),"1bed + Den, 1bath at Mission Bay",2895,1.0,4046628359,-122.399663,Sep 4 2013,/sfc/apa/4046628359.html,37.774623,950.0,http://sfbay.craigslist.org/sfc/apa/,1,0,0
6,(SOMA / south beach),"1bed + Den, 1bath at Mission Bay",2895,1.0,4046628359,-122.399663,Sep 4 2013,/sfc/apa/4046628359.html,37.774623,950.0,http://sfbay.craigslist.org/sfc/apa/,1,0,0
7,(SOMA / south beach),"1bed + Den, 1bath at Mission Bay",2895,1.0,4046628359,-122.399663,Sep 4 2013,/sfc/apa/4046628359.html,37.774623,950.0,http://sfbay.craigslist.org/sfc/apa/,1,0,0
8,(SOMA / south beach),"1bed + Den, 1bath at Mission Bay",2895,1.0,4046628359,-122.399663,Sep 4 2013,/sfc/apa/4046628359.html,37.774623,950.0,http://sfbay.craigslist.org/sfc/apa/,1,0,0
9,(SOMA / south beach),"1bed + Den, 1bath at Mission Bay",2895,1.0,4046628359,-122.399663,Sep 4 2013,/sfc/apa/4046628359.html,37.774623,950.0,http://sfbay.craigslist.org/sfc/apa/,1,0,0


In [8]:
#filter out 500 - 3000 sqft
sqft500 = df_merge[df_merge['sqft']<500]
sqft3000 = df_merge[df_merge['sqft']>3000]
sqft_500_to_3000 = pd.concat([sqft500, sqft3000])
sqft_500_to_3000

Unnamed: 0,neighborhood,title,price,bedrooms,pid,longitude,date,link,latitude,sqft,sourcepage,bed_1.0,bed_2.0,bed_3.0
29670,(SOMA / south beach),Sunny Garden Suite with Private Patio & Pet Fr...,2995,1.0,4027078978,,Sep 4 2013,/sfc/apa/4027078978.html,,420.0,http://sfbay.craigslist.org/sfc/apa/index100.html,1,0,0
29671,(SOMA / south beach),Sunny Garden Suite with Private Patio & Pet Fr...,2995,1.0,4027078978,,Sep 4 2013,/sfc/apa/4027078978.html,,420.0,http://sfbay.craigslist.org/sfc/apa/index100.html,1,0,0
29672,(SOMA / south beach),Sunny Garden Suite with Private Patio & Pet Fr...,2995,1.0,4027078978,,Sep 4 2013,/sfc/apa/4027078978.html,,420.0,http://sfbay.craigslist.org/sfc/apa/index100.html,1,0,0
29673,(SOMA / south beach),Sunny Garden Suite with Private Patio & Pet Fr...,2995,1.0,4027078978,,Sep 4 2013,/sfc/apa/4027078978.html,,420.0,http://sfbay.craigslist.org/sfc/apa/index100.html,1,0,0
29674,(SOMA / south beach),Sunny Garden Suite with Private Patio & Pet Fr...,2995,1.0,4027078978,,Sep 4 2013,/sfc/apa/4027078978.html,,420.0,http://sfbay.craigslist.org/sfc/apa/index100.html,1,0,0
29675,(SOMA / south beach),Sunny Garden Suite with Private Patio & Pet Fr...,2995,1.0,4027078978,,Sep 4 2013,/sfc/apa/4027078978.html,,420.0,http://sfbay.craigslist.org/sfc/apa/index100.html,1,0,0
29676,(SOMA / south beach),Sunny Garden Suite with Private Patio & Pet Fr...,2995,1.0,4027078978,,Sep 4 2013,/sfc/apa/4027078978.html,,420.0,http://sfbay.craigslist.org/sfc/apa/index100.html,1,0,0
29677,(SOMA / south beach),Sunny Garden Suite with Private Patio & Pet Fr...,2995,1.0,4027078978,,Sep 4 2013,/sfc/apa/4027078978.html,,420.0,http://sfbay.craigslist.org/sfc/apa/index100.html,1,0,0
29678,(SOMA / south beach),Sunny Garden Suite with Private Patio & Pet Fr...,2995,1.0,4027078978,,Sep 4 2013,/sfc/apa/4027078978.html,,420.0,http://sfbay.craigslist.org/sfc/apa/index100.html,1,0,0
29679,(SOMA / south beach),Sunny Garden Suite with Private Patio & Pet Fr...,2995,1.0,4027078978,,Sep 4 2013,/sfc/apa/4027078978.html,,420.0,http://sfbay.craigslist.org/sfc/apa/index100.html,1,0,0


In [12]:
# 5 bins
df=df[df['price'].notnull()]

In [13]:
df['price'] = df['price'].astype(int)
df

Unnamed: 0,neighborhood,title,price,bedrooms,pid,longitude,date,link,latitude,sqft,sourcepage
0,(SOMA / south beach),"1bed + Den, 1bath at Mission Bay",2895,1.0,4046628359,-122.399663,Sep 4 2013,/sfc/apa/4046628359.html,37.774623,950.0,http://sfbay.craigslist.org/sfc/apa/
1,(SOMA / south beach),Love where you live!,3354,1.0,4046761563,,Sep 4 2013,/sfc/apa/4046761563.html,,710.0,http://sfbay.craigslist.org/sfc/apa/
2,(inner sunset / UCSF),We Welcome Your Furry Friends! Call Today!,2865,1.0,4046661504,-122.470727,Sep 4 2013,/sfc/apa/4046661504.html,37.765739,644.0,http://sfbay.craigslist.org/sfc/apa/
3,(financial district),Golden Gateway Commons | 2BR + office townhous...,5500,2.0,4036170429,,Sep 4 2013,/sfc/apa/4036170429.html,,1450.0,http://sfbay.craigslist.org/sfc/apa/
4,(lower nob hill),Experience Luxury Living in San Fransisco!,3892,2.0,4046732678,,Sep 4 2013,/sfc/apa/4046732678.html,,,http://sfbay.craigslist.org/sfc/apa/
5,(sunset / parkside),"$1250 - 1 bdrm, 1 bath",1250,1.0,4046731229,,Sep 4 2013,/sfc/apa/4046731229.html,,,http://sfbay.craigslist.org/sfc/apa/
6,(SOMA / south beach),Made For The Die-hard Giants Fan,3249,,4046730047,-122.389798,Sep 4 2013,/sfc/apa/4046730047.html,37.774192,549.0,http://sfbay.craigslist.org/sfc/apa/
7,(russian hill),Open Concept 1bed 1bath,2690,1.0,4046570245,-122.420787,Sep 4 2013,/sfc/apa/4046570245.html,37.796034,781.0,http://sfbay.craigslist.org/sfc/apa/
8,,"Contemporary, charming 2bds/1ba with private d...",2850,2.0,4006732632,-122.457100,Sep 4 2013,/sfc/apa/4006732632.html,37.735400,,http://sfbay.craigslist.org/sfc/apa/
9,(pacific heights),"2bd/2.5ba, 2 car tandem parking @ 1998 Broadwa...",6500,2.0,4046018830,-122.429850,Sep 4 2013,/sfc/apa/4046018830.html,37.794973,1400.0,http://sfbay.craigslist.org/sfc/apa/index200.html


In [15]:
df.price.describe()

count      997.000000
mean      4196.676028
std       2830.259243
min        195.000000
25%       2600.000000
50%       3375.000000
75%       4950.000000
max      45000.000000
Name: price, dtype: float64

In [16]:
prices = sqft_500_to_3000['price']
bins = [0, 1000, 2000, 3000, 4000, 5000]
cats = pd.cut(prices, bins)
pd.value_counts(cats)

(2000, 3000]    1380
(1000, 2000]    1035
(3000, 4000]     345
(4000, 5000]       0
(0, 1000]          0
Name: price, dtype: int64

In [17]:
prices

29670     2995
29671     2995
29672     2995
29673     2995
29674     2995
29675     2995
29676     2995
29677     2995
29678     2995
29679     2995
29680     2995
29681     2995
29682     2995
29683     2995
29684     2995
29685     2995
29686     2995
29687     2995
29688     2995
29689     2995
29690     2995
29691     2995
29692     2995
29693     2995
29694     2995
29695     2995
29696     2995
29697     2995
29698     2995
29699     2995
          ... 
223454    9975
223455    9975
223456    9975
223457    9975
223458    9975
223459    9975
223460    9975
223461    9975
223462    9975
223463    9975
223464    9975
223465    9975
223466    9975
223467    9975
223468    9975
223469    9975
223470    9975
223471    9975
223472    9975
223473    9975
223474    9975
223475    9975
223476    9975
223477    9975
223478    9975
223479    9975
223480    9975
223481    9975
223482    9975
223483    9975
Name: price, Length: 3293, dtype: int64