# Data Wrangling in Pandas

This session draws primarily on Chapter 7 in Python for Data Analysis.  It covers methods that are used heavily in 'data wrangling', which refers to the data manipulation that is often needed to transform raw data into a form that is useful for analysis.  We'll stick to the data and examples used in the book for most of this session, since the examples are clearer on the tiny datasets.  After that we will work through some of these methods again using real data.

Key methods covered include:

* Merging and Concatenating
* Reshaping data
* Data transformations
* Categorization
* Detecting and Filtering Outliers
* Creating Dummy Variables


In [None]:
import pandas as pd
import numpy as np

## Merging

Merging two datasets is a very common operation in preparing data for analysis.  It generally means adding columns from one table to colums from another, where the value of some key, or merge field, matches.

Let's begin by creating two simple DataFrames to be merged.

In [None]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
print(df1)
print(df2)

Here is a many to one merge.  The join field is implicit, based on what columns it finds in common between the two dataframes. Note that they share some values of the key field (a, b), but do not share key values c and d.  What do you expect to happen when we merge them? The result contains the values from both inputs where they both have a value of the merge field, which is 'key' in this example.  The default behavior is that the key value has to be in both inputs to be kept.  In set terms it would be an intersection of the two sets.

In [None]:
pd.merge(df1,df2, how='right')

In [None]:
pd.merge(df1,df2, how='outer')

Here is the same merge, but making the join field explicit.


In [None]:
pd.merge(df1,df2, how='left')

In [None]:
pd.merge(df1,df2, on='key')

In [None]:
#what if there are more than one value of key in both dataframes? This is a many-to-many merge.
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df3 = pd.DataFrame({'key': ['a', 'b', 'b', 'd'],'data2': range(4)})
print(df1)
print(df3)
pd.merge(df1,df3, on='key')
#This produces a cartesian product of the number of occurrences of each key value in both dataframes:
# (b shows up 3 times in df1 and 2 times in df3, so we get 6 occurrences in the result of the merge)

In [None]:
# There are several types of joins: left, right, inner, and outer. Let's compare them.
# How does a 'left' join compare to our initial join?  Note that it keeps the result if it shows up in df1,
# regardless of whether it also shows up in df2.  It fills in a value of NaN for the missing value from df2.
pd.merge(df1,df3, on='key', how='left')

In [None]:
#How does a 'right' join compare?  Same idea, but this time it keeps a result if it shows up in df2, regardless
# of whether it also shows up in df1.
pd.merge(df1,df3, on='key', how='right')

In [None]:
#How does an 'inner' join compare?
pd.merge(df1,df3, on='key', how='inner')
# seems to be the default argument...

In [None]:
#How does an 'outer' join compare?  If inner joins are like an intersection of two sets, outer joins are unions.
pd.merge(df1,df3, on='key', how='outer')

In [None]:
#What if the join fields have different names?  No problem - just specify the names.
df4 = pd.DataFrame({'key_1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df5 = pd.DataFrame({'key_2': ['a', 'b', 'b', 'd'],'data2': range(4)})
pd.merge(df4,df5, left_on='key_1', right_on='key_2')

In [None]:
# Here is an example that uses a combination of a data column and an index to merge two dataframes.
df4 = pd.DataFrame({'key_1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df5 = pd.DataFrame({'data2': [4,6,8,10]}, index=['a','b','c','d'])
pd.merge(df4,df5, left_on='key_1', right_index=True)

## Concatenating

In [None]:
# Concatenating can append rows, or columns, depending on which axis you use. Default is 0
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
pd.concat([s1, s2, s3])
# Since we are concatenating series on axis 0, this creates a longer series, appending each of the three series

In [None]:
# What if we concatenate on axis 1?
pd.concat([s1, s2, s3], axis=1)

In [None]:
# Outer join is the default:
pd.concat([s1, s2, s3], axis=1, join='outer')

In [None]:
# What would an inner join produce?
pd.concat([s1, s2, s3], axis=1, join='inner')

In [None]:
# We need some overlapping values to have the inner join produe non-empty results
s4 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
s5 = pd.Series([1, 2, 3], index=['d', 'e', 'f'])
s6 = pd.Series([7, 8, 9, 10], index=['d', 'e', 'f', 'g'])
pd.concat([s4, s5, s6], axis=1, join='outer')

In [None]:
# Here is the inner join 
pd.concat([s4, s5, s6], axis=1, join='inner')
# Note that it contains only entries that overlap in all three series.

## Reshaping with Hierarchical Indexing

In [None]:
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                 index=pd.Index(['Ohio', 'Colorado'], name='state'),
                 columns=pd.Index(['one', 'two', 'three'], name='number'))
data

In [None]:
# Stack pivots the columns into rows, producing a Series with a hierarchical index:
result = data.stack()
result

In [None]:
# Unstack reverses this process:
result.unstack()

See also the related pivot method

## Data Transformations

In [None]:
# Start with a dataframe containing some duplicate values
data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4,'k2': [1, 1, 2, 3, 3, 4, 99]})
data

In [None]:
# How to see which rows contain duplicate values
data.duplicated()

In [None]:
# How to remove duplicate values
data.drop_duplicates()

In [None]:
#If 99 is a code for missing data, we could replace any such values with NaNs
data['k2'].replace(99,np.nan)

## Categorization (binning)

In [None]:
# Let's look at how to create categories of data using ranges to bin the data using cut
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
type(cats)

In [None]:
cats.categories

In [None]:
cats.codes

In [None]:
pd.value_counts(cats)

In [None]:
# Consistent with mathematical notation for intervals, a parenthesis means that the side is open while the 
#square bracket means it is closed (inclusive). Which side is closed can be changed by passing right=False:
cats = pd.cut(ages, bins, right=False)
print(ages)
print(pd.value_counts(cats))

### Removing Outliers

In [None]:
# Start by creating a dataframe with 4 columns of 1,000 random numbers
# We'll use a fixed seed for the random number generator to get repeatable results
np.random.seed(12345)
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

In [None]:
# This identifies any values in column 3 with absolute values > 3
col = data[3]
col[np.abs(col) > 3]

In [None]:
# This identifies all the rows with any column containing absolute values > 3
data[(np.abs(data) > 3).any(1)]

In [None]:
# Now we can cap the values at -3 to 3 using this:
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()

### Computing Dummy Variables

In [None]:
# This generates dummy variables for each value of key
# Dummy variables are useful in statistical modeling, to have 0/1 indicator
# variables for the presence of some condition
pd.get_dummies(df['key'])

In [None]:
# This generates dummy variables for each value of key and appends these to the dataframe
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Notice that we used join instead of merge.  The join method is very similar to merge, but uses indexes to merge, by default.  From the documentation:

http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
merge is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.

The related DataFrame.join method, uses merge internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior for merge). If you are joining on index, you may wish to use DataFrame.join to save yourself some typing

## A bit more: 
1. Filter out records with more than 4 bedrooms
2. Create dummy variables for each bedroom count (e.g. bed_1 would have 1 for rows with 1 bedroom, 0 for others), and merge them with the dataframe
3. Filter sqft < 500 and > 3000
4. Create a set of 5 bins for price and do counts of how many records are in each category

In [16]:
import pandas as pd
import numpy as np

df = pd.read_csv('rents_full.csv', encoding="ISO-8859-1")
#df_filter = df.query('bedrooms > 4')
#print(df_filter)
df.head() 

Unnamed: 0,date,title,neighborhood,sqft,price,bedrooms,longitude,latitude
0,11/14/14 12:26,Comfort & Convenience At an Affordable Price,foster city,755.0,2495.0,1.0,-122.27,37.5538
1,11/14/14 12:25,"$250 Visa Gift Card! Brand new flooring, appli...",palo alto,443.0,2695.0,,-122.161524,37.450289
2,11/14/14 12:24,"Sunny 2 bed/2 bath Spacious Condo, personal wa...",brisbane,1242.0,3150.0,2.0,-122.417912,37.692415
3,11/14/14 12:24,Spacious Updated Apt. Close to Stanford,palo alto,,2800.0,2.0,,
4,11/14/14 12:24,BOTTOM FLOOR ONE BEDROOM! PG&E IS INCLUDED! $1...,san mateo,676.0,2196.0,1.0,-122.2998,37.5395


In [17]:
df=df[~pd.isnull(df['bedrooms'])]
df.head()

Unnamed: 0,date,title,neighborhood,sqft,price,bedrooms,longitude,latitude
0,11/14/14 12:26,Comfort & Convenience At an Affordable Price,foster city,755.0,2495.0,1.0,-122.27,37.5538
2,11/14/14 12:24,"Sunny 2 bed/2 bath Spacious Condo, personal wa...",brisbane,1242.0,3150.0,2.0,-122.417912,37.692415
3,11/14/14 12:24,Spacious Updated Apt. Close to Stanford,palo alto,,2800.0,2.0,,
4,11/14/14 12:24,BOTTOM FLOOR ONE BEDROOM! PG&E IS INCLUDED! $1...,san mateo,676.0,2196.0,1.0,-122.2998,37.5395
5,11/14/14 12:28,Elegant Three Bd. W/ Unique Wood-Burning Firep...,santa clara,1138.0,3264.0,3.0,,


In [19]:
df_bed4 = df[df['bedrooms']>4]
df_bed4.head()


Unnamed: 0,date,title,neighborhood,sqft,price,bedrooms,longitude,latitude
17,11/14/14 12:23,Winston Park Area-House for Rent,south san francisco,,4500.0,5.0,-122.457417,37.667437
46,11/14/14 12:22,"5Br.-3Ba. Great house, location, floor plan an...",san jose north,2776.0,4150.0,5.0,,
47,11/14/14 12:19,STUNNING 5 Bedroom House in Beautiful Saratoga...,saratoga,4005.0,6950.0,5.0,-122.013569,37.271943
101,11/14/14 11:34,Napa*Huge*5 Bed*loft*$3495 a month,napa county,3500.0,3495.0,5.0,-122.269684,38.290821
108,11/14/14 12:17,Large 6 bedroom 3 bath with w/d in unit and pa...,alamo square / nopa,,15000.0,6.0,-122.440948,37.779505


In [3]:
df_filtered = df[(df.sqft >500) & (df.sqft <3000)]
print(df_filtered)

               date                                              title  \
0    11/14/14 12:26       Comfort & Convenience At an Affordable Price   
2    11/14/14 12:24  Sunny 2 bed/2 bath Spacious Condo, personal wa...   
4    11/14/14 12:24  BOTTOM FLOOR ONE BEDROOM! PG&E IS INCLUDED! $1...   
5    11/14/14 12:28  Elegant Three Bd. W/ Unique Wood-Burning Firep...   
6    11/14/14 12:28   Welcome Home! Warmth and convenience all in one.   
7    11/14/14 12:28  Luxury Townhome!Located Next To Everything!Lim...   
10   11/14/14 12:30                   Gorgeous one BR with in-unit W/D   
11   11/14/14 12:18  Special Price For this Spacious Two Bedroom Co...   
12   11/14/14 12:18  Sit Back and Relax In This Wonderful Two Bedroom!   
13   11/14/14 12:17  Newly Renovated 1 Bedroom 1 Bath!2nd Floor w/B...   
14   11/14/14 12:33  Located in the Heart of Corte Madera / Georgeo...   
16   11/14/14 12:24      Sunnyvale 3/1.5bath, upgraded, conv. location   
18   11/14/14 12:21  Exec worthy condo

In [4]:
df['price'].min()

32.0

In [24]:
df['price']

0      2495.0
2      3150.0
3      2800.0
4      2196.0
5      3264.0
6      2000.0
7      4740.0
8      3395.0
9      2699.0
10     3620.0
11     2025.0
12     2378.0
13     1795.0
14     4299.0
15      950.0
16     2695.0
17     4500.0
18     3900.0
19     2939.0
20     2045.0
21     2505.0
22     3000.0
23     3100.0
24     2481.0
25     2295.0
26     3525.0
27     2595.0
28     1400.0
29     3050.0
30     2450.0
        ...  
467    4995.0
468    4745.0
469    1595.0
470    1454.0
471    1800.0
472    8200.0
473    1749.0
474    2833.0
475    2905.0
476    2759.0
478    7080.0
479    2650.0
480    2079.0
481    2195.0
482    3182.0
483    1639.0
484    2650.0
485    4400.0
486    3700.0
487    3200.0
488    3800.0
489    2850.0
490    4700.0
491    3000.0
492    3300.0
493    3449.0
495    2250.0
497    1175.0
498    1450.0
499     800.0
Name: price, Length: 470, dtype: float64

In [5]:
df['price'].max()

15000.0

In [25]:
price=df.price.tolist()
price

[2495.0,
 3150.0,
 2800.0,
 2196.0,
 3264.0,
 2000.0,
 4740.0,
 3395.0,
 2699.0,
 3620.0,
 2025.0,
 2378.0,
 1795.0,
 4299.0,
 950.0,
 2695.0,
 4500.0,
 3900.0,
 2939.0,
 2045.0,
 2505.0,
 3000.0,
 3100.0,
 2481.0,
 2295.0,
 3525.0,
 2595.0,
 1400.0,
 3050.0,
 2450.0,
 2495.0,
 2955.0,
 2950.0,
 3699.0,
 3400.0,
 2079.0,
 1794.0,
 1540.0,
 2245.0,
 3595.0,
 1750.0,
 2595.0,
 4500.0,
 3669.0,
 2995.0,
 4150.0,
 6950.0,
 4750.0,
 2895.0,
 5400.0,
 2625.0,
 1895.0,
 2320.0,
 1705.0,
 4250.0,
 3495.0,
 3289.0,
 1375.0,
 3600.0,
 1695.0,
 2425.0,
 2795.0,
 3750.0,
 3600.0,
 1799.0,
 2700.0,
 1395.0,
 2175.0,
 3050.0,
 2895.0,
 3995.0,
 2775.0,
 2744.0,
 4899.0,
 2531.0,
 5500.0,
 5875.0,
 4856.0,
 1595.0,
 1351.0,
 1495.0,
 3200.0,
 1275.0,
 1200.0,
 2576.0,
 2088.0,
 2950.0,
 2769.0,
 2695.0,
 2995.0,
 3995.0,
 5195.0,
 3250.0,
 1350.0,
 2626.0,
 1995.0,
 1795.0,
 2606.0,
 3495.0,
 2739.0,
 4100.0,
 2046.0,
 2307.0,
 3450.0,
 2200.0,
 15000.0,
 3500.0,
 950.0,
 1200.0,
 1937.0,
 2282.0,
 6

In [26]:
#price= df['price']
bins= [32, 3032, 6032,9032,12032, 15000]
c = pd.cut(price,bins)

c



[(32, 3032], (3032, 6032], (32, 3032], (32, 3032], (3032, 6032], ..., (3032, 6032], (32, 3032], (32, 3032], (32, 3032], (32, 3032]]
Length: 470
Categories (5, interval[int64]): [(32, 3032] < (3032, 6032] < (6032, 9032] < (9032, 12032] < (12032, 15000]]

In [27]:
c.categories

IntervalIndex([(32, 3032], (3032, 6032], (6032, 9032], (9032, 12032], (12032, 15000]]
              closed='right',
              dtype='interval[int64]')

In [28]:
c.codes

array([ 0,  1,  0,  0,  1,  0,  1,  1,  0,  1,  0,  0,  0,  1,  0,  0,  1,
        1,  0,  0,  0,  0,  1,  0,  0,  1,  0,  0,  1,  0,  0,  0,  0,  1,
        1,  0,  0,  0,  0,  1,  0,  0,  1,  1,  0,  1,  2,  1,  0,  1,  0,
        0,  0,  0,  1,  1,  1,  0,  1,  0,  0,  0,  1,  1,  0,  0,  0,  0,
        1,  0,  1,  0,  0,  1,  0,  1,  1,  1,  0,  0,  0,  1,  0,  0,  0,
        0,  0,  0,  0,  0,  1,  1,  1,  0,  0,  0,  0,  0,  1,  0,  1,  0,
        0,  1,  0,  4,  1,  0,  0,  0,  0,  1,  0,  0,  1,  1,  0,  0,  0,
        0,  1,  1,  0,  0,  0,  0,  1,  1,  1,  0,  1,  0,  0,  0,  1,  1,
        1,  0,  0,  1,  1,  1,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,
        0,  0,  1,  0,  0,  0,  1,  1,  0,  1,  0,  0,  1,  1,  0,  0,  0,
        1,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  1,  0,  0,  0,  1,  0,  1,  0,  0,  0,  0,  1,  1,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  1,  1,  0,  0,  1,  0,  0,  0,  0,  0,
        0,  1,  0,  0,  0

In [29]:
pd.value_counts(c)

(32, 3032]        325
(3032, 6032]      135
(6032, 9032]        7
(12032, 15000]      2
(9032, 12032]       0
dtype: int64