<center>
  <a href="2.4.Summary%20Statistics%20in%20Dataframe.ipynb">Previous Page</a> | <a href="./">Content Page</a> | <a href="2.6._viz-examples_Matplotlib__Pandas_Seaborn.ipynb">Next Page</a></center>
</center>

# 2.5 Data Munging/Cleaning

## Munging Defined 

1. Shape dataset so it `feeds` easily into `visualisation tools` and `machine learning models` and involve in `similar set operations` for most datasets 
* Cleaning: Strings changed to numbers, numbers/strings changed to dates, `missing values` dealt with, pruning columns 
* Transformation: `Data grouped` into meaningful subsets, `summary statistics` calculated, `further filtering` 

Here, we will take you through all the usual munging step using resale prices for HDBs from 2012 onwards 

Variables include: 
```
month                   object
town                    object
flat_type               object
floor_area_sqm         float64
lease_commence_date      int64
resale_price           float64
dtype: object
```

In [26]:
import pandas as pd 
import numpy as np

## 2.5.1. First look at the data 

In [27]:
# load data 
resale = pd.read_csv("hdb_munging.csv", index_col=0)

In [28]:
# take a first look of the 2 rows
resale.sample(2)


Unnamed: 0,month,town,flat_type,floor_area_sqm,lease_commence_date,resale_price
13547,2012-09,CHOA CHU KANG,4 ROOM,104.0,1989,400000.0
88692,2016-11,SEMBAWANG,4 ROOM,96.0,2010,380000.0


In [29]:
#copy to df as a backup
df=resale.copy()


In [30]:
# how many rows and columns?  
resale.shape

(2000, 6)

In [31]:
# look for unique values 
resale["town"].unique()

array(['ANG MO KIO', 'BEDOK', 'BISHAN', 'BUKIT BATOK', 'BUKIT MERAH',
       'BUKIT PANJANG', 'BUKIT TIMAH', 'CENTRAL AREA', 'CHOA CHU KANG',
       'CLEMENTI', 'GEYLANG', 'HOUGANG', 'JURONG EAST', 'JURONG WEST',
       'KALLANG/WHAMPOA', 'MARINE PARADE', 'PASIR RIS', 'PUNGGOL',
       'QUEENSTOWN', 'SEMBAWANG', 'SENGKANG', 'SERANGOON', 'TAMPINES',
       'TOA PAYOH', 'WOODLANDS', 'YISHUN'], dtype=object)

##### How many towns? (use nunique function)

In [32]:
resale["month"].nunique()

64

In [33]:
resale['resale_price'].head(3)

87705    590000.0
86039    490000.0
52213    290000.0
Name: resale_price, dtype: float64

In [34]:
resale.dtypes
#observe the resale_price columns

month                   object
town                    object
flat_type               object
floor_area_sqm         float64
lease_commence_date      int64
resale_price           float64
dtype: object

In [35]:
resale.resale_price.head(3)
#observe the dtype

87705    590000.0
86039    490000.0
52213    290000.0
Name: resale_price, dtype: float64

The first record start with the index '87705', the second record is '86039' and the third one is '290000'

##### Assign the first 3 rows to x, y, z

In [36]:
x, y, z=resale.resale_price.iloc[[0,1,2]]

##### Assign index 87705,86039,52213 to the original number, but in a `string` format

In [37]:
resale.loc[[87705,86039,52213],['resale_price']]=str(x), str(y), str(z)
#resale.iloc[0:3,5]=str(x), str(y), str(z)

In [38]:
resale.dtypes
#observe the resale_price columns

month                   object
town                    object
flat_type               object
floor_area_sqm         float64
lease_commence_date      int64
resale_price            object
dtype: object

##### The first 3 rows are string and  others are number

In [39]:
resale['resale_price'].head()

87705    590000.0
86039    490000.0
52213    290000.0
46561      425000
69890      760000
Name: resale_price, dtype: object

##### Convert anything to number.  If not number, assign NaN

In [40]:
pd.to_numeric?

In [16]:
resale["resale_price"] = resale["resale_price"].apply(pd.to_numeric, errors = "coerce")
resale['resale_price'].head()

87705    590000.0
86039    490000.0
52213    290000.0
46561    425000.0
69890    760000.0
Name: resale_price, dtype: float64

In [41]:
# look at data type

resale[["resale_price"]] = resale[["resale_price"]].apply(pd.to_numeric, errors = "coerce")
resale["resale_price"].head()

87705    590000.0
86039    490000.0
52213    290000.0
46561    425000.0
69890    760000.0
Name: resale_price, dtype: float64

In [42]:
resale.dtypes #back to float (resale_price)

month                   object
town                    object
flat_type               object
floor_area_sqm         float64
lease_commence_date      int64
resale_price           float64
dtype: object

## Exercise 2.5.1

###### a) Load cars.csv data into dataframe named cars:

In [18]:
# load cars.csv data into dataframe named cars
cars = pd.____("______",index_col=None)

In [19]:
x, y, z=cars.MPG.iloc[[0,1,2]]
cars.loc[[0,1,2],['MPG']]=str(x), str(y), str(z)

###### b) Observe the three rows from the dataframe using head:

In [20]:
cars.____(_)

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin,Color
0,Chevrolet Chevelle Malibu,18.0,8,307.0,130,3504,12.0,70,US,blue
1,Buick Skylark 320,15.0,8,350.0,165,3693,11.5,70,US,red
2,Plymouth Satellite,18.0,8,318.0,150,3436,11.0,70,US,blue


###### c) Check how many rows and columns using shape

In [21]:
cars.______

(406, 10)

###### d) Get 10 samples from the dataframe

In [22]:
cars.____(___)

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin,Color
217,Toyota Mark II,19.0,6,156.0,108,2930,15.5,76,Japan,green
2,Plymouth Satellite,18.0,8,318.0,150,3436,11.0,70,US,blue
256,Oldsmobile Cutlass Salon Brougham,19.9,8,260.0,110,3365,15.5,78,US,green
168,Chevrolete Chevelle Malibu,16.0,6,250.0,105,3897,18.5,75,US,green
363,Toyota Corolla,32.4,4,108.0,75,2350,16.8,81,Japan,green
92,Buick Century 350,13.0,8,350.0,175,4100,13.0,73,US,blue
68,Ford Pinto Runabout,21.0,4,122.0,86,2226,16.5,72,US,red
159,Plymouth Valiant Custom,19.0,6,225.0,95,3264,16.0,75,US,blue
1,Buick Skylark 320,15.0,8,350.0,165,3693,11.5,70,US,red
344,Honda Accord,32.4,4,107.0,72,2290,17.0,80,Japan,red


###### e) Check how many Unique Car in Cars DataFrame:

In [23]:
# How many Unique Car in Cars DataFrame?
cars.__._______

308

###### f)Check how many Cylinders Car in Cars DataFrame:

In [24]:
# How many Unique Cylinders in Cars DataFrame?
cars._____.______

5

###### g)Check how many Unique countries in Cars DataFrame:

In [25]:
# From which countries these car are?
cars.____._____

array(['US', 'Europe', 'Japan'], dtype=object)

###### h) Check the datatypes in the Cars DataFrame:

In [26]:
cars._____

Car              object
MPG              object
Cylinders         int64
Displacement    float64
Horsepower        int64
Weight            int64
Acceleration    float64
Model             int64
Origin           object
Color            object
dtype: object

###### i)MPG is not recognised as a number.  Convert object in MPG to float65:

In [27]:
# Convert MPG to float64 to_numeric

In [28]:
cars[["MPG"]] = cars[["MPG"]].apply(pd.______, errors = "coerce")
cars["MPG"].head()

0    18.0
1    15.0
2    18.0
3    16.0
4    17.0
Name: MPG, dtype: float64

###### j) check the datatype in cars dataframe again:

In [29]:
cars._____

Car              object
MPG             float64
Cylinders         int64
Displacement    float64
Horsepower        int64
Weight            int64
Acceleration    float64
Model             int64
Origin           object
Color            object
dtype: object

## 2.5.2. Playing, Slicing, Dicing with 10 records (call it head dataframe)

In [43]:
# quick intro into indexing - slicing
# create a small dataset with 10 rows 
head = resale.head(10)


In [44]:
head

Unnamed: 0,month,town,flat_type,floor_area_sqm,lease_commence_date,resale_price
87705,2016-11,ANG MO KIO,4 ROOM,92.0,1977,590000.0
86039,2016-10,ANG MO KIO,4 ROOM,92.0,1978,490000.0
52213,2015-01,ANG MO KIO,3 ROOM,68.0,1979,290000.0
46561,2014-09,ANG MO KIO,4 ROOM,92.0,1981,425000.0
69890,2015-12,ANG MO KIO,EXECUTIVE,153.0,1996,760000.0
11026,2012-08,ANG MO KIO,3 ROOM,67.0,1978,335000.0
79036,2016-06,ANG MO KIO,3 ROOM,68.0,1981,277000.0
75,2012-03,ANG MO KIO,3 ROOM,89.0,1979,435000.0
29949,2013-08,ANG MO KIO,3 ROOM,68.0,1980,348000.0
11021,2012-08,ANG MO KIO,3 ROOM,68.0,1980,327000.0


In [45]:
# access a column
head['resale_price']

87705    590000.0
86039    490000.0
52213    290000.0
46561    425000.0
69890    760000.0
11026    335000.0
79036    277000.0
75       435000.0
29949    348000.0
11021    327000.0
Name: resale_price, dtype: float64

In [46]:
# access second row, keep all columns
head.iloc[1,:]

month                     2016-10
town                   ANG MO KIO
flat_type                  4 ROOM
floor_area_sqm                 92
lease_commence_date          1978
resale_price               490000
Name: 86039, dtype: object

In [47]:
# all rows, only second column
head.iloc[:,2]

87705       4 ROOM
86039       4 ROOM
52213       3 ROOM
46561       4 ROOM
69890    EXECUTIVE
11026       3 ROOM
79036       3 ROOM
75          3 ROOM
29949       3 ROOM
11021       3 ROOM
Name: flat_type, dtype: object

In [48]:
head.loc[head['resale_price'] > 100000.0, :]

Unnamed: 0,month,town,flat_type,floor_area_sqm,lease_commence_date,resale_price
87705,2016-11,ANG MO KIO,4 ROOM,92.0,1977,590000.0
86039,2016-10,ANG MO KIO,4 ROOM,92.0,1978,490000.0
52213,2015-01,ANG MO KIO,3 ROOM,68.0,1979,290000.0
46561,2014-09,ANG MO KIO,4 ROOM,92.0,1981,425000.0
69890,2015-12,ANG MO KIO,EXECUTIVE,153.0,1996,760000.0
11026,2012-08,ANG MO KIO,3 ROOM,67.0,1978,335000.0
79036,2016-06,ANG MO KIO,3 ROOM,68.0,1981,277000.0
75,2012-03,ANG MO KIO,3 ROOM,89.0,1979,435000.0
29949,2013-08,ANG MO KIO,3 ROOM,68.0,1980,348000.0
11021,2012-08,ANG MO KIO,3 ROOM,68.0,1980,327000.0


In [49]:
head.loc[head['resale_price'] > 100000.0]

Unnamed: 0,month,town,flat_type,floor_area_sqm,lease_commence_date,resale_price
87705,2016-11,ANG MO KIO,4 ROOM,92.0,1977,590000.0
86039,2016-10,ANG MO KIO,4 ROOM,92.0,1978,490000.0
52213,2015-01,ANG MO KIO,3 ROOM,68.0,1979,290000.0
46561,2014-09,ANG MO KIO,4 ROOM,92.0,1981,425000.0
69890,2015-12,ANG MO KIO,EXECUTIVE,153.0,1996,760000.0
11026,2012-08,ANG MO KIO,3 ROOM,67.0,1978,335000.0
79036,2016-06,ANG MO KIO,3 ROOM,68.0,1981,277000.0
75,2012-03,ANG MO KIO,3 ROOM,89.0,1979,435000.0
29949,2013-08,ANG MO KIO,3 ROOM,68.0,1980,348000.0
11021,2012-08,ANG MO KIO,3 ROOM,68.0,1980,327000.0


## Exercise 5.3.2

###### a) Find a car that is efficient (that can run 40 miles per gallon). List ALL OF THEM

In [37]:
cars.loc[______ > ____]

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin,Color
251,Volkswagen Rabbit Custom Diesel,43.1,4,90.0,48,1985,21.5,78,Europe,blue
316,Volkswagen Rabbit,41.5,4,98.0,76,2144,14.7,80,Europe,green
329,Mazda GLC,46.6,4,86.0,65,2110,17.9,80,Japan,green
331,Datsun 210,40.8,4,85.0,65,2110,19.2,80,Japan,blue
332,Volkswagen Rabbit C (Diesel),44.3,4,90.0,48,2085,21.7,80,Europe,blue
333,Volkswagen Dasher (diesel),43.4,4,90.0,48,2335,23.7,80,Europe,green
336,Honda Civic 1500 gl,44.6,4,91.0,67,1850,13.8,80,Japan,red
337,Renault Lecar Deluxe,40.9,4,85.0,0,1835,17.3,80,Europe,red
402,Volkswagen Pickup,44.0,4,97.0,52,2130,24.6,82,Europe,green


###### b) Find a car that is efficient (that can run 40 miles per gallon). List ONLY from Japan

In [38]:
cars.loc[(cars.MPG > 40) ___ (cars.Origin ___ Japan')]

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin,Color
329,Mazda GLC,46.6,4,86.0,65,2110,17.9,80,Japan,green
331,Datsun 210,40.8,4,85.0,65,2110,19.2,80,Japan,blue
336,Honda Civic 1500 gl,44.6,4,91.0,67,1850,13.8,80,Japan,red


###### c) Find a car that is efficient (that can run 40 miles per gallon). List ONLY from those NOT from Japan

In [39]:
cars.loc[(cars.MPG > 40) ___ (cars.Origin___='Japan')]

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin,Color
251,Volkswagen Rabbit Custom Diesel,43.1,4,90.0,48,1985,21.5,78,Europe,blue
316,Volkswagen Rabbit,41.5,4,98.0,76,2144,14.7,80,Europe,green
332,Volkswagen Rabbit C (Diesel),44.3,4,90.0,48,2085,21.7,80,Europe,blue
333,Volkswagen Dasher (diesel),43.4,4,90.0,48,2335,23.7,80,Europe,green
337,Renault Lecar Deluxe,40.9,4,85.0,0,1835,17.3,80,Europe,red
402,Volkswagen Pickup,44.0,4,97.0,52,2130,24.6,82,Europe,green


##### *d) Self-Study (Google).  Find all cars with the keyword "Renault"*

In [40]:
cars[cars['____'].str.contains("____")]

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin,Color
86,Renault 12 (sw),26.0,4,96.0,69,2189,18.0,72,Europe,red
193,Renault 12tl,27.0,4,101.0,83,2202,15.3,76,Europe,blue
225,Renault 5 GTL,36.0,4,79.0,58,1825,18.6,77,Europe,blue
337,Renault Lecar Deluxe,40.9,4,85.0,0,1835,17.3,80,Europe,red
361,Renault 18i,34.5,4,100.0,0,2320,15.8,81,Europe,blue


## 2.5.3. Cleaning: Dealing with missing values, outliers and time formats 

### Missing Values 
Real world data, or data "in the wild", is rarely standardised. Often, especially with surveys, fields are not filled in and hence we have to deal with missing values. Pandas provides several tools to handle missing data. 

In [50]:
# Generate a smalldata set with 8 rows and 3 columns to play and rename it as resale_small
resale_small = resale.loc[:, ['resale_price','lease_commence_date','floor_area_sqm']].head(8)
resale_small

Unnamed: 0,resale_price,lease_commence_date,floor_area_sqm
87705,590000.0,1977,92.0
86039,490000.0,1978,92.0
52213,290000.0,1979,68.0
46561,425000.0,1981,92.0
69890,760000.0,1996,153.0
11026,335000.0,1978,67.0
79036,277000.0,1981,68.0
75,435000.0,1979,89.0


#### Always good practice to see whether any missing values

In [51]:
##Generate a boolean mask showing missing values 
resale_small.isnull()

Unnamed: 0,resale_price,lease_commence_date,floor_area_sqm
87705,False,False,False
86039,False,False,False
52213,False,False,False
46561,False,False,False
69890,False,False,False
11026,False,False,False
79036,False,False,False
75,False,False,False


Resale_small.isnull() can be overwhelming when we have so many columns and rows
***Here is the alternative***

In [52]:
# resale_small.isnull() can be overwhelming when we have so many columns and rows
# Here is the solution.
resale_small.isnull().any()

resale_price           False
lease_commence_date    False
floor_area_sqm         False
dtype: bool

#### The above data has no missing values.   Let us impute some NaN values

In [53]:
from numpy import nan

In [54]:
resale_small.iloc[1:5, 1]=nan

In [55]:
resale_small.iloc[:2, 2]=nan

In [56]:
resale_small

Unnamed: 0,resale_price,lease_commence_date,floor_area_sqm
87705,590000.0,1977.0,
86039,490000.0,,
52213,290000.0,,68.0
46561,425000.0,,92.0
69890,760000.0,,153.0
11026,335000.0,1978.0,67.0
79036,277000.0,1981.0,68.0
75,435000.0,1979.0,89.0


In [57]:
resale_small.isnull().sum()

resale_price           0
lease_commence_date    4
floor_area_sqm         2
dtype: int64

**Now we have 4 null values in lease_commence_date, and 2 missing values in  floor_area_sqm **

In [58]:
# similarly, create a boolean mask, but this time showing non-missing values 
resale_small.notnull()

Unnamed: 0,resale_price,lease_commence_date,floor_area_sqm
87705,True,True,False
86039,True,False,False
52213,True,False,True
46561,True,False,True
69890,True,False,True
11026,True,True,True
79036,True,True,True
75,True,True,True


In [59]:
# notnull becomes usef when we use the boolean array as an index 
# here, all NaN values for lease_commence_date are filtered out 

resale_small[resale_small.lease_commence_date.notnull()]

Unnamed: 0,resale_price,lease_commence_date,floor_area_sqm
87705,590000.0,1977.0,
11026,335000.0,1978.0,67.0
79036,277000.0,1981.0,68.0
75,435000.0,1979.0,89.0


In [60]:
# notnull becomes usef when we use the boolean array as an index 
# here, all NaN values for floor_area_sqm are filtered out 

resale_small[resale_small.floor_area_sqm.notnull()]

Unnamed: 0,resale_price,lease_commence_date,floor_area_sqm
52213,290000.0,,68.0
46561,425000.0,,92.0
69890,760000.0,,153.0
11026,335000.0,1978.0,67.0
79036,277000.0,1981.0,68.0
75,435000.0,1979.0,89.0


#### Filtering all nan in  floor_area_sqm and lease_commence_date

In [61]:
resale_small[(resale_small.floor_area_sqm.notnull()) & (resale_small.lease_commence_date.notnull())]

Unnamed: 0,resale_price,lease_commence_date,floor_area_sqm
11026,335000.0,1978.0,67.0
79036,277000.0,1981.0,68.0
75,435000.0,1979.0,89.0


In [62]:
resale_small.dropna()

Unnamed: 0,resale_price,lease_commence_date,floor_area_sqm
11026,335000.0,1978.0,67.0
79036,277000.0,1981.0,68.0
75,435000.0,1979.0,89.0


Note:
* Although both results the first combined syntax and dropna() are the same
* dropna() drops ALL records
* while the first syntax gives us flexibility only to drop certain conditions (e.g., which columns that we want to drop and it can be one columns, two or more). 

##### Filling out values
Especially when our dataset is small, and we want to preserve information as much as we can, we don't have to drop NaN values. Instead, we can replace them with zero, and retain the information in the other columns. Even better, we can fill NaN values with an average or some other value. 

In [63]:
# filling with zero
resale_small.fillna(0)

Unnamed: 0,resale_price,lease_commence_date,floor_area_sqm
87705,590000.0,1977.0,0.0
86039,490000.0,0.0,0.0
52213,290000.0,0.0,68.0
46561,425000.0,0.0,92.0
69890,760000.0,0.0,153.0
11026,335000.0,1978.0,67.0
79036,277000.0,1981.0,68.0
75,435000.0,1979.0,89.0


In [64]:
# propagate previous value backward 
resale_small.fillna(method='bfill')

Unnamed: 0,resale_price,lease_commence_date,floor_area_sqm
87705,590000.0,1977.0,68.0
86039,490000.0,1978.0,68.0
52213,290000.0,1978.0,68.0
46561,425000.0,1978.0,92.0
69890,760000.0,1978.0,153.0
11026,335000.0,1978.0,67.0
79036,277000.0,1981.0,68.0
75,435000.0,1979.0,89.0


In [65]:
# propagate previous value forward 
resale_small.fillna(method='ffill')

Unnamed: 0,resale_price,lease_commence_date,floor_area_sqm
87705,590000.0,1977.0,
86039,490000.0,1977.0,
52213,290000.0,1977.0,68.0
46561,425000.0,1977.0,92.0
69890,760000.0,1977.0,153.0
11026,335000.0,1978.0,67.0
79036,277000.0,1981.0,68.0
75,435000.0,1979.0,89.0


## Exercise 2.5.3

In [None]:
cars.iloc[:2, 1]=nan
cars.iloc[1:10, 2]=nan

###### a) Check the first 10 records and find the NaN

In [58]:
cars.head(___)

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin,Color
0,Chevrolet Chevelle Malibu,,8.0,307.0,130,3504,12.0,70,US,blue
1,Buick Skylark 320,,,350.0,165,3693,11.5,70,US,red
2,Plymouth Satellite,18.0,,318.0,150,3436,11.0,70,US,blue
3,AMC Rebel SST,16.0,,304.0,150,3433,12.0,70,US,green
4,Ford Torino,17.0,,302.0,140,3449,10.5,70,US,red
5,Ford Galaxie 500,15.0,,429.0,198,4341,10.0,70,US,red
6,Chevrolet Impala,14.0,,454.0,220,4354,9.0,70,US,green
7,Plymouth Fury iii,14.0,,440.0,215,4312,8.5,70,US,blue
8,Pontiac Catalina,14.0,,455.0,225,4425,10.0,70,US,blue
9,AMC Ambassador DPL,15.0,,390.0,190,3850,8.5,70,US,green


##### b) Propagate previous value backward (fillna)

In [59]:
cars._____(method='bfill').head(11)

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin,Color
0,Chevrolet Chevelle Malibu,18.0,8.0,307.0,130,3504,12.0,70,US,blue
1,Buick Skylark 320,18.0,4.0,350.0,165,3693,11.5,70,US,red
2,Plymouth Satellite,18.0,4.0,318.0,150,3436,11.0,70,US,blue
3,AMC Rebel SST,16.0,4.0,304.0,150,3433,12.0,70,US,green
4,Ford Torino,17.0,4.0,302.0,140,3449,10.5,70,US,red
5,Ford Galaxie 500,15.0,4.0,429.0,198,4341,10.0,70,US,red
6,Chevrolet Impala,14.0,4.0,454.0,220,4354,9.0,70,US,green
7,Plymouth Fury iii,14.0,4.0,440.0,215,4312,8.5,70,US,blue
8,Pontiac Catalina,14.0,4.0,455.0,225,4425,10.0,70,US,blue
9,AMC Ambassador DPL,15.0,4.0,390.0,190,3850,8.5,70,US,green


## 2.5.4.Transforming Data with a Mapping

In [67]:
resale.town.unique()

array(['ANG MO KIO', 'BEDOK', 'BISHAN', 'BUKIT BATOK', 'BUKIT MERAH',
       'BUKIT PANJANG', 'BUKIT TIMAH', 'CENTRAL AREA', 'CHOA CHU KANG',
       'CLEMENTI', 'GEYLANG', 'HOUGANG', 'JURONG EAST', 'JURONG WEST',
       'KALLANG/WHAMPOA', 'MARINE PARADE', 'PASIR RIS', 'PUNGGOL',
       'QUEENSTOWN', 'SEMBAWANG', 'SENGKANG', 'SERANGOON', 'TAMPINES',
       'TOA PAYOH', 'WOODLANDS', 'YISHUN'], dtype=object)

In [68]:
district = resale.sample(n=23).copy()
district

Unnamed: 0,month,town,flat_type,floor_area_sqm,lease_commence_date,resale_price
86710,2016-10,JURONG EAST,4 ROOM,102.0,1998,465000.0
48935,2014-10,QUEENSTOWN,3 ROOM,65.0,1975,430000.0
70204,2015-12,CHOA CHU KANG,4 ROOM,104.0,1990,365000.0
9350,2012-07,BUKIT PANJANG,5 ROOM,122.0,1988,470000.0
76354,2016-04,PASIR RIS,4 ROOM,106.0,1990,390000.0
18594,2012-12,BEDOK,3 ROOM,67.0,1977,335000.0
94393,2017-03,TAMPINES,4 ROOM,108.0,1987,443000.0
40805,2014-04,TAMPINES,3 ROOM,78.0,1985,370000.0
99908,2017-06,SERANGOON,5 ROOM,140.0,1985,518000.0
23904,2013-04,ANG MO KIO,5 ROOM,110.0,2006,780000.0


In [69]:
# load data 
resale = pd.read_csv("hdb_munging.csv", index_col=0)


In [70]:
mapTownRegion = {
'ANG MO KIO':'North East',
'BEDOK':'East',
'BISHAN':'Central',
'BUKIT BATOK':'West',
'BUKIT MERAH':'Central',
'BUKIT PANJANG':'West',
'BUKIT TIMAH':'Central',
'CENTRAL AREA':'Central',
'CHOA CHU KANG':'West',
'CLEMENTI':'West',
'GEYLANG':'Central',
'HOUGANG':'North East',
'JURONG EAST':'West',
'JURONG WEST':'West',
'KALLANG/WHAMPOA':'Central',
'MARINE PARADE':'East',
'PASIR RIS':'East',
'PUNGGOL':'North East',
'QUEENSTOWN':'Central',
'SEMBAWANG':'North',
'SENGKANG':'North East',
'SERANGOON':'North East',
'TAMPINES':'East',
'TOA PAYOH':'Central',
'WOODLANDS':'North',
'YISHUN':'North',

}




When we have many values (e.g., millions of rows), it is worth the effort to use mapping.

In [71]:
district['area'] = district.loc[:,'town'].map(mapTownRegion) 
district

Unnamed: 0,month,town,flat_type,floor_area_sqm,lease_commence_date,resale_price,area
86710,2016-10,JURONG EAST,4 ROOM,102.0,1998,465000.0,West
48935,2014-10,QUEENSTOWN,3 ROOM,65.0,1975,430000.0,Central
70204,2015-12,CHOA CHU KANG,4 ROOM,104.0,1990,365000.0,West
9350,2012-07,BUKIT PANJANG,5 ROOM,122.0,1988,470000.0,West
76354,2016-04,PASIR RIS,4 ROOM,106.0,1990,390000.0,East
18594,2012-12,BEDOK,3 ROOM,67.0,1977,335000.0,East
94393,2017-03,TAMPINES,4 ROOM,108.0,1987,443000.0,East
40805,2014-04,TAMPINES,3 ROOM,78.0,1985,370000.0,East
99908,2017-06,SERANGOON,5 ROOM,140.0,1985,518000.0,North East
23904,2013-04,ANG MO KIO,5 ROOM,110.0,2006,780000.0,North East


## String Operations 

In [23]:
# we do the mapping for ALL complete data
resale['Region']=resale.loc[:,'town'].map(mapTownRegion) 

In [24]:
resale['Year'], resale['Quarter'] = resale['month'].str.split('-',1).str
resale.head()
#a, b, c  =resale['month'].str.split('1',2).str
#a, b, c, d =resale['month'].str.split('1',3).str

Unnamed: 0,month,town,flat_type,floor_area_sqm,lease_commence_date,resale_price,Region,Year,Quarter
87705,2016-11,ANG MO KIO,4 ROOM,92.0,1977,590000.0,North East,2016,11
86039,2016-10,ANG MO KIO,4 ROOM,92.0,1978,490000.0,North East,2016,10
52213,2015-01,ANG MO KIO,3 ROOM,68.0,1979,290000.0,North East,2015,1
46561,2014-09,ANG MO KIO,4 ROOM,92.0,1981,425000.0,North East,2014,9
69890,2015-12,ANG MO KIO,EXECUTIVE,153.0,1996,760000.0,North East,2015,12


In [25]:
# save our clean data 
resale.to_csv('clean_resale.csv')

## Exercise 2.5.4

The current data is not consistent.  Europe is not a country, Japan is a country.   We want to make it consistent by filling out "Continent".  Here is the dictonary

In [67]:
cars=pd.read_csv('cars.csv')

In [68]:
continent = {
'Europe':'Europe',
'US':'America',
'Japan':'Asia'}

###### a) Map the dictonary above, find the keyword in Origin, and assign values in continent dictionary to a new column named as cars.Continent 

In [69]:
cars['____'] = cars.loc[:,'Origin'].____(continent)
cars

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin,Color,Continent
0,Chevrolet Chevelle Malibu,18.0,8,307.0,130,3504,12.0,70,US,blue,America
1,Buick Skylark 320,15.0,8,350.0,165,3693,11.5,70,US,red,America
2,Plymouth Satellite,18.0,8,318.0,150,3436,11.0,70,US,blue,America
3,AMC Rebel SST,16.0,8,304.0,150,3433,12.0,70,US,green,America
4,Ford Torino,17.0,8,302.0,140,3449,10.5,70,US,red,America
5,Ford Galaxie 500,15.0,8,429.0,198,4341,10.0,70,US,red,America
6,Chevrolet Impala,14.0,8,454.0,220,4354,9.0,70,US,green,America
7,Plymouth Fury iii,14.0,8,440.0,215,4312,8.5,70,US,blue,America
8,Pontiac Catalina,14.0,8,455.0,225,4425,10.0,70,US,blue,America
9,AMC Ambassador DPL,15.0,8,390.0,190,3850,8.5,70,US,green,America


###### b) Save the Data Frame as newcar.csv

In [70]:
cars._____('newcar.csv')

## Drop Columns

In [50]:
# drop columns
# inplace so that original data can be modified without creating a copy.
resale.drop('month', axis = 1, inplace=True)
resale.head()

Unnamed: 0,town,flat_type,floor_area_sqm,lease_commence_date,resale_price,Year,Quarter
87705,ANG MO KIO,4 ROOM,92.0,1977,590000.0,2016,11
86039,ANG MO KIO,4 ROOM,92.0,1978,490000.0,2016,10
52213,ANG MO KIO,3 ROOM,68.0,1979,290000.0,2015,1
46561,ANG MO KIO,4 ROOM,92.0,1981,425000.0,2014,9
69890,ANG MO KIO,EXECUTIVE,153.0,1996,760000.0,2015,12


Other string operations include contains **count, endswith, startswith, findall, lower, upper, math, strip, etc.** 

## Exercise 2.5.4

###### c) Drop 'Origin' Field in cars dataframe

In [72]:
cars=pd.read_csv('newcar.csv', index_col=0)
cars.head(2)

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin,Color,Continent
0,Chevrolet Chevelle Malibu,18.0,8,307.0,130,3504,12.0,70,US,blue,America
1,Buick Skylark 320,15.0,8,350.0,165,3693,11.5,70,US,red,America


In [73]:
cars._____('Origin', axis = 1, inplace=True)
cars.head(2)

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Color,Continent
0,Chevrolet Chevelle Malibu,18.0,8,307.0,130,3504,12.0,70,blue,America
1,Buick Skylark 320,15.0,8,350.0,165,3693,11.5,70,red,America


## Dealing with time format  

In [51]:
resale['Year'].dtypes

dtype('O')

In [52]:
from datetime import datetime 
resale['Year'] = pd.to_datetime(resale['Year'])
resale['Year'].dtypes

dtype('<M8[ns]')

In [39]:
resale['Year'].head(3)

87705   2016-01-01
86039   2016-01-01
52213   2015-01-01
Name: Year, dtype: datetime64[ns]

In [53]:
resale['Year'] = [x.strftime('%Y') for x in resale['Year']]
resale['Year'].dtypes

dtype('O')

## 2.5.5. Supplementing our dataset with more data: Merging and Joining Datasets 

Before we purchase a property, it's useful to take note of any amneties nearby, particularly schools. Our original dataset doesn't come with information about schools, but with Pandas, adding information to our dataset is easy with the merge( ) function. 

In [54]:
# create two data frames 
# price of 4-room for 2016 
price = pd.DataFrame({'town': ['Yishun','Woodlands','Toa Payoh','Tampines','Serangoon'],
                        'price' : [369000,350000,565000,430000,465000]})
# number of secondary schools in area in 2016
schools = pd.DataFrame({'town' : ['Woodlands','Yishun','Toa Payoh','Tampines','Serangoon'],
                        'school_num' : [10,9,6,10,4]})

In [55]:
price

Unnamed: 0,town,price
0,Yishun,369000
1,Woodlands,350000
2,Toa Payoh,565000
3,Tampines,430000
4,Serangoon,465000


In [56]:
schools

Unnamed: 0,town,school_num
0,Woodlands,10
1,Yishun,9
2,Toa Payoh,6
3,Tampines,10
4,Serangoon,4


In [57]:
# combine both data frames into one 
merged = pd.merge(price, schools)
merged 

Unnamed: 0,town,price,school_num
0,Yishun,369000,9
1,Woodlands,350000,10
2,Toa Payoh,565000,6
3,Tampines,430000,10
4,Serangoon,465000,4


Note that the pd.merge( ) function sees that each DataFrame has an 'employee' column and hence joins using this column as a key. Hence we have painlessly combines the information from the two inputs. 

For the SQL folks, pd.merge( ) allows more complex joins such as many-to-many joins, left joins, inner joins etc. See the documentation for more details, or Wes McKinney's Python for Data Analysis is a great resource. 

A quick alternative to the pd.merge( ) function is the pd.concat( ) function. This function allows us to specify whether to join the DataFrame by column or rows. 

In [45]:
pd.concat([price, schools], axis=1)

Unnamed: 0,price,town,school_num,town.1
0,369000,Yishun,10,Woodlands
1,350000,Woodlands,9,Yishun
2,565000,Toa Payoh,6,Toa Payoh
3,430000,Tampines,10,Tampines
4,465000,Serangoon,4,Serangoon


### The importance of reset_index()

In [58]:
price.sort_values(by='town')

Unnamed: 0,town,price
4,Serangoon,465000
3,Tampines,430000
2,Toa Payoh,565000
1,Woodlands,350000
0,Yishun,369000


In [59]:
price.sort_values(by='town').reset_index()

Unnamed: 0,index,town,price
0,4,Serangoon,465000
1,3,Tampines,430000
2,2,Toa Payoh,565000
3,1,Woodlands,350000
4,0,Yishun,369000


In [60]:
p=price.sort_values(by='town').reset_index()

In [61]:
s=schools.sort_values(by='town').reset_index()

In [62]:
pd.concat([p, s], axis=1)

Unnamed: 0,index,town,price,index.1,town.1,school_num
0,4,Serangoon,465000,4,Serangoon,4
1,3,Tampines,430000,3,Tampines,10
2,2,Toa Payoh,565000,2,Toa Payoh,6
3,1,Woodlands,350000,0,Woodlands,10
4,0,Yishun,369000,1,Yishun,9


In [63]:
# row-wise by default 

price_2 = pd.DataFrame({'town': ['Sengkang','Sembawang'],
                        'price' : [408000,361500]})
pd.concat([price, price_2])

Unnamed: 0,town,price
0,Yishun,369000
1,Woodlands,350000
2,Toa Payoh,565000
3,Tampines,430000
4,Serangoon,465000
0,Sengkang,408000
1,Sembawang,361500


## 2.5.6. Aggregation

After loading our dataset, cleaning up our NaN values, merging our datasets and wrangling with data types, a common next step is to look at `group statistics` such as the `mean` or `median` to get a better feel for our data. In general terms, we can say we want to `spli`t our dataset into `interesting groups`, and apply a function to each group. For this, pandas `groupby` facility is particularly useful, and allows us to summarise our datasets flexibly and naturally. 

Indeed, one of the reasons why SQL and relational databases are so popular is that they `easily join, filter and transform datasets`. However, the group operations they can perform are limited. Pandas is `more expressive than SQ`L, and thus allows more complex grouped operations. As long as a function can accept a pandas object or NumPy array, we'll be fine. 

Here, we'll look at: 

- Performing aggregation using the `Split-Apply-Combine` paradigm 
- `Pivot Tables`

References: `Python4DataAnalysis by Wes McKinney` and `PythonDataScienceHandbook by Jake Van der Plas`

## Split-Apply-Combine 
![alt text](resources/img/splitapplycombine.png)

When dealing with groups (aggregations), we follow a process. Namely, we *split* the data, be it a column or a row, accodring to a certain *key*. Then, we *apply* a function such as a sum calculation to each group. Finally, we *combine* the results of the function into a new object. 

To make things concrete, let's look at an example, using the HDB resale dataset from before 

In [72]:
import pandas as pd
import numpy as np
resale=pd.read_csv('clean_resale.csv', index_col=0)

In [73]:
resale.head()

Unnamed: 0,month,town,flat_type,floor_area_sqm,lease_commence_date,resale_price,Region,Year,Quarter
87705,2016-11,ANG MO KIO,4 ROOM,92.0,1977,590000.0,North East,2016,11
86039,2016-10,ANG MO KIO,4 ROOM,92.0,1978,490000.0,North East,2016,10
52213,2015-01,ANG MO KIO,3 ROOM,68.0,1979,290000.0,North East,2015,1
46561,2014-09,ANG MO KIO,4 ROOM,92.0,1981,425000.0,North East,2014,9
69890,2015-12,ANG MO KIO,EXECUTIVE,153.0,1996,760000.0,North East,2015,12


In [74]:
# create groupby object that is ready for apply operation that follows.
grouped = resale['resale_price'].groupby(resale['Year'])
grouped

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x000001B83F74C5F8>

In [75]:
# apply mean operation 
grouped.mean()

Year
2012    461212.766234
2013    481367.291545
2014    462673.532308
2015    437969.096859
2016    434037.286486
2017    445642.466051
Name: resale_price, dtype: float64

In [76]:
# another example 
grouped = resale['resale_price'].groupby(resale['town'])
grouped.mean()

town
ANG MO KIO         439061.955556
BEDOK              426231.366667
BISHAN             584746.531250
BUKIT BATOK        415772.394366
BUKIT MERAH        556656.810127
BUKIT PANJANG      448087.831169
BUKIT TIMAH        640500.000000
CENTRAL AREA       612127.066667
CHOA CHU KANG      414550.305882
CLEMENTI           477348.590164
GEYLANG            423222.666667
HOUGANG            446038.921348
JURONG EAST        423955.555556
JURONG WEST        426874.733333
KALLANG/WHAMPOA    496186.920635
MARINE PARADE      510032.761905
PASIR RIS          491206.356164
PUNGGOL            494690.918919
QUEENSTOWN         507820.943273
SEMBAWANG          430690.909091
SENGKANG           480549.843750
SERANGOON          445737.837838
TAMPINES           465085.083333
TOA PAYOH          444101.189189
WOODLANDS          424822.878981
YISHUN             384895.669421
Name: resale_price, dtype: float64

## Exercise 2.5.6

###### a) Sample 3 records from the dataframe

In [91]:
cars.sample(3)

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Color,Continent
277,Toyota Celica GT Liftback,21.1,4,134.0,95,2515,14.8,78,green,Asia
151,Toyota Corolla,31.0,4,76.0,52,1649,16.5,74,green,Asia
144,Buick Century Luxus (sw),13.0,8,350.0,150,4699,14.5,74,green,America


###### b) Find the MPG mean grouped by Continent.  Which Continent has the lowest average MPG? (do not buy cars manufactured by this continent??) 

In [92]:
group_car = cars['MPG'].______(cars['Continent'])
group_car.mean()

Continent
America    19.688189
Asia       30.450633
Europe     26.745205
Name: MPG, dtype: float64

###### c) Find the MPG mean grouped by Year (Model).  Which Year has the highest average MPG? (Buy cars manufactured in this Year??) 

In [93]:
group_car = cars['MPG'].____(cars['_____'])
group_car.mean()

Model
70    14.657143
71    20.517241
72    18.714286
73    17.100000
74    22.703704
75    20.266667
76    21.573529
77    23.375000
78    24.061111
79    25.093103
80    33.696552
81    29.323333
82    31.709677
Name: MPG, dtype: float64

## Combine Keys

In [69]:
# combine keys 
grouped = resale['resale_price'].groupby([resale['Year'], resale['town']])
grouped.mean()

Year  town           
2012  ANG MO KIO         411400.000000
      BEDOK              465013.043478
      BISHAN             513250.000000
      BUKIT BATOK        420533.333333
      BUKIT MERAH        504000.000000
      BUKIT PANJANG      480549.142857
      BUKIT TIMAH        715000.000000
      CENTRAL AREA       485000.000000
      CHOA CHU KANG      456064.000000
      CLEMENTI           430752.941176
      GEYLANG            389727.272727
      HOUGANG            463644.400000
      JURONG EAST        442600.000000
      JURONG WEST        446944.444444
      KALLANG/WHAMPOA    377250.000000
      MARINE PARADE      545750.000000
      PASIR RIS          493291.666667
      PUNGGOL            526555.500000
      QUEENSTOWN         456624.916667
      SEMBAWANG          456166.666667
      SENGKANG           536718.074074
      SERANGOON          414714.285714
      TAMPINES           445495.130435
      TOA PAYOH          488127.777778
      WOODLANDS          455305.529412
   

In [70]:
# code above allows grouping with a series not in the data frame
# if grouping information is in the same data frame, we have a shorter way:
# non-numerical column automatically dropped. 
grouped = resale.groupby('town').mean()
grouped.head()

Unnamed: 0_level_0,floor_area_sqm,lease_commence_date,resale_price,Year,Quarter
town,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ANG MO KIO,85.533333,1982.944444,439061.955556,2014.655556,6.588889
BEDOK,88.891667,1983.233333,426231.366667,2014.158333,6.55
BISHAN,102.96875,1987.15625,584746.53125,2014.625,6.59375
BUKIT BATOK,93.380282,1989.549296,415772.394366,2014.352113,5.605634
BUKIT MERAH,87.620253,1986.405063,556656.810127,2014.303797,6.607595


## Exercise 2.5.6

###### d) Find the average MPG groupby Continent, and Year (Model)

In [96]:
group_car = cars['MPG'].____([cars['_____'], cars['____']])

In [97]:
group_car.mean()

Continent  Model
America    70       12.444444
           71       18.100000
           72       16.277778
           73       15.034483
           74       18.333333
           75       17.550000
           76       19.431818
           77       20.722222
           78       21.772727
           79       23.478261
           80       25.914286
           81       27.530769
           82       29.450000
Asia       70       25.500000
           71       29.500000
           72       24.200000
           73       20.000000
           74       29.333333
           75       27.500000
           76       28.000000
           77       27.416667
           78       29.687500
           79       32.950000
           80       35.400000
           81       32.958333
           82       34.888889
Europe     70       21.000000
           71       23.000000
           72       22.000000
           73       24.000000
           74       27.000000
           75       24.500000
           76       24.

In [98]:
# code above allows grouping with a series not in the data frame
# if grouping information is in the same data frame, we have a shorter way:
# non-numerical column automatically dropped. 

######  e) Find the average of ALL numbers available in Dataframe, groupby Continent, and Year

In [99]:
cars.groupby([cars['_____'],cars['_____']]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration
Continent,Model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
America,70,12.444444,7.703704,339.185185,165.962963,3752.148148,11.685185
America,71,18.1,6.2,257.0,113.85,3401.6,14.575
America,72,16.277778,6.888889,281.25,138.777778,3682.666667,14.055556
America,73,15.034483,7.241379,314.103448,146.62069,3821.448276,13.62069
America,74,18.333333,6.266667,236.066667,104.666667,3503.333333,15.966667
America,75,17.55,6.4,253.4,108.7,3533.2,16.35
America,76,19.431818,6.363636,243.954545,110.5,3405.409091,15.786364
America,77,20.722222,6.222222,242.333333,118.388889,3422.0,15.238889
America,78,21.772727,6.0,217.545455,107.272727,3141.136364,15.545455
America,79,23.478261,6.26087,231.26087,109.434783,3210.217391,15.243478


##### Pivot Table

In [100]:
import pandas as pd
resale=pd.read_csv('clean_resale.csv', index_col=0)
resale_groupby_year_region=resale['resale_price'].groupby([resale['Year'],resale['Region']])

##CONVERTING THE RESULTS TO DATAFRAME
m_resale_groupby_year_region=pd.DataFrame(resale_groupby_year_region.mean())

In [101]:
m_resale_groupby_year_region

Unnamed: 0_level_0,Unnamed: 1_level_0,resale_price
Year,Region,Unnamed: 2_level_1
2012,Central,463483.547945
2012,East,468454.645161
2012,North,441326.375
2012,North East,488158.097561
2012,West,446294.0
2013,Central,505523.28169
2013,East,478893.863014
2013,North,425864.40678
2013,North East,515623.0
2013,West,479246.285714


### 2.5.7.Retrieving from DataFrame MultiIndex (Reference: Self-Study)

In [102]:
grouped=resale['resale_price'].groupby([resale['Year'],resale['Quarter']])

In [103]:
Year_Quarter=pd.DataFrame(grouped.mean())

In [104]:
Year_Quarter.index

MultiIndex(levels=[[2012, 2013, 2014, 2015, 2016, 2017], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5], [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5]],
           names=['Year', 'Quarter'])

In [105]:
type(Year_Quarter)

pandas.core.frame.DataFrame

In [106]:
Year_Quarter

Unnamed: 0_level_0,Unnamed: 1_level_0,resale_price
Year,Quarter,Unnamed: 2_level_1
2012,3,477562.222222
2012,4,434671.428571
2012,5,480906.545455
2012,6,418536.000000
2012,7,439475.288889
2012,8,466418.604651
2012,9,468322.615385
2012,10,454875.288889
2012,11,512059.314286
2012,12,455626.441176


In [107]:
Year_Quarter.xs(2012)

Unnamed: 0_level_0,resale_price
Quarter,Unnamed: 1_level_1
3,477562.222222
4,434671.428571
5,480906.545455
6,418536.0
7,439475.288889
8,466418.604651
9,468322.615385
10,454875.288889
11,512059.314286
12,455626.441176


In [108]:
Year_Quarter.xs(12, level='Quarter')
# All the resale_price in December

Unnamed: 0_level_0,resale_price
Year,Unnamed: 1_level_1
2012,455626.441176
2013,471669.833333
2014,485293.103448
2015,438255.466667
2016,407162.933333


In [109]:
# all months in 2012
Year_Quarter.loc[2012, :]  

Unnamed: 0_level_0,resale_price
Quarter,Unnamed: 1_level_1
3,477562.222222
4,434671.428571
5,480906.545455
6,418536.0
7,439475.288889
8,466418.604651
9,468322.615385
10,454875.288889
11,512059.314286
12,455626.441176


In [110]:
# the resale price in Dec 2012 
Year_Quarter.loc[2012, 12, :] 

Unnamed: 0_level_0,Unnamed: 1_level_0,resale_price
Year,Quarter,Unnamed: 2_level_1
2012,12,455626.441176


In [111]:
## all Year in December
Year_Quarter.loc[(slice(None), slice(12, 12)), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,resale_price
Year,Quarter,Unnamed: 2_level_1
2012,12,455626.441176
2013,12,471669.833333
2014,12,485293.103448
2015,12,438255.466667
2016,12,407162.933333


In [112]:
from pandas import IndexSlice as idx

In [113]:
Year_Quarter.loc[idx[2012:2014,12],:]

Unnamed: 0_level_0,Unnamed: 1_level_0,resale_price
Year,Quarter,Unnamed: 2_level_1
2012,12,455626.441176
2013,12,471669.833333
2014,12,485293.103448


### More Mapping and Groupby

In [73]:
# group with a mapping 
# DataFrame of property guru ratings 
import numpy as np
import pandas as pd
rating = pd.DataFrame(np.random.randint(0,5, size=(4,4)),
                  columns=['Bedok','Changi','Queenstown','Toa Payoh'],
                  index=['Tan','Dennis','Jane','Wong'])
rating

Unnamed: 0,Bedok,Changi,Queenstown,Toa Payoh
Tan,1,1,2,1
Dennis,2,2,3,1
Jane,1,3,3,3
Wong,1,3,1,4


In [74]:
map = {
    'Bedok' : 'South East',
    'Changi' : 'South East',
    'Queenstown' : 'Central',
    'Toa Payoh' : 'Central'
}

by_column = rating.groupby(map, axis=1)
by_column.mean()

Unnamed: 0,Central,South East
Tan,1.5,1.0
Dennis,2.0,2.0
Jane,3.0,2.0
Wong,2.5,2.0


In [75]:
# group with a function 
rating.groupby(len).min()
#between Jane and Wong choose the minumim in each column and display it.
#index 4 does not belong to either Jane or Wong, but as the results of min and show them

Unnamed: 0,Bedok,Changi,Queenstown,Toa Payoh
3,1,1,2,1
4,1,3,1,3
6,2,2,3,1


#### Slicing DataFrames can be a complex topic. For more examples, see the Pandas docs (http://pandas.pydata.org/pandas-docs/stable/advanced.html?highlight=indexslice#using-slicers)

<center>
  <a href="2.4.Summary%20Statistics%20in%20Dataframe.ipynb">Previous Page</a> | <a href="./">Content Page</a> | <a href="2.6._viz-examples_Matplotlib__Pandas_Seaborn.ipynb">Next Page</a></center>
</center>

# Possible Answer:

## Exercise 2.5.1

###### a) Load cars.csv data into dataframe named cars:

In [None]:
# load cars.csv data into dataframe named cars
cars = pd.read_csv("cars.csv",index_col=None)

In [None]:
x, y, z=cars.MPG.iloc[[0,1,2]]
cars.loc[[0,1,2],['MPG']]=str(x), str(y), str(z)

###### b) Observe the three rows from the dataframe using head:

In [None]:
cars.head(3)

###### c) Check how many rows and columns using shape

In [None]:
cars.shape

###### d) Get 10 samples from the dataframe

In [None]:
cars.sample(10)

###### e) Check how many Unique Car in Cars DataFrame:

In [None]:
# How many Unique Car in Cars DataFrame?
cars.Car.nunique()

###### f)Check how many Cylinders Car in Cars DataFrame:

In [None]:
# How many Unique Cylinders in Cars DataFrame?
cars.Cylinders.nunique()

###### g)Check how many Unique countries in Cars DataFrame:

In [None]:
# From which countries these car are?
cars.Origin.unique()

###### h) Check the datatypes in the Cars DataFrame:

In [None]:
cars.dtypes

###### i)MPG is not recognised as a number.  Convert object in MPG to float64:

In [None]:
# Convert MPG to float64

In [None]:
cars[["MPG"]] = cars[["MPG"]].apply(pd.to_numeric, errors = "coerce")
cars["MPG"].head()

###### j) check the datatype in cars dataframe again:

In [None]:
cars.dtypes

## Exercise 2.5.2

###### a) Find a car that is efficient (that can run 40 miles per gallon). List ALL OF THEM

In [None]:
cars.loc[cars.MPG > 40]

###### b) Find a car that is efficient (that can run 40 miles per gallon). List ONLY from Japan

In [None]:
cars.loc[(cars.MPG > 40) & (cars.Origin=='Japan')]

###### c) Find a car that is efficient (that can run 40 miles per gallon). List ONLY from those NOT from Japan

In [None]:
cars.loc[(cars.MPG > 40) & (cars.Origin!='Japan')]

##### *d) Self-Study (Google).  Find all cars with the keyword "Renault"*

In [None]:
cars[cars['Car'].str.contains("Renault")]

## Exercise 2.5.3

In [None]:
cars.iloc[:2, 1]=nan
cars.iloc[1:10, 2]=nan

###### a) Check the first 10 records and find the NaN

In [None]:
cars.head(12)

##### b) Propagate previous value backward 


In [None]:
cars.fillna(method='bfill').head(11)

## Exercise 2.5.4

The current data is not consistent.  Europe is not a country, Japan is a country.   We want to make it consistent by filling out "Continent".  Here is the dictonary

In [None]:
cars=pd.read_csv('cars.csv')

In [None]:
continent = {
'Europe':'Europe',
'US':'America',
'Japan':'Asia'}

###### a) Map the dictonary above, find the keyword in Origin, and assign values in continent dictionary to a new column named as cars.Continent 

In [None]:
cars['Continent'] = cars.loc[:,'Origin'].map(continent)
cars

###### b) Save the Data Frame as newcar.csv

In [None]:
cars.to_csv('newcar.csv')

## Exercise 2.5.4

###### c) Drop 'Origin' Field in cars dataframe

In [None]:
cars=pd.read_csv('newcar.csv', index_col=0)
cars.head(2)

In [None]:
cars.drop('Origin', axis = 1, inplace=True)
cars.head(2)

## Exercise 2.5.5

###### a) Sample 3 records from the dataframe

In [None]:
cars.sample(3)

###### b) Find the MPG mean grouped by Continent.  Which Continent has the lowest average MPG? (do not buy cars manufactured by this continent??) 

In [None]:
group_car = cars['MPG'].groupby(cars['Continent'])
group_car.mean()

###### c) Find the MPG mean grouped by Year (Model).  Which Year has the highest average MPG? (Buy cars manufactured in this Year??) 

In [None]:
group_car = cars['MPG'].groupby(cars['Model'])
group_car.mean()

## Exercise 2.5.6

###### d) Find the average MPG groupby Continent, and Year (Model)

In [None]:
group_car = cars['MPG'].groupby([cars['Continent'], cars['Model']])

In [None]:
group_car.mean()

In [None]:
# code above allows grouping with a series not in the data frame
# if grouping information is in the same data frame, we have a shorter way:
# non-numerical column automatically dropped. 

######  e) Find the average of ALL numbers available in Dataframe, groupby Continent, and Year

In [None]:
cars.groupby([cars['Continent'],cars['Model']]).mean()