In this new section - Data selection, we will learn about advanced techniques of data selection with pandas, how to select a subset of data, how to select multiple rows and columns from a dataset, how to do sorting on a pandas DataFrame or a series, how to filter roles of a pandas DataFrame and also learn how to apply multiple filters to a pandas DataFrame. We will also loook at how to use the axis parameter in pandas and the uses of string methods in pandas. Finally we will learn how to change the datatype of a pandas series.

We will be suing a real dataset from zillow.com, an online real estate marketplace that releases house price datasets as part of their research effort.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/My Drive/Colab Notebooks

/content/drive/My Drive/Colab Notebooks


In [3]:
# We will import the pandas modules
import pandas as pd

We will then read in our dataset. Since it is a CSV file, we will be using a pandas'read_csv method for this. We will pass the file name with a comma as a separator to the read_csv method and we will create a DataFrame out of this data which we name data

In [4]:
data= pd.read_csv('data-zillow.csv', sep=',')
data.head()

Unnamed: 0,Date,RegionID,RegionName,State,Metro,County,SizeRank,Zhvi
0,2017-05-31,6181,New York,NY,New York,Queens,0,672400
1,2017-05-31,12447,Los Angeles,CA,Los Angeles-Long Beach-Anaheim,Los Angeles,1,629900
2,2017-05-31,17426,Chicago,IL,Chicago,Cook,2,222700
3,2017-05-31,13271,Philadelphia,PA,Philadelphia,Philadelphia,3,137300
4,2017-05-31,40326,Phoenix,AZ,Phoenix,Maricopa,4,211300


The DataFrame is created and we did read a few records from the dataset by performing the data.head() method on the DataFrame. This will give the output with columns, such as Date, and some location fields, such as RegionName, State, Metro and Country. The last column title Zhvi is a Zillow term and is the mean house price of that particular region.

**Sorting a pandas DataFrame**

In this video we will learn about the pandas sort_values method. We will also use various methods to sort a pandas DataFrame and learn how to sort a pandas series object.

Now let's start with the simple type of sorting. We will use pandas's sort_values method for this. For example, imagine that we want to sort the data by the Metro column. We need to pass Metro as a parameter to the sort_values emthod and call the method on the DataFrame.

In [5]:
data.sort_values('Metro')

Unnamed: 0,Date,RegionID,RegionName,State,Metro,County,SizeRank,Zhvi
9851,2017-05-31,48458,Westport,WA,Aberdeen,Grays Harbor,9851,144600
4996,2017-05-31,36873,Elma,WA,Aberdeen,Grays Harbor,4996,175200
5090,2017-05-31,35514,Hoquiam,WA,Aberdeen,Grays Harbor,5090,95700
9401,2017-05-31,33215,Ocean Shores,WA,Aberdeen,Grays Harbor,9401,152400
9149,2017-05-31,18370,Grayland,WA,Aberdeen,Grays Harbor,9149,143900
...,...,...,...,...,...,...,...,...
10764,2017-05-31,35349,Fraser,CO,,Grand,10764,274500
10768,2017-05-31,17816,Dresser,WI,,Polk,10768,189900
10774,2017-05-31,34232,Tamworth,NH,,Carroll,10774,164100
10822,2017-05-31,18317,Goldsboro,MD,,Caroline,10822,175700


If you notice , by default the Date column is sorted in ascending order. We can chnage the sorting order, giving the ascending parameter the value of False.

In [6]:
sorted = data.sort_values('Metro', ascending=False)

The ascending parameter is optional, and when not passed, it is set to True by default. Now we will look into how to sort data by more than 1 column. To do this we need to pass the list of columns, by which we want our data to be sorted, to the parameter column of the sort_values method.

In [9]:
sorted = data.sort_values(by=['Metro', 'County'])
sorted.head()

Unnamed: 0,Date,RegionID,RegionName,State,Metro,County,SizeRank,Zhvi
2073,2017-05-31,30116,Aberdeen,WA,Aberdeen,Grays Harbor,2073,127800
4568,2017-05-31,56078,Montesano,WA,Aberdeen,Grays Harbor,4568,182000
4996,2017-05-31,36873,Elma,WA,Aberdeen,Grays Harbor,4996,175200
5090,2017-05-31,35514,Hoquiam,WA,Aberdeen,Grays Harbor,5090,95700
7108,2017-05-31,6275,Oakville,WA,Aberdeen,Grays Harbor,7108,186900


The data has now been sorted by Metro first and then County column, that is in the same order that we passed them into the sort_values method. We can take the multiple column sort further, and introduce a mixed ascending order. For example, we can sort by 3 columns: Metro, County and the Price column.

In [10]:
sorted = data.sort_values(by=['Metro', 'County', 'Zhvi'], ascending=[True, True, False])
sorted.head()

Unnamed: 0,Date,RegionID,RegionName,State,Metro,County,SizeRank,Zhvi
7108,2017-05-31,6275,Oakville,WA,Aberdeen,Grays Harbor,7108,186900
4568,2017-05-31,56078,Montesano,WA,Aberdeen,Grays Harbor,4568,182000
4996,2017-05-31,36873,Elma,WA,Aberdeen,Grays Harbor,4996,175200
8420,2017-05-31,19269,McCleary,WA,Aberdeen,Grays Harbor,8420,170700
9401,2017-05-31,33215,Ocean Shores,WA,Aberdeen,Grays Harbor,9401,152400


You must have noticed that we are passing a list of 3 Boolean values in ascending parameter. This sets the sort order to ascending for Metro, and County and descending fot the last column which is Zhvi.

Next we will see how to sort a series object. First, let's create a series. Let's select the RegionID column from our dataset and create a series.

In [11]:
regions= data.RegionID
type(regions)

pandas.core.series.Series

In [12]:
# Now let;s look that the original series by using regions.head()
regions.head()

0     6181
1    12447
2    17426
3    13271
4    40326
Name: RegionID, dtype: int64

Now let's sort it by calling the sort_values method on it. Since the dataset contains only 1 column, we do not need to pass any column name. Hence, the code to sort the data would be regions.sort_values().head().

In [13]:
regions.sort_values().head()

3043    3301
4159    3304
4986    3305
1762    3310
3116    3312
Name: RegionID, dtype: int64