# <font color='#eb3483'> Filtering in Pandas </font>

Once we have read our data frame in and had a look around, we may want to start working with specific columns or rows, or data that only meets a certain criterion. We do this with filtering.   

Indices are incredibly useful, because they allow us to quickly and intuitively pick out specific data points. In this notebook, we're going to practice using indices to filter our dataframes.

The two most fundamental commands for indexing are `loc` and `iloc` (that is, integer-loc) followed by an identifier for the desired location in square brackets. Mastering the use of `loc` and `iloc` early will set you in good stead for use of the Pandas data API.

There are two important things you should know about `iloc`:

1. It is reserved for purely number-based indexing (integers only). So if you ever call ```iloc``` with a non-integer index, it will throw an error.
2. `iloc` does not interact with your assigned index at all - it only considers the row positions, starting at zero.  This is important to remember if your assigned index is integer-based.

In contrast, `loc` references only the assigned index of your dataframe without regard for row ordering.


In this notebook we will cover:

1. Selecting rows by their numerical position - ```iloc```
1. Selecting rows by their index - ```loc```
1. Selecting columns
1. Advanced filtering using:
 - ```mask``` and ```where```
 - multiple selections


In [1]:
import pandas as pd

We are going to use a dataset that has Airbnb listing information in Lisbon.

In [2]:
# from google import drive
# drive.mount('/content/drive')

In [3]:
# df = pd.read_csv('/content/drive/MyDrive/Class Materials/Week 1 DS Fundamentals/W1D3 Data Wrangling/Classwork/airbnb.csv', index_col='room_id')

In [4]:
df = pd.read_csv('data/airbnb.csv', index_col='room_id') # indexing the df using room_id

<font color='#eb3483'> Exercise: </font> How large is this dataset? Have a look at the top few rows to familiarize yourself with it.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13232 entries, 6499 to 19400722
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   host_id               13232 non-null  int64  
 1   room_type             13232 non-null  object 
 2   neighborhood          13232 non-null  object 
 3   reviews               13232 non-null  int64  
 4   overall_satisfaction  13222 non-null  float64
 5   accommodates          13232 non-null  int64  
 6   bedrooms              13232 non-null  float64
 7   price                 13232 non-null  float64
dtypes: float64(3), int64(3), object(2)
memory usage: 930.4+ KB


## <font color='#eb3483'> 1. Selecting Rows by their Position </font>

We use the function `iloc` to select specific rows of a dataframe **regardless of the index**. With `iloc`, we select rows by row number, starting at 0.

In [6]:
df.iloc[0] # using one square bracket returns it as a series

host_id                           14455
room_type               Entire home/apt
neighborhood                      Belém
reviews                               8
overall_satisfaction                5.0
accommodates                          2
bedrooms                            1.0
price                              57.0
Name: 6499, dtype: object

We can select multiple rows at once by passing in a list of the row numbers that we want:

In [7]:
df.iloc[[0,3,5]]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29720,128075,Entire home/apt,Estrela,14,5.0,16,9.0,1154.0


Or use slices like with arrays.

<font color='#eb3483'> Exercise: </font> Select rows 2 to 20.

In [8]:
df.iloc[1:19]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
29720,128075,Entire home/apt,Estrela,14,5.0,16,9.0,1154.0
29872,128698,Entire home/apt,Alcântara,25,5.0,2,1.0,75.0
29891,128792,Entire home/apt,Misericórdia,28,5.0,3,1.0,49.0
29915,128890,Entire home/apt,Avenidas Novas,28,4.5,3,1.0,58.0
33312,144398,Entire home/apt,Misericórdia,24,4.5,4,1.0,66.0
33348,144484,Private room,Lumiar,2,0.0,6,1.0,46.0


## <font color='#eb3483'> 2. Selecting Rows by their Index Value </font>

With `.loc` we can select rows based on their index value. Since we have set the dataframe index as the ```room_id```, we can select a specific room based on its id, for example, ```room_id == 6499```:

In [9]:
df.loc[6499]

host_id                           14455
room_type               Entire home/apt
neighborhood                      Belém
reviews                               8
overall_satisfaction                5.0
accommodates                          2
bedrooms                            1.0
price                              57.0
Name: 6499, dtype: object

Selecting an index value that doesnt exist will fail

In [10]:
df.loc[5]

KeyError: 5

Same as with ```.iloc```, we can select multiple index values at once by passing these in as a list:

In [None]:
df.loc[[29872, 19188572, 4612503]]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
29872,128698,Entire home/apt,Alcântara,25,5.0,2,1.0,75.0
19188572,134216988,Private room,Arroios,0,0.0,4,1.0,58.0
4612503,22078192,Entire home/apt,Santa Maria Maior,12,5.0,3,1.0,113.0


The ```.loc``` method also allows you to pass in a boolean list (i.e. a list of ```True```s and ```False```s) to extract specific rows (where the boolean list contains ```True```). For example, to extract all AirBnB listings with 10 bedrooms, we can do the following:

In [None]:
bed10 = df['bedrooms']>=40 # this creates a pandas Series of type bool (boolean)
bed10

room_id
6499        False
17031       False
25659       False
29248       False
29396       False
            ...  
19388006    False
19393935    False
19396300    False
19397373    False
19400722    False
Name: bedrooms, Length: 13232, dtype: bool

In [40]:
df.loc[bed10]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1


In [None]:
bed10

room_id
6499        False
17031       False
25659       False
29248       False
29396       False
            ...  
19388006    False
19393935    False
19396300    False
19397373    False
19400722    False
Name: bedrooms, Length: 13232, dtype: bool

Of course, we could combine these into a single line of code:

```python
df.loc[df['bedrooms']==10]
```

Strictly speaking, we do not need to use ```.loc``` here. This will also work:

```python
df[df['bedrooms']==10]
```

However, it is important to remember that ```.iloc``` **will not work** since it only accepts lists of integers (not booleans). Check for yourself!

## <font color='#eb3483'> 3. Column Selection </font>

We can select columns using dot notation **as long as the column names do not have spaces or non-standard characters in them**. It is good practice to name your columns without such characters. This will save you time later! :)

In [None]:
df.room_type

room_id
6499        Entire home/apt
17031       Entire home/apt
25659       Entire home/apt
29248       Entire home/apt
29396       Entire home/apt
                 ...       
19388006    Entire home/apt
19393935    Entire home/apt
19396300    Entire home/apt
19397373    Entire home/apt
19400722    Entire home/apt
Name: room_type, Length: 13232, dtype: object

...is the same as doing...

In [None]:
df['room_type']

room_id
6499        Entire home/apt
17031       Entire home/apt
25659       Entire home/apt
29248       Entire home/apt
29396       Entire home/apt
                 ...       
19388006    Entire home/apt
19393935    Entire home/apt
19396300    Entire home/apt
19397373    Entire home/apt
19400722    Entire home/apt
Name: room_type, Length: 13232, dtype: object

Note that when we select one column like in the above example, we get a ```pd.Series``` object. We can use a list of column names to select multiple columns. This returns a ```pd.DataFrame``` object as expected:

In [None]:
df[["room_type", "price"]].head()

Unnamed: 0_level_0,room_type,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1
6499,Entire home/apt,57.0
17031,Entire home/apt,46.0
25659,Entire home/apt,69.0
29248,Entire home/apt,58.0
29396,Entire home/apt,67.0


We can also use ```.loc``` to simultaneously select specific rows and a subset of the columns.

<font color='#eb3483'> Exercise: </font> What are the room types and neighborhoods of the AirBnB listings with more than 300 reviews?

In [43]:
df2 = df[["room_type", "neighborhood"]][df["reviews"]>300]

df2

Unnamed: 0_level_0,room_type,neighborhood
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1
44043,Entire home/apt,Santa Maria Maior
176153,Entire home/apt,Misericórdia
299321,Entire home/apt,Santo António
317916,Private room,Estrela
447832,Entire home/apt,Misericórdia
555172,Entire home/apt,Misericórdia
763199,Private room,Olivais
965934,Entire home/apt,Santa Maria Maior
1085634,Entire home/apt,Misericórdia
1745355,Entire home/apt,Olivais


In [44]:
df2.shape

(10, 2)

## <font color='#eb3483'> 4. Advanced Filtering </font>

### <font color='#eb3483'> Mask & Where </font>

The function `mask` allows us to "hide" or "deselect" parts of a dataframe that match a certain condition. Note that this is similar to how we use masks in NumPy.

In [None]:
df.mask(df.overall_satisfaction == 5.0).head()

# NaN means not a number
# hides the values where overall satisfaction is 5

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,,,,,,,,
17031,66015.0,Entire home/apt,Alvalade,0.0,0.0,2.0,1.0,46.0
25659,,,,,,,,
29248,125768.0,Entire home/apt,Santa Maria Maior,225.0,4.5,4.0,1.0,58.0
29396,,,,,,,,


The rows that don't match the condition appear as `NaN`, which stands for **Not a Number**, a standard way of saying "*there is no relevant data here*". Pandas will usually ignore the NaNs.

In contrast, [where](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html) hides those rows that don't match the condition. So ```where``` selects and ```mask``` deselects - they are opposites!

In [46]:
df.where(df.overall_satisfaction == 5.0).head()

# hides the values where the overall satisfaction is NOT 5

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455.0,Entire home/apt,Belém,8.0,5.0,2.0,1.0,57.0
17031,,,,,,,,
25659,107347.0,Entire home/apt,Santa Maria Maior,63.0,5.0,3.0,1.0,69.0
29248,,,,,,,,
29396,126415.0,Entire home/apt,Santa Maria Maior,132.0,5.0,4.0,1.0,67.0


<font color='#eb3483'> Exercise: </font> How does ```df[df.overall_satisfaction == 5.0]``` differ from the above? 

<font color='#eb3483'> Exercise: </font> What does ```df[~(df.overall_satisfaction == 5.0)]``` do?

**Hint:** Compare the result of ```df.overall_satisfaction == 5.0``` and ```~(df.overall_satisfaction == 5.0)```.

In [36]:
df[df.overall_satisfaction == 5.0]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
29720,128075,Entire home/apt,Estrela,14,5.0,16,9.0,1154.0
29872,128698,Entire home/apt,Alcântara,25,5.0,2,1.0,75.0
...,...,...,...,...,...,...,...,...
18997896,19063709,Entire home/apt,Avenidas Novas,3,5.0,4,0.0,62.0
19034170,62521369,Entire home/apt,Belém,5,5.0,4,1.0,85.0
19051322,132979089,Private room,Penha de França,3,5.0,3,1.0,29.0
19079169,2009620,Private room,Estrela,3,5.0,2,1.0,35.0


In [37]:
df[~(df.overall_satisfaction == 5.0)]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29915,128890,Entire home/apt,Avenidas Novas,28,4.5,3,1.0,58.0
33312,144398,Entire home/apt,Misericórdia,24,4.5,4,1.0,66.0
33348,144484,Private room,Lumiar,2,0.0,6,1.0,46.0
...,...,...,...,...,...,...,...,...
19388006,135915593,Entire home/apt,São Vicente,0,0.0,6,3.0,415.0
19393935,5376796,Entire home/apt,Santa Maria Maior,0,0.0,3,1.0,50.0
19396300,6115933,Entire home/apt,Santo António,0,0.0,6,4.0,138.0
19397373,97139334,Entire home/apt,São Vicente,0,0.0,4,1.0,56.0


### <font color='#eb3483'> Multiple Selection </font>

We can filter a dataframe based on multiple conditions. We can select rows that match multiple conditions by concatenating the conditions with the AND operator, `&`. For example, if we want those listings in Belém with more than 3 bedrooms:

In [38]:
df[(df.neighborhood == 'Belém') & (df.bedrooms > 3)].head()
# two conditions are being met use or as '|' or '&'

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2127428,8810620,Entire home/apt,Belém,17,5.0,9,5.0,184.0
5737018,29744618,Entire home/apt,Belém,46,4.5,10,4.0,128.0
6884183,11926451,Entire home/apt,Belém,1,0.0,10,5.0,138.0
9522737,46064752,Entire home/apt,Belém,27,5.0,8,4.0,78.0
15272166,17263208,Entire home/apt,Belém,1,0.0,6,4.0,288.0


Similarly, we can select rows that match one condition OR the other with the `|` operator:

In [39]:
df[(df.neighborhood == "Belém") | (df.neighborhood == "Benfica")].head()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
202654,992647,Entire home/apt,Belém,54,4.0,2,1.0,45.0
212915,1097919,Private room,Benfica,0,0.0,6,1.0,93.0
418945,2083563,Private room,Belém,2,0.0,2,1.0,127.0
472183,2341627,Entire home/apt,Belém,64,4.5,4,1.0,67.0


# <font color='#eb3483'> Filtering Pandas Exercises </font>

Let's pretend we are an Airbnb employee assigned to the Lisbon market. Our job is to help clients find their desired listing. We have a file named `airbnb.csv` that has information on all the listings we have available right now in the city. Start by importing pandas and loading the data in.

In [11]:
import pandas as pd

df = pd.read_csv('data/airbnb.csv', index_col='room_id')



### <font color='#eb3483'> Exercise 1 </font>

Alice is going to Lisbon for a week with her husband and 2 kids. They are looking for a full apartment with separate rooms for parents and children. Money is not an issue for them, but they are looking for a good place. This means they are only looking for places with more than 10 reviews and a score above 4. When we show Alice our listing selection we need to make sure we are sorting the listings from the best score to the worse one. In case some listings have the same score, we will have to sort them by the number of reviews (the more the better). We need to give her  3 alternatives.

In [12]:
# your code goes here

overall = df[(df.overall_satisfaction > 4.0)&(df.reviews >= 10.00)&(df.bedrooms<=)]

overall = overall.sort_values(by = ["overall_satisfaction", "reviews"], ascending=False)

overall.head(3)

SyntaxError: invalid syntax (3901421567.py, line 3)


### <font color='#eb3483'> Exercise 2 </font>

Diana is going to spend 3 nights in Lisbon and she wants to meet new people. Se has a budget of 50€. We need to provide her with the 10 cheapests listings, with a preference for shared rooms. We need to sort the rooms by score (descending).

In [13]:
# your code goes here
diana = df[(df.price < 50/3)&(df.room_type == "Shared room")]
diana
diana.sort_values(by = "price", ascending=True)
diana.head(10)

# diana = diana.sort_values(by = ["price", "room_type"], ascending=False)

# diana.head()


Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1179457,5799522,Shared room,Santo António,42,4.0,16,1.0,10.0
4010418,12771657,Shared room,Misericórdia,12,4.5,6,1.0,15.0
5557699,28812904,Shared room,Santa Maria Maior,22,4.0,1,1.0,10.0
5610245,29084261,Shared room,Santa Maria Maior,8,4.0,4,1.0,11.0
5823046,29676067,Shared room,Misericórdia,35,4.5,12,1.0,15.0
6691321,28812904,Shared room,Santa Maria Maior,1,0.0,1,1.0,11.0
6728244,28812904,Shared room,Santa Maria Maior,8,4.5,1,1.0,11.0
6728398,28812904,Shared room,Santa Maria Maior,3,2.5,1,1.0,11.0
7737202,40574313,Shared room,Avenidas Novas,44,4.5,4,1.0,13.0
9317561,48360716,Shared room,Arroios,13,4.5,4,1.0,11.0
