# <font color='#eb3483'> Data Prepping and Wrangling with Pandas </font>

Once we have read our data frame in and had a look around 

We may want to start working with specific columns or rows, or data that only meets a certain criteria.

Enter Pandas  

In this notebook we will cover:
- Selecting columns
- Renaming columns
- Filtering rows:
    -  by their numerical position - iloc
    -  Filtering rows by their index - loc
    -  Filtering rows with [ ]
    -  multiple selctions
-  Dropping rows or columns


In [1]:
import pandas as pd

We are going to use a dataset that has Airbnb listing information in New York.

In [2]:
df = pd.read_csv('data/NYairbnb.csv')

In [3]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,accommodates,...,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_scores_rating
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,2,...,,1.0,150,30,48,11/4/19,0.34,3,308,4.7
1,3831,"Whole flr w/private bdrm, bath & kitchen(pls r...",4869,LisaRoxanne,Brooklyn,Bedford-Stuyvesant,40.68494,-73.95765,Entire home/apt,3,...,1.0,3.0,79,1,403,5/4/21,5.16,1,208,4.46
2,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68535,-73.95512,Private room,2,...,1.0,1.0,60,30,50,6/5/16,0.56,1,365,4.52
3,5136,"Spacious Brooklyn Duplex, Patio + Garden",7378,Rebecca,Brooklyn,Sunset Park,40.66265,-73.99454,Entire home/apt,4,...,2.0,2.0,175,21,1,1/2/14,0.01,1,134,5.0
4,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Midtown,40.76457,-73.98317,Private room,2,...,1.0,1.0,61,2,474,9/25/20,3.61,1,246,4.19


In [4]:
df.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'price', 'minimum_nights',
       'number_of_reviews', 'last_review', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365',
       'review_scores_rating'],
      dtype='object')

In [5]:
df.shape

(36230, 21)

# <font color='#eb3483'> Selecting Columns </font>

We can select columns using dot notation (as long as the column names dont have spaces or non alphanumerical characters on them) - which is why it is always good to name your columns without these. Saves time later :)

In [10]:
#To get 1 column

df.room_type

0        Entire home/apt
1        Entire home/apt
2           Private room
3        Entire home/apt
4           Private room
              ...       
36225       Private room
36226       Private room
36227       Private room
36228    Entire home/apt
36229       Private room
Name: room_type, Length: 36230, dtype: object

Which is the same as selecting with square brackets - which we can use if our names have spaces or alphanumerical characters

In [11]:
df['room_type'] 

0        Entire home/apt
1        Entire home/apt
2           Private room
3        Entire home/apt
4           Private room
              ...       
36225       Private room
36226       Private room
36227       Private room
36228    Entire home/apt
36229       Private room
Name: room_type, Length: 36230, dtype: object

When we select one column we receive a pd.Series. To instead retrieve a dataframe we use double brackets ... 

In [12]:
df[['room_type']]

Unnamed: 0,room_type
0,Entire home/apt
1,Entire home/apt
2,Private room
3,Entire home/apt
4,Private room
...,...
36225,Private room
36226,Private room
36227,Private room
36228,Entire home/apt


To  select multiple columns we use double brackets (and receive a dataframe)

In [13]:
df[["room_type", "price"]].head()

Unnamed: 0,room_type,price
0,Entire home/apt,150
1,Entire home/apt,79
2,Private room,60
3,Entire home/apt,175
4,Private room,61


# <font color='#eb3483'> Renaming Columns </font>


We can change the name of the columns by changing the column names list `df.columns`. For example, we can rename the columns and make them capitalized.

Maaaaany options [here](https://note.nkmk.me/en/python-pandas-dataframe-rename/) but a few common ones below

In [14]:
df = pd.read_csv('data/NYairbnb.csv')

In [15]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,accommodates,...,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_scores_rating
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,2,...,,1.0,150,30,48,11/4/19,0.34,3,308,4.7
1,3831,"Whole flr w/private bdrm, bath & kitchen(pls r...",4869,LisaRoxanne,Brooklyn,Bedford-Stuyvesant,40.68494,-73.95765,Entire home/apt,3,...,1.0,3.0,79,1,403,5/4/21,5.16,1,208,4.46
2,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68535,-73.95512,Private room,2,...,1.0,1.0,60,30,50,6/5/16,0.56,1,365,4.52
3,5136,"Spacious Brooklyn Duplex, Patio + Garden",7378,Rebecca,Brooklyn,Sunset Park,40.66265,-73.99454,Entire home/apt,4,...,2.0,2.0,175,21,1,1/2/14,0.01,1,134,5.0
4,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Midtown,40.76457,-73.98317,Private room,2,...,1.0,1.0,61,2,474,9/25/20,3.61,1,246,4.19


In [16]:
#select just a few columns
df = df[['id', 'host_id', 'neighbourhood_group', 'room_type', 'price', 'number_of_reviews']]
df.head()

Unnamed: 0,id,host_id,neighbourhood_group,room_type,price,number_of_reviews
0,2595,2845,Manhattan,Entire home/apt,150,48
1,3831,4869,Brooklyn,Entire home/apt,79,403
2,5121,7356,Brooklyn,Private room,60,50
3,5136,7378,Brooklyn,Entire home/apt,175,1
4,5178,8967,Manhattan,Private room,61,474


We can change the name of the columns by changing the column names list `df.columns`. For example, we can rename the columns to make it a bit easier to type.

In [17]:
df.columns

Index(['id', 'host_id', 'neighbourhood_group', 'room_type', 'price',
       'number_of_reviews'],
      dtype='object')

In [18]:
df.columns = ['id', 'host_id', 'nhood_grp', 'room_type', 'price', 'num_reviews']
df.head()


Unnamed: 0,id,host_id,nhood_grp,room_type,price,num_reviews
0,2595,2845,Manhattan,Entire home/apt,150,48
1,3831,4869,Brooklyn,Entire home/apt,79,403
2,5121,7356,Brooklyn,Private room,60,50
3,5136,7378,Brooklyn,Entire home/apt,175,1
4,5178,8967,Manhattan,Private room,61,474


In [22]:
#We can also use the rename function 
df_new = df.rename(columns={'price': 'price_usd', 'nhood_grp': 'neighborhood'})
df_new.head()


Unnamed: 0,id,host_id,neighborhood,room_type,price_usd,num_reviews
0,2595,2845,Manhattan,Entire home/apt,150,48
1,3831,4869,Brooklyn,Entire home/apt,79,403
2,5121,7356,Brooklyn,Private room,60,50
3,5136,7378,Brooklyn,Entire home/apt,175,1
4,5178,8967,Manhattan,Private room,61,474


We can change column names to upper or lower case.

In [27]:
df.rename(columns=str.lower)

Unnamed: 0,id,host_id,neighborhood,room_type,price_usd,num_reviews
0,2595,2845,Manhattan,Entire home/apt,150,48
1,3831,4869,Brooklyn,Entire home/apt,79,403
2,5121,7356,Brooklyn,Private room,60,50
3,5136,7378,Brooklyn,Entire home/apt,175,1
4,5178,8967,Manhattan,Private room,61,474
...,...,...,...,...,...,...
36225,50164369,305240193,Manhattan,Private room,76,0
36226,50165159,305240193,Manhattan,Private room,58,0
36227,50165711,305240193,Manhattan,Private room,74,0
36228,50165885,107434423,Manhattan,Entire home/apt,281,0


# <font color='#eb3483'> Filtering rows </font>

### <font color='#eb3483'> Filtering rows by their position - iloc </font>

We use the function `iloc` to select specific rows on a Data Frame.

There are two things you should know about `iloc`. Firstly, it is reserved for purely number-based indexing (integars only). So if you ever call iloc with a non-integer index, it will throw an error. Secondly, `iloc` **does not interact with your index at all** -> important to remember if your index is intergar-based.

With `iloc` we select rows regarding their row number, starting at 0.

In [28]:
df.head()

Unnamed: 0,id,host_id,neighborhood,room_type,price_usd,num_reviews
0,2595,2845,Manhattan,Entire home/apt,150,48
1,3831,4869,Brooklyn,Entire home/apt,79,403
2,5121,7356,Brooklyn,Private room,60,50
3,5136,7378,Brooklyn,Entire home/apt,175,1
4,5178,8967,Manhattan,Private room,61,474


In [29]:
df.iloc[0] # using one square bracket returns it as a series

id                         2595
host_id                    2845
neighborhood          Manhattan
room_type       Entire home/apt
price_usd                   150
num_reviews                  48
Name: 0, dtype: object

In [30]:
type(df.iloc[0])

pandas.core.series.Series

Generally we would want to keep working with a dataframe - so we use double brackets `[[]]`

In [31]:
df.iloc[[0]]

Unnamed: 0,id,host_id,neighborhood,room_type,price_usd,num_reviews
0,2595,2845,Manhattan,Entire home/apt,150,48


We can select multiple rows at once:

In [32]:
df.iloc[[0,3,9]]

Unnamed: 0,id,host_id,neighborhood,room_type,price_usd,num_reviews
0,2595,2845,Manhattan,Entire home/apt,150,48
3,5136,7378,Brooklyn,Entire home/apt,175,1
9,6990,16800,Manhattan,Private room,62,233


Or use slices like with arrays:

In [33]:
#select rows 2:10
df.iloc[2:10]


Unnamed: 0,id,host_id,neighborhood,room_type,price_usd,num_reviews
2,5121,7356,Brooklyn,Private room,60,50
3,5136,7378,Brooklyn,Entire home/apt,175,1
4,5178,8967,Manhattan,Private room,61,474
5,5203,7490,Manhattan,Private room,75,118
6,5803,9744,Brooklyn,Private room,98,184
7,6848,15991,Brooklyn,Entire home/apt,89,181
8,6872,16104,Manhattan,Private room,65,0
9,6990,16800,Manhattan,Private room,62,233


## <font color='#eb3483'> Filtering rows by their index value - loc </font>

With `.loc` we can select rows based on their index value.
`loc` is based purely on the assigned index for your dataframe - so it can be a number but it can also be a label (remember changing our indexes last class)

Let's set the index of our df to host_id

In [34]:
df = df.set_index("host_id")

Now let's say we want to get the listings for a specific host. 
We know the host as the ID of 318750232 - we can pull out (aka filter) any listings that belong to that host 

In [35]:
df.head()

Unnamed: 0_level_0,id,neighborhood,room_type,price_usd,num_reviews
host_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2845,2595,Manhattan,Entire home/apt,150,48
4869,3831,Brooklyn,Entire home/apt,79,403
7356,5121,Brooklyn,Private room,60,50
7378,5136,Brooklyn,Entire home/apt,175,1
8967,5178,Manhattan,Private room,61,474


In [36]:
df.loc[7378]

id                         5136
neighborhood           Brooklyn
room_type       Entire home/apt
price_usd                   175
num_reviews                   1
Name: 7378, dtype: object

Selecting an index value that doesnt exist will fail

In [38]:
# df.loc[[5]]

Same as with .iloc, we can select multiple values at once.

In [47]:
df.loc[[8967, 305240193, 107434423 ]]
df

Unnamed: 0_level_0,id,neighborhood,room_type,price_usd,num_reviews
host_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2845,2595,Manhattan,Entire home/apt,150,48
4869,3831,Brooklyn,Entire home/apt,79,403
7356,5121,Brooklyn,Private room,60,50
7378,5136,Brooklyn,Entire home/apt,175,1
8967,5178,Manhattan,Private room,61,474
...,...,...,...,...,...
305240193,50164369,Manhattan,Private room,76,0
305240193,50165159,Manhattan,Private room,58,0
305240193,50165711,Manhattan,Private room,74,0
107434423,50165885,Manhattan,Entire home/apt,281,0


### <font color='#eb3483'> Knowledge check </font>


In [53]:
# Filter out all properties only in Manhattan
# reset index and change to nhood_grp then
# filter out manhattan data set using the loc function 

df.loc[(df['neighborhood'] == "Manhattan")]
df.reset_index()
df.set_index('neighborhood')

Unnamed: 0_level_0,id,room_type,price_usd,num_reviews
neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Manhattan,2595,Entire home/apt,150,48
Brooklyn,3831,Entire home/apt,79,403
Brooklyn,5121,Private room,60,50
Brooklyn,5136,Entire home/apt,175,1
Manhattan,5178,Private room,61,474
...,...,...,...,...
Manhattan,50164369,Private room,76,0
Manhattan,50165159,Private room,58,0
Manhattan,50165711,Private room,74,0
Manhattan,50165885,Entire home/apt,281,0


## <font color='#eb3483'> Filtering with [ ] </font>

Just as in most things in Python (and life) there are more than one way of doing things. We can use loc and iloc to filter but we can also filter by using brackets `[ ]` and subset the data frame to create a specific "sub"dataframe



For example, let's filter the dataframe again to see all listings in `Manhattan`:

In [54]:
df = pd.read_csv('data/NYairbnb.csv')

In [55]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,accommodates,...,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_scores_rating
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,2,...,,1.0,150,30,48,11/4/19,0.34,3,308,4.7
1,3831,"Whole flr w/private bdrm, bath & kitchen(pls r...",4869,LisaRoxanne,Brooklyn,Bedford-Stuyvesant,40.68494,-73.95765,Entire home/apt,3,...,1.0,3.0,79,1,403,5/4/21,5.16,1,208,4.46
2,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68535,-73.95512,Private room,2,...,1.0,1.0,60,30,50,6/5/16,0.56,1,365,4.52
3,5136,"Spacious Brooklyn Duplex, Patio + Garden",7378,Rebecca,Brooklyn,Sunset Park,40.66265,-73.99454,Entire home/apt,4,...,2.0,2.0,175,21,1,1/2/14,0.01,1,134,5.0
4,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Midtown,40.76457,-73.98317,Private room,2,...,1.0,1.0,61,2,474,9/25/20,3.61,1,246,4.19


In [56]:
df.shape

(36230, 21)

If we use brackets, the dataframe we get is smaller

In [57]:
df[df.neighbourhood_group == 'Manhattan'].head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,accommodates,...,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_scores_rating
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,2,...,,1.0,150,30,48,11/4/19,0.34,3,308,4.7
4,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Midtown,40.76457,-73.98317,Private room,2,...,1.0,1.0,61,2,474,9/25/20,3.61,1,246,4.19
5,5203,Cozy Clean Guest Room - Family Apt,7490,MaryEllen,Manhattan,Upper West Side,40.8038,-73.96751,Private room,1,...,1.0,1.0,75,2,118,7/12/17,0.91,1,0,4.91
8,6872,Uptown Sanctuary w/ Private Bath (Month to Month),16104,Kae,Manhattan,East Harlem,40.80107,-73.94255,Private room,1,...,1.0,1.0,65,30,0,,,2,365,
9,6990,UES Beautiful Blue Room,16800,Cyn,Manhattan,East Harlem,40.78778,-73.94759,Private room,1,...,1.0,1.0,62,30,233,9/15/18,1.9,1,1,4.88


In [58]:
df[df.neighbourhood_group == 'Manhattan'].shape

(15904, 21)

We can select the inverse of a condition if we put `~` in front of it.

For example, to select all listings that are not in Manhattan, we can do this:

In [61]:
df[(df.neighbourhood_group !=  "Manhattan")].shape

(20326, 21)

In [62]:
df[(df.neighbourhood_group !=  "Manhattan")].head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,accommodates,...,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_scores_rating
1,3831,"Whole flr w/private bdrm, bath & kitchen(pls r...",4869,LisaRoxanne,Brooklyn,Bedford-Stuyvesant,40.68494,-73.95765,Entire home/apt,3,...,1.0,3.0,79,1,403,5/4/21,5.16,1,208,4.46
2,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68535,-73.95512,Private room,2,...,1.0,1.0,60,30,50,6/5/16,0.56,1,365,4.52
3,5136,"Spacious Brooklyn Duplex, Patio + Garden",7378,Rebecca,Brooklyn,Sunset Park,40.66265,-73.99454,Entire home/apt,4,...,2.0,2.0,175,21,1,1/2/14,0.01,1,134,5.0
6,5803,"Lovely Room 1, Garden, Best Area, Legal rental",9744,Laurie,Brooklyn,South Slope,40.66801,-73.98784,Private room,2,...,1.0,0.0,98,4,184,5/15/21,1.39,3,329,4.71
7,6848,Only 2 stops to Manhattan studio,15991,Allen & Irina,Brooklyn,Williamsburg,40.70935,-73.95342,Entire home/apt,3,...,,1.0,89,30,181,3/9/20,1.29,1,203,4.56


### <font color='#eb3483'> Multiple Selection </font>

We can filter a dataframe based on multiple conditions.

We can select rows that match multiple conditions by concatenating the conditions with `&`.

For example, if we want those listings in Manhattan with more than 3 bedrooms:

In [63]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,accommodates,...,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_scores_rating
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,2,...,,1.0,150,30,48,11/4/19,0.34,3,308,4.7
1,3831,"Whole flr w/private bdrm, bath & kitchen(pls r...",4869,LisaRoxanne,Brooklyn,Bedford-Stuyvesant,40.68494,-73.95765,Entire home/apt,3,...,1.0,3.0,79,1,403,5/4/21,5.16,1,208,4.46
2,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68535,-73.95512,Private room,2,...,1.0,1.0,60,30,50,6/5/16,0.56,1,365,4.52
3,5136,"Spacious Brooklyn Duplex, Patio + Garden",7378,Rebecca,Brooklyn,Sunset Park,40.66265,-73.99454,Entire home/apt,4,...,2.0,2.0,175,21,1,1/2/14,0.01,1,134,5.0
4,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Midtown,40.76457,-73.98317,Private room,2,...,1.0,1.0,61,2,474,9/25/20,3.61,1,246,4.19


In [64]:
df[(df.neighbourhood_group == 'Manhattan') & (df.bedrooms > 3)].head() # we use & for and.

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,accommodates,...,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_scores_rating
38,16974,SpaHa Loft: Enormous and Bright,65837,Robin,Manhattan,East Harlem,40.8019,-73.93762,Entire home/apt,8,...,4.0,6.0,250,4,220,5/23/21,1.83,1,0,4.83
401,248871,4 Bedroom-2 Bath West Duplex Townhouse,605463,West Village,Manhattan,West Village,40.73081,-74.00275,Entire home/apt,8,...,4.0,4.0,630,7,154,3/1/20,1.94,3,350,4.56
411,250801,Heart & Soul of Greenwich Village,1314834,Rhona,Manhattan,Greenwich Village,40.72976,-74.00168,Private room,6,...,4.0,4.0,700,3,132,10/12/20,1.25,1,119,4.9
473,273190,6 Bedroom Landmark West Village Townhouse,605463,West Village,Manhattan,West Village,40.7327,-74.00267,Entire home/apt,12,...,6.0,6.0,900,5,28,7/1/19,0.26,3,247,4.58
1220,893413,Architecturally Stunning Former Synagogue!,4751930,Martin,Manhattan,East Village,40.72694,-73.98173,Entire home/apt,10,...,4.0,4.0,2500,30,28,5/24/21,0.35,1,74,4.79


Same way, we can select rows that match one condition OR the other with the pipe (`|`)

In [65]:
df[(df.neighbourhood_group == 'Manhattan') | (df.neighbourhood_group == "Brooklyn")].head() # we use | for or.

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,accommodates,...,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_scores_rating
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,2,...,,1.0,150,30,48,11/4/19,0.34,3,308,4.7
1,3831,"Whole flr w/private bdrm, bath & kitchen(pls r...",4869,LisaRoxanne,Brooklyn,Bedford-Stuyvesant,40.68494,-73.95765,Entire home/apt,3,...,1.0,3.0,79,1,403,5/4/21,5.16,1,208,4.46
2,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68535,-73.95512,Private room,2,...,1.0,1.0,60,30,50,6/5/16,0.56,1,365,4.52
3,5136,"Spacious Brooklyn Duplex, Patio + Garden",7378,Rebecca,Brooklyn,Sunset Park,40.66265,-73.99454,Entire home/apt,4,...,2.0,2.0,175,21,1,1/2/14,0.01,1,134,5.0
4,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Midtown,40.76457,-73.98317,Private room,2,...,1.0,1.0,61,2,474,9/25/20,3.61,1,246,4.19


In [66]:
#Play around with this

# <font color='#eb3483'> Dropping rows and columns </font>
To remove rows and columns we can use `.drop`

In order to drop rows and columns from a DataFrame, you can use the function [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html). By default `.drop` removes rows based on the index value.

Drop has two important arguments:
* `inplace`: with this argument, you can chose if you want to transform the original DataFrame or if you want the drop function to return a copy of the transformed DataFrame. It"s default value is False, i.e, you don"t apply the transformation in the original DataFrame. You"ll see this argument in many functions that transform DataFrames. **This is usually not recommended**
* `axis`: with this argument, you chose if you want to drop rows (axis=0) or if you want to drop columns (axis=1). The default behaviour is to drop rows. You"ll se this argument in many functions that transform DataFrames.

In [67]:
df = pd.read_csv("data/NYairbnb.csv")

In [69]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,accommodates,...,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_scores_rating
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,2,...,,1.0,150,30,48,11/4/19,0.34,3,308,4.7
1,3831,"Whole flr w/private bdrm, bath & kitchen(pls r...",4869,LisaRoxanne,Brooklyn,Bedford-Stuyvesant,40.68494,-73.95765,Entire home/apt,3,...,1.0,3.0,79,1,403,5/4/21,5.16,1,208,4.46
2,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68535,-73.95512,Private room,2,...,1.0,1.0,60,30,50,6/5/16,0.56,1,365,4.52
3,5136,"Spacious Brooklyn Duplex, Patio + Garden",7378,Rebecca,Brooklyn,Sunset Park,40.66265,-73.99454,Entire home/apt,4,...,2.0,2.0,175,21,1,1/2/14,0.01,1,134,5.0
4,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Midtown,40.76457,-73.98317,Private room,2,...,1.0,1.0,61,2,474,9/25/20,3.61,1,246,4.19


For example, we can remove the row with index 2

In [71]:
df1 = df.drop(2)
df1.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,accommodates,...,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_scores_rating
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,2,...,,1.0,150,30,48,11/4/19,0.34,3,308,4.7
1,3831,"Whole flr w/private bdrm, bath & kitchen(pls r...",4869,LisaRoxanne,Brooklyn,Bedford-Stuyvesant,40.68494,-73.95765,Entire home/apt,3,...,1.0,3.0,79,1,403,5/4/21,5.16,1,208,4.46
3,5136,"Spacious Brooklyn Duplex, Patio + Garden",7378,Rebecca,Brooklyn,Sunset Park,40.66265,-73.99454,Entire home/apt,4,...,2.0,2.0,175,21,1,1/2/14,0.01,1,134,5.0
4,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Midtown,40.76457,-73.98317,Private room,2,...,1.0,1.0,61,2,474,9/25/20,3.61,1,246,4.19
5,5203,Cozy Clean Guest Room - Family Apt,7490,MaryEllen,Manhattan,Upper West Side,40.8038,-73.96751,Private room,1,...,1.0,1.0,75,2,118,7/12/17,0.91,1,0,4.91


We can drop multiple rows (Same as with `loc` or `iloc`)

In [72]:
df.drop([4, 5]).head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,accommodates,...,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_scores_rating
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,2,...,,1.0,150,30,48,11/4/19,0.34,3,308,4.7
1,3831,"Whole flr w/private bdrm, bath & kitchen(pls r...",4869,LisaRoxanne,Brooklyn,Bedford-Stuyvesant,40.68494,-73.95765,Entire home/apt,3,...,1.0,3.0,79,1,403,5/4/21,5.16,1,208,4.46
2,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68535,-73.95512,Private room,2,...,1.0,1.0,60,30,50,6/5/16,0.56,1,365,4.52
3,5136,"Spacious Brooklyn Duplex, Patio + Garden",7378,Rebecca,Brooklyn,Sunset Park,40.66265,-73.99454,Entire home/apt,4,...,2.0,2.0,175,21,1,1/2/14,0.01,1,134,5.0
6,5803,"Lovely Room 1, Garden, Best Area, Legal rental",9744,Laurie,Brooklyn,South Slope,40.66801,-73.98784,Private room,2,...,1.0,0.0,98,4,184,5/15/21,1.39,3,329,4.71


If we use `axis=1` we remove columns (columns are the second axis on a dataframe)

In [73]:
df = df.drop(["latitude", "longitude"], axis=1)
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,room_type,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_scores_rating
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,Entire home/apt,2,,,1.0,150,30,48,11/4/19,0.34,3,308,4.7
1,3831,"Whole flr w/private bdrm, bath & kitchen(pls r...",4869,LisaRoxanne,Brooklyn,Bedford-Stuyvesant,Entire home/apt,3,,1.0,3.0,79,1,403,5/4/21,5.16,1,208,4.46
2,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,Private room,2,,1.0,1.0,60,30,50,6/5/16,0.56,1,365,4.52
3,5136,"Spacious Brooklyn Duplex, Patio + Garden",7378,Rebecca,Brooklyn,Sunset Park,Entire home/apt,4,,2.0,2.0,175,21,1,1/2/14,0.01,1,134,5.0
4,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Midtown,Private room,2,,1.0,1.0,61,2,474,9/25/20,3.61,1,246,4.19


In [74]:
#drop the host_id



# <font color='#eb3483'> GET SOME PRACTICE </font>

## Take 10 minutes and work through 1 or 2 of these problems to get a feel for doing the coding yourself.

It is going to be rough at first. And that's okay. You can copy paste and scroll up. You dont have to remember each command. it's all there - and if it isn't ... google is your friend.

Step through the logic yourself - or in writing BEFORE you start coding!


# <font color='#eb3483'> Filtering Pandas Exercises </font>

Let's pretend we are an Airbnb employee assigned to the New York market. Our job is to help clients find their desired listing. We have a file named `NYairbnb.csv` that has information on all the listings we have available right now in the city. Start by import pandas and loading our data in.

In [98]:
df = pd.read_csv("data/NYairbnb.csv")
df.head(3)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,accommodates,...,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_scores_rating
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,2,...,,1.0,150,30,48,11/4/19,0.34,3,308,4.7
1,3831,"Whole flr w/private bdrm, bath & kitchen(pls r...",4869,LisaRoxanne,Brooklyn,Bedford-Stuyvesant,40.68494,-73.95765,Entire home/apt,3,...,1.0,3.0,79,1,403,5/4/21,5.16,1,208,4.46
2,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68535,-73.95512,Private room,2,...,1.0,1.0,60,30,50,6/5/16,0.56,1,365,4.52


### <font color='#eb3483'> Exercise 1 </font>

Alice is going to New York for a week with her husband and 2 kids. They are looking for a full apartment with separate rooms for parents and children. Money is not an issue for them, but they are looking for a good place. This means they are only looking for places with more than 10 reviews and a score above 4. When we show Alice our listing selection we need to make sure we are sorting the listings from the best score to the worse one. In case some listings have the same score, we will have to sort them by the number of reviews (the more the better). We need to give her  3 alternatives.

In [99]:
selections = df[(df.bedrooms == 2) \
                & (df.number_of_reviews > 10) \
                & (df.review_scores_rating > 4)]

selections.sort_values(by = 'review_scores_rating', ascending = False).head(3)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,accommodates,...,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_scores_rating
19186,28715332,TRIBECA/SOHO 2 BEDROOM LUXURY LOFT,122323589,Roberto,Manhattan,SoHo,40.72054,-74.00198,Entire home/apt,4,...,2.0,2.0,628,3,14,5/25/21,0.43,1,300,5.0
899,657198,"2BR gem in Cobble Hill, Brooklyn",414627,Charlotte,Brooklyn,Columbia St,40.68731,-74.0004,Entire home/apt,4,...,2.0,3.0,200,3,15,1/1/18,0.21,1,0,5.0
828,596442,"Modern, Bright, Spacious 2 Bed, 2 Bath",2949021,Daisy,Brooklyn,Clinton Hill,40.68307,-73.96392,Entire home/apt,4,...,2.0,4.0,170,30,12,4/28/21,0.13,1,276,5.0



### <font color='#eb3483'> Exercise 2 </font>

Rafi is going to spend 3 nights in New York and he wants to meet new people. He has a budget of $100. We need to provide to him the 10 cheapests listings, with a preference for shared rooms. We need to sort the rooms by score (descending).

In [None]:
#We'll share the solutions but try it out for yourself!