# <font color='#eb3483'> Data Prepping and Wrangling with Pandas </font>

Once we have read our data frame in and had a look around 

We may want to start working with specific columns or rows, or data that only meets a certain criteria.

Enter Pandas  

In this notebook we will cover:
- Selecting columns
- Renaming columns
- Filtering rows:
    -  by their numerical position - iloc
    -  Filtering rows by their index - loc
    -  Filtering rows with [ ]
    -  multiple selctions
-  Dropping rows or columns


In [None]:
import pandas as pd

We are going to use a dataset that has Airbnb listing information in New York.

In [None]:
df = pd.read_csv('data/NYairbnb.csv')

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.shape

# <font color='#eb3483'> Selecting Columns </font>

We can select columns using dot notation (as long as the column names dont have spaces or non alphanumerical characters on them) - which is why it is always good to name your columns without these. Saves time later :)

In [None]:
#To get 1 column

df.room_type

Which is the same as selecting with square brackets - which we can use if our names have spaces or alphanumerical characters

In [None]:
df['room_type'] 

When we select one column we receive a pd.Series. To instead retrieve a dataframe we use double brackets ... 

In [None]:
df[['room_type']]

To  select multiple columns we use double brackets (and receive a dataframe)

In [None]:
df[["room_type", "price"]].head()

# <font color='#eb3483'> Renaming Columns </font>


We can change the name of the columns by changing the column names list `df.columns`. For example, we can rename the columns and make them capitalized.

Maaaaany options [here](https://note.nkmk.me/en/python-pandas-dataframe-rename/) but a few common ones below

In [None]:
df = pd.read_csv('data/NYairbnb.csv')

In [None]:
df.head()

In [None]:
#select just a few columns
df = df[['id', 'host_id', 'neighbourhood_group', 'room_type', 'price', 'number_of_reviews']]
df.head()

We can change the name of the columns by changing the column names list `df.columns`. For example, we can rename the columns to make it a bit easier to type.

In [None]:
df.columns

In [None]:
df.columns = ['id', 'host_id', 'nhood_grp', 'room_type', 'price', 'num_reviews']
df.head()


In [None]:
#We can also use the rename function 
df_new = df.rename(columns={'price': 'price_usd', 'nhood_grp': 'neighborhood'})
df_new.head()


We can change column names to upper or lower case.

In [None]:
df.rename(columns=str.lower)

# <font color='#eb3483'> Filtering rows </font>

### <font color='#eb3483'> Filtering rows by their position - iloc </font>

We use the function `iloc` to select specific rows on a Data Frame.

There are two things you should know about `iloc`. Firstly, it is reserved for purely number-based indexing (integars only). So if you ever call iloc with a non-integer index, it will throw an error. Secondly, `iloc` **does not interact with your index at all** -> important to remember if your index is intergar-based.

With `iloc` we select rows regarding their row number, starting at 0.

In [None]:
df.head()

In [None]:
df.iloc[0] # using one square bracket returns it as a series

In [None]:
type(df.iloc[0])

Generally we would want to keep working with a dataframe - so we use double brackets `[[]]`

In [None]:
df.iloc[[0]]

We can select multiple rows at once:

In [None]:
df.iloc[[0,3,9]]

Or use slices like with arrays:

In [None]:
#select rows 2:10
df.iloc[2:10]


## <font color='#eb3483'> Filtering rows by their index value - loc </font>

With `.loc` we can select rows based on their index value.
`loc` is based purely on the assigned index for your dataframe - so it can be a number but it can also be a label (remember changing our indexes last class)

Let's set the index of our df to host_id

In [None]:
df = df.set_index("host_id")

Now let's say we want to get the listings for a specific host. 
We know the host as the ID of 318750232 - we can pull out (aka filter) any listings that belong to that host 

In [None]:
df.head()

In [None]:
df.loc[7378]

Selecting an index value that doesnt exist will fail

In [None]:
df.loc[[5]]

Same as with .iloc, we can select multiple values at once.

In [None]:
df.loc[[8967, 305240193, 107434423 ]]

### <font color='#eb3483'> Knowledge check </font>


In [None]:
# Filter out all properties only in Manhattan
# reset index and change to nhood_grp then
# filter out manhattan data set using the loc function 




## <font color='#eb3483'> Filtering with [ ] </font>

Just as in most things in Python (and life) there are more than one way of doing things. We can use loc and iloc to filter but we can also filter by using brackets `[ ]` and subset the data frame to create a specific "sub"dataframe



For example, let's filter the dataframe again to see all listings in `Manhattan`:

In [None]:
df = pd.read_csv('data/NYairbnb.csv')

In [None]:
df.head()

In [None]:
df.shape

If we use brackets, the dataframe we get is smaller

In [None]:
df[df.neighbourhood_group == 'Manhattan'].head()

In [None]:
df[df.neighbourhood_group == 'Manhattan'].shape

We can select the inverse of a condition if we put `~` in front of it.

For example, to select all listings that are not in Manhattan, we can do this:

In [None]:
df[~(df.neighbourhood_group ==  "Manhattan")].shape

In [None]:
df[~(df.neighbourhood_group ==  "Manhattan")].head()

### <font color='#eb3483'> Multiple Selection </font>

We can filter a dataframe based on multiple conditions.

We can select rows that match multiple conditions by concatenating the conditions with `&`.

For example, if we want those listings in Manhattan with more than 3 bedrooms:

In [None]:
df.head()

In [None]:
df[(df.neighbourhood_group == 'Manhattan') & (df.bedrooms > 3)].head() # we use & for and.

Same way, we can select rows that match one condition OR the other with the pipe (`|`)

In [None]:
df[(df.neighbourhood_group == 'Manhattan') | (df.neighbourhood_group == "Brooklyn")].head() # we use | for or.

In [None]:
#Play around with this

# <font color='#eb3483'> Dropping rows and columns </font>
To remove rows and columns we can use `.drop`

In order to drop rows and columns from a DataFrame, you can use the function [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html). By default `.drop` removes rows based on the index value.

Drop has two important arguments:
* `inplace`: with this argument, you can chose if you want to transform the original DataFrame or if you want the drop function to return a copy of the transformed DataFrame. It"s default value is False, i.e, you don"t apply the transformation in the original DataFrame. You"ll see this argument in many functions that transform DataFrames. **This is usually not recommended**
* `axis`: with this argument, you chose if you want to drop rows (axis=0) or if you want to drop columns (axis=1). The default behaviour is to drop rows. You"ll se this argument in many functions that transform DataFrames.

In [None]:
df = pd.read_csv("data/NYairbnb.csv")

In [None]:
df.head()

For example, we can remove the row with index 2

In [None]:
df1 = df.drop(2)
df1.head()

We can drop multiple rows (Same as with `loc` or `iloc`)

In [None]:
df.drop([4, 5]).head()

If we use `axis=1` we remove columns (columns are the second axis on a dataframe)

In [None]:
df = df.drop(["latitude", "longitude"], axis=1)
df.head()

In [None]:
#drop the host_id



# <font color='#eb3483'> GET SOME PRACTICE </font>

## Take 10 minutes and work through 1 or 2 of these problems to get a feel for doing the coding yourself.

It is going to be rough at first. And that's okay. You can copy paste and scroll up. You dont have to remember each command. it's all there - and if it isn't ... google is your friend.

Step through the logic yourself - or in writing BEFORE you start coding!


# <font color='#eb3483'> Filtering Pandas Exercises </font>

Let's pretend we are an Airbnb employee assigned to the New York market. Our job is to help clients find their desired listing. We have a file named `NYairbnb.csv` that has information on all the listings we have available right now in the city. Start by import pandas and loading our data in.

### <font color='#eb3483'> Exercise 1 </font>

Alice is going to New York for a week with her husband and 2 kids. They are looking for a full apartment with separate rooms for parents and children. Money is not an issue for them, but they are looking for a good place. This means they are only looking for places with more than 10 reviews and a score above 4. When we show Alice our listing selection we need to make sure we are sorting the listings from the best score to the worse one. In case some listings have the same score, we will have to sort them by the number of reviews (the more the better). We need to give her  3 alternatives.


### <font color='#eb3483'> Exercise 2 </font>

Rafi is going to spend 3 nights in New York and he wants to meet new people. He has a budget of $100. We need to provide to him the 10 cheapests listings, with a preference for shared rooms. We need to sort the rooms by score (descending).

In [None]:
#We'll share the solutions but try it out for yourself!