 # <font color='#eb3483'>Pandas Functions  </font>

Pandas has way more functions than we can even hope to cover in class. Here are a few that might be helpful during your homework and class project.    
    
1. Isin
1. Working with different data types
1. Merge

In [1]:
import pandas as pd

## <font color='#eb3483'> 1. Isin </font>

We can check if an element belongs to a python list like this:

In [2]:
"potato" in ["potato", "tomato", "lettuce"]

True

We can use a similar approach with pandas dataframes using `.isin`. For example, if we want to select those listings where the neighborhood is in a specific list we can do it like this:

In [3]:
df = pd.read_csv("data/airbnb.csv", index_col="room_id").sort_index()
df.head()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0


In [4]:
favorite_neighbourhoods = ["Belém", "Parque das Nações"]

listings_i_like = df[df.neighborhood.isin(favorite_neighbourhoods)]

listings_i_like.head()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
105962,106839925,Entire home/apt,Parque das Nações,29,5.0,4,1.0,69.0
129822,639944,Private room,Parque das Nações,50,4.5,2,1.0,55.0
137434,671930,Entire home/apt,Parque das Nações,0,0.0,6,2.0,138.0
184888,887175,Entire home/apt,Parque das Nações,7,4.5,6,2.0,115.0


<hr>

### <font color='#eb3483'> Filtering based on datatypes </font>

In [5]:
df.dtypes

host_id                   int64
room_type                object
neighborhood             object
reviews                   int64
overall_satisfaction    float64
accommodates              int64
bedrooms                float64
price                   float64
dtype: object

We can use the method `select_dtypes` to select those columns that have specific types. 

For example, if we want to select only the columns that are floats, we can do:

In [6]:
df.select_dtypes(include=[float]).head()

Unnamed: 0_level_0,overall_satisfaction,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6499,5.0,1.0,57.0
17031,0.0,1.0,46.0
25659,5.0,1.0,69.0
29248,4.5,1.0,58.0
29396,5.0,1.0,67.0


We can also use the parameter `exclude` to filter excluding certain data types. 

For example, if we want to exclude those columns that are python objects (and strings are objects), we can do so like:

In [7]:
df.select_dtypes(exclude=[object]).head()

Unnamed: 0_level_0,host_id,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
6499,14455,8,5.0,2,1.0,57.0
17031,66015,0,0.0,2,1.0,46.0
25659,107347,63,5.0,3,1.0,69.0
29248,125768,225,4.5,4,1.0,58.0
29396,126415,132,5.0,4,1.0,67.0


<hr>

## <font color='#eb3483'> 2. Working with Different Data Types </font>

Let's assume we have the following dataframe:

In [8]:
unprocessed_data = pd.DataFrame({
    "worked_hours": [5, 5.2, "sick", 8, 10],
    "work_date": ["2020-01-02", "208-01-02", "2020-01-04", "2020-01-05", "2020-01-06"],
    "employee": ["taryn morris/ix", "enock chemochek/ix", "enock chemochek/ix", "david adarkwah/ix", "julie tian/ix"],
    "customer_review": [5,4,0,2,4]
})

unprocessed_data

Unnamed: 0,worked_hours,work_date,employee,customer_review
0,5,2020-01-02,taryn morris/ix,5
1,5.2,208-01-02,enock chemochek/ix,4
2,sick,2020-01-04,enock chemochek/ix,0
3,8,2020-01-05,david adarkwah/ix,2
4,10,2020-01-06,julie tian/ix,4


By looking at the dataset, it is obvious that the variable `work_date` is a date, `worked_hours` is a number and `employee` is a string. However, because of errors on the data, Pandas has considered all columns as strings (object).

In [9]:
unprocessed_data.dtypes

worked_hours       object
work_date          object
employee           object
customer_review     int64
dtype: object

We can turn columns into numbers with `pd.to_numeric`

In [18]:
# pd.to_numeric(unprocessed_data.worked_hours)

By default, `to_numeric` fails if any value can't be converted. We can change the conversion strategy by specifying the error strategy (with the argument `errors`). For example, if we use `errors="coerce"` pandas will convert those values it can and will convert the rest to null values (NaN).

In [19]:
pd.to_numeric(unprocessed_data.worked_hours, errors="coerce")

0     5.0
1     5.2
2     NaN
3     8.0
4    10.0
Name: worked_hours, dtype: float64

### <font color='#eb3483'> Strings </font>

Those columns that are strings have a set of special methods, available under the `.str` name.

For example, if we want to set the employee name to be capitalize, we can use `.str.lower()`

In [20]:
unprocessed_data.employee.str.capitalize().head()

0       Taryn morris/ix
1    Enock chemochek/ix
2    Enock chemochek/ix
3     David adarkwah/ix
4         Julie tian/ix
Name: employee, dtype: object

And if we want to replace the spaces on the name with a `_`, we can do so with `.replace()`

In [24]:
unprocessed_data.employee = unprocessed_data.employee.str.replace(" ", "_").str.capitalize()
unprocessed_data.head()

Unnamed: 0,worked_hours,work_date,employee,customer_review
0,5,2020-01-02,Taryn_morris/ix,5
1,5.2,208-01-02,Enock_chemochek/ix,4
2,sick,2020-01-04,Enock_chemochek/ix,0
3,8,2020-01-05,David_adarkwah/ix,2
4,10,2020-01-06,Julie_tian/ix,4


There are many string methods available, for example, we can use `.str.strip()` to remove the `/apt` at the end of the room_type.

In [29]:
unprocessed_data.employee.str.strip("/ix*").head()

0       Taryn_morris
1    Enock_chemochek
2    Enock_chemochek
3     David_adarkwah
4         Julie_tian
Name: employee, dtype: object

We can use `.str.cat` to concatenate strings. For example, if we want to set the index to `employee_work_date` we can do so as:

In [30]:
unprocessed_data = unprocessed_data.set_index('employee', drop = False)
unprocessed_data.index = unprocessed_data.index.str.cat(unprocessed_data.work_date.astype(str), sep="_")
unprocessed_data

Unnamed: 0_level_0,worked_hours,work_date,employee,customer_review
employee,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Taryn_morris/ix_2020-01-02,5,2020-01-02,Taryn_morris/ix,5
Enock_chemochek/ix_208-01-02,5.2,208-01-02,Enock_chemochek/ix,4
Enock_chemochek/ix_2020-01-04,sick,2020-01-04,Enock_chemochek/ix,0
David_adarkwah/ix_2020-01-05,8,2020-01-05,David_adarkwah/ix,2
Julie_tian/ix_2020-01-06,10,2020-01-06,Julie_tian/ix,4


### <font color='#eb3483'>  Datetimes </font>

Same way, we can convert dates to datetime objects (a datatype specific for dates). We can use `pd.to_datetime` to do so. It will coerce invalid dates into nulls (NaT, `not a time`)

In [31]:
pd.to_datetime(unprocessed_data.work_date, errors="coerce")

employee
Taryn_morris/ix_2020-01-02      2020-01-02
Enock_chemochek/ix_208-01-02           NaT
Enock_chemochek/ix_2020-01-04   2020-01-04
David_adarkwah/ix_2020-01-05    2020-01-05
Julie_tian/ix_2020-01-06        2020-01-06
Name: work_date, dtype: datetime64[ns]

This way we can transform the dates into actual date objects.

In [32]:
unprocessed_data["work_date"] = pd.to_datetime(unprocessed_data.work_date, errors="coerce")
unprocessed_data["worked_hours"] = pd.to_numeric(unprocessed_data.worked_hours, errors="coerce")

In [33]:
unprocessed_data

Unnamed: 0_level_0,worked_hours,work_date,employee,customer_review
employee,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Taryn_morris/ix_2020-01-02,5.0,2020-01-02,Taryn_morris/ix,5
Enock_chemochek/ix_208-01-02,5.2,NaT,Enock_chemochek/ix,4
Enock_chemochek/ix_2020-01-04,,2020-01-04,Enock_chemochek/ix,0
David_adarkwah/ix_2020-01-05,8.0,2020-01-05,David_adarkwah/ix,2
Julie_tian/ix_2020-01-06,10.0,2020-01-06,Julie_tian/ix,4


In [34]:
unprocessed_data.dtypes

worked_hours              float64
work_date          datetime64[ns]
employee                   object
customer_review             int64
dtype: object

Datetime columns have additional functionality, similarly to the `str.` methods with strings. We can access the datetime methods via the attribute `.dt`.

For example, we can find the day of week with `.dt.dayofweek`

In [35]:
unprocessed_data.work_date.dt.dayofweek

employee
Taryn_morris/ix_2020-01-02       3.0
Enock_chemochek/ix_208-01-02     NaN
Enock_chemochek/ix_2020-01-04    5.0
David_adarkwah/ix_2020-01-05     6.0
Julie_tian/ix_2020-01-06         0.0
Name: work_date, dtype: float64

### <font color='#eb3483'>  Categories </font>

In [36]:
df = pd.read_csv("data/airbnb.csv", index_col="room_id").sort_index()

Pandas can encode categorical variables in two ways, regular objects (strings), or **category**. 
Pandas provide us with a category dtype for categorical data:

Why would we want to use categories?

    Easily identify and signal categorical columns for processing and other Python libraries
    Converting a string variable with a few different values to a categorical variable saves memory
    By converting to a categorical we can specify an order on the categories.


In [37]:
df.dtypes

host_id                   int64
room_type                object
neighborhood             object
reviews                   int64
overall_satisfaction    float64
accommodates              int64
bedrooms                float64
price                   float64
dtype: object

In [38]:
df.memory_usage(index=True, deep=True)

Index                    105856
host_id                  105856
room_type                942129
neighborhood            1051870
reviews                  105856
overall_satisfaction     105856
accommodates             105856
bedrooms                 105856
price                    105856
dtype: int64

Woah, neighborhood is taking up a lot of memory. We always want to be as efficient as possible.

Let's try encode the room type and neighbourhood as categories.

In [39]:
df = df.astype({"room_type": "category", "neighborhood": "category"})

df.dtypes

host_id                    int64
room_type               category
neighborhood            category
reviews                    int64
overall_satisfaction     float64
accommodates               int64
bedrooms                 float64
price                    float64
dtype: object

In [40]:
df.describe(include='category')

Unnamed: 0,room_type,neighborhood
count,13232,13232
unique,3,24
top,Entire home/apt,Santa Maria Maior
freq,9764,2798


We see we get additional information when describing categories, and not only that, the 2 columns now take significantly less memory space!

In [41]:
df.memory_usage(index=True, deep=True)

Index                   105856
host_id                 105856
room_type                13549
neighborhood             15715
reviews                 105856
overall_satisfaction    105856
accommodates            105856
bedrooms                105856
price                   105856
dtype: int64

Pandas also works well for ordinal variables (variables that have a ranking). We can assign an order to a categorical variable. For example, we can set the variable `room_type` as an ordinal variable `(Shared Room < Private Room < Entire Home)`

In [42]:
df.room_type.head()

room_id
6499     Entire home/apt
17031    Entire home/apt
25659    Entire home/apt
29248    Entire home/apt
29396    Entire home/apt
Name: room_type, dtype: category
Categories (3, object): ['Entire home/apt', 'Private room', 'Shared room']

By default a categorical column is not ordered

In [43]:
df.room_type.cat.ordered # let slook up this function in help - press click on the function and shift+tab

False

We can assign the order by doing `.cat.set_categories` and passing the list of levels

In [44]:
df.room_type = df.room_type.cat.set_categories(["Shared room","Private room", "Entire home/apt"], ordered=True)

In [45]:
df.room_type.cat.ordered

True

Now because we know the sorting order of the categories, we can filter the ordinal column:

In [46]:
df[df.room_type>"Shared room"].head()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0


When reading a csv, we can also specify dtypes so we can directly create category columns:

In [47]:
df = pd.read_csv("data/airbnb.csv", dtype={
                                        "room_type": "category",
                                        "neighborhood": "category"
})

In [48]:
df.dtypes

room_id                    int64
host_id                    int64
room_type               category
neighborhood            category
reviews                    int64
overall_satisfaction     float64
accommodates               int64
bedrooms                 float64
price                    float64
dtype: object

<hr>

### <font color='#eb3483'> Rename </font>

`.rename` allows us to rename either the row index or the columns.

We can rename columns very easily this way.

In [49]:
df.head()

Unnamed: 0,room_id,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
0,6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
1,17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
2,25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
3,29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
4,29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0


In [50]:
df = df.rename(columns={"neighborhood": "neighbourhood", "reviews": "number_reviews"})
df.head()

Unnamed: 0,room_id,host_id,room_type,neighbourhood,number_reviews,overall_satisfaction,accommodates,bedrooms,price
0,6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
1,17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
2,25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
3,29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
4,29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0


We can also rename values in the index.

In [51]:
df_indexed_neighbourhood = df.set_index("neighbourhood")
df_indexed_neighbourhood.head()

Unnamed: 0_level_0,room_id,host_id,room_type,number_reviews,overall_satisfaction,accommodates,bedrooms,price
neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Belém,6499,14455,Entire home/apt,8,5.0,2,1.0,57.0
Alvalade,17031,66015,Entire home/apt,0,0.0,2,1.0,46.0
Santa Maria Maior,25659,107347,Entire home/apt,63,5.0,3,1.0,69.0
Santa Maria Maior,29248,125768,Entire home/apt,225,4.5,4,1.0,58.0
Santa Maria Maior,29396,126415,Entire home/apt,132,5.0,4,1.0,67.0


In [52]:
df_indexed_neighbourhood.rename({"Belém": "Belem", "Santa Maria Maior": "Saint Mary"}).head()

Unnamed: 0_level_0,room_id,host_id,room_type,number_reviews,overall_satisfaction,accommodates,bedrooms,price
neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Belem,6499,14455,Entire home/apt,8,5.0,2,1.0,57.0
Alvalade,17031,66015,Entire home/apt,0,0.0,2,1.0,46.0
Saint Mary,25659,107347,Entire home/apt,63,5.0,3,1.0,69.0
Saint Mary,29248,125768,Entire home/apt,225,4.5,4,1.0,58.0
Saint Mary,29396,126415,Entire home/apt,132,5.0,4,1.0,67.0


<hr>

## <font color='#eb3483'> 3. Practicing Merge AGAIN </font>

`merge` allows us to join two dataframes based on one or two columns

Now we are going to load an additional dataset that contains the population in every Lisbon district (taken from [here](https://www.citypopulation.de/php/portugal-lisboa.php))

In [53]:
df = pd.read_csv("data/airbnb.csv", index_col="room_id").sort_index()
populations = pd.read_csv("data/population_lisboa.csv")

In [54]:
populations.head()

Unnamed: 0,name,Population_2011
0,Alcochete,17569
1,Alcochete,12239
2,Samouco,3143
3,São Francisco,2187
4,Almada,17403


We will do a merge between the Airbnb listings and the populations.

merge works as follows:

```
pd.merge(left_dataframe,right_dataframe,join_type)
```

merge has different parameters:

`left_on` is the name of the column for the left dataframe.
`right_on` is the name of the column for the right dataframe.
`how` specifies the join type, how to make the union:
 - `full` takes all rows for both dataframes whether the values on the join columns exist on both dataframes or not.
 - `inner` takes only the rows on the dataframews where the values on the join columns exist on both dataframes. This is the default value.
 - `left` takes all rows for the left dataframe whether the values on the join columns exist on it dataframes or not.
 - `right` the opposite of `left`

In [55]:
listings_with_pop = pd.merge(df, populations, left_on="neighborhood", right_on="name")

In [56]:
listings_with_pop.shape

(12191, 10)

In [57]:
listings_with_pop.head()

Unnamed: 0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price,name,Population_2011
0,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0,Belém,16528
1,992647,Entire home/apt,Belém,54,4.0,2,1.0,45.0,Belém,16528
2,2083563,Private room,Belém,2,0.0,2,1.0,127.0,Belém,16528
3,2341627,Entire home/apt,Belém,64,4.5,4,1.0,67.0,Belém,16528
4,3168004,Entire home/apt,Belém,57,4.5,3,2.0,46.0,Belém,16528


Let's check how merge works.

We can see the original number of neighbourhoods by doing `unique()`

In [58]:
neighborhoud_list_all = df.neighborhood.unique()
neighborhoud_list_all

array(['Belém', 'Alvalade', 'Santa Maria Maior', 'Estrela', 'Alcântara',
       'Misericórdia', 'Avenidas Novas', 'Lumiar', 'São Vicente',
       'Campo de Ourique', 'Santo António', 'São Domingos de Benfica',
       'Parque das Nações', 'Penha de França', 'Arroios', 'Beato',
       'Campolide', 'Benfica', 'Areeiro', 'Ajuda', 'Carnide', 'Olivais',
       'Santa Clara', 'Marvila'], dtype=object)

In [59]:
len(neighborhoud_list_all)

24

We see there are 24 unique neighbourhoods

Let's see how many neighbourhoods we get when we do the merge

In [60]:
neighborhoud_list_merge = listings_with_pop.neighborhood.unique()

In [61]:
len(neighborhoud_list_merge)

23

We see there is a neighbourhood missing! Maybe there is one missing on the population dataset?

In [62]:
set(neighborhoud_list_all) - set(neighborhoud_list_merge)

{'São Vicente'}

We see *São Vicente* doesn't exist on the population dataset. When we do a merge, we do an `inner` merge by default, and it removes those rows whose shared column/s don't match. In this example `*São Vicente*` exists on the Airbnbn `neighborhood` column but doesn't exist on the populations dataframe `name` column.

In [63]:
listings_with_pop[listings_with_pop.neighborhood=="São Vicente"]

Unnamed: 0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price,name,Population_2011


In this particular example, the reference dataframe is the Airbnb, and we are just adding additional data to it. So we would rather keep those listings in São Vicente even though we wont have population information for them. In this case we will do a left merge.

In [64]:
listings_with_pop = pd.merge(df, populations, left_on="neighborhood", right_on="name", how="left")

Now we kept all the listings.

In [65]:
listings_with_pop[listings_with_pop.neighborhood=="São Vicente"].head()

Unnamed: 0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price,name,Population_2011
21,270457,Entire home/apt,São Vicente,86,4.5,3,1.0,52.0,,
32,335614,Entire home/apt,São Vicente,157,4.5,4,1.0,93.0,,
35,480371,Entire home/apt,São Vicente,27,4.5,6,2.0,115.0,,
39,483396,Entire home/apt,São Vicente,26,4.5,4,1.0,87.0,,
46,183877,Entire home/apt,São Vicente,31,4.5,2,1.0,44.0,,
