# Pandas Introduction

![image](../../images/pandas_logo.png)

**Pandas** is probably the most popular Python library for data manipulation and analysis. Pandas is designed to work with a wide range of data sources, including: `CSV`, `Excel`, `Parquet`, `Pickle`, `JSON`, and many more.

**NOTE**: Data Scientists / Data Professionals 💚 **Pandas**!

# Install and Use Pandas for the first time

`pandas` should be automatically installed as part of `anaconda`. Nonetheless, if for some reasons it is missing, you can enter the following command in a `anaconda prompt` or a `terminal` to install `pandas`:

```bash
pip install pandas
```

You can verify that it is installed by entering the following codes in a cell in a notebook:

```python
try:
    import pandas as pd
    print(f"The version of pandas is: {pandas.__version__}")
except ImportError:
    print("pandas is not installed!")
```

It is extremely common to use `pd` as the alias for `pandas` in `Python`.

```python
import pandas as pd
```

# Core Objects of Pandas

There are two **core objects** of `pandas`:
- **DataFrame**: A table-like object that can be easily created and manipulated.
- **Series**: A 1-dimensional sequence of data values. If a **dataframe** is a table, then each column is a **Series**.

![image](../../images/df_series.png)

## DataFrame

In order to **create a dataframe**, you can use `pd.DataFrame()` constructor.

Let's create the dataframe we see in the image above and assign it to a variable called `df`.

```python
df = pd.DataFrame({
    "Height": [1.60, 1.76, 1.95],
    "Weight": [60.0, 56.0, 85.0],
})

df
```

As you can see, we provide a dictionary of **column names** and **column values** to the `pd.DataFrame()` constructor. The **column names** are the **keys** of the dictionary, and the **column values** are the **values** of the dictionary. This is a very common way to create a dataframe.

The numbers in the far left of the dataframe are indices. The indices are automatically assigned by `pandas` and are called **row labels**.

In this example, the indices are `0`, `1`, and `2`. The [`1`, `Weight`] entry contains the value `56.0`.

**NOTE**:
- Dataframes can contain as many columns and rows as you want. The entries in the dataframes are not limited to numbers, and they can be of any type.
- Dataframes are **mutable**. This means that you can change the values of the dataframe.

**Exercise**

Given the following dictionary, can you create a dataframe that looks like the one below?

```python
dict_data = {
    "Name": ["Millie Bobby Brown", "Finn Wolfhard", "Natalie Dyer", "Kaley Cuoco", "Jim Parsons"],
    "Age": [18, 19, 27, 36, 49],
    "Height": [1.60, 1.80, 1.63, 1.68, 1.86],
    "TV Series": ["Stranger Things", "Stranger Things", "Stranger Things", "Big Bang Theory", "Big Bang Theory"],
}
```

|Name|Age|Height|TV_Series|
|:-:|:-:|:-:|:-:|
|Millie Bobby Brown|18|1.60|Stranger Things|
|Finn Wolfhard|19|1.80|Stranger Things|
|Natalie Dyer|27|1.63|Stranger Things|
|Kaley Cuoco|36|1.68|Big Bang Theory|
|Jim Parsons|49|1.86|Big Bang Theory|

In [None]:
# [TODO]


Instead of using generic numbers as **row labels**, can we create a dataframe having row labels to be the names of the people in the dataframe?

The answer is **YES**, and it's very easy.

```python
df = pd.DataFrame({
    "Age": [18, 19, 27, 36, 49],
    "Height": [1.60, 1.80, 1.63, 1.68, 1.86],
    "TV Series": ["Stranger Things", "Stranger Things", "Stranger Things", "Big Bang Theory", "Big Bang Theory"],
}, index=["Millie Bobby Brown", "Finn Wolfhard", "Natalie Dyer", "Kaley Cuoco", "Jim Parsons"])

df
```

As you can see, we only have to provide the list of names as values of the `index` argument in the `pd.DataFrame()` constructor.

## Series

In order to **create a series**, you can use `pd.Series()` constructor.

```python
height_series = pd.Series([1.60, 1.80, 1.63, 1.68, 1.86])
```

As mentioned above, a series can be considered as a column in a dataframe. 
- A series can contain as many entries as you want.
- A series also has `index` that are automatically generated by `pandas` but can be changed as you wish. 
- A series can be of any type.
- A series has one overall `name` that is used to identify the series.

Let's recreate height_series, but we'll give it a `name` of `Height` this time.

```python
height_series = pd.Series([1.60, 1.80, 1.63, 1.68, 1.86], name="Height")
```

In [None]:
# let's recreate height_series but we'll give it a name this time


# Read data from file

It's good that you know how to create a dataframe/series from scratch. In real life, you will probably be doing that only when you want to create some very small tests specific to your need. In this section, we'll learn how to use `pandas` to read data from a **CSV file** and store the data in a dataframe.

**CSV file** is a common file format used to store data. It is a simple text file that contains a list of comma separated values.

In the `data` folder, I have already provided a CSV file called `weatherAUS.csv`. We'll be using this data for the rest of the lesson.

![image](../../images/data_tree.png)

The **Weather in Australia** dataset is a public dataset. More information about the **Weather in Australia** dataset is available here [here](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package).

**NOTE**: **Kaggle** is probably the best source of learning for all **data enthusiasts**.

The data contains daily weather observations taken from various weather stations in Australia from `1st November 2007` to `25th June 2017`.

The data includes `145k` rows of weather observation data. Each row represents a unique observation set and each observation set contains the attributes shown in the table below

|No|Attribute|Description|Type|
|-|-|-|-|
|1|Date|The date of observation|Data time|
|2|Location|The common name of the location of the weather station|String|
|3|MinTemp|Minimun temperature in $^\circ C$|Float|
|4|MaxTemp|Maximum temperature in $^\circ C$|Float|
|5|Rainfall|Amount of rainfall for the day in $mm$|Float|
|6|Evaporation|class A pan evaporation in $mm$ in the 24 hours to 9am|Float|
|7|Sunshine|Number of hours of bright sunshine in the day|Float|
|8|WindGustDir|Direction of the strongest wind gust in the 24 hours to midnight|String|
|9|WindGustSpeed|Speed in $Km/hr$ of the strongest wind gust in the 24 hours to midnight|Float|
|10|WindDir9am|Direction of the wind at 9am|String|
|11|WindDir3pm|Direction of the wind at 3pm|String|
|12|WindSpeed9am|Wind speed in $km/hr$ averaged over 10 minutes prior to 9am|Float|
|13|WindSpeed3pm|Wind speed in $km/hr$ averaged over 10 minutes prior to 3pm|Float|
|14|Humidity9am|Humidity in percentage at 9am|Float|
|15|Humidity3pm|Humidity in percentage at 3pm|Float|
|16|Pressure9am|Atmospheric pressure in $hpa$ reduced to mean sea level at 9am|Float|
|17|Pressure3pm|Atmospheric pressure in $hpa$ reduced to mean sea level at 3pm|Float|
|18|Cloud9am|Fraction of sky obscured by cloud at 9am, measured in $oktas$ in a range 0 (sky clear) to 8 (sky completely overcast)|Float|
|19|Cloud3pm|Fraction of sky obscured by cloud at 3pm, measured in $oktas$ in a range 0 (sky clear) to 8 (sky completely overcast)|Float|
|20|Temp9am|Temperature in $^\circ C$ at 9am|Float|
|21|Temp3pm|Temperature in $^\circ C$ at 3pm|Float|
|22|RainToday|Boolean attribute. 'Yes' if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 'No'|String|
|23|RainTomorrow|The target variable. Did it rain tomorrow? Boolean. 'Yes' if we predict rain for tomorrow, otherwise 'No'|String|

Let's read the data from the CSV file and play with the data.

```python
weather_data = pd.read_csv("../../data/weatherAUS.csv")
```

Similar to `np.ndarray`, `pd.DataFrame` also has `shape` attribute that can be used to check the number of rows and columns in the dataframe.

```python
weather_data.shape
```

Thus, we can see that the dataframe has `145,460` rows and `23` columns.

In order to examine the data, we can use the `head()`, `tail()`, or `sample()` method.
- `head(n)` returns the first `n` rows of the dataframe.
- `tail(n)` returns the last `n` rows of the dataframe.
- `sample(n)` returns `n` random rows of the dataframe.

Let's view the first `5` rows of the `weather_data` dataframe.

```python
weather_data.head()
```

**Quick observation**:
- You can see that there are lots of values `NaN` in the dataframe. `NaN` stands for `Not a Number`, and it means that the value is not available. We'll learn how to deal with `NaN` values later.
- The dataframe is also **too long** to be displayed in a single screen. That's why we see a `...` section in the middle of the dataframe. 

**Exercise**

Can you get 10 random rows from the `weather_data` dataframe?

In [None]:
# [TODO]


Similar to `np.ndarray`, `pd.DataFrame` can also be tranposed. This is useful when you want to display the dataframe in a more readable format.

Let's view the transposed version of the first 5 rows.

```python
weather_data.head().T
```

# Indexing, Selecting, and Assigning in Pandas

## Indexing

There are **several ways** to access a column in the dataframe.
- `.` operator: `df.column_name` returns a series with the values of the column `column_name`.
- `[]` (indexing) operator: `df["column_name"]` does the same thing as the above.

Let's access the `Location` column of the `weather_data` dataframe.

```python
weather_data.Location
weather_data["Location"]
```

Personally, I prefer to use the indexing syntax (`df["column_name"]`) because it is more readable and extremely clear that we are accessing a column from the dataframe. In addition, the indexing syntax can handle cases where the column name contains reserved characters like `spaces`.

For example, `df["column name"]` will work, but `df.column name` will not.

`pandas` has built-in accessor operators called `iloc` and `loc` for more advanced operations.
- `iloc` is **position-based** selection, while `loc` is **label-based** selection.
- `iloc` is short for `integer location` and is used to select rows and columns by their index (position).
- Both `iloc` and `loc` are **row-first, column-second**. 

Let's access the first row of the `weather_data` dataframe using `iloc`.

```python
weather_data.iloc[0]
```

Let's access the `MaxTemp` column of `weather_data` using the `iloc` accessor.

```python
weather_data.iloc[:, 3]
```

**Exercise**

Do you notice something similar to what we learnt in the previous lessons?

<font size="5">[TODO] 📖</font>

It's the slicing operator (`:`).

**Exercise**

Can you select the first 10 rows of the column `MinTemp` from the `weather_data` dataframe?

In [None]:
# [TODO]


Let's access the first row of the `weather_data` dataframe using `loc`.

```python
weather_data.loc[0]
```

It doesn't look that different from `iloc` accessor right? In fact, `iloc` is conceptually simpler than `loc` because it ignores the dataset's indices. When we're using `iloc`, we only need to specify the index of the row(s) and/or column(s) we want to access. `loc` is more flexible because it allows us to specify the label of the row(s) and/or column(s) we want to access.

For example, let's access the first 5 rows from the following columns (`MinTemp`, `MaxTemp`, `RainToday`, `RainTomorrow`) of the `weather_data` dataframe.

```python
weather_data.loc[:4, ["MinTemp", "MaxTemp", "RainToday", "RainTomorrow"]]
```


Do you notice anything different about `loc` and `iloc` in terms of the provided indices? 

`iloc[:5]` is equivalent to `loc[:4]`. The index in `loc` is inclusive while `iloc` is exclusive.

There is an even **more compact** way to select a subset dataframe from the original dataframe using the `[]` operator. Personally, I prefer to use this method as it is highly readable and easy to use.

Let's do the same using `[]` operator!

```python
weather_data[["MinTemp", "MaxTemp", "RainToday", "RainTomorrow"]].head()
```

As shown above, when creating a dataframe, we have an option to specify the `index` of the dataframe instead of using the automatically generated numbers. Thus, in this case, if we want to use `Location` column as the `index` of our `weather_data` dataframe, we could use the `set_index()` method. 

```python
weather_data.set_index("Location")
```

## Selecting

This will be extremely similar to how we filter data in a `numpy` array. Often in a project, we want to select a subset of the data based on certain conditions to perform some analysis in depth.

For example, let's select all records where the `Location` is `Sydney`.

```python
weather_data[weather_data["Location"] == "Sydney"]
```

As you can see, the result is a dataframe with only the records where the `Location` is `Sydney`. 

**Exercise**

How many rows are there where the `Location` is `Sydney` and the `RainTomorrow` is `Yes`?

**Hint**: We can use `&` (`ampersand`: on a high level, it's similar to `and`).

In [None]:
# [TODO]


**Exercise**

How many rows are there where the `Evaporation` is greater than `3` or `RainTomorrow` is `Yes`?

**Hint**: We can use `|` (`pipe`: on a high level, it's similar to `or`).

In [None]:
# [TODO]


**Exercise**

Can you select all records where the `Location` is either `Sydney`, `Melbourne`, or `Canberra`?

In [None]:
# [TODO]


The `|` operator works but it's quite cumbersome for that scenario. There's a better way to do the same thing by using the `pandas` `isin` method. `isin` lets you select data whose value **is in** a list of values.

```python
weather_data[weather_data["Location"].isin(["Sydney", "Melbourne", "Canberra"])]
```

Remember that we have seen lots of `NaN` values above. How do we filter out the records with missing values? We can use the `pandas` `isnull` method. The opposite of `isnull` is `notnull`. We can also use `~` (`tilde`) to invert the result.

```python
weather_data[weather_data["Evaporation"].isnull()]
weather_data[~weather_data["Evaporation"].isnull()]
```

**Exercise**

Can you check if the sum of rows of the 2 dataframes we got above is the same as the number of rows in the `weather_data` dataframe?

In [None]:
# [TODO]


## Assigning

Assigning data to a dataframe is very simple and very similar to how we add a new `key-value` pair to an existing dictionary.

**Exercise**

Can you create a column called `MinTemp > 5` to indicate whether the `MinTemp` is greater than 5 or not? The column will contain value "Yes" if the `MinTemp` is greater than 5, and "No" otherwise.

**Hint**: Use `np.where()`.

In [None]:
# [TODO]


# Summary Functions and Maps

## Summary Functions
`pandas` has functions like `describe()` and `info()` which provide a summary of the data. They can be used to give you an overview of the data by displaying relevant descriptive statistics of the data.

Let's see how `describe()` works!

Note that when the dataframe has too many columns that can't be easily displayed in a single screen, we can transpose the result to view it better.

```python
weather_data.describe().T
```

`describe()` gives you the following information.
- `count`: the number of `notnull` rows for that particular column.
- `mean`: the mean of the column.
- `std`: the standard deviation of the column.
- `min`: the minimum value of the column.
- `25%`: the 25th percentile of the column.
- `50%`: the 50th percentile of the column.
- `75%`: the 75th percentile of the column.
- `max`: the maximum value of the column.

If you use `describe()` as is, you will only get summary statistics for the **numerical** columns. If you want to get the summary statistics for all columns, you can use the `.describe(include="all")` method.

```python
weather_data.describe(include="all").T
```

It's actually extremely hard to read, so normally, we view the summary statistics of **numerical** and **categorical** columns separately.

To view the summary statistics of only **categorical** columns, we can use the `.describe(include="object")` method.
- `count`: the number of `notnull` rows for that particular column.
- `unique`: the number of unique values in the column.
- `top`: the top most common values in the column.
- `freq`: the frequency of the top most common values.

```python
weather_data.describe(include="object").T
```

Let's see how `info()` works!
- `Non-Null Count`: the number of `notnull` rows for that particular column.
- `Dtype`: the data types of the columns.

`info()` gives us an idea to see whether the `pd.read_csv()` method used before has inferred the data types of the columns correctly. We can easily assign the correct data types to each column while reading the data.

```python
weather_data.info()
```

To view the data type of a single column, we can use the `.dtype` attribute.

```python
weather_data["Location"].dtype
```

To quickly view the data types of each column, we can use the `.dtypes` attribute.

```python
weather_data.dtypes
```

Regarding unique values in a column, we can use:
- `unique()` to get the list of unique values in the column of interest.
- `nunique()` to get the number of unique values in the column of interest.
- `value_counts()` to get the frequency of each unique value in the column of interest.

```python
weather_data["Location"].unique()
weather_data["Location"].nunique()
weather_data["Location"].value_counts()
```

## Maps

**Mapping** is a way to map one set of values to another set of values. This is one of the way to transform the raw data to the desired format. This is something that data scientists do a lot.

There are lots of mapping methods in `pandas`, but you'll often use the `map()` method.

In data science, it's very common to standardise features to a common scale. One of the most common scaler is called **Standard Scaler** where we transform each data point of the feature into the `z-score`. 

z = $\displaystyle \frac{x - mean}{std}$

Let's see how we normalise the `Rainfall` column using `map()` method! We'll also save the normalised data to a new column called `NormalisedRainfall`.

```python
mean_rainfall = weather_data["Rainfall"].mean()
std_rainfall = weather_data["Rainfall"].std()

weather_data["NormalisedRainfall"] = weather_data["Rainfall"].map(
    lambda x: (x - mean_rainfall) / std_rainfall
)
```

Notice the usage of the anonymous (`lambda`) function. The `lambda` function passed to `map()` expects a single value from the Series (a point value, in the above example), and return a transformed version of that value. `map()` returns a new Series where all the values have been transformed by the anonymous function.

Let's look at the normalised column!

In fact, we don't even need to use `map()` to normalise the data. There are plenty of common mapping operations already built-in `pandas`.

Let's do the same thing in a faster way!

```python
mean_rainfall = weather_data["Rainfall"].mean()
std_rainfall = weather_data["Rainfall"].std()

weather_data["NormalisedRainfall_v2"] = (weather_data["Rainfall"] - mean_rainfall)/std_rainfall
```

We can see that `NormalisedRainfall_v2` is exactly the same as `NormalisedRainfall`. In fact, `NormalisedRainfall_v2` is even calculated faster. Let's verify it by using `%timeit` magic command.

```python
%timeit weather_data["NormalisedRainfall"] = weather_data["Rainfall"].map(lambda x: (x - mean_rainfall) / std_rainfall)
%timeit weather_data["NormalisedRainfall_v2"] = (weather_data["Rainfall"] - mean_rainfall)/std_rainfall
```

Despite this, `map()` method is still very useful as it is a lot more flexible than the built-in operations. You can use `map()` to apply conditional logics, which can't be done using the built-in operations.

There is another very common mapping method called `apply()`. It is similar to `map()` but it works on a DataFrame instead of a Series.

Let's say we want to apply the same standardisation to the following columns ("Rainfall", "Temp9am", "WindGustSpeed") in the `weather_data` dataframe!

```python
df = weather_data[["Rainfall", "Temp9am", "WindGustSpeed"]]

def standardise(df):
    mean_column = df.mean()
    std_column = df.std()
    return (df - mean_column) / std_column

standardised_df = df.apply(standardise, axis=0)
```

Let's compare `weather_data["NormalisedRainfall"]` and `standardised_df["Rainfall"]` to see that we have the same result!

To compare 2 `pd.Series`, we can use the `equals()` method. More about the `equals()` method is available [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.equals.html).

# Grouping and Sorting

## Grouping
It is very common in data science that we need to aggregate the data based on some criteria and perform some calculations on grouped data. In order to do so, we can use the `groupby()` method.

How `groupby()` works is as follows:

![image](../../images/pd_groupby.png)

Let's find the total `Rainfall` for each `Location` in the `weather_data` dataframe! We'll assign the result to a variable called `total_rainfall_by_location`.

```python
total_rainfall_by_location = weather_data.groupby(["Location"])["Rainfall"].sum()
```

So the result is a `Series` with the `Location` as the index and the total `Rainfall` as the value. Often, we want to have the result as a dataframe instead of a `Series`. In order to do that, we can use the `reset_index()` method, which will reset the index of the result `Series` and allow us to have `Location` as another column in the result dataframe.

```python
total_rainfall_by_location = total_rainfall_by_location.reset_index()
```

We could combine everything in 1 single line as follows:

```python
total_rainfall_by_location = weather_data.groupby(["Location"])["Rainfall"].sum().reset_index()
```

**Exercise**

1. Can you find the maximum `MinTemp` for each location in the `weather_data` dataframe?
1. Based on the result above, can you find the location with the highest `MinTemp`?

**Hint**: You can use the `max()` method to find the maximum value in a `Series`.

In [None]:
# [TODO]


There are many aggregate functions in `pandas`. The most common ones that you'll use are: `sum()`, `mean()`, `min()`, `max()`, `median()`. If you want to perform multiple aggregate functions on a single column, you can use the `agg()` method.

For example, we can write 1 single line of code to find the following information about the `Rainfall` column:
- Total `Rainfall` per `Location`.
- Average `Rainfall` per `Location`.
- Minimum `Rainfall` per `Location`.
- Maximum `Rainfall` per `Location`.

```python
weather_data.groupby(["Location"])["Rainfall"].agg(["sum", "mean", "min", "max"]).reset_index()
```

Notice that the column names are automatically generated by the `agg()` method. Usually, it's a good idea to rename the columns to make them more readable. This is because if we perform the same set of functions on another column (e.g. `MaxTemp`), the column names will be the same as the aggregated functions, and it will be hard to combine the results and make sense of the data.

We can use the `rename()` method to rename dataframe columns. 
- It's a very useful method as we can rename multiple columns at once by supplying a dictionary of old and new column names to the method. 
- We can choose to have the operation done `inplace` or `not inplace`.
    - `inplace` means we will be saving the new result in the original dataframe.
    - `not inplace` means we will be creating a new dataframe with the new result.

Let's change the column names `not inplace` first and understand what it does.

```python
result.rename(columns={
    "sum": "TotalRainfall",
    "mean": "AverageRainfall",
    "min": "MinRainfall",
    "max": "MaxRainfall"
}, inplace=False)
```

If we access `result` variable, we can see that it remains intact.

Let's change the column names `inplace` this time!

```python
result.rename(columns={
    "sum": "TotalRainfall",
    "mean": "AverageRainfall",
    "min": "MinRainfall",
    "max": "MaxRainfall"
}, inplace=True)
```

We can see that `result` dataframe is now changed with the new column names.

In the example thus far, we have only grouped by 1 single column (e.g. `Location`). In fact, we can group by multiple columns as well.

Remember that we have created a column called `MinTemp > 5` above which has the value of `Yes` if the `MinTemp` is greater than 5 and `No` otherwise.

Let's find the total `Rainfall` for each `Location` and `MinTemp > 5` in the `weather_data` dataframe! We'll assign the result to a variable called `result`.

```python
result = weather_data.groupby(["Location", "MinTemp > 5"])["Rainfall"].sum().reset_index()
```

This simple aggregation can already tell us **some interesting information about the data**. You can see that in any `Location` if `MinTemp` is greater than 5, there's a higher chance that it will rain (`Rainfall` is higher) than if `MinTemp` is less than or equal to 5.

I'm going to perform some operations on the `Date` column of the `weather_data` dataframe to extract the `Month` and `Year` from the `Date` column. This is the reason why we need to be very careful when reading in `datetime` information. Most of the time, we want to ensure that the `Date` column is recognised as actual `datetime` instead of a string.

```python
weather_data["Month"] = weather_data["Date"].dt.month
weather_data["Year"] = weather_data["Date"].dt.year
```

Working with `datetime` is fun but a bit tricky. Don't worry about it too much at the moment! 

[Here](https://www.dataquest.io/blog/datetime-in-pandas/) is a pretty good tutorial on how to start working with `datetime` in `pandas`.

**Exercise**

Can you view 10 random samples `["Date", "Month", "Year"]` from the `weather_data` dataframe to see if we have performed the operations correctly?

In [None]:
# [TODO]


**Exercise**

Can you tell me which `Month` has the highest `Rainfall` in `Sydney`?

**Hint**: You need to filter for `Location == Sydney` first.

In [None]:
# [TODO]


**Exercise**

How many `Date`s are there for each `Year` for `Sydney`?

In [None]:
# [TODO]


We can see that there're some gaps in the data. For instance, `2008`, `2011`, `2012`, and `2013` didn't contain 1-full year worth of data. `2017` is an exception because we already know the data is only collected until `Jun 2017`.

**Exercise**

Can you tell me which `Year` has the highest amount of `Rainfall` in `Adelaide`?

In [None]:
# [TODO]


## Sorting

Sorting is a very common operation in data/programming. We can sort a `Series` or a `DataFrame` by using the `sort_values()` method.

How sorting works together with `groupby()` is as follows:

![image](../../images/pd_groupby_sort.png)

Let's revisit the `total_rainfall_by_location` dataframe we created earlier. Let's try to sort that dataframe by `Rainfall` in ascending order.

```python
total_rainfall_by_location = weather_data.groupby(["Location"])["Rainfall"].sum().reset_index()
total_rainfall_by_location = total_rainfall_by_location.sort_values(by="Rainfall", ascending=True)
```

**Exercise**

Can you sort the `total_rainfall_by_location` dataframe by `Rainfall` in descending order?

In [None]:
# [TODO]


# Missing Values

As we have seen, there are lots of `NaN` values in the data. We have also learnt to use the `isnull()` and `notnull()` methods to check for missing values. 

**Dealing missing values** is something that data professionals do all the time. There are many ways to deal with missing values.
- Replace the missing values with some value (e.g. `mean`, `median` of the numerical columns or "Unknown" for categorical columns).
- Drop the rows with the missing values.

We'll learn about 2 very useful methods for dealing with missing values here: `fillna()` and `dropna()`.
- `fillna()`, as the name suggests`, will fill the missing values with a specified value.
- `dropna()`, as the name suggests`, will drop the rows with missing values.

Let's see what happen if we drop all rows with at least 1 missing value.

```python
weather_data_droppedna = weather_data.dropna()

print(f"Original shape of data: {weather_data.shape}")
print(f"New shape of data: {weather_data_droppedna.shape}")
```

The data shrinks significantly (about 3 times) because we have dropped all rows with at least 1 missing value. This is generally **NOT A GOOD IDEA** because we lose too much valuable information. Normally, we only drop rows with missing values if we are very sure that many columns in the same rows are also missing.

**Exercise**

Can you tell me which column(s) have the most missing values?

In [None]:
# [TODO]


In [None]:
# [TODO]


**Exercise**

Let's choose column `Pressure9am` as an example and fill in the missing values using the `fillna()` method.

In order to decide which value to fill in, let's take a look at the summary statistics of that column.

Can you do that?

In [None]:
# [TODO]


We will choose to fill the missing values of `Pressure9am` with the `mean` of that column. Normally, I prefer to use `median` since `mean` is most often skewed by outliers. Nonetheless, in this case, the `mean` and `median` are very close to each other so we can use `mean` here.

We also choose to perform the operations `inplace` which means that we will modify the original dataframe.

```python
mean_Pressure9am = weather_data["Pressure9am"].mean()

weather_data.fillna({
    "Pressure9am": mean_Pressure9am,
}, inplace=True)
```

We can verify that the missing values have been filled in by running the following code:

```python
weather_data["Pressure9am"].isna().sum()
```

# Combine Dataframes

It's very often that we retrieve data from multiple sources. For each source, we have data stored in a separate dataframe. Thus, it is crucial for us to know how to combine the dataframes.

`pandas` has 3 core methods to perform this task: `concat()`, `merge()` and `join()`. 
- The `concat()` method is used to combine multiple dataframes into one. 
- The `merge()` method is used to combine two dataframes based on a common column. 
- The `join()` method is used to combine two dataframes based on a common column.

Most of what `join()` can do is the same as `merge()`, so I'll only demonstrate the `merge()` method here.

The `join()` method combines two dataframes on the **basis of their indices** whereas the `merge()` method allows us to specify columns in addition to the indices to join on.
- You can read more about the `join()` method [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html).
- You can read about the difference between `join()` and `merge()` [here](https://www.geeksforgeeks.org/what-is-the-difference-between-join-and-merge-in-pandas/).

There are various forms of join:
1. `inner`: Keep only elements that exist in both dataframes.
1. `outer`: Keep all elements from all dataframes.
1. `left`: Keep all elements from the left dataframe.
1. `right`: Keep all elements from the right dataframe.

Visual representations of various join types:

|Type of Join|Image|
|:-:|:-:|
|Inner|![image](../../images/inner_join.png)|
|Outer|![image](../../images/outer_join.png)|
|Left|![image](../../images/left_join.png)|
|Right|![image](../../images/right_join.png)|

Since we're familiar with the different types of join now, we will then see how to combine dataframes with `pandas`.

First, we'll create two dataframes:

```python
left_df = pd.DataFrame({
    "common_col": [1, 2, 3, 4, 5, 6, 7, 8],
    "left_value": [10, 20, 30, 40, 50, 60, 70, 80],
})

right_df = pd.DataFrame({
    "common_col": [3, 4, 5, 6, 9, 10, 11],
    "right_value": [-30, -40, -50, -60, -90, -100, -110],
})
```

Let's take a look at the `left_df` and `right_df` dataframes.

`inner` join `left_df` and `right_df` on the `common_col` column.

```python
inner_join_df = left_df.merge(right_df, on="common_col", how="inner")
```

As expected, the `inner_join_df` dataframe contains **only common values that exist in both** `left_df` and `right_df` dataframes. 

`outer` join `left_df` and `right_df`.

```python
outer_join_df = left_df.merge(right_df, on="common_col", how="outer")
```

As expected, the `outer_join_df` dataframe contains **all common values that exist in either** `left_df` or `right_df` dataframes. 

`left` join `left_df` and `right_df`.

```python
left_join_df = left_df.merge(right_df, on="common_col", how="left")
```

As expected, the `left_join_df` contains all values from `left_df` and **only the common values that exist in both** `left_df` and `right_df` dataframes.

`right` join `left_df` and `right_df`.

```python
right_join_df = left_df.merge(right_df, on="common_col", how="right")
```

Similarly, the `right_join_df` contains all values from `right_df` and **only the common values that exist in both** `left_df` and `right_df` dataframes.

**Concatenation** is a bit different from the merging techniques that you saw above. With concatenation, we are stitching dataframes together along an axis - either horizontally or vertically.

Visually, this is what concatenation looks like.

|Type of concat|Axis|Image|
|:-:|:-:|:-:|
|`verical`|`0` (`row axis`)|![image](../../images/concat_vertical_axis0.png)|
|`horizontal`|`1` (`column axis`)|![image](../../images/concat_horizontal_axis1.png)|

Let's see `concat()` in action!