# pandas

Pandas is:

- one of the most commonly used Python packages/libraries/modules for data science.
- Python's answer for making two dimensional tables (ala Excel and SQL).
- calls a table a "DataFrame".
- Pandas DataFrames are used by Python's other packages for statistical analysis, data manipulation, and data visualization.
- Pandas DataFrames can be exported as .csv and other files.

Some of the pandas syntax will differ from basic Python. I still have to look a lot of things up in pandas, if it's something I don't do very often. However, it is the tool for working with spreadsheets in Python, so you'll need to learn it at some point.<br><br>

#### <br>Why do we work with Jupyter Notebooks for data science?

Jupyter Notebooks allow us to view nicely formatted output (such as pandas DataFrames and data visualizations) directly below the code used to create the object. They also allow you to scroll through large DataFrames or images.

#### <br>NumPy arrays
This notebook is going to focus on the Python package Pandas. However, Pandas (and many other Python packages) are built on NumPy arrays. NumPy is another Python module, and NumPy arrays are multi-dimensional datasets made up entirely of numerical data. They allow for much faster calculations than other basic Python objects. If you work with large numerical datasets, you will also want to look into the NumPy package. NumPy arrays do not have the features that many of us want to work with, such as column headers and the ability to work with non-numerical data; that's why pandas is so popular.

# <br><br><br>1.0 Part One: pandas and the DataFrame object

#### First, where are the files we are working with today?

**If you are working on the cloud (for example Google Colab):** You will need to first run the line of code below. It will pull the files into your workspace. 

In [None]:
!wget https://raw.githubusercontent.com/nuitrcs/python_workshops_datarepo/refs/heads/main/forestfires.csv
!wget https://raw.githubusercontent.com/nuitrcs/python_workshops_datarepo/refs/heads/main/pigeonRacing.txt
!wget https://raw.githubusercontent.com/nuitrcs/python_workshops_datarepo/refs/heads/main/zoo.xlsx

The files should now appear in your filetree in the same directory where this notebook is located. You should see them in your filetree. Look for `forestfires.csv` and `pigeonRacing.txt`.
<br><br>**If you are working locally on your own computer:** The files are here in the same repo and folder where this notebook is located. You should see them in your filetree. Look for `forestfires` and `pigeonRacing.txt`.

## <br><br><br>1.0 Importing pandas

Because pandas is one of the most commonly used Python packages, it often gets imported as a shortened version of it's actual name. This makes it quicker to type.

In [None]:
import pandas as pd

If you encounter an error, it may be because the Pandas package has not been installed yet. You can install by uncommenting and running the following code:

In [None]:
# uncomment the following line to install pandas
# !pip install pandas

Pandas comes with the Anaconda distribution of Python and is available on Google Colab.

We also want to check that our pandas version is one of the newer versions so that we are all working with the same commands.

In [None]:
pd.__version__

**If your version starts with a 1 instead of a 2 for that very first number,** you can uncomment the following cell, then run the cell to upgrade your Python to a newer version.

In [None]:
# uncomment the following line to upgrade pandas
# !pip install pandas --upgrade

## <br><br><br>1.1 Loading a csv file

We will use the function `pd.read_csv()`. As a reminder, when we use a function from an imported module, we first give the module's name, followed by a dot, followed by the function name.
<br><br>This will automatically create a **DataFrame** object, which we are saving as `df`. `df` is a common variable name for a DataFrame. You can open the file, define it as a Pandas DataFrame, assign it to a variable, and close the file in one line. (Already we're seeing the differences from basic Python).

In [None]:
df = pd.read_csv("forestfires.csv")

This is a dataset from forest fires in NE Portugal. I have included the dataset as a csv file in today's materials, but the data is available publicly at this site: https://archive.ics.uci.edu/ml/datasets/Forest+Fires

## <br><br><br>1.2 Viewing the DataFrame

In [None]:
df

<br>Take a minute to look at the data. The DataFrame will have a slightly different look on Colab and Jupyter, and on different versions of Jupyter.
<br><br>The number at the beginning of each row is called an **index**. The index was automatically assigned by pandas when the dataset was loaded. It was not in the original csv file. It is merely a series of consecutive numbers going down the rows. The rows were loaded in whatever order they were in the csv file.

If you are working in Google Colab, there is a new feature that lets you magically convert your DataFrame into an interactive table. We're NOT going to use that feature, though you can feel free to explore it on your own time. 

Let's discuss the columns together so we understand the dataset a bit more.

<br><br>There are ways to view pieces of the DataFrame. Try these to see what they do:

In [None]:
df.head()

In [None]:
df.head(10)

In [None]:
df.tail()

In [None]:
df.tail(2)

In [None]:
df.sample()

In [None]:
df.sample(6)

## <br><br><br>1.3 Loading other types of files

We can open a tab-separated file using the same function we used to open a csv. We just have to pass a second argument, a **keyword argument**, to tell it that the delimiter is a tab instead of the default (comma). This dataset contains rankings of profressional racing pigeons.

In [None]:
pigeon_df = pd.read_csv("pigeonRacing.txt", delimiter="\t")

In [None]:
pigeon_df.head()

<br><br>We will use a different function to open an Excel file. This file has information about animals and has two sheets within the excel file. We will first load sheet 1 and then sheet 2. We have to pass the `read_excel()` function one extra argument to specify the sheet:

In [None]:
zoo_df = pd.read_excel("zoo.xlsx", sheet_name=0)

If you run in to an error, you may not have `openpyxl` package installed, which is a dependency of `pandas`. Uncomment and run the following code to install the package and run the cell above again.

In [None]:
# uncomment the following line to install openpyxl
# !pip install openpyxl

In [None]:
zoo_df.head()

In [None]:
zoo_class_df = pd.read_excel("zoo.xlsx", sheet_name=1)

In [None]:
zoo_class_df.head()

## <br><br><br>1.4 Getting basic info about the DataFrame

You can use the `len()` function to find out how many rows are in a DataFrame object:

In [None]:
len(df)

<br>The `describe()` method will give you some very basic stats about each column in your DataFrame:

In [None]:
df.describe()

<br>The `shape` attribute will return the number of rows and columns as a tuple. An attribute gives us some stored data about an object - it is not a method function, so it does not get parentheses.

In [None]:
df.shape

You can even save the shape tuple as an object, in case you need to include it in any code:

In [None]:
df_shape = df.shape

In [None]:
print("Our DataFrame has " + str(df_shape[0]) + " rows and " + str(df_shape[1]) + " columns.")

<br>The `size` attribute will tell you the total number of elements in the DataFrame (size = rows x columns):

In [None]:
df.size

<br>To return a list of the column names, you can start with the `columns` attribute:

In [None]:
df.columns

Hmm. That looks strange because it is a pandas object. You can make it into a list so that it is easier to work with:

In [None]:
column_names = list(df.columns)
print(column_names)

<br>To find out the data types of the data found in each column, use the `dtypes` attribute:

In [None]:
df.dtypes

*Notice how strings are represented in the data types.*

<br>To **transpose** a DataFrame (swap the rows and columns), you also use an attribute. Let's transpose `zoo_df`:

In [None]:
zoo_df.T

<br>Let's see if that changed our DataFrame object:

In [None]:
zoo_df

In [None]:
zoo_df_t = zoo_df.T
zoo_df_t

### <br><br>Exercise 1.4

Write code to create a list of column names from `zoo_df`:

Write code to return the data type for each column in `zoo_df`:

Write code to return two random rows from `zoo_df`.

# <br><br><br>2.0 Part Two: manipulating and saving dataframes

## <br>2.1 Renaming columns

Here's what our column names look like in the forest fire dataset:

In [None]:
df.head()

Four of the columns end in "\_code". Let's remove that part from the column names. We can use the `rename()` method. We need to pass the function a **dictionary**, with the `key: value` pair being: `old_column_name: new_column_name`

In [None]:
df.rename(columns = {"moisture_code": "moisture", 
                     "fuel_code": "fuel"})

In [None]:
df.head()

Uh-oh, the change didn't stick. We've encountered this before with strings, so we know the answer - reassign it to a variable.

In [None]:
df = df.rename(columns = {"moisture_code": "moisture", 
                          "fuel_code": "fuel"})

In [None]:
df.head()

### <br><br>Exercise 2.1

Write code to remove "\_code" from the ends of the drought and initial_spread column names:

In [None]:
df.head()

## <br><br><br>2.2 Dropping rows and columns

Let's drop a single row from the DataFrame. How about row 2? You still have to assign `df` to a variable to make the change permanent:

In [None]:
df = df.drop(2)

In [None]:
df.head()

<br>The index numbers did not reset when we dropped a row. 2 is missing!

We can reset the index and pretend like 2 was never there. The `reset_index()` function takes one keyword argument. If we don't pass the argument, `drop=True`, an extra column will get added to our DataFrame containing the old index numbers. Let's first reset the index without passing the argument, but we won't save that DataFrame:

In [None]:
df.reset_index()

You can see that new column `index` contains the original index positions. Now let's save a new version of our DataFrame, with the indexes reset, but without that new column:

In [None]:
df = df.reset_index(drop=True)

In [None]:
df.head()

<br><br><br>The `drop()` function defaults to dropping rows. If we want to drop a column, we need to add one more argument. `axis=1` is used in pandas to refer to columns as opposed to rows (`axis=0`). The `axis` argument is used elsewhere in pandas, too. Let's drop the "X" column:

In [None]:
df = df.drop("X", axis=1)

In [None]:
df.head()

### <br><br>Exercise 2.2

Write code to view the last 5 rows of the DataFrame:

Now write code to drop the very last row:

In [None]:
df.tail()

Write code to remove the "Y" column:

In [None]:
df.head()

## <br><br><br>2.3 Sorting a DataFrame

There are two methods for sorting your DataFrame.

If you want to sort by the index numbers, or if you want to sort by the column names (alphabetically), you use `sort_index`. It can take two arguments: the axis to sort by (row or column) and the order (ascending or not):

The default arguments are to sort by row index with 0 at the top, which is how we've already been viewing the data:

In [None]:
df.sort_index()

Let's try more arguments:

In [None]:
df.sort_index(ascending=False)

Notice that ordering is case sensitive.

In [None]:
df.sort_index(axis=1)

In [None]:
df.sort_index(axis=1, ascending=False)

<br><br><br>The second sort function, `sort_values()`, will sort the frame by the data in a column:

In [None]:
df.sort_values("area_burned")

In [None]:
df.sort_values("day")

### <br><br>Exercise 2.3

Write code to sort the DataFrame by the rain column, with the largest values at the top:

You can also sort on multiple values by passing the `sort_values` function a list of column names instead of a single name. If we want to first sort by day, then by area burned:

In [None]:
df.sort_values(["day", "area_burned"])

Write code to sort the DataFrame by month, day, and by temperature. As a bonus, use the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) to figure out if you can sort by month and day in increasing alphabetical order, and decreasing temperature.

## <br><br><br>2.4 Saving your changed DataFrame

We've made a lot of changes to the forest fire dataset. Let's save it as a new csv file. First, we can decide what we're going to call the new file:

In [None]:
new_filename = "fire_changed.csv"

Next, we can use the `to_csv()` method function to save the new file:

In [None]:
df.to_csv(new_filename)

### <br><br>Exercise 2.4

Create a new version of the dataframe called `df_rain` that holds the `df` sorted by the rain column with the smallest rain flow at the top. Also, make the DataFrame only contain month, day, and rain flow columns.

Run the next line of code to save a new filename for the rain dataframe:

In [None]:
rain_filename = "forestFiresRain.csv"

Write code to save the df_rain dataframe as a csv file:

## <br><br>2.5 Setting index labels

The bold numbers on the far left of each column were assigned to each row when the csv file was originally loaded into pandas.

In [None]:
df.head()

<br>You can, however, set one of your columns as the index labels:

In [None]:
new_df = df.set_index("ID")

In [None]:
new_df.head()

## <br><br><br>2.6 Cleaning data while loading

Several of the data cleaning tasks we've learned can be done when loading the file in as a pandas DataFrame.

#### Telling pandas to make a particular column the index with the `index_col` argument:

In [None]:
df = pd.read_csv("forestfires.csv", index_col="ID")
df.head()

#### Telling pandas to only load some columns with the `usecols` argument:

In [None]:
df = pd.read_csv("forestfires.csv", usecols=["month", "day", "temp", "rain", "area_burned"])
df.head()

#### Telling pandas the column names with the `names` argument:

In [None]:
df = pd.read_csv("forestfires.csv", names=["ID", "X", "Y", "Month", "Day", "Fuel Code", "Moisture Code", "Drought Code", "Initial Spread Code", "Temperature", "Humidity", "Wind", "Rain", "Area Burned (hectares)"])
df.head()

If you use `names`, it will treat your original header row as a row of data. To avoid that, you have to pass an additional argument - `header=0`.

In [None]:
df = pd.read_csv("forestfires.csv", header=0, names=["ID", "X", "Y", "Month", "Day", "Fuel Code", "Moisture Code", "Drought Code", "Initial Spread Code", "Temperature", "Humidity", "Wind", "Rain", "Area Burned (sq km)"])
df.head()

<br>And many more tricks [in this documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

# <br><br><br>3.0 Part Three: selecting data

There are multiple ways to select data in pandas. You will need to learn all of the ways because you will see these techniques being used in other people's code and in answers to your pandas questions when you search online.
<br><br>We will cover:
- Selecting columns using DataFrame indexing
- Selecting rows based on a boolean condition using DataFrame indexing
- Selecting columns, rows, and individual data points with `loc` and `iloc`
- Selecting individual data points with `at` and `iat`
*You cannot select individual data points using indexing.*

We will work with the forest fires data again. Let's reload it with some data cleaning so that we're all working in the same version of the data frame. In this code, we will:
- rename the columns to remove the "_code" from the end of the column names
- remove the `X` and `Y` columns, since we won't be working with those
- set the index as the `ID` column, which contains unique identifiers

In [None]:
df = pd.read_csv("forestfires.csv",
                 names=["ID", "X", "Y", "month", "day", "fuel", "moisture", "drought", "initial_spread", "temp", "humidity", "wind", "rain", "area_burned"], 
                 header=0, 
                 index_col="ID", 
                 usecols=["ID", "month", "day", "fuel", "moisture", "drought", "initial_spread", "temp", "humidity", "wind", "rain", "area_burned"])
df.head()

## <br><br>3.1 Selecting columns using indexing

To create a DataFrame with only some columns, you use indexing, and you pass it a list of the columns that you want to include:

In [None]:
my_columns = ["month", "day", "area_burned"]
df[my_columns]

<br>OR you could just include the list inside the indexing. This creates two sets of square brackets, which looks a little silly, but it works!

In [None]:
df[["month", "day", "area_burned"]]

<br>If you want to return just one column as a DataFrame, you still use the list inside the index:

In [None]:
df[["month"]]

### <br><br>Exercise 3.1

Here's a reminder of what the DataFrame looks like:

In [None]:
df.head()

Write code to return the humidity, wind, and rain columns:

Write code to return the temp column:

## <br><br><br>3.2 pandas Series object
If you only index the column name, without putting it in a list, you get a different type of pandas object - the **Series** object.

In [None]:
df["month"]

<br>A Series object only returns the values from one column. It can be turned into a list, which is very convenient:

In [None]:
month_list = list(df["month"])
print(len(month_list))
print(type(month_list))
print(month_list[0:5])

<br>**A Series object is a one-dimensional object, while a DataFrame is a two-dimensional object. A Series can be turned into a list, while a DataFrame can be indexed based on row number, so they both have their uses.**

### <br><br>Exercise 3.2

Write code to return a list of data in the day column:

In [None]:
day_data = 

In [None]:
print(len(day_data))
print(type(day_data))

## <br><br><br>3.3 Selecting rows using indexing

If we want to return a DataFrame with only some **rows**, we can index a range. DataFrame indexing uses regular Python indexing, so we ask for the first item we want, and then a colon, and then we go one position past the last item we want. 

In [None]:
df[0:10]

<br>Because this indexing is referencing the **position of the row in the DataFrame**, not the **index number**, we can use negative indexing in either spot to count from the bottom of the DataFrame.

In [None]:
df[495:-12]

<br>If you only want a single row, you still need to use indexing with a `:`:

In [None]:
df[4:5]

### <br><br>Exercise 3.3

Write code to return row number 14137.

Write code to return the last row in our DataFrame.

In [None]:
df

## <br><br><br>3.4 Selecting data with a boolean

To return a DataFrame that only has rows that meet a certain condition, we use this syntax. The outer `df[]` lets Python know that you want the answer to be returned as a DataFrame, meaning you can return all the columns included in the output. Inside the indexing, we include our boolean statement, which usually means we need to index a particular column in the dataset to filter the data on.

In [None]:
df[df["month"] == "oct"]

<br>If you don't use the outer `df[]` the return is a Series object that returns the boolean value for each row based on the condition you set:

In [None]:
df["month"] == "oct"

In [None]:
df[df["temp"] > 30]

### <br><br>Exercise 3.4

Write code to return a DataFrame that only includes fires on Mondays:

Write code to return a DataFrame that only includes fires on days when rain was greater than 2:

## <br><br><br>3.5 Combining boolean indexing with column indexing

You can also combine a boolean with column indexing to return only some columns for your filtered data. Here I am returning only the team2, score1, and score2 columns for any rows with "CHI" in the team1 column.

In [None]:
df[df["month"] == "aug"][["wind", "rain", "humidity"]]

### <br><br>Exercise 3.5

Write code to return all the fires with area_burned greater than 0. Return only columns for month and day.

<br><br><br>**Using the indexing method, we cannot refer to individual rows by index name or easily pull up individual cells in our DataFrame.**

In [None]:
df[14147]

## <br><br><br>3.6 pandas `loc`

The [`loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) **attribute** allows us to call up certain rows and columns **based on their index names (or labels) or a boolean array**. The syntax is:

#### `df.loc[row, column]`

#### `df.loc[list of rows, list of columns]`

#### `df.loc[range of rows, range of columns]`

`loc` can take a row, a list of rows, or a range of rows, followed by a comma, and then a column, list of columns, or range of columns. <br><br>If you want all the rows or all the columns, you can use a `:`. <br><br>**The rows that we refer to here are the row names (index names) that are found in bold on the far left of our DataFrame.**

<br>To reference one cell:

In [None]:
df.loc[14171, "month"]

<br>All rows for one column:

In [None]:
df.loc[:, "month"]

<br>All columns for one row:

In [None]:
df.loc[14375, :]

### <br><br>Exercise 3.6

Write code to return all columns in row 14647:

What was the temperature on the day of fire 14647? Write code to return the data in the column `temp` for that row:

## <br><br>3.7 `loc` with a range and a list

This code will return all columns for the first 10.

In [None]:
df.loc[0:10, :]

<br>**Oops! Unlike Python indexing, `loc` is referencing the rows by their index names, so we need to know those index names to use `loc`.**

In [None]:
first_row = 14140
df.loc[first_row:first_row + 10, :]

**Uh-oh, it gave us 11 rows! Again, `loc` isn't working with index positions, so we don't have to ask for 1 past what we want.**

In [None]:
first_row = 14140
df.loc[first_row:first_row + 9, :]

<br><br>We can also ask for a range of columns, from left to right:

In [None]:
df.loc[first_row:first_row + 9, "temp":"rain"]

<br><br>Again, `loc` uses the column and row names, not their positions, so this will not work:

In [None]:
df.loc[0:10, 0:4]

<br><br>We can also pass a list of rows or columns:

In [None]:
df.loc[[14589, 14420, 14632], ["area_burned", "month", "day"]]

*Notice how the returned DataFrame used the same order given in the lists.*

### <br><br>Exercise 3.7

The last row in our DataFrame is indexed as `14651`.

Write code to return the last 5 rows in the DataFrame, and return only the columns "rain", "temp", "humidity", and "area_burned":

## <br><br>3.8 `loc` with a boolean

You can use a boolean to filter rows. The boolean is written the same way as we would write it without using `loc`:

In [None]:
df.loc[df["month"] == "apr", :]

<br><br>Here I use the same filter for the rows, but I only ask for three columns to be returned:

In [None]:
df.loc[df["month"] == "apr", ["fuel", "moisture", "drought"]]

### <br><br>Exercise 3.8

Write code to return all fires started on Sundays. Only return the columns "rain", "temp", and "drought".

## <br><br><br>3.9 pandas `iloc`

**While `loc` searches by row and column names, [`iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) searches only by the indexed positions in the DataFrame.**

Here, I'm asking for the top 10 rows and the first four columns:

In [None]:
df.iloc[0:10, 0:4]

<br>**Notice that `iloc` uses Python indexing!** When we ask for rows 0:10, it returns rows 0 to 9. Also notice that the index (the bold number on the left side of each row) does not count as a true column.

<br>Because `iloc` uses Python indexing, we can use negative numbers:

In [None]:
df.iloc[-10:, 4:-2]

### <br><br>Exercise 3.9

Use iloc to return the column "area_burned" for the very last row in the DataFrame:

Use iloc to write code to return the columns "rain" and "humidity" for the first 20 rows in our DataFrame:

## <br><br><br>3.10 pandas `at` and `iat`

If you are looking for the contents of only a single cell (called a **scalar**) in the DataFrame, you can use `loc` or `iloc`:

In [None]:
df.loc[14651, "month"]

In [None]:
df.iloc[-1, 0]

<br>However, there is another set of pandas functions designed to look up only a single cell. `at` will look up a single cell by row name and column name (like `loc`), and `iat` will look up a single cell by index position (like `iloc`).

Why does pandas have a separate way to look up a single cell? Because `at` and `iat` are very fast. If you write code to look up 10,000 single points in a DataFrame, it would be much faster to use `at` or `iat` than `loc` or `iloc`.

In [None]:
df.at[14651, "month"]

In [None]:
df.iat[-1, 0]

In Jupyter notebook, there are built-in **magic commands** that let's us perform some special tasks. Magic commands are preceded by `%`. For example, in the following magic commands, I will time the speed of two lines of code by running each 100,000 times.

In [None]:
%timeit -n 100000 df.loc[14651, "month"]
%timeit -n 100000 df.at[14651, "month"]

<br><br>Just to reiterate, `at` and `iat` cannot be used with multiple rows or columns:

In [None]:
df.at[0, ["month", "day"]]

### <br><br>Exercise 3.10

Use `at` to write code to find out what the temperature was on the day of fire 14650:

Now use `iat` to find the same answer:

## <br><br><br>3.11 Searching for multiple conditionals in pandas

Let's say we want to search through the DataFrame for all fires in August with an area burned greater than 15 hectares. For each of these fires, we want to return only the columns for temperature and rain.

The conditional for only fires in August is:
<br>`df["month"] == "aug"`
<br><br>The conditional for fires that burned more than 15 hectares is:
<br>`df["area_burned"] > 15`

We might try to use Python operators (`and`, `or`, `not`):

In [None]:
df.loc[df["month"] == "aug" and df["area_burned"] > 15, ["temp", "rain"]]

#### <br><br>However, pandas uses the operators `&`, `|`, `!` for and, or, and not, reespectively. Pandas also requires you to include each conditional inside parentheses.

In [None]:
df.loc[(df["month"] == "aug") & (df["area_burned"] > 15), ["temp", "rain"]]

You can also combine conditionals in boolean indexing.

In [None]:
df[(df["month"] == "aug") & (df["area_burned"] > 15)][["temp", "rain"]]

### <br><br>Exercise 3.11

Use `loc` to return fires in the DataFrame that happened in either June or July. For each row, return all columns. If you want to see how months are written in the month column, you can use `df["month"].unique()`.

Has there ever been a fire in February that burned more than 20 hectares?:

# <br><br><br>4.0 Part Four: data aggregation

Data aggregation means taking many data points and reducing them to one number, whether it's a count, sum, mean, or other single statistic. Here are some DataFrame method functions:

- [`.count()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html)
- [`.sum()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html)
- [`.mean()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html)
- [`.median()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html)
- [`.min()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.min.html)
- [`.max()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html)
- [`.unique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html)
- [`.nunique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.nunique.html)
- [`.std()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html)   #Standard error
- [`.var()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html)   #Variance
- And more! https://pandas.pydata.org/docs/reference/series.html#computations-descriptive-stats

## <br>4.1 Aggregating across the whole dataframe and across columns
If you use a method function on the entire dataset, it will try its best to execute the method for all columns.

#### How many (non-missing) values are in each column?

In [None]:
df.count()

#### What is the smallest value in each column?

In [None]:
df.min()

#### What is the sum of all values in each column?

In [None]:
df.sum()

#### What are the unique values in each column?

In [None]:
df.unique()

<br>**Uh-oh.** Not all functions will work on the entire DataFrame. 
<br>Most of the time you are interested in only a subset of the data. You can aggregate the data in a single column:

#### <br>The `unique()` method function provides all unique values in a column, as a pandas object:

In [None]:
df["day"].unique()

This can easily be turned into a list:

In [None]:
list(df["day"].unique())

#### The `nunique()` method tells you how many unique values are present in a column:

In [None]:
df["month"].nunique()

#### Find the variance of one column.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.var.html for more info on the `var()` function.

In [None]:
df["temp"].var()

### <br><br>Exercise 4.1

Write code to find the mean humidity of the dataset:

Write code to find the coldest temperature in the dataset:

## <br><br>4.2 groupby

Often, you will want to calculate the statistics for a particular subgroup of a data column.

For example, let's say we want to ask: **Do more fires happen on certain days of the week?** This code will tell you the count for every column in the DataFrame except the column that you are using to group your data (i.e. "day").

In [None]:
df.groupby("day").count()

<br>We can make this easier to read by sorting the rows from lowest count to highest count. We can string multiple method functions onto our code, as long as they go in the correct logical order. We will pick any column to sort on, since they are all the same.

In [None]:
df.groupby("day").count().sort_values("month")

<br>**It looks like weekends have more fires than weekdays.**

<br><br><br><br>If you only want to see an aggregate total for one column in the DataFrame, you can add on the indexing technique we learned earlier. With this code I will ask, **What is the mean area burned on each day of the week?**

In [None]:
df.groupby("day")[["area_burned"]].mean()

<br>Again, we can also sort the data:

In [None]:
df.groupby("day")[["area_burned"]].mean().sort_values("area_burned")

<br>**So Saturday fires are the most destructive fires.**

<br>We can also add some other functions to our code, like `round`, to make it more readable:

In [None]:
df.groupby("day")[["area_burned"]].mean().round(2).sort_values("area_burned")

You can even calculate aggregate statistics for multiple columns at once:

In [None]:
df.groupby("day")[["area_burned", "temp", "rain"]].mean().round(2).sort_values("area_burned")

### <br><br>Exercise 4.2

**Which months had the most fires?** Write code to count how many fires happened in each month (as a bonus, sort the results):

**Which months have the largest fires?** Write code to calculate the mean area burned for fires in each month (as a bonus, round and sort the results):

Think of another interesting question you can ask with this dataset and write code to find the answer.

# <br><br><br>5.0 Part Five: basic plotting

Pandas allows you to make some basic plots without loading other packages. Plotting in pandas is good for exploring your data, and we'll focus on that today.
<br><br>Pandas is not good for making good-looking, high-quality data visualizations. The Python library for that is called matplotlib. We will not be covering how to make data visualizations for publication, since this is not a matplotlib workshop.
<br><br>Pandas' plotting capabilities are actually built on matplotlib, but in a much simpler format.

## <br><br>5.1 Histograms

One of the most common data exploration tasks you might do is to check the distributions of the columns in your dataset.

We will use the `hist()` method function on the `temp` column.

In [None]:
df["temp"].hist()

<br><br>Like I said, it's not pretty, but it tells us the story of our data.

By default, `hist()` will divide the data into 10 bins. We can change that by passing a keyword argument:

In [None]:
df["temp"].hist(bins=20)

### <br><br>Exercise 5.1.0

Make a histogram of the `humidity` column. Specify that you want the data grouped into 15 bins.

<br><br><br>Another way to create a histogram in pandas is to do:

In [None]:
df.hist(column="temp")

<br>Both ways are doing the same thing. In one, we're making a histogram on a sample of our dataframe. In the second, we're calling `hist()` on the entire dataframe and then specifying the column with an argument.

<br>If we don't specify a column, we will get histograms of all columns with numerical data:

In [None]:
df.hist()

<br>We can also ask for a list of columns:

In [None]:
df.hist(column=["temp", "humidity"])

### <br><br>Exercise 5.1.1

In one line of code, create histograms for the `moisture` and `drought` columns.

**<br><br><br>What other changes can we make to our histogram?**

Let's look at the documentation. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html

<br>We can get rid of the grid lines!

In [None]:
df.hist(column="temp", grid=False)

<br>We can also change the figure size. The `figsize` keyword argument takes a list containing two numbers: width in inches and height in inches.

### <br><br>Exercise 5.1.2

Remember this plot?

In [None]:
df.hist()

<br>Change the figure size in the next line of code until you can see the plots better:

In [None]:
df.hist(grid=False, figsize=[2,2])

## <br><br><br>5.2 Scatter plots to check for correlation

We can make a quick scatter plot to check for correlation between 2 columns. We have to use a slightly different format for our scatter plot function. We are going to do `df.plot.scatter()`. This function requires two arguments, the columns for your x and y axes.

In [None]:
df.plot.scatter(x="temp", y="humidity")

### <br><br>Exercise 5.2

Write code to create a scatter plot with the `moisture` column on the x-axis and the `drought` column on the y-axis.

Based on what we learned with `hist()`, can you add a grid to the scatter plot you just made and change the size so that it is a perfect square?

## <br><br><br>5.3 Slightly more complicated examples

### <br>5.3.0 Removing Outliers

Let's look at the correlation between humidity and area_burned:

In [None]:
df.plot.scatter(x="humidity", y="area_burned")

<br>We can see that a few outliers are clouding any relationship. We can remove them with a boolean, but we need to look at the plot above and decide where to make the cutoff. Let's try getting rid of only points above 400.

In [None]:
df_no_outliers = df[df["area_burned"] < 400]

In [None]:
df_no_outliers.plot.scatter(x="humidity", y="area_burned")

<br><br>There are a large number of points with 0 for area_burned. Let's remove those, and only look at days where the fires spread.

In [None]:
df_no_outliers = df[(df["area_burned"] > 0) & (df["area_burned"] < 400)]

In [None]:
df_no_outliers.plot.scatter(x="humidity", y="area_burned")

### <br><br><br>5.3.1 Plotting with categorical data

Let's say we want to see the relationship between month and humidity. We can try a scatter plot:

In [None]:
df.plot.scatter(x="month", y="humidity")

<br>That doesn't work! Let's try another type of plot - a bar plot:

In [None]:
df.plot.bar(x="month", y="humidity")

<br>Also not what we're looking for. It's plotting each data point individually.

What we really want to see is how the mean humidity changes each month. We can create a new DataFrame that only includes the data we need. We will group by month, select only the humidity column, and then find the mean.

In [None]:
hum_mean = df.groupby("month")["humidity"].mean()
hum_mean

<br>Now we have a nice series object that we can plot:

In [None]:
hum_mean.plot.bar()

<br>This is looking better, but we need to sort them by month, not alphabetically. We're going to worry about that in just a minute.

### <br><br>Exercise 5.3

Create a bar graph that shows mean temperature grouped by month. First you'll need to create a new series object with the means. Refer back to the humidity exercise directly above.

<br><br><br>Let's deal with the sorting issue! We can create a new column in our DataFrame that contains a numerical value for the month. We will use a handy pandas function called `map()`. It takes a dictionary where `key: value` pair represents `data_to_replace: new_data`.

In [None]:
df["month_num"] = df["month"].map({"jan": 1, "feb": 2, "mar": 3, "apr": 4, 
                          "may": 5, "jun": 6, "jul": 7, "aug": 8,
                          "sep": 9, "oct": 10, "nov": 11, "dev": 12})

df.head()

<br>Now we will repeat the humidity plot we just did, but we will group by the new column.

In [None]:
hum_mean = df.groupby("month_num")["humidity"].mean()
hum_mean.plot.bar()