## Sections:

1. [Pandas](#Pandas)  
2. [Installing and importing Pandas](#installing-and-importing-pandas)  
3. [Series](#series)  
    3.1. [Creating a Series](#indexing-and-slicing)  
    3.2. [Indexing and slicing](#series-indexing-and-slicing)   
    3.3. [Operations on Series](#indexing-and-slicing)    
4. [Keyword arguments](#keyword-arguments)  
5. [Dataframe](#Dataframe)  
    5.1. [Creating a DataFrame](#creating-a-datafarme)  
    5.2. [Indexing and slicing](#df-indexing-and-slicing)  
    5.3. [Adding rows and columns](#adding-rows-and-columns)  
    5.4. [Removing rows and columns](#removing-rows-and-columns)  

# 1. Pandas  <a id='pandas'></a>

Pandas is a Python library for data analysis. Pandas is particularly useful when working with tabular data (data organized in a table). It is also worth mentioning that the Pandas library relies heavily on the NumPy library - we will explore what that means later on.

# 2. Installing and importing Pandas  <a id='installing-and-importing-pandas'></a>

The Pandas package is not included with Python, and therefore we have to install it before we can import the "pandas" module. The easiest way to install Pandas is to use **pip** - the official **P**ackage **I**nstaller for **P**ython. In order to install the Pandas package using pip, we can run the following command in the terminal:

`pip install pandas`

However, since we are using a Jupyter notebook, we can also install Pandas by running the command in the cell below (which will run the above command in the terminal for us):

In [2]:
!pip install pandas

Collecting pandas
  Downloading pandas-1.4.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.7/11.7 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Installing collected packages: pandas
Successfully installed pandas-1.4.4


Once we have installed the Pandas package, we can import the "pandas" module by running the following code:

In [1]:
import pandas as pd

The line above imports the "pandas" module and assigns it to the variable `pd` - this is a commonly used abbreviation for the "pandas" module.

# 3. Series  <a id='series'></a>

The Pandas library contains two main types of objects - a `Series` and a `DataFrame`. A `Series` is similar to a one-dimensional NumPy array - it holds an ordered collection of items of the same data type. In fact, the `Series` object uses a NumPy array to store data. The `DataFrame` object is used to represent a table which consists of multiple columns (each of which is a `Series`). In this subsection, we will cover the `Series` object.

## 3.1. Creating series <a id='creating-series'></a>

What distinguishes a `Series` from just a NumPy array is that a Series contains two NumPy arrays - one which represents the values in the `Series` and another which represents the index. The index can be composed of numbers or strings.

If we create a `Series` and just pass in a `list` of items, without specifying the index, a default index will be created from 0 to $n-1$, where $n$ is the number of items in the `Series`.

In [4]:
s = pd.Series([1, 2, 3, 4, 5])
s

0    1
1    2
2    3
3    4
4    5
dtype: int64

As can be seen above, the index of the `Series` is on the left `(0, 1, 2, 3, 4)`, while the values of the `Series` are on the right `(1, 2, 3, 4, 5)`. At the bottom we can see the data type (`dtype`) of the `Series`. You might be wondering why the values are displayed vertically. As mentioned before, this is because a `Series` is used to represent a column in a table - we will cover this in greater depth later on in the notebook. 

Let's see what happens if we create a `Series` with the same values, but we pass in a custom index this time:

In [5]:
s = pd.Series([1, 2, 3, 4, 5], ["a", "b", "c", "d", "e"])
s

a    1
b    2
c    3
d    4
e    5
dtype: int64

As we can see above, the index now consists of: `a b c d e`. We can check the values of the index specifically by accessing the `index` attribute of a `Series`.

In [6]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

We see that the index does contain the three strings `'a'`, `'b'`, `'c'`, `'d'` and `'e'`. We can similarly access the values in a series via the `values` attribute:

In [7]:
s.values

array([1, 2, 3, 4, 5])

You may recognize the above output as a NumPy array. This is because the values in a `Series` are just a NumPy array 

In [8]:
type(s.values)

numpy.ndarray

The `Index` object also contains a NumPy array to store the values of the index:

In [9]:
s.index.values

array(['a', 'b', 'c', 'd', 'e'], dtype=object)

The `Series` object can also be thought of as an ordered dictionary. Therefore, we can also create a series from a dictionary:

In [10]:
data = {"a": 1, "b": 2, "c": 3}
new_series = pd.Series(data)
new_series

a    1
b    2
c    3
dtype: int64

## 3.2. Indexing and slicing <a id='indexing-and-slicing'></a>

As mentioned before, Pandas relies heavily on the NumPy library. It can be said that Pandas is built on top of NumPy. This means that if you are already familiar with Numpy, understanding Pandas will be easier since much of the functionality is similar.

For example, accessing items in a `Series` works exactly the same way as accessing items in a one-dimensional NumPy array. However, Pandas has extended this idea to indices composed of strings as well. Therefore, we can access items in the `s` Series by their index like so:

In [11]:
print(s["a"])
print(s["b"])
print(s["c"])
print(s["d"])
print(s["e"])

1
2
3
4
5


This is exactly how you would access items in a dictionary by their key. However, since a `Series` is an ordered collection of data (just like an array), Pandas also allows us to perform slicing with string indices:

In [12]:
s["b":"d"]

b    2
c    3
d    4
dtype: int64

As can be seen above, slicing a `Series` returns another `Series` containing the collection of items specified by the splice operator. This second `Series` is called a "view" of the original `Series`.

We can also pass a list of indices to obtain a specific view of the original `Series`:

In [13]:
s[["a", "c", "e"]]

a    1
c    3
e    5
dtype: int64

And we can also use logical expressions to create views:

In [14]:
s[s < 4]

a    1
b    2
c    3
dtype: int64

Similarly to NumPy arrays, we can assign values to locations in a `Series` with the use of indexing and slicing:

In [15]:
s["b"] = 99
s

a     1
b    99
c     3
d     4
e     5
dtype: int64

In [16]:
s["b":"c"] = [20, 30]
s

a     1
b    20
c    30
d     4
e     5
dtype: int64

Of course a `Series` with a numerical index works analogously to a one-dimensional NumPy array:

In [17]:
s2 = pd.Series([1, 2, 3, 4, 5])
s2[0]

1

In [18]:
s2[1:4]

1    2
2    3
3    4
dtype: int64

In [19]:
s2[[0, 2, 4]]

0    1
2    3
4    5
dtype: int64

In [20]:
s2[s2 < 4]

0    1
1    2
2    3
dtype: int64

In [21]:
s2[1] = 99
s2

0     1
1    99
2     3
3     4
4     5
dtype: int64

In [22]:
s2[1:3] = [20, 30]
s2

0     1
1    20
2    30
3     4
4     5
dtype: int64

## 3.3. Operations on series <a id='operations-on-series'></a>

Performing operations on `Series` works analogously to performing operations on one-dimensional NumPy arrays:

In [23]:
s

a     1
b    20
c    30
d     4
e     5
dtype: int64

In [24]:
s * 2

a     2
b    40
c    60
d     8
e    10
dtype: int64

In [25]:
s + [1, 2, 3, 4, 5]

a     2
b    22
c    33
d     8
e    10
dtype: int64

In [26]:
s * s

a      1
b    400
c    900
d     16
e     25
dtype: int64

In [27]:
print(s.min())
print(s.max())
print(s.sum())
print(s.mean())
print(s.std())
print(s.var())

1
30
60
12.0
12.469963913339926
155.5


In [28]:
print(1 in s.values)
print(-999 in s.values)

True
False


In [29]:
print("a" in s.index)
print("a" in s)

print("z" in s.index)
print("z" in s)

True
True
False
False


In [30]:
for value in s.values:
    print(value)

1
20
30
4
5


In [31]:
for index_value in s.index:
    print(index_value)

a
b
c
d
e


# 4. Keyword arguments <a id='keyword-arguments'></a>

Before we proceed to discussing the `DataFrame` object, it is worth briefly covering what keyword arguments are. Until now, whenever passing some arguments to a function or method, you had to pass the arguments in the correct order. For example, when creating a `Series`, you had to remember that the data argument is first, while the index argument is second:

In [32]:
x = [1, 2, 3]
y = ["a", "b", "c"]

pd.Series(x, y)

a    1
b    2
c    3
dtype: int64

However, each parameter in a function has a name, and therefore you can pass arguments in any order, as long as you specify the correct parameter names. The names of parameters are usually intuitive. For example, the names of the data and index parameters are "data" and "index":

In [33]:
pd.Series(data=x, index=y)

a    1
b    2
c    3
dtype: int64

In [34]:
pd.Series(index=y, data=x)

a    1
b    2
c    3
dtype: int64

As we can see above, if we use keyword arguments in the form of `{name}={value}`, the order of arguments does not matter anymore. Keyword arguments can be used in all functions or methods, since every parameter in a function has a name (there is a small exception to this rule, but it is not important for now). We can check what keyword arguments `Series()` accepts by looking at the [official documentation page](https://pandas.pydata.org/docs/reference/api/pandas.Series.html).

There we can see that `Series()` has the following parameters: "data", "index", "dtype", "name", "copy" and "fastpath". We don't have to worry about what all of those parameters do, but let's look at another example where we pass in the `dtype` argument, but we do not pass in the index.

In [35]:
pd.Series(data=x, dtype="float64")

0    1.0
1    2.0
2    3.0
dtype: float64

In the examples above, we stored the data in a variable called `x` and the index in a variable called `y`. However, often it is convenient to store this data in variables named identically to the parameters of a function or method to which they correspond.

In [36]:
data = [1, 2, 3]
index = ["a", "b", "c"]

pd.Series(data=data, index=index)

a    1
b    2
c    3
dtype: int64

The `data=data` and `index=index` may seem a bit strange at first, but all you have to remember is that on the left we specify the name of the keyword argument, and on the right we specify the value of the keyword argument.

Sometimes, you will see people mixing arguments and keyword arguments. For example, below we pass in the data as an argument, but the index and dtype are passed in as a keyword arguments:

In [37]:
pd.Series(data, index=index, dtype="float64")

a    1.0
b    2.0
c    3.0
dtype: float64

# 5. DataFrame <a id='dataframe'></a>

A DataFrame is a two-dimensional, table-like data structure with rows and columns. The `DataFrame` is composed of a collection of `Series` objects, each of which represents a single column in the `DataFrame`. All of the `Series` share the same index, which is also the index of the `DataFrame`.

## 5.1 Creating a DataFrame <a id='creating-a-dataframe'></a>

There are many different ways to create a `DataFrame` in Pandas, however, one of the most common ways is to use a dictionary containing equal-length lists of values:

In [2]:
data = {
    "name": ["Coco", "Luna", "Teddy", "Nala"],
    "breed": ["Poodle", "Shih tzu", "Poodle", "Husky"],
    "weight": [16.4, 5.9, 18.7, 24.8]
}

df = pd.DataFrame(data)

df

Unnamed: 0,name,breed,weight
0,Coco,Poodle,16.4
1,Luna,Shih tzu,5.9
2,Teddy,Poodle,18.7
3,Nala,Husky,24.8


As we can see above, a `DataFrame` can be represented as a table with named columns ("name", "breed", "weight") and rows containing an index (0, 1, 2, 3). The `DataFrame` above contains various information about  dogs - you might imagine a veterinary clinic having a table like that in their database. Of course, the table would have more columns containing information about the dogs, such as their age or the date of their last visit etc.

As mentioned before, each column in a `DataFrame` is in fact a `Series`. We can access a column in a `DataFrame` with the use of the column's name (in a dictionary-like fashion): 

In [4]:
df["breed"]

0      Poodle
1    Shih tzu
2      Poodle
3       Husky
Name: breed, dtype: object

As we can see, the code above returns a `Series` containing the dog breeds with the same index as the `DataFrame`. We can also pass in a `list` of column names to get a dataframe view containing a subset of the columns:

In [5]:
column_names = ["breed", "weight"]
df[column_names]

Unnamed: 0,breed,weight
0,Poodle,16.4
1,Shih tzu,5.9
2,Poodle,18.7
3,Husky,24.8


It is also worth mentioning that just like with a `Series`, we can pass in an index when creating a `DataFrame`.  

In [6]:
index = ["a", "b", "c", "d"]

df = pd.DataFrame(data, index=index)

df

Unnamed: 0,name,breed,weight
a,Coco,Poodle,16.4
b,Luna,Shih tzu,5.9
c,Teddy,Poodle,18.7
d,Nala,Husky,24.8


## 5.2 Indexing and slicing <a id='indexing-and-slicing'></a>

If we want a slice of a particular column, we can simply retrieve the corresponding `Series` object and then slice it accordingly. However, what about retrieving a slice of a `DataFrame` along the rows dimension? To do so, we could simply slice the `DataFrame` object, instead of the `Series` object:

In [7]:
df[1:3]

Unnamed: 0,name,breed,weight
b,Luna,Shih tzu,5.9
c,Teddy,Poodle,18.7


More typically, if we want to retrieve a specific row or column, or a slice of rows and columns, we can use one of two attributes: `loc` or `iloc`.

The `loc` attribute allows for defining a slice primarily based on labels. For example, we can retrieve a specific row in a `DataFrame`, like so:

In [8]:
df.loc["a"]

name        Coco
breed     Poodle
weight      16.4
Name: a, dtype: object

The code above returns the data in the row with index "a" as a `Series`, where the index of the `Series` consists of the names of the columns in `df`. We can also retrieve more than one row by passing in a `list` of indices. In this case, the expression will return a `DataFrame` with the corresponding rows:

In [9]:
df.loc[["a", "c"]]

Unnamed: 0,name,breed,weight
a,Coco,Poodle,16.4
c,Teddy,Poodle,18.7


We can also pass in a list with a single index to retrieve a single row as a `DataFrame`:

In [10]:
df.loc[["a"]]

Unnamed: 0,name,breed,weight
a,Coco,Poodle,16.4


We can also retrieve a slice of rows:

In [11]:
df.loc["b":]

Unnamed: 0,name,breed,weight
b,Luna,Shih tzu,5.9
c,Teddy,Poodle,18.7
d,Nala,Husky,24.8


The same rules apply for the indexing and slicing of columns using the `loc` attribute:

In [12]:
df.loc["b":, ["name", "weight"]]

Unnamed: 0,name,weight
b,Luna,5.9
c,Teddy,18.7
d,Nala,24.8


In [13]:
df.loc[:, :"breed"]

Unnamed: 0,name,breed
a,Coco,Poodle
b,Luna,Shih tzu
c,Teddy,Poodle
d,Nala,Husky


In [14]:
df.loc[["a", "b"], "weight"]

a    16.4
b     5.9
Name: weight, dtype: float64

As mentioned previously, we can also use the `iloc` attribute which works analogously to the `loc` attribute, but uses positions instead of labels. 

In [15]:
df.iloc[1:3]

Unnamed: 0,name,breed,weight
b,Luna,Shih tzu,5.9
c,Teddy,Poodle,18.7


In [16]:
df.iloc[:, 1:]

Unnamed: 0,breed,weight
a,Poodle,16.4
b,Shih tzu,5.9
c,Poodle,18.7
d,Husky,24.8


In [17]:
df.iloc[[1, 3], [0, 2]]

Unnamed: 0,name,weight
b,Luna,5.9
d,Nala,24.8


Aside from the `loc` and `iloc` attribute, we can use masks based on logical conditions, similarly as in the case of NumPy arrays:

In [18]:
mask = df["breed"] == "Poodle"

df[mask]

Unnamed: 0,name,breed,weight
a,Coco,Poodle,16.4
c,Teddy,Poodle,18.7


We can also assign values to specific locations in a `DataFrame`:

In [19]:
df.loc[["a", "c"], "weight"] = [17.2, 19.1]
df[mask]

Unnamed: 0,name,breed,weight
a,Coco,Poodle,17.2
c,Teddy,Poodle,19.1


We can also assign values based on a mask. However, in order to do so, it is recommended to use `.loc` together with a mask, like so:

In [20]:
df.loc[mask, "weight"] = [17.3, 19.2]
df[mask]

Unnamed: 0,name,breed,weight
a,Coco,Poodle,17.3
c,Teddy,Poodle,19.2


## 5.3 Adding rows and columns <a id='adding-rows-and-columns'></a>

We can add rows to the `DataFrame` assigned to the variable `df` by creating a new `DataFrame` object and concatenating it with `df` using the `concat()` function: 

In [57]:
data = {
    "name": ["Milo", "Millie"],
    "breed": ["Beagle", "Scottish Terrier"],
    "weight": [9.4, 8.4]
}

index = ["e", "f"]

pd.concat([df, pd.DataFrame(data, index=index)])

Unnamed: 0,name,breed,weight
a,Coco,Poodle,17.3
b,Luna,Shih tzu,5.9
c,Teddy,Poodle,19.2
d,Nala,Husky,24.8
e,Milo,Beagle,9.4
f,Millie,Scottish Terrier,8.4


The `concat()` function returns a new `DataFrame`, and therefore the original `df` dataframe is unaffected unless we assign the result returned by the `concat()` function to the variable `df`

In [58]:
df

Unnamed: 0,name,breed,weight
a,Coco,Poodle,17.3
b,Luna,Shih tzu,5.9
c,Teddy,Poodle,19.2
d,Nala,Husky,24.8


In [59]:
df = pd.concat([df, pd.DataFrame(data, index=index)])
df

Unnamed: 0,name,breed,weight
a,Coco,Poodle,17.3
b,Luna,Shih tzu,5.9
c,Teddy,Poodle,19.2
d,Nala,Husky,24.8
e,Milo,Beagle,9.4
f,Millie,Scottish Terrier,8.4


We can add a new column to a `DataFrame` similarly to the way we would add a new key-value pair to a dictionary. More specifically, we can use a column name that does not yet exist and assign a value to it:

In [60]:
df["age"] = 3
df

Unnamed: 0,name,breed,weight,age
a,Coco,Poodle,17.3,3
b,Luna,Shih tzu,5.9,3
c,Teddy,Poodle,19.2,3
d,Nala,Husky,24.8,3
e,Milo,Beagle,9.4,3
f,Millie,Scottish Terrier,8.4,3


Notice that the above statement modifies `df` directly (unlike the `concat()` function, which returns a new `DataFrame`). We can also assign different values to each row, like so:

In [61]:
df["age"] = [1, 2, 3, 4, 5, 6]
df

Unnamed: 0,name,breed,weight,age
a,Coco,Poodle,17.3,1
b,Luna,Shih tzu,5.9,2
c,Teddy,Poodle,19.2,3
d,Nala,Husky,24.8,4
e,Milo,Beagle,9.4,5
f,Millie,Scottish Terrier,8.4,6


In the above case, we modified the existing "age" column by assigning new data to it. We can also make new columns by applying operations to existing columns:

In [62]:
df["weight_x2"] = df["weight"] * 2
df

Unnamed: 0,name,breed,weight,age,weight_x2
a,Coco,Poodle,17.3,1,34.6
b,Luna,Shih tzu,5.9,2,11.8
c,Teddy,Poodle,19.2,3,38.4
d,Nala,Husky,24.8,4,49.6
e,Milo,Beagle,9.4,5,18.8
f,Millie,Scottish Terrier,8.4,6,16.8


As we can see above, the values in the column "weight_x2" are equal to the values of the "weight" column multiplied by two.

## 5.4 Removing rows and columns <a id='removing-rows-and-columns'></a>

The `drop()` method of a `DataFrame` object allows us to remove rows or columns of data. The `drop()` method takes in two arguments: "labels" and "axis". The "labels" argument is a list of strings (or a single string) containing the index values or column names to remove from the `DataFrame`. The "axis" argument specifies whether to drop rows or columns based on the "labels" argument (`axis=0` means rows, whereas `axis=1` means columns).

In [63]:
df.drop(labels=["e", "f"], axis=0)

Unnamed: 0,name,breed,weight,age,weight_x2
a,Coco,Poodle,17.3,1,34.6
b,Luna,Shih tzu,5.9,2,11.8
c,Teddy,Poodle,19.2,3,38.4
d,Nala,Husky,24.8,4,49.6


As usual, almost all functions and methods in the Pandas library return a new copy of the `DataFrame` object. Therefore, if we do not re-assign the value returned by `drop()` to `df`, then `df` will remain unchanged.

In [64]:
df

Unnamed: 0,name,breed,weight,age,weight_x2
a,Coco,Poodle,17.3,1,34.6
b,Luna,Shih tzu,5.9,2,11.8
c,Teddy,Poodle,19.2,3,38.4
d,Nala,Husky,24.8,4,49.6
e,Milo,Beagle,9.4,5,18.8
f,Millie,Scottish Terrier,8.4,6,16.8


In [65]:
df = df.drop(["e", "f"], axis=0)
df

Unnamed: 0,name,breed,weight,age,weight_x2
a,Coco,Poodle,17.3,1,34.6
b,Luna,Shih tzu,5.9,2,11.8
c,Teddy,Poodle,19.2,3,38.4
d,Nala,Husky,24.8,4,49.6


We can remove columns with the `drop()` method in the following way: 

In [66]:
df.drop(labels=["weight_x2", "age"], axis=1)

Unnamed: 0,name,breed,weight
a,Coco,Poodle,17.3
b,Luna,Shih tzu,5.9
c,Teddy,Poodle,19.2
d,Nala,Husky,24.8


# 5. Operations on DataFrames

Many of the operations that can be performed on a `Series` can also be performed on a `DataFrame`. Usually this means that an operation will be applied to each `Series`/column separately. So for example, `DataFrame.min()` returns the minimum value for each column.

Before we test this out, let's add some columns containing numerical data to the dataframe stored in `df`:

In [4]:
df["age"] = [1, 2, 3, 4]
df["height"] = [1.02, 0.46, 0.87, 1.12]
df

Unnamed: 0,name,breed,weight,age,height
0,Coco,Poodle,16.4,1,1.02
1,Luna,Shih tzu,5.9,2,0.46
2,Teddy,Poodle,18.7,3,0.87
3,Nala,Husky,24.8,4,1.12


Now lets select only the columns with numerical data:

In [5]:
column_names = ["weight", "age", "height"]
df_num = df[column_names]
df_num

Unnamed: 0,weight,age,height
0,16.4,1,1.02
1,5.9,2,0.46
2,18.7,3,0.87
3,24.8,4,1.12


Now lets see what happens when we call the `DataFrame.min()` method:

In [6]:
df_num.min()

weight    5.90
age       1.00
height    0.46
dtype: float64

As we can see, the `DataFrame.min()` method returned a `Series` whose indices are column names, and the corresponding values are the minimum values in each column.

The code cells below contain more examples of methods that can be used with a `DataFrame` object.

In [7]:
df_num.max()

weight    24.80
age        4.00
height     1.12
dtype: float64

In [8]:
df_num.sum()

weight    65.80
age       10.00
height     3.47
dtype: float64

In [9]:
df_num.mean()

weight    16.4500
age        2.5000
height     0.8675
dtype: float64

In [10]:
df_num.std()

weight    7.875913
age       1.290994
height    0.290445
dtype: float64

In [11]:
df_num.var()

weight    62.030000
age        1.666667
height     0.084358
dtype: float64