# Pandas

Pandas is a library for storing and manipulating tabular data, or data stored in rows and columns like a spreadsheet. Pandas is a huge library with many different functions and methods, so what follows is a brief introduction to the most important functions for data management.

```{seealso}
If you encounter any part of Pandas out in the wild that you don't see here, you can always refer to the [Pandas documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)
```

## DataFrames and Series

Instead of normal Python lists and dictionaries, Pandas stores data in its own specialized objects. The main one is a *DataFrame*, which is a lot like a spreadsheet with rows and columns.

You can create a DataFrame directly with the `DataFrame()` class in Pandas, but it's more likely that you'll read in a DataFrame from a CSV or spreadsheet file. First you must import the library, and it's a good idea to import the `numpy` library as well.

```{note}
Numpy is a Python library for efficiently handling arrays and matrices of numbers. Pandas uses it under the hood to run quickly. You usually won't need to use it directly, but it's good to have it installed to avoid any mysterious errors.
```

In [1]:
import pandas as pd
import numpy as np

Now you can use the `read_csv()` function to read in a comma-separated value (CSV) spreadsheet file. You must put the name of this file in quotes, and the file should be in the same directory as your Jupyter notebook (or else you should include a full path). The `read_csv()` function will also accept a URL that points to a CSV file online.

For this example, we'll use the file `mpg.csv` which comes from R's [ggplot2 library](https://www.rdocumentation.org/packages/ggplot2/versions/3.4.1/topics/mpg).

In [4]:
mpg = pd.read_csv("data/mpg.csv")
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
0,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
229,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
230,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
231,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
232,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


```{note}
Jupyter nicely formats DataFrames as tables when you type the name of a variable containing a DataFrame. But if you use the `print()` function, it won't display as well.
```

You can get basic information about your DataFrames columns using the `.info()` method.

In [5]:
mpg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234 entries, 0 to 233
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 20.2+ KB


A *Series* is a lot like a Python list, and each column of a DataFrame is a Series. You can access the columns of a Dataframe with dot notation.

In [7]:
mpg.model

0          a4
1          a4
2          a4
3          a4
4          a4
        ...  
229    passat
230    passat
231    passat
232    passat
233    passat
Name: model, Length: 234, dtype: object

You can also turn a list into a Series with the `Series()` class.

In [9]:
myseries = pd.Series([5, 6, 7, 8])
myseries

0    5
1    6
2    7
3    8
dtype: int64

## Selecting Rows and Columns

Once you have a DataFrame, you'll typically want to filter and select different rows or columns.

To filter specific rows, Pandas a bracket notation. It takes conditional statements that are similar to [Python conditions](/python.html#conditions).