Pandas is a library for Python that is meant to provide high-level organization and data processing. It is intended to feel like data utilities in other languages, like `data.frame` in R. If you don't want to worry about using for loops, splitting strings, and casting variables just to read in your data, Pandas can be extremely helpful. It is also compatible (in most cases) with the plotting libraries we will introduce later.

To start off, we have to introduce the concept of **modules** in Python. A module contains code that can do specific things. For instance, the **math** module contains math functions that aren't a part of basic python. In order to use these functions, we have to import the module that contains them:

In [1]:
import math

This statement is usually the first thing you see in a given script or program, but the only rule for its location is that it must come before the functions that use it. Now that we've imported the module, we can use the functions by specifying the module, then a period, then the name of the function:

In [2]:
x = 7.4

rounded_down = math.floor(x)
rounded_up = math.ceil(x)

print(rounded_down)
print(rounded_up)

7
8


The **math** module in particular has some very useful functions that we have tried to work around so far, like `sqrt()`. We have previously used `**0.5` instead to avoid introducing modules too early, but `sqrt()` can look much more natural:

In [3]:
x = 144

s1 = x**0.5
s2 = math.sqrt(x)

print(s1)
print(s2)

12.0
12.0


Just like with other things like strings, Jupyter Notebook can look up what functions are available with the *Tab* key. I've listed a few here just as some examples:

In [13]:
print(math.log10(10))
print(math.pi)
print(math.sin(0.5* math.pi))
print(math.cos(2 * math.pi))

1.0
3.141592653589793
1.0
1.0


Of course, you have to have these modules on your computer in order to import them, but most Python installations come with many modules. If you downloaded Anaconda (which I believe most people did), you can find the list of modules that came with it at this link: https://docs.anaconda.com/anaconda/packages/pkg-docs/. You just have to click on the installation that you used. If you want to get more control or download modules that aren't already installed, you can use the program `conda` to do so. We don't have enough time to cover that here, but there is plenty of documentation online.

Now we can get into Pandas. Start by importing it. I'm using a slight variation of the import statement here, because it is how you will see most people import Pandas:

In [14]:
import pandas as pd

This just means that we can type `pd` instead of `pandas` everywhere in the code, which saves some keystrokes.

The first thing we need to do is read in some data. I'm using a made-up dataset I found online at https://people.sc.fsu.edu/~jburkardt/data/csv/biostats.csv in this example.

In my case, I placed the file `biostats.csv` in the same directory as my notebook, so I don't need to write its location. If the file is somewhere else, you need to provide the full file location.

In [54]:
df = pd.read_csv("biostats.csv")

#full path version
#df = pd.read_csv("/home/josh/tmp/biostats.csv")

This operation places an object called a **DataFrame** into the variable `df`. It contains all of the data in the file, parsed and (hopefully) correctly interpreted as the right type. If we print out the variable, we can see all of the data: 

In [56]:
print(df)

    Name Sex  Age  Height  Weight
0   Alex   M   41      74     170
1   Bert   M   42      68     166
2   Carl   M   32      70     155
3   Dave   M   39      72     167
4   Elly   F   30      66     124
5   Fran   F   33      66     115
6   Gwen   F   26      64     121
7   Hank   M   30      71     158
8   Ivan   M   53      72     175
9   Jake   M   32      69     143
10  Kate   F   47      69     139
11  Luke   M   34      72     163
12  Myra   F   23      62      98
13  Neil   M   36      75     160
14  Omar   M   38      70     145
15  Page   F   31      67     135
16  Quin   M   29      71     176
17  Ruth   F   28      65     131


Inside Jupyter Notebook, we can also get a nicely formatted view of the data by just typing the variable name. This is purely visual, and will not necessarily happen in another text editor. I will use this method for printing data here, just because it looks better.

In [57]:
df

Unnamed: 0,Name,Sex,Age,Height,Weight
0,Alex,M,41,74,170
1,Bert,M,42,68,166
2,Carl,M,32,70,155
3,Dave,M,39,72,167
4,Elly,F,30,66,124
5,Fran,F,33,66,115
6,Gwen,F,26,64,121
7,Hank,M,30,71,158
8,Ivan,M,53,72,175
9,Jake,M,32,69,143


DataFrames are built with some assumptions in mind. Every row is an "observation", in our case one person. Every column is a variable, like "Age" or "Height". If we take a "slice" of our DataFrame, we get a one-dimensional object called a "Series". The names of these objects isn't terribly important, but it can help for debugging if you use the actual names.

DataFrames can be indexed much like lists and strings. Keep in mind that we have two dimensions here (one for rows and one for columns), so it will take two indices to get one value from the DataFrame. We can start by just extracting a given column with the name we want:

In [48]:
col = df["Age"]
col

0     41
1     42
2     32
3     39
4     30
5     33
6     26
7     30
8     53
9     32
10    47
11    34
12    23
13    36
14    38
15    31
16    29
17    28
Name: Age, dtype: int64

You can see that the type of the data in this column (referred to as a `dtype`) is `int64`. Pandas figured out that all of the data in this column are integers, and cast them all correctly.

We use the `[]` operator to access the data in the DataFrame, but depending on what we put inside the brackets, we can access either rows or columns. Above, we put the string `"Age"` in the brackets. If we put just indices, we can extract rows:

In [126]:
df[0:1]

Unnamed: 0,Name,Sex,Age,Height,Weight
0,Alex,M,41,75,170


Pandas is picky about indexing here, as it won't accept a single index. If we want just one row, we have to use a range inside the brackets, even if that range only covers one row.

There is a function called `iloc()` that is recommended for accessing observations (rows) that works like this:

In [128]:
df.iloc[0]

Name      Alex
Sex          M
Age         41
Height      75
Weight     170
Name: 0, dtype: object

We store this in a variable and use it in a convenient way, or skip the variable step and just the data we want right away:

In [None]:
my_row = df.iloc[0]
my_row.Age

#equivalent
#df.iloc[0].Age

We can also access columns without using brackets, just the column name:

In [62]:
df.Age

0     41
1     42
2     32
3     39
4     30
5     33
6     26
7     30
8     53
9     32
10    47
11    34
12    23
13    36
14    38
15    31
16    29
17    28
Name: Age, dtype: int64

We can get a little fancy and use this to filter our data according to some expression that returns a boolean value. First, we will ask, for every value in the age column, whether that value is higer than 30:

In [63]:
df.Age > 30

0      True
1      True
2      True
3      True
4     False
5      True
6     False
7     False
8      True
9      True
10     True
11     True
12    False
13     True
14     True
15     True
16    False
17    False
Name: Age, dtype: bool

You can see that each row now either contains `True` or `False`, depending on its value. We can use these values to grab the entire row of relevant values:

In [64]:
df[df.Age > 30]

Unnamed: 0,Name,Sex,Age,Height,Weight
0,Alex,M,41,75,170
1,Bert,M,42,69,166
2,Carl,M,32,71,155
3,Dave,M,39,73,167
5,Fran,F,33,67,115
8,Ivan,M,53,73,175
9,Jake,M,32,70,143
10,Kate,F,47,70,139
11,Luke,M,34,73,163
13,Neil,M,36,76,160


Now we have only the rows where the age is greater than 30. You can do any arbitrary filtering that you want, as long as it returns a boolean value. There's one key difference here, though. You can't use the normal Python `and` and `or` operators. You have to use `&` for and, and `|` for or. This is a carryover from earlier programming languages, so you might come across these symbols used in this way. Just be sure to use parentheses here, or you will get a very unhelpful error.

In [158]:
df[(df.Age < 30) | (df.Age > 40)]
#df[(df.Height > 70) & (df.Weight > 170)]

Unnamed: 0,Name,Sex,Age,Height,Weight
8,Ivan,M,53,73,175
16,Quin,M,29,72,176


There is also a useful function called `isin()` that we can use to filter out rows with specific values:

In [109]:
desired_rows = df['Name'].isin(['Bert', 'Fran', 'Page'])
df[desired_rows]

Unnamed: 0,Name,Sex,Age,Height,Weight
1,Bert,M,42,69,166
5,Fran,F,33,67,115
15,Page,F,31,68,135


Now that we've seen a few ways to access data within a DataFrame, we can sidestep indexing altogether if we want. Say we want to find the height and age of the person named Carl:

In [141]:
df.loc[df.Name == 'Carl', ('Height', 'Age')]

Unnamed: 0,Height,Age
2,71,32


If you think all of the these functions are confusing, then you not alone. It will take time to get used to the different ways of accessing data. There is a very helpful cheat sheet that is available here: http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

For a final example of the functionality of Pandas, we can veiw some built in statistical information:

In [87]:
df.mean()

Age        34.666667
Height     70.055556
Weight    146.722222
dtype: float64

In [82]:
df.max()

Name      Ruth
Sex          M
Age         53
Height      76
Weight     176
dtype: object

In [83]:
df.min()

Name      Alex
Sex          F
Age         23
Height      63
Weight      98
dtype: object

Try returning all of the rows where the height is within one inch of the mean: