# Tutorial 4: Pandas Basics

Pandas Tutorial 1: Pandas Basics (Reading Data Files, DataFrames, Data Selection)

Pandas is one of the most popular Python libraries for Data Science and Analytics. I like to say it’s the “SQL of Python.” Why? Because pandas helps you to manage two-dimensional data tables in Python. Of course, it has many more features. In this pandas tutorial series, I’ll show you the most important (that is, the most often used) things that you have to know as an Analyst or a Data Scientist. This is the first episode and we will start from the basics!

Import numpy and pandas to your Jupyter Notebook by running these two lines in a cell:

In [None]:
import numpy as np
import pandas as pd

> Note: It’s conventional to refer to `pandas` as `pd`. When you add the as pd at the end of your import statement, your Jupyter Notebook understands that from this point on every time you type `pd`, you are actually referring to the `pandas` library.

## How to open data files in pandas
You might have your data in .csv files or SQL tables. Maybe Excel files. Or .tsv files. Or something else. But the goal is the same in all cases. If you want to analyze that data using pandas, the first step will be to read it into a data structure that’s compatible with pandas.

## Loading a .csv file into a pandas DataFrame

Let’s load a .csv data file into pandas!
There is a function for it, called `read_csv()`.

In [None]:
titanic = pd.read_csv('../data/titanic-train.csv', delimiter = ',')

This nice 2D table. Well, this is a _pandas dataframe_. The numbers on the left are the indexes. And the column names on the top are picked up from the first row of our `titanic-train.csv` file.

## Saving a pandas DataFrame into a .csv

There is a function for it, called `to_csv()`:

In [None]:
titanic.to_csv('../data/test.csv', index=False)

## Selecting data from a dataframe in pandas
Let’s start with a few very basic data selection methods.

### Print the whole dataframe

In [None]:
titanic

### Print a sample of your dataframe

Sometimes, it’s handy not to print the whole dataframe and flood your screen with data. When a few lines is enough, you can print only the first 5 lines – by typing:

In [None]:
titanic.head()

Or the last few lines by typing:

In [None]:
titanic.tail()

### Select specific columns of your dataframe

This one is a bit tricky! Let’s say you want to print the `Survived` and the `Pclass` columns only.
You should use this syntax:

In [None]:
titanic[['PassengerId', 'Pclass']]

Any guesses why we have to use double bracket frames? It seems a bit over-complicated, I admit, but maybe this will help you remember: the outer bracket frames tell pandas that you want to select columns, and the inner brackets are for the list (remember? Python lists go between bracket frames) of the column names.

By the way, if you change the order of the column names, the order of the returned columns will change, too:

In [None]:
titanic[['Pclass', 'PassengerId']]

> Note: Sometimes (especially in predictive analytics projects), you want to get Series objects instead of DataFrames. You can get a Series using any of these two syntaxes (and selecting only one column):

In [None]:
titanic.Pclass

### Filter for specific values in your dataframe

Let’s say, you want to see a list of only the passangers who came from the `1` class. In this case you have to filter for the `1` value in the `Pclass` column:

```Python
titanic[titanic.Pclass == 1]
```

Step 1) First, between the bracket frames it evaluates every line: is the `titanic.Pclass` column’s value `1` or not? The results are boolean values (`True` or `False`).

In [None]:
titanic.Pclass == 1

Step 2) Then from the `titanic` table, it prints every row where this value is `True` and doesn’t print any row where it’s `False`.

In [None]:
titanic[titanic.Pclass == 1]

### Functions can be used after each other

It’s very important to understand that pandas’s logic is very linear (compared to SQL, for instance). So if you apply a function, you can always apply another one on it. In this case, the input of the latter function will always be the output of the previous function.

E.g. combine these two selection methods:

In [None]:
titanic.head()[['Pclass', 'PassengerId']]

Could you get the same result with a different chain of functions? Of course you can:

In [None]:
titanic[['Pclass', 'PassengerId']].head()

## Test yourself!


Select the `PassengerId`, the `Pclass` and the `Sex` columns for the passangers who are from class 3. Print the first five rows only!

In [None]:
titanic[titanic.Pclass == 3][['PassengerId', 'Sex', 'Pclass']].head()

### More resources

https://chrisalbon.com/