# Introduction to Pandas

This Jupyter Notebook introduces the `pandas` library, and how to best use it for working with data.

## What is `pandas`?
Pandas is a popular Python library that contains many tools allowing you to more easily visualize, inspect or slice data. You may find it helpful in this class.

In [2]:
import pandas as pd
# the above line of code imports the pandas library, and renames it `pd` so we can type it more easily

### Loading in Data

To load in a .csv, we'll just use the following function:
```python 
pd.read_csv("some file path here")
```

In [3]:
titanic_data = pd.read_csv('titanic_dataset.csv')
# Replace the string above with the path to YOUR titanic csv

### Taking a look at data

In [4]:
titanic_data
# Run this cell and see what happens.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### Woah. What are we looking at?

The above spreadsheet you're looking at is called a `DataFrame`. It's one of the most important `pandas` objects to know.

Below I'll share some of the things you can do with `DataFrame` objects that would be more difficult to do without it.

### Handy `DataFrame` methods

Let's look at two attributes that `DataFrame`objects have: `.columns`, `shape` and `.dtypes` 

Copy the code below into various cells and see what they do.

Example:

**`.columns`**

```python
titanic_data.columns
```

**`.shape`**

```python
titanic_data.shape
```

**`.dtypes`**

```python
titanic_data.dtypes
```

In [5]:
### Your code here

titanic_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

### Handy DataFrame methods

#### Taking a look at the `.head()` and `.tail()`

The `.head()` and `.tail()` method allows you to see the first or last few rows of a `DataFrame`.

Try running the following code in your own cell and see what gets produced:

**`.head()`**

```python
titanic_data.head()
```

**`.tail()`**

```python
titanic_data.head()
```

In [6]:
### Your code here

titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Handy DataFrame methods

#### `describe`, `count`, `min`, `max`, `std`, `corr`


Try running the each of the commands above to see what happens.

Example:

```python
titanic_data.describe()
```

In [7]:
### Your code here

titanic_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### Indexing into a `DataFrame`

You may now be curious how to get a specific column of a `DataFrame` object. Try this:

```python

fare = titanic_data['Fare']

```

In [10]:
### Your code here

fare = titanic_data['Fare']
fare.head(25)

0      7.2500
1     71.2833
2      7.9250
3     53.1000
4      8.0500
5      8.4583
6     51.8625
7     21.0750
8     11.1333
9     30.0708
10    16.7000
11    26.5500
12     8.0500
13    31.2750
14     7.8542
15    16.0000
16    29.1250
17    13.0000
18    18.0000
19     7.2250
20    26.0000
21    13.0000
22     8.0292
23    35.5000
24    21.0750
Name: Fare, dtype: float64

#### What is the above object?

Try using the `type()` function on the `fare` variable above.

It should be a `Series`. This is the second important data structure to be aware of in the `pandas` DataFrame.

A `Series` is very much like a Python dictionary in that it has a number of `keys` and associated `values`. It is often used to represent one column or row of a `DataFrame`.

### Challenges:

Try to do each of the following:

1. Make a new DataFrame that is equal to only the first 125 rows of the data.

2. Make a new DataFrame that is equal to only the columns `fare` and `home.dest` of the original dataset.

3. Make a new DataFrame that is equal to the first 25 rows and the `sex` and `survived` columns.


In [11]:
### Your code here



### Indexing by Conditions

Say you want to get certain rows or columns of a `DataFrame`, well you can do something like the following:

In [14]:
# What does this do?
is_female = titanic_data['Sex'] == 'female'

In [15]:
# Let's inspect it.
is_female

0      False
1       True
2       True
3       True
4      False
       ...  
886    False
887     True
888     True
889    False
890    False
Name: Sex, Length: 891, dtype: bool

### What's going on?

The above code has produced a `Series` object containing only `True` or `False`. We call this a `Boolean Series` because the `Series` contains only `Boolean` (e.g., `True` or `False`) values.

Why is this useful?

Because we can do the following:

In [16]:
# Run this cell

only_female_passengers = titanic_data[is_female]

In [17]:
# Now let's inspect it. We'll use the same `.head()` from before. It also works with Series objects!
# Let's look at the first 10 rows
only_female_passengers.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S
18,19,0,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female,31.0,1,0,345763,18.0,,S


## Multiple criteria: using `and`, `or` and `not` in conditions

Say you wanted passengers that were female, in Class 1 and were over the age of 30. Here's how you could easily write a one-line bit of code:


#### Multiple `and` conditions

You use the `&` symbol to denote `and`.

```python

titanic_data[ (titanic_data['sex'] == 'female') & (titanic_data['pclass'] == 1) & (titanic_data['age'] > 30) ]

```

In general, the syntax is:

```python

titanic_data[ (condition1) & (condition2) ... ]

```


#### Mutiple `or` conditions

If you want to use an `or` condition, you simply use the `|` symbol:

```python

titanic_data[ (condition1) | (condition2) ... ]

```


#### Reversing a condition

Use the `~` symbol to reverse a condition.

For example, here's how you find all of the passengers that were **NOT** class 1.

```python

titanic_data[~(titanic_data['pclass'] == 1)]

```

#### You can also use `.isin` to check if a column is one of multiple values

```python

values_to_check = ['S', 'C']

subset = titanic_data[titanic_data['embarked'].isin( values_to_check )]

```


### Challenges

Try the following:

1. Find the number of passengers who survived that paid a fare above 50 and were NOT in class 1.

In [18]:
### Your code here



## Resources

If you want to learn more about `pandas`, here are some resources I suggest:

0. [The official documentation on the two `DataFrame` and `Series` data structures](https://pandas.pydata.org/pandas-docs/stable/dsintro.html)
1. [A lot of examples of awesome, crazy ways to filter and slice a `DataFrame`](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-39e811c81a0c)
2. [A list of the most common `pandas` functionality from the official documentation](https://pandas.pydata.org/pandas-docs/stable/10min.html)
3. [The official `pandas` documentation](https://pandas.pydata.org/pandas-docs/stable/index.html) -- click on one of the topics on the left hand side to navigate to it.