## DataFrame vs 2D Arrays

One difference between DataFrames and Numpy Arrays we already talked about:
- Numpy arrays contain entries of same type
- Pandas DataFame can contain columns of different data types

One other key difference between DataFrames and Numpy Arrays is what index information they assume you'll use to query data.

|           |        | Ordered | Named |
| :--:      | :--:   | :--:    | :-:   |
| Array     | Rows   | ✔️       |       |
| Array     | Columns| ✔️       |       |
| DataFrame | Rows   | ✔️       | ✔️     |
| DataFrame | Columns|         | ✔️     |

This reliance on named indices makes it straightforward to work with Pandas data **relationally**, thinking of each row as a specific Record with named Fields. 

Let's see how it works.

### Indexing Rows

Because DataFrame rows are both ordered and named, they can be indexed using either approach, and even both!  Column operations tend to be name-specific:

| Axis        | Ordered Index | Named Index    | Ordered Slice    | Named Slice              |  NamedOrdered Slice     |
| :--:        | :--:          | :--:           | :--:             | :--:                     | :--:                    |
| **Rows**    | `df.iloc[0]`    | `df.loc['John']` | `df.iloc[0:2]`    | `df.loc[['Jim', 'John']]`  |  `df.loc['Jim':'Jenny']`  |
| **Columns** |               |  `df['Q1']`      |                  |  `df[['Q1', 'Q4']]`       |                         |


To reduce total typing, Pandas also has alternate interfaces for the same operations:

| Axis        | Ordered Index | Named Index    | Ordered Slice    | Named Slice              |  NamedOrdered Slice     |
| :--:        | :--:          | :--:           | :--:             | :--:                     | :--:                    |
| **Rows**    |               |                | `df[0:2]`          |                          |  `df['Jim':'Jenny']`      |
| **Columns** |               |  `df.Q1`         |                  |  `df[['Q1', 'Q4']]`        |                         |

**Note**: Notice that square brackets, not round brackets (a.k.a parenthesis) are used after `df.loc` and `df.iloc`.

In [3]:
import pandas as pd

In [20]:
df = pd.DataFrame({ 
    'Name': ['Nick', 'Jenn', 'Joe', "Mo", "Anni"],
    'Age': [31, 55, 25, 29, 38], 
    'Height': [2.9, 1.2, 1.2, 1.8, 1.6],
})
df

Unnamed: 0,Name,Age,Height
0,Nick,31,2.9
1,Jenn,55,1.2
2,Joe,25,1.2
3,Mo,29,1.8
4,Anni,38,1.6


In [21]:
df.set_index("Name", inplace=True)

In [22]:
df

Unnamed: 0_level_0,Age,Height
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Nick,31,2.9
Jenn,55,1.2
Joe,25,1.2
Mo,29,1.8
Anni,38,1.6


## Exercise

1. Load the titanic dataset (from the data folder) as a DataFrame

2. Select the `age` column.

3. Get rows 10-16.

4. Select the first 5 rows of the "sex" column

5. Select the `fare` column.

6. Select the `survived` and `age` columns.

7. Select the last 3 rows of the `alive` column

## Summarizing / Aggregating Data in DataFrames

Pandas also supplies many different aggregation functions as methods:

```python
df.mean()
df['Column'].mean()
```
<br>

**Examples**:  mean, median, max, min, count, value_counts, unique

## Exercise

In [1]:
import pandas as pd

In [2]:
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


1. What is the mean ticket fare that the passengers paid on the titanic?

2. How many passengers does this dataset contain?

3. How many men and women are in this dataset? (`value_counts()`)

4. What proportion of the passengers were alone on the titanic?

5. How many different classes were on the titanic?

6. How many passengers of each sex are sitting in each class?

## Querying Data via Logical Indexing

To get rows based on their value, Pandas supports logical indexing - similar to Numpy.  For example, to get all the rows of a dataframe that are positive for `Column1`:

```python
positive_rows = df['Column1'] > 0
df[positive_rows]
```
<br>

Often, this is done in a single line:

```python
df[df['Column1'] > 0]
```


## Exercise

Let's go back to the Titanic dataset and do some data querying.

1. Did the oldest passenger on the Titanic survive?

2. Where did the youngest passenger on the Titanic embark from?

3. How many passengers on the Titanic embarked from Cherbourg?

4. What is mean ticket fare for the 1st class?

... What about the 2nd class?

... What about the 3rd class?

5. What was the average age of female passengers?

6. What percentage of the female passengers survived?

## Transforming Data

Any transformation function can be performed on each element of a column, or even all columns of a DataFrame.  Here are several options for this approach:

Numpy-like Operator syntax with Broadcasting:
```python
df['Column1'] * 100
```
<br>

Functions-style syntax:
```python
np.sqrt(df['Column1'])
```
<br>

Object's methods:
```python
df['Column1'].str.upper()
```


## Exercise

1. Make a new column called `OnTitanic`, with all of the values set to `True`.

2. Make a new column called `isAdult`, with `True` values if they were 18 or older and `False` if not.

3. Get everyone's age if they were still alive today (hint: Titanic sunk in 1912)

4. Make a column called `not_survived`, the opposite of the `survived` column.

## Further reading

Pandas has great documentation, tutorials, and examples. You can get started [here](https://pandas.pydata.org/docs/getting_started/)