# Introduction to Pandas
## Overview
### What You'll Learn
In this section, you'll learn
1. What Pandas and dataframes are
1. How to load data into them
1. How to work with and modify data in dataframes
1. How to filter and sort to gain insight into data

### Prerequisites
Before starting this section, you should have an understanding of
1. [Basic Python](https://github.com/HackBinghamton/PythonWorkshop)

### Introduction
**Pandas** is a Python library which provides high-performance, easy to use structures and data analysis tools. To dive in further, feel free to check out the [documentation](https://pandas.pydata.org/).

### Setup

***Make sure to run the below code block to set the section up!***

In [None]:
# Install requirements
!pip install pandas
!pip install requests

# Fetch the data
import requests
url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
raw_data = requests.get(url).text

# Write it locally
with open("titanic.csv", "w") as data_file:
    data_file.write(raw_data)

## Getting Started and Loading Data 

If you ever decide to use Pandas on your personal device, make sure to install it with:

```
pip3 install pandas
```

Pandas is usually imported like so:

In [None]:
import pandas as pd

This means that we'll reference the Pandas library as `pd` in code -- it seems to be a standard for whatever reason.

Now that we have Pandas imported in our code, let's get some data. 

### Loading the Titanic Dataset

For this section, we'll be using a dataset that contains information about those aboard the Titanic -- it'll include things such as how much their fare was, what class service they purchased, sex, age, etc. Let's see what we can do with it!

To load data from a comma-separated value (CSV) file (a standard format for collections of data), you'll use the following command, specifying the name of the file to load in the quotation marks:

In [None]:
dataframe = pd.read_csv("titanic.csv")

Cool, we now have our data all neatly packed into a Pandas "Dataframe" -- it's like a spreadsheet with column names and row labels. We can let Colab print it out nicely for us:

In [None]:
dataframe

Sweet! Now, let's figure out how to do things with this data.

## Data Viewing

Pandas provides a massive amount of ways to view different aspects of a dataset.

For starters, we can use the `.describe()` method to get some basic statistical information about our dataset:

In [None]:
dataframe.describe()

This tells us measures like the averages and distributions of each column.

If we'd just like to know things like the mean values of every column, we can use the `.mean()` method:

In [None]:
dataframe.mean()

While these methods tell us some interesting information, like that the average fare was 32 pounds while the maximum was 512, **they don't tell us all that much**.

We're more interested in seeing how these factors correlate with one another. Some questions we might ask might be:

* Were people who paid for higher-class service more likely to survive?
* Were children prioritized for rescue (i.e. were younger ages more likely to survive)?
* What was the average fare of a 2nd Class ticket?
* Were men and women charged different prices? What was the average cost of a 2nd Class ticket for a woman?

Let's take a look at a few tools that'll get you started.

## Group By (`.groupby()`)

`.groupby()` lets you reorganize your dataframe into groups, at which point we can view it with one of the methods from above. To see how this works, let's answer the third question from above:

In [None]:
dataframe.groupby(["Pclass"])

Well, that didn't work as expected. No worries! Calling things like `.groupby()` will return an entirely new DataFrame object, and we still have to display it with one of the functions from before. Let's use `.mean()`.

In [None]:
dataframe.groupby(["Pclass"]).mean()

We can see that the average data is now compared across two dimensions -- the passenger class, and the other columns. With this, we can now say that the average 2nd Class ticket cost 20 pounds.

We can also use `.groupby` with multiple labels! Just use the format `dataframe.groupby(["Column1", "Column2", ...])`. With this sort of flexibility, we can answer our fourth question from above:

In [None]:
dataframe.groupby(["Sex", "Pclass"]).mean()

As we can see, women generally paid more for their tickets -- in the case of 1st Class tickets, women paid close to double what men did!

## Filter by Condition

Now, what if we want to look at such information, but only look at certain regions of the data? For example, if we were trying to answer our second question, we'd only really like to look at those under a certain range in the Age column.

To do so, we can filter like so:

In [None]:
dataframe[dataframe["Age"]<18].mean()

This will look confusing to anyone who uses Python regularly, since it looks like we're selecting out of a list with... a list. No worries, just follow this format and change the columns you'd like to filter through and you'll be fine.

## Exercises

So, the information in the code block above tells us what the average survival rate for those under 18 was. Given this, find the answer to our second question: Were children prioritized for rescue (i.e. were younger ages more likely to survive)?

In [None]:
# Your code here!

Now, answer our first question: Were people who paid for higher-class service more likely to survive?

In [None]:
# Your code here!