# Chapter 1.2: EDA First Look

Goal: practice the first pass of EDA and ask good questions about the data.

In [1]:
import pandas as pd
import numpy as np

### Topics:
- What is Pandas?
- Loading data in Pandas
- Displaying data in Pandas
- Subsetting data in Pandas

## What is Pandas?

Data is typically stored in a CSV file, you can see an example [here](https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv)

How do we even work with this? For example, how could we find the average number of votes? Or all Republicans? We need to load it into Python to accomplish this.

Pandas is a Python package built for manipulating data.

To read a CSV file, you use the Pandas command `.read_csv()` and tell it where the file is located.

### Basics of file storage

Files are stored in folders. You can access folders by typing their name. Load example image in `images` folder.

To go *backwards* (i.e. to a "parent" folder), use `..`, for example `../images` would go backwards one folder, then look for the `images` folder. Load example images in `week02/images` folder.

### Loading the file in Pandas

With that, we're ready to load the CSV file using Pandas `.read_csv()`. The CSV file we just looked at it stored here: `https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv`. Let's load it using the `.read_csv()` command.

In [None]:
# Load the CSV file using Pandas .read_csv function
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv')

Now, how do we actually _look_ at the data? Try below.

In [None]:
# Look at the data


## Subsetting the data

Typically, we want to look at only *subsets* of the data. For example, who survived? Who died? How many children survived?

In Pandas, subsets are selected using square brackets `[]`. For example:

In [None]:
# Select the "Survived" column
df['Survived']

We only want rows where it says they `Survived`.

In [None]:
# Find which rows have "Survived" equal to one (meaning they did survive)
df["Survived"] == 1

Note that this shows if a person survived or not, but it didn't actually select those people.

What we did above using `df['Survived'] == 1` is called creating a **mask**. A mask means "what are that rows that satifsy X?". In our case, it was "what are the rows that are people who survived?". 

To actually _select_ those rows, we use those square brackets. If I want to select rows, that's what square brackets `[]` are for.

In [None]:
survived_mask = df["Survived"] == 1

survived_df = df[survived_mask]

survived_df.head()

In summary:
- Square brackets mean to select things according to the criteria inside them
- The criteria can be a column name, which selects the whole column
- The criteria can be a list of `True`/`False` values, telling it whether or not to select a row

### Practice:
1. Select all people who died ("Survived" equals zero)
2. Select all females
3. Select all people under the age of 18
4. Select all people under the age of 18, and then select the people under 18 who survived

In [None]:
# 1. Select all people who survived



In [None]:
# 2. Select all females


In [None]:
# 3. Select all people under the age of 18


In [None]:
# 4. Select all people under the age of 18, then select the people under 18 who survived


In [None]:
# 4b. You can do #4 either in two separate steps, or in one single step, using "&" to combine two conditions,
# such as <condition 1> & <condition 2>. Try doing it in oine step using &.


In [None]:
# 4c. You can use "&" to express "and". Similarly, you can use "|" (a vertical pipe, just about the enter key)
# on your keyboard) to express "or". Use this to select all people who are under 18 OR who survived.


## Summarizing data

DataFrames can be summarized by applying a function to them using `df.my_function()`. Here are some examples:

- `.mean()`
- `.max()`
- `.shape` (Similar to the shape of a matrix)

In [None]:
# Compute the mean age


In [None]:
# Compute the max ticket price


In [None]:
# Use .shape to find out how many people on board were under the age of 18
