# Introduction to Pandas library

## What is Pandas?

Pandas is a powerful and widely-used open-source library in Python for data manipulation and analysis. It provides data structures and functions to efficiently handle structured data, such as tables or spreadsheets.

Below is an example of structured data, such as a table with columns and rows:

<div style="text-align:center">
    <img src="https://github.com/data-bootcamp-v4/prework_img/blob/main/column_row.png?raw=true" alt="Image" style="width:60%;">
</div>


We will give a short introduction to Pandas so we can do Descriptive Statistics with tabular data (data with columns and rows).

## Key Components

The essential data structures in the Pandas library are Pandas DataFrames and Series:
- **DataFrames**
    - DataFrames are **two-dimensional tabular data** structures in Pandas, similar to a **spreadsheet table**.
    - Each **column** in a DataFrame represents a **variable**, while each **row** represents an **observation** or data point.
    - DataFrames allow us to store, manipulate, and analyze structured data efficiently.

Let's look into it with an image. As we mentioned, each column is a variable and each row is an observation. The whole table is called a `Pandas DataFrame`.

<div style="text-align:center">
    <img src="https://github.com/data-bootcamp-v4/prework_img/blob/main/variable_observation.png?raw=true" alt="Image" style="width:60%;">
</div>

- **Series**
    - Series are one-dimensional labeled arrays in Pandas.
    - They are used to **represent a single column or variable** and to represent **a single row** in a DataFrame.
    - Series can store various types of data, such as numbers, strings, or dates.

As we can see in the following image, the `Pandas DataFrame` is the whole table, and each row and each column is of type `Pandas Series`

<div style="text-align:center">
    <img src="https://github.com/data-bootcamp-v4/prework_img/blob/main/dataframe_series.png?raw=true" alt="Image" style="width:70%;">
</div>


In Pandas, there are two types of labels used to access data in a DataFrame: row index labels and column labels.

- Row Index Labels:

    - Row index labels are the labels assigned to each row in a DataFrame.
    - They provide a unique identifier for each row and allow us to access specific rows using their labels.
    - Row index labels are not considered as a separate "column" in the DataFrame.

- Column Labels:

    - Column labels refer to the names assigned to each column in a DataFrame.
    - They serve as identifiers for the columns and allow us to access specific columns using their labels.
    - Column names are considered the column labels in a DataFrame.


<div style="text-align:center">
    <img src="https://github.com/data-bootcamp-v4/prework_img/blob/main/row_column_label_index.png?raw=true" alt="Image" style="width:70%;">
</div>


Let's look at the image above.

- Row Index Labels: In this example, the row index labels are automatically generated and shown as 0, 1, and 2. They are not considered a separate column.
- Column Labels: The column labels are "PassengerId", "Survived", "Pclass", "Name", and "Sex". These names represent the columns in the DataFrame.

## Importing Pandas

To use Pandas, you need to import the library. Conventionally, we import Pandas using the alias pd. 
Example: 
```python
import pandas as pd
```

This imports Pandas and assigns it the alias "pd" for easier usage.

In [None]:
import pandas as pd

Aliases are commonly used in programming for convenience and brevity. When working with libraries, we can use a dot . operator to access their predefined functions. However, rather than repeatedly typing the complete library name, programmers often opt for shorter aliases. For example: 
- Intead of doing `pandas.read_csv(...)`
- Giving an alias, as pd, we get to do `pd.read_csv(...)`

## Loading data

Pandas can read data from various sources, such as CSV files, Excel files, or databases.

To read data from a CSV file, use the `read_csv()` function provided by Pandas.

Syntax: 
```python
df = pd.read_csv('filename.csv')
```
Replace 'filename.csv' with the actual path and name of your CSV file. This loads the `filename` dataset from a CSV file and stores it in a DataFrame called df.

Observation: we can also read from an online URL that takes you to the CSV file.

In [None]:
# We can read from an online URL
titanic_data = pd.read_csv('https://raw.githubusercontent.com/data-bootcamp-v4/prework_data/main/titanic.csv')

## Displaying Data

- To get an overview of the data, you can display the first few rows using the head() function.

Example: 
```python 
data.head()
``` 
This displays the first five rows of the DataFrame data.

In [None]:
# Display the first few rows of the dataset
titanic_data.head()

## Number of columns and rows

We can use 
```python
data.shape
```

to get information about the number of rows and columns present in the DataFrame.

The `data.shape` returns a tuple representing the dimensions of the DataFrame. The returned tuple has two values: the first value represents the number of rows, and the second value represents the number of columns in the DataFrame.

In [None]:
titanic_data.shape

## Accessing Columns

You can access individual columns of a DataFrame using square brackets `[]`.

Example: 
```python
data['Age']
```

This retrieves the 'Age' column from the DataFrame data.

In [None]:
titanic_data["Age"]

Please note that column names in pandas DataFrames are **case-sensitive**. This means that when referencing or manipulating column names, you need to use the exact casing as defined in the DataFrame. Using different cases will result in an error or lead to unexpected behavior.

## Series Operations

You can perform various operations on a Series, such as selecting specific values, applying mathematical calculations, or aggregating data.

Example: 
```python
data['Age'].mean()
```
will calculate the mean of the 'Age' column.

In [None]:
titanic_data["Age"].mean()

💡 Check for understanding: try it yourself! Look for other numerical variables, such as Fare, and calculate the mean. 