# **Chapter 1:** Introduction to pandas

**Pandas** is a cornerstone library for data analysis and manipulation in Python, offering rich data structures and functions designed to make data exploration and manipulation straightforward and efficient. 

<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/2560px-Pandas_logo.svg.png height=100>

At the heart of Pandas are its two primary data structures: Series and DataFrame, which provide the functionality to handle and analyze data in a way that is both fast and intuitive. 

Conceived by Wes McKinney in 2008, Pandas has become indispensable for data scientists for tasks ranging from simple data filtering and aggregation to more complex data transformations and analysis. Its seamless integration with other libraries like NumPy, Scikit-learn, and Matplotlib makes it a versatile tool in the data science toolkit, enabling analysts and researchers to draw insights from data with ease. 

### [1.1 Overview of Pandas](#11-overview-of-pandas)

Pandas users range from beginners in data science to seasoned analysts and researchers, all benefiting from its extensive functionality for handling and analyzing input data, regardless of its origin or format. 

The library's central feature is its powerful and flexible `DataFrame` object, which allows for sophisticated data manipulation and analysis.


#### Core Feature: DataFrame

The DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). It's akin to a spreadsheet or SQL table and is the most commonly used pandas object.

```python
pandas.DataFrame()
```

#### Why Pandas?

Pandas simplify the process of data manipulation and analysis through its powerful data structures. 

It provides:
- Fast and efficient DataFrame object for data manipulation with integrated indexing.
- Tools for reading and writing data between in-memory data structures and different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of datasets.
- Label-based slicing, indexing, and subsetting of large datasets.
- Data structure column insertion and deletion.
- Group by functionality for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.

Pandas is built on top of NumPy and is designed to integrate well within a scientific computing environment with many other 3rd party libraries.


### [1.2 Installation and Setup](#1.2-installation-and-setup)
<a name="1.2-installation-and-setup"></a>
Pandas can be installed using Python's package manager `pip`. 

If you have Python and pip on your system, you can install Pandas by running:

    pip install pandas

To verify the installation, you can import pandas and check its version with:

In [None]:
import pandas as pd
print(pd.__version__)

### [1.3 Pandas Data Structures](#1.3-pandas-data-structures)
<a name="1.3-why-pandas"></a>
The two primary data structures of pandas, `Series` (1-dimensional) and `DataFrame` (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. 

For [R](https://www.r-project.org/) users, DataFrame provides everything that [R](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame)'s `data.frame` provides and much more.

<img src=../images/ue4_series-dataframe.png height=200>

#### Series

A Series is a `one-dimensional array-like object` containing a sequence of values and an associated array of data labels, called its index. 

A simple Series is formed from only an array of data:

In [None]:
series = pd.Series([4, 7, -5, 3])

print(series)

#### DataFrame

The DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). 

The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index.

Creating a DataFrame is as simple as passing a dict of objects:

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2000, 2002, 2001, 2003, 2002],
        'pop': [1.5, 1.5, 3.6, 2.4, 2.9, 3.9]}
dataframe = pd.DataFrame(data)  

dataframe

<img src= https://cdn-images-1.medium.com/v2/1*5zJ9tsVIRvxY83GsO8eyOw.png height=500>

**Source:** https://medium.com/dunder-data/the-pandas-dataframe-and-series-a7e7a5987492

Pandas DataFrames are versatile and powerful, capable of solving a wide array of data manipulation tasks.

---