# Introduction to pandas

## Data structures

There are 2 main data panda structures

1. Series
2. Dataframe

Let's begin by looking at the Series object

### Series
The Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

```
s = pd.Series(data, index=index)
```

## Simple Example

In [1]:
import pandas as pd

simple_example = pd.Series([0.1, 0.2, 0.3, 0.4])
simple_example


0    0.1
1    0.2
2    0.3
3    0.4
dtype: float64

Note that the series is initialized with default indices starting from 0.

## Setting an alternative index

Pass a list of axis labels as an index

In [2]:
# Note that the length of index must equal to the length of the series

simple_example_with_index = pd.Series(
    [0.1, 0.2, 0.3, 0.4],
    index=["A", "B", "C", "D"]
)
simple_example_with_index

A    0.1
B    0.2
C    0.3
D    0.4
dtype: float64

## Input Data

The data input to the series can be many different things:

1. Python dict
2. an ndarray
3. a scalar value -- automatically repeats the values in accordance to the length of the index


### Series as specialized dictionary
In this way, we can think of a Pandas Series a bit like a specialization of a Python dictionary. 
Dictionary maps arbitrary keys to a set of arbitrary values, 
Series maps typed keys to a set of typed values. 

* This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

The Series-as-dictionary analogy can be made even more clear by constructing a Series object directly from a Python dictionary

### python dictionary example


In [3]:
metric1_dict = {
    "key1": 1,
    "key2": 2,
    "key3": 3,
}
metric1 = pd.Series(metric1_dict)
metric1


key1    1
key2    2
key3    3
dtype: int64

In [4]:
# values can be accessed similar to a dictionary
metric1['key1']

1

### What happens if we pass a dictionary with an index?


In [5]:
dict_series_with_index = pd.Series(
    metric1_dict,
    index=["A", "B", "C"]
)
dict_series_with_index

A   NaN
B   NaN
C   NaN
dtype: float64

In [6]:
# This works?
dict_series_with_index = pd.Series(
    metric1_dict,
    index=["key1", "key2", "key3"]
)
dict_series_with_index

key1    1
key2    2
key3    3
dtype: int64

In [7]:
# What is happening?
dict_series_with_index = pd.Series(
    metric1_dict,
    index=["key3", "key5", "key1"]
)
dict_series_with_index

key3    3.0
key5    NaN
key1    1.0
dtype: float64

## Dataframe
The next fundamental structure in Pandas is the DataFrame. It is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

1. Dict of 1D nd-arrays, lists, dicts, or Series
2. 2-D numpy.ndarray
3. Structured or record ndarray
4. A Series
5. Another DataFrame

## Example a dict of lists / series


In [8]:
example_df = pd.DataFrame(
    {
        "simple_example": simple_example,
        "simple_example2": [0.2, 0.3, 0.4, 0.5]
    }
)
example_df


Unnamed: 0,simple_example,simple_example2
0,0.1,0.2
1,0.2,0.3
2,0.3,0.4
3,0.4,0.5


# Choices for Indexing
1. .loc -- primarily label based, will raise KeyError if not found
2. .iloc -- primarily integer-zero based, will raise IndexError if out of bounds except slice indexers

In [9]:
# Example of .loc
# .loc works by filtering on the row labels, followed by the column labels
# A simple example
example_df.loc[0, "simple_example"]


0.1

In [10]:
# Referring to multiple rows/columns using a list
example_df.loc[[0, 3], ["simple_example", "simple_example2"]]

Unnamed: 0,simple_example,simple_example2
0,0.1,0.2
3,0.4,0.5


In [11]:
# Example of .iloc
# .iloc works by filtering on the row / column indices
# A simple example
example_df.iloc[0, 0]

0.1

In [12]:
# Referring to multiple rows/columns using a list
example_df.iloc[[0, 3], [0, 1]]

Unnamed: 0,simple_example,simple_example2
0,0.1,0.2
3,0.4,0.5


## Additional selections using indexing methods
In addition to labels / indices, loc / iloc can accept the following as well
1. Index slices
2. Boolean arrays
3. Callable (That returns valid output for indexing)

## Index slices

In [13]:
# Note that unlike python slices, which excludes the end element
# slices in loc/iloc are start / end inclusive
example_df.loc[0:2, ["simple_example", "simple_example2"]]

Unnamed: 0,simple_example,simple_example2
0,0.1,0.2
1,0.2,0.3
2,0.3,0.4


## Boolean arrays

In [14]:
# length of Boolean array must equal length of index
example_df.loc[
    [True, False, True, False],  #Show index 0,2
    [False, True]  # Show column 2
]

Unnamed: 0,simple_example2
0,0.2
2,0.4


## Callable example
callables must return one of the acceptable types above

In [15]:
# method below returns a Boolean Array
def select_simple_example_less_than_0_3(df):
    return df["simple_example"] < 0.3

select_simple_example_less_than_0_3(example_df)

0     True
1     True
2    False
3    False
Name: simple_example, dtype: bool

In [16]:
example_df.loc[select_simple_example_less_than_0_3]

Unnamed: 0,simple_example,simple_example2
0,0.1,0.2
1,0.2,0.3


### Computing approximate size of the Dataframe

In [17]:
# Displaying memory usage of a dataframe
data = {'Name': ['John', 'Jane', 'Alice'],
        'Age': [25, 30, 28],
        'Salary': [50000, 60000, 55000]}

df = pd.DataFrame(data)

#### How is the computed?
The memory usage estimation provided by Pandas' `info()` method is based on the data types of the columns in the DataFrame. It calculates the memory required for storing the column data and any additional overhead required by Pandas to manage the DataFrame.

Here are some general guidelines for estimating memory usage based on data types:

- Numeric data types (e.g., int, float): The memory usage depends on the size of the data type. For example, an int64 column will use 8 bytes per element, and a float64 column will use 8 bytes per element.

- Boolean data type (bool): The memory usage for boolean columns is estimated as 1 byte per 8 elements. Each boolean element occupies a single bit, but Pandas rounds it up to the nearest byte for efficiency.

- String data type (object): The memory usage for string columns is estimated based on the actual content of the strings. It calculates the total memory required for storing the strings, including the string lengths and any additional overhead for managing the strings.

Keep in mind that these estimates are based on the assumption that the data types accurately represent the actual data in the DataFrame. If the data types are not optimized or if there are missing values, the memory usage estimate may not be entirely precise.

Additionally, the memory usage estimation may not account for certain optimizations or compression techniques used by Pandas, such as string interning or category data types. These techniques can reduce the memory usage compared to the naive estimates based solely on data types.

For a more precise measurement, use the `memory_usage()` method with the `deep=True` argument, as shown in the previous examples. This will provide the actual memory usage by recursively examining the contents of the DataFrame, but it requires creating the DataFrame and accessing its data.


In [18]:
# Data frame memory size calculation exercise
memory_usage = df.memory_usage(deep=True).sum()
print("Memory usage of the DataFrame:", memory_usage, "bytes")

Memory usage of the DataFrame: 360 bytes
