# <span style="color:#130654; font-family: Helvetica; font-size: 200%; font-weight:700"> Pandas | <span style="font-size: 50%; font-weight:300">Data Structures</span>

To use pandas in python import it first by using the following command:

In [1]:
# import pandas
import pandas as pd

# import other libraries here
import numpy as np

## <span style="color:#130654">Data Structures</span>

- Pandas deals with 3 data structures that are <u>built on Numpy array</u> (which means they are fast).
- Higher dimensional data structure is a container of its lower dimensional data structure. 

|Data Structure |Dimension |Data Type   |Size      |Data     |
|:-------------:|----------|------------|----------|---------|
|**Series**     |1D        |<span style="color:red">Homogeneous</span> |<span style="color:red">Immutable</span>|Mutable  |
|**DataFrame**  |2D        |Hetrogeneous|Mutable   |Mutable  |
|**Panel**      |3D        |Hetrogeneous|Mutable   |Mutable  |

All Pandas data structures are value mutable (can be changed) and except Series all are size mutable.

- Panel, container of:
    - DataFrame, container of:
            - Series

<div style="text-align:center">
    <img src="../img/pandas-data-structures.png" width="620" height="620"/>
</div>

### <span style="color:#130654">A. Series</span>

- Series is a one-dimensional labeled array.
- The axis labels are collectively called index.
- A series can be created using `array`, `dict`, `scalar`.

*Syntax:*
```python
pandas.Series( data, index, dtype, copy)
```

| Parameters | Details                                                      |
| :--------: | ------------------------------------------------------------ |
|  **data**  | Takes various forms like ndarray, list, constants            |
| **index**  | Index values must be unique and hashable, same length as data. Default **np.arrange(n)** if no index is passed. |
| **dtype**  | dtype is for data type. If None, data type will be inferred  |
|  **copy**  | Copy data. Default False                                     |

#### <span style="color:#130654">Creating Series</span>

**1. Empty Series**

In [2]:
s = pd.Series()
print(s)

Series([], dtype: float64)


  s = pd.Series()


**2. Using ndarray**

In [3]:
data = np.array(['a','b','c','d'])

# without defining index
s_noindex = pd.Series(data)

# with defining index
index = [100, 101, 102, 103] # creating index variable
s_index = pd.Series(data, index=index) # assiging index variable to index parameter

print("Without index:")
print(s_noindex)
print("\n")
print("with index:")
print(s_index)
print("\n")
print("If no index is passed then by default index is assigned ranging from '0' to 'len(data)-1' ")

Without index:
0    a
1    b
2    c
3    d
dtype: object


with index:
100    a
101    b
102    c
103    d
dtype: object


If no index is passed then by default index is assigned ranging from '0' to 'len(data)-1' 


**3. Using dict**

In [4]:
data = {'a' : 0, 'b' : 1, 'c' : 2}

# without defining index
s_noindex = pd.Series(data, dtype="float64")

# with defining index
index = ['b','c','d','a']
s_index = pd.Series(data, index=index, dtype="float64")

print("Without index:")
print(s_noindex)
print("\n")
print("with index:")
print(s_index)

Without index:
a    0.0
b    1.0
c    2.0
dtype: float64


with index:
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64


- <span style="color:#130654">Dictionary keys are used to construct index.</span>
- If no index is passed then by default order of index will be order of keys in dictionary.
- If there is any index which is not subset of key then its value will be `NaN` equivalent to *Null*.

**4. Using scalar**

In [5]:
s = pd.Series(5, index=['a', 'b', 'c', 'd'])
print(s)

a    5
b    5
c    5
d    5
dtype: int64


#### <span style="color:#130654">Accessing / Retreiving values</span>

**1. By Indexing**

<div style="text-align:center"><img src="../img/python-slicing.png"/></div>

<span style="color:green; font-family: Helvetica;">
    Note: if length of series is n then Indexing in python starts from 0 and ends on n-1.
</span>

- <span style="color:#130654">**0 = absolute first index**</span>
- A = initial indexing position
- N = last indexing position
- X = any indexing position
- <span style="color:#130654">**n-1 = absolute last index**</span>

|Indexing|Return|
|:------:|------|
|**series[X]**|Xth value in the list|
|**series[A:]**|All values starting from to A to n-1.|
|**series[A:N]**|All values between A and N-1|
|**series[:N]**|All values starting from index 0 to N-1|
|**series[-A:]**|All values except from 0 to A|
|**series[:-N]**|All values except from N to n-1|

*Examples:*

In [6]:
series = pd.Series(['a','b','c','d','e'])

In [7]:
# Return xth value
series[2]

'c'

In [8]:
# Return from A to N
series[2:]

2    c
3    d
4    e
dtype: object

In [9]:
# Return values between A to n-1 index
series[2:4]

2    c
3    d
dtype: object

In [10]:
# Return values from 0 index to n-1 index
series[:3]

0    a
1    b
2    c
dtype: object

In [11]:
# Return all values after A till n-1
series[-2:]

3    d
4    e
dtype: object

In [12]:
# Return all values before N till n-1
series[:-2]

0    a
1    b
2    c
dtype: object

**2. By Label**

A Series is like a fixed-size dict in that you can get and set values by index label.

In [13]:
series = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

In [14]:
#single value
series['c']

3

In [15]:
# multiple sequential / non sequential values
series[['a','c','e']]

a    1
c    3
e    5
dtype: int64

<span style="color:green; font-family: Helvetica;">Note: Always use extra "[]" if retreiving multiple values.</span>

## <span style="color:#130654">B. DataFrame</span>

- It is a two-dimensional data structure (data is aligined in tabular row and column fashion).
- It is size and value mutable.
- Potentially columns are of different types.
- Labeled axes (rows and columns).
- <span style="color:#130654">Arithmetic operations on rows and columns.</span>

*Syntax:*
```python
pandas.DataFrame( data, index, columns, dtype, copy)
```

| Parameters | Details                                                      |
| :--------: | ------------------------------------------------------------ |
|  **data**  | Takes various forms like ndarray, list, constants            |
| **index**  | Index values must be unique and hashable, same length as data. Default **np.arrange(n)** if no index is passed. |
|**columns**|For column labels, the optional default syntax is - np.arange(n). This is only true if no index is passed.|
| **dtype**  | dtype is for data type. If None, data type will be inferred  |
|  **copy**  | Copy data. Default False                                     |

#### <span style="color:#130654">Creating DataFrame</span>


**1. Empty DataFrame**

To create an empty DataFrame just run the DataFrame() method.

In [16]:
df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


**2. Using List**

In [17]:
# using simple list
simple_list = [1,2,3,4,5]
df_simple_list = pd.DataFrame(simple_list, columns=["Column 1"])
print("DataFrame using simple list:")
print(df_simple_list)

# using nested list
nest_list = [['Alex',10],['Bob',12],['Clarke',13]]
df_nest_list = pd.DataFrame(nest_list,columns=['Name','Age'])
print("\nDataFrame using nested list:")
print(df_nest_list)

DataFrame using simple list:
   Column 1
0         1
1         2
2         3
3         4
4         5

DataFrame using nested list:
     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13


**3. Using Dict of ndarrays / Lists**

In [18]:
dict_ndarray = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df_ndarray = pd.DataFrame(dict_ndarray)
df_ndarray

Unnamed: 0,Name,Age
0,Tom,28
1,Jack,34
2,Steve,29
3,Ricky,42


**4. Using List of Dicts**

In [19]:
list_dicts = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df_list_dicts = pd.DataFrame(list_dicts, columns=['a','b','c','d'])
df_list_dicts

Unnamed: 0,a,b,c,d
0,1,2,,
1,5,10,20.0,


<span style="color:green; font-family: Helvetica;">
    <strong>Note</strong>:
        <ul>
            <li>Observe, NaN (Not a Number) is appended in missing areas.</li>
            <li>If column index is created with other than the dictionary key, then it will return <strong>NaN</strong>.</li>
        </ul> 
</span>

**4. Using Dict of Series**

In [20]:
dict_series = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df_dict_series = pd.DataFrame(dict_series)
df_dict_series

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


<span style="color:green; font-family: Helvetica;">
    <strong>Note</strong>: Creating dataframe with multiple series of uneven length, then dataframe will be created for longest series with appending <strong>NaN</strong> for subsequent indexs for smaller series.
</span>

#### <span style="color:#130654">Column Operations</span>

**1. Column selection**

- Column can be selected by using `dataframe['<column_name>']`, exclude <>
- Use `dataframe.columns` to get column names as index.

*Example:*

In [21]:
data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
        'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
        'three' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])}

df = pd.DataFrame(data, dtype="int64")

In [22]:
# get column names as index
df.columns

Index(['one', 'two', 'three'], dtype='object')

In [23]:
# get first column
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
e    NaN
Name: one, dtype: float64

In [24]:
# get multiple columns column
df[['one', 'three']]

Unnamed: 0,one,three
a,1.0,1
b,2.0,2
c,3.0,3
d,,4
e,,5


**2. Column addition**

New column in dataframe can be created by adding `new data` or by applying operations on `existing columns`.

*Example:*

In [25]:
# creating new column by adding new data into dataframe
df['four'] = pd.Series([1, 2, 3, 4, 5, 6, 7], index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
df

Unnamed: 0,one,two,three,four
a,1.0,1.0,1,1
b,2.0,2.0,2,2
c,3.0,3.0,3,3
d,,4.0,4,4
e,,,5,5


<span style="color:green; font-family: Helvetica;">
    <strong>Note</strong>: 
    Adding a column with higher index than dataframe, other indexes will ignored and will not be appended to dataframe.
</span>

In [26]:
# creating new column by by applying operations on existing columns
df['five'] = df['three']+df['four']

df

Unnamed: 0,one,two,three,four,five
a,1.0,1.0,1,1,2
b,2.0,2.0,2,2,4
c,3.0,3.0,3,3,6
d,,4.0,4,4,8
e,,,5,5,10


#### Question: Why some columns are showing values in float and some are in integer format?
### *Hint*: <span style="color: Red; font-family: Helvetica; font-size: 125%; font-weight:700"> `NaN` is a float! </span> 

**3. Column Deletion**

Columns from dataframe can be removed using following:

|Method|Syntax|
|:----:|------|
|**del**| `del dataframe["<column_name">]`|
|**pop**| `dataframe.pop("<column_name>")`|
|**drop**| `df.drop(columns=["<column_names>"]`|


<span style="color:#130654; font-family: Helvetica;"><strong>Difference:</strong></span> 
- `pop` method can return the column popped while `del` won't return the deleted column. 
- So `pop` method can be utilized for popping out a column and pushing it to another dataframe.
- `drop` is a advance pandas method to remove row/index or columns by specifying label names and corresponding axis, or by specifying directly index or column names.

*Syntax of drop():*
```python
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
```

|Param|Details|
|:----:|------|
|**labels**| Index or column labels to drop.|
|**axis**| Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).|
|**index**| Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).|
|**columns**| Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).|
|**level**| For MultiIndex, level from which the labels will be removed.|
|**inplace**| If False, return a copy. Otherwise, do operation inplace and return None.|
|**errors**| If ‘ignore’, suppress error and only existing labels are dropped.|


*Example:*

In [27]:
# deleting first column in dataframe
del df['one']
df

Unnamed: 0,two,three,four,five
a,1.0,1,1,2
b,2.0,2,2,4
c,3.0,3,3,6
d,4.0,4,4,8
e,,5,5,10


In [28]:
# removing last column using pop method
df.pop('five')
df

Unnamed: 0,two,three,four
a,1.0,1,1
b,2.0,2,2
c,3.0,3,3
d,4.0,4,4
e,,5,5


In [29]:
# removing column using drop method
df.drop(labels=['three'], axis=1)

#or

df.drop(columns=['three'])

# both will give same result

Unnamed: 0,two,four
a,1.0,1
b,2.0,2
c,3.0,3
d,4.0,4
e,,5


#### <span style="color:#130654">Row Operations</span>

**1. Row selection**

Rows in pandas can be selected using two method:
1. by label using `loc` method
2. by integer location using `iloc` method

*Example:*

In [30]:
# selecting row using label
df.loc['a']

two      1.0
three    1.0
four     1.0
Name: a, dtype: float64

In [31]:
# selecting row using integer location
df.iloc[0]

two      1.0
three    1.0
four     1.0
Name: a, dtype: float64

**2. Slice Rows**

- Rows in pandas can be sliced using `:` operator
- Rows can be sliced using `index postition` or `index label`
- Index position works same as normal slicing, while slicing with index label is accurate to the labels used around `:` operator

*Example:*

In [32]:
# slicing using index position
df[1:4]

Unnamed: 0,two,three,four
b,2.0,2,2
c,3.0,3,3
d,4.0,4,4


In [33]:
# slicing using index label
df['b':'d']

Unnamed: 0,two,three,four
b,2.0,2,2
c,3.0,3,3
d,4.0,4,4


**3. Adding rows**

- Rows to data frame can be added using `append()` method
- Single row can be appended
- Rows from another dataframe with same fields/columns can be appended
- `append()` method doesn't support inplace functionality, so it has to be assigned to make it inplace

*Syntax:*
```python
DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=False)
```

|Param|Details|
|:----:|------|
|**other**| The data to append.|
|**ignore_index**| If True, the resulting axis will be labeled 0, 1, …, n - 1.|
|**verify_integrity**|If True, raise ValueError on creating index with duplicates.|
|**sort**|Sort columns if the columns of self and other are not aligned.|

*Example:* Method 1

In [34]:
# Directly appending dictionary to dataframe
df.append({'two':5,'three':6}, ignore_index=True)

Unnamed: 0,two,three,four
0,1.0,1.0,1.0
1,2.0,2.0,2.0
2,3.0,3.0,3.0
3,4.0,4.0,4.0
4,,5.0,5.0
5,5.0,6.0,


*Example:* Method 2

In [35]:
# Creating dataframe and then appending

# creating another dataframe
data2 = {'two':[6],'three':[7],'four':[7]}
df2 = pd.DataFrame(data2)
df2

Unnamed: 0,two,three,four
0,6,7,7


In [36]:
# appending new dataframe with old dataframe
df.append(df2, ignore_index=True)

Unnamed: 0,two,three,four
0,1.0,1,1
1,2.0,2,2
2,3.0,3,3
3,4.0,4,4
4,,5,5
5,6.0,7,7


**4. Deleting rows**

Rows in pandas dataframe can be deleted using `drop()` method

*Example:*

In [37]:
# Removing rows using drop method
df.drop(labels=['a', 'c'], axis=0)

# OR

df.drop(index=['a', 'c'])

# OR

df.drop(index=df.iloc[[0, 2]].index, axis=0)

# each of these methods will give same result

Unnamed: 0,two,three,four
b,2.0,2,2
d,4.0,4,4
e,,5,5


In the last method `df.iloc[[0,2]]` is used to get the dataframe for index 'a' and 'c', then `.index` method is used to take out index labels which is then passed into "index" param. It is better to use variable instead for directly passing dataframe.

## <span style="color:#130654">C. Panel</span>

### <span style="color: Red; font-family: Helvetica; font-size: 125%; font-weight:700"> Warning: </span> Panel was removed in 0.25.0.

- `Panel` is a 3D container of data
- The term panel data originates from somewhere in econometrics
- The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data

*Syntax:*
```python
pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)
```

|Param|Details|
|:----:|------|
|**data**| Data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.|
|<span style="color:red">**items**</span>| `axis 0`, each item corresponds to a DataFrame contained inside.|
|<span style="color:red">**major_axis**</span>| `axis 1`, it is the index (rows) of each of the DataFrames. |
|<span style="color:red">**minor_axis**</span>| `axis 2`, it is the columns of each of the DataFrames. |
|**dtype**| Data type of each column. |
|**copy**| Copy data. Default, false. |