----------------
## What is PANDAS
------------

- `pandas` (all lowercase) is a popular Python-based data analysis toolkit which can be imported using `import pandas as pd`. 

- It presents a diverse range of utilities, ranging from parsing multiple file formats to converting an entire data table into a NumPy matrix array. This makes pandas a trusted ally in `data science` and `machine learning`.

- Similar to `NumPy`, pandas deals primarily with data in 1-D and 2-D arrays; however, pandas handles the two differently.

In [1]:
import numpy as np
import pandas as pd

#### Data structures

Here is a basic tenet to keep in mind: `data alignment` is intrinsic.

The link between `labels` and `data` will not be broken unless done so explicitly by you.

#### 1. Series
- In pandas, 1-D arrays are referred to a `series`. 

- A `series` is created through the `pd.Series` constructor, which has a lot of optional arguments. The most common argument is `data`, which specifies the elements of the series.

- holding any data type (integers, strings, floating point numbers, Python objects, etc.).

-  The axis labels are collectively referred to as the `index`.

Here, `data` can be many different things:

- a Python dict
- an ndarray
- a scalar value (like 5)

The passed `index` is a list of axis labels. 

##### From ndarray

If data is an `ndarray`, index must be the same length as data. 

If no index is passed, one will be created having values [0, ..., len(data) - 1].

In [3]:
s = pd.Series(data=np.random.randn(5), index=["a", "b", "c", "d", "e"])

In [4]:
s

a    3.372493
b   -0.100405
c   -0.306372
d    0.886006
e   -1.266740
dtype: float64

In [4]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [5]:
pd.Series(np.random.randn(5))

0   -0.121653
1    1.542080
2    1.801043
3    0.553854
4    0.962291
dtype: float64

##### From dict

Series can be instantiated from dicts:

In [5]:
data = {"b": 1, "a": 0, "c": 2}

pd.Series(data)

b    1
a    0
c    2
dtype: int64

When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s insertion order,

In [6]:
pd.Series(data, index=["b", "c", "d", "a"])

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

`NaN` (not a number) is the standard `missing data marker` used in pandas.

##### From scalar value

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

In [7]:
pd.Series(5.0, index=["a", "b", "c", "d", "e"])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

`Series is ndarray-like`

- Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. 
- However, operations such as `slicing` will also `slice the index`.

In [12]:
s

a    1.496840
b   -1.394502
c    3.936838
d    0.563542
e   -1.767142
dtype: float64

In [10]:
s[0]

1.4968401053714555

In [11]:
s[:3]

a    1.496840
b   -1.394502
c    3.936838
dtype: float64

In [9]:
s > s.median()

a    False
b     True
c    False
d     True
e    False
dtype: bool

In [10]:
s[s > s.median()]

b    0.148535
d    0.672836
dtype: float64

In [14]:
s

a    1.496840
b   -1.394502
c    3.936838
d    0.563542
e   -1.767142
dtype: float64

In [11]:
s[[4, 3, 1]]

e   -0.181246
d    0.672836
b    0.148535
dtype: float64

In [12]:
np.exp(s)

a    1.030765
b    1.160134
c    0.271633
d    1.959788
e    0.834230
dtype: float64

Like a NumPy array, a pandas Series has a `dtype`.

In [13]:
s.dtype

dtype('float64')

While Series is ndarray-like, if you need an `actual ndarray`, then use

In [14]:
s.to_numpy()

array([ 0.0303014 ,  0.14853526, -1.30330302,  0.67283636, -0.18124577])

`Series is dict-like`

- A `Series` is like a fixed-size `dict` in that you can get and set values by index label:

In [20]:
s["a"], s[0]

(1.4968401053714555, 1.4968401053714555)

In [22]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [23]:
"e" in s

True

##### Vectorized operations and label alignment with Series

- When working with raw `NumPy` arrays, `looping` through value-by-value is usually not necessary. 

- The same is `true` when working with `Series` in pandas. 


In [15]:
s + s

a    0.060603
b    0.297071
c   -2.606606
d    1.345673
e   -0.362492
dtype: float64

In [16]:
s * 2

a    0.060603
b    0.297071
c   -2.606606
d    1.345673
e   -0.362492
dtype: float64

In [17]:
np.exp(s)

a    1.030765
b    1.160134
c    0.271633
d    1.959788
e    0.834230
dtype: float64

In [27]:
# vectorize operations
vector1 = pd.Series([1, 2, 3, 4],  index=['a', 'b','c','d'])
vector2 = pd.Series([10,20,30,40], index=['a', 'b','c','d'])

In [28]:
vector1 + vector2

a    11
b    22
c    33
d    44
dtype: int64

In [30]:
vector3 = pd.Series([10,20,300,400], index=['a', 'b','e','f'])

In [31]:
vector1 + vector3

a    11.0
b    22.0
c     NaN
d     NaN
e     NaN
f     NaN
dtype: float64

##### Name attribute

Series can also have a `name` attribute:

In [20]:
s = pd.Series(np.random.randn(5), name="something")
s

0    0.307229
1    2.070136
2    0.073581
3   -1.177062
4    2.211530
Name: something, dtype: float64

In [33]:
s2 = s.rename("different")
s2

0   -2.006336
1    0.491227
2    2.307667
3    1.664012
4   -0.423483
Name: different, dtype: float64

In [29]:
id(s), id(s2)

(1342817567112, 1342817570568)

Note that s and s2 refer to different objects

#### DataFrame

- `DataFrame` is a 2-dimensional labeled data structure with columns of potentially different types. 

- You can think of it like a `spreadsheet` or SQL table, or a `dict` of Series objects. 

- It is generally the most commonly used pandas object. 

- Like Series, DataFrame accepts many different kinds of input:

    - Dict of 1D ndarrays, lists, dicts, or Series
    - 2-D numpy.ndarray
    - Structured or record ndarray
    - A Series
    - Another DataFrame

`From dict of Series or dicts`

- The resulting index will be the `union of the indexes` of the various Series. 

- If there are any `nested dicts`, these will first be converted to `Series`. 

- If `no columns` are passed, the columns will be the ordered list of `dict keys`.

In [32]:
d = {
   ....:     "one": pd.Series([1.0, 2.0, 3.0],      index=["a", "b", "c"]),
   ....:     "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
   ....: }

In [33]:
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [32]:
pd.DataFrame(d, index=["d", "b", "a"])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [36]:
pd.DataFrame(d, index=["d", "b", "a"], columns=["abc", "xyz"])

Unnamed: 0,abc,xyz
d,,
b,,
a,,


In [34]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [35]:
df.columns

Index(['one', 'two'], dtype='object')

`From dict of ndarrays / lists`

- The `ndarrays` must all be the same length. 

- If an `index` is passed, it must clearly also be the same length as the arrays. 

- If no index is passed, the result will be range(n), where n is the array length.

In [36]:
d = {"one": [1.0, 2.0, 3.0, 4.0], 
     "two": [4.0, 3.0, 2.0, 1.0]}

In [37]:
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [38]:
pd.DataFrame(d, index=["a", "b", "c", "d"])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


`From a list of dicts`

In [39]:
data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]

In [40]:
pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [41]:
pd.DataFrame(data2, index=["first", "second"])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [45]:
pd.DataFrame(data2, columns=["a", "b"])

Unnamed: 0,a,b
0,1,2
1,5,10


`From a dict of tuples`

In [42]:
pd.DataFrame(
   ....:     {
   ....:         ("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
   ....:         ("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
   ....:         ("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
   ....:         ("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
   ....:         ("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
   ....:     }
   ....: )

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,b,a,c,a,b
A,B,1.0,4.0,5.0,8.0,10.0
A,C,2.0,3.0,6.0,7.0,
A,D,,,,,9.0


#### Features and functionalities offered by PANDAS

In [16]:
data2 = [{"col1": 'DELIVEROO', "col2": 2, "col3": 3}, 
         {"col1": 5, "col2": 10, "col3": 20},
         {"col1": 5, "col2": 10, "col3": 20}
        ]

In [17]:
df = pd.DataFrame(data2)
df

Unnamed: 0,col1,col2,col3
0,DELIVEROO,2,3
1,5,10,20
2,5,10,20


In [11]:
df.loc[df['col1'] == 1, "col2"].sum()

2

In [18]:
def sum_dev(df, col1, col2, condition):
    res = df.loc[df[col1] == condition, col2].sum()
    
    return res

In [19]:
sum_dev(df, 'col1', 'col2', 'DELIVEROO')

2