In [1]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Session 3 - The making of Pandas data frames


<img src="img/company-logo.png" width=120 height=120 align="right">

Author: Prof. Manoel Gadi

Contact: manoelgadi@gmail.com

Teaching Web: http://mfalonso.pythonanywhere.com

Linkedin: https://www.linkedin.com/in/manoel-gadi-97821213/

Github: https://github.com/manoelgadi

Last revision: 27/October/2022


## A bit of Pandas history

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Pandas began as a project in 2008 to join together a specific set of requirements that were not found together on any tool.

The author Wes McKinney was working at the investment fund AQR Capital Management at the time, he found in Numpy and in basic Python all the ingredients to put together a Data Structure as powerfull as its rivals SAS .sas7bdat and MATLAB .mat propretary data formats. 



<img src="img/WesMcKinney.png" align="left">

Working in stock trading means BIG DATA - Variety, Velocity and Volume.

* Variety: Pandas uses independent Numpy arrays to allow variety.
* Velocity: Pandas is in-memory data format, allowing much faster computation than early rivals.
* Volume: Computer disk swap allows pandas to use memory and disk to hold huge data sets.


What else comes with pandas?

* Data structures with labeled axes
* Time series functionalities
* Same data structures for both time-series and non-time series data
* Arithmetic operations based on labels
* Flexible handling of missing data
* Merge and relational operations (like in SQL)

Today, it has grown into one of the most important Data Analysis libraries!

* Actively supported by a community from around the world
* Is the go-to library for high-performance, easy-to-use data structures and data analysis tools 
* Is a project sponsored by NumFOCUS, same as:


<img src="img/np_mat_jup.png" width=500 height=400 align="left">

---

# 2. Pandas

Now we have all the ingredients to create one of the most powerful data structures of all, the DATA FRAME.

For this we will need:
    * 2 Lists
    * 2 Dictionaries
    * and many 1D numpy arrays

## 2.1 Pandas Data Frame Structure

<img src="img/pandas_df_structure.png" align="left">

Let's firs import pandas using the alias (or nickname) 'pd'

In [2]:
import pandas as pd
import numpy as np

In [3]:
df=pd.DataFrame(np.random.randn(10,4))
print(df)

          0         1         2         3
0  0.353322 -0.547099 -0.129439 -1.158755
1 -0.999245  0.251460 -0.684030  0.124800
2  0.543713 -0.142925  0.342517 -2.340069
3  0.603019  0.776082 -0.705805  1.172718
4  2.151902  1.337099 -2.238540 -0.155609
5  0.413653  0.214892 -0.143864 -1.185378
6 -0.064504  0.150199  0.629304  0.047826
7 -0.043980 -1.121524  0.199819  1.017386
8  1.096159 -0.029803 -0.234149  0.297689
9  0.270622 -0.192437 -0.553986  1.168953


In [4]:
df['A']

KeyError: 'A'

In [5]:
df=pd.DataFrame(np.random.randn(10,4),columns=['A','B','C','D'])
print(df)

          A         B         C         D
0 -0.825523  0.401358  1.199114  2.092676
1  0.223923  0.484344  0.227367  1.722459
2  0.960186 -1.567375 -0.503148 -0.611427
3 -0.576443 -0.636377  0.480347  1.158419
4  1.148812  0.496597 -0.767413  0.278265
5  1.137837 -0.257678  2.012630 -0.488774
6 -0.636440  1.080663  1.004879  0.361474
7 -0.169457  1.206869  0.680214  1.178184
8  0.626454  0.077148 -1.098140  0.492865
9 -0.622009  0.920793  0.354062 -0.675026


In [6]:
df['A']

0   -0.825523
1    0.223923
2    0.960186
3   -0.576443
4    1.148812
5    1.137837
6   -0.636440
7   -0.169457
8    0.626454
9   -0.622009
Name: A, dtype: float64

In [7]:
df.iloc[:,0]

0   -0.825523
1    0.223923
2    0.960186
3   -0.576443
4    1.148812
5    1.137837
6   -0.636440
7   -0.169457
8    0.626454
9   -0.622009
Name: A, dtype: float64

In [8]:
df=pd.DataFrame(np.random.randn(10,4),columns=['A','B a l l','C','D'],index=['row0', 'row1', 'row2', 'row3', 'row4', 'row5', 
                                                                       'row6', 'row7', 'row8', 'row9'])
print(df)

             A   B a l l         C         D
row0 -2.042564 -0.056018  0.417207  0.003894
row1 -0.442620  0.857589  0.314043 -1.268257
row2 -1.504804 -0.027168 -0.930094  1.081655
row3 -1.235314 -0.336581  0.336642  0.093850
row4 -0.735665 -0.004192 -0.372772 -0.969991
row5  1.561050 -2.950871  0.193347  1.318407
row6 -0.995347 -0.520899 -0.538398 -0.177308
row7  0.530160 -0.762979  0.229581 -0.336207
row8 -1.195998  1.172738  0.139174  1.579593
row9 -0.718062  0.651051 -0.777075  0.342675


Selection column 'B' using its name, in other words, using the dictionary!

In [9]:
df.A

row0   -2.042564
row1   -0.442620
row2   -1.504804
row3   -1.235314
row4   -0.735665
row5    1.561050
row6   -0.995347
row7    0.530160
row8   -1.195998
row9   -0.718062
Name: A, dtype: float64

In [12]:
df["A"]

row0   -2.042564
row1   -0.442620
row2   -1.504804
row3   -1.235314
row4   -0.735665
row5    1.561050
row6   -0.995347
row7    0.530160
row8   -1.195998
row9   -0.718062
Name: A, dtype: float64

In [11]:
df["B a l l"]

row0   -0.056018
row1    0.857589
row2   -0.027168
row3   -0.336581
row4   -0.004192
row5   -2.950871
row6   -0.520899
row7   -0.762979
row8    1.172738
row9    0.651051
Name: B a l l, dtype: float64

In [13]:
df.loc[:,"B a l l"]

row0   -0.056018
row1    0.857589
row2   -0.027168
row3   -0.336581
row4   -0.004192
row5   -2.950871
row6   -0.520899
row7   -0.762979
row8    1.172738
row9    0.651051
Name: B a l l, dtype: float64

In [14]:
#df.B a l l #equivalent to df.loc[:,'B'] - LOC is for accessing the DICTIONARY of labels.
df['B a l l'] #df.loc[rows,columns]

row0   -0.056018
row1    0.857589
row2   -0.027168
row3   -0.336581
row4   -0.004192
row5   -2.950871
row6   -0.520899
row7   -0.762979
row8    1.172738
row9    0.651051
Name: B a l l, dtype: float64

Selection column 'B' using its position, in other words, using the list!

In [16]:
print(df)

             A   B a l l         C         D
row0 -2.042564 -0.056018  0.417207  0.003894
row1 -0.442620  0.857589  0.314043 -1.268257
row2 -1.504804 -0.027168 -0.930094  1.081655
row3 -1.235314 -0.336581  0.336642  0.093850
row4 -0.735665 -0.004192 -0.372772 -0.969991
row5  1.561050 -2.950871  0.193347  1.318407
row6 -0.995347 -0.520899 -0.538398 -0.177308
row7  0.530160 -0.762979  0.229581 -0.336207
row8 -1.195998  1.172738  0.139174  1.579593
row9 -0.718062  0.651051 -0.777075  0.342675


In [17]:
df.loc['row0',:]

A         -2.042564
B a l l   -0.056018
C          0.417207
D          0.003894
Name: row0, dtype: float64

In [23]:
df.iloc[3:-1,2:-1]

Unnamed: 0,C
row3,0.336642
row4,-0.372772
row5,0.193347
row6,-0.538398
row7,0.229581
row8,0.139174


In [24]:
df.iloc[1:4,1:3] # second column of data frame (last_name) - : = all rows of column 1 (list index starts from 0) 
#  - ILOC is for accessing the LIST of indexes.

Unnamed: 0,B a l l,C
row1,0.857589,0.314043
row2,-0.027168,-0.930094
row3,-0.336581,0.336642


# Exercise: 
## Select the second row using its name 'row1' and the column name 'C', in other words, using the dictionary!

Selecting the second row using its position or index, in other words, using the list!

In [33]:
df

Unnamed: 0,A,B a l l,C,D
row0,-2.042564,-0.056018,0.417207,0.003894
row1,-0.44262,0.857589,0.314043,-1.268257
row2,-1.504804,-0.027168,-0.930094,1.081655
row3,-1.235314,-0.336581,0.336642,0.09385
row4,-0.735665,-0.004192,-0.372772,-0.969991
row5,1.56105,-2.950871,0.193347,1.318407
row6,-0.995347,-0.520899,-0.538398,-0.177308
row7,0.53016,-0.762979,0.229581,-0.336207
row8,-1.195998,1.172738,0.139174,1.579593
row9,-0.718062,0.651051,-0.777075,0.342675


In [34]:
df.iloc[1] # second row of data frame using

A         -0.442620
B a l l    0.857589
C          0.314043
D         -1.268257
Name: row1, dtype: float64

In [35]:
a = [0,1,2,3,4,5,6,7]

In [36]:
type(a)

list

In [37]:
a[0:3]

[0, 1, 2]

In [40]:
df.iloc[1:10]['A']

row1   -0.442620
row2   -1.504804
row3   -1.235314
row4   -0.735665
row5    1.561050
row6   -0.995347
row7    0.530160
row8   -1.195998
row9   -0.718062
Name: A, dtype: float64

In [41]:
df2 = df.iloc[1:10]
df2['A']

row1   -0.442620
row2   -1.504804
row3   -1.235314
row4   -0.735665
row5    1.561050
row6   -0.995347
row7    0.530160
row8   -1.195998
row9   -0.718062
Name: A, dtype: float64

## 2.2 Slicing

__ As .iloc access via list, and as we are specialists in list already ;-) - we can use the power of slicing to do cool things: __

__ Rows: __
* df.iloc[0] # first row of data frame
* df.iloc[1] # second row of data frame
* df.iloc[-1] # last row of data frame

__ Columns: __ 
* df.iloc[:,0] # first column of data frame
* df.iloc[:,1] # second column of data frame
* df.iloc[:,-1] # last column of data frame

In [42]:
a


[0, 1, 2, 3, 4, 5, 6, 7]

In [45]:
a[1:6:3]

[1, 4]

In [46]:
# Remember, double colon :: allows us to set the step we wish to select.
print(df.iloc[0::2]) #Starts from 0 and gets all the EVEN rows.

             A   B a l l         C         D
row0 -2.042564 -0.056018  0.417207  0.003894
row2 -1.504804 -0.027168 -0.930094  1.081655
row4 -0.735665 -0.004192 -0.372772 -0.969991
row6 -0.995347 -0.520899 -0.538398 -0.177308
row8 -1.195998  1.172738  0.139174  1.579593


In [47]:
print(df.iloc[1::2]) #Starts from 0 and gets all the ODD rows.

             A   B a l l         C         D
row1 -0.442620  0.857589  0.314043 -1.268257
row3 -1.235314 -0.336581  0.336642  0.093850
row5  1.561050 -2.950871  0.193347  1.318407
row7  0.530160 -0.762979  0.229581 -0.336207
row9 -0.718062  0.651051 -0.777075  0.342675


In [48]:
print(df.iloc[:,0::2]) #Starts from 0 and gets all the EVEN columns.

             A         C
row0 -2.042564  0.417207
row1 -0.442620  0.314043
row2 -1.504804 -0.930094
row3 -1.235314  0.336642
row4 -0.735665 -0.372772
row5  1.561050  0.193347
row6 -0.995347 -0.538398
row7  0.530160  0.229581
row8 -1.195998  0.139174
row9 -0.718062 -0.777075


In [49]:
print(df.iloc[:,1::2]) #Starts from 1 and gets all the ODD columns.

       B a l l         D
row0 -0.056018  0.003894
row1  0.857589 -1.268257
row2 -0.027168  1.081655
row3 -0.336581  0.093850
row4 -0.004192 -0.969991
row5 -2.950871  1.318407
row6 -0.520899 -0.177308
row7 -0.762979 -0.336207
row8  1.172738  1.579593
row9  0.651051  0.342675


## Exercise:

__print that would return all the EVEN rows and all the ODD columns of df!!!__ 



Great job, from now on, we are ready to start doing serious things with Python using this power Data Structure that brings __VELOCITY, VOLUME AND VARIETY__ to the table.

## 2.4 Practicing Pandas via exercises

Where can I learn more about Pandas? https://github.com/ajcr/100-pandas-puzzles/blob/master/100-pandas-puzzles-with-solutions.ipynb

---
# EXTRA MATERIAL - 

## Numpy

<img src="img/scipy_ecosystem.png" width=500 height=400 align="left">

## Numpy 

<img src="img/Numpy1.png" width=600 height=500 align="left">

<img src="img/Numpy2.png" width=600 height=500 align="left">

<img src="img/Numpy3.png" width=600 height=500 align="left">

Travis Oliphant is an American data scientist and businessman. He is founder of technology startup Anaconda. In addition, he is the primary creator of NumPy and founding contributor to the SciPy packages in the Python programming languages.
ref: https://en.wikipedia.org/wiki/Travis_Oliphant

Numpy arrays can be 1D, 2D, 3D, ... N-Dimension, let´s have a look at 2D:


<img src="img/np_array.png" width=500 height=400 align="left">

* The number of dimensions is known as the rank of the array
* The shape of an array is a tuple of integers giving the size of the array along each dimension
* The index refers to the position of a value in the array. In Python, indexing starts at 0.


## 1.1 Practicing Numpy via exercises

Where can I learn more about Numpy? https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises.ipynb