# Notebook_11 - Introduction to pandas

Pandas is data analysis library that allows for easier data manipulation using python. It has become quite popular in recent years, fueled by interest of many people in Data Science and Machine Learning.

To work with pandas it should be installed:

In [1]:
!pip install pandas

You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.9/bin/python3 -m pip install --upgrade pip' command.[0m


If everything went successful, you should be able to see pandas version.

In [4]:
!pip show pandas

Name: pandas
Version: 1.3.4
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: pandas-dev@python.org
License: BSD-3-Clause
Location: /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages
Requires: numpy, python-dateutil, pytz
Required-by: seaborn, statsmodels


If you see detailed message, everything is fine.

Let us begin by creating simple dataframe. Dataframe is a data structure that allows for easier access and visualization.

In [7]:
import pandas as pd

df = pd.DataFrame(
{"a" : [4, 5, 6],
"b" : [7, 8, 9],
"c" : [10, 11, 12]},
index = [1, 2, 3])

print(df)

   a  b   c
1  4  7  10
2  5  8  11
3  6  9  12


That is all! We created first DataFrame in pandas. `pd` is common alias for pandas as it is short and commonly used (similar to tkinter as tk). We specify three number for each row and name every column. You may have noticed, that we used ordinary `print` to show the result. It is slightly different topic, but in python it is possible to set special behavior for some operations, such as `+`, `-`, etc. Similary, `print` can be directly invoked, for the sake of convinience.

In [9]:
import pandas as pd

df = pd.DataFrame(
[[4, 7, 10],
[5, 8, 11],
[6, 9, 12]],
index=[1, 2, 3],
columns=['a', 'b', 'c'])

print(df)

   a  b   c
1  4  7  10
2  5  8  11
3  6  9  12


There is another way of creating same DataFrame. Probably, this way feels more natural, but it is not always the case. Also note, that `df` is a common name for DataFrame, so if you come across such naming, probably there is pandas DataFrame used somewhere.

Also, if you are familiar with library `numpy` it can be very handy, as they are commonly used together (not to mention that some parts of pandas use numpy)

One of the concepts that is different from python but exists in pandas is `NaN` value. The closest python has is `None`. `NaN` acts as void, something that does not exist. Indeed, in real world applications there are sometimes cases that some data is missing or corrupted, in that case said data can be marked as NaN.

Another interesting thing about pandas is that its columns and rows can be viewed as separate entities, making it easier for user to make slices, extractions and other selection operations.

In [12]:
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

print(df)
print(df.dtypes)

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object


Firstly, we import numpy, as it works with pandas and can be useful while making datastructures. Then we create DataFrame filled with values of different datatypes. As we see, the data types of columns are different but they peacfully coexist. Also note, that for such method of creating DataFrame it is possible to specify one value that will spread to all rows. This can be quite useful when creating dummy DataFrames

For large datasets one can use `tail` and `head` to view the very first or last rows.

In [14]:
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
print(df.head(2))
print(df.tail(2))

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
     A          B    C  D      E    F
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo


Also a very useful tool is `describe`

In [20]:
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
print(df.describe())

         A    C    D
count  4.0  4.0  4.0
mean   1.0  1.0  3.0
std    0.0  0.0  0.0
min    1.0  1.0  3.0
25%    1.0  1.0  3.0
50%    1.0  1.0  3.0
75%    1.0  1.0  3.0
max    1.0  1.0  3.0


It shows most commonly used statistic measurments with regards to all rows from the Data Frame.

Also it is really easy to show indices and columns of the dataset

In [23]:
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

print(df.index)
print(df.columns)

Int64Index([0, 1, 2, 3], dtype='int64')
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')


Also, the connection between pandas and numpy is mutual: it is really easy to convert DataFrame to numpy array

In [26]:
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

arr = df.to_numpy()
print(arr)

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

Also it is really easy to transpose your data

In [25]:
df.T

Unnamed: 0,0,1,2,3
A,1.0,1.0,1.0,1.0
B,2013-01-02 00:00:00,2013-01-02 00:00:00,2013-01-02 00:00:00,2013-01-02 00:00:00
C,1.0,1.0,1.0,1.0
D,3,3,3,3
E,test,train,test,train
F,foo,foo,foo,foo


You may see that jupyter notebooks allow for certain interractions that other IDEs do not, such as an example above. However, these are not recommended, as if person does not have jupyter and only have the source code, said code will not work.

Lastly, pandas DataFrames have very handy column access. Similary to python, to access element just use square brackets

In [27]:
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

print(df["E"])

0     test
1    train
2     test
3    train
Name: E, dtype: category
Categories (2, object): ['test', 'train']


## Tasks

1. Install pandas if it is not installed yet
2. Try all the code yourself and experiment with it
3. Visit official pandas website https://pandas.pydata.org and find the documnetation section
4. Create your own DataFrame and perform operations not listed in this notebook