# 10 minutes to pandas

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Basic data structures

1. Series: One dimensional labeled array
2. DataFrame: Two-dimensional

## Object creation

In [2]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])


Interesting thing about that cell is that np.nan can be inputed in the creation.

In [5]:
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

Interesting thing about the creation of that dataframe is first the pd.date_range function that takes periods as days; 
second is the creatio of a list of elements out of a char list. 

In [10]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(3)), dtype="float32"),
        "D": np.array([3] * 3, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test"]),
        "F": "foo",
    }
)

Interesting; if I write a column with one value, it repeats itself to match the lenght of other columns with
specified length; in this case "A" only has one value specified same with "B" and "F" but the rest have explicitely
three values and they should match otherwise an error will be shown.

## Viewing data

In [12]:
df2.head()

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo


In [13]:
df2.tail()

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo


In [14]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [15]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [17]:
df.to_numpy()

array([[-0.41973302, -0.85518766,  0.36283998, -0.56778059],
       [ 0.01991387, -0.22005718, -1.15842391, -0.66948427],
       [ 0.89863202,  1.12973948, -0.04840166,  0.06998686],
       [ 1.88833696, -0.05119412, -0.62832233,  0.31305809],
       [-0.29634715, -0.75739921, -0.59443671,  0.00836387],
       [ 1.4055257 , -1.0462622 ,  1.43175515,  0.82122549]])

In [18]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo']],
      dtype=object)

NumPy arrays have one dtype for the entire array while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. If the common data type is object, DataFrame.to_numpy() will require copying data.

In [21]:
type(df2.to_numpy()[0][4])

str

Quick statistic summary

In [22]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.582721,-0.30006,-0.105832,-0.004105
std,0.956647,0.798349,0.917257,0.556393
min,-0.419733,-1.046262,-1.158424,-0.669484
25%,-0.217282,-0.830741,-0.619851,-0.423744
50%,0.459273,-0.488728,-0.321419,0.039175
75%,1.278802,-0.09341,0.26003,0.25229
max,1.888337,1.129739,1.431755,0.821225


Transposing data

In [23]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-0.419733,0.019914,0.898632,1.888337,-0.296347,1.405526
B,-0.855188,-0.220057,1.129739,-0.051194,-0.757399,-1.046262
C,0.36284,-1.158424,-0.048402,-0.628322,-0.594437,1.431755
D,-0.567781,-0.669484,0.069987,0.313058,0.008364,0.821225


Sort index:

In [24]:
df.sort_index()

Unnamed: 0,A,B,C,D
2013-01-01,-0.419733,-0.855188,0.36284,-0.567781
2013-01-02,0.019914,-0.220057,-1.158424,-0.669484
2013-01-03,0.898632,1.129739,-0.048402,0.069987
2013-01-04,1.888337,-0.051194,-0.628322,0.313058
2013-01-05,-0.296347,-0.757399,-0.594437,0.008364
2013-01-06,1.405526,-1.046262,1.431755,0.821225


In [26]:
df.sort_index(axis=1, ascending=False)


Unnamed: 0,D,C,B,A
2013-01-01,-0.567781,0.36284,-0.855188,-0.419733
2013-01-02,-0.669484,-1.158424,-0.220057,0.019914
2013-01-03,0.069987,-0.048402,1.129739,0.898632
2013-01-04,0.313058,-0.628322,-0.051194,1.888337
2013-01-05,0.008364,-0.594437,-0.757399,-0.296347
2013-01-06,0.821225,1.431755,-1.046262,1.405526


Very nice to be able to sort the dataframe by the values of a specific column or row

In [27]:
df.sort_values(by="B")

Unnamed: 0,A,B,C,D
2013-01-06,1.405526,-1.046262,1.431755,0.821225
2013-01-01,-0.419733,-0.855188,0.36284,-0.567781
2013-01-05,-0.296347,-0.757399,-0.594437,0.008364
2013-01-02,0.019914,-0.220057,-1.158424,-0.669484
2013-01-04,1.888337,-0.051194,-0.628322,0.313058
2013-01-03,0.898632,1.129739,-0.048402,0.069987


## Getitem ([])

In [28]:
df.A

2013-01-01   -0.419733
2013-01-02    0.019914
2013-01-03    0.898632
2013-01-04    1.888337
2013-01-05   -0.296347
2013-01-06    1.405526
Freq: D, Name: A, dtype: float64

In [29]:
df['A']

2013-01-01   -0.419733
2013-01-02    0.019914
2013-01-03    0.898632
2013-01-04    1.888337
2013-01-05   -0.296347
2013-01-06    1.405526
Freq: D, Name: A, dtype: float64

For a DataFrame, passing a slice : selects matching rows:

In [33]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-0.419733,-0.855188,0.36284,-0.567781
2013-01-02,0.019914,-0.220057,-1.158424,-0.669484
2013-01-03,0.898632,1.129739,-0.048402,0.069987


In [32]:
df["20130102":"20130104"]

Unnamed: 0,A,B,C,D
2013-01-02,0.019914,-0.220057,-1.158424,-0.669484
2013-01-03,0.898632,1.129739,-0.048402,0.069987
2013-01-04,1.888337,-0.051194,-0.628322,0.313058


df[1:2] is different from df.loc[1:2]

## Selection by label

df.loc[dates[0]]

In [39]:
df.loc[dates[0]]

A   -0.419733
B   -0.855188
C    0.362840
D   -0.567781
Name: 2013-01-01 00:00:00, dtype: float64