In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Creating a Pandas Series object

In [4]:
s = pd.Series([1,2,3,np.nan, 10])
print(s)

0     1.0
1     2.0
2     3.0
3     NaN
4    10.0
dtype: float64


In [10]:
dates = pd.date_range(start="20170801", periods=10)
print(dates)

#Randomly creating a data frame with datetime index and labeled columns
df = pd.DataFrame(np.random.randn(10, 4), index=dates, columns=list("ABCD"))
df

DatetimeIndex(['2017-08-01', '2017-08-02', '2017-08-03', '2017-08-04',
               '2017-08-05', '2017-08-06', '2017-08-07', '2017-08-08',
               '2017-08-09', '2017-08-10'],
              dtype='datetime64[ns]', freq='D')


Unnamed: 0,A,B,C,D
2017-08-01,0.767187,-0.054625,1.590607,-0.791468
2017-08-02,0.3419,0.926485,-0.270759,0.732538
2017-08-03,-0.392693,-0.168523,-1.815978,-0.441589
2017-08-04,-0.416511,-0.33074,-0.817551,1.05703
2017-08-05,0.530754,-1.279534,0.798434,0.417356
2017-08-06,-0.702763,-0.303106,-1.542699,2.810941
2017-08-07,1.022777,0.228942,1.128929,0.308503
2017-08-08,1.84703,-1.330483,1.6364,-1.498886
2017-08-09,-0.73918,-1.866896,-0.062221,0.7395
2017-08-10,-0.941687,1.594801,-0.967864,-1.427584


In [14]:
# Creating a dataframe from a dict object.
df2 = pd.DataFrame({"A":1.,
                   "B":pd.Timestamp('20170802'),
                   "C":pd.Series([1,2,3,np.nan]),
                   "D":[1,2,3,10],
                   "E":"String",
                   "F":np.asarray([1,2,3,4]),
                   "G":pd.Categorical(["test","train","test","train"])
                   })
print(df2.dtypes)
df2

A           float64
B    datetime64[ns]
C           float64
D             int64
E            object
F             int32
G          category
dtype: object


Unnamed: 0,A,B,C,D,E,F,G
0,1.0,2017-08-02,1.0,1,String,1,test
1,1.0,2017-08-02,2.0,2,String,2,train
2,1.0,2017-08-02,3.0,3,String,3,test
3,1.0,2017-08-02,,10,String,4,train


### Viewing the data
First or last 5 lines can be viewed using the head/tail command.

In [18]:
df.head()

Unnamed: 0,A,B,C,D
2017-08-01,0.767187,-0.054625,1.590607,-0.791468
2017-08-02,0.3419,0.926485,-0.270759,0.732538
2017-08-03,-0.392693,-0.168523,-1.815978,-0.441589
2017-08-04,-0.416511,-0.33074,-0.817551,1.05703
2017-08-05,0.530754,-1.279534,0.798434,0.417356


In [19]:
df.tail()

Unnamed: 0,A,B,C,D
2017-08-06,-0.702763,-0.303106,-1.542699,2.810941
2017-08-07,1.022777,0.228942,1.128929,0.308503
2017-08-08,1.84703,-1.330483,1.6364,-1.498886
2017-08-09,-0.73918,-1.866896,-0.062221,0.7395
2017-08-10,-0.941687,1.594801,-0.967864,-1.427584


**describe()** shows a quick summary of the data, which is mainly corresponding numerical data. Hence it skips the columns which don't have entirely numerical data. Eg: columns B, E and G are skipped in the dataframe df2.

In [23]:
df2.describe()

Unnamed: 0,A,C,D,F
count,4.0,3.0,4.0,4.0
mean,1.0,2.0,4.0,2.5
std,0.0,1.0,4.082483,1.290994
min,1.0,1.0,1.0,1.0
25%,1.0,1.5,1.75,1.75
50%,1.0,2.0,2.5,2.5
75%,1.0,2.5,4.75,3.25
max,1.0,3.0,10.0,4.0


df.index() gives the kind of index used the by dataframe. It either returns a DateTimeIndex object, as in case of df or returns a rangeIndex object, when the indices used in the dataframe are numbers, in the case of df2.

In [29]:
df.index

DatetimeIndex(['2017-08-01', '2017-08-02', '2017-08-03', '2017-08-04',
               '2017-08-05', '2017-08-06', '2017-08-07', '2017-08-08',
               '2017-08-09', '2017-08-10'],
              dtype='datetime64[ns]', freq='D')

In [30]:
df2.index

RangeIndex(start=0, stop=4, step=1)

The data can be viewed in different forms by sorting the order in which the columns and rows are displayed.
`df.sort_index(axis=0)` sorts the data based on the index column.
`df.sort_index(axis=1)` sorts the data based on the column names.

In [37]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2017-08-01,-0.791468,1.590607,-0.054625,0.767187
2017-08-02,0.732538,-0.270759,0.926485,0.3419
2017-08-03,-0.441589,-1.815978,-0.168523,-0.392693
2017-08-04,1.05703,-0.817551,-0.33074,-0.416511
2017-08-05,0.417356,0.798434,-1.279534,0.530754
2017-08-06,2.810941,-1.542699,-0.303106,-0.702763
2017-08-07,0.308503,1.128929,0.228942,1.022777
2017-08-08,-1.498886,1.6364,-1.330483,1.84703
2017-08-09,0.7395,-0.062221,-1.866896,-0.73918
2017-08-10,-1.427584,-0.967864,1.594801,-0.941687


The data could also be displayed by sorting the dataframe based on values rather than their column headings or indices.

In [39]:
df.sort_values(by="C")
# You can observe the C column being completed sorted in ascending order and the datetime index column is jumbled.

Unnamed: 0,A,B,C,D
2017-08-03,-0.392693,-0.168523,-1.815978,-0.441589
2017-08-06,-0.702763,-0.303106,-1.542699,2.810941
2017-08-10,-0.941687,1.594801,-0.967864,-1.427584
2017-08-04,-0.416511,-0.33074,-0.817551,1.05703
2017-08-02,0.3419,0.926485,-0.270759,0.732538
2017-08-09,-0.73918,-1.866896,-0.062221,0.7395
2017-08-05,0.530754,-1.279534,0.798434,0.417356
2017-08-07,1.022777,0.228942,1.128929,0.308503
2017-08-01,0.767187,-0.054625,1.590607,-0.791468
2017-08-08,1.84703,-1.330483,1.6364,-1.498886


### Selection
While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc, .iloc and .ix.

**Selection by label**

In [41]:
# Getting slices of the data by Pandas syntax
# To get a slice of data between 20170803 to 20170806 for columns C and D
df.loc["20170803":"20170806", ['C', 'D']]

Unnamed: 0,C,D
2017-08-03,-1.815978,-0.441589
2017-08-04,-0.817551,1.05703
2017-08-05,0.798434,0.417356
2017-08-06,-1.542699,2.810941


In [46]:
# For getting a scalar value, you can use df.loc['20170802', 'A'], but using df.at is much faster.
# Datetime object should be given as a key in the df.at[] function.
df.at[dates[0], 'B']

-0.054625332334633389

**Selection by position**

In [50]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2017-08-04,-0.416511,-0.33074
2017-08-05,0.530754,-1.279534
