# Data Analysis with Python

### NumPy
NumPy, short for Numerical Python, has long been a cornerstone of numerical computing in Python. It provides the data structures, algorithms, and library glue needed for most scientific applications involving numerical data in Python

### pandas
pandas provides high-level data structures and functions designed to make working with structured or tabular data fast, easy, and expressive. The primary objects in pandas is the DataFrame, a tabular, column-oriented data structure with both row and column labels. pandas blends the high-performance, array-computing ideas of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases.

### matplotlib, seaborn, altair, plotly, bokeh, holoview, hvPlot and plotnine
matplotlib is the most popular Python library for producing plots and other two-dimensional data visualizations. seaborn is a library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. altair is a declarative statistical visualization library for Python, based on Vega and Vega-Lite. plotly.py is an interactive, open-source, and browser-based graphing library for Python. Bokeh is a Python library for creating interactive visualizations for modern web browsers. HoloViews is a concise declarative interface that helps you build Bokeh plots. It is a separately maintained package that focuses on interaction with Jupyter notebooks and enables quick prototyping of figures for data analysis. hvPlot is a concise API that lets you plot in Bokeh with the pandas .plot() function and a wide variety of data containers. This API is particularly convenient for working with data interactively and lets you quickly produce common types of plots. Plotnine is a Python implementation of the Grammar of Graphics and follows the same idea (and syntax) of ggplot2 in R.

In [2]:
import numpy as np

### The basic object is an N-dimensional array object, or `ndarray`. An `ndarray` is a generic multidimensional container for homogeneous data.

In [3]:
a = np.array(['ciao', 1, True, 1.0])

In [4]:
a

array(['ciao', '1', 'True', '1.0'], dtype='<U32')

In [5]:
type(a)

numpy.ndarray

In [7]:
type(a[1])

numpy.str_

In [8]:
[1, 2] * 2

[1, 2, 1, 2]

In [9]:
np.array([1,2]) * 2

array([2, 4])

In [10]:
my_arr = np.arange(1000000)
my_list = list(range(1000000))

In [14]:
%time for _ in range(10): my_arr2 = my_arr * 2

CPU times: user 14.8 ms, sys: 768 µs, total: 15.6 ms
Wall time: 14.2 ms


In [15]:
%time for _ in range(10): my_list2 = [x * 2 for x in my_list]

CPU times: user 509 ms, sys: 102 ms, total: 611 ms
Wall time: 609 ms


In [17]:
data = np.random.randn(2, 3)
data

array([[-0.59335469,  0.25318989,  1.60473433],
       [ 0.60624753, -2.35757612, -0.57492737]])

In [18]:
data + 2

array([[ 1.40664531,  2.25318989,  3.60473433],
       [ 2.60624753, -0.35757612,  1.42507263]])

In [19]:
data * 2

array([[-1.18670937,  0.50637978,  3.20946865],
       [ 1.21249506, -4.71515224, -1.14985475]])

In [20]:
data + data

array([[-1.18670937,  0.50637978,  3.20946865],
       [ 1.21249506, -4.71515224, -1.14985475]])

In [22]:
data * data

array([[0.35206978, 0.06410512, 2.57517226],
       [0.36753607, 5.55816515, 0.33054148]])

In [23]:
data ** 2

array([[0.35206978, 0.06410512, 2.57517226],
       [0.36753607, 5.55816515, 0.33054148]])

In [26]:
data.shape

(2, 3)

In [27]:
data.T

array([[-0.59335469,  0.60624753],
       [ 0.25318989, -2.35757612],
       [ 1.60473433, -0.57492737]])

In [28]:
np.dot(data, data.T)

array([[ 2.99134717, -1.87923995],
       [-1.87923995,  6.2562427 ]])

In [29]:
v1 = np.random.randn(5, )
v2 = np.random.randn(5, )

v1 * v2

array([ 0.34421819,  0.74831629,  0.21677621, -0.46898346, -0.06478637])

In [30]:
v1

array([ 1.40394492, -0.76997643, -0.83029956, -0.31683306,  0.20146454])

In [31]:
v2

array([ 0.24517927, -0.97186909, -0.26108193,  1.48022259, -0.32157705])

In [32]:
np.dot(v1, v2)

0.7755408570646427

### Shape, indexing and slicing

In [34]:
a = np.arange(12)
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [36]:
a[:3]

array([0, 1, 2])

In [37]:
a[3:-2]

array([3, 4, 5, 6, 7, 8, 9])

In [38]:
a[3:-2:2]

array([3, 5, 7, 9])

In [39]:
# i= 0    1    2    3    4    5    6    7    8    9   10   11
# ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
# │  0 │  1 │  2 │  3 │  4 │  5 │  6 │  7 │  8 │  9 │ 10 │ 11 │
# └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

In [40]:
a.shape

(12,)

In [42]:
a.ndim == len(a.shape)

True

In [46]:
b = a.reshape(3, 4)
b

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [44]:
b.shape

(3, 4)

In [45]:
# i= 0    0    0    0    1    1    1    1    2    2    2    2
# j= 0    1    2    3    0    1    2    3    0    1    2    3
# ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
# │  0 │  1 │  2 │  3 │  4 │  5 │  6 │  7 │  8 │  9 │ 10 │ 11 │
# └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

In [47]:
b[1, 2]

6

In [48]:
c = a.reshape((2, 3, 2))
c

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]]])

In [49]:
# i= 0    0    0    0    0    0    1    1    1    1    1    1
# j= 0    0    1    1    2    2    0    0    1    1    2    2
# k= 0    1    0    1    0    1    0    1    0    1    0    1
# ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
# │  0 │  1 │  2 │  3 │  4 │  5 │  6 │  7 │  8 │  9 │ 10 │ 11 │
# └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

In [50]:
c[0, 2, 1]

5

In [51]:
b.T

array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])

### Boolean indexing

In [54]:
b

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [56]:
b[0][1]

1

In [57]:
b[0, 1]

1

In [58]:
p_array = [1, 2, 3, 4, 5]
np_array = np.array(p_array)

type(p_array), type(np_array)

(list, numpy.ndarray)

In [59]:
p_array[0], np_array[0]

(1, 1)

In [60]:
index = [0, 2]

In [61]:
np_array[index]

array([1, 3])

In [62]:
p_array[index]

TypeError: list indices must be integers or slices, not list

In [64]:
b

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [63]:
b[index]

array([[ 0,  1,  2,  3],
       [ 8,  9, 10, 11]])

In [65]:
b[index, ]

array([[ 0,  1,  2,  3],
       [ 8,  9, 10, 11]])

In [68]:
b[:, index]

array([[ 0,  2],
       [ 4,  6],
       [ 8, 10]])

In [83]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = np.random.randn(7, 4)

data

array([[-0.75029113, -0.03548424, -1.1138726 ,  0.60603591],
       [ 0.04343974, -0.09864437,  0.62862964,  0.33960427],
       [ 0.2538986 ,  0.21248802, -1.25507894,  0.22712653],
       [-1.11445212, -0.54351758, -0.61925305,  0.58112509],
       [ 1.20636829,  0.04690619, -2.01574076,  0.88213192],
       [-0.9529896 ,  1.371248  , -2.30751686, -1.99880498],
       [-0.51508436, -0.12250376, -0.37250228,  1.34182932]])

In [70]:
names.reshape(7, 1)

array([['Bob'],
       ['Joe'],
       ['Will'],
       ['Bob'],
       ['Will'],
       ['Joe'],
       ['Joe']], dtype='<U4')

In [71]:
np.array([1,2,3,4,5]) + 5

array([ 6,  7,  8,  9, 10])

In [72]:
names == 'Bob'

array([ True, False, False,  True, False, False, False])

In [73]:
(names == 'Bob').reshape(7, 1)

array([[ True],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False]])

In [74]:
data[names=='Bob']

array([[-0.16059603, -0.83174139,  0.48274535, -0.52581674],
       [-0.06061175,  0.86683445,  0.3233126 , -1.24886462]])

In [75]:
data[names=='Bob', :2]

array([[-0.16059603, -0.83174139],
       [-0.06061175,  0.86683445]])

In [79]:
data

array([[-1.60596026e-01, -8.31741392e-01,  4.82745348e-01,
        -5.25816740e-01],
       [-1.04473660e+00, -2.51182686e-02,  4.99792182e-05,
        -7.93310020e-01],
       [ 4.87993242e-01, -2.93924855e-01,  5.28792689e-01,
        -1.21028083e+00],
       [-6.06117466e-02,  8.66834453e-01,  3.23312604e-01,
        -1.24886462e+00],
       [ 7.56613816e-01,  1.36979318e+00,  3.61532808e-01,
        -1.46351780e-01],
       [ 2.88971009e-02,  2.46014067e-01, -3.71147247e-01,
        -8.02942344e-01],
       [-1.13376428e+00, -7.56362729e-01, -4.27441384e-01,
         6.58211143e-01]])

In [78]:
data[data < 0]

array([-0.16059603, -0.83174139, -0.52581674, -1.0447366 , -0.02511827,
       -0.79331002, -0.29392486, -1.21028083, -0.06061175, -1.24886462,
       -0.14635178, -0.37114725, -0.80294234, -1.13376428, -0.75636273,
       -0.42744138])

In [80]:
data[data < 0] = 0.0

In [98]:
data

array([[-0.75029113, -0.03548424, -1.1138726 ,  0.60603591],
       [ 0.04343974, -0.09864437,  0.62862964,  0.33960427],
       [ 0.2538986 ,  0.21248802, -1.25507894,  0.22712653],
       [-1.11445212, -0.54351758, -0.61925305,  0.58112509],
       [ 1.20636829,  0.04690619, -2.01574076,  0.88213192],
       [-0.9529896 ,  1.371248  , -2.30751686, -1.99880498],
       [-0.51508436, -0.12250376, -0.37250228,  1.34182932]])

In [99]:
# print out the rows where first number < 0
data

array([[-0.75029113, -0.03548424, -1.1138726 ,  0.60603591],
       [ 0.04343974, -0.09864437,  0.62862964,  0.33960427],
       [ 0.2538986 ,  0.21248802, -1.25507894,  0.22712653],
       [-1.11445212, -0.54351758, -0.61925305,  0.58112509],
       [ 1.20636829,  0.04690619, -2.01574076,  0.88213192],
       [-0.9529896 ,  1.371248  , -2.30751686, -1.99880498],
       [-0.51508436, -0.12250376, -0.37250228,  1.34182932]])

In [101]:
data < 0

array([[ True,  True,  True, False],
       [False,  True, False, False],
       [False, False,  True, False],
       [ True,  True,  True, False],
       [False, False,  True, False],
       [ True, False,  True,  True],
       [ True,  True,  True, False]])

In [102]:
(data < 0)[:,0]

array([ True, False, False,  True, False,  True,  True])

In [103]:
data[(data < 0)[:,0]]

array([[-0.75029113, -0.03548424, -1.1138726 ,  0.60603591],
       [-1.11445212, -0.54351758, -0.61925305,  0.58112509],
       [-0.9529896 ,  1.371248  , -2.30751686, -1.99880498],
       [-0.51508436, -0.12250376, -0.37250228,  1.34182932]])

In [104]:
data[data[:,0]<0]

array([[-0.75029113, -0.03548424, -1.1138726 ,  0.60603591],
       [-1.11445212, -0.54351758, -0.61925305,  0.58112509],
       [-0.9529896 ,  1.371248  , -2.30751686, -1.99880498],
       [-0.51508436, -0.12250376, -0.37250228,  1.34182932]])

### Broadcasting

Arrays with different sizes cannot be added, subtracted, or generally be used in arithmetic.

A way to overcome this is to duplicate the smaller array so that it is the dimensionality and size as the larger array.

* if the array shapes have different lengths, then left-pad the smaller shape with 1;
* if any corresponding dimension does not match, make copies along the 1-dimension
* if any corresponding dimension does not have a 1 in it, raise an error

In [88]:
np.array(1).ndim

0

In [89]:
np.array([1]).ndim

1

In [90]:
x = np.arange(5)

x * 5

array([ 0,  5, 10, 15, 20])

In [91]:
x

array([0, 1, 2, 3, 4])

In [93]:
np.full((5,), 5)

array([5, 5, 5, 5, 5])

In [94]:
x * np.full((5,), 5)

array([ 0,  5, 10, 15, 20])

In [105]:
x = np.arange(10).reshape(2, 5)
x

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [106]:
x * 5

array([[ 0,  5, 10, 15, 20],
       [25, 30, 35, 40, 45]])

In [107]:
np.full((2, 5), 5)

array([[5, 5, 5, 5, 5],
       [5, 5, 5, 5, 5]])

In [108]:
x * 5

array([[ 0,  5, 10, 15, 20],
       [25, 30, 35, 40, 45]])

In [109]:
x * np.full((5,), 5)

array([[ 0,  5, 10, 15, 20],
       [25, 30, 35, 40, 45]])

In [110]:
x = np.arange(3)
y = np.arange(3)

In [111]:
x * y

array([0, 1, 4])

In [112]:
x = np.arange(3)
y = np.arange(5)

In [113]:
x * y

ValueError: operands could not be broadcast together with shapes (3,) (5,) 

In [114]:
x

array([0, 1, 2])

In [115]:
y

array([0, 1, 2, 3, 4])

In [116]:
# element-wise product 'by hand'
out = []
for i in x:
    for j in y:
        out.append(i * j)
out # we lost the information about dimensions

[0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 2, 4, 6, 8]

In [126]:
np.array(out).reshape(len(x), -1)

array([[0, 0, 0, 0, 0],
       [0, 1, 2, 3, 4],
       [0, 2, 4, 6, 8]])

In [119]:
x

array([0, 1, 2])

In [121]:
x.reshape((3,1))

array([[0],
       [1],
       [2]])

In [120]:
x[:,np.newaxis]

array([[0],
       [1],
       [2]])

In [122]:
x[:,np.newaxis] * y

array([[0, 0, 0, 0, 0],
       [0, 1, 2, 3, 4],
       [0, 2, 4, 6, 8]])

In [127]:
x = np.arange(10).reshape(2, 5)
y = np.arange(12).reshape(3, 4)


In [128]:
x

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [129]:
y

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [130]:
x * y

ValueError: operands could not be broadcast together with shapes (2,5) (3,4) 

In [131]:
y[:,:,np.newaxis, np.newaxis] * x

array([[[[ 0,  0,  0,  0,  0],
         [ 0,  0,  0,  0,  0]],

        [[ 0,  1,  2,  3,  4],
         [ 5,  6,  7,  8,  9]],

        [[ 0,  2,  4,  6,  8],
         [10, 12, 14, 16, 18]],

        [[ 0,  3,  6,  9, 12],
         [15, 18, 21, 24, 27]]],


       [[[ 0,  4,  8, 12, 16],
         [20, 24, 28, 32, 36]],

        [[ 0,  5, 10, 15, 20],
         [25, 30, 35, 40, 45]],

        [[ 0,  6, 12, 18, 24],
         [30, 36, 42, 48, 54]],

        [[ 0,  7, 14, 21, 28],
         [35, 42, 49, 56, 63]]],


       [[[ 0,  8, 16, 24, 32],
         [40, 48, 56, 64, 72]],

        [[ 0,  9, 18, 27, 36],
         [45, 54, 63, 72, 81]],

        [[ 0, 10, 20, 30, 40],
         [50, 60, 70, 80, 90]],

        [[ 0, 11, 22, 33, 44],
         [55, 66, 77, 88, 99]]]])

# Pandas

In [132]:
import pandas as pd

In [133]:
s = pd.Series([5, 6, 7, 8])
s

0    5
1    6
2    7
3    8
dtype: int64

In [138]:
s = pd.Series([5, 6, 7, 8], index=['a', 'b', 'c', 'd'])
s

a    5
b    6
c    7
d    8
dtype: int64

In [139]:
d = {'a': 5, 'b':6, 'c':7, 'd':8}
d

{'a': 5, 'b': 6, 'c': 7, 'd': 8}

In [137]:
pd.Series(d)

a    5
b    6
c    7
d    8
dtype: int64

In [140]:
s['a']

5

In [141]:
d['a']

5

In [142]:
s[['a', 'c']]

a    5
c    7
dtype: int64

In [143]:
d[['a', 'c']]

TypeError: unhashable type: 'list'

In [144]:
d > 6

TypeError: '>' not supported between instances of 'dict' and 'int'

In [147]:
s > 6

a    False
b    False
c     True
d     True
dtype: bool

In [148]:
s[s > 6]

c    7
d    8
dtype: int64

### A DataFrame can be tought as a dictionary of Series sharing an index

In [150]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

data

{'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002, 2003],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [151]:
pd.DataFrame(data)

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [152]:
df = pd.DataFrame(data, index=['first', 'second', 'third', 'fourth', 'fifth', 'sixth'])
df

Unnamed: 0,state,year,pop
first,Ohio,2000,1.5
second,Ohio,2001,1.7
third,Ohio,2002,3.6
fourth,Nevada,2001,2.4
fifth,Nevada,2002,2.9
sixth,Nevada,2003,3.2


In [155]:
df['year']

first     2000
second    2001
third     2002
fourth    2001
fifth     2002
sixth     2003
Name: year, dtype: int64

In [156]:
df.year

first     2000
second    2001
third     2002
fourth    2001
fifth     2002
sixth     2003
Name: year, dtype: int64

In [159]:
df.loc['third']

state    Ohio
year     2002
pop       3.6
Name: third, dtype: object

In [161]:
df.head(2)

Unnamed: 0,state,year,pop
first,Ohio,2000,1.5
second,Ohio,2001,1.7


In [162]:
df.columns

Index(['state', 'year', 'pop'], dtype='object')

In [163]:
df['debt'] = np.nan

In [164]:
df

Unnamed: 0,state,year,pop,debt
first,Ohio,2000,1.5,
second,Ohio,2001,1.7,
third,Ohio,2002,3.6,
fourth,Nevada,2001,2.4,
fifth,Nevada,2002,2.9,
sixth,Nevada,2003,3.2,


In [165]:
val = pd.Series([-1.2, -1.5, -1.7], index=['second', 'fourth', 'fifth'])
val

second   -1.2
fourth   -1.5
fifth    -1.7
dtype: float64

In [168]:
df['another_debt'] = val

In [169]:
df

Unnamed: 0,state,year,pop,debt,another_debt
first,Ohio,2000,1.5,,
second,Ohio,2001,1.7,-1.2,-1.2
third,Ohio,2002,3.6,,
fourth,Nevada,2001,2.4,-1.5,-1.5
fifth,Nevada,2002,2.9,-1.7,-1.7
sixth,Nevada,2003,3.2,,


In [172]:
df['state'] == 'Ohio'

first      True
second     True
third      True
fourth    False
fifth     False
sixth     False
Name: state, dtype: bool

In [173]:
df['eastern'] = df['state'] == 'Ohio'
df

Unnamed: 0,state,year,pop,debt,another_debt,eastern
first,Ohio,2000,1.5,,,True
second,Ohio,2001,1.7,-1.2,-1.2,True
third,Ohio,2002,3.6,,,True
fourth,Nevada,2001,2.4,-1.5,-1.5,False
fifth,Nevada,2002,2.9,-1.7,-1.7,False
sixth,Nevada,2003,3.2,,,False


In [174]:
df['new_column'] = pd.Series(np.arange(10))

In [175]:
df

Unnamed: 0,state,year,pop,debt,another_debt,eastern,new_column
first,Ohio,2000,1.5,,,True,
second,Ohio,2001,1.7,-1.2,-1.2,True,
third,Ohio,2002,3.6,,,True,
fourth,Nevada,2001,2.4,-1.5,-1.5,False,
fifth,Nevada,2002,2.9,-1.7,-1.7,False,
sixth,Nevada,2003,3.2,,,False,


In [177]:
del df['eastern']

In [178]:
df

Unnamed: 0,state,year,pop,debt,another_debt,new_column
first,Ohio,2000,1.5,,,
second,Ohio,2001,1.7,-1.2,-1.2,
third,Ohio,2002,3.6,,,
fourth,Nevada,2001,2.4,-1.5,-1.5,
fifth,Nevada,2002,2.9,-1.7,-1.7,
sixth,Nevada,2003,3.2,,,


In [181]:
df.values

array([['Ohio', 2000, 1.5, nan, nan, nan],
       ['Ohio', 2001, 1.7, -1.2, -1.2, nan],
       ['Ohio', 2002, 3.6, nan, nan, nan],
       ['Nevada', 2001, 2.4, -1.5, -1.5, nan],
       ['Nevada', 2002, 2.9, -1.7, -1.7, nan],
       ['Nevada', 2003, 3.2, nan, nan, nan]], dtype=object)

In [182]:
df.T

Unnamed: 0,first,second,third,fourth,fifth,sixth
state,Ohio,Ohio,Ohio,Nevada,Nevada,Nevada
year,2000,2001,2002,2001,2002,2003
pop,1.5,1.7,3.6,2.4,2.9,3.2
debt,,-1.2,,-1.5,-1.7,
another_debt,,-1.2,,-1.5,-1.7,
new_column,,,,,,


In [183]:
df

Unnamed: 0,state,year,pop,debt,another_debt,new_column
first,Ohio,2000,1.5,,,
second,Ohio,2001,1.7,-1.2,-1.2,
third,Ohio,2002,3.6,,,
fourth,Nevada,2001,2.4,-1.5,-1.5,
fifth,Nevada,2002,2.9,-1.7,-1.7,
sixth,Nevada,2003,3.2,,,


In [184]:
df.set_index('year')

Unnamed: 0_level_0,state,pop,debt,another_debt,new_column
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000,Ohio,1.5,,,
2001,Ohio,1.7,-1.2,-1.2,
2002,Ohio,3.6,,,
2001,Nevada,2.4,-1.5,-1.5,
2002,Nevada,2.9,-1.7,-1.7,
2003,Nevada,3.2,,,


In [191]:
idx = list(df.index)
np.random.shuffle(idx)

In [192]:
idx

['fifth', 'sixth', 'fourth', 'first', 'second', 'third']

In [193]:
df

Unnamed: 0,state,year,pop,debt,another_debt,new_column
first,Ohio,2000,1.5,,,
second,Ohio,2001,1.7,-1.2,-1.2,
third,Ohio,2002,3.6,,,
fourth,Nevada,2001,2.4,-1.5,-1.5,
fifth,Nevada,2002,2.9,-1.7,-1.7,
sixth,Nevada,2003,3.2,,,


In [194]:
df.reindex(idx)

Unnamed: 0,state,year,pop,debt,another_debt,new_column
fifth,Nevada,2002,2.9,-1.7,-1.7,
sixth,Nevada,2003,3.2,,,
fourth,Nevada,2001,2.4,-1.5,-1.5,
first,Ohio,2000,1.5,,,
second,Ohio,2001,1.7,-1.2,-1.2,
third,Ohio,2002,3.6,,,


In [196]:
idx.append('seventh')
idx

['fifth', 'sixth', 'fourth', 'first', 'second', 'third', 'seventh']

In [198]:
df = df.reindex(idx)
df

Unnamed: 0,state,year,pop,debt,another_debt,new_column
fifth,Nevada,2002.0,2.9,-1.7,-1.7,
sixth,Nevada,2003.0,3.2,,,
fourth,Nevada,2001.0,2.4,-1.5,-1.5,
first,Ohio,2000.0,1.5,,,
second,Ohio,2001.0,1.7,-1.2,-1.2,
third,Ohio,2002.0,3.6,,,
seventh,,,,,,


In [201]:
df.drop('seventh')

Unnamed: 0,state,year,pop,debt,another_debt,new_column
fifth,Nevada,2002.0,2.9,-1.7,-1.7,
sixth,Nevada,2003.0,3.2,,,
fourth,Nevada,2001.0,2.4,-1.5,-1.5,
first,Ohio,2000.0,1.5,,,
second,Ohio,2001.0,1.7,-1.2,-1.2,
third,Ohio,2002.0,3.6,,,


In [203]:
df.drop(['new_column', 'another_debt'], axis=1)

Unnamed: 0,state,year,pop,debt
fifth,Nevada,2002.0,2.9,-1.7
sixth,Nevada,2003.0,3.2,
fourth,Nevada,2001.0,2.4,-1.5
first,Ohio,2000.0,1.5,
second,Ohio,2001.0,1.7,-1.2
third,Ohio,2002.0,3.6,
seventh,,,,


In [205]:
df

Unnamed: 0,state,year,pop,debt,another_debt,new_column
fifth,Nevada,2002.0,2.9,-1.7,-1.7,
sixth,Nevada,2003.0,3.2,,,
fourth,Nevada,2001.0,2.4,-1.5,-1.5,
first,Ohio,2000.0,1.5,,,
second,Ohio,2001.0,1.7,-1.2,-1.2,
third,Ohio,2002.0,3.6,,,
seventh,,,,,,


In [204]:
df[:2]

Unnamed: 0,state,year,pop,debt,another_debt,new_column
fifth,Nevada,2002.0,2.9,-1.7,-1.7,
sixth,Nevada,2003.0,3.2,,,


In [207]:
df.iloc[:2]

Unnamed: 0,state,year,pop,debt,another_debt,new_column
fifth,Nevada,2002.0,2.9,-1.7,-1.7,
sixth,Nevada,2003.0,3.2,,,


In [208]:
df.iloc[:, :2]

Unnamed: 0,state,year
fifth,Nevada,2002.0
sixth,Nevada,2003.0
fourth,Nevada,2001.0
first,Ohio,2000.0
second,Ohio,2001.0
third,Ohio,2002.0
seventh,,


In [209]:
samples = ['sample_' + str(x) for x in range(1,6)]
variables = [chr(x) for x in range(97, 102)]

df2 = pd.DataFrame(np.random.rand(5,5), index=samples, columns=variables)

In [211]:
df2

Unnamed: 0,a,b,c,d,e
sample_1,0.174823,0.955389,0.239256,0.287502,0.348075
sample_2,0.367054,0.183705,0.319905,0.532188,0.852024
sample_3,0.871349,0.194397,0.180266,0.107815,0.563384
sample_4,0.711949,0.27681,0.784667,0.408875,0.596565
sample_5,0.22862,0.867314,0.708576,0.099386,0.688848


In [212]:
df2 > 2

Unnamed: 0,a,b,c,d,e
sample_1,False,False,False,False,False
sample_2,False,False,False,False,False
sample_3,False,False,False,False,False
sample_4,False,False,False,False,False
sample_5,False,False,False,False,False


In [215]:
df2[df2 > 0.5]

Unnamed: 0,a,b,c,d,e
sample_1,,0.955389,,,
sample_2,,,,0.532188,0.852024
sample_3,0.871349,,,,0.563384
sample_4,0.711949,,0.784667,,0.596565
sample_5,,0.867314,0.708576,,0.688848


In [216]:
df2[df2 > 0.5] = 1.0 

In [217]:
df2

Unnamed: 0,a,b,c,d,e
sample_1,0.174823,1.0,0.239256,0.287502,0.348075
sample_2,0.367054,0.183705,0.319905,1.0,1.0
sample_3,1.0,0.194397,0.180266,0.107815,1.0
sample_4,1.0,0.27681,1.0,0.408875,1.0
sample_5,0.22862,1.0,1.0,0.099386,1.0


In [221]:
df2.loc[['sample_2', 'sample_4'], ['b', 'd']]

Unnamed: 0,b,d
sample_2,0.183705,1.0
sample_4,0.27681,0.408875


# Arithmetics

In [222]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

df1.loc[1, 'b'] = np.nan
df2.loc[2, 'c'] = np.nadfn


df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,,6.0,7.0
2,8.0,9.0,10.0,11.0


In [223]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [224]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,,24.0,
3,,,,,


In [226]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,6.0,13.0,15.0,9.0
2,18.0,20.0,10.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [227]:
1 / df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [229]:
df1.div(1) # df1 / 1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,,6.0,7.0
2,8.0,9.0,10.0,11.0


In [230]:
df1.rdiv(1, fill_value=0) # == 1 / df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,inf,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


### Dealing with missing values

In [231]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))

df1.loc[1, 'b'] = np.nan

In [232]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,,6.0,7.0
2,8.0,9.0,10.0,11.0


In [235]:
df1.mean()

a    4.0
b    5.0
c    6.0
d    7.0
dtype: float64

In [236]:
df1.fillna(df1.mean())

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [237]:
df1.fillna({'b': 100})

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,100.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [244]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,,6.0,7.0
2,8.0,9.0,10.0,11.0


In [245]:
df1.dropna(axis=1)

Unnamed: 0,a,c,d
0,0.0,2.0,3.0
1,4.0,6.0,7.0
2,8.0,10.0,11.0


In [247]:
df1.dropna(how='all')

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,,6.0,7.0
2,8.0,9.0,10.0,11.0


In [256]:
df1.dropna(how='any', thresh=3)

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,,6.0,7.0
2,8.0,9.0,10.0,11.0


In [254]:
df1.shape[1] * .8

3.2

In [257]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,,6.0,7.0
2,8.0,9.0,10.0,11.0


In [258]:
df1.mean()

a    4.0
b    5.0
c    6.0
d    7.0
dtype: float64

In [259]:
np.mean(df1)

a    4.0
b    5.0
c    6.0
d    7.0
dtype: float64

In [260]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [267]:
df2.apply(lambda x : pd.Series([x.max(), x.min()]), axis=1)

Unnamed: 0,0,1
0,4.0,0.0
1,9.0,5.0
2,14.0,10.0
3,19.0,15.0


# Sorting

In [268]:
np.random.shuffle(samples)
np.random.shuffle(variables)

df2 = pd.DataFrame(np.random.rand(5,5), index=samples, columns=variables)
df2

Unnamed: 0,c,a,b,e,d
sample_3,0.479887,0.44027,0.525284,0.924421,0.872237
sample_5,0.222849,0.782444,0.476994,0.293677,0.621789
sample_4,0.526461,0.856491,0.090187,0.149782,0.684564
sample_2,0.818129,0.946389,0.735098,0.745828,0.935573
sample_1,0.518789,0.467026,0.673738,0.60651,0.158838


In [270]:
df2.sort_index(axis=1)

Unnamed: 0,a,b,c,d,e
sample_3,0.44027,0.525284,0.479887,0.872237,0.924421
sample_5,0.782444,0.476994,0.222849,0.621789,0.293677
sample_4,0.856491,0.090187,0.526461,0.684564,0.149782
sample_2,0.946389,0.735098,0.818129,0.935573,0.745828
sample_1,0.467026,0.673738,0.518789,0.158838,0.60651


In [272]:
df2

Unnamed: 0,c,a,b,e,d
sample_3,0.479887,0.44027,0.525284,0.924421,0.872237
sample_5,0.222849,0.782444,0.476994,0.293677,0.621789
sample_4,0.526461,0.856491,0.090187,0.149782,0.684564
sample_2,0.818129,0.946389,0.735098,0.745828,0.935573
sample_1,0.518789,0.467026,0.673738,0.60651,0.158838


In [273]:
df2.sort_values(by='c')

Unnamed: 0,c,a,b,e,d
sample_5,0.222849,0.782444,0.476994,0.293677,0.621789
sample_3,0.479887,0.44027,0.525284,0.924421,0.872237
sample_1,0.518789,0.467026,0.673738,0.60651,0.158838
sample_4,0.526461,0.856491,0.090187,0.149782,0.684564
sample_2,0.818129,0.946389,0.735098,0.745828,0.935573


In [274]:
df2.loc['sample_2','a'] = df2.loc['sample_4','a'] 
df2

Unnamed: 0,c,a,b,e,d
sample_3,0.479887,0.44027,0.525284,0.924421,0.872237
sample_5,0.222849,0.782444,0.476994,0.293677,0.621789
sample_4,0.526461,0.856491,0.090187,0.149782,0.684564
sample_2,0.818129,0.856491,0.735098,0.745828,0.935573
sample_1,0.518789,0.467026,0.673738,0.60651,0.158838


In [275]:
df2.sort_values(by='a')

Unnamed: 0,c,a,b,e,d
sample_3,0.479887,0.44027,0.525284,0.924421,0.872237
sample_1,0.518789,0.467026,0.673738,0.60651,0.158838
sample_5,0.222849,0.782444,0.476994,0.293677,0.621789
sample_4,0.526461,0.856491,0.090187,0.149782,0.684564
sample_2,0.818129,0.856491,0.735098,0.745828,0.935573


In [277]:
df2.sort_values(by=['a', 'b'], ascending=False)

Unnamed: 0,c,a,b,e,d
sample_2,0.818129,0.856491,0.735098,0.745828,0.935573
sample_4,0.526461,0.856491,0.090187,0.149782,0.684564
sample_5,0.222849,0.782444,0.476994,0.293677,0.621789
sample_1,0.518789,0.467026,0.673738,0.60651,0.158838
sample_3,0.479887,0.44027,0.525284,0.924421,0.872237


In [278]:
df.describe()

Unnamed: 0,year,pop,debt,another_debt,new_column
count,6.0,6.0,3.0,3.0,0.0
mean,2001.5,2.55,-1.466667,-1.466667,
std,1.048809,0.836062,0.251661,0.251661,
min,2000.0,1.5,-1.7,-1.7,
25%,2001.0,1.875,-1.6,-1.6,
50%,2001.5,2.65,-1.5,-1.5,
75%,2002.0,3.125,-1.35,-1.35,
max,2003.0,3.6,-1.2,-1.2,


# Pandas I/O

In [279]:
iris = pd.read_csv('https://datahub.io/machine-learning/iris/r/iris.csv')
iris

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [288]:
iris = pd.read_excel('https://web.stanford.edu/~ashishg/msande111/excel/iris.xls', skiprows=1)
iris.drop(iris.columns[0], axis=1).drop(iris.columns[6:], axis=1)

Unnamed: 0,Sepal Length (cm),Sepal Width (cm),Petal Length (cm),Petal Width (cm),Class
0,7.0,3.2,4.7,1.4,Iris-versicolor
1,6.4,3.2,4.5,1.5,Iris-versicolor
2,6.9,3.1,4.9,1.5,Iris-versicolor
3,5.5,2.3,4.0,1.3,Iris-versicolor
4,6.5,2.8,4.6,1.5,Iris-versicolor
...,...,...,...,...,...
95,4.8,3.0,1.4,0.3,Iris-setosa
96,5.1,3.8,1.6,0.2,Iris-setosa
97,4.6,3.2,1.4,0.2,Iris-setosa
98,5.3,3.7,1.5,0.2,Iris-setosa


In [289]:
penguin = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv')
penguin

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


In [292]:
penguin.groupby(['species', 'sex']).mean().index

MultiIndex([(   'Adelie', 'FEMALE'),
            (   'Adelie',   'MALE'),
            ('Chinstrap', 'FEMALE'),
            ('Chinstrap',   'MALE'),
            (   'Gentoo', 'FEMALE'),
            (   'Gentoo',   'MALE')],
           names=['species', 'sex'])

In [293]:
penguin.groupby(['species', 'sex']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
species,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Adelie,FEMALE,37.257534,17.621918,187.794521,3368.835616
Adelie,MALE,40.390411,19.072603,192.410959,4043.493151
Chinstrap,FEMALE,46.573529,17.588235,191.735294,3527.205882
Chinstrap,MALE,51.094118,19.252941,199.911765,3938.970588
Gentoo,FEMALE,45.563793,14.237931,212.706897,4679.741379
Gentoo,MALE,49.47377,15.718033,221.540984,5484.836066


In [None]:
lambda x : np.mean(x['body_mass_g'] / np.mean(x['flipper_length_mm'])

In [297]:
penguin_series = penguin.groupby(['species', 'sex']).apply(
    lambda x : np.mean(x['body_mass_g'] / np.mean(x['flipper_length_mm']))
)

In [299]:
type(penguin_series)

pandas.core.series.Series

In [303]:
penguin_series

species    sex   
Adelie     FEMALE    17.938945
           MALE      21.014880
Chinstrap  FEMALE    18.396226
           MALE      19.703546
Gentoo     FEMALE    22.000892
           MALE      24.757659
dtype: float64

In [302]:
penguin_series.unstack('species')

species,Adelie,Chinstrap,Gentoo
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FEMALE,17.938945,18.396226,22.000892
MALE,21.01488,19.703546,24.757659


In [304]:
penguin

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


In [305]:
penguin.to_csv('my_file.csv', sep='\t')

### Multiindex reshape pivot e melt

In [306]:
data = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data

a  1    0.562602
   2   -0.416626
   3   -0.565867
b  1    0.878129
   3   -0.398984
c  1    1.488548
   2    1.216078
d  2   -2.282193
   3   -0.105749
dtype: float64

In [308]:
data.loc[:, 2]

a   -0.416626
c    1.216078
d   -2.282193
dtype: float64

In [313]:
data.unstack()

Unnamed: 0,1,2,3
a,0.562602,-0.416626,-0.565867
b,0.878129,,-0.398984
c,1.488548,1.216078,
d,,-2.282193,-0.105749


In [312]:
data.unstack().loc[:, 2]

a   -0.416626
b         NaN
c    1.216078
d   -2.282193
Name: 2, dtype: float64

In [314]:
data.unstack().stack()

a  1    0.562602
   2   -0.416626
   3   -0.565867
b  1    0.878129
   3   -0.398984
c  1    1.488548
   2    1.216078
d  2   -2.282193
   3   -0.105749
dtype: float64

In [315]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [316]:
frame.sum(level=0, axis=0)

  frame.sum(level=0, axis=0)


Unnamed: 0_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Green,Red,Green
a,3,5,7
b,15,17,19


In [317]:
frame.sum(level=1, axis=1)

  frame.sum(level=1, axis=1)


Unnamed: 0,Unnamed: 1,Green,Red
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


In [318]:
data = pd.read_csv('/home/dataset/macrodata_long.csv', index_col=0)
data.head()

Unnamed: 0,date,item,value
0,1959-03-31 23:59:59.999999999,realgdp,2710.349
1,1959-03-31 23:59:59.999999999,infl,0.0
2,1959-03-31 23:59:59.999999999,unemp,5.8
3,1959-06-30 23:59:59.999999999,realgdp,2778.801
4,1959-06-30 23:59:59.999999999,infl,2.34


In [320]:
data.shape

(609, 3)

In [319]:
data['item'].unique()

array(['realgdp', 'infl', 'unemp'], dtype=object)

In [321]:
data.pivot('date', 'item', 'value').head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2


In [322]:
data

Unnamed: 0,date,item,value
0,1959-03-31 23:59:59.999999999,realgdp,2710.349
1,1959-03-31 23:59:59.999999999,infl,0.000
2,1959-03-31 23:59:59.999999999,unemp,5.800
3,1959-06-30 23:59:59.999999999,realgdp,2778.801
4,1959-06-30 23:59:59.999999999,infl,2.340
...,...,...,...
604,2009-06-30 23:59:59.999999999,infl,3.370
605,2009-06-30 23:59:59.999999999,unemp,9.200
606,2009-09-30 23:59:59.999999999,realgdp,12990.341
607,2009-09-30 23:59:59.999999999,infl,3.560


In [323]:
data.set_index(['date', 'item']).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,value
date,item,Unnamed: 2_level_1
1959-03-31 23:59:59.999999999,realgdp,2710.349
1959-03-31 23:59:59.999999999,infl,0.0
1959-03-31 23:59:59.999999999,unemp,5.8
1959-06-30 23:59:59.999999999,realgdp,2778.801
1959-06-30 23:59:59.999999999,infl,2.34


In [325]:
data.set_index(['date', 'item']).unstack('item').head()

Unnamed: 0_level_0,value,value,value
item,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2


In [326]:
df = pd.DataFrame({'key': ['foo', 'bar', 'baz'], 'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df

Unnamed: 0,key,A,B,C
0,foo,1,4,7
1,bar,2,5,8
2,baz,3,6,9


In [328]:
df.melt('key')

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6
6,foo,C,7
7,bar,C,8
8,baz,C,9


In [329]:
df

Unnamed: 0,key,A,B,C
0,foo,1,4,7
1,bar,2,5,8
2,baz,3,6,9


In [330]:
df.set_index('key')

Unnamed: 0_level_0,A,B,C
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
foo,1,4,7
bar,2,5,8
baz,3,6,9


In [331]:
df.set_index('key').stack()

key   
foo  A    1
     B    4
     C    7
bar  A    2
     B    5
     C    8
baz  A    3
     B    6
     C    9
dtype: int64

In [332]:
df.set_index('key').stack().reset_index()

Unnamed: 0,key,level_1,0
0,foo,A,1
1,foo,B,4
2,foo,C,7
3,bar,A,2
4,bar,B,5
5,bar,C,8
6,baz,A,3
7,baz,B,6
8,baz,C,9


### Merge e concat

In [333]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'data': range(3)})

df1

Unnamed: 0,key,data
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [334]:
df2

Unnamed: 0,key,data
0,a,0
1,b,1
2,d,2


In [340]:
pd.merge(df1, df2, on='key', how='right', suffixes=['_df1', '_df2'])

Unnamed: 0,key,data_df1,data_df2
0,a,2.0,0
1,a,4.0,0
2,a,5.0,0
3,b,0.0,1
4,b,1.0,1
5,b,6.0,1
6,d,,2


In [341]:
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

s1

a    0
b    1
dtype: int64

In [342]:
s2

c    2
d    3
e    4
dtype: int64

In [344]:
s3

f    5
g    6
dtype: int64

In [345]:
pd.concat([s1, s2, s3], axis=0)

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

In [346]:
pd.concat([s1, s2, s3], axis=1)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [347]:
s4 = pd.concat([s1, s3])
s4

a    0
b    1
f    5
g    6
dtype: int64

In [348]:
pd.concat([s1, s4], axis=1)

Unnamed: 0,0,1
a,0.0,0
b,1.0,1
f,,5
g,,6


In [351]:
pd.concat([s1, s4], axis=1, join='outer')

Unnamed: 0,0,1
a,0.0,0
b,1.0,1
f,,5
g,,6


In [350]:
type(np.nan)

float