<h3> Introduction to Pandas </h3>
<h3> Introduction to pandas Data Structures </h3>
<ul>
    <li> <b>Series</b> </li>
    <li> <b>DataFrame</b> </li>
</ul><br>

In [1]:
import pandas as pd
import numpy as np

<h3> Series </h3>
<ul>
    <li>one-dimensional array-like object containing a sequence of values</li>
    <li>associated array of data labels, called its <b>index</b> </li>

In [2]:
obj = pd.Series([4, 7, -5, 3])

In [3]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [4]:
obj.values

array([ 4,  7, -5,  3])

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
list(obj.index)

[0, 1, 2, 3]

In [7]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [8]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [9]:
obj2['a']

-5

In [10]:
obj2[['a','b']]

a   -5
b    7
dtype: int64

<p>['c', 'a'] is interpreted as a list of indices</p>
<br>
<p>Numpy like operation on pandas series </p><br>

In [11]:
obj2[obj2>0]

d    4
b    7
c    3
dtype: int64

In [12]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

In [13]:
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [14]:
'b' in obj2

True

In [15]:
4 in obj2

False

In [16]:
4 in obj2.values

True

<p>Create Series from dic:t</p><br>

In [17]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [18]:
obj3 = pd.Series(sdata)

In [19]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [20]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [21]:
obj4 = pd.Series(sdata, index=states)

In [22]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [23]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [24]:
obj4.notnull()

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [25]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [26]:
obj4 + obj3

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

<br><p>Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality</p>

In [27]:
obj4.name = 'population'

In [28]:
obj4.index.name = 'state'

In [29]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [30]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

<br><p>A Series’s index can be altered in-place by assignment:</p>

In [31]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [32]:
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

<br><h3>DataFrame</h3>
<ul>
    <li>Two-dimensional size</li>
    <li>mutable </li>
    <li>labeled axes (rows and columns)</li>
    <li>Rectangular table of data and contains an ordered collection of columns</li>
    <li>Each column can be a different value type</li>
    <li>The DataFrame has both a row and column index;it can be thought of as a dict of Series all sharing the same index </li>
    <li>Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays </li>
    <li>can be thought of as a dict-like container for Series objects</li>
    <br>
    

<p>constructing DataFrame: </p>

In [33]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [34]:
frame = pd.DataFrame(data)

In [35]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [36]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [37]:
frame2 = pd.DataFrame(data, 
                      columns=['year', 'state', 'pop', 'debt'], 
                      index=['one', 'two', 'three', 'four', 'five', 'six'])

In [38]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [39]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

<br>Rows can also be retrieved by position or name with the special loc attribute:

In [40]:
frame2.loc['one']

year     2000
state    Ohio
pop       1.5
debt      NaN
Name: one, dtype: object

In [41]:
frame2.loc[['one', 'four']]

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
four,2001,Nevada,2.4,


In [42]:
frame2['debt'] = 16.5

<br>Columns can be modified by assignment:

In [43]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


<h4>Tips</h4>
<ul>
    <li> assigning lists or arrays to a column => the value’s length must match the length of the DataFrame </li>
    <li> assign a Series => its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any holes </li>
</ul>

In [44]:
val = pd.Series([-1.2, -1.5, -1.7, 100000], index=['two', 'four', 'five', 'x'])

In [45]:
frame2['debt'] = val

In [46]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


<ul>
    <li>Assigning a column that doesn’t exist will create a new column. The del keyword will delete columns as with a dict.</li>
</ul>

In [47]:
frame2['eastern'] = frame2['state'] == 'Ohio'

In [48]:
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


In [49]:
del frame2['eastern']

In [50]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


Another common form of data is a nested dict of dicts:

In [51]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
            'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [52]:
frame3 = pd.DataFrame(pop)

In [53]:
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [54]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


<h3>Possible data inputs to DataFrame constructor</h3>
<ul>
    <li><b>2D ndarray</b> : A matrix of data, passing optional row and column labels</li>
    <li><b>dict of arrays, lists, or tuples</b> : Each sequence becomes a column in the DataFrame; all sequences must be the same length</li>
    <li><b>NumPy structured/record array</b> : Treated as the “dict of arrays” case</li>
    <li><b>dict of Series</b> : Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed</li>
    <li><b>dict of dicts</b> : Each inner dict becomes a column; keys are unioned to form the row index as in the “dict of Series” case</li>
    <li><b>List of dicts or Series </b> : Each item becomes a row in the DataFrame; union of dict keys or Series indexes become the DataFrame’s column labels</li>
    <li><b>List of lists or tuples</b> : Treated as the “2D ndarray” case</li>
    <li><b>Another DataFrame </b> : The DataFrame’s indexes are used unless different ones are passed</li>
    <li><b>NumPy MaskedArray</b> : Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result</li>
</ul><br>

<h3>Index Objects</h3>
<ul>
    <li>pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names)</li>
</ul>

In [55]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [56]:
index = obj.index

In [57]:
index.difference

<bound method Index.difference of Index(['a', 'b', 'c'], dtype='object')>

<h3>Essential Functionality</h3>
<ul>
    <li>Reindexing</li>
    <li>Dropping Entries from an Axis</li>
    <li>Indexing, Selection, and Filtering</li>
    <li>Integer Indexes</li>
    <li>Arithmetic and Data Alignment</li>
    <li>Function Application and Mapping</li>
</ul><br>

In [58]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

In [59]:
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [60]:
obj2 = obj.reindex(['a', 'b', 'c', 'd'])

In [61]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
dtype: float64

In [62]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

In [63]:
obj3

0      blue
2    purple
4    yellow
dtype: object

In [64]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [65]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'], 
                     columns=['Ohio', 'Texas', 'California'])

In [66]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [67]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])

In [68]:
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


The columns can be reindexed with the <b>columns</b> keyword:

In [69]:
states = ['Texas', 'Utah', 'California']

In [70]:
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


<h4> Dropping Entries from an Axis </h4>

In [71]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [72]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [73]:
new_obj = obj.drop('c')

In [74]:
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [75]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

In [76]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [77]:
data.drop('Colorado')

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Utah,8,9,10,11
New York,12,13,14,15


In [78]:
data.drop('two', axis='columns') # or axis=1

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [79]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

<p>Mutating object with inplace </p>

In [80]:
obj.drop('c', inplace=True)

In [81]:
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

<h4>Indexing, Selection, and Filtering</h4>

In [82]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

In [83]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [84]:
obj['b']

1.0

In [85]:
obj[1]

1.0

In [86]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [87]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [88]:
obj[obj<2]

a    0.0
b    1.0
dtype: float64

In [89]:
obj['a':'c']

a    0.0
b    1.0
c    2.0
dtype: float64

In [90]:
obj['b':'c']=12.0

In [91]:
obj

a     0.0
b    12.0
c    12.0
d     3.0
dtype: float64

In [92]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

In [93]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [94]:
data['one']

Ohio         0
Colorado     4
Utah         8
New York    12
Name: one, dtype: int64

In [95]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [96]:
data[data['three']>5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


<p>Selection with loc and iloc</p>

In [97]:
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int64

In [98]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [99]:
data.iloc[[1,2], [3,0,1]]

Unnamed: 0,four,one,two
Colorado,7,4,5
Utah,11,8,9


In [100]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [101]:
ser = pd.Series(np.arange(3.))

In [102]:
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [103]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])

In [104]:
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

In [105]:
ser2[-1]

2.0

In [106]:
ser.loc[:1]

0    0.0
1    1.0
dtype: float64

<h4>Arithmetic and Data Alignment</h4>
<ul>
    <li>applications is the behavior of arithmetic between objects with different indexes.</li>
</ul>

In [107]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

In [108]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [109]:
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [110]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [111]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

<p>In DataFrame, alignment is performed on both the rows and the columns:</p>

In [112]:
df1 = pd.DataFrame(np.arange(9.).reshape(3, 3), columns=list('bcd'), index= ['Ohio', 'Texas', 'Colorado'])

In [113]:
df2 = pd.DataFrame(np.arange(12.).reshape(4, 3), columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [114]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [115]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [116]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [117]:
df1 = pd.DataFrame({'A': [1, 2]})

In [118]:
df2 = pd.DataFrame({'B': [3, 4]})

In [119]:
df1

Unnamed: 0,A
0,1
1,2


In [120]:
df2

Unnamed: 0,B
0,3
1,4


In [121]:
df1 - df2

Unnamed: 0,A,B
0,,
1,,


In [122]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))

In [123]:
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

In [124]:
df1.add(df2, fill_value=1.0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,5.0
1,9.0,11.0,13.0,15.0,10.0
2,18.0,20.0,22.0,24.0,15.0
3,16.0,17.0,18.0,19.0,20.0


In [125]:
1/df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [126]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


<p>Operations between DataFrame and Series</p>

In [127]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [128]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [129]:
frame.iloc[0]

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [130]:
series = frame.iloc[0]

In [131]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [132]:
type(series)

pandas.core.series.Series

In [133]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [134]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])

In [135]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


<h4>Function Application and Mapping</h4>

In [136]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index= ['Utah', 'Ohio', 'Texas', 'Oregon'])

In [137]:
frame

Unnamed: 0,b,d,e
Utah,-1.77232,-0.907681,0.915981
Ohio,0.479492,-0.648829,-1.079367
Texas,0.338097,-1.189623,-0.683677
Oregon,-0.424193,1.147343,-0.480687


In [138]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.77232,0.907681,0.915981
Ohio,0.479492,0.648829,1.079367
Texas,0.338097,1.189623,0.683677
Oregon,0.424193,1.147343,0.480687


<p><b>apply</b> method : apply a function on one-dimensional arrays to each column or row </p>

In [139]:
def f(x):
    return x.max() - x.min()

In [140]:
frame.apply(f)

b    2.251812
d    2.336966
e    1.995348
dtype: float64

In [141]:
frame.apply(f, axis=1)

Utah      2.688301
Ohio      1.558859
Texas     1.527720
Oregon    1.628030
dtype: float64

In [142]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

In [143]:
frame.apply(f)

Unnamed: 0,b,d,e
min,-1.77232,-1.189623,-1.079367
max,0.479492,1.147343,0.915981


In [144]:
frame.sort_index()

Unnamed: 0,b,d,e
Ohio,0.479492,-0.648829,-1.079367
Oregon,-0.424193,1.147343,-0.480687
Texas,0.338097,-1.189623,-0.683677
Utah,-1.77232,-0.907681,0.915981


In [145]:
frame.sort_index(axis=1)

Unnamed: 0,b,d,e
Utah,-1.77232,-0.907681,0.915981
Ohio,0.479492,-0.648829,-1.079367
Texas,0.338097,-1.189623,-0.683677
Oregon,-0.424193,1.147343,-0.480687


In [146]:
obj = pd.Series([4, 7, -3, 2])

In [147]:
obj

0    4
1    7
2   -3
3    2
dtype: int64

In [148]:
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [149]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])

In [150]:
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [151]:
frame = pd.DataFrame({'b':[4, 7, -3, 2], 'a':[0, 1, 0, 1]})

In [152]:
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [153]:
obj.rank(ascending=False, method='max')

0    2.0
1    NaN
2    1.0
3    NaN
4    4.0
5    3.0
dtype: float64

<h3>DataFrame vs Series </h3>

In [154]:
a = np.random.randn(4,5)

In [155]:
pd.DataFrame(a)

Unnamed: 0,0,1,2,3,4
0,0.867078,-0.466423,-0.798255,0.191517,-0.337475
1,-1.547859,0.202853,-2.063505,1.637027,0.217197
2,0.433628,-1.74079,0.463267,-0.459306,0.832866
3,-1.758418,-1.969648,-0.268768,-0.106296,-1.175101


pd.Series(a)

<h3>Reading and Writing Data in Text Format</h3><br>

<h3>Parsing functions in pandas</h3>
<ul>
    <li><b>read_csv</b> : Load delimited data from a file, URL, or file-like object. Use comma as default delimiter</li>
    <li><b>read_table</b> : Load delimited data from a file, URL, or file-like object. Use tab ( '\t' ) as default delimiter</li>
    <li><b>read_fwf</b> : Read data in fixed-width column format (that is, no delimiters)</li>
    <li><b>read_clipboard</b> : Version of read_table that reads data from the clipboard. Useful for converting tables from web pages</li>
</ul><br>

* *CSV* : Comma-separated values

In [156]:
!cat "files/ex1.csv"

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

In [157]:
df = pd.read_csv('files/ex1.csv')

In [158]:
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [159]:
df.dtypes

a           int64
b           int64
c           int64
d           int64
message    object
dtype: object

<ul>
    <li>Type inference is one of the more important features of these functions; that means you don’t have to specify which columns are numeric, integer, boolean, or string. Handling dates and other custom types requires a bit more effort, though.</li>
</ul>

In [160]:
pd.read_table('files/ex1.csv')

Unnamed: 0,"a,b,c,d,message"
0,"1,2,3,4,hello"
1,"5,6,7,8,world"
2,"9,10,11,12,foo"


In [161]:
pd.read_table('files/ex1.csv' , sep=',')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


<ul>
    <li>To read this in, you have a couple of options. You can allow pandas to assign default
column names, or you can specify names yourself</li>
    <li>You can specify indexes yourself </li>
</ul>

In [162]:
pd.read_csv('files/ex1.csv', header = None)

Unnamed: 0,0,1,2,3,4
0,a,b,c,d,message
1,1,2,3,4,hello
2,5,6,7,8,world
3,9,10,11,12,foo


In [163]:
pd.read_csv('files/ex1.csv', names=['m', 'n', 'o' , 'p', 'payam'])

Unnamed: 0,m,n,o,p,payam
0,a,b,c,d,message
1,1,2,3,4,hello
2,5,6,7,8,world
3,9,10,11,12,foo


In [164]:
names = ['a', 'b', 'c', 'd', 'message']

In [165]:
pd.read_csv('files/ex1.csv', names=names, index_col='message')

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
message,a,b,c,d
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [166]:
!cat 'files/ex2.csv'

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [167]:
frame = pd.read_csv('files/ex2.csv', index_col=['key1', 'key2'])

In [168]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [169]:
frame.shape

(8, 2)

In [170]:
frame.loc['one']

Unnamed: 0_level_0,value1,value2
key2,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1,2
b,3,4
c,5,6
d,7,8


In [171]:
frame.loc['one'].loc['a']

value1    1
value2    2
Name: a, dtype: int64

<ul>
    <li>Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some sentinel value. By default, pandas uses a set of commonly occurring sentinels, such as NA , -1.#IND , and NULL</li>
</ul>

In [172]:
!cat files/ex3.csv

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo


In [173]:
df = pd.read_csv('files/ex3.csv')

In [174]:
df

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [175]:
df.isnull()

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


<h4> Pandas understand missing values as <b>nan</b></h4>
<ul>
    <li><b>NaN</b> : standing for not a number</li>
    <li> <b>NaN</b> are floating point numbers </li>
    <li> NaN is not equal to NaN </li>
</ul>

In [176]:
type(df['c'][1]) #

numpy.float64

In [177]:
np.nan == np.nan

False

In [178]:
np.nan + 3

nan

In [179]:
np.nan > np.nan

False

<ul><li>The <b>na_values</b> option can take either a list or set of strings to consider missing values</li></ul>

In [180]:
df = pd.read_csv('files/ex3.csv', na_values=['NULL', 4])

In [181]:
df

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,,
1,two,5,6,,8.0,world
2,three,9,10,11.0,12.0,foo


<h3>Reading Text Files in Pieces</h3>
<ul>
    <li>When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file </li>
</ul>

In [182]:
df = pd.read_csv('files/ex4.csv')

In [183]:
df

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
5,1.817480,0.742273,0.419395,-2.251035,Q
6,-0.776764,0.935518,-0.332872,-1.875641,U
7,-0.913135,1.530624,-0.572657,0.477252,K
8,0.358480,-0.497572,-0.367016,0.507702,S
9,-1.740877,-1.160417,-1.637830,2.172201,G


In [184]:
pd.read_csv('files/ex4.csv', nrows=5)

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


<p><b>nrows</b> and <b>skiprows</b> are to help<p>

In [185]:
pd.read_csv('files/ex4.csv', skiprows = 25, nrows=5)

Unnamed: 0,-0.564474180519,0.792832268133,0.747052842907,0.57167515791,I
0,1.759879,-0.515666,-0.230481,1.362317,S
1,0.126266,0.309281,0.38282,-0.239199,L
2,1.33436,-0.100152,-0.840731,-0.643967,6
3,-0.73762,0.278087,-0.053235,-0.950972,J
4,-1.148486,-0.986292,-0.144963,0.124362,Y


In [186]:
df.loc[24:30] # unlike lists , and numpy array .loc indexing instead of 24<=index<30 is 24<=index<=30

Unnamed: 0,one,two,three,four,key
24,-0.564474,0.792832,0.747053,0.571675,I
25,1.759879,-0.515666,-0.230481,1.362317,S
26,0.126266,0.309281,0.38282,-0.239199,L
27,1.33436,-0.100152,-0.840731,-0.643967,6
28,-0.73762,0.278087,-0.053235,-0.950972,J
29,-1.148486,-0.986292,-0.144963,0.124362,Y
30,1.630594,0.243886,0.468368,1.258048,F


In [187]:
pd.read_csv('files/ex4.csv', skiprows = 25, nrows=5, names = ['one', 'two', 'three', 'four', 'key'])

Unnamed: 0,one,two,three,four,key
0,-0.564474,0.792832,0.747053,0.571675,I
1,1.759879,-0.515666,-0.230481,1.362317,S
2,0.126266,0.309281,0.38282,-0.239199,L
3,1.33436,-0.100152,-0.840731,-0.643967,6
4,-0.73762,0.278087,-0.053235,-0.950972,J


<p>To read out a file in pieces, specify a <b>chunksize</b> as a number of rows</p>

In [188]:
chunker = pd.read_csv('files/ex4.csv', chunksize=100)

In [189]:
chunker

<pandas.io.parsers.TextFileReader at 0x7f310603f978>

<ul><li> pd.io.parsers.TextFileReader is generator you can iterate over it by next or for loop </li><ul>

In [190]:
next(chunker)

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
5,1.817480,0.742273,0.419395,-2.251035,Q
6,-0.776764,0.935518,-0.332872,-1.875641,U
7,-0.913135,1.530624,-0.572657,0.477252,K
8,0.358480,-0.497572,-0.367016,0.507702,S
9,-1.740877,-1.160417,-1.637830,2.172201,G


In [191]:
a2 = next(chunker)

In [192]:
a2

Unnamed: 0,one,two,three,four,key
100,-0.602748,-0.182504,-0.768164,1.260686,X
101,-0.039105,-0.069510,-1.358052,-0.292859,0
102,-0.345101,-0.179477,-0.904658,-0.071287,T
103,-0.759495,0.225916,2.235872,-0.004490,X
104,-1.686771,0.742276,0.485876,1.798507,Q
105,0.145697,0.805037,0.350362,-0.540235,O
106,2.487654,-0.147867,0.580688,-0.305696,S
107,0.380166,-0.592483,0.393693,-0.463217,M
108,0.963549,-1.550302,-0.944962,0.121546,K
109,-0.040385,-0.312649,-1.196604,1.075043,S


In [193]:
chunker = pd.read_csv('files/ex4.csv', chunksize=100)

In [194]:
tot = pd.Series([])

In [195]:
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

In [196]:
tot

0    151.0
1    146.0
2    152.0
3    162.0
4    171.0
5    157.0
6    166.0
7    164.0
8    162.0
9    150.0
A    320.0
B    302.0
C    286.0
D    320.0
E    368.0
F    335.0
G    308.0
H    330.0
I    327.0
J    337.0
K    334.0
L    346.0
M    338.0
N    306.0
O    343.0
P    324.0
Q    340.0
R    318.0
S    308.0
T    304.0
U    326.0
V    328.0
W    305.0
X    364.0
Y    314.0
Z    288.0
dtype: float64

<h3> Writing Data Out to Text Format </h3>

In [197]:
data = pd.read_csv('files/ex3.csv')

In [198]:
data.to_csv('out.csv')

In [199]:
!cat 'out.csv'

,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [200]:
import sys

In [201]:
data.to_csv(sys.stdout, na_rep='NULL')

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo


In [202]:
data.to_csv(sys.stdout, na_rep="nil", index=False)

something,a,b,c,d,message
one,1,2,3.0,4,nil
two,5,6,nil,8,world
three,9,10,11.0,12,foo


<h4>Exercise : reading csv file manually </h4>

In [203]:
def read_csv(source: str) -> np.ndarray:
    with open(source, 'r') as f:
        s = f.read()
        return np.array([l.split(',') for l in s.split()])

In [204]:
a = read_csv('files/ex3.csv')

In [205]:
a

array([['something', 'a', 'b', 'c', 'd', 'message'],
       ['one', '1', '2', '3', '4', 'NA'],
       ['two', '5', '6', '', '8', 'world'],
       ['three', '9', '10', '11', '12', 'foo']], dtype='<U9')

In [206]:
a.shape

(4, 6)

In [207]:
a[1:]

array([['one', '1', '2', '3', '4', 'NA'],
       ['two', '5', '6', '', '8', 'world'],
       ['three', '9', '10', '11', '12', 'foo']], dtype='<U9')

In [208]:
a

array([['something', 'a', 'b', 'c', 'd', 'message'],
       ['one', '1', '2', '3', '4', 'NA'],
       ['two', '5', '6', '', '8', 'world'],
       ['three', '9', '10', '11', '12', 'foo']], dtype='<U9')

### Data Cleaning and Preparation
* String Manipulation
* Handling Missing Data
* Data Transformation

In [209]:
from numpy import nan as Na

In [210]:
data = pd.Series([1, Na, 3.5, Na, 7])

In [211]:
data.isnull()

0    False
1     True
2    False
3     True
4    False
dtype: bool

In [212]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

#### This is equivalent to

In [213]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

#### In DataFrames

In [214]:
data = pd.DataFrame([[1., 6.5, 3.], [1., Na, Na], [Na, Na, Na], [Na, 6.5, 3.]])

In [215]:
cleaned = data.dropna()

In [216]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [217]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [218]:
data[4] = Na

In [219]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [220]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [221]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = Na
df.iloc[:2, 2] = Na

In [222]:
df.dropna()

Unnamed: 0,0,1,2
4,-0.126109,-0.33328,-1.334259
5,-0.784273,0.383049,-0.655363
6,0.351592,0.514875,1.866167


In [223]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.142707,,-1.301921
3,0.263243,,1.411129
4,-0.126109,-0.33328,-1.334259
5,-0.784273,0.383049,-0.655363
6,0.351592,0.514875,1.866167


### Filling In Missing Data

In [224]:
df.fillna(2)

Unnamed: 0,0,1,2
0,1.290571,2.0,2.0
1,-0.880534,2.0,2.0
2,0.142707,2.0,-1.301921
3,0.263243,2.0,1.411129
4,-0.126109,-0.33328,-1.334259
5,-0.784273,0.383049,-0.655363
6,0.351592,0.514875,1.866167


In [225]:
df.fillna({1:.5, 2:0}) # for each column

Unnamed: 0,0,1,2
0,1.290571,0.5,0.0
1,-0.880534,0.5,0.0
2,0.142707,0.5,-1.301921
3,0.263243,0.5,1.411129
4,-0.126109,-0.33328,-1.334259
5,-0.784273,0.383049,-0.655363
6,0.351592,0.514875,1.866167


In [226]:
df.fillna(df.mean())

Unnamed: 0,0,1,2
0,1.290571,0.188215,-0.002849
1,-0.880534,0.188215,-0.002849
2,0.142707,0.188215,-1.301921
3,0.263243,0.188215,1.411129
4,-0.126109,-0.33328,-1.334259
5,-0.784273,0.383049,-0.655363
6,0.351592,0.514875,1.866167


In [227]:
_ = df.fillna(0, inplace=True)

### Data Transformation

### Removing Duplicates

In [228]:
data = pd.DataFrame(
    {"k1": ["one", "two"] * 3 + ["two"], "k2": [1, 1, 2, 3, 3, 4, 4]}
)

In [229]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [230]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [231]:
data["v1"] = range(7)

In [232]:
data.drop_duplicates(["k1"])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


In [233]:
data.drop_duplicates(["k1", "k2"])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5


* *last* arguments


In [234]:
data.duplicated(["k1", "k2"], keep="last")

0    False
1    False
2    False
3    False
4    False
5     True
6    False
dtype: bool

#### Transforming Data Using a Function or Mapping

In [235]:
data = pd.DataFrame(
    {
        "food": [
            "bacon",
            "pulled pork",
            "bacon",
            "Pastrami",
            "corned beef",
            "Bacon",
            "pastrami",
            "honey ham",
            "nova lox",
        ],
        "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6],
    }
)

In [236]:
meat_to_animal = {
    "bacon": "pig",
    "pulled pork": "pig",
    "pastrami": "cow",
    "corned beef": "cow",
    "honey ham": "pig",
    "nova lox": "salmon",
}

In [237]:
lowercased = data["food"].str.lower()

In [238]:
lowercased.map(meat_to_animal)

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

In [239]:
data["animal"] = lowercased.map(meat_to_animal)

#### Replacing Values

In [240]:
data = pd.Series([1.0, -999.0, 2.0, -999.0, -1000.0, 3.0])

In [241]:
data.replace(-999, Na)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [242]:
data.replace([-999, -1000], Na)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [243]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

#### Renaming Axis Indexes

In [244]:
data = pd.DataFrame(
    np.arange(12).reshape((3, 4)),
    index=["Ohio", "Colorado", "New York"],
    columns=["one", "two", "three", "four"],
)

In [245]:
def transform(x):
    return x[:4].upper()

In [246]:
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [247]:
data.index = data.index.map(transform)

In [248]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [249]:
data.rename(index={"OHIO": "INDIANA"}, columns={"three": "peekaboo"})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


#### Discretization and Binning

In [250]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]

In [251]:
cats = pd.cut(ages, bins)

In [252]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [253]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
              closed='right',
              dtype='interval[int64]')

In [254]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

In [255]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False) 

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

In [265]:
group_names = ["Youth", "YoungAdult", "MiddleAged", "Senior"]

In [266]:
pd.cut(ages, bins, labels=group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

In [281]:
data = np.random.randn(1000)
cats = pd.qcut(data, 4)

In [284]:
cats.categories

(-3.5789999999999997, -0.626]    250
(-0.626, 0.026]                  250
(0.026, 0.641]                   250
(0.641, 3.396]                   250
dtype: int64

#### Detecting and Filtering Outliers

In [286]:
data = pd.DataFrame(np.random.randn(1000, 4))

In [287]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.020428,-0.028721,0.018206,-0.030178
std,1.015222,0.982182,1.012815,1.000794
min,-2.924245,-3.320279,-2.89496,-2.841583
25%,-0.649611,-0.649357,-0.684554,-0.741586
50%,0.042227,-0.001331,-0.011572,-0.019379
75%,0.697068,0.644401,0.709733,0.656436
max,3.413568,2.432044,3.272521,3.157746


In [288]:
col = data[2]

In [289]:
col[np.abs(col) > 3]

73    3.272521
Name: 2, dtype: float64

In [293]:
data[(np.abs(data) > 3).any(axis=1)]

Unnamed: 0,0,1,2,3
73,2.265819,-0.165681,3.272521,1.029614
134,1.020441,1.953661,-0.965628,3.157746
490,3.413568,-0.343721,-0.469895,0.260454
730,-0.047572,-1.037958,-0.733609,3.060179
860,-0.202091,-3.320279,-1.518169,1.137886


In [294]:
data[np.abs(data) > 3] = np.sign(data) * 3

####  Permutation and Random Sampling
* Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the numpy.random.permutation function. Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new ordering:

In [295]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))

In [296]:
sampler = np.random.permutation(5)

In [298]:
df.take(sampler)

Unnamed: 0,0,1,2,3
3,12,13,14,15
2,8,9,10,11
0,0,1,2,3
1,4,5,6,7
4,16,17,18,19


In [299]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
1,4,5,6,7
3,12,13,14,15
2,8,9,10,11


In [300]:
choices = pd.Series([5, 7, -1, 6, 4])

In [301]:
draws = choices.sample(n=10, replace=True)

In [302]:
draws

2   -1
2   -1
0    5
2   -1
0    5
2   -1
3    6
0    5
2   -1
1    7
dtype: int64

In [303]:
type(draws)

pandas.core.series.Series

### String Manipulation

In [304]:
val = "a,b, guido"

In [306]:
val.split(",")

['a', 'b', ' guido']

In [307]:
pieces = [x.strip() for x in val.split(",")]

In [308]:
first, second, third = pieces

In [309]:
first + "::" + second + "::" + third

'a::b::guido'

In [310]:
"::".join(pieces)

'a::b::guido'

In [311]:
"guido" in pieces

True

In [312]:
val.index(",")

1

In [313]:
val.find(":")

-1

In [314]:
val.index(";")

ValueError: substring not found

In [315]:
val.count(",")

2

In [316]:
val.replace(",", "::")

'a::b:: guido'

In [317]:
val.replace(",", "")

'ab guido'

<br>
<h3>Data Wrangling: Clean, Transform, Merge, Reshape</h3><br>

<h3>Combining and Merging Data Sets</h3>
<ul>
    <li><b>pandas.merge</b> : connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations</li>
    <li><b>pandas.concat</b> : glues or stacks together objects along an axis</li>
    <li><b>combine_first</b> : instance method enables splicing together overlapping data to fill in missing values in one object with values from another. </li>
</ul><br>

<h3>Database-style DataFrame Merges</h3>
<ul>
    <li>Merge or join operations combine data sets by linking rows using one or more keys. These operations are central to relational databases. The merge function in pandas is the main entry point for using these algorithms on your data </li>
</ul><br>

In [256]:
df1 = pd.DataFrame(
    {"key": ["b", "b", "a", "c", "a", "a", "b"], "data1": range(7)}
)

df2 = pd.DataFrame({"key": ["a", "b", "d"], "data2": range(3)})

In [257]:
pd.merge(df1, df2)

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [258]:
pd.merge(df1, df2, on='key')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [259]:
df3 = pd.DataFrame(
    {"lkey": ["b", "b", "a", "c", "a", "a", "b"], "data1": range(7)}
)

df4 = pd.DataFrame({"rkey": ["a", "b", "d"], "data2": range(3)})

In [260]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


In [261]:
pd.merge(df1, df2, how='outer')

Unnamed: 0,key,data1,data2
0,b,0.0,1.0
1,b,1.0,1.0
2,b,6.0,1.0
3,a,2.0,0.0
4,a,4.0,0.0
5,a,5.0,0.0
6,c,3.0,
7,d,,2.0


In [262]:
left = pd.DataFrame(
    {
        "key1": ["foo", "foo", "bar"],
        "key2": ["one", "two", "one"],
        "lval": [1, 2, 3],
    }
)

In [263]:
right = pd.DataFrame(
    {
        "key1": ["foo", "foo", "bar", "bar"],
        "key2": ["one", "one", "one", "two"],
        "rval": [4, 5, 6, 7],
    }
)

In [264]:
pd.merge(left, right, on=['key1', 'key2'], how='outer')

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0
