# 5 Getting Started with Pandas
pandas will be a major tool of interest throughout much of the rest of the book. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and convenient in Python. pandas is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib. pandas adopts significant parts of NumPy's idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops.

While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is **designed for working with tabular or heterogeneous data.** NumPy, by contrast, is best suited for working with **homogeneously typed numerical array data.**

Since becoming an open source project in 2010, pandas has matured into a quite large library that's applicable in a broad set of real-world use cases. The developer community has grown to over 2,500 distinct contributors, who've been helping build the project as they used it to solve their day-to-day data problems. The vibrant pandas developer and user communities have been a key part of its success.


In [1]:
import pandas as pd;

## 1 - Series
A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) of the same type and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:

In [2]:
obj = pd.Series([4, 7, -5, 3])

In [3]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [4]:
obj.array

<NumpyExtensionArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
obj2 = pd.Series([4,7,-5,3], index=["a","b","c","d"])


In [7]:
obj2

a    4
b    7
c   -5
d    3
dtype: int64

In [8]:
obj2.array

<NumpyExtensionArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [9]:
obj2.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [10]:
obj2["a"]

4

In [11]:
obj2[["c","a","d"]]

c   -5
a    4
d    3
dtype: int64

In [12]:
obj2[obj2 > 0]

a    4
b    7
d    3
dtype: int64

In [13]:
obj3 = obj2[obj2 > 0]

In [14]:
obj3

a    4
b    7
d    3
dtype: int64

In [15]:
obj3["a"] = 10

In [16]:
obj2

a    4
b    7
c   -5
d    3
dtype: int64

In [17]:
obj2 * 2

a     8
b    14
c   -10
d     6
dtype: int64

In [18]:
"b" in obj2

True

In [19]:
"e" in obj2

False

In [20]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

obj4 = pd.Series(sdata)

obj4



Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [21]:
obj4.array

<NumpyExtensionArray>
[35000, 71000, 16000, 5000]
Length: 4, dtype: int64

In [22]:
obj4.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [23]:
type(obj4)

pandas.core.series.Series

In [24]:
states = ["California", "Ohio", "Oregon", "Texas"]

obj5 = pd.Series(sdata, index=states)

obj5

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [25]:
pd.isna(obj5) # It will detect the missing data

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [26]:
pd.notna(obj5)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [27]:
obj4

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [28]:
obj5

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [29]:
obj4 + obj5

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [30]:
obj4.name = "population"

In [31]:
obj4.index.name = "state"

In [32]:
obj4

state
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
Name: population, dtype: int64

In [33]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [34]:
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]

In [35]:
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## 2 - DataFrame
A DataFrame represents a rectangular table of data and contains an ordered, named collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index.

In [36]:
data = {"state": ["Ohio", "Ohio","Ohio","Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [37]:
frame = pd.DataFrame(data)

In [38]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [39]:
frame.head() #It shows only first five rows

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [40]:
frame.tail() #It shows only last five rows

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [41]:
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [42]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])

In [43]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [44]:
frame2["state"]

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [45]:
frame2.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

In [46]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [47]:
frame2.loc[1]

year     2001
state    Ohio
pop       1.7
debt      NaN
Name: 1, dtype: object

In [48]:
frame2.iloc[2]

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: 2, dtype: object

In [49]:
frame2.debt = 16.5

In [50]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,16.5
1,2001,Ohio,1.7,16.5
2,2002,Ohio,3.6,16.5
3,2001,Nevada,2.4,16.5
4,2002,Nevada,2.9,16.5
5,2003,Nevada,3.2,16.5


In [51]:
import numpy as np
frame2["debt"] += np.arange(6.)

In [52]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,16.5
1,2001,Ohio,1.7,17.5
2,2002,Ohio,3.6,18.5
3,2001,Nevada,2.4,19.5
4,2002,Nevada,2.9,20.5
5,2003,Nevada,3.2,21.5


In [53]:
val = pd.Series([-1.2, -1.5, -1.7], index=[2,4,5])

In [54]:
val

2   -1.2
4   -1.5
5   -1.7
dtype: float64

In [55]:
frame2.debt += val

In [56]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,17.3
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,19.0
5,2003,Nevada,3.2,19.8


In [57]:
frame2["eastern"] = frame2["state"] == "Ohio"

In [58]:
frame2
# New columns cannot be created with the frame2.eastern dot attribute notation

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,,True
1,2001,Ohio,1.7,,True
2,2002,Ohio,3.6,17.3,True
3,2001,Nevada,2.4,,False
4,2002,Nevada,2.9,19.0,False
5,2003,Nevada,3.2,19.8,False


In [59]:
del frame2["eastern"]

In [60]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,17.3
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,19.0
5,2003,Nevada,3.2,19.8


In [61]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [62]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6}, 
                "Nevada": {2001: 2.4, 2002: 2.9}}

In [63]:
frame3 = pd.DataFrame(populations)

In [64]:
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [65]:
frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


In [66]:
pd.DataFrame(populations, index=[2001,2002,2003])

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9
2003,,


In [67]:
pdata = {"Ohio": frame3["Ohio"][:-1],
        "Nevada": frame3["Nevada"][:2]}

In [68]:
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


In [69]:
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [70]:
frame3.index.name = "year"
frame3.columns.name = "state"

In [71]:
frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [72]:
frame3.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

In [73]:
frame2.to_numpy()

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, 17.3],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, 19.0],
       [2003, 'Nevada', 3.2, 19.8]], dtype=object)

## 3 - Index Objects
pandas’s Index objects are responsible for holding the axis labels (including a DataFrame's column names) and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index:

In [74]:
obj = pd.Series(np.arange(3), index=["a","b","c"])

In [75]:
index = obj.index

In [76]:
index

Index(['a', 'b', 'c'], dtype='object')

In [77]:
list(index[1:])

['b', 'c']

In [78]:
#index[1] = "d" # since indexes are immutable it will not work

In [79]:
labels = pd.Index(np.arange(3))
labels

Index([0, 1, 2], dtype='int64')

In [80]:
obj2 = pd.Series([1.5,-2.5,0], index=labels)

In [81]:
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [82]:
obj2.index is labels

True

In [83]:
frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [84]:
frame3.columns

Index(['Ohio', 'Nevada'], dtype='object', name='state')

In [85]:
"Ohio" in frame3.columns

True

In [86]:
frame3.index

Index([2000, 2001, 2002], dtype='int64', name='year')

In [87]:
2003 in frame3.index

False

In [88]:
pd.Index(["foo", "foo", "bar","bar"])

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

# 5.2 Essential Functionality

## 1 - Reindexing
An important method on pandas objects is reindex, which means to create a new object with the values rearranged to align with the new index. Consider an example:

In [89]:
obj = pd.Series([4.5,7.2,-5.3,3.6], index=["d", "b", "a", "c"])

In [90]:
obj.name = "awe"
obj.index.name = "index"
obj

index
d    4.5
b    7.2
a   -5.3
c    3.6
Name: awe, dtype: float64

In [91]:
obj2 = obj.reindex(["a","b","c","d","e"])

In [92]:
obj2

index
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
Name: awe, dtype: float64

In [93]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])

In [94]:
obj3.reindex(np.arange(6), method="ffill")

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [95]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)), index=["a","b","c"], columns=["Ohio","Texas","California"])

In [96]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
b,3,4,5
c,6,7,8


In [97]:
frame2 = frame.reindex(index=["a","b","c","d"])

In [98]:
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,3.0,4.0,5.0
c,6.0,7.0,8.0
d,,,


In [99]:
states = ["Texas", "Utah", "California"]
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
b,4,,5
c,7,,8


In [100]:
frame.reindex(states, axis="columns")

Unnamed: 0,Texas,Utah,California
a,1,,2
b,4,,5
c,7,,8


In [101]:
frame.loc[["a","b","c"],["California", "Texas"]]

Unnamed: 0,California,Texas
a,2,1
b,5,4
c,8,7


In [102]:
obj = pd.Series(np.arange(5.), index=["a","b","c", "d", "e"])

In [103]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [104]:
new_obj = obj.drop("c")

In [105]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [106]:
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [107]:
data = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two","three","four"])

In [108]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [109]:
data.drop(index=["Ohio", "Colorado"])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [110]:
data.drop(columns=["one", "three"], axis=1)

Unnamed: 0,two,four
Ohio,1,3
Colorado,5,7
Utah,9,11
New York,13,15


## 2 Indexing, Selection, and Filtering
Series indexing ``(obj[...])`` works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples of this:

In [111]:
obj = pd.Series(np.arange(4.), index=["a","b","c","d"])

In [112]:
obj.index.name = "index"
obj.name = "Standart numpy array"

In [113]:
obj

index
a    0.0
b    1.0
c    2.0
d    3.0
Name: Standart numpy array, dtype: float64

In [114]:
obj["b"]

1.0

In [115]:
obj.iloc[1]

1.0

In [116]:
obj[2:4]

index
c    2.0
d    3.0
Name: Standart numpy array, dtype: float64

In [117]:
obj[["b","a","d"]]

index
b    1.0
a    0.0
d    3.0
Name: Standart numpy array, dtype: float64

In [118]:
obj[[1,3]]

  obj[[1,3]]


index
b    1.0
d    3.0
Name: Standart numpy array, dtype: float64

In [119]:
obj[obj < 2]

index
a    0.0
b    1.0
Name: Standart numpy array, dtype: float64

In [120]:
obj

index
a    0.0
b    1.0
c    2.0
d    3.0
Name: Standart numpy array, dtype: float64

In [121]:
list(obj)

[0.0, 1.0, 2.0, 3.0]

In [122]:
obj.loc[["b","a","d"]]

index
b    1.0
a    0.0
d    3.0
Name: Standart numpy array, dtype: float64

In [123]:
obj1 = pd.Series([1,2,3], index=[2,0,1])

In [124]:
obj2 = pd.Series([1,2,3], index=["a","b","c"])
obj1

2    1
0    2
1    3
dtype: int64

In [125]:
obj2

a    1
b    2
c    3
dtype: int64

In [126]:
obj1[[0,1,2]]

0    2
1    3
2    1
dtype: int64

In [127]:
obj2.iloc[[0,1,2]]

a    1
b    2
c    3
dtype: int64

In [128]:
index = ["Ohio", "Colorado", "Utah", "New York"]
column = ["one", "two","three","four"]
data = pd.DataFrame(np.arange(16).reshape((4,4)), index=index, columns=column)


In [129]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [130]:
data["two"]

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [131]:
data[["one","two"]]

Unnamed: 0,one,two
Ohio,0,1
Colorado,4,5
Utah,8,9
New York,12,13


In [132]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [133]:
data[data["three"]>5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [134]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [135]:
data[data < 5] = 0

In [136]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


# Selection on DataFrame with loc and iloc

In [137]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [138]:
data.loc["Colorado"]

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [139]:
data.iloc[1]

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [140]:
data.loc[["Colorado", "New York"]]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
New York,12,13,14,15


In [141]:
data.loc["Colorado",["two", "three"]]

two      5
three    6
Name: Colorado, dtype: int64

In [142]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [143]:
data.iloc[[2,1], [1,2,3]]

Unnamed: 0,two,three,four
Utah,9,10,11
Colorado,5,6,7


In [144]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [145]:
data.iloc[2, [3,0,1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

In [146]:
data.iloc[[1,2],[3,0,1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [147]:
data.loc[:"Utah", "two"]

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [148]:
data.iloc[1:,[0,1,2]][data.loc[:,"three"] > 3]

  data.iloc[1:,[0,1,2]][data.loc[:,"three"] > 3]


Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


In [149]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [150]:
data.loc[data.three >= 2]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### Integer indexing pitfalls

In [151]:
ser = pd.Series(np.arange(3.))

In [152]:
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [153]:
ser2 = pd.Series(np.arange(3.), index=["a","b","c"])

In [154]:
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

In [155]:
ser.iloc[-1]

2.0

In [156]:
ser2[-1]

  ser2[-1]


2.0

In [157]:
data.loc[:, "one"] = 1

In [158]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,1,5,6,7
Utah,1,9,10,11
New York,1,13,14,15


In [159]:
data.iloc[2] = 5

In [160]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,1,5,6,7
Utah,5,5,5,5
New York,1,13,14,15


In [161]:
data.loc[data.four > 5] = 3

In [162]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,5,5
New York,3,3,3,3


In [163]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,5,5
New York,3,3,3,3


In [164]:
data.loc[data.three == 5, "three"] = 6

In [165]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,6,5
New York,3,3,3,3


In [166]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])

s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])



In [167]:
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [168]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [169]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [170]:
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)), index=["Ohio", "Texas", "Colorado"], columns=["b","c","d"])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"), index=["Utah", "Ohio", "Texas", "Oregon"])



In [171]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [172]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [173]:
df1 + df2 # Because if you sum NaN + whatever = NaN     

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [174]:
df1 = pd.DataFrame({"A": [1,2]})

In [175]:
df1

Unnamed: 0,A
0,1
1,2


In [176]:
df2 = pd.DataFrame({"B": [3,4]})

In [177]:
df2

Unnamed: 0,B
0,3
1,4


In [178]:
df1 + df2

Unnamed: 0,A,B
0,,
1,,


In [179]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list("abcd"))

df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list("abcde"))


In [180]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [181]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [182]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [183]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [184]:
1 / df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [185]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [186]:
df1.div(10)

Unnamed: 0,a,b,c,d
0,0.0,0.1,0.2,0.3
1,0.4,0.5,0.6,0.7
2,0.8,0.9,1.0,1.1


In [187]:
df1.reindex(index=df2.index,columns=df2.columns, fill_value=0) + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


### Operations between DataFrame and Series

In [188]:
arr = np.arange(12).reshape((3,4))

In [189]:
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [190]:
arr[0]

array([0, 1, 2, 3])

In [191]:
arr - arr[0]

array([[0, 0, 0, 0],
       [4, 4, 4, 4],
       [8, 8, 8, 8]])

In [192]:
frame = pd.DataFrame(np.arange(12.).reshape((4,3)), columns=list("bde"), index=["Utah", "Ohio", "Texas", "Oregon"])

In [193]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [194]:
series = frame.iloc[0]

In [195]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [196]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [197]:
frame + series

Unnamed: 0,b,d,e
Utah,0.0,2.0,4.0
Ohio,3.0,5.0,7.0
Texas,6.0,8.0,10.0
Oregon,9.0,11.0,13.0


In [198]:
series2 = pd.Series(np.arange(3), index=["b","e","f"])

In [199]:
series2

b    0
e    1
f    2
dtype: int64

In [200]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [201]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [202]:
frame.add(series2)

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [203]:
series3 = frame["d"]

In [204]:
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [205]:
frame.sub(series3.reindex(index=frame.index), axis="index")

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


In [206]:
frame = pd.DataFrame(np.random.standard_normal((4,3)), columns=list("bde"), index=["Utah", "Ohio","Texas","Oregon"])
frame

Unnamed: 0,b,d,e
Utah,1.150201,1.451031,0.286362
Ohio,-0.879151,-2.080264,-0.181427
Texas,-0.8068,-1.056288,-0.461998
Oregon,1.652924,1.466159,-0.765236


In [207]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.150201,1.451031,0.286362
Ohio,0.879151,2.080264,0.181427
Texas,0.8068,1.056288,0.461998
Oregon,1.652924,1.466159,0.765236


In [208]:
def f1(x):
    return x.max() - x.min()

frame.apply(f1)

b    2.532074
d    3.546423
e    1.051598
dtype: float64

In [209]:
frame.apply(f1, axis="columns")

Utah      1.164668
Ohio      1.898838
Texas     0.594290
Oregon    2.418159
dtype: float64

In [210]:
def f2(x):
    return pd.Series([x.min(), x.max()], index["min", "max"])

In [211]:
frame.apply(sum)

b    1.117174
d   -0.219362
e   -1.122298
dtype: float64

In [212]:
frame.sum()

b    1.117174
d   -0.219362
e   -1.122298
dtype: float64

In [213]:
def my_format(x):
    return f"{x:.2f}"

frame.map(my_format)

Unnamed: 0,b,d,e
Utah,1.15,1.45,0.29
Ohio,-0.88,-2.08,-0.18
Texas,-0.81,-1.06,-0.46
Oregon,1.65,1.47,-0.77


In [214]:
frame["e"].map(my_format)

Utah       0.29
Ohio      -0.18
Texas     -0.46
Oregon    -0.77
Name: e, dtype: object

In [215]:
obj = pd.Series(np.arange(4), index = ["d","a","b","c"])
obj

d    0
a    1
b    2
c    3
dtype: int64

In [216]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [217]:
frame = pd.DataFrame(np.arange(8).reshape(2,4), index=["three","one"], columns=["d","a","b","c"])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [218]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [219]:
frame.sort_index(axis="columns")

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [220]:
frame.sort_index(axis="columns", ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


In [221]:
obj = pd.Series([4,7,-3,2])

In [222]:
obj.sort_values().index

Index([2, 3, 0, 1], dtype='int64')

In [223]:
obj = pd.Series([4,np.nan, 7, np.nan, -3, 2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [224]:
obj.sort_values(na_position="first")

1    NaN
3    NaN
4   -3.0
5    2.0
0    4.0
2    7.0
dtype: float64

In [225]:
frame = pd.DataFrame({"b":[4,7,-3,2],
                        "a":[0,1,0,1]})


In [226]:
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [227]:
frame.sort_values("b",ascending=False)

Unnamed: 0,b,a
1,7,1
0,4,0
3,2,1
2,-3,0


In [228]:
frame.sort_values(["a","b"])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


In [229]:
obj = pd.Series([7,-5,7,4,2,0,4])

In [230]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [231]:
obj.rank(method="first")

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [232]:
obj.rank(ascending=False)

0    1.5
1    7.0
2    1.5
3    3.5
4    5.0
5    6.0
6    3.5
dtype: float64

In [233]:
frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1],"c": [-2, 5, 8, -2.5]})



In [234]:
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [235]:
frame.rank(axis="columns")

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


In [236]:
obj = pd.Series(np.arange(5), index=["a","a","b","b","c"])

In [237]:
obj.index.is_unique

False

In [238]:
obj["a"]

a    0
a    1
dtype: int64

In [239]:
obj.loc["a"]

a    0
a    1
dtype: int64

In [240]:
obj["c"]

4

In [241]:
df = pd.DataFrame(np.random.standard_normal((5,3)), index=["a","a","b","b","c"])


In [242]:
df

Unnamed: 0,0,1,2
a,0.41324,-0.208738,0.244607
a,0.488427,-0.803299,-0.076609
b,-1.420499,0.258052,1.896288
b,-1.51704,0.994121,0.095048
c,0.526296,-1.696048,-0.153442


In [243]:
def modifyit(x):
    return f"{x:.2f}"

df.map(modifyit)

Unnamed: 0,0,1,2
a,0.41,-0.21,0.24
a,0.49,-0.8,-0.08
b,-1.42,0.26,1.9
b,-1.52,0.99,0.1
c,0.53,-1.7,-0.15


In [244]:
df.loc["b"]

Unnamed: 0,0,1,2
b,-1.420499,0.258052,1.896288
b,-1.51704,0.994121,0.095048


In [245]:
df.loc["c"]

0    0.526296
1   -1.696048
2   -0.153442
Name: c, dtype: float64

In [246]:
df = pd.DataFrame([[1.4, np.nan], [7.1,-4.5], [np.nan, np.nan],[0.75, -1.3]], 
                    index=["a","b","c","d"], columns=["one","two"])

In [247]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [248]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [249]:
df.sum(axis="columns")

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [250]:
df.sum(axis="index", skipna=False)

one   NaN
two   NaN
dtype: float64

In [251]:
df.sum(axis="columns", skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

In [252]:
df.mean(axis="columns")

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

In [253]:
df.mean(axis="columns")

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

In [254]:
df.idxmax()

one    b
two    d
dtype: object

In [255]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [256]:
df.cumsum(axis="index")

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [257]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [258]:
obj = pd.Series(["a","a","b","c"] * 4)

In [259]:
obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

In [260]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

In [261]:
obj = pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])

In [262]:
uniques = obj.unique()

In [263]:
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [264]:
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [265]:
obj.value_counts(sort=False)

c    3
a    3
d    1
b    2
Name: count, dtype: int64

In [266]:
pd.value_counts(obj.to_numpy(), sort=False)

  pd.value_counts(obj.to_numpy(), sort=False)


c    3
a    3
d    1
b    2
Name: count, dtype: int64

In [267]:
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [268]:
mask = obj.isin(["b","c"])

In [269]:
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [270]:
obj[mask].index

Index([0, 5, 6, 7, 8], dtype='int64')

In [271]:
data = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4],"Qu2": [2, 3, 1, 2, 3],"Qu3": [1, 5, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [272]:
data.index

RangeIndex(start=0, stop=5, step=1)

In [273]:
data["Qu1"].value_counts().sort_index()

Qu1
1    1
3    2
4    2
Name: count, dtype: int64

In [274]:
result = data.apply(pd.value_counts).fillna(0)

  result = data.apply(pd.value_counts).fillna(0)


In [275]:
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


In [276]:
data = pd.DataFrame({"a":[1,1,1,2,2],"b":[0,0,1,0,0]})

In [277]:
data

Unnamed: 0,a,b
0,1,0
1,1,0
2,1,1
3,2,0
4,2,0


In [278]:
data[:].value_counts()

a  b
1  0    2
2  0    2
1  1    1
Name: count, dtype: int64