#**5.1 Introduction to pandas Data Structures**#
##To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.
##**Series**##
###A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.
###The simplest Series is formed from only an array of data:

In [None]:
import pandas as pd

obj = pd.Series([4, 7, -5, 3])
print(obj)
print(obj.values)
print(obj.index)

0    4
1    7
2   -5
3    3
dtype: int64
[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)


##*Often it will be desirable to create a Series with an index identifying each data point with a label:*##

In [None]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)
print(obj2.index)

d    4
b    7
a   -5
c    3
dtype: int64
Index(['d', 'b', 'a', 'c'], dtype='object')


##*Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values:*##

In [None]:
print(obj2["a"])
print(obj2["d"])
print(obj2[["c","a","d"]])


-5
4
c    3
a   -5
d    4
dtype: int64


##Here ['c', 'a', 'd'] is interpreted as a list of indices, even though it contains strings instead of integers.
##Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [None]:
import numpy as np

print(obj2[obj2 > 0])
print(obj2 * 2)
print(np.exp(obj2))

d    4
b    7
c    3
dtype: int64
d     8
b    14
a   -10
c     6
dtype: int64
d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64


##Another way to think about a Series is as a fixed-length, ordered dict, as it is a map‐ ping of index values to data values. It can be used in many contexts where you might use a dict:

In [None]:
"b" in obj2

True

##Should you have data contained in a Python dict, you can create a Series from it by passing the dict:

In [None]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
print(obj3 )

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64


##When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:

In [None]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata,index = states)
print(obj4)

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64


##Here, three values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number), which is con‐ sidered in pandas to mark missing or NA values. Since 'Utah' was not included in states, it is excluded from the resulting object.
##I will use the terms “missing” or “NA” interchangeably to refer to missing data. The isnull and notnull functions in pandas should be used to detect missing data:

In [None]:
print(pd.isnull(obj4))
print(pd.notnull(obj4))

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool


##A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations:

In [None]:
print(obj3)
print(obj4)
print(obj3 + obj4)



Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64


##Data alignment features will be addressed in more detail later. If you have experience with databases, you can think about this as being similar to a join operation.
##Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality:

In [None]:
obj4.name = 'population'
obj4.index.name = 'state'
print(obj4)

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64


##A Series’s index can be altered in-place by assignment:

In [None]:

obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
print(obj)

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64


#**DataFrame**#
##A DataFrame represents a rectangular table of data and contains an ordered collec‐ tion of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays.
##There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [None]:
import pandas as pd

frame = pd.DataFrame(data)
print(frame)

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2


##If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:

In [None]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


##If you pass a column that isn’t contained in the dict, it will appear with missing values in the result:

In [None]:
import pandas as pd

frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four',
                      'five', 'six'])



In [None]:
print(frame2)

       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN


##A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:

In [None]:
frame2[["state","pop"]]

Unnamed: 0,state,pop
one,Ohio,1.5
two,Ohio,1.7
three,Ohio,3.6
four,Nevada,2.4
five,Nevada,2.9
six,Nevada,3.2


In [None]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

##Rows can also be retrieved by position or name with the special loc attribute (much more on this later):

In [None]:
frame2.loc[["one","four"]]

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
four,2001,Nevada,2.4,


##Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar value or an array of values:

In [None]:
frame2 ["debt"] = 16.5
print(frame2)

       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5
six    2003  Nevada  3.2  16.5


In [None]:
import numpy as np

frame2["debt"] = np.arange(6.)

In [None]:
print(frame2)

       year   state  pop  debt
one    2000    Ohio  1.5   0.0
two    2001    Ohio  1.7   1.0
three  2002    Ohio  3.6   2.0
four   2001  Nevada  2.4   3.0
five   2002  Nevada  2.9   4.0
six    2003  Nevada  3.2   5.0


##When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any holes:

In [None]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

##Assigning a column that doesn’t exist will create a new column. The del keyword will delete columns as with a dict.
##As an example of del, I first add a new column of boolean values where the state column equals 'Ohio':

In [None]:
frame2['eastern'] = frame2.state == 'Ohio'
print(frame2)

       year   state  pop  debt  eastern
one    2000    Ohio  1.5   0.0     True
two    2001    Ohio  1.7   1.0     True
three  2002    Ohio  3.6   2.0     True
four   2001  Nevada  2.4   3.0    False
five   2002  Nevada  2.9   4.0    False
six    2003  Nevada  3.2   5.0    False


##The del method can then be used to remove this column:

In [None]:
del frame2['eastern']
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

##Another common form of data is a nested dict of dicts:

In [None]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

##If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys as the columns and the inner keys as the row indices:

In [None]:
frame3 = pd.DataFrame(pop)
print(frame3)

      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5


##You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array:

In [None]:
print(frame3.T)

        2001  2002  2000
Nevada   2.4   2.9   NaN
Ohio     1.7   3.6   1.5


##The keys in the inner dicts are combined and sorted to form the index in the result.
##This isn’t true if an explicit index is specified:

In [None]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


##If the DataFrame’s columns are different dtypes, the dtype of the values array will be chosen to accommodate all of the columns:

In [None]:
frame2.values

array([[2000, 'Ohio', 1.5, 0.0],
       [2001, 'Ohio', 1.7, 1.0],
       [2002, 'Ohio', 3.6, 2.0],
       [2001, 'Nevada', 2.4, 3.0],
       [2002, 'Nevada', 2.9, 4.0],
       [2003, 'Nevada', 3.2, 5.0]], dtype=object)

#**Index Objects**#
##pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index:

In [None]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
print(index)
print(index[1:])

Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')


##Immutability makes it safer to share Index objects among data structures:

In [None]:
labels = pd.Index(np.arange(3))
print(labels)

obj2 = pd.Series([1.5, -2.5, 0], index=labels)
print(obj2)

obj2.index is labels

Index([0, 1, 2], dtype='int64')
0    1.5
1   -2.5
2    0.0
dtype: float64


True

##Unlike Python sets, a pandas Index can contain duplicate labels:

In [None]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
print(dup_labels)

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')


#**Reindexing**#
##An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index. Consider an example:

In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
print(obj)

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64


##Calling reindex on this Series rearranges the data according to the new index, intro‐ ducing missing values if any index values were not already present:

In [None]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
print(obj)

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64


##For ordered data like time series, it may be desirable to do some interpolation or fill‐ ing of values when reindexing. The method option allows us to do this, using a method such as ffill, which forward-fills the values:

In [None]:
import pandas as pd

obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
print(obj3)

obj3.reindex(range(6), method='ffill')
obj3

0      blue
2    purple
4    yellow
dtype: object


0      blue
2    purple
4    yellow
dtype: object

##With DataFrame, reindex can alter either the (row) index, columns, or both. When passed only a sequence, it reindexes the rows in the result:

In [None]:
import numpy as np

frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
print(frame)

frame2 = frame.reindex(['a', 'b', 'c', 'd'])
print(frame2)

   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0


##The columns can be reindexed with the columns keyword:

In [None]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns = states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


##As we’ll explore in more detail, you can reindex more succinctly by label-indexing with loc, and many users prefer to use it exclusively:

#**Dropping Entries from an Axis**#
##Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis:

In [None]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
print(obj)
new_obj = obj.drop('c')
print(new_obj)
obj.drop(['d', 'c'])

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64


a    0.0
b    1.0
e    4.0
dtype: float64

##With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:

In [None]:
import pandas as pd
import numpy as np

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
print(data)

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15


##Calling drop with a sequence of labels will drop values from the row labels (axis 0):

In [None]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


##You can drop values from the columns by passing axis=1 or axis='columns':

In [None]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [None]:
data.drop(['two', 'four'], axis='columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


##Many functions, like drop, which modify the size or shape of a Series or DataFrame, can manipulate an object in-place without returning a new object:

In [None]:
obj.drop('c', inplace=True)
obj

KeyError: "['c'] not found in axis"

#**Indexing, Selection, and Filtering Series**#
##indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples of this:

In [None]:
import pandas as pd
import numpy as np

obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
print(obj)
print(obj['b'])

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
1.0


##Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence:

In [None]:
import pandas as pd
import numpy as np

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

print(data)
print(data['two'])
print(data[['three', 'one']])

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12


##Indexing like this has a few special cases. First, slicing or selecting data with a boolean array:

In [None]:
print(data[:2])
print(data[data['three'] > 5])

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15


##The row selection syntax `data[:2]` is provided as a convenience. Passing a single ele‐ ment or a list to the `[] operator selects columns`.
##Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison:

In [None]:
print(data < 5)
data[data < 5] = 0
print(data)

            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15


#**Selection with loc and iloc**#
##For DataFrame label-indexing on the rows, I introduce the special indexing operators `loc` and `iloc`. They enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notation using either `axis labels (loc)` or `integers (iloc)`.
##As a preliminary example, let’s select a single row and multiple columns by label:

In [None]:
data.loc['Colorado', ['two', 'three',"one"]]

two      5
three    6
one      0
Name: Colorado, dtype: int64

##We’ll then perform some similar selections with integers using `iloc`:

In [None]:
print(data.iloc[2, [3, 0, 1,2]])
print(data.iloc[2])
data.iloc[[1, 2], [3, 0, 1]]

four     11
one       8
two       9
three    10
Name: Utah, dtype: int64
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64


Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


##Both indexing functions work with slices in addition to single labels or lists of labels:

In [None]:
print(data.loc[:'Utah', 'two'])
data.iloc[:, :3][data.three > 5]

Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int64


Unnamed: 0,one,two,three
Colorado,4,5,6
Utah,8,9,10
New York,12,13,14


Type Notes df[val] Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (f i lter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion) df.loc[val] Selects single row or subset of rows from the DataFrame by label df.loc[:, val] Selects single column or subset of columns by label df.loc[val1, val2] Select both rows and columns by label df.iloc[where] Selects single row or subset of rows from the DataFrame by integer position
df.iloc[:, where] Selects single column or subset of columns by integer position df.iloc[where_i, where_j] Select both rows and columns by integer position df.at[label_i, label_j] Select a single scalar value by row and column label df.iat[i, j] Select a single scalar value by row and column position (integers) reindex method Select either rows or columns by labels get_value, set_value methods Select single value by row and column label

#**Integer Indexes**#
##Working with pandas objects indexed by integers is something that often trips up new users due to some differences with indexing semantics on built-in Python data structures like lists and tuples. For example, you might not expect the following code to generate an error:

In [None]:
import pandas as pd
import numpy as np

ser = pd.Series(np.arange(3.))
print(ser)
print(ser[-1])

#**Arithmetic and Data Alignment**#
 ##An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels. Let’s look at an example:

In [None]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
               index=['a', 'c', 'e', 'f', 'g'])

print(s1)
print(s2)

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64


##Adding these together yields:

In [None]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

##In the case of DataFrame, alignment is performed on both the rows and the columns:

In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(df1)
print(df2)

            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0


##Adding these together returns a DataFrame whose index and columns are the unions of the ones in each DataFrame:

In [None]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


#**Function Application and Mapping**#
##NumPy ufuncs (element-wise array methods) also work with pandas objects:

In [None]:
import pandas as pd
import numpy as np

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(frame)
np.abs(frame)

               b         d         e
Utah   -0.371530 -0.486436  1.226432
Ohio    1.135391 -0.749412 -0.029807
Texas  -0.586784 -0.273892  0.024211
Oregon -0.819214  0.012225 -1.110059


Unnamed: 0,b,d,e
Utah,0.37153,0.486436,1.226432
Ohio,1.135391,0.749412,0.029807
Texas,0.586784,0.273892,0.024211
Oregon,0.819214,0.012225,1.110059


##Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this:

In [None]:
f = lambda x: x.max() - x.min()
frame.apply(f)

b    1.954605
d    0.761637
e    2.336490
dtype: float64

#**Sorting and Ranking**#
##Sorting a dataset by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:

In [None]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
print(obj)
obj.sort_index()

d    0
a    1
b    2
c    3
dtype: int64


a    1
b    2
c    3
d    0
dtype: int64

##With a DataFrame, you can sort by index on either axis:

In [None]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
print(frame)
frame.sort_index()



       d  a  b  c
three  0  1  2  3
one    4  5  6  7


Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


##The data is sorted in ascending order by default, but can be sorted in descending order, too:

In [None]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


#×To sort a Series by its values, use its sort_values method:

In [None]:
obj = pd.Series([4, 7, -3, 2])
print(obj)
obj.sort_values()

0    4
1    7
2   -3
3    2
dtype: int64


2   -3
3    2
0    4
1    7
dtype: int64

##Any missing values are sorted to the end of the Series by default:

In [None]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
print(obj)
obj.sort_values()

0    4.0
1    NaN
2    7.0
3    NaN
4   -3.0
5    2.0
dtype: float64


4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

#×When sorting a DataFrame, you can use the data in one or more columns as the sort keys. To do so, pass one or more column names to the by option of sort_values:

In [None]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
print(frame)
frame.sort_values(by='a')

   b  a
0  4  0
1  7  1
2 -3  0
3  2  1


Unnamed: 0,b,a
0,4,0
2,-3,0
1,7,1
3,2,1


##Ranking assigns ranks from one through the number of valid data points in an array.
##The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank:

In [None]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

##Ranks can also be assigned according to the order in which they’re observed in the data:

In [None]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

#**Correlation and Covariance**#
##Some summary statistics, like correlation and covariance, are computed from pairs of arguments. Let’s consider some DataFrames of stock prices and volumes obtained from Yahoo! Finance using the add-on pandas-datareader package. If you don’t have it installed already, it can be obtained via conda or pip:

In [None]:
!pip install pandas-datareader



#**Unique Values, Value Counts, and Membership**#

##Another class of related methods extracts information about the values contained in a one-dimensional Series. To illustrate these, consider this example:

In [None]:
import pandas as pd

obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

##The first function is `unique`, which gives you an array of the unique values in a Series:

In [None]:
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

##The unique values are not necessarily returned in sorted order, but could be sorted after the fact if needed `(uniques.sort())`. Relatedly, `value_counts` computes a Series containing value frequencies:

In [None]:
obj.value_counts()

c    3
a    3
b    2
d    1
Name: count, dtype: int64

##The Series is sorted by value in descending order as a convenience. value_counts is also available as a top-level pandas method that can be used with any array or sequence:

In [None]:
pd.value_counts(obj.values, sort=True)

c    3
a    3
b    2
d    1
Name: count, dtype: int64

##In some cases, you may want to compute a histogram on multiple related columns in a DataFrame. Here’s an example:

In [2]:
import pandas as pd

data = pd.DataFrame({
    'Qu1': [1, 3, 4, 3, 4],
    'Qu2': [2, 3, 1, 2, 3],
    'Qu3': [1, 5, 2, 4, 4]
})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


##Passing pandas.value_counts to this DataFrame’s apply function gives:

In [4]:
result = data.apply(pd.value_counts).fillna(0)
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
