# Chapter 5.  Getting Started with pandas

* Pandas contains data structures and data manipulation tools designed to make data cleaning
and analysis fast and easy in Python. 
* Pandas is often used in tandem with numerical
computing tools like NumPy and SciPy, analytical libraries like statsmodels and
scikit-learn, and data visualization libraries like matplotlib. 
* Pandas adopts significant
parts of NumPy’s idiomatic style of array-based computing, especially array-based
functions and a preference for data processing without for loops.
* While pandas adopts many coding idioms from NumPy, the biggest difference is that
pandas is designed for working with tabular or heterogeneous data. 
* NumPy, by contrast, is best suited for working with homogeneous numerical array data.

In [None]:
import pandas as pd

<span style="color:red">Thus, whenever you see **pd.** in code, it’s referring to pandas. 

In [None]:
import numpy as np
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.set_printoptions(precision=4, suppress=True)

## 5.1  Introduction to pandas Data Structures

There are two core objects in pandas: the **DataFrame** and the **Series**.  
While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.

## The pandas data structures: `DataFrame` and `Series`

A `DataFrame` is a **tablular data structure** (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.

<img align="left" width=50% src="pic/pic_5_02.png">      
<br><br><br><br><br><br><br>
<br><br><br><br><br><br><br>
<br><br><br><br><br><br><br>
    
<img align="left" width=50% src="pic/pic_5_01.png">

### DataFrame

A DataFrame is a table.   

For example, consider the following simple DataFrame:

In [None]:
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})

In this example, the "0, No" entry has the value of 131.   
The "0, Yes" entry has a value of 50, and so on.

DataFrame entries are not limited to integers.   
For instance, here's a DataFrame whose values are strings:

In [None]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})

We are using the pd.DataFrame() constructor to generate these DataFrame objects.   
The syntax for declaring a new one is a dictionary whose keys are the column names (Bob and Sue in this example), and whose values are a list of entries.   
This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter.

The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels.   
Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.

The list of row labels used in a DataFrame is known as an Index.   
We can assign values to it by using an index parameter in our constructor:

In [None]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])

### Series

A Series is a sequence of data values.   
If a DataFrame is a table, a Series is a list.   
And in fact you can create one with nothing more than a list:

In [None]:
pd.Series([1, 2, 3, 4, 5])

A Series is, in essence, a single column of a DataFrame.   
So you can assign column values to the Series the same way as before, using an index parameter.   
However, a Series does not have a column name, it only has one overall name:

In [None]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

The Series and the DataFrame are intimately related.   
It's helpful to think of a DataFrame as actually being just a bunch of Series "glued together".

연습문제 5-1: create a DataFrame `fruits` that looks like this:

<img align="left" src="pic/pic_5_10.png" width=30% >      

In [None]:
pd.DataFrame({'Apple':[30], 'Bananas':[21]})

연습문제 5-2: Create a dataframe `fruit_sales` that matches the diagram below:

<img align="left" src="pic/pic_5_11.png" width=30% >      

연습문제 5-3: Create a variable `ingredients` with a Series that looks like:

<img align="left" src="pic/pic_5_12.png" width=30% >      

### Series

* A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its *index*.
* The simplest Series is formed from only an array of data:

In [None]:
obj = pd.Series([4, 7, -5, 3])

In [None]:
obj

In [None]:
print(obj)

* The print representation of a Series displayed interactively shows the index on the
left and the values on the right. 
* Since we did not specify an index for the data, a
default one consisting of the integers 0 through N - 1 (where N is the length of the
data) is created. 
* You can get the array representation and index object of the Series via
its values and index attributes, respectively:

In [None]:
obj.values

In [None]:
obj.index  # like range(4)

* Often it will be desirable to create a Series with an index identifying each data point
with a label.

In [None]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [None]:
obj2

In [None]:
obj2.index

* Compared with NumPy arrays, you can use labels in the index when selecting single
values or a set of values.

In [None]:
obj2['a']

In [None]:
obj2['d'] = 6

In [None]:
obj2

In [None]:
obj2[['c', 'a', 'd']]

* Using NumPy functions or NumPy-like operations, such as filtering with a boolean
array, scalar multiplication, or applying math functions, will preserve the index-value
link.

In [None]:
obj2 > 0

In [None]:
obj2[obj2 > 0]

In [None]:
obj2 * 2

In [None]:
np.exp(obj2)

* Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. 
* It can be used in many contexts where you might
use a dict.

In [None]:
'b' in obj2

In [None]:
'e' in obj2

* Should you have data contained in a Python dict, you can create a Series from it by
passing the dict.

In [None]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
sdata

In [None]:
obj3 = pd.Series(sdata)

In [None]:
obj3

* When you are only passing a dict, the index in the resulting Series will have the dict’s
keys in sorted order. 
* You can override this by passing the dict keys in the order you
want them to appear in the resulting Series.

In [None]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [None]:
obj4 = pd.Series(sdata, index=states)

In [None]:
obj4

* Here, three values found in sdata were placed in the appropriate locations, but since
no value for 'California' was found, it appears as *NaN* (not a number), which is considered in pandas to mark missing or NA values.   
* Since 'Utah' was not included in
states, it is excluded from the resulting object.

* I will use the terms “missing” or “NA” interchangeably to refer to missing data. 
* The **isnull** and **notnull** functions in pandas should be used to detect missing data.

In [None]:
pd.isnull(obj4)

In [None]:
pd.notnull(obj4)

In [None]:
obj4.isnull()

* I discuss working with missing data in more detail in Chapter 7.  


* A useful Series feature for many applications is that it automatically aligns by index
label in arithmetic operations.

In [None]:
obj3

In [None]:
obj4

In [None]:
obj3 + obj4

* Data alignment features will be addressed in more detail later. 
* If you have experience
with databases, you can think about this as being similar to a join operation.


* Both the Series object itself and its index have a name attribute, which integrates with
other key areas of pandas functionality.


In [None]:
obj4.name = 'population'

In [None]:
obj4

The name argument allows you to give a name to a Series object, i.e. to the column. So that when you'll put that in a DataFrame, the column will be named according to the name parameter.

In [None]:
pd.DataFrame(obj4)

In [None]:
pd.DataFrame(obj3) #column name이 어떻게 되는지 살펴보자

In [None]:
obj4.index.name = 'state'

In [None]:
obj4

A Series’s index can be altered in-place by assignment.

In [None]:
obj

In [None]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [None]:
obj

<img style="float: left;" src="pic/fig_00.png" width="400">

### DataFrame

* A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string,
boolean, etc.). 
* The DataFrame has both a row and column index; it can be thought of
as a dict of Series all sharing the same index. 
* Under the hood, the data is stored as one
or more two-dimensional blocks rather than a list, dict, or some other collection of
one-dimensional arrays. 
* The exact details of DataFrame’s internals are outside the
scope of this book.

<img style="float: left;" src="pic/pic_0_2.png">

While a DataFrame is physically two-dimensional, you can use it to
represent higher dimensional data in a tabular format using hierarchical indexing,
a subject we will discuss in Chapter 8 and an
ingredient in some of the more advanced data-handling features in pandas.

There are many ways to construct a DataFrame, though one of the most common is
from a dict of equal-length lists or NumPy arrays.

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [None]:
df = pd.DataFrame(data)

The resulting DataFrame will have its index assigned automatically as with Series, and
the columns are placed in sorted order.

In [None]:
df

In [None]:
type(df)

In [None]:
type(obj4)

In [None]:
df.info()

If you are using the Jupyter notebook, pandas DataFrame objects will be displayed as
a more browser-friendly HTML table.


For large DataFrames, the **head** method selects only the first five rows.

In [None]:
df.head()

If you specify a sequence of columns, the DataFrame’s columns will be arranged in
that order.

In [None]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

If you pass a column that isn’t contained in the dict, it will appear with missing values
in the result.

In [None]:
df2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four',
                             'five', 'six'])

In [None]:
df2

In [None]:
df2.columns

A column in a DataFrame can be retrieved as a Series either by dict-like notation or
by attribute.

In [None]:
df2['state']

In [None]:
df2.year

In [None]:
type(df2.year)

<img style="float: left;" src="pic/pic_0_2.png">

Attribute-like access (e.g., df2.year) and tab completion of column names in IPython is provided as a convenience.

df2[column] works for any column name, but df2.column
only works when the column name is a valid Python variable
name.
<br><br><br>

Rows can also be retrieved by position or name with the special loc attribute (much
more on this later).

In [None]:
df2.loc['three']

Columns can be modified by assignment. For example, the empty 'debt' column
could be assigned a scalar value or an array of values.

In [None]:
df2['debt'] = 16.5

In [None]:
df2

In [None]:
df2['debt'] = np.arange(6.)

In [None]:
df2

When you are assigning lists or arrays to a column, the value’s length must match the
length of the DataFrame.   
If you assign a Series, its labels will be realigned exactly to
the DataFrame’s index, inserting missing values in any holes.

In [None]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

In [None]:
val

In [None]:
df2['debt'] = val

In [None]:
df2

Assigning a column that doesn’t exist will create a new column.   
The **del** keyword will
delete columns as with a dict.  

As an example of **del**, I first add a new column of boolean values where the state
column equals 'Ohio'.

In [None]:
df2['eastern'] = df2.state == 'Ohio'

In [None]:
df2

<img style="float: left;" src="pic/pic_0_1.png">

<span style="color:red">New columns cannot be created with the df2.western syntax.

In [None]:
df2.western= df2.state == 'Ohio'

In [None]:
df2

The **del** method can then be used to remove this column.

In [None]:
del df2['eastern']

In [None]:
df2.columns

In [None]:
df2

<img style="float: left;" src="pic/pic_0_1.png">

<span style="color:red">The column returned from indexing a DataFrame is a view on the
underlying data, not a copy. Thus, any in-place modifications to the
Series will be reflected in the DataFrame. The column can be
explicitly copied with the Series’s copy method.

In [None]:
df2

In [None]:
aaa=df2['debt']

In [None]:
aaa

In [None]:
aaa['one']=1.5

In [None]:
aaa

In [None]:
df2

copy()를 사용해보자.

In [None]:
bbb=df2['pop'].copy()

In [None]:
bbb

In [None]:
bbb['one']=5

In [None]:
bbb

In [None]:
df2

Another common form of data is a nested dict of dicts.

In [None]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [None]:
pop

If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys
as the columns and the inner keys as the row indices.

In [None]:
df3 = pd.DataFrame(pop)

In [None]:
df3

You can transpose the DataFrame (swap rows and columns) with similar syntax to a
NumPy array.

In [None]:
df3.T

The keys in the inner dicts are combined and sorted to form the index in the result.  
This isn’t true if an explicit index is specified.

In [None]:
pd.DataFrame(pop)

In [None]:
pd.__version__

<img style="float: left;" src="pic/pic_0_1.png">

<span style="color:red"> **만일 pandas 가 0.23 버전 또는 그 이하이면 업데이트 해야 한다.   
현재 jupyter notebook을 모두 kill 한 후, anaconda prompt 창으로 가서 conda update pandas 를 실행하라.**

In [None]:
pd.DataFrame(pop,index=[2001,2002,2003])

Dicts of Series are treated in much the same way.

In [None]:
df3

In [None]:
pdata = {'Ohio': df3['Ohio'][:-1],
         'Nevada': df3['Nevada'][:2]}
pd.DataFrame(pdata)

<img style="float: left;" src="pic/pic_5_1.png" width="700">

If a DataFrame’s index and columns have their name attributes set, these will also be
displayed.

In [None]:
df3

In [None]:
df3.index.name = 'year'; df3.columns.name = 'state'

In [None]:
df3

As with Series, the values attribute returns the data contained in the DataFrame as a
two-dimensional ndarray.

In [None]:
df3.values

If the DataFrame’s columns are different dtypes, the dtype of the values array will be
chosen to accommodate all of the columns.

In [None]:
df2.values

### Index Objects

* pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). 
* Any array or other sequence of labels you use when
constructing a Series or DataFrame is internally converted to an Index.

In [None]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [None]:
obj

In [None]:
idx = obj.index

In [None]:
idx

In [None]:
idx[1:]

Index objects are immutable and thus can’t be modified by the user.

In [None]:
idx[1] = 'd'  # TypeError

Immutability makes it safer to share Index objects among data structures.

In [None]:
labels = pd.Index(np.arange(3))

In [None]:
labels

In [None]:
obj2 = pd.DataFrame([1.5, -2.5, 0], index=labels)

In [None]:
obj2

In [None]:
obj2.index is labels

<img style="float: left;" src="pic/pic_0_1.png">

<span style="color:red"> Some users will not often take advantage of the capabilities provided by indexes, but because some operations will yield results
containing indexed data, it’s important to understand how they
work.

In addition to being array-like, an Index also behaves like a fixed-size set.

In [None]:
df3

In [None]:
df3.columns

In [None]:
'Ohio' in df3.columns

In [None]:
2003 in df3.index

Unlike Python sets, a pandas Index can contain duplicate labels.

In [None]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])

In [None]:
dup_labels

In [None]:
obj4 = pd.DataFrame([1.5, -2.5, 0, 5], index=dup_labels)
obj4

In [None]:
obj4.T

In [None]:
obj4.loc['foo']

## 5.2  Essential Functionality

In [None]:
import pandas as pd
import numpy as np

This section will walk you through the fundamental mechanics of interacting with the
data contained in a Series or DataFrame. In the chapters to come, we will delve more
deeply into data analysis and manipulation topics using pandas. This book is not
intended to serve as exhaustive documentation for the pandas library; instead, we’ll
focus on the most important features, leaving the less common (i.e., more esoteric)
things for you to explore on your own.

### Reindexing

* An important method on pandas objects is **reindex**, which means to create a new
object with the data conformed to a new index. Consider an example.

In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

In [None]:
obj

Calling **reindex** on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present.

In [None]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [None]:
obj2

* For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. 
* The **method** option allows us to do this, using a
method such as **ffill**, which forward-fills the values.

In [None]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

In [None]:
obj3

In [None]:
obj3.reindex(range(6), method='ffill')

* With DataFrame, **reindex** can alter either the (row) index, columns, or both. 
* When
passed only a sequence, it reindexes the rows in the result.

In [None]:
df = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])

In [None]:
df

In [None]:
df2 = df.reindex(['a', 'b', 'c', 'd'])

In [None]:
df2

The columns can be reindexed with the **columns** keyword.

In [None]:
states = ['Texas', 'Utah', 'California']

In [None]:
df.reindex(columns=states)

<img style="float: left;" src="pic/pic_5_2.png" width="700">

In [None]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

In [None]:
obj3

In [None]:
obj3.reindex(range(6), method='bfill')

In [None]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

In [None]:
obj3

In [None]:
obj3.reindex(range(6), fill_value='???')

As we’ll explore in more detail, you can reindex more succinctly by label-indexing
with **loc**, and many users prefer to use it exclusively.

아래 코드는 오류가 남

In [None]:
#df.loc[['a', 'b', 'c', 'd'], states]

### Dropping Entries from an Axis

* Dropping one or more entries from an axis is easy if you already have an index array
or list without those entries. 
* As that can require a bit of munging and set logic, the **drop** method will return a new object with the indicated value or values deleted from
an axis.

In [None]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [None]:
obj

In [None]:
new_obj = obj.drop('c')

In [None]:
new_obj

In [None]:
obj

In [None]:
obj.drop(['d', 'c'])

In [None]:
obj

* With DataFrame, index values can be deleted from either axis. To illustrate this, we
first create an example DataFrame.

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

In [None]:
data

Calling **drop** with a sequence of labels will drop values from the row labels (axis 0).

In [None]:
data.drop(['Colorado', 'Ohio'])

In [None]:
data

* You can drop values from the columns by passing axis=1 or axis='columns'.

In [None]:
data.drop('two', axis=1)

In [None]:
data

In [None]:
data.drop(['two', 'four'], axis='columns')

In [None]:
data

Many functions, like **drop**, which modify the size or shape of a Series or DataFrame,
can manipulate an object in-place without returning a new object.

In [None]:
data.drop('Ohio', inplace=True)

In [None]:
data

### Indexing, Selection, and Filtering

Series indexing (obj[...]) works analogously to NumPy array indexing, except you
can use the Series’s index values instead of only integers. Here are some examples of
this:

In [None]:
obj = pd.Series(np.arange(5,9), index=['a', 'b', 'c', 'd'])

In [None]:
obj

In [None]:
obj['b']

In [None]:
obj[1]

In [None]:
obj[2:4]

In [None]:
obj[['b', 'a', 'd']]

In [None]:
obj[[1, 3]]

In [None]:
obj[obj < 7]

### <span style="color:red"> Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive.

In [None]:
obj['b':'c']

### <span style="color:red"> But slicing with index numbers behaves like normal Python slicing.

In [None]:
obj[1:3]

Setting using these methods modifies the corresponding section of the Series.

In [None]:
obj['b':'c'] = 15
obj

* column retrieving

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

In [None]:
data

In [None]:
data['two']

In [None]:
data[['three', 'one']]

* Indexing like this has a few special cases. First, slicing or selecting data with a boolean
array.

In [None]:
data

In [None]:
data[:2]

In [None]:
data[data['three'] > 5]

* The row selection syntax data[:2] is provided as a convenience. 
* Passing a single element or a list to the [ ] operator selects rows.


* Another use case is in indexing with a boolean DataFrame, such as one produced by a
scalar comparison:


In [None]:
data

In [None]:
data < 5

In [None]:
data[data < 5] = 0

In [None]:
data

#### Selection with loc and iloc

* For DataFrame label-indexing on the rows, I introduce the special indexing operators **loc** and **iloc**. 
* They enable you to select a subset of the rows and columns from a
DataFrame with NumPy-like notation using either axis labels (**loc**) or integers
(**iloc**).

As a preliminary example, let’s select a single row and multiple columns by label

In [None]:
data

In [None]:
data.loc['Colorado', ['two', 'three']]

We’ll then perform some similar selections with integers using **iloc**

In [None]:
data

In [None]:
data.iloc[2, [3, 0, 1]]

In [None]:
data.iloc[2]

In [None]:
data

In [None]:
data[2]

In [None]:
data[2:3]

In [None]:
data[2:3].values  #pandas

In [None]:
data[2:3].values.shape  #pandas

In [None]:
type(data[2:3])  

In [None]:
data.iloc[2].values #series

In [None]:
data.iloc[2].values.shape

In [None]:
type(data.iloc[2])

In [None]:
data.iloc[[1, 2], [3, 0, 1]]

Both indexing functions work with slices in addition to single labels or lists of labels.

In [None]:
data.loc[:'Utah', 'two']

In [None]:
data.iloc[:, :3][data.three > 5]

<img style="float: left;" src="pic/pic_5_3.png" width="700">

<img style="float: left;" src="pic/pic_5_4.png" width="700">

### Integer Indexes

* Working with pandas objects indexed by integers is something that often trips up
new users due to some differences with indexing semantics on built-in Python data
structures like lists and tuples. 
* For example, you might not expect the following code
to generate an error.

In [None]:
ser = pd.Series(np.arange(3.))

In [None]:
ser

In [None]:
ser[-1]

On the other hand, with a non-integer index, there is no potential for ambiguity.

In [None]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])

In [None]:
ser2[-1]

To keep things consistent, if you have an axis index containing integers, data selection
will always be label-oriented.   
For more precise handling, use **loc** (for labels) or **iloc**
(for integers):

In [None]:
ser

In [None]:
ser[:1]

In [None]:
ser.loc[:1]

In [None]:
ser2.loc[:'b']

In [None]:
ser.loc[1]

In [None]:
ser.iloc[:1]

### Arithmetic and Data Alignment

* An important pandas feature for some applications is the behavior of arithmetic
between objects with different indexes. 
* When you are adding together objects, if any
index pairs are not the same, the respective index in the result will be the union of the
index pairs. 
* For users with database experience, this is similar to an automatic outer
join on the index labels. Let’s look at an example:

In [None]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

In [None]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
               index=['a', 'c', 'e', 'f', 'g'])

In [None]:
s1

In [None]:
s2

Adding these together yields.

In [None]:
s1 + s2

* The internal data alignment introduces missing values in the label locations that don’t
overlap.
* Missing values will then propagate in further arithmetic computations.


* In the case of DataFrame, alignment is performed on both the rows and the columns:

In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])

In [None]:
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [None]:
df1

In [None]:
df2

* Adding these together returns a DataFrame whose index and columns are the unions
of the ones in each DataFrame.

In [None]:
df1 + df2

* Since the 'c' and 'e' columns are not found in both DataFrame objects, they appear
as all missing in the result. 
* The same holds for the rows whose labels are not common
to both objects.


* If you add DataFrame objects with no column or row labels in common, the result
will contain all nulls:

In [None]:
df1 = pd.DataFrame({'A': [1, 2]})

In [None]:
df2 = pd.DataFrame({'B': [3, 4]})

In [None]:
df1

In [None]:
df2

In [None]:
df1 - df2

#### Arithmetic methods with fill values

* In arithmetic operations between differently indexed objects, you might want to fill
with a special value, like 0, when an axis label is found in one object but not the other.

In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))

In [None]:
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list('abcde'))

In [None]:
df2.loc[1, 'b'] = np.nan

In [None]:
df1

In [None]:
df2

* Adding these together results in NA values in the locations that don’t overlap.

In [None]:
df1 + df2

In [None]:
df1

In [None]:
df2

Using the **add** method on df1, I pass df2 and an argument to **fill_value**.

In [None]:
df1.add(df2)

In [None]:
df1.add(df2, fill_value=0)

<img style="float: left;" src="pic/pic_5_5.png" width="400">

* See Table 5-5 for a listing of Series and DataFrame methods for arithmetic. 
* Each of
them has a counterpart, starting with the letter r, that has arguments flipped. 
* So these
two statements are equivalent:

In [None]:
df1

In [None]:
1 / df1

In [None]:
df1.rdiv(1)

In [None]:
df3 = pd.DataFrame(np.arange(10,22,2).reshape((2, 3)))

In [None]:
df4 = pd.DataFrame(np.arange(6).reshape((2, 3)))

In [None]:
df3

In [None]:
df4

In [None]:
df3-df4

In [None]:
df3.sub(df4)

In [None]:
df4.rsub(df3)

Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill
value.

In [None]:
df1

In [None]:
df2

In [None]:
df1.reindex(columns=df2.columns)

In [None]:
df1.reindex(columns=df2.columns, fill_value=0)

#### Operations between DataFrame and Series

* As with NumPy arrays of different dimensions, arithmetic between DataFrame and
Series is also defined. 
* First, as a motivating example, consider the difference between
a two-dimensional array and one of its rows.

In [None]:
arr = np.arange(12.).reshape((3, 4))

In [None]:
arr

In [None]:
arr[0]

In [None]:
arr - arr[0]

* When we subtract arr[0] from arr, the subtraction is performed once for each row.
* This is referred to as *broadcasting* and is explained in more detail as it relates to general
NumPy arrays in Appendix A. 
* Operations between a DataFrame and a Series are
similar.

In [None]:
df = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [None]:
series = df.iloc[0]

In [None]:
df

In [None]:
series

* By default, arithmetic between DataFrame and Series matches the index of the Series
on the DataFrame’s columns, broadcasting down the rows.

In [None]:
df - series

If an index value is not found in either the DataFrame’s columns or the Series’s index,
the objects will be reindexed to form the union.

In [None]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])

In [None]:
series2

In [None]:
df

In [None]:
df + series2

If you want to instead broadcast over the columns, matching on the rows, you have to
use one of the arithmetic methods.

In [None]:
series3 = df['d']

In [None]:
df

In [None]:
series3

In [None]:
df.sub(series3, axis='index')

In [None]:
df.sub(series3, axis=0)

In [None]:
df-series3

### Function Application and Mapping

NumPy ufuncs (element-wise array methods) also work with pandas objects.

In [None]:
df= pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [None]:
df

In [None]:
np.abs(df)

Another frequent operation is applying a function on one-dimensional arrays to each
column or row. DataFrame’s **apply** method does exactly this.

In [None]:
df

In [None]:
f = lambda x: x.max() - x.min()

In [None]:
def f(x):
    return x.max()-x.min()

In [None]:
df.apply(f)

* Here the function f, which computes the difference between the maximum and minimum
of a Series, is invoked once on each column in frame. 
* The result is a Series having
the columns of frame as its index.
* If you pass axis='columns' to apply, the function will be invoked once per row
instead.

In [None]:
df.apply(f, axis='columns')

* Many of the most common array statistics (like sum and mean) are DataFrame methods,
so using apply is not necessary.
* The function passed to apply need not return a scalar value; it can also return a Series
with multiple values.

In [None]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

In [None]:
df.apply(f)

* Element-wise Python functions can be used, too. 
* Suppose you wanted to compute a
formatted string from each floating-point value in frame. You can do this with applymap.

In [None]:
format = lambda x: '%.2f' % x

In [None]:
df.applymap(format)

The reason for the name **applymap** is that Series has a map method for applying an
element-wise function.

In [None]:
df['e'].map(format)

### Sorting and Ranking

* Sorting a dataset by some criterion is another important built-in operation. 
* To sort
lexicographically by row or column index, use the **sort_index** method, which returns
a new, sorted object.

In [None]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

In [None]:
obj

In [None]:
obj.sort_index()

* With a DataFrame, you can sort by index on either axis.

In [None]:
df = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])

In [None]:
df

In [None]:
df.sort_index()

In [None]:
df.sort_index(axis=1)

* The data is sorted in ascending order by default, but can be sorted in descending
order, too.

In [None]:
df.sort_index(axis=1, ascending=False)

To sort a Series by its values, use its sort_values method.

In [None]:
obj = pd.Series([4, 7, -3, 2])

In [None]:
obj

In [None]:
obj.sort_values()

* Any missing values are sorted to the end of the Series by default.

In [None]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj

In [None]:
obj.sort_values()

In [None]:
obj.sort_values(ascending=False)

* When sorting a DataFrame, you can use the data in one or more columns as the sort
keys. 
* To do so, pass one or more column names to the by option of sort_values.

In [None]:
df = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})

In [None]:
df

In [None]:
df.sort_values(by='b')

* To sort by multiple columns, pass a list of names.

In [None]:
df.sort_values(by=['a', 'b'])

* *Ranking* assigns ranks from one through the number of valid data points in an array.

* The **rank** methods for Series and DataFrame are the place to look; by default **rank**
breaks ties by assigning each group the mean rank.

In [None]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj

In [None]:
obj.sort_values()

In [None]:
obj.rank()

Ranks can also be assigned according to the order in which they’re observed in the
data.

In [None]:
obj.rank(method='first')

In [None]:
obj.rank(method='first').map(int)

* Here, instead of using the average rank 6.5 for the entries 0 and 2, they instead have
been set to 6 and 7 because label 0 precedes label 2 in the data.


* You can rank in descending order, too:

In [None]:
obj.rank(ascending=False)

In [None]:
# Assign tie values the maximum rank in the group
obj.rank(ascending=False, method='max')

In [None]:
# Assign tie values the maximum rank in the group
obj.rank(ascending=False, method='max').map(int)

<img style="float: left;" src="pic/pic_5_6.png" width="700">

DataFrame can compute ranks over the rows or the columns.

In [None]:
df = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})

In [None]:
df

In [None]:
df.rank(axis='columns')

### Axis Indexes with Duplicate Labels

* Up until now all of the examples we’ve looked at have had unique axis labels (index
values).
* While many pandas functions (like reindex) require that the labels be
unique, it’s not mandatory.
* Let’s consider a small Series with duplicate indices.

In [None]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])

In [None]:
obj

The index’s **is_unique** property can tell you whether its labels are unique or not.

In [None]:
obj.index.is_unique

* Data selection is one of the main things that behaves differently with duplicates.
* Indexing a label with multiple entries returns a Series, while single entries return a
scalar value.

In [None]:
obj['a']

In [None]:
obj['c']

* This can make your code more complicated, as the output type from indexing can
vary based on whether a label is repeated or not.
* The same logic extends to indexing rows in a DataFrame:

In [None]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])

In [None]:
df

In [None]:
df.loc['b']

## 5.3  Summarizing and Computing Descriptive Statistics

* pandas objects are equipped with a set of common mathematical and statistical methods. 
* Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean) from a Series or a Series of values
from the rows or columns of a DataFrame. 
* Compared with the similar methods
found on NumPy arrays, they have built-in handling for missing data. Consider a
small DataFrame.

In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])

In [None]:
df

Calling DataFrame’s **sum** method returns a Series containing column sums.

In [None]:
df.sum()

Passing axis='columns' or axis=1 sums across the columns instead.

In [None]:
df.sum(axis='columns')

In [None]:
df.sum(axis=1)

* NA values are excluded unless the entire slice (row or column in this case) is NA.
* This can be disabled with the **skipna** option.

In [None]:
df

In [None]:
df.mean(axis='columns', skipna=False)

<img style="float: left;" src="pic/pic_5_7.png" width="500">

* Some methods, like **idxmin** and **idxmax**, return indirect statistics like the index value
where the minimum or maximum values are attained

In [None]:
df

In [None]:
df.idxmax()

Other methods are *accumulations*.

In [None]:
df

In [None]:
df.cumsum()

* Another type of method is neither a reduction nor an accumulation.   
* **describe** is one
such example, producing multiple summary statistics in one shot:

In [None]:
df.describe()

On non-numeric data, **describe** produces alternative summary statistics.

In [None]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)

In [None]:
obj

In [None]:
obj.describe()

<img style="float: left;" src="pic/pic_5_8.png" width="600">

### Correlation and Covariance

생략

* Some summary statistics, like correlation and covariance, are computed from pairs of
arguments. 
* Let’s consider some DataFrames of stock prices and volumes obtained
from Yahoo! Finance using the add-on pandas-datareader package. 
* If you don’t
have it installed already, it can be obtained via conda or pip.

<p style="font-family: Courier New; font-size: 1.15em;">conda install pandas-datareader

In [None]:
price = pd.read_pickle('examples/yahoo_price.pkl')
volume = pd.read_pickle('examples/yahoo_volume.pkl')

* I use the **pandas_datareader** module to download some data for a few stock tickers.

In [None]:
import pandas_datareader.data as web

In [None]:
all_data = {ticker: web.get_data_yahoo(ticker)
            for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

In [None]:
all_data

In [None]:
price = pd.DataFrame({ticker: data['Adj Close']
                     for ticker, data in all_data.items()})

In [None]:
price

In [None]:
volume = pd.DataFrame({ticker: data['Volume']
                      for ticker, data in all_data.items()})

In [None]:
volume

* I now compute percent changes of the prices, a time series operation which will be
explored further in Chapter 11.

In [None]:
price.pct_change?

In [None]:
returns = price.pct_change()

In [None]:
returns.tail()

* The **corr** method of Series computes the correlation of the overlapping, non-NA,
aligned-by-index values in two Series. Relatedly, cov computes the covariance.

In [None]:
returns['MSFT'].corr(returns['IBM'])

In [None]:
returns['MSFT'].cov(returns['IBM'])

* Since MSFT is a valid Python attribute, we can also select these columns using more
concise syntax.

In [None]:
returns.MSFT.corr(returns.IBM)

* DataFrame’s **corr** and **cov** methods, on the other hand, return a full correlation or
covariance matrix as a DataFrame, respectively.

In [None]:
returns.corr()

In [None]:
returns.cov()

* Using DataFrame’s **corrwith** method, you can compute pairwise correlations
between a DataFrame’s columns or rows with another Series or DataFrame. 
* Passing a
Series returns a Series with the correlation value computed for each column.

In [None]:
returns.corrwith(returns.IBM)

* Passing a DataFrame computes the correlations of matching column names. 
* Here I
compute correlations of percent changes with volume.

In [None]:
returns.corrwith(volume)

### Unique Values, Value Counts, and Membership

* Another class of related methods extracts information about the values contained in a
one-dimensional Series. 
* To illustrate these, consider this example.

In [None]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [None]:
obj

The first function is **unique**, which gives you an array of the unique values in a Series.

In [None]:
uniques = obj.unique()

In [None]:
uniques

* The unique values are not necessarily returned in sorted order, but could be sorted
after the fact if needed (uniques.sort( )). 

In [None]:
uniques.sort()

In [None]:
uniques

* Relatedly, **value_counts** computes a Series
containing value frequencies:

In [None]:
obj.value_counts()

* The Series is sorted by value in descending order as a convenience. 
* **value_counts** is
also available as a top-level pandas method that can be used with any array or
sequence.

In [None]:
pd.value_counts(obj.values, sort=False)

In [None]:
pd.value_counts(obj.values)

이하 생략

* **isin** performs a vectorized set membership check and can be useful in filtering a
dataset down to a subset of values in a Series or column in a DataFrame.

In [None]:
obj

In [None]:
mask = obj.isin(['b', 'c'])

In [None]:
mask

In [None]:
obj[mask]

* Related to **isin** is the **Index.get_indexer** method, which gives you an index array
from an array of possibly non-distinct values into another array of distinct values.

In [None]:
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])

In [None]:
to_match

In [None]:
unique_vals = pd.Series(['c', 'b', 'a'])

In [None]:
unique_vals

In [None]:
pd.Index(unique_vals).get_indexer(to_match)

<img style="float: left;" src="pic/pic_5_9.png" width="700">