# SEP532 인공지능 이론과 실제
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

## Introduction to Pandas

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Series and DataFrames
The primary two components of pandas are the `Series` and `DataFrame`. A `Series` is essentially a column, and a `DataFrame` is a multi-dimensional table made up of a collection of Series.
![Series and DataFrame](images/series-and-dataframe.png)

#### Object creation
Creating a `Series` by passing a list of values, letting pandas create a default integer index:

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

Creating a `DataFrame` by passing a NumPy array, with a datetime index and labeled columns:

In [None]:
dates = pd.date_range('20200414', periods=6)
dates

In [None]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Creating a `DataFrame` by passing a dict of objects that can be converted to series-like:

In [None]:
df2 = pd.DataFrame({
    'A': 1.0,
    'B': pd.Timestamp('20130102'),
    'C': pd.Series(1.0, index=range(0, 4)),
    'D': np.array([1.0, 2.0, 3.0, 4.0]),
    'E': 'foo',
})
df2

The columns of the resulting `DataFrame` have different dtypes.

In [None]:
df2.dtypes

### Viewing data
Here is how to view the top and bottom rows of the frame:

In [None]:
df.head()

In [None]:
df.tail(3)

Display the index or columns:

In [None]:
df.index

In [None]:
df.columns

`DataFrame.to_numpy()` gives a NumPy representation of the underlying data.

In [None]:
df.to_numpy()

`describe()` shows a quick statistic summary of your data:

In [None]:
df.describe()

Transposing your data:

In [None]:
df.T

Sorting by an axis:

In [None]:
df.sort_index(axis=1, ascending=False)

Sorting by values:

In [None]:
df.sort_values(by='B')

### Selection

#### Getting
Selecting a single column, which yields a `Series`

In [None]:
df['A']

This is equivalent to `df.A`

In [None]:
df.A

#### Selection by label
For getting a cross secting using a label:

In [None]:
df.loc['20200414']

Selecting on a multi-axis by label:

In [None]:
df.loc[:, ['A', 'B']]

When using label slicing, both endpoints are *included*:

In [None]:
df.loc['20200414':'20200416', 'A':'B']

Reduction in the dimensions of the returned object:

In [None]:
df.loc['20200414', ['A', 'B']]

#### Selection by position
Select via the position of the passed integers:

In [None]:
df.iloc[3]

By integer slices, acting similar to NumPy/Python:

In [None]:
df.iloc[3:5, 0:2]

By lists of integer position locations:

In [None]:
df.iloc[[1, 2, 4], [0, 2]]

#### Boolean indexing
Using a single column's values to select data

In [None]:
df['A'] > 0

In [None]:
df[df['A'] > 0]

Selecting values from a `DataFrame` where a boolean condition is met

In [None]:
df > 0

In [None]:
df[df > 0]

#### Setting
Setting a new column automatically aligns the data by the indexes

In [None]:
df['F'] = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20200415', periods=6))
df

Setting values by label:

In [None]:
df.loc['20200414', 'A'] = 3
df

Setting values by position:

In [None]:
df.iloc[0, 1] = 5
df

Setting by assigning with a NumPy array:

In [None]:
df['D'] = np.array([1, 2, 3, 4, 5, 6])
df

### Operations

#### Statistics
Performing a descriptive statistic:

In [None]:
df.mean()

Same operation on the other axis:

In [None]:
df.mean(axis=1)

#### Apply
Applying functions to the data:

In [None]:
df

In [None]:
df.apply(np.cumsum) # apply function to each column

In [None]:
df.apply(np.cumsum, axis=1) # apply function to each row

#### Histogramming

In [None]:
s = pd.Series(np.random.randint(0, 7, size=10))
s

In [None]:
s.value_counts()

### Merge

#### Concat
pandas provides various facilities for easily combining together `Series` and `DataFrame` objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

Concatenating pandas objects together with `concat()`:

In [None]:
df = pd.DataFrame(np.random.randn(10, 4))
df

In [None]:
pieces = [df.iloc[:3], df.iloc[3:7], df.iloc[7:]]

In [None]:
pieces[0]

In [None]:
pieces[1]

In [None]:
pieces[2]

In [None]:
pd.concat(pieces)

In [None]:
pd.concat(pieces, axis=1)

#### Join
SQL style joins with `merge()`

In [None]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [None]:
left

In [None]:
right

In [None]:
pd.merge(left, right, on='key')

Another example that can be given is:

In [None]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

In [None]:
left

In [None]:
right

In [None]:
pd.merge(left, right, on='key')

### Grouping
By *group by* we are referring to a process involving one or more of the following steps:
- **Splitting** the data into groups based on some criteria
- **Applying** a function to each group independently
- **Combining** the results into a data structure

In [None]:
df = pd.DataFrame({
   'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
   'C': np.random.randn(8),
   'D': np.random.randn(8)
})
df

Grouping and then applying the `sum()` function to the resulting groups

In [None]:
df.groupby('A').sum()

Grouping by multiple columns forms a hierarchical index, and again we can apply the `sum()` function.

In [None]:
df.groupby(['A', 'B']).sum()

### Plotting

In [None]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('2000-01-01', periods=1000))
ts = ts.cumsum()

ts.plot()

On a `DataFrane`, the `plot()` method is a conveniene to plot all of the columns with labels:

In [None]:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()

df.plot()