# Industry 4.0 의 중심, BigData

<div align='right'><font size=2 color='gray'>Data Processing Based Python @ <font color='blue'><a href='https://www.facebook.com/jskim.kr'>FB / jskim.kr</a></font>, [김진수](bigpycraft@gmail.com)</font></div>
<hr>

# Pandas Advanced 

> ### INDEX
> - Object Creation (객체 생성)
> - Viewing Data (데이터 확인하기)
> - Selection (선택)
> - Missing Data (결측치)
> - Operation (연산)

## 1. Object Creation (객체 생성)

> - 데이터 구조 섹션 참조
> - Pandas는 값을 가지고 있는 리스트를 통해 Series를 만들고, 
<br/> 정수로 만들어진 인덱스를 기본값으로 불러올 것입니다.

In [1]:
import numpy as np
import pandas as pd

In [2]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [3]:
dates = pd.date_range("20220530", periods=6)
dates

DatetimeIndex(['2022-05-30', '2022-05-31', '2022-06-01', '2022-06-02',
               '2022-06-03', '2022-06-04'],
              dtype='datetime64[ns]', freq='D')

In [4]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
2022-05-30,0.256777,0.519061,1.403401,0.42556
2022-05-31,0.860877,0.369131,0.51094,0.793585
2022-06-01,-0.044488,0.163224,0.371233,1.03993
2022-06-02,1.07795,-1.266518,0.680737,1.696298
2022-06-03,0.122883,-0.703422,0.431064,0.758467
2022-06-04,0.487377,-1.015252,0.366307,-0.340236


In [5]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20220530"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2022-05-30,1.0,3,test,foo
1,1.0,2022-05-30,1.0,3,train,foo
2,1.0,2022-05-30,1.0,3,test,foo
3,1.0,2022-05-30,1.0,3,train,foo


In [6]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

### Tap 키 : 자동완성 기능
> df2.\<TAB> 

## 2. Viewing Data (데이터 확인하기)

In [7]:
df.head()

Unnamed: 0,A,B,C,D
2022-05-30,0.256777,0.519061,1.403401,0.42556
2022-05-31,0.860877,0.369131,0.51094,0.793585
2022-06-01,-0.044488,0.163224,0.371233,1.03993
2022-06-02,1.07795,-1.266518,0.680737,1.696298
2022-06-03,0.122883,-0.703422,0.431064,0.758467


In [8]:
df.tail()

Unnamed: 0,A,B,C,D
2022-05-31,0.860877,0.369131,0.51094,0.793585
2022-06-01,-0.044488,0.163224,0.371233,1.03993
2022-06-02,1.07795,-1.266518,0.680737,1.696298
2022-06-03,0.122883,-0.703422,0.431064,0.758467
2022-06-04,0.487377,-1.015252,0.366307,-0.340236


In [9]:
df.tail(3)

Unnamed: 0,A,B,C,D
2022-06-02,1.07795,-1.266518,0.680737,1.696298
2022-06-03,0.122883,-0.703422,0.431064,0.758467
2022-06-04,0.487377,-1.015252,0.366307,-0.340236


In [10]:
df.index

DatetimeIndex(['2022-05-30', '2022-05-31', '2022-06-01', '2022-06-02',
               '2022-06-03', '2022-06-04'],
              dtype='datetime64[ns]', freq='D')

In [11]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [12]:
df.values

array([[ 0.25677696,  0.51906142,  1.40340111,  0.42556013],
       [ 0.86087735,  0.36913071,  0.51093962,  0.79358504],
       [-0.04448814,  0.16322362,  0.37123263,  1.03992969],
       [ 1.07795022, -1.26651805,  0.68073719,  1.69629841],
       [ 0.12288343, -0.70342163,  0.43106402,  0.75846677],
       [ 0.48737726, -1.01525229,  0.36630729, -0.34023551]])

In [13]:
df.value_counts

<bound method DataFrame.value_counts of                    A         B         C         D
2022-05-30  0.256777  0.519061  1.403401  0.425560
2022-05-31  0.860877  0.369131  0.510940  0.793585
2022-06-01 -0.044488  0.163224  0.371233  1.039930
2022-06-02  1.077950 -1.266518  0.680737  1.696298
2022-06-03  0.122883 -0.703422  0.431064  0.758467
2022-06-04  0.487377 -1.015252  0.366307 -0.340236>

-  describe() : 데이터의 대략적인 통계적 정보 요약

In [14]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.46023,-0.322296,0.62728,0.728934
std,0.436525,0.76664,0.397725,0.674199
min,-0.044488,-1.266518,0.366307,-0.340236
25%,0.156357,-0.937295,0.38619,0.508787
50%,0.372077,-0.270099,0.471002,0.776026
75%,0.767502,0.317654,0.638288,0.978344
max,1.07795,0.519061,1.403401,1.696298


In [15]:
# 데이터 전치
df.T

Unnamed: 0,2022-05-30,2022-05-31,2022-06-01,2022-06-02,2022-06-03,2022-06-04
A,0.256777,0.860877,-0.044488,1.07795,0.122883,0.487377
B,0.519061,0.369131,0.163224,-1.266518,-0.703422,-1.015252
C,1.403401,0.51094,0.371233,0.680737,0.431064,0.366307
D,0.42556,0.793585,1.03993,1.696298,0.758467,-0.340236


In [16]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2022-05-30,0.42556,1.403401,0.519061,0.256777
2022-05-31,0.793585,0.51094,0.369131,0.860877
2022-06-01,1.03993,0.371233,0.163224,-0.044488
2022-06-02,1.696298,0.680737,-1.266518,1.07795
2022-06-03,0.758467,0.431064,-0.703422,0.122883
2022-06-04,-0.340236,0.366307,-1.015252,0.487377


In [17]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2022-06-02,1.07795,-1.266518,0.680737,1.696298
2022-06-04,0.487377,-1.015252,0.366307,-0.340236
2022-06-03,0.122883,-0.703422,0.431064,0.758467
2022-06-01,-0.044488,0.163224,0.371233,1.03993
2022-05-31,0.860877,0.369131,0.51094,0.793585
2022-05-30,0.256777,0.519061,1.403401,0.42556


## 3. Selection (선택)

- Getting (데이터 얻기)

In [18]:
df['A']

2022-05-30    0.256777
2022-05-31    0.860877
2022-06-01   -0.044488
2022-06-02    1.077950
2022-06-03    0.122883
2022-06-04    0.487377
Freq: D, Name: A, dtype: float64

In [19]:
df[0:3]

Unnamed: 0,A,B,C,D
2022-05-30,0.256777,0.519061,1.403401,0.42556
2022-05-31,0.860877,0.369131,0.51094,0.793585
2022-06-01,-0.044488,0.163224,0.371233,1.03993


In [20]:
df['20220530' :'20220603']

Unnamed: 0,A,B,C,D
2022-05-30,0.256777,0.519061,1.403401,0.42556
2022-05-31,0.860877,0.369131,0.51094,0.793585
2022-06-01,-0.044488,0.163224,0.371233,1.03993
2022-06-02,1.07795,-1.266518,0.680737,1.696298
2022-06-03,0.122883,-0.703422,0.431064,0.758467


- Selection by Label (Label 을 통한 선택)

In [21]:
df.loc[dates[0]]

A    0.256777
B    0.519061
C    1.403401
D    0.425560
Name: 2022-05-30 00:00:00, dtype: float64

In [22]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2022-05-30,0.256777,0.519061
2022-05-31,0.860877,0.369131
2022-06-01,-0.044488,0.163224
2022-06-02,1.07795,-1.266518
2022-06-03,0.122883,-0.703422
2022-06-04,0.487377,-1.015252


In [23]:
df.loc['20220601':'20220603', ['A','B']]

Unnamed: 0,A,B
2022-06-01,-0.044488,0.163224
2022-06-02,1.07795,-1.266518
2022-06-03,0.122883,-0.703422


In [24]:
df.loc['20220601',['A','B']]

A   -0.044488
B    0.163224
Name: 2022-06-01 00:00:00, dtype: float64

In [25]:
df.loc[dates[0],'A']

0.25677695927120076

- Selection by Position (위치로 선택하기)

In [26]:
df.iloc[3]

A    1.077950
B   -1.266518
C    0.680737
D    1.696298
Name: 2022-06-02 00:00:00, dtype: float64

In [27]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2022-05-31,0.860877,0.51094
2022-06-01,-0.044488,0.371233
2022-06-03,0.122883,0.431064


In [28]:
df.iloc[1:3,:]

Unnamed: 0,A,B,C,D
2022-05-31,0.860877,0.369131,0.51094,0.793585
2022-06-01,-0.044488,0.163224,0.371233,1.03993


In [29]:
df.iloc[1,1]

0.36913070868083586

In [30]:
df.iat[1,1]     # 스칼라 값을 빠르게 얻는 방법입니다 (위의 방식과 동일합니다).

0.36913070868083586

In [31]:
# ? df.iat

- Boolean Indexing

In [32]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
2022-05-30,0.256777,0.519061,1.403401,0.42556
2022-05-31,0.860877,0.369131,0.51094,0.793585
2022-06-02,1.07795,-1.266518,0.680737,1.696298
2022-06-03,0.122883,-0.703422,0.431064,0.758467
2022-06-04,0.487377,-1.015252,0.366307,-0.340236


In [33]:
df[df > 0]

Unnamed: 0,A,B,C,D
2022-05-30,0.256777,0.519061,1.403401,0.42556
2022-05-31,0.860877,0.369131,0.51094,0.793585
2022-06-01,,0.163224,0.371233,1.03993
2022-06-02,1.07795,,0.680737,1.696298
2022-06-03,0.122883,,0.431064,0.758467
2022-06-04,0.487377,,0.366307,


In [34]:
df2 = df.copy()

In [35]:
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']

In [36]:
df2

Unnamed: 0,A,B,C,D,E
2022-05-30,0.256777,0.519061,1.403401,0.42556,one
2022-05-31,0.860877,0.369131,0.51094,0.793585,one
2022-06-01,-0.044488,0.163224,0.371233,1.03993,two
2022-06-02,1.07795,-1.266518,0.680737,1.696298,three
2022-06-03,0.122883,-0.703422,0.431064,0.758467,four
2022-06-04,0.487377,-1.015252,0.366307,-0.340236,three


In [37]:
df2[df2['E'].isin(['two','four'])]

Unnamed: 0,A,B,C,D,E
2022-06-01,-0.044488,0.163224,0.371233,1.03993,two
2022-06-03,0.122883,-0.703422,0.431064,0.758467,four


- Setting (설정)

In [38]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20220530', periods=6))
s1

2022-05-30    1
2022-05-31    2
2022-06-01    3
2022-06-02    4
2022-06-03    5
2022-06-04    6
Freq: D, dtype: int64

In [39]:
df['F'] = s1

In [40]:
df

Unnamed: 0,A,B,C,D,F
2022-05-30,0.256777,0.519061,1.403401,0.42556,1
2022-05-31,0.860877,0.369131,0.51094,0.793585,2
2022-06-01,-0.044488,0.163224,0.371233,1.03993,3
2022-06-02,1.07795,-1.266518,0.680737,1.696298,4
2022-06-03,0.122883,-0.703422,0.431064,0.758467,5
2022-06-04,0.487377,-1.015252,0.366307,-0.340236,6


In [41]:
# 라벨에 의해 값 설정
df.at[dates[0],'A'] = 0
df

Unnamed: 0,A,B,C,D,F
2022-05-30,0.0,0.519061,1.403401,0.42556,1
2022-05-31,0.860877,0.369131,0.51094,0.793585,2
2022-06-01,-0.044488,0.163224,0.371233,1.03993,3
2022-06-02,1.07795,-1.266518,0.680737,1.696298,4
2022-06-03,0.122883,-0.703422,0.431064,0.758467,5
2022-06-04,0.487377,-1.015252,0.366307,-0.340236,6


In [42]:
# 위치에 의해 값 설정
df.iat[0,1] = 0
df

Unnamed: 0,A,B,C,D,F
2022-05-30,0.0,0.0,1.403401,0.42556,1
2022-05-31,0.860877,0.369131,0.51094,0.793585,2
2022-06-01,-0.044488,0.163224,0.371233,1.03993,3
2022-06-02,1.07795,-1.266518,0.680737,1.696298,4
2022-06-03,0.122883,-0.703422,0.431064,0.758467,5
2022-06-04,0.487377,-1.015252,0.366307,-0.340236,6


In [43]:
# Numpy 배열을 사용한 할당에 의해 값을 설정
df.loc[:,'D'] = np.array([5] * len(df))
df

Unnamed: 0,A,B,C,D,F
2022-05-30,0.0,0.0,1.403401,5,1
2022-05-31,0.860877,0.369131,0.51094,5,2
2022-06-01,-0.044488,0.163224,0.371233,5,3
2022-06-02,1.07795,-1.266518,0.680737,5,4
2022-06-03,0.122883,-0.703422,0.431064,5,5
2022-06-04,0.487377,-1.015252,0.366307,5,6


In [44]:
# where 연산을 설정
df2 = df.copy()
df2

Unnamed: 0,A,B,C,D,F
2022-05-30,0.0,0.0,1.403401,5,1
2022-05-31,0.860877,0.369131,0.51094,5,2
2022-06-01,-0.044488,0.163224,0.371233,5,3
2022-06-02,1.07795,-1.266518,0.680737,5,4
2022-06-03,0.122883,-0.703422,0.431064,5,5
2022-06-04,0.487377,-1.015252,0.366307,5,6


In [45]:
df2[df2 > 0] = -df2
df2

Unnamed: 0,A,B,C,D,F
2022-05-30,0.0,0.0,-1.403401,-5,-1
2022-05-31,-0.860877,-0.369131,-0.51094,-5,-2
2022-06-01,-0.044488,-0.163224,-0.371233,-5,-3
2022-06-02,-1.07795,-1.266518,-0.680737,-5,-4
2022-06-03,-0.122883,-0.703422,-0.431064,-5,-5
2022-06-04,-0.487377,-1.015252,-0.366307,-5,-6


## 4. Missing Data (결측치)

> - Pandas는 결측치를 표현하기 위해 주로 np.nan 값을 사용합니다. 
<br/>이 방법은 기본 설정값이지만 계산에는 포함되지 않습니다. 
<br/>Missing data section을 참조
> - Reindexing으로 지정된 축 상의 인덱스를 변경 / 추가 / 삭제할 수 있습니다. 
<br/>Reindexing은 데이터의 복사본을 반환합니다.

In [46]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1

Unnamed: 0,A,B,C,D,F,E
2022-05-30,0.0,0.0,1.403401,5,1,
2022-05-31,0.860877,0.369131,0.51094,5,2,
2022-06-01,-0.044488,0.163224,0.371233,5,3,
2022-06-02,1.07795,-1.266518,0.680737,5,4,


In [47]:
df1.loc[dates[0]:dates[1],'E'] = 1
df1

Unnamed: 0,A,B,C,D,F,E
2022-05-30,0.0,0.0,1.403401,5,1,1.0
2022-05-31,0.860877,0.369131,0.51094,5,2,1.0
2022-06-01,-0.044488,0.163224,0.371233,5,3,
2022-06-02,1.07795,-1.266518,0.680737,5,4,


- 결측치 행들에 대한 처리 : Drop / Fiil

In [48]:
df1.dropna(how='any')

Unnamed: 0,A,B,C,D,F,E
2022-05-30,0.0,0.0,1.403401,5,1,1.0
2022-05-31,0.860877,0.369131,0.51094,5,2,1.0


In [49]:
df1.fillna(value=5)
df1

Unnamed: 0,A,B,C,D,F,E
2022-05-30,0.0,0.0,1.403401,5,1,1.0
2022-05-31,0.860877,0.369131,0.51094,5,2,1.0
2022-06-01,-0.044488,0.163224,0.371233,5,3,
2022-06-02,1.07795,-1.266518,0.680737,5,4,


In [50]:
pd.isna(df1)

Unnamed: 0,A,B,C,D,F,E
2022-05-30,False,False,False,False,False,False
2022-05-31,False,False,False,False,False,False
2022-06-01,False,False,False,False,False,True
2022-06-02,False,False,False,False,False,True


## 5. Operation (연산)

> Stats (통계)
> - 일반적으로 결측치를 제외한 후 연산됩니다.
> - 기술통계를 수행합니다.

In [51]:
df

Unnamed: 0,A,B,C,D,F
2022-05-30,0.0,0.0,1.403401,5,1
2022-05-31,0.860877,0.369131,0.51094,5,2
2022-06-01,-0.044488,0.163224,0.371233,5,3
2022-06-02,1.07795,-1.266518,0.680737,5,4
2022-06-03,0.122883,-0.703422,0.431064,5,5
2022-06-04,0.487377,-1.015252,0.366307,5,6


In [52]:
df.mean()

A    0.417433
B   -0.408806
C    0.627280
D    5.000000
F    3.500000
dtype: float64

In [53]:
df.mean(1)

2022-05-30    1.480680
2022-05-31    1.748190
2022-06-01    1.697994
2022-06-02    1.898434
2022-06-03    1.970105
2022-06-04    2.167686
Freq: D, dtype: float64

In [54]:
# ?df.mean
help(df.mean)

Help on method mean in module pandas.core.generic:

mean(axis: 'int | None | lib.NoDefault' = <no_default>, skipna=True, level=None, numeric_only=None, **kwargs) method of pandas.core.frame.DataFrame instance
    Return the mean of the values over the requested axis.
    
    Parameters
    ----------
    axis : {index (0), columns (1)}
        Axis for the function to be applied on.
    skipna : bool, default True
        Exclude NA/null values when computing the result.
    level : int or level name, default None
        If the axis is a MultiIndex (hierarchical), count along a
        particular level, collapsing into a Series.
    numeric_only : bool, default None
        Include only float, int, boolean columns. If None, will attempt to use
        everything, then use only numeric data. Not implemented for Series.
    **kwargs
        Additional keyword arguments to be passed to the function.
    
    Returns
    -------
    Series or DataFrame (if level specified)



In [55]:
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s

2022-05-30    NaN
2022-05-31    NaN
2022-06-01    1.0
2022-06-02    3.0
2022-06-03    5.0
2022-06-04    NaN
Freq: D, dtype: float64

In [56]:
df.sub(s, axis='index')

Unnamed: 0,A,B,C,D,F
2022-05-30,,,,,
2022-05-31,,,,,
2022-06-01,-1.044488,-0.836776,-0.628767,4.0,2.0
2022-06-02,-1.92205,-4.266518,-2.319263,2.0,1.0
2022-06-03,-4.877117,-5.703422,-4.568936,0.0,0.0
2022-06-04,,,,,


In [57]:
df

Unnamed: 0,A,B,C,D,F
2022-05-30,0.0,0.0,1.403401,5,1
2022-05-31,0.860877,0.369131,0.51094,5,2
2022-06-01,-0.044488,0.163224,0.371233,5,3
2022-06-02,1.07795,-1.266518,0.680737,5,4
2022-06-03,0.122883,-0.703422,0.431064,5,5
2022-06-04,0.487377,-1.015252,0.366307,5,6


> Apply (적용)
> - 데이터에 함수를 적용합니다.

In [58]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D,F
2022-05-30,0.0,0.0,1.403401,5,1
2022-05-31,0.860877,0.369131,1.914341,10,3
2022-06-01,0.816389,0.532354,2.285573,15,6
2022-06-02,1.894339,-0.734164,2.966311,20,10
2022-06-03,2.017223,-1.437585,3.397375,25,15
2022-06-04,2.5046,-2.452838,3.763682,30,21


In [59]:
df.apply(lambda x: x.max() - x.min())

A    1.122438
B    1.635649
C    1.037094
D    0.000000
F    5.000000
dtype: float64

> Histogramming (히스토그래밍)
> - 도수분포표를 그래프로 나타낸 것
> - 즉, 표로 되어 있는 도수 분포를 정보 그림으로 나타낸 것

In [60]:
s = pd.Series(np.random.randint(0, 7, size=10))
s

0    1
1    1
2    3
3    5
4    2
5    0
6    2
7    1
8    1
9    2
dtype: int32

In [61]:
s.value_counts()

1    4
2    3
3    1
5    1
0    1
dtype: int64

> String Methods (문자열 메소드)
> - Series는 다음의 코드와 같이 문자열 처리 메소드 모음 (set)을 가지고 있습니다.
<br/>이 모음은 배열의 각 요소를 쉽게 조작할 수 있도록 만들어주는 문자열의 속성에 포함되어 있습니다.
> - 문자열의 패턴 일치 확인은 기본적으로 정규 표현식을 사용하며, 몇몇 경우에는 항상 정규 표현식을 사용함에 유의하십시오.
> - 좀 더 자세한 내용은 벡터화된 문자열 메소드 부분에서 확인할 수 있습니다.

In [62]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s

0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object

In [63]:
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

In [64]:
s.str.upper()

0       A
1       B
2       C
3    AABA
4    BACA
5     NaN
6    CABA
7     DOG
8     CAT
dtype: object

In [65]:
s.str.swapcase()

0       a
1       b
2       c
3    aABA
4    bACA
5     NaN
6    caba
7     DOG
8     CAT
dtype: object

<hr>
<marquee><font size=3 color='brown'>The BigpyCraft find the information to design valuable society with Technology & Craft.</font></marquee>
<div align='right'><font size=2 color='gray'> &lt; The End &gt; </font></div>