### Pandas : Dataframe으로 데이터 불러오기
read_csv() : **데이터프레임** 형식으로 데이터 파일 불러오기 

In [3]:
import pandas as pd
csv_test = pd.read_csv("./res/test_csv_file.csv")
csv_test

Unnamed: 0,ID,LAST_NAME,AGE
0,1,KIM,30
1,2,CHOI,25
2,3,LEE,41
3,4,PARK,19
4,5,LIM,36


### Pandas : Dataframe으로 불러와서 header 처리

In [7]:
# txt_test = pd.read_csv("./res/test_text_file.txt", sep="|", index_col=0) 
# txt_test = pd.read_csv("./res/test_text_file.txt", sep="|", index_col="id") # Error 대소문자 구분
txt_test = pd.read_csv("./res/test_text_file.txt", sep="|", index_col="ID") 
txt_test

Unnamed: 0_level_0,A,B,C,D
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
C1,1,2,3,4
C2,5,6,7,8
C3,1,3,5,7


행 index = 0, 1, 2 → 데이터 구분자, 참조 위치<br>
열 index = ID, A, B, C, D → ID가 구분자로 있으므로 행 index(default)가 필요없다<br>
pd.read_csv(경로, sep=구분자, **index_col**=행 index로 사용할 col의 index, col name)

In [17]:
# header가 없는 데이터에 header 부여
text = pd.read_csv("./res/text_without_column_name.txt", sep="|", header=None, names=["ID", "A", "B", "C", "D"]) 
text

Unnamed: 0,ID,A,B,C,D
0,C1,1,2,3,4
1,C2,5,6,7,8
2,C3,1,3,5,7


In [18]:
type(text)
text.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
ID    3 non-null object
A     3 non-null int64
B     3 non-null int64
C     3 non-null int64
D     3 non-null int64
dtypes: int64(4), object(1)
memory usage: 200.0+ bytes


**[중요] info()** : Pandas에서 dataframe의 정보를 볼수있는 함수. R에서는 structure() 사용 <br>
object : string → string 객체가 따로 없음<br><br>

### Pandas : 데이터 저장

In [19]:
data = {"ID": ["A1", "A2", "A3"],
       "X1": [10, 20, 30],
       "X2": [1.1, 2.2, 3.3]}

In [21]:
df = pd.DataFrame(data, index=["a1", "a2", "a3"]) # 행 index 지정
df

Unnamed: 0,ID,X1,X2
a1,A1,10,1.1
a2,A2,20,2.2
a3,A3,30,3.3


In [24]:
# 데이터프레임에 행 추가(index 사용)
df2 = df.reindex(["a1", "a2", "a3", "a4"])
df2 

Unnamed: 0,ID,X1,X2
a1,A1,10.0,1.1
a2,A2,20.0,2.2
a3,A3,30.0,3.3
a4,,,


Pandas NaN(=결측값, 값이 없다), Deep Nan(inf → 발산, 학습과정에서 learning rate 조정이 필요)
1. 정말 데이터가 없는 경우
2. 사용자가 데이터를 누락한 경우
3. 인위적으로 결측값으로 지정하는 경우

In [25]:
df2.to_csv("./res/dt2.csv")

![./asset/1.PNG](./asset/1.PNG)

In [31]:
df2.to_csv("./res/dt2_.csv", sep="?")

![./asset/2.PNG](./asset/2.PNG)

In [29]:
df2.to_csv("./res/dt2_.csv", sep=",") # sep="," 생략 가능

In [33]:
df2.to_csv("./res/dt2.csv", sep=",", na_rep="NaN") # na_rep : NaN자리에 대체할 텍스트

![./asset/3.PNG](./asset/3.PNG)

### DataFrame의 속성

In [38]:
import numpy as np
df1 = pd.DataFrame(np.arange(12).reshape(3, 4),
                  index=["r0", "r1", "r2"],
                  columns=["c0", "c1", "c2", "c3"]) 
# 데이터를 12개를 생성해서 3X4행으로 바꾼 뒤 dataframe 형식으로 변경
df1

Unnamed: 0,c0,c1,c2,c3
r0,0,1,2,3
r1,4,5,6,7
r2,8,9,10,11


In [37]:
df1.T

Unnamed: 0,r0,r1,r2
c0,0,4,8
c1,1,5,9
c2,2,6,10
c3,3,7,11


In [42]:
df1.axes

[Index(['r0', 'r1', 'r2'], dtype='object'),
 Index(['c0', 'c1', 'c2', 'c3'], dtype='object')]

In [43]:
df1.dtypes

c0    object
c1    object
c2    object
c3    object
dtype: object

In [44]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, r0 to r2
Data columns (total 4 columns):
c0    3 non-null object
c1    3 non-null object
c2    3 non-null object
c3    3 non-null object
dtypes: object(4)
memory usage: 120.0+ bytes


In [45]:
df1.shape

(3, 4)

In [46]:
df1.size # 요소 개수

12

### Pandas : 데이터 추출

In [116]:
df2 = pd.DataFrame({"c1": ["a", "a", "b", "b", "c"],
                   "V1": np.arange(5),
                   "V2": np.random.randn(5)},
                  index=["r0", "r1", "r2", "r3", "r4"]) 
df2

Unnamed: 0,c1,V1,V2
r0,a,0,0.903752
r1,a,1,1.121822
r2,b,2,1.368453
r3,b,3,-0.81872
r4,c,4,1.628203


In [53]:
df2.index

Index(['r0', 'r1', 'r2', 'r3', 'r4'], dtype='object')

In [61]:
df2.ix[2:]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,c1,V1,V2
r2,b,2,0.680758
r3,b,3,0.639052
r4,c,4,-0.564201


In [62]:
df2.ix[2] # 특정 행 추출

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


c1           b
V1           2
V2    0.680758
Name: r2, dtype: object

In [63]:
df2.ix["r2"]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


c1           b
V1           2
V2    0.680758
Name: r2, dtype: object

ix 사용 시, 행 index 이름, 행 index 첨자 모두 사용 가능

## 데이터 분석 : 탐색적 분석 방법 EDA Indexing

In [64]:
df2.head()

Unnamed: 0,c1,V1,V2
r0,a,0,-1.966041
r1,a,1,0.431947
r2,b,2,0.680758
r3,b,3,0.639052
r4,c,4,-0.564201


In [66]:
df2.head(3) # 위에서부터 특정 개수의 데이터를 추출

Unnamed: 0,c1,V1,V2
r0,a,0,-1.966041
r1,a,1,0.431947
r2,b,2,0.680758


In [67]:
df2.tail(3) # 아래에서부터 특정 개수의 데이터를 추출

Unnamed: 0,c1,V1,V2
r2,b,2,0.680758
r3,b,3,0.639052
r4,c,4,-0.564201


In [70]:
df2.columns

Index(['c1', 'V1', 'V2'], dtype='object')

In [71]:
df2["V1"]

r0    0
r1    1
r2    2
r3    3
r4    4
Name: V1, dtype: int32

* 특정 col을 여러 개 추출하는 경우 []로 묶어준다

In [72]:
df2[["V1", "V2"]]

Unnamed: 0,V1,V2
r0,0,-1.966041
r1,1,0.431947
r2,2,0.680758
r3,3,0.639052
r4,4,-0.564201


In [75]:
# dataframe에서 특정 col를 추출하는 경우 
# col vector가 2개 이상인 경우, dataframe이 된다
type(df2["V1"]) 

pandas.core.series.Series

In [76]:
type(df2[["V1"]]) # dataframe에서 특정 col을 추출하는 데, dataframe으로 읽고싶은 경우

pandas.core.frame.DataFrame

### Pandas : reindex 활용
* 없는 data는 NaN으로 들어감

In [117]:
newindex=["r0", "r1", "r2", "r5", "r6"]
df2_1 = df2.reindex(newindex, fill_value=1) # fill_value : NaN을 채우는 속성
df2_1

Unnamed: 0,c1,V1,V2
r0,a,0,0.903752
r1,a,1,1.121822
r2,b,2,1.368453
r5,1,1,1.0
r6,1,1,1.0


In [118]:
df2_1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, r0 to r6
Data columns (total 3 columns):
c1    5 non-null object
V1    5 non-null int32
V2    5 non-null float64
dtypes: float64(1), int32(1), object(1)
memory usage: 140.0+ bytes


In [99]:
df2_2 = df2.reindex(newindex, fill_value="missing") 
# object로 채우면 같은 col의 dtype이 변경됨
df2_2

Unnamed: 0,c1,V1,V2
r0,a,0,-1.96604
r1,a,1,0.431947
r2,b,2,0.680758
r5,missing,missing,missing
r6,missing,missing,missing


In [110]:
df2_2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, r0 to r6
Data columns (total 3 columns):
c1    5 non-null object
V1    5 non-null object
V2    5 non-null object
dtypes: object(3)
memory usage: 160.0+ bytes


### 행 index를 date로 사용

In [105]:
dindex = pd.date_range("07/02/2019", periods=5, freq= "D") # 월 일 년도
dindex

DatetimeIndex(['2019-07-02', '2019-07-03', '2019-07-04', '2019-07-05',
               '2019-07-06'],
              dtype='datetime64[ns]', freq='D')

In [119]:
df2 = pd.DataFrame({"c1": [1, 2, 3, 4, 5]}, index=dindex)
df2

Unnamed: 0,c1
2019-07-02,1
2019-07-03,2
2019-07-04,3
2019-07-05,4
2019-07-06,5


In [108]:
dindex2 = pd.date_range("06/30/2019", periods=10, freq="D")
dindex2

DatetimeIndex(['2019-06-30', '2019-07-01', '2019-07-02', '2019-07-03',
               '2019-07-04', '2019-07-05', '2019-07-06', '2019-07-07',
               '2019-07-08', '2019-07-09'],
              dtype='datetime64[ns]', freq='D')

In [120]:
df2.reindex(dindex2)

Unnamed: 0,c1
2019-06-30,
2019-07-01,
2019-07-02,1.0
2019-07-03,2.0
2019-07-04,3.0
2019-07-05,4.0
2019-07-06,5.0
2019-07-07,
2019-07-08,
2019-07-09,


In [121]:
df2 = pd.DataFrame({"c1": [1, 2, 3, 4, 5]}, index=dindex)
df2.reindex(dindex2, method="ffill") 
# ffill : forward fill 
# forward방향(↓)으로 진행하면서 NaN을 만나면  NaN 이전의 데이터로 채움

Unnamed: 0,c1
2019-06-30,
2019-07-01,
2019-07-02,1.0
2019-07-03,2.0
2019-07-04,3.0
2019-07-05,4.0
2019-07-06,5.0
2019-07-07,5.0
2019-07-08,5.0
2019-07-09,5.0


In [122]:
df2 = pd.DataFrame({"c1": [1, 2, 3, 4, 5]}, index=dindex)
df2.reindex(dindex2, method="bfill") # bfill : backward fill

Unnamed: 0,c1
2019-06-30,1.0
2019-07-01,1.0
2019-07-02,1.0
2019-07-03,2.0
2019-07-04,3.0
2019-07-05,4.0
2019-07-06,5.0
2019-07-07,
2019-07-08,
2019-07-09,
