### 학습데이터처리 - Pandas 기초
---

### < DataFrame으로 변환할 때 >
1. 딕셔너리 
    - 키 : 열 이름
    - 값 : 열 구성
2. 리스트
    - 행으로 입력됨

### < 행렬 삭제 >
1. drop()
    - 행 삭제 : axis=0   (default: axis=0)
    - 열 삭제 : axis=1
2. 여러 개의 행/열을 삭제하려면 리스트 형태로 입력
3. 원본 객체도 변경하려면 inplace=True 옵션 추가

### < 선택해서 가져오기 >
- 행 선택
    1. loc : 인덱스의 이름을 기준으로 행을 선택. ```['a':'c'] -> 'a','b','c'``` 가져옴
    2. iloc : 정수형 위치 인덱스를 사용. ```[3:7] -> 3,4,5,6``` 가져옴
- 열 선택
    1. 열 1개 선택 : 시리즈 생성 ```DataFrame['열 이름']``` or ```DataFrame.열이름```
    2. 열 n개 선택 : DataFrame 생성 ```[[열1, 열2, ... ]]```
- 원소 선택
    1. 인덱스 이름 : ```DataFrame.loc[행 이름, 열 이름]```
    2. 정수 위치 인덱스 : ```DataFrame.iloc[행 번호, 열 번호]```

In [1]:
import pandas as pd

## ▶ DataFrame 생성

In [2]:
exam_data = {
    'name': ['No', 'Kim', 'Jung'],
    'AI-basic': [90, 90, 95],
    'DataStructure': [80, 97, 85],
    'English': [74, 60, 55],
    'WebPgm': [86, 100, 90],
}

In [3]:
df = pd.DataFrame(exam_data)
df

Unnamed: 0,name,AI-basic,DataStructure,English,WebPgm
0,No,90,80,74,86
1,Kim,90,97,60,100
2,Jung,95,85,55,90


In [4]:
type(df)

pandas.core.frame.DataFrame

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name           3 non-null      object
 1   AI-basic       3 non-null      int64 
 2   DataStructure  3 non-null      int64 
 3   English        3 non-null      int64 
 4   WebPgm         3 non-null      int64 
dtypes: int64(4), object(1)
memory usage: 248.0+ bytes


## ▶ 컬럼 가져오기

### 컬럼 하나만 가져오기
열 이름 없이 값만 시리즈로 가져옴

In [6]:
dataStructure = df['DataStructure']
type(dataStructure), dataStructure

(pandas.core.series.Series,
 0    80
 1    97
 2    85
 Name: DataStructure, dtype: int64)

In [7]:
english = df.English
english

0    74
1    60
2    55
Name: English, dtype: int64

### 컬럼 여러개 가져오기 : 대괄호 2개
데이터프레임으로 반환됨

In [8]:
ai_web = df[['AI-basic', 'WebPgm']]  
# df['AI-basic', 'WebPgm'] : 대괄호 하나만 쓰고 콤마로 구분한다면 앞은 row, 뒤는 column으로 인식함

In [9]:
type(ai_web), ai_web

(pandas.core.frame.DataFrame,
    AI-basic  WebPgm
 0        90      86
 1        90     100
 2        95      90)

## ▶ set_index
: 특정 열을 새로운 index로 사용

In [10]:
df

Unnamed: 0,name,AI-basic,DataStructure,English,WebPgm
0,No,90,80,74,86
1,Kim,90,97,60,100
2,Jung,95,85,55,90


In [11]:
df.set_index('name', inplace=True)
df  # name 컬럼이 index가 됐음

Unnamed: 0_level_0,AI-basic,DataStructure,English,WebPgm
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No,90,80,74,86
Kim,90,97,60,100
Jung,95,85,55,90


## ▶ 원소 선택하기

### df의 특정 원소 1개 선택

In [12]:
point1 = df.loc['No', 'English']
type(point1), point1

(numpy.int64, 74)

In [13]:
point2 = df.iloc[0, 2]
point2

74

### df의 특정 원소 여러 개 선택

In [14]:
points = df.loc['No', ['English', 'WebPgm']]
type(points), points

(pandas.core.series.Series,
 English    74
 WebPgm     86
 Name: No, dtype: int64)

In [15]:
points2 = df.iloc[:2, 2:]
type(points2), points2

(pandas.core.frame.DataFrame,
       English  WebPgm
 name                 
 No         74      86
 Kim        60     100)

In [16]:
# 두 코드 모두 동일한 값을 가져옴 : loc, iloc 둘 다 사용해보기
points3 = df.loc[['No', 'Jung'], ['DataStructure', 'WebPgm']]
points4 = df.iloc[[0, -1], [1, -1]]

points3, points4

(      DataStructure  WebPgm
 name                       
 No               80      86
 Jung             85      90,
       DataStructure  WebPgm
 name                       
 No               80      86
 Jung             85      90)

## ▶ reindex

In [17]:
df

Unnamed: 0_level_0,AI-basic,DataStructure,English,WebPgm
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No,90,80,74,86
Kim,90,97,60,100
Jung,95,85,55,90


In [18]:
test = df.reindex(["No", "Kang", "Kim", "Jung"])
test

Unnamed: 0_level_0,AI-basic,DataStructure,English,WebPgm
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No,90.0,80.0,74.0,86.0
Kang,,,,
Kim,90.0,97.0,60.0,100.0
Jung,95.0,85.0,55.0,90.0


In [19]:
test1 = df.reindex(["No", "Kim", "Jung", "Lee"], fill_value=50)
test1

Unnamed: 0_level_0,AI-basic,DataStructure,English,WebPgm
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No,90,80,74,86
Kim,90,97,60,100
Jung,95,85,55,90
Lee,50,50,50,50


## ▶ 행렬 추가하기 assign
### 열 추가
- 값을 하나만 지정하면 모든 행에 동일한 값이 입력됨  

In [20]:
df

Unnamed: 0_level_0,AI-basic,DataStructure,English,WebPgm
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No,90,80,74,86
Kim,90,97,60,100
Jung,95,85,55,90


In [21]:
df['ML'] = [80, 100, 100]
df

Unnamed: 0_level_0,AI-basic,DataStructure,English,WebPgm,ML
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
No,90,80,74,86,80
Kim,90,97,60,100,100
Jung,95,85,55,90,100


In [22]:
df['DL'] = 97
df

Unnamed: 0_level_0,AI-basic,DataStructure,English,WebPgm,ML,DL
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
No,90,80,74,86,80,97
Kim,90,97,60,100,100,97
Jung,95,85,55,90,100,97


### 행 추가 : loc 사용
- df.loc[새로운 행 이름] = 새로운 값

In [23]:
df.loc[3] = 0
df

Unnamed: 0_level_0,AI-basic,DataStructure,English,WebPgm,ML,DL
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
No,90,80,74,86,80,97
Kim,90,97,60,100,100,97
Jung,95,85,55,90,100,97
3,0,0,0,0,0,0


In [24]:
df.loc['Lee'] = [30, 40, 55, 70, 4, 70]
df

Unnamed: 0_level_0,AI-basic,DataStructure,English,WebPgm,ML,DL
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
No,90,80,74,86,80,97
Kim,90,97,60,100,100,97
Jung,95,85,55,90,100,97
3,0,0,0,0,0,0
Lee,30,40,55,70,4,70


In [25]:
df.iloc[3, 1] = 40   # 원소 값 변경
df

Unnamed: 0_level_0,AI-basic,DataStructure,English,WebPgm,ML,DL
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
No,90,80,74,86,80,97
Kim,90,97,60,100,100,97
Jung,95,85,55,90,100,97
3,0,40,0,0,0,0
Lee,30,40,55,70,4,70


### 새로운 행 추가 시 기존 행을 복사

In [26]:
df.loc['Park'] = df.iloc[1]
df

Unnamed: 0_level_0,AI-basic,DataStructure,English,WebPgm,ML,DL
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
No,90,80,74,86,80,97
Kim,90,97,60,100,100,97
Jung,95,85,55,90,100,97
3,0,40,0,0,0,0
Lee,30,40,55,70,4,70
Park,90,97,60,100,100,97


## ▶ 원소 값 변경

In [27]:
df1 = df.copy()

In [28]:
df1.iloc[0, 0]

90

In [29]:
df1.iloc[0][0]  # 접근은 가능하지만 에러날 수도 있음. 구버전. 잘 안씀. 파이썬에선 지양할 것.

90

In [30]:
df1.iloc[0, 1:4] = [100, 100, 100]
df1

Unnamed: 0_level_0,AI-basic,DataStructure,English,WebPgm,ML,DL
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
No,90,100,100,100,80,97
Kim,90,97,60,100,100,97
Jung,95,85,55,90,100,97
3,0,40,0,0,0,0
Lee,30,40,55,70,4,70
Park,90,97,60,100,100,97


In [31]:
df1.iloc[[0, 1], 1:4] = [[30, 30, 30], [0, 0, 0]]
df1

Unnamed: 0_level_0,AI-basic,DataStructure,English,WebPgm,ML,DL
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
No,90,30,30,30,80,97
Kim,90,0,0,0,100,97
Jung,95,85,55,90,100,97
3,0,40,0,0,0,0
Lee,30,40,55,70,4,70
Park,90,97,60,100,100,97


In [32]:
df2 = df.T
df2

name,No,Kim,Jung,3,Lee,Park
AI-basic,90,90,95,0,30,90
DataStructure,80,97,85,40,40,97
English,74,60,55,0,55,60
WebPgm,86,100,90,0,70,100
ML,80,100,100,0,4,100
DL,97,97,97,0,70,97


## ▶ 데이터 변경 : 이것저것

In [33]:
dict_data = {'c0': [1, 32, 92],
             'c1': [7, 88, 54],
             'c2': [-7, 2, 50],
            }

In [34]:
df = pd.DataFrame(dict_data, index=['r0', 'r1', 'r2'])
df

Unnamed: 0,c0,c1,c2
r0,1,7,-7
r1,32,88,2
r2,92,54,50


### reindex(), fill_value 유무 차이

In [35]:
new_index = ['r0', 'r1', 'r2', 'r3', 'r4']
ndf = df.reindex(new_index, fill_value=777)
ndf

Unnamed: 0,c0,c1,c2
r0,1,7,-7
r1,32,88,2
r2,92,54,50
r3,777,777,777
r4,777,777,777


In [36]:
ndf2 = df.reindex(new_index)
ndf2  # fill_value 안쓰니까 float으로 변경됨

Unnamed: 0,c0,c1,c2
r0,1.0,7.0,-7.0
r1,32.0,88.0,2.0
r2,92.0,54.0,50.0
r3,,,
r4,,,


### sort_index(), ascending 옵션

In [37]:
sort_df = df.sort_index(ascending=False)
df, sort_df

(    c0  c1  c2
 r0   1   7  -7
 r1  32  88   2
 r2  92  54  50,
     c0  c1  c2
 r2  92  54  50
 r1  32  88   2
 r0   1   7  -7)

### sort_values(), ascending 옵션

In [38]:
sort_value_df = df.sort_values(by='c1', ascending=False)  # c1 기준으로 내림차순 정렬
sort_value_df

Unnamed: 0,c0,c1,c2
r1,32,88,2
r2,92,54,50
r0,1,7,-7


## ▶ 판다스 버전 에러로 예상됨. 예전엔 가능했대. 권장하지 않음

In [39]:
df1.loc[[3][:2]] = [10, 100]
df1

ValueError: shape mismatch: value array of shape (2,)  could not be broadcast to indexing result of shape (1,6)

In [43]:
df1.loc[[3][:2]]  # [:2]로 슬라이싱해도 그 행 전부를 가져옴. 그래서 위에서 에러 발생.

Unnamed: 0_level_0,AI-basic,DataStructure,English,WebPgm,ML,DL
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3,0,40,0,0,0,0
