# 데이터 결합 및 부분 선택

### 주요 내용

1. 데이터 결합
2. index, columns을 활용한 부분 선택 
3. 조건을 활용한 관측치 선택

<br>

### 목표 
1. 복수의 데이터를 적절한 방법으로 결합할 수 있다.
2. 변수 이름 등을 활용하여 부분 데이터를 선택한다.
3. 주제에 맞게 조건을 활용하여 부분 관측치를 선택한다. 


<br>
<hr>
<br>


## 1. 데이터 결합

### 1.1. concat( )을 활용한 동일 구조 데이터 행 결합

구조는 똑같고 기간이나 상품만 다른 여러 데이터가 있으면 pandas의 *concat()* 으로 결합해서 활용  
함수 안에서 `axis=0`옵션을 활용해서 행 결합(아래로 이어 붙이기)을 할 수 있고, `axis=1`로 열 결합도 가능  
`axis=0`이 기본값며 생략 가능

In [None]:
import pandas as pd

In [None]:
# 행 결합
    ## 출처 : 국토교통부 실거래가(http://rtdown.molit.go.kr/)
df_apt1 = pd.read_csv('data/아파트(매매)__실거래가_20210902153616.csv', skiprows=15, encoding='CP949')
df_apt1

In [None]:
df_apt2 = pd.read_csv('data/아파트(매매)__실거래가_20210902153636.csv', skiprows=15, encoding='CP949')
df_apt2

In [None]:
df_apt3 = pd.read_csv('data/아파트(매매)__실거래가_20210902153655.csv', skiprows=15, encoding='CP949')
df_apt3

In [11]:
df_apt = pd.concat([df_apt1, df_apt2, df_apt3])
df_apt

Unnamed: 0,시군구,번지,본번,부번,단지명,전용면적(㎡),계약년월,계약일,거래금액(만원),층,건축년도,도로명,해제사유발생일
0,서울특별시 강남구 개포동,1282,1282,0,개포래미안포레스트,59.9200,202108,21,199500,6,2020,개포로 264,
1,서울특별시 강남구 개포동,185,185,0,개포주공 7단지,83.7000,202108,20,280000,2,1983,개포로 516,
2,서울특별시 강남구 개포동,138,138,0,디에이치아너힐즈,59.8732,202108,17,233000,4,2019,삼성로 11,
3,서울특별시 강남구 개포동,1280,1280,0,래미안블레스티지,59.9670,202108,14,227000,10,2019,선릉로 8,
4,서울특별시 강남구 개포동,12,12,0,성원대치2단지아파트,49.8600,202108,1,169000,10,1992,개포로109길 9,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,서울특별시 송파구 풍납동,220-2,220,2,신동아파밀리에,59.3600,202108,1,85900,8,1993,풍성로6길 15,
72,서울특별시 송파구 풍납동,510,510,0,신성노바빌아파트,59.7800,202108,16,98500,15,2000,한가람로 468,
73,서울특별시 송파구 풍납동,401-1,401,1,쌍용,84.8500,202108,11,143000,15,1994,올림픽로47길 12,
74,서울특별시 송파구 풍납동,508,508,0,한강극동,84.7600,202108,25,111000,11,1995,토성로 38-6,


<br>

> **DataFrame**에서 행 번호에 해당하는 **index**는 중요한 역할을 합니다.  
예를 들어 아래처럼 index를 확인할 수 있고, 특정 index를 지정해서 관측치를 선택하는 것도 가능합니다. 

In [None]:
df_apt.index

In [None]:
# index 0 관측치 선택
df_apt.loc[0]

결합 이전 기존 Index 활용으로 **0** 인덱스 관측치의 중복 발생  
행 결합이나 정렬 이후 인덱스를 재지정하거나 초기화 필요 

In [None]:
# reset_index()을 활용한 index 초기화
    ## drop=True: 기존 인덱스를 변수로 추가할 지 버릴지 선택
df_apt = df_apt.reset_index(drop=True)
df_apt

In [None]:
# index 0 관측치 재선택
df_apt.loc[0]

#### [실습]  데이터 결합 및 인덱스 초기화

출처 : [서울시 지하철 호선별 역별 승하차 인원수](http://data.seoul.go.kr/dataList/OA-12914/S/1/datasetView.do)

1. `data`폴더의 `CARD_SUBWAY_MONTH_`로 시작하는 3개 데이터 확인하기  
    


2. 1.의 데이터를 각각 불러와서 저장하고, pd.concat()으로 행 결합하기(encoding='CP949' 활용)


3. index 초기화 하기



In [17]:
df_card1 = pd.read_csv("./data/CARD_SUBWAY_MONTH_201907.csv", encoding="cp949")
df_card2 = pd.read_csv("./data/CARD_SUBWAY_MONTH_202007.csv", encoding="cp949")
df_card3 = pd.read_csv("./data/CARD_SUBWAY_MONTH_202107.csv", encoding="cp949")

In [20]:
df_card = pd.concat( [df_card1, df_card2, df_card3]  ).reset_index(drop=True)

In [None]:
df_card ... ->

#### [참고] glob과 for 반복문을 활용한 복수 데이터 처리

**glob** 라이브러리의 *glob()* 을 활용하면 복수의 데이터 경로를 손쉽게 처리 가능

In [21]:
# 대상 파일 목록 생성
from glob import glob
file_list  = glob('data/apt/*.csv')
file_list

['data/apt\\아파트(매매)__실거래가_20220930082940.csv',
 'data/apt\\아파트(매매)__실거래가_20220930083006.csv',
 'data/apt\\아파트(매매)__실거래가_20220930083033.csv',
 'data/apt\\아파트(매매)__실거래가_20220930083102.csv',
 'data/apt\\아파트(매매)__실거래가_20220930083124.csv',
 'data/apt\\아파트(매매)__실거래가_20220930083151.csv',
 'data/apt\\아파트(매매)__실거래가_20220930083220.csv',
 'data/apt\\아파트(매매)__실거래가_20220930083248.csv']

In [22]:
# for를 활용한 반복
a = list()
for path_ in file_list:
    a.append(pd.read_csv(path_, skiprows=15, encoding='CP949'))
a

[               시군구      번지   본번  부번               단지명   전용면적(㎡)    계약년월  계약일  \
 0    서울특별시 강남구 개포동      12   12   0           삼익대청아파트   39.5300  202208    8   
 1    서울특별시 강남구 개포동      12   12   0        성원대치2단지아파트   49.8600  202208   10   
 2    서울특별시 강남구 개포동      12   12   0        성원대치2단지아파트   49.8600  202208   31   
 3    서울특별시 강남구 논현동    58-2   58   2            마일스디오빌   36.2900  202208    6   
 4    서울특별시 강남구 논현동    58-2   58   2            마일스디오빌   36.2900  202208    8   
 5    서울특별시 강남구 논현동    58-2   58   2            마일스디오빌   36.2900  202208   12   
 6    서울특별시 강남구 논현동    58-2   58   2            마일스디오빌   36.2900  202208   12   
 7    서울특별시 강남구 논현동    58-2   58   2            마일스디오빌   36.2900  202208   27   
 8    서울특별시 강남구 논현동    58-2   58   2            마일스디오빌   36.2900  202208   27   
 9    서울특별시 강남구 논현동   221-7  221   7        한양수자인어반게이트   16.2000  202208    4   
 10   서울특별시 강남구 대치동     503  503   0             개포우성1   84.8100  202208   30   
 11   서울특별시 강남구 대치동     888 

In [23]:
# 최종 작업
df_subway = pd.concat(a, axis=0).reset_index(drop=True)
df_subway

Unnamed: 0,시군구,번지,본번,부번,단지명,전용면적(㎡),계약년월,계약일,거래금액(만원),층,건축년도,도로명,해제사유발생일,거래유형,중개사소재지
0,서울특별시 강남구 개포동,12,12,0,삼익대청아파트,39.53,202208,8,119000,12,1992,개포로109길 21,,중개거래,서울 강남구
1,서울특별시 강남구 개포동,12,12,0,성원대치2단지아파트,49.86,202208,10,133000,1,1992,개포로109길 9,,중개거래,서울 강남구
2,서울특별시 강남구 개포동,12,12,0,성원대치2단지아파트,49.86,202208,31,136000,14,1992,개포로109길 9,,중개거래,"서울 강남구, 서울 서초구"
3,서울특별시 강남구 논현동,58-2,58,2,마일스디오빌,36.29,202208,6,38500,11,2004,학동로 165,,중개거래,"서울 강남구, 서울 서초구"
4,서울특별시 강남구 논현동,58-2,58,2,마일스디오빌,36.29,202208,8,38000,14,2004,학동로 165,,중개거래,서울 강남구
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
614,서울특별시 강남구 자곡동,619,619,0,엘에이치강남힐스테이트,51.99,202201,8,103000,7,2015,자곡로3길 21,20220502.0,중개거래,서울 강남구
615,서울특별시 강남구 자곡동,619,619,0,엘에이치강남힐스테이트,51.96,202201,17,118000,11,2015,자곡로3길 21,,중개거래,서울 강남구
616,서울특별시 강남구 청담동,76-12,76,12,라테라스청담,18.62,202201,17,34000,8,2018,학동로81길 9,,중개거래,서울 강남구
617,서울특별시 강남구 청담동,10,10,0,삼환아파트101동,84.91,202201,17,170000,2,1999,학동로77길 49,,중개거래,서울 강남구


In [28]:
import glob

dfs = [pd.read_csv(path_, skiprows=15, encoding='CP949') for path_ in glob.glob('data/apt/*.csv') ]
pd.concat(dfs).reset_index(drop=True)

Unnamed: 0,시군구,번지,본번,부번,단지명,전용면적(㎡),계약년월,계약일,거래금액(만원),층,건축년도,도로명,해제사유발생일,거래유형,중개사소재지
0,서울특별시 강남구 개포동,12,12,0,삼익대청아파트,39.53,202208,8,119000,12,1992,개포로109길 21,,중개거래,서울 강남구
1,서울특별시 강남구 개포동,12,12,0,성원대치2단지아파트,49.86,202208,10,133000,1,1992,개포로109길 9,,중개거래,서울 강남구
2,서울특별시 강남구 개포동,12,12,0,성원대치2단지아파트,49.86,202208,31,136000,14,1992,개포로109길 9,,중개거래,"서울 강남구, 서울 서초구"
3,서울특별시 강남구 논현동,58-2,58,2,마일스디오빌,36.29,202208,6,38500,11,2004,학동로 165,,중개거래,"서울 강남구, 서울 서초구"
4,서울특별시 강남구 논현동,58-2,58,2,마일스디오빌,36.29,202208,8,38000,14,2004,학동로 165,,중개거래,서울 강남구
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
614,서울특별시 강남구 자곡동,619,619,0,엘에이치강남힐스테이트,51.99,202201,8,103000,7,2015,자곡로3길 21,20220502.0,중개거래,서울 강남구
615,서울특별시 강남구 자곡동,619,619,0,엘에이치강남힐스테이트,51.96,202201,17,118000,11,2015,자곡로3길 21,,중개거래,서울 강남구
616,서울특별시 강남구 청담동,76-12,76,12,라테라스청담,18.62,202201,17,34000,8,2018,학동로81길 9,,중개거래,서울 강남구
617,서울특별시 강남구 청담동,10,10,0,삼환아파트101동,84.91,202201,17,170000,2,1999,학동로77길 49,,중개거래,서울 강남구


<br>

### 1.2. merge()를 활용한 KEY 변수 기준 결합 

SQL의 JOIN, Excel의 VLOOKUP()과 같이 KEY 변수를 활용한 데이터 결합은 *merge()* 를 활용

In [45]:
# 예제 데이터 불러오기
df_left  = pd.read_csv('data/data_left.csv')
df_right = pd.read_csv('data/data_right.csv')

In [46]:
df_left

Unnamed: 0,product_id,category,sales,name
0,P001,A,100,Food
1,P002,B,300,Beverage
2,P003,,100,
3,P005,A,200,Food


In [31]:
df_right

Unnamed: 0,category,name,manager_id
0,A,Food,E009
1,B,Beverage,E009
2,C,Industrial,E010


<br>

> key를 활용한 데이터 결합에서는 일치하는 key가 있는, 짝이 있는 관측치만 출력하는 것이 기본값으로 설정되어 있습니다. SQL에서는 이것을 **inner join**이라고 부릅니다.  

*merge()* 에서 `how=` 옵션을 활용해서 다음과 같은 데이터 결합 방법 지정 

+ `inner`: inner join. key 기준 일치하는 관측치만 포함
+ `left`:  left join. inner join의 결과물과 왼쪽 데이터의 짝 없는 관측치 포함
+ `right`: right join. inner join의 결과물과 오른쪽 데이터의 짝 없는 관측치 포함
+ `outer`: full outer join. inner join과 양쪽 데이터의 짝이 없는 모든 관측치 포함

In [47]:
# merge()를 활용한 결합
pd.merge(df_left, df_right, how='inner', on=['category', 'name'])

Unnamed: 0,product_id,category,sales,name,manager_id
0,P001,A,100,Food,E009
1,P005,A,200,Food,E009
2,P002,B,300,Beverage,E009


In [34]:
# left join
pd.merge(df_left, df_right, how='left', on='category')

Unnamed: 0,product_id,category,sales,name,manager_id
0,P001,A,100,Food,E009
1,P002,B,300,Beverage,E009
2,P003,,100,,
3,P005,A,200,Food,E009


In [35]:
# right join
pd.merge(df_left, df_right, how='right', on='category')

Unnamed: 0,product_id,category,sales,name,manager_id
0,P001,A,100.0,Food,E009
1,P005,A,200.0,Food,E009
2,P002,B,300.0,Beverage,E009
3,,C,,Industrial,E010


In [33]:
# full outer join
pd.merge(df_left, df_right, how='outer', on='category')

Unnamed: 0,product_id,category,sales,name,manager_id
0,P001,A,100.0,Food,E009
1,P005,A,200.0,Food,E009
2,P002,B,300.0,Beverage,E009
3,P003,,100.0,,
4,,C,,Industrial,E010


#### [실습]  데이터 결합 

1. `data`폴더의 `production.csv`를 불러와서 **df_pd** 로 저장하기

2. `data`폴더의 `weather.csv`를 불러와서 **df_wt** 로 저장하기

3. 두 데이터를 `date`를 기준으로 결합하기



In [39]:
df_pd = pd.read_csv("./data/production.csv")
df_pd

Unnamed: 0,date,factory,line,capacity,production,defective
0,2021-01-04,A,1,1000.0,979,3
1,2021-01-04,A,2,1000.0,948,3
2,2021-01-04,A,3,1000.0,962,4
3,2021-01-04,B,4,1500.0,1473,3
4,2021-01-04,B,5,1500.0,1462,5
...,...,...,...,...,...,...
2236,2021-12-31,B,5,1500.0,1457,6
2237,2021-12-31,C,6,2000.0,1987,8
2238,2021-12-31,C,7,2000.0,2025,8
2239,2021-12-31,C,8,2000.0,2034,7


In [40]:
df_wt = pd.read_csv("./data/weather.csv")
df_wt

Unnamed: 0,date,temp_high,temp_low,hum,rain
0,2021-01-01,1.6,-9.8,64.0,
1,2021-01-02,-1.4,-8.4,38.5,
2,2021-01-03,-2.0,-9.1,45.0,
3,2021-01-04,0.3,-8.4,51.4,0.0
4,2021-01-05,-2.1,-9.9,52.8,0.0
...,...,...,...,...,...
360,2021-12-27,-3.9,-12.9,60.9,0.0
361,2021-12-28,-0.9,-8.5,73.8,
362,2021-12-29,5.9,-3.8,72.9,0.2
363,2021-12-30,0.2,-6.8,48.5,0.0


In [38]:
pd.merge(df_pd, df_wt, on='date')

Unnamed: 0,date,factory,line,capacity,production,defective,temp_high,temp_low,hum,rain
0,2021-01-04,A,1,1000.0,979,3,0.3,-8.4,51.4,0.0
1,2021-01-04,A,2,1000.0,948,3,0.3,-8.4,51.4,0.0
2,2021-01-04,A,3,1000.0,962,4,0.3,-8.4,51.4,0.0
3,2021-01-04,B,4,1500.0,1473,3,0.3,-8.4,51.4,0.0
4,2021-01-04,B,5,1500.0,1462,5,0.3,-8.4,51.4,0.0
...,...,...,...,...,...,...,...,...,...,...
2236,2021-12-31,B,5,1500.0,1457,6,-3.9,-8.8,35.9,
2237,2021-12-31,C,6,2000.0,1987,8,-3.9,-8.8,35.9,
2238,2021-12-31,C,7,2000.0,2025,8,-3.9,-8.8,35.9,
2239,2021-12-31,C,8,2000.0,2034,7,-3.9,-8.8,35.9,


<br>
<hr>
<br>


## 2. 데이터 부분 선택

일반적인 비즈니스 데이터 분석에서 주제와 기간, 사이트, 제품, 공정 등 본인의 업무와 관련이 있는 일부 데이터만 선택하고 활용  
SQL을 활용한 데이터 추출 과정과 별개로 Python에서 각 분석 과정에서 맞게 부분 데이터를 다시 선택하고 사용

<br> 

In [48]:
# 예제 데이터 불러오기
import pandas as pd
df_ins = pd.read_csv('data/insurance.csv')
df_ins.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


<br>

### 2.1. .을 활용한 변수 선택

DataFrame 뒤에 마침표(.)를 찍고 `Tab` 키를 눌러 DataFrame의 메서드들과 함께 변수이름을 확인 가능  
.은 가장 간단한 변수 선택 방법이며 선택된 변수는 **Series** 형식으로 출력  

In [59]:
testList = [1,2,3,"안녕"]
testList[2]

3

In [54]:
# .을 활용한 하나의 변수 선택
df_ins.bmi.mean()

30.66339686098655

<br>


### 2.2. 대괄호를 활용한 데이터 부분 선택

DataFrame에 대괄호를 붙이고 슬라이스:로 관측치 번호를 지정하거나 따옴표''로 변수 이름을 넣어 데이터 부분을 선택 가능  
변수 이름을 리스트 형식으로 묶어 넣어 여러개 변수를 한번에 선택 가능

In [60]:
# 관측치 선택
df_ins[0:3]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462


In [61]:
# 한 변수 선택 
df_ins['age']

0       19
1       18
2       28
3       33
4       32
        ..
1333    50
1334    18
1335    18
1336    21
1337    61
Name: age, Length: 1338, dtype: int64

In [62]:
# 리스트를 활용한 복수 변수 선택
df_ins[ ['age','smoker','charges'] ]

Unnamed: 0,age,smoker,charges
0,19,yes,16884.92400
1,18,no,1725.55230
2,28,no,4449.46200
3,33,no,21984.47061
4,32,no,3866.85520
...,...,...,...
1333,50,no,10600.54830
1334,18,no,2205.98080
1335,18,no,1629.83350
1336,21,no,2007.94500


In [63]:
# 연속된 대괄호 활용가능
df_ins[0:5][['age','smoker','charges']]

Unnamed: 0,age,smoker,charges
0,19,yes,16884.924
1,18,no,1725.5523
2,28,no,4449.462
3,33,no,21984.47061
4,32,no,3866.8552


<br>

#### [실습]  

1. 아래의 명령어를 실행해서 df_subway 데이터 생성하기 

2. .columns 메서드를 활용해서 변수이름 확인하기

3. 슬라이스를 활용하여 11~15번째 관측치 선택하기

4. '사용일자', '역명', '하차총승객수' 세 변수 선택하기



    

In [64]:
df_subway = pd.read_csv('./data/CARD_SUBWAY_MONTH_202107.csv', encoding='CP949')
df_subway

Unnamed: 0,사용일자,노선명,역명,승차총승객수,하차총승객수,등록일자
0,20210701,중앙선,상봉(시외버스터미널),6102,6098,20210704
1,20210701,중앙선,망우,7706,7633,20210704
2,20210701,중앙선,양원,1987,2052,20210704
3,20210701,중앙선,구리,13576,13748,20210704
4,20210701,중앙선,도농,10145,9146,20210704
...,...,...,...,...,...,...
18627,20210731,경원선,청량리(서울시립대입구),11320,13138,20210803
18628,20210731,경원선,외대앞,4261,4279,20210803
18629,20210731,경원선,신이문,4666,4313,20210803
18630,20210731,2호선,용두(동대문구청),1292,1364,20210803


In [65]:
df_subway.columns

Index(['사용일자', '노선명', '역명', '승차총승객수', '하차총승객수', '등록일자'], dtype='object')

In [66]:
df_subway[10:15]

Unnamed: 0,사용일자,노선명,역명,승차총승객수,하차총승객수,등록일자
10,20210701,중앙선,양수,2264,2310,20210704
11,20210701,중앙선,신원,199,189,20210704
12,20210701,중앙선,국수,769,753,20210704
13,20210701,중앙선,아신,542,544,20210704
14,20210701,7호선,면목,15526,15035,20210704


In [67]:
df_subway[['사용일자', '역명', '하차총승객수']]

Unnamed: 0,사용일자,역명,하차총승객수
0,20210701,상봉(시외버스터미널),6098
1,20210701,망우,7633
2,20210701,양원,2052
3,20210701,구리,13748
4,20210701,도농,9146
...,...,...,...
18627,20210731,청량리(서울시립대입구),13138
18628,20210731,외대앞,4279
18629,20210731,신이문,4313
18630,20210731,용두(동대문구청),1364


<br>

## 2.3. loc과 iloc을 활용한 관측치/변수 선택

loc은 행 이름(index)과 열 이름(column)으로 데이터에서 일부를 선택하고, iloc은 정수(integer) 형식의 행 번호, 열 번호를 활용  
두 방법 모두 리스트[ ]나 슬라이스:를 활용한 방법을 지원



In [68]:
# 실습을 위해 원본 데이터를 복제(copy)하고 부분선택
df_ins2 = df_ins.copy()[0:10]
df_ins2

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
7,37,female,27.74,3,no,northwest,7281.5056
8,37,male,29.83,2,no,northeast,6406.4107
9,60,female,25.84,0,no,northwest,28923.13692


In [69]:
# 실습을 위해 인덱스를 별도로 지정
df_ins2['idx'] = list(range(101, 111))
df_ins2.set_index('idx', inplace=True)
df_ins2

Unnamed: 0_level_0,age,sex,bmi,children,smoker,region,charges
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
101,19,female,27.9,0,yes,southwest,16884.924
102,18,male,33.77,1,no,southeast,1725.5523
103,28,male,33.0,3,no,southeast,4449.462
104,33,male,22.705,0,no,northwest,21984.47061
105,32,male,28.88,0,no,northwest,3866.8552
106,31,female,25.74,0,no,southeast,3756.6216
107,46,female,33.44,1,no,southeast,8240.5896
108,37,female,27.74,3,no,northwest,7281.5056
109,37,male,29.83,2,no,northeast,6406.4107
110,60,female,25.84,0,no,northwest,28923.13692


<br> 

### 2.3.1. loc을 활용한 부분 선택

loc은 실제로 눈에 보이는 index와 column을 활용

In [70]:
df_ins2.loc[101]

age                19
sex            female
bmi              27.9
children            0
smoker            yes
region      southwest
charges     16884.924
Name: 101, dtype: object

In [72]:
df_ins2.loc[[101, 103, 107]]

Unnamed: 0_level_0,age,sex,bmi,children,smoker,region,charges
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
101,19,female,27.9,0,yes,southwest,16884.924
103,28,male,33.0,3,no,southeast,4449.462
107,46,female,33.44,1,no,southeast,8240.5896


In [73]:
df_ins2.loc[101:103]

Unnamed: 0_level_0,age,sex,bmi,children,smoker,region,charges
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
101,19,female,27.9,0,yes,southwest,16884.924
102,18,male,33.77,1,no,southeast,1725.5523
103,28,male,33.0,3,no,southeast,4449.462


In [75]:
df_ins2.loc[101:103, 'smoker']

idx
101    yes
102     no
103     no
Name: smoker, dtype: object

In [84]:
# 변수이름 리스트 활용가능
df_ins2.loc[101:103, ['smoker','region']]

Unnamed: 0_level_0,smoker,region
idx,Unnamed: 1_level_1,Unnamed: 2_level_1
101,yes,southwest
102,no,southeast
103,no,southeast


In [82]:
# 변수이름 슬라이스:를 활용 가능 
df_ins2.loc[101:103, 'smoker':'charges']

Unnamed: 0_level_0,smoker,region,charges
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
101,yes,southwest,16884.924
102,no,southeast,1725.5523
103,no,southeast,4449.462


In [83]:
# 모든 관측치 선택할 때는 :
df_ins2.loc[:, 'smoker':'charges']

Unnamed: 0_level_0,smoker,region,charges
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
101,yes,southwest,16884.924
102,no,southeast,1725.5523
103,no,southeast,4449.462
104,no,northwest,21984.47061
105,no,northwest,3866.8552
106,no,southeast,3756.6216
107,no,southeast,8240.5896
108,no,northwest,7281.5056
109,no,northeast,6406.4107
110,no,northwest,28923.13692


In [74]:
df_ins2.loc[101:103,105]

KeyError: 105

In [80]:
df_sub1 = df_ins2.loc[101:103]
df_sub2 = df_ins2.loc[[105]]
pd.concat([df_sub1, df_sub2])

Unnamed: 0_level_0,age,sex,bmi,children,smoker,region,charges
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
101,19,female,27.9,0,yes,southwest,16884.924
102,18,male,33.77,1,no,southeast,1725.5523
103,28,male,33.0,3,no,southeast,4449.462
105,32,male,28.88,0,no,northwest,3866.8552


<br> 

### 2.2.2. iloc을 활용한 부분 선택

iloc은 이름과 상관없이 정수로 표현한 위치, 번호를 활용하며 리스트나 슬라이스 활용 방법은 loc과 동일

In [85]:
df_ins2

Unnamed: 0_level_0,age,sex,bmi,children,smoker,region,charges
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
101,19,female,27.9,0,yes,southwest,16884.924
102,18,male,33.77,1,no,southeast,1725.5523
103,28,male,33.0,3,no,southeast,4449.462
104,33,male,22.705,0,no,northwest,21984.47061
105,32,male,28.88,0,no,northwest,3866.8552
106,31,female,25.74,0,no,southeast,3756.6216
107,46,female,33.44,1,no,southeast,8240.5896
108,37,female,27.74,3,no,northwest,7281.5056
109,37,male,29.83,2,no,northeast,6406.4107
110,60,female,25.84,0,no,northwest,28923.13692


In [86]:
df_ins2.iloc[0:3, [0,3,4]]

Unnamed: 0_level_0,age,children,smoker
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
101,19,0,yes
102,18,1,no
103,28,3,no


#### [실습] 

1. df_pr에서 index 기준 '3'의 'Weight' 확인하기
2. df_pr에서 index 기준 '11~15'의 'Age'부터 'Exercise'까지 선택하기
3. df_pr에서 첫번째 ~ 다섯번째 관측치와 다섯번째 ~ 열번째 변수 선택하기

In [87]:
df_pr = pd.read_csv('data/PulseRates.csv')
df_pr.head()

Unnamed: 0,Height,Weight,Age,Gender,Smokes,Alcohol,Exercise,Ran,Pulse1,Pulse2,Year
0,173,57.0,18,2,2,1,2,2,86.0,88.0,93
1,179,58.0,19,2,2,1,2,1,82.0,150.0,93
2,167,62.0,18,2,2,1,1,1,96.0,176.0,93
3,195,84.0,18,1,2,1,1,2,71.0,73.0,93
4,173,64.0,18,2,2,1,3,2,90.0,88.0,93


In [88]:
df_pr.loc[3, 'Weight']

84.0

In [89]:
df_pr.loc[11:15, 'Age':'Exercise']

Unnamed: 0,Age,Gender,Smokes,Alcohol,Exercise
11,19,1,2,2,3
12,22,1,1,1,2
13,18,1,2,1,1
14,18,1,2,1,2
15,22,1,2,1,3


In [90]:
df_pr.iloc[0:5, 4:10]

Unnamed: 0,Smokes,Alcohol,Exercise,Ran,Pulse1,Pulse2
0,2,1,2,2,86.0,88.0
1,2,1,2,1,82.0,150.0
2,2,1,1,1,96.0,176.0
3,2,1,1,2,71.0,73.0
4,2,1,3,2,90.0,88.0


### 2.4. 함수를 활용한 여러 변수 선택 



In [94]:
# filter( ) 메서드에서 변수 이름 패턴을 활용한 선택 
df_ins.filter(regex='ch')
    ## regex :  정규표현식(regular expression)
    ## '^s' : 's'로 시작하는 이름/텍스트
    

Unnamed: 0,children,charges
0,0,16884.92400
1,1,1725.55230
2,3,4449.46200
3,0,21984.47061
4,0,3866.85520
...,...,...
1333,3,10600.54830
1334,0,2205.98080
1335,0,1629.83350
1336,0,2007.94500


In [95]:
# 변수형식 확인하기
df_ins.dtypes
    ## int/float : 숫자
    ## object : 문자열

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

In [96]:
# 수치형 변수만 선택
df_ins.select_dtypes(include='number')

Unnamed: 0,age,bmi,children,charges
0,19,27.900,0,16884.92400
1,18,33.770,1,1725.55230
2,28,33.000,3,4449.46200
3,33,22.705,0,21984.47061
4,32,28.880,0,3866.85520
...,...,...,...,...
1333,50,30.970,3,10600.54830
1334,18,31.920,0,2205.98080
1335,18,36.850,0,1629.83350
1336,21,25.800,0,2007.94500


In [97]:
# 문자열 변수만 선택
df_ins.select_dtypes(include='object')

Unnamed: 0,sex,smoker,region
0,female,yes,southwest
1,male,no,southeast
2,male,no,southeast
3,male,no,northwest
4,male,no,northwest
...,...,...,...
1333,male,no,northwest
1334,female,no,northeast
1335,female,no,southeast
1336,female,no,southwest


<br>

#### [실습] Student performance 데이터 활용

1. df_sp에서 수치형 변수만 선택
2. df_sp에서 수치형이 아닌 변수만 선택
3. df_sp에서 이름에 'score'가 들어간 변수만 선택


In [98]:
df_sp = pd.read_csv('data/StudentsPerformance.csv')
df_sp.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [99]:
df_sp.select_dtypes(include='number')

Unnamed: 0,math score,reading score,writing score
0,72,72,74
1,69,90,88
2,90,95,93
3,47,57,44
4,76,78,75
...,...,...,...
995,88,99,95
996,62,55,55
997,59,71,65
998,68,78,77


In [100]:
df_sp.select_dtypes(include='object')

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course
0,female,group B,bachelor's degree,standard,none
1,female,group C,some college,standard,completed
2,female,group B,master's degree,standard,none
3,male,group A,associate's degree,free/reduced,none
4,male,group C,some college,standard,none
...,...,...,...,...,...
995,female,group E,master's degree,standard,completed
996,male,group C,high school,free/reduced,none
997,female,group C,high school,free/reduced,completed
998,female,group D,some college,standard,completed


In [101]:
df_sp.filter(regex='score')

Unnamed: 0,math score,reading score,writing score
0,72,72,74
1,69,90,88
2,90,95,93
3,47,57,44
4,76,78,75
...,...,...,...
995,88,99,95
996,62,55,55
997,59,71,65
998,68,78,77


<br>

### 3.5. 조건을 활용한 관측치 선택

SQL에서 WHERE 절이나 Excel의 Filter와 같이 데이터에서 부분을 선택할 때 조건을 활용하는 경우 많음  
[ ]나 .loc[ ] 안에 조건식을 넣어서 조건과 일치하는 관측치만 선택 가능

In [103]:
# 1 단계 : 조건 설정(결과는 True/False)
    # bool 타입 Series 
df_ins['age'] < 30

0        True
1        True
2        True
3       False
4       False
        ...  
1333    False
1334     True
1335     True
1336     True
1337    False
Name: age, Length: 1338, dtype: bool

In [105]:
# 2 단계 : []와 조건을 활용한 관측치 선택
df_ins.loc[ df_ins['age'] < 30 ]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
10,25,male,26.220,0,no,northeast,2721.32080
12,23,male,34.400,0,no,southwest,1826.84300
...,...,...,...,...,...,...,...
1328,23,female,24.225,2,no,northeast,22395.74424
1331,23,female,33.400,0,no,southwest,10795.93733
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350


In [108]:
(df_ins['age'] < 30) & (df_ins['sex'] == 'female')

0        True
1       False
2       False
3       False
4       False
        ...  
1333    False
1334     True
1335     True
1336     True
1337    False
Length: 1338, dtype: bool

In [107]:
 (df_ins['sex'] == 'female')

0        True
1       False
2       False
3       False
4       False
        ...  
1333    False
1334     True
1335     True
1336     True
1337     True
Name: sex, Length: 1338, dtype: bool

In [109]:
# &와 |를 활용한 조건 결합
df_ins[(df_ins['age'] < 30) & (df_ins['sex'] == 'female')]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
31,18,female,26.315,0,no,northeast,2198.18985
32,19,female,28.600,5,no,southwest,4687.79700
40,24,female,26.600,0,no,northeast,3046.06200
46,18,female,38.665,2,no,northeast,3393.35635
...,...,...,...,...,...,...,...
1328,23,female,24.225,2,no,northeast,22395.74424
1331,23,female,33.400,0,no,southwest,10795.93733
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350


In [110]:
df_ins[(df_ins['age'] < 30) | (df_ins['sex'] == 'female')]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.90,0,yes,southwest,16884.9240
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.00,3,no,southeast,4449.4620
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
...,...,...,...,...,...,...,...
1332,52,female,44.70,3,no,southwest,11411.6850
1334,18,female,31.92,0,no,northeast,2205.9808
1335,18,female,36.85,0,no,southeast,1629.8335
1336,21,female,25.80,0,no,southwest,2007.9450


In [116]:
cond_bmi =  (df_ins['bmi'] > 25) & (df_ins['bmi'] < 30)
df_ins[ cond_bmi ]  

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
4,32,male,28.880,0,no,northwest,3866.85520
5,31,female,25.740,0,no,southeast,3756.62160
7,37,female,27.740,3,no,northwest,7281.50560
8,37,male,29.830,2,no,northeast,6406.41070
...,...,...,...,...,...,...,...
1321,62,male,26.695,0,yes,northeast,28101.33305
1324,31,male,25.935,1,no,northwest,4239.89265
1330,57,female,25.740,2,no,southeast,12629.16560
1336,21,female,25.800,0,no,southwest,2007.94500


<br> 

> 특히 비즈니스 데이터는 범주화, 그룹화된 변수들이 많고, 수많은 담당자들이 그 중 일부 범주, 그룹, 수준을 나눠서 운영하는 경우가 많습니다.  
*isin()* 을 활용해서 내가 관심있는 범주인지 아닌지 포함여부에 대한 연산이 가능합니다.

In [117]:
# 변수 region의 수준 목록 확인 및 관심 수준 선택
df_ins['region'].unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

In [118]:
# isin()을 활용한 특정 수준 관측치 선택
cond1 = df_ins['region'].isin(['southeast','northwest'])
cond1

0       False
1        True
2        True
3        True
4        True
        ...  
1333     True
1334    False
1335     True
1336    False
1337     True
Name: region, Length: 1338, dtype: bool

In [120]:
cond2 = (df_ins['region'] == 'southeast') | (df_ins['region'] == 'northwest')

In [121]:
df_ins[cond2]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
5,31,female,25.740,0,no,southeast,3756.62160
...,...,...,...,...,...,...,...
1327,51,male,30.030,1,no,southeast,9377.90470
1330,57,female,25.740,2,no,southeast,12629.16560
1333,50,male,30.970,3,no,northwest,10600.54830
1335,18,female,36.850,0,no,southeast,1629.83350


<br>

#### [실습]

1. df_sp에서 math score가 90 이상인 관측치 선택
2. df_sp에서 race/ethnicity가 'group D', 'group E'인 관측치 선택(isin() 활용)
3. 1.과 2.를 동시에 만족하는 관측치 선택 

In [122]:
df_sp.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [123]:
cond_math = df_sp['math score'] >= 90
df_sp[cond_math]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
2,female,group B,master's degree,standard,none,90,95,93
34,male,group E,some college,standard,none,97,87,82
104,male,group C,some college,standard,completed,98,86,90
114,female,group E,bachelor's degree,standard,completed,99,100,100
121,male,group B,associate's degree,standard,completed,91,89,92
149,male,group E,associate's degree,free/reduced,completed,100,100,93
165,female,group C,bachelor's degree,standard,completed,96,100,100
171,male,group E,some high school,standard,none,94,88,78
179,female,group D,some high school,standard,completed,97,100,100
233,male,group E,some high school,standard,none,92,87,78


In [124]:
cond_race = df_sp['race/ethnicity'].isin(['group D', 'group E'])
df_sp[cond_race]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
8,male,group D,high school,free/reduced,completed,64,64,67
11,male,group D,associate's degree,standard,none,40,52,43
20,male,group D,high school,standard,none,66,69,63
22,male,group D,some college,standard,none,44,54,53
24,male,group D,bachelor's degree,free/reduced,completed,74,71,80
...,...,...,...,...,...,...,...,...
992,female,group D,associate's degree,free/reduced,none,55,76,76
993,female,group D,bachelor's degree,free/reduced,none,62,72,74
995,female,group E,master's degree,standard,completed,88,99,95
998,female,group D,some college,standard,completed,68,78,77


In [125]:
df_sp[cond_math & cond_race]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
34,male,group E,some college,standard,none,97,87,82
114,female,group E,bachelor's degree,standard,completed,99,100,100
149,male,group E,associate's degree,free/reduced,completed,100,100,93
171,male,group E,some high school,standard,none,94,88,78
179,female,group D,some high school,standard,completed,97,100,100
233,male,group E,some high school,standard,none,92,87,78
263,female,group E,high school,standard,none,99,93,90
286,male,group E,associate's degree,standard,completed,97,82,88
299,male,group D,associate's degree,free/reduced,none,90,87,75
306,male,group E,some college,standard,completed,99,87,81


In [137]:
df_sp[ 
    (df_sp['math score'] >= 90) &
    df_sp['race/ethnicity'].isin(['group D', 'group E'])
]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
34,male,group E,some college,standard,none,97,87,82
114,female,group E,bachelor's degree,standard,completed,99,100,100
149,male,group E,associate's degree,free/reduced,completed,100,100,93
171,male,group E,some high school,standard,none,94,88,78
179,female,group D,some high school,standard,completed,97,100,100
233,male,group E,some high school,standard,none,92,87,78
263,female,group E,high school,standard,none,99,93,90
286,male,group E,associate's degree,standard,completed,97,82,88
299,male,group D,associate's degree,free/reduced,none,90,87,75
306,male,group E,some college,standard,completed,99,87,81


#### [참고] Series의 str 메서드 활용
문자열 Series(한 변수)에서 str 함수를 활용하면 특정 단어를 포함하거나 특정 패턴과 일치하는 관측치를 선택 가능

In [127]:
df_sp

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


In [126]:
df_sp['parental level of education'].str.startswith('b')

0       True
1      False
2      False
3      False
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Name: parental level of education, Length: 1000, dtype: bool

In [128]:
df_sp['parental level of education'].str.endswith('college')

0      False
1       True
2      False
3      False
4       True
       ...  
995    False
996    False
997    False
998     True
999     True
Name: parental level of education, Length: 1000, dtype: bool

In [130]:
df_sp[  df_sp['parental level of education'].str.contains('degree') ] 

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
5,female,group B,associate's degree,standard,none,71,83,78
10,male,group C,associate's degree,standard,none,58,54,52
...,...,...,...,...,...,...,...,...
979,female,group C,associate's degree,standard,none,91,95,94
986,female,group C,associate's degree,standard,none,40,59,51
992,female,group D,associate's degree,free/reduced,none,55,76,76
993,female,group D,bachelor's degree,free/reduced,none,62,72,74


<br>

#### [참고] Series의 between 메서드 활용
수치형 Series(한 변수)에서 *between()* 으로 특정 범위 내 관측치 선택 가능

In [131]:
df_sp['math score'].between(80, 89.9)

0      False
1      False
2      False
3      False
4      False
       ...  
995     True
996    False
997    False
998    False
999    False
Name: math score, Length: 1000, dtype: bool

In [132]:
# 양쪽 끝 경계 포함 여부 지정 가능
    # 'both', 'left', 'right'
df_sp[df_sp['math score'].between(80, 90, inclusive='left')] 

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
6,female,group B,some college,standard,completed,88,95,92
16,male,group C,high school,standard,none,88,89,86
35,male,group E,associate's degree,standard,completed,81,81,79
49,male,group C,high school,standard,completed,82,84,82
53,male,group D,high school,standard,none,88,78,75
...,...,...,...,...,...,...,...,...
970,female,group D,bachelor's degree,standard,none,89,100,100
981,male,group D,some high school,standard,none,81,78,78
987,male,group E,some high school,standard,completed,81,75,76
990,male,group E,high school,free/reduced,completed,86,81,75


<br>

#### [참고] ~를 활용한 부정(True/False 반전)
bool Series(True/False) 앞에 **~** 를 붙여서 True와 False를 뒤집기 가능

In [133]:
cond1 = df_sp['math score'].between(80, 90, inclusive='left')
cond1

0      False
1      False
2      False
3      False
4      False
       ...  
995     True
996    False
997    False
998    False
999    False
Name: math score, Length: 1000, dtype: bool

In [134]:
~cond1

0       True
1       True
2       True
3       True
4       True
       ...  
995    False
996     True
997     True
998     True
999     True
Name: math score, Length: 1000, dtype: bool

In [135]:
df_sp[~cond1]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
994,male,group A,high school,standard,none,63,63,62
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


<br>

### 2.6. 함수를 활용한 부분 관측치 선택


In [138]:
# head( )와 tail()
df_ins.head()
df_ins.tail()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1333,50,male,30.97,3,no,northwest,10600.5483
1334,18,female,31.92,0,no,northeast,2205.9808
1335,18,female,36.85,0,no,southeast,1629.8335
1336,21,female,25.8,0,no,southwest,2007.945
1337,61,female,29.07,0,yes,northwest,29141.3603


In [145]:
# sample( )의 활용
#df_ins.sample(frac=0.01)
df_ins.sample(n=10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
215,41,female,37.1,2,no,southwest,7371.772
746,34,male,27.0,2,no,southwest,11737.84884
916,43,female,26.885,0,yes,northwest,21774.32215
1301,62,male,30.875,3,yes,northwest,46718.16325
829,39,male,21.85,1,no,northwest,6117.4945
238,19,male,29.07,0,yes,northwest,17352.6803
1235,26,male,31.065,0,no,northwest,2699.56835
1038,22,male,28.88,0,no,northeast,2250.8352
1291,19,male,34.9,0,yes,southwest,34828.654
791,19,male,27.6,0,no,southwest,1252.407


In [146]:
# nlargest( ), nsmallest( )로 상위/하위 관측치 선택
df_ins.nlargest(10, 'charges')

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
543,54,female,47.41,0,yes,southeast,63770.42801
1300,45,male,30.36,0,yes,southeast,62592.87309
1230,52,male,34.485,3,yes,northwest,60021.39897
577,31,female,38.095,1,yes,northeast,58571.07448
819,33,female,35.53,0,yes,northwest,55135.40209
1146,60,male,32.8,0,yes,southwest,52590.82939
34,28,male,36.4,1,yes,southwest,51194.55914
1241,64,male,36.96,2,yes,southeast,49577.6624
1062,59,male,41.14,1,yes,southeast,48970.2476
488,44,female,38.06,0,yes,southeast,48885.13561


In [147]:
df_ins.nsmallest(10, 'charges')

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
940,18,male,23.21,0,no,southeast,1121.8739
808,18,male,30.14,0,no,southeast,1131.5066
1244,18,male,33.33,0,no,southeast,1135.9407
663,18,male,33.66,0,no,southeast,1136.3994
22,18,male,34.1,0,no,southeast,1137.011
194,18,male,34.43,0,no,southeast,1137.4697
866,18,male,37.29,0,no,southeast,1141.4451
781,18,male,41.14,0,no,southeast,1146.7966
442,18,male,43.01,0,no,southeast,1149.3959
1317,18,male,53.13,0,no,southeast,1163.4627


<br>

#### [실습]

1. df_sp에서 math score 상위 20 명 선택
2. df_sp에서 writing score 하위 10명 선택


In [150]:
df_sp.sample(n=10)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
674,female,group D,high school,standard,completed,69,77,78
252,female,group B,some high school,standard,completed,60,70,70
237,female,group D,some high school,standard,completed,64,60,74
765,female,group B,high school,standard,none,74,72,72
110,female,group D,associate's degree,free/reduced,completed,77,89,98
192,female,group B,some high school,standard,none,62,64,66
489,male,group A,associate's degree,free/reduced,completed,79,82,82
250,male,group A,some high school,standard,completed,47,49,49
243,male,group E,some college,standard,none,59,51,43
204,male,group C,some college,standard,none,59,41,42


In [151]:
df_sp.nlargest(20, 'math score')

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
149,male,group E,associate's degree,free/reduced,completed,100,100,93
451,female,group E,some college,standard,none,100,92,97
458,female,group E,bachelor's degree,standard,none,100,100,100
623,male,group A,some college,standard,completed,100,96,86
625,male,group D,some college,standard,completed,100,97,99
916,male,group E,bachelor's degree,standard,completed,100,100,100
962,female,group E,associate's degree,standard,none,100,100,100
114,female,group E,bachelor's degree,standard,completed,99,100,100
263,female,group E,high school,standard,none,99,93,90
306,male,group E,some college,standard,completed,99,87,81


In [152]:
df_sp.nsmallest(10, 'writing score')

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
59,female,group C,some high school,free/reduced,none,0,17,10
596,male,group B,high school,free/reduced,none,30,24,15
327,male,group A,some college,free/reduced,none,28,23,19
76,male,group E,some high school,standard,none,30,26,22
980,female,group B,high school,free/reduced,none,8,24,23
211,male,group C,some college,free/reduced,none,35,28,27
338,female,group B,some high school,free/reduced,none,24,38,27
896,male,group B,high school,free/reduced,none,36,29,27
17,female,group B,some high school,free/reduced,none,18,32,28
601,female,group C,high school,standard,none,29,29,30


<br>

### 2.7. 중복값 제거

`drop_duplicates()`를 활용해서 중복값을 제거한 목록 생성 가능

In [157]:
df_ins.drop_duplicates()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [154]:
df_ins[['sex','region']].drop_duplicates()

Unnamed: 0,sex,region
0,female,southwest
1,male,southeast
3,male,northwest
5,female,southeast
7,female,northwest
8,male,northeast
12,male,southwest
16,female,northeast


In [155]:
df_ins[['sex','region']].value_counts()

sex     region   
male    southeast    189
female  southeast    175
        northwest    164
male    northeast    163
        southwest    163
female  southwest    162
        northeast    161
male    northwest    161
dtype: int64

### 2.8. 관측치 정렬

`sort_values()`를 활용해서 관측치를 정렬

In [158]:
# age 순 데이터 정렬
df_ins.sort_values('age')

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1248,18,female,39.820,0,no,southeast,1633.96180
482,18,female,31.350,0,no,southeast,1622.18850
492,18,female,25.080,0,no,northeast,2196.47320
525,18,female,33.880,0,no,southeast,11482.63485
529,18,male,25.460,0,no,northeast,1708.00140
...,...,...,...,...,...,...,...
398,64,male,25.600,2,no,southwest,14988.43200
335,64,male,34.500,0,no,southwest,13822.80300
378,64,female,30.115,3,no,northwest,16455.70785
1265,64,male,23.760,0,yes,southeast,26926.51440


In [159]:
# 원본 데이터는 영향 없음
df_ins

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [160]:
# 원본 데이터의 정렬
df_ins = df_ins.sort_values('age')
df_ins.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1248,18,female,39.82,0,no,southeast,1633.9618
482,18,female,31.35,0,no,southeast,1622.1885
492,18,female,25.08,0,no,northeast,2196.4732
525,18,female,33.88,0,no,southeast,11482.63485
529,18,male,25.46,0,no,northeast,1708.0014


In [161]:
# 내림차순 지정
df_ins = df_ins.sort_values('age', ascending=False)
df_ins.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
635,64,male,38.19,0,no,northeast,14410.9321
1051,64,male,26.41,0,no,northeast,14394.5579
603,64,female,39.05,3,no,southeast,16085.1275
752,64,male,37.905,0,no,northwest,14210.53595
768,64,female,39.7,0,no,southwest,14319.031


In [162]:
# 복수 기준의 설정 
df_ins.sort_values(['age', 'charges'], ascending=[True, False])

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
803,18,female,42.240,0,yes,southeast,38792.68560
759,18,male,38.170,0,yes,southeast,36307.79830
161,18,female,36.850,0,yes,southeast,36149.48350
623,18,male,33.535,0,yes,northeast,34617.84065
57,18,male,31.680,2,yes,southeast,34303.16720
...,...,...,...,...,...,...,...
768,64,female,39.700,0,no,southwest,14319.03100
801,64,female,35.970,0,no,southeast,14313.84630
752,64,male,37.905,0,no,northwest,14210.53595
534,64,male,40.480,0,no,southeast,13831.11520


In [163]:
# index를 활용한 정렬
df_ins = df_ins.sort_index()
df_ins.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


<br>

#### [실습] 데이터 df_sp 활용

1. 'gender', 'lunch' 조합의 중복값 제거 목록 생성   
2. 전체 관측치를 'math score', 'reading score'의 내림차순으로 정렬

In [None]:
df_sp[['gender', 'lunch']].drop_duplicates()

In [None]:
df_sp.sort_values(['math score', 'reading score'], ascending=[False, False])

#### End of script