# 데이터 결합 및 부분 선택

### 주요 내용

1. 데이터 결합
2. index, columns을 활용한 부분 선택 
3. 조건을 활용한 관측치 선택

<br>

### 목표 
1. 복수의 데이터를 적절한 방법으로 결합할 수 있다.
2. 변수 이름 등을 활용하여 부분 데이터를 선택한다.
3. 주제에 맞게 조건을 활용하여 부분 관측치를 선택한다. 


<br>
<hr>
<br>

## 1. DataFrame 형식의 활용

pandas는 데이터를 저장하는 형식 **DataFrame**을 중심으로 구성되어 있음  
pandas의 다양한 함수를 활용해서 데이터를 불러오거나 저장할 수 있고, 분석 과정에서 필요한 전처리나 집계 작업도 가능


In [None]:
# 라이브러리 불러오기
import pandas as pd

<br>

DataFrame에서 각각의 열, 변수가 하나의 Series로 저장되어 있음  
Series의 메서드와 DataFrame의 메서드 구분 필요  

만약 직접 DataFrame을 만들어야 할 때는 *DataFrame( )* 과 딕셔너리를 활용

In [None]:
# 딕셔너리를 활용한 DataFrame 생성
df_own = pd.DataFrame({ 0 : ['A', 'B', 'C', 'D'],
                       'SECOND': [7,6,5,8], 
                       'THIRD' : pd.date_range('2022-12-05', periods=4, freq='W-SUN')}) # freq='W-MON' : 매주 월요일
df_own

In [None]:
df_own[0:1]

<br>
<hr>
<br>

## 2. 데이터 결합

### 2.1. concat( )을 활용한 동일 구조 데이터 행 결합

구조는 똑같고 기간이나 상품만 다른 여러 데이터가 있으면 pandas의 *concat()* 으로 결합해서 활용  
함수 안에서 `axis=0`옵션을 활용해서 행 결합(아래로 이어 붙이기)을 할 수 있고, `axis=1`로 열 결합도 가능  
`axis=0`이 기본값며 생략 가능

In [None]:
# 행 결합
    ## 출처 : 국토교통부 실거래가(http://rtdown.molit.go.kr/)
df_apt1 = pd.read_csv('data/아파트(매매)__실거래가_20210902153616.csv', skiprows=15, encoding='CP949')
df_apt1

In [None]:
df_apt2 = pd.read_csv('data/아파트(매매)__실거래가_20210902153636.csv', skiprows=15, encoding='CP949')
df_apt2

In [None]:
df_apt3 = pd.read_csv('data/아파트(매매)__실거래가_20210902153655.csv', skiprows=15, encoding='CP949')
df_apt3

In [None]:
df_apt = pd.concat([df_apt1, df_apt2, df_apt3], join='inner')
df_apt

<br>

> **DataFrame**에서 행 번호에 해당하는 **index**는 중요한 역할을 합니다.  
예를 들어 아래처럼 index를 확인할 수 있고, 특정 index를 지정해서 관측치를 선택하는 것도 가능합니다. 

In [None]:
df_apt.index

In [None]:
# index 0 관측치 선택
df_apt.loc[0]

결합 이전 기존 Index 활용으로 **0** 인덱스 관측치의 중복 발생  
행 결합이나 정렬 이후 인덱스를 재지정하거나 초기화 필요 

In [None]:
# reset_index()을 활용한 index 초기화
    ## drop=True: 기존 인덱스를 변수로 추가할 지 버릴지 선택
df_apt = df_apt.reset_index(drop=True)
df_apt

In [None]:
# index 0 관측치 재선택
df_apt.loc[0]

#### [실습]  데이터 결합 및 인덱스 초기화

출처 : [서울시 지하철 호선별 역별 승하차 인원수](http://data.seoul.go.kr/dataList/OA-12914/S/1/datasetView.do)

1. `data`폴더의 `CARD_SUBWAY_MONTH_`로 시작하는 3개 데이터 확인하기  
    


2. 1.의 데이터를 각각 불러와서 저장하고, pd.concat()으로 행 결합하기(encoding='CP949' 활용)


3. index 초기화 하기



In [None]:
pd.read_csv('/Users/jhpark/Downloads/CARD_SUBWAY_MONTH_202305.csv', index_col=False)

In [None]:
df_card1 = pd.read_csv("./data/CARD_SUBWAY_MONTH_201907.csv", encoding='cp949')
df_card2 = pd.read_csv("./data/CARD_SUBWAY_MONTH_202007.csv", encoding='cp949')
df_card3 = pd.read_csv("./data/CARD_SUBWAY_MONTH_202107.csv", encoding='cp949')
df_card1

In [None]:
pd.concat([df_card1, df_card2, df_card3]).reset_index(drop=True)

#### [참고] glob과 for 반복문을 활용한 복수 데이터 처리

**glob** 라이브러리의 *glob()* 을 활용하면 복수의 데이터 경로를 손쉽게 처리 가능

In [None]:
# 대상 파일 목록 생성
from glob import glob
file_list  = glob('data/apt/*.csv')
file_list

In [None]:
# for를 활용한 반복
a = list()
for e in file_list:
    a.append(pd.read_csv(e, skiprows=15, encoding='CP949'))
a

In [None]:
# 최종 작업
df_subway = pd.concat(a, axis=0).reset_index(drop=True)
df_subway

In [None]:

df_list = [ pd.read_csv(e, skiprows=15, encoding='CP949') for e in glob('data/apt/*.csv') ]
pd.concat(df_list).reset_index(drop=True)

<br>

### 2.2. merge()를 활용한 KEY 변수 기준 결합 

SQL의 JOIN, Excel의 VLOOKUP()과 같이 KEY 변수를 활용한 데이터 결합은 *merge()* 를 활용

In [None]:
# 예제 데이터 불러오기
df_left  = pd.read_csv('data/data_left.csv')
df_right = pd.read_csv('data/data_right.csv')

In [None]:
df_left

In [None]:
df_right

<br>

> key를 활용한 데이터 결합에서는 일치하는 key가 있는, 짝이 있는 관측치만 출력하는 것이 기본값으로 설정되어 있습니다. SQL에서는 이것을 **inner join**이라고 부릅니다.  

*merge()* 에서 `how=` 옵션을 활용해서 다음과 같은 데이터 결합 방법 지정 

+ `inner`: inner join. key 기준 일치하는 관측치만 포함
+ `left`:  left join. inner join의 결과물과 왼쪽 데이터의 짝 없는 관측치 포함
+ `right`: right join. inner join의 결과물과 오른쪽 데이터의 짝 없는 관측치 포함
+ `outer`: full outer join. inner join과 양쪽 데이터의 짝이 없는 모든 관측치 포함

In [None]:
# merge()를 활용한 결합
pd.merge(df_left, df_right, how='inner', on='category')

In [None]:
# left join
pd.merge(df_left, df_right, how='left', on='category')

In [None]:
# right join
pd.merge(df_left, df_right, how='right', on='category')

In [None]:
# full outer join
pd.merge(df_left, df_right, how='outer', on='category')

<br>
<hr>
<br>


## 3. 데이터 부분 선택

일반적인 비즈니스 데이터 분석에서 주제와 기간, 사이트, 제품, 공정 등 본인의 업무와 관련이 있는 일부 데이터만 선택하고 활용  
SQL을 활용한 데이터 추출 과정과 별개로 Python에서 각 분석 과정에서 맞게 부분 데이터를 다시 선택하고 사용

<br> 

In [105]:
# 예제 데이터 불러오기
import pandas as pd
df_ins = pd.read_csv('data/insurance.csv')
df_ins.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


<br>

### 3.1. .을 활용한 변수 선택

DataFrame 뒤에 마침표(.)를 찍고 `Tab` 키를 눌러 DataFrame의 메서드들과 함께 변수이름을 확인 가능  
.은 가장 간단한 변수 선택 방법이며 선택된 변수는 **Series** 형식으로 출력  

In [None]:
# .을 활용한 하나의 변수 선택
df_ins.age

<br>


### 3.2. 대괄호를 활용한 데이터 부분 선택

DataFrame에 대괄호를 붙이고 슬라이스:로 관측치 번호를 지정하거나 따옴표''로 변수 이름을 넣어 데이터 부분을 선택 가능  
변수 이름을 리스트 형식으로 묶어 넣어 여러개 변수를 한번에 선택 가능

In [None]:
# 관측치 선택
df_ins[0:3]

In [None]:
# 한 변수 선택 
df_ins['age']

In [None]:
# 리스트를 활용한 복수 변수 선택
df_ins[ ['age','smoker','charges'] ]

In [None]:
# 연속된 대괄호 활용가능
df_tmp = df_ins[0:5]
df_tmp[['age','smoker','charges']]

<br>

#### [실습]  

1. 아래의 명령어를 실행해서 df_subway 데이터 생성하기 

2. .columns 메서드를 활용해서 변수이름 확인하기

3. 슬라이스를 활용하여 11~15번째 관측치 선택하기

4. '사용일자', '역명', '하차총승객수' 세 변수 선택하기



    

In [None]:
df_subway = pd.read_csv('./data/CARD_SUBWAY_MONTH_202107.csv', encoding='CP949')
df_subway

In [None]:
df_subway.columns

In [None]:
df_subway[10:15]

In [None]:
df_subway[['사용일자', '역명', '하차총승객수']]

<br>

## 3.3. loc과 iloc을 활용한 관측치/변수 선택

loc은 행 이름(index)과 열 이름(column)으로 데이터에서 일부를 선택하고, iloc은 정수(integer) 형식의 행 번호, 열 번호를 활용  
두 방법 모두 리스트[ ]나 슬라이스:를 활용한 방법을 지원



In [None]:
# 실습을 위해 원본 데이터를 복제(copy)하고 부분선택
df_ins2 = df_ins.copy()[0:10]
df_ins2

In [None]:
# 실습을 위해 인덱스를 별도로 지정
df_ins2['idx'] = list(range(101, 111))
df_ins2.set_index('idx', inplace=True)
df_ins2

<br> 

### 3.3.1. loc을 활용한 부분 선택

loc은 실제로 눈에 보이는 index와 column을 활용

In [None]:
df_ins2.loc[101]

In [None]:
df_ins2.loc[[101, 103]]

In [None]:
df_ins2.loc[101:103]

In [None]:
df_ins2.loc[101:103, 'smoker']

In [None]:
# 변수이름 리스트 활용가능
df_ins2.loc[101:103, ['smoker','region']]

In [None]:
# 변수이름 슬라이스:를 활용 가능 
df_ins2.loc[101:103, 'smoker':'charges']

In [None]:
# 모든 관측치 선택할 때는 :
df_ins2.loc[:, 'smoker':'charges']

<br> 

### 3.2.2. iloc을 활용한 부분 선택

iloc은 이름과 상관없이 정수로 표현한 위치, 번호를 활용하며 리스트나 슬라이스 활용 방법은 loc과 동일

In [None]:
df_ins2.iloc[0:3, [0,3,4]]

In [None]:
df_ins2[101:103]

#### [실습] 

1. df_pr에서 index 기준 '3'의 'Weight' 확인하기
2. df_pr에서 index 기준 '11~15'의 'Age'부터 'Exercise'까지 선택하기
3. df_pr에서 첫번째 ~ 다섯번째 관측치와 다섯번째 ~ 열번째 변수 선택하기

In [None]:
df_pr = pd.read_csv('data/PulseRates.csv')
df_pr.head()

In [None]:
df_pr.loc[3,'Weight']

In [None]:
df_pr.loc[11:15, 'Age':'Exercise']

In [None]:
df_pr.iloc[0:5 , 4:10]

In [None]:
df_pr.loc[0]

### 3.4. 함수를 활용한 여러 변수 선택 



In [None]:
# filter( ) 메서드에서 변수 이름 패턴을 활용한 선택 
df_ins.filter(regex='n$')
    ## regex :  정규표현식(regular expression)
    ## '^s' : 's'로 시작하는 이름/텍스트   ^, $, keyword

    

In [None]:
# 변수형식 확인하기
df_ins.dtypes
    ## int/float : 숫자
    ## object : 문자열

In [None]:
# 수치형 변수만 선택
df_ins.select_dtypes(include='number')

In [None]:
# 문자열 변수만 선택
df_ins.select_dtypes(include='object')

<br>

#### [실습] Student performance 데이터 활용

1. df_sp에서 수치형 변수만 선택
2. df_sp에서 문자열 변수만 선택
3. df_sp에서 이름에 'score'가 들어간 변수만 선택


In [96]:
df_sp = pd.read_csv('data/StudentsPerformance.csv')
df_sp.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [None]:
df_sp.select_dtypes(include='number')

In [None]:
df_sp.select_dtypes(include='object')

In [97]:
df_sp.filter(regex='e.*l.*e')

Unnamed: 0,parental level of education
0,bachelor's degree
1,some college
2,master's degree
3,associate's degree
4,some college
...,...
995,master's degree
996,high school
997,high school
998,some college


<br>

### 3.5. 조건을 활용한 관측치 선택

SQL에서 WHERE 절이나 Excel의 Filter와 같이 데이터에서 부분을 선택할 때 조건을 활용하는 경우 많음  
[ ]나 .loc[ ] 안에 조건식을 넣어서 조건과 일치하는 관측치만 선택 가능

In [98]:
# 1 단계 : 조건 설정(결과는 True/False)
    # bool 타입 Series 
df_ins['age'] < 30

0        True
1        True
2        True
3       False
4       False
        ...  
1333    False
1334     True
1335     True
1336     True
1337    False
Name: age, Length: 1338, dtype: bool

In [99]:
# 2 단계 : []와 조건을 활용한 관측치 선택
df_ins[df_ins['age'] < 30]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
10,25,male,26.220,0,no,northeast,2721.32080
12,23,male,34.400,0,no,southwest,1826.84300
...,...,...,...,...,...,...,...
1328,23,female,24.225,2,no,northeast,22395.74424
1331,23,female,33.400,0,no,southwest,10795.93733
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350


In [107]:
(df_ins['age'] < 30) & (df_ins['sex'] == 'female')

0        True
1       False
2       False
3       False
4       False
        ...  
1333    False
1334     True
1335     True
1336     True
1337    False
Length: 1338, dtype: bool

In [100]:
# &와 |를 활용한 조건 결합
df_ins[(df_ins['age'] < 30) & (df_ins['sex'] == 'female')]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
31,18,female,26.315,0,no,northeast,2198.18985
32,19,female,28.600,5,no,southwest,4687.79700
40,24,female,26.600,0,no,northeast,3046.06200
46,18,female,38.665,2,no,northeast,3393.35635
...,...,...,...,...,...,...,...
1328,23,female,24.225,2,no,northeast,22395.74424
1331,23,female,33.400,0,no,southwest,10795.93733
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350


In [108]:
df_ins[(df_ins['age'] < 30) | (df_ins['sex'] == 'female')]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.90,0,yes,southwest,16884.9240
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.00,3,no,southeast,4449.4620
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
...,...,...,...,...,...,...,...
1332,52,female,44.70,3,no,southwest,11411.6850
1334,18,female,31.92,0,no,northeast,2205.9808
1335,18,female,36.85,0,no,southeast,1629.8335
1336,21,female,25.80,0,no,southwest,2007.9450


<br> 

> 특히 비즈니스 데이터는 범주화, 그룹화된 변수들이 많고, 수많은 담당자들이 그 중 일부 범주, 그룹, 수준을 나눠서 운영하는 경우가 많습니다.  
*isin()* 을 활용해서 내가 관심있는 범주인지 아닌지 포함여부에 대한 연산이 가능합니다.

In [109]:
# 변수 region의 수준 목록 확인 및 관심 수준 선택
df_ins['region'].unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

In [110]:
# isin()을 활용한 특정 수준 관측치 선택
cond1 = df_ins['region'].isin(['southeast','northwest'])
cond1

0       False
1        True
2        True
3        True
4        True
        ...  
1333     True
1334    False
1335     True
1336    False
1337     True
Name: region, Length: 1338, dtype: bool

In [111]:
df_ins[cond1]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
5,31,female,25.740,0,no,southeast,3756.62160
...,...,...,...,...,...,...,...
1327,51,male,30.030,1,no,southeast,9377.90470
1330,57,female,25.740,2,no,southeast,12629.16560
1333,50,male,30.970,3,no,northwest,10600.54830
1335,18,female,36.850,0,no,southeast,1629.83350


<br>

#### [실습]

1. df_sp에서 math score가 90 이상인 관측치 선택
2. df_sp에서 race/ethnicity가 'group D', 'group E'인 관측치 선택(isin() 활용)
3. 1.과 2.를 동시에 만족하는 관측치 선택 

In [None]:
df_sp.head()

In [112]:
cond_math = df_sp['math score'] >= 90
df_sp[cond_math]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
2,female,group B,master's degree,standard,none,90,95,93
34,male,group E,some college,standard,none,97,87,82
104,male,group C,some college,standard,completed,98,86,90
114,female,group E,bachelor's degree,standard,completed,99,100,100
121,male,group B,associate's degree,standard,completed,91,89,92
149,male,group E,associate's degree,free/reduced,completed,100,100,93
165,female,group C,bachelor's degree,standard,completed,96,100,100
171,male,group E,some high school,standard,none,94,88,78
179,female,group D,some high school,standard,completed,97,100,100
233,male,group E,some high school,standard,none,92,87,78


In [113]:
cond_race = df_sp['race/ethnicity'].isin(['group D', 'group E'])
df_sp[cond_race]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
8,male,group D,high school,free/reduced,completed,64,64,67
11,male,group D,associate's degree,standard,none,40,52,43
20,male,group D,high school,standard,none,66,69,63
22,male,group D,some college,standard,none,44,54,53
24,male,group D,bachelor's degree,free/reduced,completed,74,71,80
...,...,...,...,...,...,...,...,...
992,female,group D,associate's degree,free/reduced,none,55,76,76
993,female,group D,bachelor's degree,free/reduced,none,62,72,74
995,female,group E,master's degree,standard,completed,88,99,95
998,female,group D,some college,standard,completed,68,78,77


In [114]:
df_sp[cond_math & cond_race]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
34,male,group E,some college,standard,none,97,87,82
114,female,group E,bachelor's degree,standard,completed,99,100,100
149,male,group E,associate's degree,free/reduced,completed,100,100,93
171,male,group E,some high school,standard,none,94,88,78
179,female,group D,some high school,standard,completed,97,100,100
233,male,group E,some high school,standard,none,92,87,78
263,female,group E,high school,standard,none,99,93,90
286,male,group E,associate's degree,standard,completed,97,82,88
299,male,group D,associate's degree,free/reduced,none,90,87,75
306,male,group E,some college,standard,completed,99,87,81


#### [참고] Series의 str 메서드 활용
문자열 Series(한 변수)에서 str 함수를 활용하면 특정 단어를 포함하거나 특정 패턴과 일치하는 관측치를 선택 가능

In [116]:
df_sp[df_sp['parental level of education'].str.startswith('b')]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
24,male,group D,bachelor's degree,free/reduced,completed,74,71,80
27,female,group C,bachelor's degree,standard,none,67,69,75
60,male,group E,bachelor's degree,free/reduced,completed,79,74,72
77,male,group A,bachelor's degree,standard,completed,80,78,81
...,...,...,...,...,...,...,...,...
916,male,group E,bachelor's degree,standard,completed,100,100,100
933,male,group C,bachelor's degree,free/reduced,completed,70,75,74
969,female,group B,bachelor's degree,standard,none,75,84,80
970,female,group D,bachelor's degree,standard,none,89,100,100


In [117]:
df_sp['parental level of education'].str.endswith('college')

0      False
1       True
2      False
3      False
4       True
       ...  
995    False
996    False
997    False
998     True
999     True
Name: parental level of education, Length: 1000, dtype: bool

In [118]:
df_sp['parental level of education'].str.contains('degree')

0       True
1      False
2       True
3       True
4      False
       ...  
995     True
996    False
997    False
998    False
999    False
Name: parental level of education, Length: 1000, dtype: bool

<br>

#### [참고] Series의 between 메서드 활용
수치형 Series(한 변수)에서 *between()* 으로 특정 범위 내 관측치 선택 가능

In [119]:
df_sp['math score'].between(80, 89.9)

0      False
1      False
2      False
3      False
4      False
       ...  
995     True
996    False
997    False
998    False
999    False
Name: math score, Length: 1000, dtype: bool

In [120]:
# 양쪽 끝 경계 포함 여부 지정 가능
    # 'both', 'left', 'right'
df_sp[ df_sp['math score'].between(80, 90, inclusive='left') ] 

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
6,female,group B,some college,standard,completed,88,95,92
16,male,group C,high school,standard,none,88,89,86
35,male,group E,associate's degree,standard,completed,81,81,79
49,male,group C,high school,standard,completed,82,84,82
53,male,group D,high school,standard,none,88,78,75
...,...,...,...,...,...,...,...,...
970,female,group D,bachelor's degree,standard,none,89,100,100
981,male,group D,some high school,standard,none,81,78,78
987,male,group E,some high school,standard,completed,81,75,76
990,male,group E,high school,free/reduced,completed,86,81,75


<br>

#### [참고] ~를 활용한 부정(True/False 반전)
bool Series(True/False) 앞에 **~** 를 붙여서 True와 False를 뒤집기 가능

In [121]:
cond1 = df_sp['math score'].between(80, 90, inclusive='left')
cond1

0      False
1      False
2      False
3      False
4      False
       ...  
995     True
996    False
997    False
998    False
999    False
Name: math score, Length: 1000, dtype: bool

In [122]:
~cond1

0       True
1       True
2       True
3       True
4       True
       ...  
995    False
996     True
997     True
998     True
999     True
Name: math score, Length: 1000, dtype: bool

In [125]:
df_sp[~cond1]['math score'].unique()

array([ 72,  69,  90,  47,  76,  71,  40,  64,  38,  58,  65,  78,  50,
        18,  46,  54,  66,  44,  74,  73,  67,  70,  62,  63,  56,  97,
        75,  57,  55,  53,  59,  77,  33,  52,   0,  79,  39,  45,  60,
        61,  41,  49,  30,  42,  27,  43,  68,  98,  51,  99,  91,  22,
       100,  96,  94,  48,  35,  34,  92,  37,  28,  24,  26,  95,  36,
        29,  32,  93,  19,  23,   8], dtype=int64)

<br>

### 3.6. 함수를 활용한 부분 관측치 선택


In [None]:
# head( )와 tail()
df_ins.head()
df_ins.tail()

In [133]:
# sample( )의 활용
#df_ins.sample(frac=0.005)
df_ins.sample(n=5)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1333,50,male,30.97,3,no,northwest,10600.5483
363,21,female,26.4,1,no,southwest,2597.779
816,24,female,24.225,0,no,northwest,2842.76075
550,63,male,30.8,0,no,southwest,13390.559
642,61,male,33.915,0,no,northeast,13143.86485


In [137]:
# nlargest( ), nsmallest( )로 상위/하위 관측치 선택
df_ins.nlargest(50, 'charges')

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
543,54,female,47.41,0,yes,southeast,63770.42801
1300,45,male,30.36,0,yes,southeast,62592.87309
1230,52,male,34.485,3,yes,northwest,60021.39897
577,31,female,38.095,1,yes,northeast,58571.07448
819,33,female,35.53,0,yes,northwest,55135.40209
1146,60,male,32.8,0,yes,southwest,52590.82939
34,28,male,36.4,1,yes,southwest,51194.55914
1241,64,male,36.96,2,yes,southeast,49577.6624
1062,59,male,41.14,1,yes,southeast,48970.2476
488,44,female,38.06,0,yes,southeast,48885.13561


In [135]:
df_ins.nsmallest(10, 'charges')

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
940,18,male,23.21,0,no,southeast,1121.8739
808,18,male,30.14,0,no,southeast,1131.5066
1244,18,male,33.33,0,no,southeast,1135.9407
663,18,male,33.66,0,no,southeast,1136.3994
22,18,male,34.1,0,no,southeast,1137.011
194,18,male,34.43,0,no,southeast,1137.4697
866,18,male,37.29,0,no,southeast,1141.4451
781,18,male,41.14,0,no,southeast,1146.7966
442,18,male,43.01,0,no,southeast,1149.3959
1317,18,male,53.13,0,no,southeast,1163.4627


<br>

#### [실습]

1. df_sp에서 math score 상위 20 명 선택
2. df_sp에서 writing score 하위 10명 선택


In [141]:
df_sp.sample(n=10)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
366,male,group C,high school,standard,completed,69,58,53
409,male,group D,associate's degree,standard,completed,87,84,85
636,female,group B,high school,free/reduced,completed,67,80,81
329,female,group B,some high school,standard,none,41,55,51
626,male,group B,associate's degree,free/reduced,completed,69,70,63
841,male,group C,some high school,standard,none,64,58,51
16,male,group C,high school,standard,none,88,89,86
443,female,group B,associate's degree,standard,none,73,83,76
290,male,group C,associate's degree,standard,none,76,70,68
701,female,group B,some high school,standard,none,57,67,72


In [142]:
df_sp.nlargest(20, 'math score')

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
149,male,group E,associate's degree,free/reduced,completed,100,100,93
451,female,group E,some college,standard,none,100,92,97
458,female,group E,bachelor's degree,standard,none,100,100,100
623,male,group A,some college,standard,completed,100,96,86
625,male,group D,some college,standard,completed,100,97,99
916,male,group E,bachelor's degree,standard,completed,100,100,100
962,female,group E,associate's degree,standard,none,100,100,100
114,female,group E,bachelor's degree,standard,completed,99,100,100
263,female,group E,high school,standard,none,99,93,90
306,male,group E,some college,standard,completed,99,87,81


In [144]:
df_sp.nsmallest(10, 'writing score')

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
59,female,group C,some high school,free/reduced,none,0,17,10
596,male,group B,high school,free/reduced,none,30,24,15
327,male,group A,some college,free/reduced,none,28,23,19
76,male,group E,some high school,standard,none,30,26,22
980,female,group B,high school,free/reduced,none,8,24,23
211,male,group C,some college,free/reduced,none,35,28,27
338,female,group B,some high school,free/reduced,none,24,38,27
896,male,group B,high school,free/reduced,none,36,29,27
17,female,group B,some high school,free/reduced,none,18,32,28
601,female,group C,high school,standard,none,29,29,30


<br>

### 3.7. 중복값 제거

`drop_duplicates()`를 활용해서 중복값을 제거한 목록 생성 가능

In [145]:
df_ins[['sex','region']].drop_duplicates()

Unnamed: 0,sex,region
0,female,southwest
1,male,southeast
3,male,northwest
5,female,southeast
7,female,northwest
8,male,northeast
12,male,southwest
16,female,northeast


### 3.8. 관측치 정렬

`sort_values()`를 활용해서 관측치를 정렬

In [146]:
# age 순 데이터 정렬
df_ins.sort_values('age')

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1248,18,female,39.820,0,no,southeast,1633.96180
482,18,female,31.350,0,no,southeast,1622.18850
492,18,female,25.080,0,no,northeast,2196.47320
525,18,female,33.880,0,no,southeast,11482.63485
529,18,male,25.460,0,no,northeast,1708.00140
...,...,...,...,...,...,...,...
398,64,male,25.600,2,no,southwest,14988.43200
335,64,male,34.500,0,no,southwest,13822.80300
378,64,female,30.115,3,no,northwest,16455.70785
1265,64,male,23.760,0,yes,southeast,26926.51440


In [147]:
# 원본 데이터는 영향 없음
df_ins.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [148]:
# 원본 데이터의 정렬
df_ins = df_ins.sort_values('age')
df_ins.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1248,18,female,39.82,0,no,southeast,1633.9618
482,18,female,31.35,0,no,southeast,1622.1885
492,18,female,25.08,0,no,northeast,2196.4732
525,18,female,33.88,0,no,southeast,11482.63485
529,18,male,25.46,0,no,northeast,1708.0014


In [150]:
# 내림차순 지정
df_ins = df_ins.sort_values('age', ascending=False)
df_ins

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
635,64,male,38.190,0,no,northeast,14410.93210
890,64,female,26.885,0,yes,northwest,29330.98315
1051,64,male,26.410,0,no,northeast,14394.55790
420,64,male,33.880,0,yes,southeast,46889.26120
418,64,male,39.160,1,no,southeast,14418.28040
...,...,...,...,...,...,...,...
648,18,male,28.500,0,no,northeast,1712.22700
663,18,male,33.660,0,no,southeast,1136.39940
1282,18,female,21.660,0,yes,northeast,14283.45940
710,18,male,35.200,1,no,southeast,1727.54000


In [151]:
# 복수 기준의 설정 
df_ins.sort_values(['age', 'charges'], ascending=[True, False])

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
803,18,female,42.240,0,yes,southeast,38792.68560
759,18,male,38.170,0,yes,southeast,36307.79830
161,18,female,36.850,0,yes,southeast,36149.48350
623,18,male,33.535,0,yes,northeast,34617.84065
57,18,male,31.680,2,yes,southeast,34303.16720
...,...,...,...,...,...,...,...
768,64,female,39.700,0,no,southwest,14319.03100
801,64,female,35.970,0,no,southeast,14313.84630
752,64,male,37.905,0,no,northwest,14210.53595
534,64,male,40.480,0,no,southeast,13831.11520


In [152]:
# index를 활용한 정렬
df_ins = df_ins.sort_index()
df_ins.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


<br>


#### [실습] 데이터 df_sp 활용

1. 전체 관측치를 'math score', 'reading score'의 내림차순으로 정렬해서 출력


In [175]:
df_sp.groupby('gender').count()

Unnamed: 0_level_0,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,518,518,518,518,518,518,518
male,482,482,482,482,482,482,482


In [159]:
df_sp['parental level of education'].unique()

array(["bachelor's degree", 'some college', "master's degree",
       "associate's degree", 'high school', 'some high school'],
      dtype=object)

In [172]:
from pandas.api.types import CategoricalDtype

# 원하는 순서를 리스트로 지정합니다.
cat_type = CategoricalDtype(categories=['high school', 'some high school' ,'some college', "associate's degree",  "bachelor's degree", "master's degree"], ordered=True)

# 'level' 열을 categorical data로 변환하고 순서를 지정합니다.
df_sp['parental level of education'] = df_sp['parental level of education'].astype(cat_type)

# 이제 정렬합니다.
df_sp = df_sp.sort_values('parental level of education')
df_sp.iloc[[0, 100, 200, 300,400, 500, 600, 700, 800, 900]]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
746,male,group D,high school,standard,none,69,75,71
858,male,group B,high school,standard,completed,52,49,46
399,male,group D,some high school,standard,none,60,59,54
591,male,group A,some high school,standard,none,71,62,50
424,male,group B,some college,free/reduced,none,41,39,34
769,male,group A,some college,free/reduced,none,58,60,57
777,female,group C,some college,free/reduced,none,35,44,43
659,male,group D,associate's degree,standard,none,90,87,85
36,female,group D,associate's degree,standard,none,74,81,83
913,female,group C,bachelor's degree,free/reduced,completed,47,62,66


In [160]:
df_sp.sort_values('parental level of education')

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
669,male,group D,associate's degree,standard,completed,81,72,77
457,male,group D,associate's degree,free/reduced,none,53,54,48
454,female,group C,associate's degree,free/reduced,none,53,61,62
452,female,group C,associate's degree,free/reduced,none,65,77,74
173,female,group C,associate's degree,standard,none,63,67,70
...,...,...,...,...,...,...,...,...
513,female,group B,some high school,standard,completed,54,61,62
515,female,group C,some high school,standard,completed,76,87,85
265,male,group D,some high school,free/reduced,none,59,42,41
853,male,group E,some high school,standard,none,82,67,61


#### End of script