# Pandas 학습
1. 데이터 분석을 위한 모듈
- 데이터 분석 및 가공에 사용되는 파이썬 라이브러리
- 사용 빈도가 굉장히 높음
- Numpy 모듈 기반으로 생성됨

**학습 내용**
1. 기초 익히기
- 이미 존재하는 파일의 내용을 DataFrame으로 생성하기
- 중복 데이터 처리 기술
- 결측치 처리 하기
- 두개의 DataFrame 병합: row기준? column기준?

**주요 객체**
1. Series
> A. DataFrame 구성 요소 <br> B. 구조가 하나의 column으로 구성(DataFrame의 하나의 column)
  
2. DataFrame
> A. 가로축과 세로축이 있는 엑셀과 유사한 구조 <br> B. Series 들로 구성

In [1]:
import numpy as np
import pandas as pd

In [2]:
!pip show pandas

Name: pandas
Version: 0.23.4
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: http://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location: c:\users\playdata\anaconda3\lib\site-packages
Requires: python-dateutil, pytz, numpy
Required-by: seaborn, pandas-datareader, odo, fix-yahoo-finance


기초익히기

In [3]:
s = pd.Series([1,2,3])
s

0    1
1    2
2    3
dtype: int64

In [4]:
type(s)

pandas.core.series.Series

In [6]:
s.values

array([1, 2, 3], dtype=int64)

In [7]:
s.index

RangeIndex(start=0, stop=3, step=1)

In [8]:
# 결측치 반영 시 Series 각 요소의 데이터 타입 - float64
s = pd.Series([1,np.nan,3])
s

0    1.0
1    NaN
2    3.0
dtype: float64

In [9]:
# 특정 날짜를 기준으로 자동으로 날짜 증가
# 주의 사항: yyymmdd 형태
datas = pd.date_range('20181210', periods=6)
datas

DatetimeIndex(['2018-12-10', '2018-12-11', '2018-12-12', '2018-12-13',
               '2018-12-14', '2018-12-15'],
              dtype='datetime64[ns]', freq='D')

In [10]:
#DataFrame 생성: 6행 4열
df = pd.DataFrame(np.random.randn(6,4), index=datas,columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
2018-12-10,-0.478695,-0.697765,2.092173,-0.663293
2018-12-11,0.994703,-1.177106,-0.074732,0.584394
2018-12-12,0.945091,-0.077789,1.176366,1.057915
2018-12-13,0.418631,0.057429,0.221677,-0.810361
2018-12-14,-2.108476,0.866042,-2.104554,-0.205641
2018-12-15,-0.236502,-0.775598,1.191881,-1.316389


In [11]:
df.values

array([[-0.47869491, -0.69776526,  2.09217312, -0.6632931 ],
       [ 0.99470316, -1.17710569, -0.07473174,  0.58439447],
       [ 0.9450912 , -0.07778875,  1.17636571,  1.05791522],
       [ 0.41863082,  0.05742912,  0.22167664, -0.81036095],
       [-2.10847581,  0.86604228, -2.10455395, -0.20564107],
       [-0.23650237, -0.77559836,  1.19188143, -1.31638858]])

In [12]:
df.index

DatetimeIndex(['2018-12-10', '2018-12-11', '2018-12-12', '2018-12-13',
               '2018-12-14', '2018-12-15'],
              dtype='datetime64[ns]', freq='D')

In [13]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [14]:
# DataFrame 정리하는 함수
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2018-12-10 to 2018-12-15
Freq: D
Data columns (total 4 columns):
A    6 non-null float64
B    6 non-null float64
C    6 non-null float64
D    6 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes


In [15]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.077541,-0.300798,0.417135,-0.225562
std,1.162115,0.733554,1.457496,0.897468
min,-2.108476,-1.177106,-2.104554,-1.316389
25%,-0.418147,-0.75614,-0.00063,-0.773594
50%,0.091064,-0.387777,0.699021,-0.434467
75%,0.813476,0.023625,1.188003,0.386886
max,0.994703,0.866042,2.092173,1.057915


<font color = 'blue'> 정리: <br>
Series 생성 방법, <br>
yyyymmdd 및 기간을 기준으로 날짜 자동 생성 하는 방법<br> DataFrame 객체 생성 방법<br> index 및 column 적용 방법<br>DataFrame의 정보 확인하는 방법 </font>

In [16]:
#B 컬럼값 기준으로 오름차순 정렬
# 오름차순: ascending=True
# 내림차순: ascending=False
df.sort_values(by='B',ascending=True)

Unnamed: 0,A,B,C,D
2018-12-11,0.994703,-1.177106,-0.074732,0.584394
2018-12-15,-0.236502,-0.775598,1.191881,-1.316389
2018-12-10,-0.478695,-0.697765,2.092173,-0.663293
2018-12-12,0.945091,-0.077789,1.176366,1.057915
2018-12-13,0.418631,0.057429,0.221677,-0.810361
2018-12-14,-2.108476,0.866042,-2.104554,-0.205641


In [17]:
df.sort_values(by='B',ascending=False)

Unnamed: 0,A,B,C,D
2018-12-14,-2.108476,0.866042,-2.104554,-0.205641
2018-12-13,0.418631,0.057429,0.221677,-0.810361
2018-12-12,0.945091,-0.077789,1.176366,1.057915
2018-12-10,-0.478695,-0.697765,2.092173,-0.663293
2018-12-15,-0.236502,-0.775598,1.191881,-1.316389
2018-12-11,0.994703,-1.177106,-0.074732,0.584394


In [18]:
df['A']

2018-12-10   -0.478695
2018-12-11    0.994703
2018-12-12    0.945091
2018-12-13    0.418631
2018-12-14   -2.108476
2018-12-15   -0.236502
Freq: D, Name: A, dtype: float64

In [19]:
type(df['A'])

pandas.core.series.Series

In [20]:
# DataFrame의 row를 index로 slicing
# [0:3]: 0, 1, 2의 row
df[0:3]

Unnamed: 0,A,B,C,D
2018-12-10,-0.478695,-0.697765,2.092173,-0.663293
2018-12-11,0.994703,-1.177106,-0.074732,0.584394
2018-12-12,0.945091,-0.077789,1.176366,1.057915


In [21]:
df

Unnamed: 0,A,B,C,D
2018-12-10,-0.478695,-0.697765,2.092173,-0.663293
2018-12-11,0.994703,-1.177106,-0.074732,0.584394
2018-12-12,0.945091,-0.077789,1.176366,1.057915
2018-12-13,0.418631,0.057429,0.221677,-0.810361
2018-12-14,-2.108476,0.866042,-2.104554,-0.205641
2018-12-15,-0.236502,-0.775598,1.191881,-1.316389


In [22]:
# index명으로 반환, 단 row의 번호와 달리 지정한 영역까지 다 반환
df['2018-12-10':'2018-12-13'] # same as df[0:4]

Unnamed: 0,A,B,C,D
2018-12-10,-0.478695,-0.697765,2.092173,-0.663293
2018-12-11,0.994703,-1.177106,-0.074732,0.584394
2018-12-12,0.945091,-0.077789,1.176366,1.057915
2018-12-13,0.418631,0.057429,0.221677,-0.810361


<font color = 'blue'> 정리: <br>
DataFrame에서 일부 row값(index) 반환 <- index 번호 또는 index명 

**loc**: 데이터를 slicing 할 수 있는 기술 <br>
loc[index, columns]

In [23]:
df

Unnamed: 0,A,B,C,D
2018-12-10,-0.478695,-0.697765,2.092173,-0.663293
2018-12-11,0.994703,-1.177106,-0.074732,0.584394
2018-12-12,0.945091,-0.077789,1.176366,1.057915
2018-12-13,0.418631,0.057429,0.221677,-0.810361
2018-12-14,-2.108476,0.866042,-2.104554,-0.205641
2018-12-15,-0.236502,-0.775598,1.191881,-1.316389


In [24]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2018-12-10,-0.478695,-0.697765
2018-12-11,0.994703,-1.177106
2018-12-12,0.945091,-0.077789
2018-12-13,0.418631,0.057429
2018-12-14,-2.108476,0.866042
2018-12-15,-0.236502,-0.775598


In [25]:
df.loc['2018-12-12',['A','B']]

A    0.945091
B   -0.077789
Name: 2018-12-12 00:00:00, dtype: float64

In [26]:
type(df.loc['2018-12-12',['A','B']])

pandas.core.series.Series

In [27]:
# index와 column의 숫자값으로 검색
df.iloc[[0,1,2],[0,1]]

Unnamed: 0,A,B
2018-12-10,-0.478695,-0.697765
2018-12-11,0.994703,-1.177106
2018-12-12,0.945091,-0.077789


In [28]:
df.loc['2018-12-10':'2018-12-12',['A','B']]

Unnamed: 0,A,B
2018-12-10,-0.478695,-0.697765
2018-12-11,0.994703,-1.177106
2018-12-12,0.945091,-0.077789


**iloc**: loc와 달리 행과 열의 번호를 이용해서 데이터 활용

In [None]:
df

Question: 12월 12일의 B column값 반환. loc? iloc?에 따른 반환타입이 다름 <br>
loc: numpy.float64 <br>
iloc: DataFrame <br>

In [29]:
df.loc['2018-12-12','B']

-0.07778874713811153

In [30]:
type(df.loc['2018-12-12','B'])

numpy.float64

In [31]:
df.iloc[[2],[1]]

Unnamed: 0,B
2018-12-12,-0.077789


In [32]:
type(df.iloc[[2],[1]])

pandas.core.frame.DataFrame

**조건식을 반영한 데이터 도출**

In [33]:
df

Unnamed: 0,A,B,C,D
2018-12-10,-0.478695,-0.697765,2.092173,-0.663293
2018-12-11,0.994703,-1.177106,-0.074732,0.584394
2018-12-12,0.945091,-0.077789,1.176366,1.057915
2018-12-13,0.418631,0.057429,0.221677,-0.810361
2018-12-14,-2.108476,0.866042,-2.104554,-0.205641
2018-12-15,-0.236502,-0.775598,1.191881,-1.316389


In [34]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
2018-12-11,0.994703,-1.177106,-0.074732,0.584394
2018-12-12,0.945091,-0.077789,1.176366,1.057915
2018-12-13,0.418631,0.057429,0.221677,-0.810361


In [35]:
#DataFrame 객체를 복제해서 새로운 객체 생성 후 대입
df2 = df.copy()
df2

Unnamed: 0,A,B,C,D
2018-12-10,-0.478695,-0.697765,2.092173,-0.663293
2018-12-11,0.994703,-1.177106,-0.074732,0.584394
2018-12-12,0.945091,-0.077789,1.176366,1.057915
2018-12-13,0.418631,0.057429,0.221677,-0.810361
2018-12-14,-2.108476,0.866042,-2.104554,-0.205641
2018-12-15,-0.236502,-0.775598,1.191881,-1.316389


In [36]:
df2.iloc[0,0] = 0

In [37]:
df2

Unnamed: 0,A,B,C,D
2018-12-10,0.0,-0.697765,2.092173,-0.663293
2018-12-11,0.994703,-1.177106,-0.074732,0.584394
2018-12-12,0.945091,-0.077789,1.176366,1.057915
2018-12-13,0.418631,0.057429,0.221677,-0.810361
2018-12-14,-2.108476,0.866042,-2.104554,-0.205641
2018-12-15,-0.236502,-0.775598,1.191881,-1.316389


In [38]:
df

Unnamed: 0,A,B,C,D
2018-12-10,-0.478695,-0.697765,2.092173,-0.663293
2018-12-11,0.994703,-1.177106,-0.074732,0.584394
2018-12-12,0.945091,-0.077789,1.176366,1.057915
2018-12-13,0.418631,0.057429,0.221677,-0.810361
2018-12-14,-2.108476,0.866042,-2.104554,-0.205641
2018-12-15,-0.236502,-0.775598,1.191881,-1.316389


In [39]:
df3 = df #df 객체의 주소값 넘긴거임. 즉, 객체 지향
df3.iloc[0,0]=10

In [40]:
df3

Unnamed: 0,A,B,C,D
2018-12-10,10.0,-0.697765,2.092173,-0.663293
2018-12-11,0.994703,-1.177106,-0.074732,0.584394
2018-12-12,0.945091,-0.077789,1.176366,1.057915
2018-12-13,0.418631,0.057429,0.221677,-0.810361
2018-12-14,-2.108476,0.866042,-2.104554,-0.205641
2018-12-15,-0.236502,-0.775598,1.191881,-1.316389


In [41]:
df

Unnamed: 0,A,B,C,D
2018-12-10,10.0,-0.697765,2.092173,-0.663293
2018-12-11,0.994703,-1.177106,-0.074732,0.584394
2018-12-12,0.945091,-0.077789,1.176366,1.057915
2018-12-13,0.418631,0.057429,0.221677,-0.810361
2018-12-14,-2.108476,0.866042,-2.104554,-0.205641
2018-12-15,-0.236502,-0.775598,1.191881,-1.316389


In [42]:
# 이미 존재하는 DataFrame에 새로운 column 추가
df2['E'] = [1,2,3,4,5,np.nan]
df2

Unnamed: 0,A,B,C,D,E
2018-12-10,0.0,-0.697765,2.092173,-0.663293,1.0
2018-12-11,0.994703,-1.177106,-0.074732,0.584394,2.0
2018-12-12,0.945091,-0.077789,1.176366,1.057915,3.0
2018-12-13,0.418631,0.057429,0.221677,-0.810361,4.0
2018-12-14,-2.108476,0.866042,-2.104554,-0.205641,5.0
2018-12-15,-0.236502,-0.775598,1.191881,-1.316389,


In [None]:
del df2['E']
df2

In [43]:
#해당 열에 데이터 존재 여부 검증하는 함수
df2['E'].isin([1,3,5,6])

2018-12-10     True
2018-12-11    False
2018-12-12     True
2018-12-13    False
2018-12-14     True
2018-12-15    False
Freq: D, Name: E, dtype: bool

In [44]:
df['A']

2018-12-10    10.000000
2018-12-11     0.994703
2018-12-12     0.945091
2018-12-13     0.418631
2018-12-14    -2.108476
2018-12-15    -0.236502
Freq: D, Name: A, dtype: float64

In [45]:
df['A'].max()

10.0

In [46]:
df['A'].min()

-2.1084758050358534

In [47]:
#? A컬럼에서 최대값 빼기 최소값
df['A'].max() - df['A'].min()

12.108475805035853

In [48]:
#모든 컬럼의 최대값 - 최소값
df.max() - df.min()

A    12.108476
B     2.043148
C     4.196727
D     2.374304
dtype: float64

In [49]:
# 사용자 정의 함수를 호출 또는 람다 표현식을 사용 가능한 DataFrame의 함수
# 최대값 - 최소값: 각 컬럼별 데이터 파악 후에 대소비교 후 연산
df.apply(lambda x : x.max()-x.min())

A    12.108476
B     2.043148
C     4.196727
D     2.374304
dtype: float64

In [50]:
# ? 이 apply()가 몇번 호출될 지 파악하는 로직을 lambda로 단순 확인
df.apply(lambda x : print(x))

2018-12-10    10.000000
2018-12-11     0.994703
2018-12-12     0.945091
2018-12-13     0.418631
2018-12-14    -2.108476
2018-12-15    -0.236502
Freq: D, Name: A, dtype: float64
2018-12-10   -0.697765
2018-12-11   -1.177106
2018-12-12   -0.077789
2018-12-13    0.057429
2018-12-14    0.866042
2018-12-15   -0.775598
Freq: D, Name: B, dtype: float64
2018-12-10    2.092173
2018-12-11   -0.074732
2018-12-12    1.176366
2018-12-13    0.221677
2018-12-14   -2.104554
2018-12-15    1.191881
Freq: D, Name: C, dtype: float64
2018-12-10   -0.663293
2018-12-11    0.584394
2018-12-12    1.057915
2018-12-13   -0.810361
2018-12-14   -0.205641
2018-12-15   -1.316389
Freq: D, Name: D, dtype: float64


A    None
B    None
C    None
D    None
dtype: object

**이미 존재하는 파일의 내용을 기반으로 DataFrame 생성하기**

In [51]:
# 외부폴더의 file을 read & DataFrame 객체 생성하기
'''
이 파일이 내장된 경로: C:\0.ITStudy\9.Pandas\step01_
외부 파일(friends.csv) 내장 경로: C:\0.ITStudy\0.dataSet\friends.csv
'''
df = pd.read_csv('../0.dataSet/friends.csv')

In [52]:
df

Unnamed: 0,이름,나이,직업,hobby
0,신동엽,20,연예인,music
1,유재석,41,교수,art
2,김새롬,18,학생,study
3,이영자,45,상담사,talk
4,강호동,38,연예인,talk


In [53]:
df['이름']

0    신동엽
1    유재석
2    김새롬
3    이영자
4    강호동
Name: 이름, dtype: object

In [54]:
#? 이영자 이름을 이영순으로 변경
df.iloc[3,0]='이영순'
df

Unnamed: 0,이름,나이,직업,hobby
0,신동엽,20,연예인,music
1,유재석,41,교수,art
2,김새롬,18,학생,study
3,이영순,45,상담사,talk
4,강호동,38,연예인,talk


In [55]:
# tab으로 데이터를 구분한 파일의 내용으로 DataFrame 객체 생성하기
df = pd.read_csv('../0.dataSet/friendsTab.txt',delimiter='\t')
df

Unnamed: 0,이름,나이,직업,hobby
0,신동엽,20,연예인,music
1,유재석,41,교수,art
2,김새롬,18,학생,study
3,이영자,45,상담사,talk
4,강호동,38,연예인,talk


**원천 데이터(row data)에 컬럼명으로 사용될 header 정보가 없을 경우 주의사항** <br>
DataFrame 생성시 데이터의 첫라인이 header 정보로 사용되지 않게 속성으로 제어해야 함. <br>
1단계 실습: 신동엽 정보가 header 정보로 사용 -> columns 추가 -> 정보 손실 <br>
2단계 실습: 신동엽 정보 손실없이 작업 <br>


In [56]:
# 1단계: header 정보가 없는 파일을 read해서 가공하기
df = pd.read_csv('../0.dataSet/friendsTabNoHead.txt',delimiter='\t')
df

Unnamed: 0,신동엽,20,연예인,music
0,유재석,41,교수,art
1,김새롬,18,학생,study
2,이영자,45,상담사,talk
3,강호동,38,연예인,talk


In [57]:
df.columns

Index(['신동엽', '20', '연예인', 'music'], dtype='object')

In [58]:
df.columns = ['name','age','job','hobby']
df

Unnamed: 0,name,age,job,hobby
0,유재석,41,교수,art
1,김새롬,18,학생,study
2,이영자,45,상담사,talk
3,강호동,38,연예인,talk


In [59]:
# 2단계
df = pd.read_csv('../0.dataSet/friendsTabNoHead.txt',delimiter='\t', header=None)
df

Unnamed: 0,0,1,2,3
0,신동엽,20,연예인,music
1,유재석,41,교수,art
2,김새롬,18,학생,study
3,이영자,45,상담사,talk
4,강호동,38,연예인,talk


In [60]:
df.columns = ['name','age','job','hobby']
df

Unnamed: 0,name,age,job,hobby
0,신동엽,20,연예인,music
1,유재석,41,교수,art
2,김새롬,18,학생,study
3,이영자,45,상담사,talk
4,강호동,38,연예인,talk


In [61]:
# 3단계: DataFrame 생성 시점에 column명 반영 가능 
df = pd.read_csv('../0.dataSet/friendsTabNoHead.txt',delimiter='\t', names = ['name','age','job','hobby'])
df

Unnamed: 0,name,age,job,hobby
0,신동엽,20,연예인,music
1,유재석,41,교수,art
2,김새롬,18,학생,study
3,이영자,45,상담사,talk
4,강호동,38,연예인,talk


**파이썬 데이터 타입으로 부터 DataFrame 생성 & 가공**

In [62]:
friend_dict_list = [{'name': '신동엽', 'age': 20, 'job': '연예인', 'hobby':'music'},
                     {'name': '유재석', 'age': 41, 'job': '교수', 'hobby':'art'},
                     {'name': '김새롬', 'age': 18, 'job': '학생', 'hobby':'study'},
                     {'name': '이영자', 'age' : 45, 'job': '상담사', 'hobby' : 'talk'},
                     {'name' :  '강호동', 'age' : 38, 'job' : '연예인', 'hobby' : 'talk'}]

In [63]:
friend_dict_list

[{'name': '신동엽', 'age': 20, 'job': '연예인', 'hobby': 'music'},
 {'name': '유재석', 'age': 41, 'job': '교수', 'hobby': 'art'},
 {'name': '김새롬', 'age': 18, 'job': '학생', 'hobby': 'study'},
 {'name': '이영자', 'age': 45, 'job': '상담사', 'hobby': 'talk'},
 {'name': '강호동', 'age': 38, 'job': '연예인', 'hobby': 'talk'}]

In [64]:
type(friend_dict_list)

list

In [70]:
friend_dict_list[0]

{'name': '신동엽', 'age': 20, 'job': '연예인', 'hobby': 'music'}

In [72]:
friend_dict_list[0]['name']

'신동엽'

In [73]:
type(friend_dict_list[0]['name'])

str

In [67]:
df = pd.DataFrame(friend_dict_list)
df

Unnamed: 0,age,hobby,job,name
0,20,music,연예인,신동엽
1,41,art,교수,유재석
2,18,study,학생,김새롬
3,45,talk,상담사,이영자
4,38,talk,연예인,강호동


In [74]:
df.name

0    신동엽
1    유재석
2    김새롬
3    이영자
4    강호동
Name: name, dtype: object

In [75]:
type(df.name)

pandas.core.series.Series

In [76]:
df.head()

Unnamed: 0,age,hobby,job,name
0,20,music,연예인,신동엽
1,41,art,교수,유재석
2,18,study,학생,김새롬
3,45,talk,상담사,이영자
4,38,talk,연예인,강호동


In [77]:
df = df[['name','age','job','hobby']]
df

Unnamed: 0,name,age,job,hobby
0,신동엽,20,연예인,music
1,유재석,41,교수,art
2,김새롬,18,학생,study
3,이영자,45,상담사,talk
4,강호동,38,연예인,talk


In [78]:
#? df2라는 변수한테 friend_dict_list를 보유한 일반 파이썬의 list로 DataFrame 생성
#단, 컬럼 순서는 name, age, job, hobby 
#? 생성 시점에 컬럼순서 정할 수 있나?
df2 = pd.DataFrame(friend_dict_list, columns = ['name','age','job','hobby'], index = ['영','일','이','삼','사'])
df2

Unnamed: 0,name,age,job,hobby
영,신동엽,20,연예인,music
일,유재석,41,교수,art
이,김새롬,18,학생,study
삼,이영자,45,상담사,talk
사,강호동,38,연예인,talk


**정제한 DataFrame을 csv파일로 생성하기**

In [79]:
df2.to_csv('../0.dataSet/f1_out.csv', index = None)

In [80]:
# 컬럼명 배제해서 f2_out.csv 생성해보기
df2.to_csv('../0.dataSet/f2_out.csv', index = None, header = None)

**리스트로 데이터 프레임 생성하기**

In [81]:
friends = [['신동엽',20,'연예인','music'], 
           ['유재석',41,'교수','art'], 
           ['김새롬',18,'학생','study'], 
           ['이영자',45,'상담사','talk'], 
           ['강호동',38,'연예인','talk']]

df = pd.DataFrame.from_records(friends)
df

Unnamed: 0,0,1,2,3
0,신동엽,20,연예인,music
1,유재석,41,교수,art
2,김새롬,18,학생,study
3,이영자,45,상담사,talk
4,강호동,38,연예인,talk


In [82]:
df.columns=['name','age','job','hobby']
df

Unnamed: 0,name,age,job,hobby
0,신동엽,20,연예인,music
1,유재석,41,교수,art
2,김새롬,18,학생,study
3,이영자,45,상담사,talk
4,강호동,38,연예인,talk


In [83]:
df['salary']=0
df

Unnamed: 0,name,age,job,hobby,salary
0,신동엽,20,연예인,music,0
1,유재석,41,교수,art,0
2,김새롬,18,학생,study,0
3,이영자,45,상담사,talk,0
4,강호동,38,연예인,talk,0


In [84]:
#? job이 학생인 경우 salary값이 no, 학생이 아닌 경우 yes값으로 치환하기
#np.where()
'''
    salary 컬럼에 no or yes 대입: df['salary']='no' or 'yes'
    상황에 따른 조건 job 학생? non학생?: df['job'] != '학생'
    상황연산자:
        보편적인 프로그램 언어 문법 = 조건식? true일때:false일때
        
'''
df['salary']=np.where(df['job']!='학생','yes','no')
df

Unnamed: 0,name,age,job,hobby,salary
0,신동엽,20,연예인,music,yes
1,유재석,41,교수,art,yes
2,김새롬,18,학생,study,no
3,이영자,45,상담사,talk,yes
4,강호동,38,연예인,talk,yes


**중복 데이터 처리 기술** <br>
- 1단계: 100% 모든 데이터 동일한 것으로 test
- 2단계: age 컬럼값만 다를 경우 중복이 아니다라는 확인
- 3단계: 컬럼 데이터만을 기준으로 중복 여부 확인 후에 삭제 가능


In [87]:
# 1단계
friend_dict_list = [{'name': '신동엽', 'age': 20, 'job': '연예인', 'hobby':'music'},
                     {'name': '유재석', 'age': 41, 'job': '교수', 'hobby':'art'},
                     {'name': '김새롬', 'age': 18, 'job': '학생', 'hobby':'study'},
                     {'name': '이영자', 'age' : 45, 'job': '상담사', 'hobby' : 'talk'},
                     {'name' :  '강호동', 'age' : 38, 'job' : '연예인', 'hobby' : 'talk'},
                    {'name': '신동엽', 'age': 20, 'job': '연예인', 'hobby':'music'} ]

df = pd.DataFrame(friend_dict_list)
df = df[['name', 'age', 'job', 'hobby']]
df

Unnamed: 0,name,age,job,hobby
0,신동엽,20,연예인,music
1,유재석,41,교수,art
2,김새롬,18,학생,study
3,이영자,45,상담사,talk
4,강호동,38,연예인,talk
5,신동엽,20,연예인,music


In [88]:
df.duplicated() #중복 확인

0    False
1    False
2    False
3    False
4    False
5     True
dtype: bool

In [89]:
df = df.drop_duplicates()
df

Unnamed: 0,name,age,job,hobby
0,신동엽,20,연예인,music
1,유재석,41,교수,art
2,김새롬,18,학생,study
3,이영자,45,상담사,talk
4,강호동,38,연예인,talk


In [90]:
# 2단계
friend_dict_list = [{'name': '신동엽', 'age': 20, 'job': '연예인', 'hobby':'music'},
                     {'name': '유재석', 'age': 41, 'job': '교수', 'hobby':'art'},
                     {'name': '김새롬', 'age': 18, 'job': '학생', 'hobby':'study'},
                     {'name': '이영자', 'age' : 45, 'job': '상담사', 'hobby' : 'talk'},
                     {'name' :  '강호동', 'age' : 38, 'job' : '연예인', 'hobby' : 'talk'},
                    {'name': '신동엽', 'age': 21, 'job': '연예인', 'hobby':'music'} ]

df = pd.DataFrame(friend_dict_list)
df = df[['name', 'age', 'job', 'hobby']]
df

Unnamed: 0,name,age,job,hobby
0,신동엽,20,연예인,music
1,유재석,41,교수,art
2,김새롬,18,학생,study
3,이영자,45,상담사,talk
4,강호동,38,연예인,talk
5,신동엽,21,연예인,music


In [91]:
df.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

In [92]:
# 3단계: 컬럼 데이터만을 기준으로 중복 여부 확인 후에 삭제 가능
df.drop_duplicates(['name'], keep='last')

Unnamed: 0,name,age,job,hobby
1,유재석,41,교수,art
2,김새롬,18,학생,study
3,이영자,45,상담사,talk
4,강호동,38,연예인,talk
5,신동엽,21,연예인,music


In [93]:
df

Unnamed: 0,name,age,job,hobby
0,신동엽,20,연예인,music
1,유재석,41,교수,art
2,김새롬,18,학생,study
3,이영자,45,상담사,talk
4,강호동,38,연예인,talk
5,신동엽,21,연예인,music


In [94]:
# inplace=True: 원본 DataFrame에 영향을 주는 속성
df.drop_duplicates(['name'], keep='last', inplace=True)

In [95]:
df

Unnamed: 0,name,age,job,hobby
1,유재석,41,교수,art
2,김새롬,18,학생,study
3,이영자,45,상담사,talk
4,강호동,38,연예인,talk
5,신동엽,21,연예인,music


**NaN**

In [96]:
friend_dict_list = [{'name': '신동엽', 'age': 20, 'job': '연예인', 'hobby':'music'},
                     {'name': '유재석', 'age': 41, 'job': '교수', 'hobby':'art'},
                     {'name': '김새롬', 'age': 18, 'job': '학생', 'hobby':'study'},
                     {'name': '이영자', 'age' : 45, 'job': '상담사', 'hobby' : 'talk'},
                     {'name' :  '강호동', 'age' : 38, 'job' : '연예인', 'hobby' : 'talk'},
                    {'name': '신동엽', 'age': None, 'job': '연예인', 'hobby':'music'} ]

df = pd.DataFrame(friend_dict_list)
df = df[['name', 'age', 'job', 'hobby']]

In [97]:
df

Unnamed: 0,name,age,job,hobby
0,신동엽,20.0,연예인,music
1,유재석,41.0,교수,art
2,김새롬,18.0,학생,study
3,이영자,45.0,상담사,talk
4,강호동,38.0,연예인,talk
5,신동엽,,연예인,music


In [98]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
name     6 non-null object
age      5 non-null float64
job      6 non-null object
hobby    6 non-null object
dtypes: float64(1), object(3)
memory usage: 272.0+ bytes


In [99]:
# 결측치 확인 함수. 결측치인 경우에만 True
df.isna()

Unnamed: 0,name,age,job,hobby
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
5,False,True,False,False


In [100]:
df.isnull()

Unnamed: 0,name,age,job,hobby
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
5,False,True,False,False


In [102]:
# ?모든 친구들의 age값의 합? 평균?
df['age'].sum()
#np.sum(df['age'])

162.0

In [103]:
df['age'].mean()

32.4

In [104]:
'''
nan값을 0으로 치환, 평균을 제대로 도출
'''
df['age'].fillna(0, inplace=True)
df['age'].mean()
#df.loc[df['age'].isnull(),'age']=0
#df['age'].mean()

27.0

In [105]:
df

Unnamed: 0,name,age,job,hobby
0,신동엽,20.0,연예인,music
1,유재석,41.0,교수,art
2,김새롬,18.0,학생,study
3,이영자,45.0,상담사,talk
4,강호동,38.0,연예인,talk
5,신동엽,0.0,연예인,music


In [107]:
# age의 평균을 구해서 두번째 동명이인의 신동엽에게 대입
'''
1. 결측치가 있다는 가정하에 예제
2. 결측치가 정리가 된 상태에서는 평균 산출로 판단
'''
df['age'].fillna(df.groupby('job')['age'].transform('mean'),
                 inplace=True)

In [108]:
df

Unnamed: 0,name,age,job,hobby
0,신동엽,20.0,연예인,music
1,유재석,41.0,교수,art
2,김새롬,18.0,학생,study
3,이영자,45.0,상담사,talk
4,강호동,38.0,연예인,talk
5,신동엽,29.0,연예인,music


In [112]:
df.groupby('job')

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001EB601B0080>

In [110]:
df.groupby('job')['age']

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x000001EB601B0358>

In [111]:
df.describe()

Unnamed: 0,age
count,6.0
mean,31.833333
std,11.267949
min,18.0
25%,22.25
50%,33.5
75%,40.25
max,45.0


**중복 데이터 관련된 unique 처리 기술 학습** <br>

1. db의 distinct와 같은 기능 학습
- oracle db의 emp의 job의 종류가 중복되지 않게 검색: select dictinct job from emp;

In [113]:
friend_dict_list = [{'name': '신동엽', 'age': 20, 'job': '연예인', 'hobby':'music'},
                     {'name': '유재석', 'age': 41, 'job': '교수', 'hobby':'art'},
                     {'name': '김새롬', 'age': 18, 'job': '학생', 'hobby':'study'},
                     {'name': '이영자', 'age' : 45, 'job': '상담사', 'hobby' : 'talk'},
                     {'name' : '강호동', 'age' : 38, 'job' : '연예인', 'hobby' : 'talk'},
                     {'name': '신동엽', 'age': None, 'job': '연예인', 'hobby':'music'},
                     {'name': '고현정', 'age': 44, 'job': '가수', 'hobby':'music'},
                     {'name': '박민영', 'age': 22, 'job': '학생', 'hobby':'art'},
                     {'name': '박서준', 'age': 18, 'job': '학생', 'hobby':'study'},
                     {'name': '박보검', 'age' : 45, 'job': '상담사', 'hobby' : 'talk'},
                     {'name' : '이효리', 'age' : 28, 'job' : '교수', 'hobby' : 'talk'},
                    {'name': '이상순', 'age': 29, 'job': '주부', 'hobby':'music'}]

df = pd.DataFrame(friend_dict_list)
df = df[['name', 'age', 'job', 'hobby']]

In [114]:
df

Unnamed: 0,name,age,job,hobby
0,신동엽,20.0,연예인,music
1,유재석,41.0,교수,art
2,김새롬,18.0,학생,study
3,이영자,45.0,상담사,talk
4,강호동,38.0,연예인,talk
5,신동엽,,연예인,music
6,고현정,44.0,가수,music
7,박민영,22.0,학생,art
8,박서준,18.0,학생,study
9,박보검,45.0,상담사,talk


In [115]:
print(df.job)
#same as df['job']

0     연예인
1      교수
2      학생
3     상담사
4     연예인
5     연예인
6      가수
7      학생
8      학생
9     상담사
10     교수
11     주부
Name: job, dtype: object


In [120]:
type(df.job)

pandas.core.series.Series

In [121]:
df.job.unique()

array(['연예인', '교수', '학생', '상담사', '가수', '주부'], dtype=object)

In [122]:
type(df.job.unique())

numpy.ndarray

In [123]:
df.job.value_counts()

연예인    3
학생     3
상담사    2
교수     2
가수     1
주부     1
Name: job, dtype: int64

In [124]:
type(df.job.value_counts())

pandas.core.series.Series

eg. hobby에 대해서 종류 검색 -> 개수 파악 -> hobby 별 인원수 파악

In [125]:
df.hobby.unique()

array(['music', 'art', 'study', 'talk'], dtype=object)

In [126]:
len(df['hobby'].unique())

4

In [127]:
df.hobby.value_counts()

talk     4
music    4
art      2
study    2
Name: hobby, dtype: int64

**두개의 DataFrame 병합하기 - row 기준**

In [128]:
l1 = [{'name': '이효리', 'job': "교수"},
      {'name': '이상순', 'job': "학생"},
      {'name': '박보검', 'job': "개발자"}]

l2 = [{'name': '신동엽', 'job': "치과의사"},
      {'name': '이영자', 'job': "농부"},
      {'name': '정찬우', 'job': "연예인"}]
         
df1 = pd.DataFrame(l1, columns = ['name', 'job'])
df2 = pd.DataFrame(l2, columns = ['name', 'job'])

In [129]:
l1

[{'name': '이효리', 'job': '교수'},
 {'name': '이상순', 'job': '학생'},
 {'name': '박보검', 'job': '개발자'}]

In [130]:
df1

Unnamed: 0,name,job
0,이효리,교수
1,이상순,학생
2,박보검,개발자


In [131]:
type(l1)

list

In [132]:
type(df1)

pandas.core.frame.DataFrame

In [133]:
df1

Unnamed: 0,name,job
0,이효리,교수
1,이상순,학생
2,박보검,개발자


In [134]:
df2

Unnamed: 0,name,job
0,신동엽,치과의사
1,이영자,농부
2,정찬우,연예인


In [135]:
df3 = [df1, df2]
df3

[  name  job
 0  이효리   교수
 1  이상순   학생
 2  박보검  개발자,   name   job
 0  신동엽  치과의사
 1  이영자    농부
 2  정찬우   연예인]

In [136]:
df4 = pd.concat(df3)
df4

Unnamed: 0,name,job
0,이효리,교수
1,이상순,학생
2,박보검,개발자
0,신동엽,치과의사
1,이영자,농부
2,정찬우,연예인


In [137]:
df5 = pd.concat(df3, ignore_index=True)
df5

Unnamed: 0,name,job
0,이효리,교수
1,이상순,학생
2,박보검,개발자
3,신동엽,치과의사
4,이영자,농부
5,정찬우,연예인


**두개의 DataFrame 병합하기 - column 기준**

In [138]:
l1 = [{'name': '이효리', 'job': "교수"},
      {'name': '이상순', 'job': "학생"},
      {'name': '박보검', 'job': "개발자"}]

l2 = [{'name': '신동엽', 'job': "치과의사"},
      {'name': '이영자', 'job': "농부"},
      {'name': '정찬우', 'job': "연예인"}]
         
df1 = pd.DataFrame(l1, columns = ['name', 'job'])
df2 = pd.DataFrame(l2, columns = ['name', 'job'])

In [139]:
df3 = [df1,df2]
df3

[  name  job
 0  이효리   교수
 1  이상순   학생
 2  박보검  개발자,   name   job
 0  신동엽  치과의사
 1  이영자    농부
 2  정찬우   연예인]

In [None]:
df4 = pd.concat(df3, axis=1,ignore_index=True)
df4

In [None]:
# ? column 순서를 0 3 2 1
df4 = df4[[0,3,2,1]]
df4

In [None]:
# ? column명을 수정
df4.columns = ['name','job','name','job']
df4

In [None]:
#?
dataOne = df4.iloc[:,[0,1]]
dataOne

In [None]:
dataTwo = df4.iloc[:,[2,3]]
dataTwo

In [None]:
dataTotal = [dataOne, dataTwo]
dataTotal

In [None]:
dataFinal = pd.concat(dataTotal, ignore_index=True)
dataFinal

### Review questions

In [223]:
friend_dict_list = [{'이름': '나학생1', '중간고사점수': 95, '기말고사점수': 85},
         {'이름': '나학생2', '중간고사점수': 85, '기말고사점수': 80},
         {'이름': '나학생3', '중간고사점수': 10, '기말고사점수': 30}]

df = pd.DataFrame(friend_dict_list, columns = ['이름', '중간고사점수', '기말고사점수'])
df

Unnamed: 0,이름,중간고사점수,기말고사점수
0,나학생1,95,85
1,나학생2,85,80
2,나학생3,10,30


**문제1 : 학생들의 중간, 기말고사 성적만으로 '총점'과 '평점' 추가하기**

In [224]:
df['총점'] = df['중간고사점수']+df['기말고사점수']
df['평점'] = (df['총점']/2).astype(int)
df

Unnamed: 0,이름,중간고사점수,기말고사점수,총점,평점
0,나학생1,95,85,180,90
1,나학생2,85,80,165,82
2,나학생3,10,30,40,20


**문제2: DataFrame도 반복문으로 처리 가능. 평균값을 반복문 (for) 활용해서 출력해보기**

In [203]:
for v in df:
    print(v)

이름
중간고사점수
기말고사점수
총점
평점


In [204]:
for key, value in df.iteritems():
    print(key, ' ', value)

이름   0    나학생1
1    나학생2
2    나학생3
Name: 이름, dtype: object
중간고사점수   0    95
1    85
2    10
Name: 중간고사점수, dtype: int64
기말고사점수   0    85
1    80
2    30
Name: 기말고사점수, dtype: int64
총점   0    180
1    165
2     40
Name: 총점, dtype: int64
평점   0    90
1    82
2    20
Name: 평점, dtype: int32


In [205]:
for key, value in df.iterrows():
    print(key, ' ', value)

0   이름        나학생1
중간고사점수      95
기말고사점수      85
총점         180
평점          90
Name: 0, dtype: object
1   이름        나학생2
중간고사점수      85
기말고사점수      80
총점         165
평점          82
Name: 1, dtype: object
2   이름        나학생3
중간고사점수      10
기말고사점수      30
총점          40
평점          20
Name: 2, dtype: object


In [208]:
for avg in df.평점:
    print(avg)

90
82
20


**문제3: grade(등급) 이라는 컬럼 추가, 90점 이상(A), 80점 이상(B), 나머지(F). 조건문 확용해서 파이썬 코드로 처리해보기**

In [225]:
grade = []
for avg in df['평점']:
    if avg >= 90:
        grade.append('A')
    elif avg >= 80:
        grade.append('B')
    else:
        grade.append('F')
df['등급'] = grade
df

Unnamed: 0,이름,중간고사점수,기말고사점수,총점,평점,등급
0,나학생1,95,85,180,90,A
1,나학생2,85,80,165,82,B
2,나학생3,10,30,40,20,F


**문제4: '이수여부'라는 컬럼을 추가하되, 등급을 기준으로 F인 경우 fail/ F가 아닌 경우 pass값으로 반영. 단, lambda 문법으로 처리해보기**

In [None]:
'''
if v != 'F':
    return 'pass'
else:
    return 'fail'
'''

In [229]:
test = lambda v : 'pass' if v != 'F' else 'fail'
test('A')

'pass'

In [219]:
test('F')

'fail'

In [227]:
df['이수여부'] = df['등급'].apply(lambda v : 'pass' if v != 'F' else 'fail')
df

Unnamed: 0,이름,중간고사점수,기말고사점수,총점,평점,등급,이수여부
0,나학생1,95,85,180,90,A,pass
1,나학생2,85,80,165,82,B,pass
2,나학생3,10,30,40,20,F,fail


**문제 5: apply 함수를 사용해서 년월일 정보에서 년에 해당하는 정보만 반환**

In [230]:
date_list = [{'yyyy-mm-dd': '2000-06-27'},
         {'yyyy-mm-dd': '2002-09-24'},
         {'yyyy-mm-dd': '2018-12-20'}]

df = pd.DataFrame(date_list, columns = ['yyyy-mm-dd'])

df

Unnamed: 0,yyyy-mm-dd
0,2000-06-27
1,2002-09-24
2,2018-12-20


In [236]:
df['yyyy-mm-dd']

0    2000-06-27
1    2002-09-24
2    2018-12-20
Name: yyyy-mm-dd, dtype: object

In [234]:
type(df['yyyy-mm-dd'])

pandas.core.series.Series

In [235]:
'2000-06-27'.split('-')

['2000', '06', '27']

In [237]:
for v in df['yyyy-mm-dd']:
    print(type(v))

<class 'str'>
<class 'str'>
<class 'str'>


In [239]:
for v in df['yyyy-mm-dd']:
    print(v.split('-')[0])

2000
2002
2018


In [242]:
'''
    사용자정의 함수 개발 -> apply()에서 호출
    호출문법: apply(함수명)
'''
def getYear(row):
    return row.split('-')[0]

df['year'] = df['yyyy-mm-dd'].apply(getYear)
df

Unnamed: 0,yyyy-mm-dd,year
0,2000-06-27,2000
1,2002-09-24,2002
2,2018-12-20,2018


In [245]:
#check the type of year element: string. Hence convert into integer
type(df.year[0])

str

**추가문제: 'age' column 추가하기**

In [247]:
def getAge(year, currentYear):
    return currentYear - int(year)
df['age'] = df['year'].apply(getAge, currentYear = 2018)
df

Unnamed: 0,yyyy-mm-dd,year,age
0,2000-06-27,2000,18
1,2002-09-24,2002,16
2,2018-12-20,2018,0


**추가문제: 'info' column에 내 나이는 ** 살 추가하기**

In [250]:
def getInfo(age, prefix, suffix):
    return prefix + str(age) + suffix
df['info'] = df['age'].apply(getInfo, prefix = '내 나이는 ', suffix = ' 살')
df

Unnamed: 0,yyyy-mm-dd,year,age,info
0,2000-06-27,2000,18,내 나이는 18 살
1,2002-09-24,2002,16,내 나이는 16 살
2,2018-12-20,2018,0,내 나이는 0 살


**group by exercises**

In [251]:
student_list = [{'name': 'John', 'major': "Computer Science", 'gender': "male"},
                {'name': 'Nate', 'major': "Computer Science", 'gender': "male"},
                {'name': 'Abraham', 'major': "Physics", 'gender': "male"},
                {'name': 'Brian', 'major': "Psychology", 'gender': "male"},
                {'name': 'Janny', 'major': "Economics", 'gender': "female"},
                {'name': 'Yuna', 'major': "Economics", 'gender': "female"},
                {'name': 'Jeniffer', 'major': "Computer Science", 'gender': "female"},
                {'name': 'Edward', 'major': "Computer Science", 'gender': "male"},
                {'name': 'Zara', 'major': "Psychology", 'gender': "female"},
                {'name': 'Wendy', 'major': "Economics", 'gender': "female"},
                {'name': 'Sera', 'major': "Psychology", 'gender': "female"}
         ]
df = pd.DataFrame(student_list, columns = ['name', 'major', 'gender'])
df

Unnamed: 0,name,major,gender
0,John,Computer Science,male
1,Nate,Computer Science,male
2,Abraham,Physics,male
3,Brian,Psychology,male
4,Janny,Economics,female
5,Yuna,Economics,female
6,Jeniffer,Computer Science,female
7,Edward,Computer Science,male
8,Zara,Psychology,female
9,Wendy,Economics,female


In [252]:
groupby_major = df.groupby('major')
groupby_major

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001EB607417F0>

In [253]:
#그룹별 (전공) 정보 확인 속성
groupby_major.groups

{'Computer Science': Int64Index([0, 1, 6, 7], dtype='int64'),
 'Economics': Int64Index([4, 5, 9], dtype='int64'),
 'Physics': Int64Index([2], dtype='int64'),
 'Psychology': Int64Index([3, 8, 10], dtype='int64')}

In [254]:
groupby_major.name

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x000001EB60741048>

In [259]:
#그룹 변수는 각 컬럼명 사용 가능
# what else? 호출가능 속성명: name/major/gender
groupby_major.gender

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x000001EB607466D8>

In [255]:
'''
    name  변수: 변수명 가변적
                보유하게 되는 데이터값은 grouping 된 데이터 이름
    group 변수: 변수명 가변적
                보유하게 되는 데이터
                해당 그룹에 소속된 index와 각 컬럼별 데이터들
'''
for name, group in groupby_major:
    print(name + ':' + str(len(group)))

Computer Science:4
Economics:3
Physics:1
Psychology:3


In [256]:
for n, g in groupby_major:
    print(n + ':' + str(len(g)))

Computer Science:4
Economics:3
Physics:1
Psychology:3


In [257]:
for name, group in groupby_major:
    print(name + ':' + str(len(group)))
    print(group)
    print()

Computer Science:4
       name             major  gender
0      John  Computer Science    male
1      Nate  Computer Science    male
6  Jeniffer  Computer Science  female
7    Edward  Computer Science    male

Economics:3
    name      major  gender
4  Janny  Economics  female
5   Yuna  Economics  female
9  Wendy  Economics  female

Physics:1
      name    major gender
2  Abraham  Physics   male

Psychology:3
     name       major  gender
3   Brian  Psychology    male
8    Zara  Psychology  female
10   Sera  Psychology  female



In [261]:
groupby_major.size()

major
Computer Science    4
Economics           3
Physics             1
Psychology          3
dtype: int64

그룹 객체를 다시 DataFrame 객체로 생성해보기

In [262]:
# index도 없고 count라는 컬럼명 위치도 정리 필요 구조
# DataFrame 내에 존재하는 reset_index()로 재정리하기
df2 = pd.DataFrame({'count': groupby_major.size()})
df2

Unnamed: 0_level_0,count
major,Unnamed: 1_level_1
Computer Science,4
Economics,3
Physics,1
Psychology,3


In [263]:
df3 = pd.DataFrame({'count': groupby_major.size()}).reset_index()
df3

Unnamed: 0,major,count
0,Computer Science,4
1,Economics,3
2,Physics,1
3,Psychology,3


In [264]:
df

Unnamed: 0,name,major,gender
0,John,Computer Science,male
1,Nate,Computer Science,male
2,Abraham,Physics,male
3,Brian,Psychology,male
4,Janny,Economics,female
5,Yuna,Economics,female
6,Jeniffer,Computer Science,female
7,Edward,Computer Science,male
8,Zara,Psychology,female
9,Wendy,Economics,female


마지막 문제: <br>
1. gender별로 그룹화 하기 <br>
2. 그룹 단위로 이름, 전공, 성별 구조로 출력 <br>
- 성별 인원수로 새로운 DataFrame 생성하기 (count 라는 컬럼명 추가)

In [265]:
groupby_gender = df.groupby('gender')
groupby_gender

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001EB60757160>

In [268]:
groupby_gender.name

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x000001EB606EFBA8>

In [269]:
groupby_gender.major

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x000001EB606EF588>

In [270]:
groupby_gender.gender

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x000001EB606EFBE0>

In [274]:
for name, group in groupby_major:
    print(name + ':' + str(len(group)))
    print(group)
    print()

Computer Science:4
       name             major  gender
0      John  Computer Science    male
1      Nate  Computer Science    male
6  Jeniffer  Computer Science  female
7    Edward  Computer Science    male

Economics:3
    name      major  gender
4  Janny  Economics  female
5   Yuna  Economics  female
9  Wendy  Economics  female

Physics:1
      name    major gender
2  Abraham  Physics   male

Psychology:3
     name       major  gender
3   Brian  Psychology    male
8    Zara  Psychology  female
10   Sera  Psychology  female



In [276]:
for name, group in groupby_gender:
    print(name + ':' + str(len(group)))
    print(group)
    print()

female:6
        name             major  gender
4      Janny         Economics  female
5       Yuna         Economics  female
6   Jeniffer  Computer Science  female
8       Zara        Psychology  female
9      Wendy         Economics  female
10      Sera        Psychology  female

male:5
      name             major gender
0     John  Computer Science   male
1     Nate  Computer Science   male
2  Abraham           Physics   male
3    Brian        Psychology   male
7   Edward  Computer Science   male



In [277]:
df5 = pd.DataFrame({'count': groupby_gender.size()}).reset_index()
df5

Unnamed: 0,gender,count
0,female,6
1,male,5
