# **1. 판다스(Pandas)**
판다스(Pandas)는 데이터 분석을 위한 파이썬 라이브러리 중 하나로, 표 형태의 데이터나 다양한 형태의 데이터를 쉽게 처리하고 분석할 수 있도록 도와주는 도구입니다. 주로 데이터프레임(DataFrame)이라는 자료구조를 제공하며, 이를 통해 테이블 형태의 데이터를 다루기 용이합니다.

In [359]:
!pip install pandas



# **2. Series와 DataFrame**

### 2-1. Series
Series는 1차원 배열과 같은 자료구조로 하나의 열을 나타냅니다. 또한 각 요소는 인덱스(index)와 값(value)으로 구성되어 있습니다. 값은 넘파이의 ndarray 기반으로 저장됩니다. Series는 다양한 데이터 타입을 가질 수 있으며 정수, 실수, 문자열 등 다양한 형태의 데이터를 담을 수 있습니다.

In [360]:
import pandas as pd

In [361]:
idx = ['김사과', '반하나', '오렌지', '이메론', '배애리']
data = [67, 75, 90, 62, 98]

print(pd.Series(data))

print(pd.Series(data, idx)) # pd.Seried(데이터, 인덱스)

0    67
1    75
2    90
3    62
4    98
dtype: int64
김사과    67
반하나    75
오렌지    90
이메론    62
배애리    98
dtype: int64


In [362]:
se1 = pd.Series(data, idx)
print(se1)
print(se1.index)
print(se1.values)

김사과    67
반하나    75
오렌지    90
이메론    62
배애리    98
dtype: int64
Index(['김사과', '반하나', '오렌지', '이메론', '배애리'], dtype='object')
[67 75 90 62 98]


### 2-2. DataFrame
데이터프레임(DataFrame)은 판다스(Pandas) 라이브러리에서 제공하는 중요하고 강력한 데이터 구조로, 2차원의 테이블 형태 데이터를 다루는 데 사용됩니다. 또한 각 요소는 인덱스(index), 열(column), 값(value)으로 구성되어 있습니다. 데이터프레임은 행과 열로 이루어져 있으며, 각 열은 다양한 데이터 타입을 가질 수 있습니다. 값은 넘파이의 ndarray 기반으로 저장됩니다.

In [363]:
data = [[67, 93, 91],
        [75, 68, 96],
        [87, 81, 82],
        [62, 70, 75],
        [98, 56, 87]]

idx = ['김사과', '반하나', '오렌지', '이메론', '배애리']
col = ['국어', '영어', '수학']

In [364]:
pd.DataFrame(data)

Unnamed: 0,0,1,2
0,67,93,91
1,75,68,96
2,87,81,82
3,62,70,75
4,98,56,87


In [365]:
pd.DataFrame(data, idx)

Unnamed: 0,0,1,2
김사과,67,93,91
반하나,75,68,96
오렌지,87,81,82
이메론,62,70,75
배애리,98,56,87


In [366]:
pd.DataFrame(data, idx, col)

Unnamed: 0,국어,영어,수학
김사과,67,93,91
반하나,75,68,96
오렌지,87,81,82
이메론,62,70,75
배애리,98,56,87


In [367]:
df = pd.DataFrame(index=idx, columns=col, data=data)
df

Unnamed: 0,국어,영어,수학
김사과,67,93,91
반하나,75,68,96
오렌지,87,81,82
이메론,62,70,75
배애리,98,56,87


In [368]:
print(df.index)
print(df.columns)
print(df.values)

Index(['김사과', '반하나', '오렌지', '이메론', '배애리'], dtype='object')
Index(['국어', '영어', '수학'], dtype='object')
[[67 93 91]
 [75 68 96]
 [87 81 82]
 [62 70 75]
 [98 56 87]]


In [369]:
dic = {
    '국어':[67, 75, 76, 62, 98],
    '영어':[93, 68, 81, 70, 56],
    '수학':[91, 96, 82, 75, 87]
}

df = pd.DataFrame(data=dic, index=idx)
df

Unnamed: 0,국어,영어,수학
김사과,67,93,91
반하나,75,68,96
오렌지,76,81,82
이메론,62,70,75
배애리,98,56,87


# **3. CSV 파일 읽어오기**
CSV 파일은 Comma-Separated Values(쉼표로 구분된 값) 파일의 약자로, 데이터를 단순한 텍스트 형식으로 저장하는 데 사용되는 파일 형식입니다.

In [370]:
# /content/drive/MyDrive/컴퓨터비전 시즌3/3. 데이터분석/Data/idol.csv
df = pd.read_csv('/content/drive/MyDrive/컴퓨터비전/3. 데이터 분석/Data/idol.csv')
df

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [371]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [372]:
type(df)

# **4. 데이터프레임 기본정보 알아보기**

In [373]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   이름       20 non-null     object 
 1   그룹       20 non-null     object 
 2   소속사      19 non-null     object 
 3   성별       20 non-null     object 
 4   생년월일     20 non-null     object 
 5   키        19 non-null     float64
 6   혈액형      19 non-null     object 
 7   브랜드평판지수  20 non-null     int64  
dtypes: float64(1), int64(1), object(6)
memory usage: 1.4+ KB


In [374]:
df.columns

Index(['이름', '그룹', '소속사', '성별', '생년월일', '키', '혈액형', '브랜드평판지수'], dtype='object')

In [375]:
# 컬럼명 변경하기
new_columns = ['name', 'group', 'company', 'gender', 'birthday', 'height', 'blood', 'brand']
df.columns = new_columns
print(df.columns)

Index(['name', 'group', 'company', 'gender', 'birthday', 'height', 'blood',
       'brand'],
      dtype='object')


In [376]:
df

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [377]:
# describe(): 통계 정보를 반환
df.describe()

Unnamed: 0,height,brand
count,19.0,20.0
mean,170.536842,2700190.0
std,7.225204,1381919.0
min,161.0,1680587.0
25%,164.75,1887423.0
50%,168.0,2074682.0
75%,179.0,2623465.0
max,182.0,6267302.0


In [378]:
df.describe(include=object) # Top: 최빈값, freq: 최빈값의 빈도

Unnamed: 0,name,group,company,gender,birthday,blood
count,20,20,19,20,20,19
unique,20,6,5,2,20,4
top,지민,방탄소년단,빅히트,여자,1995-10-13,A
freq,1,5,7,13,1,11


In [379]:
# 원하는 개수의 데이터 보기
df.head() # 상위 5개의 row를 출력

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048


In [380]:
df.head(3)

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081


In [381]:
df.tail() # 하위 5개의 row를 출력

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
15,윤아,소녀시대,에스엠,여자,1990-05-30,168.0,B,1885297
16,조이,레드벨벳,빅히트,여자,1996-09-03,168.0,A,1830514
17,슬기,레드벨벳,빅히트,여자,1994-02-10,161.0,A,1741767
18,강다니엘,워너원,,남자,1996-12-10,182.0,A,1706444
19,진,방탄소년단,빅히트,남자,1992-12-04,179.0,O,1680587


In [382]:
df.tail(2)

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
18,강다니엘,워너원,,남자,1996-12-10,182.0,A,1706444
19,진,방탄소년단,빅히트,남자,1992-12-04,179.0,O,1680587


In [383]:
# 정렬하기
df.sort_index() # index로 오름차순 정렬: 기본값

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [384]:
df.sort_index(ascending=False) # index로 내림차순 정렬

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
19,진,방탄소년단,빅히트,남자,1992-12-04,179.0,O,1680587
18,강다니엘,워너원,,남자,1996-12-10,182.0,A,1706444
17,슬기,레드벨벳,빅히트,여자,1994-02-10,161.0,A,1741767
16,조이,레드벨벳,빅히트,여자,1996-09-03,168.0,A,1830514
15,윤아,소녀시대,에스엠,여자,1990-05-30,168.0,B,1885297
14,로제,블랙핑크,와이지,여자,1997-02-11,168.0,B,1888132
13,리사,블랙핑크,와이지,여자,1997-03-27,167.0,A,1912800
12,옹성우,워너원,판타지오,남자,1995-08-25,179.0,A,1954327
11,제니,블랙핑크,와이지,여자,1996-01-16,163.0,B,2069250
10,RM,방탄소년단,빅히트,남자,1994-09-12,181.0,A,2069499


In [385]:
df.sort_values(by='height') # 키로 오름차순 정렬

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
17,슬기,레드벨벳,빅히트,여자,1994-02-10,161.0,A,1741767
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
11,제니,블랙핑크,와이지,여자,1996-01-16,163.0,B,2069250
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
13,리사,블랙핑크,와이지,여자,1997-03-27,167.0,A,1912800
16,조이,레드벨벳,빅히트,여자,1996-09-03,168.0,A,1830514
15,윤아,소녀시대,에스엠,여자,1990-05-30,168.0,B,1885297
14,로제,블랙핑크,와이지,여자,1997-02-11,168.0,B,1888132


In [386]:
df.sort_values(by='height', ascending=False) # 키로 내림차순 정렬

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
18,강다니엘,워너원,,남자,1996-12-10,182.0,A,1706444
10,RM,방탄소년단,빅히트,남자,1994-09-12,181.0,A,2069499
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
12,옹성우,워너원,판타지오,남자,1995-08-25,179.0,A,1954327
19,진,방탄소년단,빅히트,남자,1992-12-04,179.0,O,1680587
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
14,로제,블랙핑크,와이지,여자,1997-02-11,168.0,B,1888132


In [387]:
df.sort_values(by='height', ascending=False, na_position='first') # NaN을 위로 올림

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866
18,강다니엘,워너원,,남자,1996-12-10,182.0,A,1706444
10,RM,방탄소년단,빅히트,남자,1994-09-12,181.0,A,2069499
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
12,옹성우,워너원,판타지오,남자,1995-08-25,179.0,A,1954327
19,진,방탄소년단,빅히트,남자,1992-12-04,179.0,O,1680587
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081


In [388]:
# 1차 정렬: 키(오름차순), 2차 정렬: 브랜드(내림차순)
df.sort_values(by=['height', 'brand'], ascending=[True, False])

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
17,슬기,레드벨벳,빅히트,여자,1994-02-10,161.0,A,1741767
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
11,제니,블랙핑크,와이지,여자,1996-01-16,163.0,B,2069250
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
13,리사,블랙핑크,와이지,여자,1997-03-27,167.0,A,1912800
14,로제,블랙핑크,와이지,여자,1997-02-11,168.0,B,1888132
15,윤아,소녀시대,에스엠,여자,1990-05-30,168.0,B,1885297
16,조이,레드벨벳,빅히트,여자,1996-09-03,168.0,A,1830514


# **5. 데이터 다루기**

In [389]:
df.head()

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048


In [390]:
df['blood']

Unnamed: 0,blood
0,A
1,A
2,A
3,O
4,AB
5,
6,O
7,A
8,B
9,A


In [391]:
type(df['blood'])

In [392]:
df.blood

Unnamed: 0,blood
0,A
1,A
2,A
3,O
4,AB
5,
6,O
7,A
8,B
9,A


In [393]:
df.head(3)

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081


In [394]:
df[:3]

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081


In [395]:
# loc: 컬럼으로 인덱싱
df.loc[:, 'name'] # df['name']

Unnamed: 0,name
0,지민
1,정국
2,민지
3,하니
4,뷔
5,다니엘
6,혜인
7,지수
8,해린
9,태연


In [396]:
df.loc[2:5, 'name'] # 5번을 포함

Unnamed: 0,name
2,민지
3,하니
4,뷔
5,다니엘


In [397]:
df.loc[2:5, ['name', 'gender', 'height']]

Unnamed: 0,name,gender,height
2,민지,여자,169.0
3,하니,여자,161.7
4,뷔,남자,179.0
5,다니엘,여자,165.0


In [398]:
df.loc[[2,5], ['name', 'gender', 'height']]

Unnamed: 0,name,gender,height
2,민지,여자,169.0
5,다니엘,여자,165.0


In [399]:
df.loc[[2,5], 'name':'height']

Unnamed: 0,name,group,company,gender,birthday,height
2,민지,뉴진스,어도어,여자,2004-05-07,169.0
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0


In [400]:
# iloc: index로 인덱싱
df.iloc[:, 0]

Unnamed: 0,name
0,지민
1,정국
2,민지
3,하니
4,뷔
5,다니엘
6,혜인
7,지수
8,해린
9,태연


In [401]:
df.iloc[:, [0, 2]]

Unnamed: 0,name,company
0,지민,빅히트
1,정국,빅히트
2,민지,어도어
3,하니,어도어
4,뷔,빅히트
5,다니엘,어도어
6,혜인,어도어
7,지수,와이지
8,해린,어도어
9,태연,에스엠


In [402]:
df.iloc[2:5, [0, 2]] # 5번 행을 포함하지 않음

Unnamed: 0,name,company
2,민지,어도어
3,하니,어도어
4,뷔,빅히트


In [403]:
df.iloc[2:5, 0:2] # 2번 컬럼을 포함하지 않음

Unnamed: 0,name,group
2,민지,뉴진스
3,하니,뉴진스
4,뷔,방탄소년단


In [404]:
df['height'] >= 180

Unnamed: 0,height
0,False
1,False
2,False
3,False
4,False
5,False
6,False
7,False
8,False
9,False


In [405]:
df[df['height'] >= 180]

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
10,RM,방탄소년단,빅히트,남자,1994-09-12,181.0,A,2069499
18,강다니엘,워너원,,남자,1996-12-10,182.0,A,1706444


In [406]:
df[df['height'] >= 180]['name']

Unnamed: 0,name
10,RM
18,강다니엘


In [407]:
df['name'][df['height'] >= 180]

Unnamed: 0,name
10,RM
18,강다니엘


In [408]:
df[df['height'] >= 180][['name', 'gender', 'height']]

Unnamed: 0,name,gender,height
10,RM,남자,181.0
18,강다니엘,남자,182.0


In [409]:
# 키가 170cm이상인 연예인의 이름, 성별, 키, 브랜드 데이터를 출력
# 단, loc를 사용
df.loc[df['height'] >= 170, ['name', 'gender', 'height', 'brand']]

Unnamed: 0,name,gender,height,brand
0,지민,남자,174.0,6267302
1,정국,남자,179.0,5805844
4,뷔,남자,179.0,3470048
6,혜인,여자,170.0,2301785
10,RM,남자,181.0,2069499
12,옹성우,남자,179.0,1954327
18,강다니엘,남자,182.0,1706444
19,진,남자,179.0,1680587


In [410]:
df

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [411]:
# isin(): 정의한 list에 있는 데이터를 불린으로 반환
company = ['빅히트', '어도어']
df['company'].isin(company)

Unnamed: 0,company
0,True
1,True
2,True
3,True
4,True
5,True
6,True
7,False
8,True
9,False


In [412]:
df[df['company'].isin(company)] # df.loc[df['company'].isin(company), :]

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
10,RM,방탄소년단,빅히트,남자,1994-09-12,181.0,A,2069499
16,조이,레드벨벳,빅히트,여자,1996-09-03,168.0,A,1830514


# **6. 결측값(Null, NaN, None)**
결측값은 값이 누락된 데이터를 의미하며, 판다스에서는 일반적으로 NaN(Not a Number) 또는 None으로 표시됩니다.

1. NaN
수치 데이터에서 나타나는 결측값으로, 판다스의 float 타입에서 주로 사용됩니다.
2. None
비수치 데이터에서 나타나는 결측값으로, object 데이터 타입에서 사용됩니다.

In [413]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      20 non-null     object 
 1   group     20 non-null     object 
 2   company   19 non-null     object 
 3   gender    20 non-null     object 
 4   birthday  20 non-null     object 
 5   height    19 non-null     float64
 6   blood     19 non-null     object 
 7   brand     20 non-null     int64  
dtypes: float64(1), int64(1), object(6)
memory usage: 1.4+ KB


In [414]:
df

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [415]:
df.isna()

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,True,False
6,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False
9,False,False,False,False,False,True,False,False


In [416]:
df.isnull()

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,True,False
6,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False
9,False,False,False,False,False,True,False,False


In [417]:
df['height'].isna()

Unnamed: 0,height
0,False
1,False
2,False
3,False
4,False
5,False
6,False
7,False
8,False
9,True


In [418]:
df[df['height'].isna()]

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [419]:
df[df['height'].isna()]['name']

Unnamed: 0,name
9,태연


In [420]:
df[df['height'].notnull()]

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
10,RM,방탄소년단,빅히트,남자,1994-09-12,181.0,A,2069499


In [421]:
df[~df['height'].isnull()]

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
10,RM,방탄소년단,빅히트,남자,1994-09-12,181.0,A,2069499


In [422]:
# 회사가 존재하는 연예인의 이름, 회사, 그룹, 성별의 데이터를 출력
# 단, loc를 사용
df.loc[df['company'].notnull(), ['name', 'company', 'group', 'gender']]

Unnamed: 0,name,company,group,gender
0,지민,빅히트,방탄소년단,남자
1,정국,빅히트,방탄소년단,남자
2,민지,어도어,뉴진스,여자
3,하니,어도어,뉴진스,여자
4,뷔,빅히트,방탄소년단,남자
5,다니엘,어도어,뉴진스,여자
6,혜인,어도어,뉴진스,여자
7,지수,와이지,블랙핑크,여자
8,해린,어도어,뉴진스,여자
9,태연,에스엠,소녀시대,여자


In [423]:
df_copy = df.copy()
df_copy

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [424]:
df_copy['height']

Unnamed: 0,height
0,174.0
1,179.0
2,169.0
3,161.7
4,179.0
5,165.0
6,170.0
7,162.0
8,164.5
9,


In [425]:
# fillna(): 결측값을 채워주는 함수
df_copy['height'].fillna(0) # df['height'].fillna(0, inplace=True) 데이터 적용

Unnamed: 0,height
0,174.0
1,179.0
2,169.0
3,161.7
4,179.0
5,165.0
6,170.0
7,162.0
8,164.5
9,0.0


In [426]:
df_copy['height']

Unnamed: 0,height
0,174.0
1,179.0
2,169.0
3,161.7
4,179.0
5,165.0
6,170.0
7,162.0
8,164.5
9,


In [427]:
height = df_copy['height'].mean()
height

170.53684210526316

In [428]:
# df_copy['height'].fillna(height, inplace=True)
df_copy['height'] = df_copy['height'].fillna(height)
df_copy['height']

Unnamed: 0,height
0,174.0
1,179.0
2,169.0
3,161.7
4,179.0
5,165.0
6,170.0
7,162.0
8,164.5
9,170.536842


In [429]:
df_copy = df.copy()
df_copy['height']

Unnamed: 0,height
0,174.0
1,179.0
2,169.0
3,161.7
4,179.0
5,165.0
6,170.0
7,162.0
8,164.5
9,


In [430]:
height = df_copy['height'].median()
height

168.0

In [431]:
df_copy['height'].fillna(height, inplace=True)
df_copy['height']

Unnamed: 0,height
0,174.0
1,179.0
2,169.0
3,161.7
4,179.0
5,165.0
6,170.0
7,162.0
8,164.5
9,168.0


In [432]:
df_copy = df.copy()
df_copy

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [433]:
# dropna(): 결측값이 있는 행 또는 열을 제거
df_copy.dropna() # 결측값이 한개라도 있는 경우 삭제. 기본값 axis=0. (행 삭제)

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
10,RM,방탄소년단,빅히트,남자,1994-09-12,181.0,A,2069499
11,제니,블랙핑크,와이지,여자,1996-01-16,163.0,B,2069250


### ※ axis
1. NumPy는 수학적 배열 연산에서 출발했기 때문에, 축(axis) 개념이 배열의 모양과 연산 방향에 초점을 맞춥니다.
axis=0: 배열의 세로 방향(열), axis=1: 배열의 가로 방향(행)
2. Pandas는 데이터 분석에 특화된 라이브러리로, **행(row)**과 **열(column)**을 명시적으로 구분하여 작업을 수행합니다.
axis=0: 행(row)을 대상으로 작업(열 간 연산). axis=1: 열(column)을 대상으로 작업(행 간 연산)

In [434]:
df_copy.dropna(axis=1) # 결측값이 있는 열을 제거

Unnamed: 0,name,group,gender,birthday,brand
0,지민,방탄소년단,남자,1995-10-13,6267302
1,정국,방탄소년단,남자,1997-09-01,5805844
2,민지,뉴진스,여자,2004-05-07,4437081
3,하니,뉴진스,여자,2004-10-06,4161153
4,뷔,방탄소년단,남자,1995-12-30,3470048
5,다니엘,뉴진스,여자,2005-04-11,2341271
6,혜인,뉴진스,여자,2008-04-21,2301785
7,지수,블랙핑크,여자,1995-01-03,2227460
8,해린,뉴진스,여자,2006-05-15,2173376
9,태연,소녀시대,여자,1989-03-09,2079866


# **7. 행, 열 추가 및 삭제하기**

In [435]:
df_copy = df.copy()
df_copy

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [436]:
dic = {
    'name': '김사과',
    'group': '과수원',
    'company': '애플',
    'gender': '여자',
    'birthday': '2000-01-01',
    'height': 160.0,
    'blood': 'A',
    'brand': 1234567
}

In [437]:
# concat(): 데이터를 합침, axis=0
# ignore_index=True 옵션을 추가해야 인덱스 적용 에러가 발생하지 않음
df = pd.concat([df_copy, pd.DataFrame(dic, index=[0])], ignore_index=True)
df

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [438]:
# 'nation' 열을 추가하고, nation에 모든 데이터는 '대한민국'으로 저장
df['nation'] = '대한민국'
df.head()

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand,nation
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,대한민국
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844,대한민국
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081,대한민국
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153,대한민국
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048,대한민국


In [439]:
df.tail()

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand,nation
16,조이,레드벨벳,빅히트,여자,1996-09-03,168.0,A,1830514,대한민국
17,슬기,레드벨벳,빅히트,여자,1994-02-10,161.0,A,1741767,대한민국
18,강다니엘,워너원,,남자,1996-12-10,182.0,A,1706444,대한민국
19,진,방탄소년단,빅히트,남자,1992-12-04,179.0,O,1680587,대한민국
20,김사과,과수원,애플,여자,2000-01-01,160.0,A,1234567,대한민국


In [440]:
# '김사과'님의 국적을 '미국'으로 변경(loc를 사용)
df.loc[df['name'] == '김사과', 'nation'] = '미국'
df.tail()

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand,nation
16,조이,레드벨벳,빅히트,여자,1996-09-03,168.0,A,1830514,대한민국
17,슬기,레드벨벳,빅히트,여자,1994-02-10,161.0,A,1741767,대한민국
18,강다니엘,워너원,,남자,1996-12-10,182.0,A,1706444,대한민국
19,진,방탄소년단,빅히트,남자,1992-12-04,179.0,O,1680587,대한민국
20,김사과,과수원,애플,여자,2000-01-01,160.0,A,1234567,미국


In [441]:
# 행 제거하기
df.drop(20, axis=0) # 0: 행, 1: 열

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand,nation
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,대한민국
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844,대한민국
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081,대한민국
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153,대한민국
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048,대한민국
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271,대한민국
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785,대한민국
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460,대한민국
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376,대한민국
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866,대한민국


In [442]:
df.tail()

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand,nation
16,조이,레드벨벳,빅히트,여자,1996-09-03,168.0,A,1830514,대한민국
17,슬기,레드벨벳,빅히트,여자,1994-02-10,161.0,A,1741767,대한민국
18,강다니엘,워너원,,남자,1996-12-10,182.0,A,1706444,대한민국
19,진,방탄소년단,빅히트,남자,1992-12-04,179.0,O,1680587,대한민국
20,김사과,과수원,애플,여자,2000-01-01,160.0,A,1234567,미국


In [443]:
df.drop([1, 3, 5, 7, 20], axis=0)

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand,nation
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,대한민국
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081,대한민국
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048,대한민국
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785,대한민국
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376,대한민국
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866,대한민국
10,RM,방탄소년단,빅히트,남자,1994-09-12,181.0,A,2069499,대한민국
11,제니,블랙핑크,와이지,여자,1996-01-16,163.0,B,2069250,대한민국
12,옹성우,워너원,판타지오,남자,1995-08-25,179.0,A,1954327,대한민국
13,리사,블랙핑크,와이지,여자,1997-03-27,167.0,A,1912800,대한민국


In [444]:
# 열 제거하기
df.drop('nation', axis=1)

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [445]:
df.drop(['group', 'nation'], axis=1)

Unnamed: 0,name,company,gender,birthday,height,blood,brand
0,지민,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,에스엠,여자,1989-03-09,,A,2079866


# **8. 통계함수**

In [446]:
df.describe()

Unnamed: 0,height,brand
count,20.0,21.0
mean,170.01,2630399.0
std,7.416688,1384378.0
min,160.0,1234567.0
25%,164.125,1885297.0
50%,168.0,2069499.0
75%,179.0,2341271.0
max,182.0,6267302.0


In [447]:
df['height'].sum() # 합계

3400.2

In [448]:
df['height'].count() # 개수. NaN은 포함하지 않음

20

In [449]:
df['height'].mean() # 평균

170.01

In [450]:
df['height'].median() # 중앙값

168.0

> 평균은 모든 데이터를 더한 후, 데이터 개수로 나눈 값입니다. 데이터를 고르게 분배했을 때, 한 데이터가 가질 수 있는 이론적인 중심값을 의미합니다. 중앙값은 데이터를 크기 순서대로 정렬했을 때, 가운데 위치하는 값입니다. 데이터의 순서에만 영향을 받고, 값의 크기에는 영향을 받지 않습니다. 데이터가 고르게 분포된 경우 평균과 중앙값이 비슷하거나 같습니다. 하지만 데이터에 극단값(Outlier)이 있는 경우 평균은 극단값의 영향을 받아 왜곡될 수 있지만, 중앙값은 비교적 안정적입니다.

In [451]:
df['height'].max() # 최대값

182.0

In [452]:
df['height'].min() # 최소값

160.0

In [453]:
df['height'].var() # 분산

55.007263157894734

In [454]:
df['height'].std() # 표준편차

7.416688152935563

<img src='https://blog.kakaocdn.net/dn/lSDKp/btsLh5aU1e8/lbiMCS1ur5ItObK5lSbrQK/img.png'>

> 분산(Variance)과 표준편차(Standard Deviation)는 데이터가 평균에서 얼마나 퍼져 있는지를 나타내는 산포도(분포 정도)를 측정하는 지표입니다. 분산은 데이터가 평균을 기준으로 얼마나 퍼져 있는지를 나타냅니다. 평균에서 각 데이터의 거리를 제곱한 값들의 평균입니다. 표준편차는 분산의 제곱근입니다. 분산은 제곱 값이기 때문에 단위가 커질 수 있는데, 이를 원래 데이터와 같은 단위로 변환하기 위해 제곱근을 씌웁니다.

# **9. 그룹**

In [455]:
df = pd.read_csv('/content/drive/MyDrive/컴퓨터비전/3. 데이터 분석/Data/idol.csv')
df

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [456]:
new_columns = ['name', 'group', 'company', 'gender', 'birthday', 'height', 'blood', 'brand']
df.columns = new_columns
print(df.columns)

Index(['name', 'group', 'company', 'gender', 'birthday', 'height', 'blood',
       'brand'],
      dtype='object')


In [457]:
df

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [458]:
# groupby(): 데이터를 그룹으로 묶어 분석할 때 사용
df.groupby('group')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc12ee144c0>

In [459]:
# 그룹을 맺으면 통계함수를 사용할 수 있음
df.groupby('group').count()

Unnamed: 0_level_0,name,company,gender,birthday,height,blood,brand
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
뉴진스,5,5,5,5,5,4,5
레드벨벳,2,2,2,2,2,2,2
방탄소년단,5,5,5,5,5,5,5
블랙핑크,4,4,4,4,4,4,4
소녀시대,2,2,2,2,1,2,2
워너원,2,1,2,2,2,2,2


In [460]:
df.groupby('group').mean(numeric_only=True)

Unnamed: 0_level_0,height,brand
group,Unnamed: 1_level_1,Unnamed: 2_level_1
뉴진스,166.04,3082933.2
레드벨벳,164.5,1786140.5
방탄소년단,178.4,3858656.0
블랙핑크,165.0,2024410.5
소녀시대,168.0,1982581.5
워너원,180.5,1830385.5


In [461]:
df.groupby('group').sum(numeric_only=True)

Unnamed: 0_level_0,height,brand
group,Unnamed: 1_level_1,Unnamed: 2_level_1
뉴진스,830.2,15414666
레드벨벳,329.0,3572281
방탄소년단,892.0,19293280
블랙핑크,660.0,8097642
소녀시대,168.0,3965163
워너원,361.0,3660771


In [462]:
df.groupby('gender').mean(numeric_only=True)

Unnamed: 0_level_0,height,brand
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
남자,179.0,3279150.0
여자,165.6,2388442.0


In [463]:
df.groupby('gender').mean(['height', 'brand'])

Unnamed: 0_level_0,height,brand
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
남자,179.0,3279150.0
여자,165.6,2388442.0


In [464]:
df.groupby(['blood', 'gender'])['height'].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,height
blood,gender,Unnamed: 2_level_1
A,남자,179.0
A,여자,165.4
AB,남자,179.0
B,여자,165.875
O,남자,179.0
O,여자,165.85


# **10. 중복값 제거하기**

In [465]:
df['blood']

Unnamed: 0,blood
0,A
1,A
2,A
3,O
4,AB
5,
6,O
7,A
8,B
9,A


In [466]:
# drop_duplicates(): 중복된 데이터를 제거
df['blood'].drop_duplicates()

Unnamed: 0,blood
0,A
3,O
4,AB
5,
8,B


In [467]:
# index의 마지막 데이터를 출력
df['blood'].drop_duplicates(keep='last')

Unnamed: 0,blood
4,AB
5,
15,B
18,A
19,O


In [468]:
# value_counts(): 열의 각 값에 대한 데이터의 개수를 반환. NaN은 생략
df['blood'].value_counts()

Unnamed: 0_level_0,count
blood,Unnamed: 1_level_1
A,11
B,4
O,3
AB,1


In [469]:
df['company'].value_counts()

Unnamed: 0_level_0,count
company,Unnamed: 1_level_1
빅히트,7
어도어,5
와이지,4
에스엠,2
판타지오,1


In [470]:
df['company'].value_counts(dropna=False)

Unnamed: 0_level_0,count
company,Unnamed: 1_level_1
빅히트,7
어도어,5
와이지,4
에스엠,2
판타지오,1
,1


# **11. 데이터프레임 합치기**

In [471]:
df1 = pd.read_csv('/content/drive/MyDrive/컴퓨터비전/3. 데이터 분석/Data/idol.csv')
df2 = pd.read_csv('/content/drive/MyDrive/컴퓨터비전/3. 데이터 분석/Data/idol2.csv')

In [472]:
df1

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [473]:
df2

Unnamed: 0,이름,연봉,가족수
0,지민,3000,3
1,정국,3500,3
2,민지,3200,4
3,하니,3050,4
4,뷔,4300,3
5,다니엘,2900,5
6,혜인,3400,6
7,지수,4500,5
8,해린,4200,4
9,태연,4300,4


In [474]:
df_copy = df1.copy()

In [475]:
pd.concat([df1, df_copy]) # axis=0 (기본값)

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [476]:
df_concat = pd.concat([df1, df_copy])
df_concat

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [477]:
# reset_index(): index를 새롭게 적용
# drop=True 옵션을 사용하여 기존 index가 컬럼으로 만들어지는 것을 방지
df_concat.reset_index(drop=True)

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [478]:
pd.concat([df1, df2], axis=1) # 같은 index와 결합

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수,이름.1,연봉,가족수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,지민,3000,3
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844,정국,3500,3
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081,민지,3200,4
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153,하니,3050,4
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048,뷔,4300,3
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271,다니엘,2900,5
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785,혜인,3400,6
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460,지수,4500,5
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376,해린,4200,4
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866,태연,4300,4


In [479]:
df3 = df2.drop([1, 3, 5, 7])
df3

Unnamed: 0,이름,연봉,가족수
0,지민,3000,3
2,민지,3200,4
4,뷔,4300,3
6,혜인,3400,6
8,해린,4200,4
9,태연,4300,4
10,RM,3700,3
11,제니,3850,5
12,옹성우,3900,4
13,리사,4100,3


In [480]:
pd.concat([df1, df3], axis=1)

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수,이름.1,연봉,가족수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,지민,3000.0,3.0
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844,,,
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081,민지,3200.0,4.0
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153,,,
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048,뷔,4300.0,3.0
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271,,,
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785,혜인,3400.0,6.0
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460,,,
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376,해린,4200.0,4.0
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866,태연,4300.0,4.0


In [481]:
df_right = df2.drop([1, 3, 5, 7, 9], axis=0)
df_right

Unnamed: 0,이름,연봉,가족수
0,지민,3000,3
2,민지,3200,4
4,뷔,4300,3
6,혜인,3400,6
8,해린,4200,4
10,RM,3700,3
11,제니,3850,5
12,옹성우,3900,4
13,리사,4100,3
14,로제,4150,3


In [482]:
df_right = df_right.reset_index(drop=True)
df_right

Unnamed: 0,이름,연봉,가족수
0,지민,3000,3
1,민지,3200,4
2,뷔,4300,3
3,혜인,3400,6
4,해린,4200,4
5,RM,3700,3
6,제니,3850,5
7,옹성우,3900,4
8,리사,4100,3
9,로제,4150,3


In [483]:
dic = {
    "이름":"김사과",
    "연봉":9000,
    "가족수":10
}

In [484]:
df_right = pd.concat([df_right, pd.DataFrame(dic, index=[0])], ignore_index=True)
df_right

Unnamed: 0,이름,연봉,가족수
0,지민,3000,3
1,민지,3200,4
2,뷔,4300,3
3,혜인,3400,6
4,해린,4200,4
5,RM,3700,3
6,제니,3850,5
7,옹성우,3900,4
8,리사,4100,3
9,로제,4150,3


In [485]:
pd.concat([df1, df_right], axis=1)

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수,이름.1,연봉,가족수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,지민,3000.0,3.0
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844,민지,3200.0,4.0
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081,뷔,4300.0,3.0
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153,혜인,3400.0,6.0
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048,해린,4200.0,4.0
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271,RM,3700.0,3.0
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785,제니,3850.0,5.0
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460,옹성우,3900.0,4.0
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376,리사,4100.0,3.0
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866,로제,4150.0,3.0


In [486]:
# merge(): 특정 고유한 키(unique, id)값을 기준으로 합침
# merge(데이터프레임1, 데이터프레임2, on='유니크값', how='병합의 기준')
# 병합의 기준: left, right, inner, cross
pd.merge(df1, df_right, on='이름', how='left')

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수,연봉,가족수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,3000.0,3.0
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844,,
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081,3200.0,4.0
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153,,
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048,4300.0,3.0
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271,,
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785,3400.0,6.0
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460,,
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376,4200.0,4.0
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866,,


In [487]:
pd.merge(df1, df_right, on='이름', how='right')

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수,연봉,가족수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302.0,3000,3
1,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081.0,3200,4
2,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048.0,4300,3
3,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785.0,3400,6
4,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376.0,4200,4
5,RM,방탄소년단,빅히트,남자,1994-09-12,181.0,A,2069499.0,3700,3
6,제니,블랙핑크,와이지,여자,1996-01-16,163.0,B,2069250.0,3850,5
7,옹성우,워너원,판타지오,남자,1995-08-25,179.0,A,1954327.0,3900,4
8,리사,블랙핑크,와이지,여자,1997-03-27,167.0,A,1912800.0,4100,3
9,로제,블랙핑크,와이지,여자,1997-02-11,168.0,B,1888132.0,4150,3


In [488]:
pd.merge(df1, df_right, on='이름', how='inner')

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수,연봉,가족수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,3000,3
1,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081,3200,4
2,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048,4300,3
3,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785,3400,6
4,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376,4200,4
5,RM,방탄소년단,빅히트,남자,1994-09-12,181.0,A,2069499,3700,3
6,제니,블랙핑크,와이지,여자,1996-01-16,163.0,B,2069250,3850,5
7,옹성우,워너원,판타지오,남자,1995-08-25,179.0,A,1954327,3900,4
8,리사,블랙핑크,와이지,여자,1997-03-27,167.0,A,1912800,4100,3
9,로제,블랙핑크,와이지,여자,1997-02-11,168.0,B,1888132,4150,3


In [489]:
pd.merge(df1, df_right, how='cross')

Unnamed: 0,이름_x,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수,이름_y,연봉,가족수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,지민,3000,3
1,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,민지,3200,4
2,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,뷔,4300,3
3,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,혜인,3400,6
4,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,해린,4200,4
...,...,...,...,...,...,...,...,...,...,...,...
315,진,방탄소년단,빅히트,남자,1992-12-04,179.0,O,1680587,조이,3500,3
316,진,방탄소년단,빅히트,남자,1992-12-04,179.0,O,1680587,슬기,3200,4
317,진,방탄소년단,빅히트,남자,1992-12-04,179.0,O,1680587,강다니엘,3050,4
318,진,방탄소년단,빅히트,남자,1992-12-04,179.0,O,1680587,진,4300,3


In [490]:
df_right.columns = ['성함', '연봉', '가족수']
df_right

Unnamed: 0,성함,연봉,가족수
0,지민,3000,3
1,민지,3200,4
2,뷔,4300,3
3,혜인,3400,6
4,해린,4200,4
5,RM,3700,3
6,제니,3850,5
7,옹성우,3900,4
8,리사,4100,3
9,로제,4150,3


In [491]:
# pd.merge(df1, df_right, on='이름', how='left')

In [492]:
pd.merge(df1, df_right, left_on='이름', right_on='성함', how='left')

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수,성함,연봉,가족수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,지민,3000.0,3.0
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844,,,
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081,민지,3200.0,4.0
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153,,,
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048,뷔,4300.0,3.0
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271,,,
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785,혜인,3400.0,6.0
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460,,,
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376,해린,4200.0,4.0
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866,,,


# **12. 등수 매기기**

In [493]:
# rank(): 데이터프레임 또는 시리즈의 순위를 매기는 메서드. 기본값은 ascending
df1['브랜드순위'] = df1['브랜드평판지수'].rank()
df1

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수,브랜드순위
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,20.0
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844,19.0
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081,18.0
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153,17.0
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048,16.0
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271,15.0
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785,14.0
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460,13.0
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376,12.0
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866,11.0


In [494]:
df1['브랜드순위'] = df1['브랜드평판지수'].rank(ascending=False)
df1

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수,브랜드순위
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,1.0
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844,2.0
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081,3.0
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153,4.0
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048,5.0
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271,6.0
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785,7.0
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460,8.0
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376,9.0
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866,10.0


In [495]:
# astype(): 특정열의 자료형을 변경
df1['브랜드순위'] = df1['브랜드순위'].astype(int)
df1

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수,브랜드순위
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,1
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844,2
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081,3
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153,4
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048,5
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271,6
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785,7
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460,8
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376,9
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866,10


In [496]:
df1['브랜드순위'].dtypes

dtype('int64')

# **13. 날짜타입 사용하기**

In [497]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      20 non-null     object 
 1   group     20 non-null     object 
 2   company   19 non-null     object 
 3   gender    20 non-null     object 
 4   birthday  20 non-null     object 
 5   height    19 non-null     float64
 6   blood     19 non-null     object 
 7   brand     20 non-null     int64  
dtypes: float64(1), int64(1), object(6)
memory usage: 1.4+ KB


In [498]:
df['birthday']

Unnamed: 0,birthday
0,1995-10-13
1,1997-09-01
2,2004-05-07
3,2004-10-06
4,1995-12-30
5,2005-04-11
6,2008-04-21
7,1995-01-03
8,2006-05-15
9,1989-03-09


In [499]:
# to_datatime(): object타입에서 datetime타입으로 변환
df['birthday'] = pd.to_datetime(df['birthday'])
print(type(df['birthday']))
print(df['birthday'].dtypes)

<class 'pandas.core.series.Series'>
datetime64[ns]


In [500]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   name      20 non-null     object        
 1   group     20 non-null     object        
 2   company   19 non-null     object        
 3   gender    20 non-null     object        
 4   birthday  20 non-null     datetime64[ns]
 5   height    19 non-null     float64       
 6   blood     19 non-null     object        
 7   brand     20 non-null     int64         
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 1.4+ KB


In [501]:
df['birthday'].dt.year

Unnamed: 0,birthday
0,1995
1,1997
2,2004
3,2004
4,1995
5,2005
6,2008
7,1995
8,2006
9,1989


In [502]:
df['birthday'].dt.month

Unnamed: 0,birthday
0,10
1,9
2,5
3,10
4,12
5,4
6,4
7,1
8,5
9,3


In [503]:
df['birthday'].dt.day

Unnamed: 0,birthday
0,13
1,1
2,7
3,6
4,30
5,11
6,21
7,3
8,15
9,9


In [504]:
df['birthday'].dt.hour

Unnamed: 0,birthday
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0


In [505]:
df['birthday'].dt.minute

Unnamed: 0,birthday
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0


In [506]:
df['birthday'].dt.second

Unnamed: 0,birthday
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0


In [507]:
df['birthday'].dt.dayofweek # 요일: 0(월요일) ~ 6(일요일)

Unnamed: 0,birthday
0,4
1,0
2,4
3,2
4,5
5,0
6,0
7,1
8,0
9,3


In [508]:
df['birthday'].dt.isocalendar().week

Unnamed: 0,week
0,41
1,36
2,19
3,41
4,52
5,15
6,17
7,1
8,20
9,10


# **14. apply 사용하기**
Pandas의 apply() 함수는 데이터프레임이나 시리즈의 데이터를 사용자 정의 함수 또는 내장 함수에 적용하여 새로운 값을 계산하거나 변환할 때 사용됩니다. 데이터를 행(row) 또는 열(column) 단위로 처리할 수 있는 강력한 도구입니다.

In [509]:
df.head()

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048


In [510]:
# 성별의 남자는 1, 여자는 0으로 변환(loc를 사용)
df.loc[df['gender'] == '남자', 'gender'] = 1
df.loc[df['gender'] == '여자', 'gender'] = 0

In [511]:
df.head()

Unnamed: 0,name,group,company,gender,birthday,height,blood,brand
0,지민,방탄소년단,빅히트,1,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,1,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,0,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,0,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,1,1995-12-30,179.0,AB,3470048


In [512]:
df = pd.read_csv('/content/drive/MyDrive/컴퓨터비전/3. 데이터 분석/Data/idol.csv')
df

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [513]:
def male_or_female(x):
    if x == '남자':
        return 1
    elif x == '여자':
        return 0
    else:
        return None

In [514]:
male_or_female('남자')

1

In [515]:
male_or_female('여자')

0

In [516]:
df['성별'].apply(male_or_female)

Unnamed: 0,성별
0,1
1,1
2,0
3,0
4,1
5,0
6,0
7,0
8,0
9,0


In [517]:
df['성별'].apply(lambda x: 1 if x == '남자' else 0)

Unnamed: 0,성별
0,1
1,1
2,0
3,0
4,1
5,0
6,0
7,0
8,0
9,0


In [518]:
df['new성별'] = df['성별'].apply(lambda x: 1 if x == '남자' else 0)
df.head()

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수,new성별
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,1
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844,1
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081,0
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153,0
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048,1


# **15. map 사용하기**
Pandas의 map() 함수는 Series 객체에서 사용할 수 있는 함수로, 각 요소에 대해 함수나 매핑 규칙을 적용하여 새로운 값을 계산하거나 변환할 때 사용됩니다. map()은 데이터의 각 요소를 순회하며 특정 작업을 수행하므로, 데이터를 가공하거나 변환하는 데 유용합니다.

In [519]:
df = pd.read_csv('/content/drive/MyDrive/컴퓨터비전/3. 데이터 분석/Data/idol.csv')
df

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [520]:
map_gender = {'남자':1, '여자':0}

In [521]:
df['성별'].map(map_gender)

Unnamed: 0,성별
0,1
1,1
2,0
3,0
4,1
5,0
6,0
7,0
8,0
9,0


In [522]:
df['New성별'] = df['성별'].map(map_gender)
df.head()

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수,New성별
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,1
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844,1
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081,0
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153,0
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048,1


# **16. 데이터프레임의 산술연산**

In [523]:
df = pd.DataFrame({
    '파이썬':[60, 70, 80, 90, 95],
    '데이터분석':[40, 60, 70, 55, 87],
    '머신러닝딥러닝':[35, 40, 30, 70, 55]
})

In [524]:
df

Unnamed: 0,파이썬,데이터분석,머신러닝딥러닝
0,60,40,35
1,70,60,40
2,80,70,30
3,90,55,70
4,95,87,55


In [525]:
df['파이썬'].dtypes

dtype('int64')

In [526]:
type(df['파이썬'])

In [527]:
df['파이썬'] + df['데이터분석'] + df['머신러닝딥러닝']

Unnamed: 0,0
0,135
1,170
2,180
3,215
4,237


In [528]:
df['총점'] = df['파이썬'] + df['데이터분석'] + df['머신러닝딥러닝']
df['평균'] = df['총점'] / 3
df

Unnamed: 0,파이썬,데이터분석,머신러닝딥러닝,총점,평균
0,60,40,35,135,45.0
1,70,60,40,170,56.666667
2,80,70,30,180,60.0
3,90,55,70,215,71.666667
4,95,87,55,237,79.0


# **17.select_dtypes**

In [529]:
df = pd.read_csv('/content/drive/MyDrive/컴퓨터비전/3. 데이터 분석/Data/idol.csv')
df

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,,2341271
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,O,2301785
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,A,2227460
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,B,2173376
9,태연,소녀시대,에스엠,여자,1989-03-09,,A,2079866


In [530]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   이름       20 non-null     object 
 1   그룹       20 non-null     object 
 2   소속사      19 non-null     object 
 3   성별       20 non-null     object 
 4   생년월일     20 non-null     object 
 5   키        19 non-null     float64
 6   혈액형      19 non-null     object 
 7   브랜드평판지수  20 non-null     int64  
dtypes: float64(1), int64(1), object(6)
memory usage: 1.4+ KB


In [531]:
df.select_dtypes(include='object') # 문자열 컬럼만 가져오기

Unnamed: 0,이름,그룹,소속사,성별,생년월일,혈액형
0,지민,방탄소년단,빅히트,남자,1995-10-13,A
1,정국,방탄소년단,빅히트,남자,1997-09-01,A
2,민지,뉴진스,어도어,여자,2004-05-07,A
3,하니,뉴진스,어도어,여자,2004-10-06,O
4,뷔,방탄소년단,빅히트,남자,1995-12-30,AB
5,다니엘,뉴진스,어도어,여자,2005-04-11,
6,혜인,뉴진스,어도어,여자,2008-04-21,O
7,지수,블랙핑크,와이지,여자,1995-01-03,A
8,해린,뉴진스,어도어,여자,2006-05-15,B
9,태연,소녀시대,에스엠,여자,1989-03-09,A


In [532]:
df.select_dtypes(exclude='object') # 문자열 컬럼만 빼고 가져오기

Unnamed: 0,키,브랜드평판지수
0,174.0,6267302
1,179.0,5805844
2,169.0,4437081
3,161.7,4161153
4,179.0,3470048
5,165.0,2341271
6,170.0,2301785
7,162.0,2227460
8,164.5,2173376
9,,2079866


In [533]:
# 문자가 아닌 컬럼에만 10을 더함
df.select_dtypes(exclude='object') + 10

Unnamed: 0,키,브랜드평판지수
0,184.0,6267312
1,189.0,5805854
2,179.0,4437091
3,171.7,4161163
4,189.0,3470058
5,175.0,2341281
6,180.0,2301795
7,172.0,2227470
8,174.5,2173386
9,,2079876


In [534]:
# 문자열을 가지고 있는 컬럼의 이름만 변수에 저장하여 출력
str_cols = df.select_dtypes(include='object').columns
str_cols

Index(['이름', '그룹', '소속사', '성별', '생년월일', '혈액형'], dtype='object')

In [535]:
df[str_cols]

Unnamed: 0,이름,그룹,소속사,성별,생년월일,혈액형
0,지민,방탄소년단,빅히트,남자,1995-10-13,A
1,정국,방탄소년단,빅히트,남자,1997-09-01,A
2,민지,뉴진스,어도어,여자,2004-05-07,A
3,하니,뉴진스,어도어,여자,2004-10-06,O
4,뷔,방탄소년단,빅히트,남자,1995-12-30,AB
5,다니엘,뉴진스,어도어,여자,2005-04-11,
6,혜인,뉴진스,어도어,여자,2008-04-21,O
7,지수,블랙핑크,와이지,여자,1995-01-03,A
8,해린,뉴진스,어도어,여자,2006-05-15,B
9,태연,소녀시대,에스엠,여자,1989-03-09,A


# **18. get_dummies**
get_dummies()는 Pandas에서 범주형 데이터를 원-핫 인코딩(one-hot encoding) 방식으로 변환하는 데 사용됩니다.

> 원-핫 인코딩은 각 범주를 별도의 열로 변환하고, 해당 범주에 해당하는 곳에 1을, 나머지에는 0을 채우는 방식입니다. 예를 들어, 데이터가 "Red", "Green", "Blue"와 같은 문자열이라면, 모델은 이를 이해하지 못합니다. 범주형 데이터를 숫자로 변환해야 모델이 계산할 수 있습니다. 원-핫 인코딩은 범주형 데이터를 숫자로 변환하면서도 각 범주 간의 순서나 크기를 부여하지 않습니다.

In [536]:
blood_map = {'A':0, 'B':1, 'AB':2, 'O':3}
df['혈액형_code'] = df['혈액형'].map(blood_map) # 라벨 인코딩
df.head()

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,혈액형,브랜드평판지수,혈액형_code
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,A,6267302,0.0
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,A,5805844,0.0
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,A,4437081,0.0
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,O,4161153,3.0
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,AB,3470048,2.0


In [537]:
pd.get_dummies(df['혈액형'])

Unnamed: 0,A,AB,B,O
0,True,False,False,False
1,True,False,False,False
2,True,False,False,False
3,False,False,False,True
4,False,True,False,False
5,False,False,False,False
6,False,False,False,True
7,True,False,False,False
8,False,False,True,False
9,True,False,False,False


In [538]:
df = pd.get_dummies(df, columns=['혈액형'])
df

Unnamed: 0,이름,그룹,소속사,성별,생년월일,키,브랜드평판지수,혈액형_code,혈액형_A,혈액형_AB,혈액형_B,혈액형_O
0,지민,방탄소년단,빅히트,남자,1995-10-13,174.0,6267302,0.0,True,False,False,False
1,정국,방탄소년단,빅히트,남자,1997-09-01,179.0,5805844,0.0,True,False,False,False
2,민지,뉴진스,어도어,여자,2004-05-07,169.0,4437081,0.0,True,False,False,False
3,하니,뉴진스,어도어,여자,2004-10-06,161.7,4161153,3.0,False,False,False,True
4,뷔,방탄소년단,빅히트,남자,1995-12-30,179.0,3470048,2.0,False,True,False,False
5,다니엘,뉴진스,어도어,여자,2005-04-11,165.0,2341271,,False,False,False,False
6,혜인,뉴진스,어도어,여자,2008-04-21,170.0,2301785,3.0,False,False,False,True
7,지수,블랙핑크,와이지,여자,1995-01-03,162.0,2227460,0.0,True,False,False,False
8,해린,뉴진스,어도어,여자,2006-05-15,164.5,2173376,1.0,False,False,True,False
9,태연,소녀시대,에스엠,여자,1989-03-09,,2079866,0.0,True,False,False,False


In [539]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   이름        20 non-null     object 
 1   그룹        20 non-null     object 
 2   소속사       19 non-null     object 
 3   성별        20 non-null     object 
 4   생년월일      20 non-null     object 
 5   키         19 non-null     float64
 6   브랜드평판지수   20 non-null     int64  
 7   혈액형_code  19 non-null     float64
 8   혈액형_A     20 non-null     bool   
 9   혈액형_AB    20 non-null     bool   
 10  혈액형_B     20 non-null     bool   
 11  혈액형_O     20 non-null     bool   
dtypes: bool(4), float64(2), int64(1), object(5)
memory usage: 1.5+ KB
