## 데이터 핸들링 및 전처리 

## 07. 데이터 인코딩하기

<img src = "https://images.unsplash.com/photo-1533237264985-ee62f6d342bb?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1469&q=80" width=80% align="center"/>

<div align="right">사진: <a href="https://unsplash.com/ko/%EC%82%AC%EC%A7%84/LTyDj7u_TU4?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>의<a href="https://unsplash.com/@element5digital?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Element5 Digital</a>
</div>
  
  

## 0. 데이터 불러오기
- ### 데이터 설명
    1. preprocessing_07.csv : 이전 실습에서 이상치 처리를 완료한 데이터 
> - MovieId : (int) 영화 아이디 <br>
> - ImdbId : (int) IMDb 데이터베이스 관리 아이디<br>
> - TmdbId : (int) TMDB 데이터베이스 관리 아이디<br>
> - Title : (object) 영화 제목 <br> 
> - Year : (int) 제작년도 <br> 
> - Genres : (object) 영화의 장르, '|'을 구분자로 한 복수 장르
> - UserId : (int) 유저 아이디 <br>
> - Rating : (float) 영화 평점 <br>
> - Gender : (object) 성별, M/F <br>
> - Age : (int) 나이<br>
> - Occupation : (object) 직업,<br>


In [1]:
# 라이브러리 불러오기
import pandas as pd

import warnings
warnings.filterwarnings(action='ignore')

In [29]:
# 데이터 불러오기
df = pd.read_csv("./data/preprocessing_07.csv")

In [30]:
df.head()

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Gender,Age,Occupation
0,1,114709,862,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy,1,4.0,F,2,K-12 student
1,1,114709,862,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy,5,4.0,M,30,writer
2,1,114709,862,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy,7,4.5,M,39,academic/educator
3,1,114709,862,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy,15,2.5,M,29,executive/managerial
4,1,114709,862,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy,17,4.5,M,52,academic/educator


---

### 1. 레이블 인코딩 (Label Encoding) 
범주형 데이터를 숫자로 변환하는 방법으로, 각 범주에 고유한 숫자를 할당하는 방식입니다. <br> 
예를 들어, "남성", "여성"과 같은 범주형 데이터를 각각 0과 1로 변환하는 방법이 있습니다.<br>
sklearn 라이브러리의 LabelEncoding 함수를 사용하거나, pandas의 map 메소드를 사용해서 직접 값을 변환하는 방법이 있습니다.

#### 1-1. Sklearn 라이브러리를 활용하기
- #### 'Occupation' 컬럼 label Encoding 하기

In [31]:
df_le = df.copy()

In [32]:
# 라이브러리 불러오기
from sklearn.preprocessing import LabelEncoder

# Label Encoder 호출하기
le = LabelEncoder()

# 'Occupation' 컬럼의 데이터를 Label 인코딩하기.
le_occupation = le.fit_transform(df['Occupation'])


In [33]:
le_occupation

array([ 0, 20,  1, ..., 11, 11,  7])

In [7]:
# label Encoding 매핑 클래스 확인
le.classes_

array(['K-12 student', 'academic/educator', 'artist', 'clerical/admin',
       'college/grad student', 'customer service', 'doctor/health care',
       'executive/managerial', 'farmer', 'homemaker', 'lawyer', 'other',
       'programmer', 'retired', 'sales/marketing', 'scientist',
       'self-employed', 'technician/engineer', 'tradesman/craftsman',
       'unemployed', 'writer'], dtype=object)

In [34]:
df_le['Occupation_Le'] = le_occupation

In [35]:
df_le.head(3)

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Gender,Age,Occupation,Occupation_Le
0,1,114709,862,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy,1,4.0,F,2,K-12 student,0
1,1,114709,862,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy,5,4.0,M,30,writer,20
2,1,114709,862,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy,7,4.5,M,39,academic/educator,1


#### 1-2. Pandas의 map() 함수를 사용하여 레이블 인코딩
- #### 'Occupation' 컬럼 label Encoding 하기

In [36]:
# 'Occupation' 컬럼의 값들을 lable mapping
label_mapping = { "other" : 0 , 
                  "academic/educator" : 1, 
                  "artist" : 2, 
                  "clerical/admin" : 3, 
                  "college/grad student" : 4, 
                  "customer service" : 5, 
                  "doctor/health care" : 6, 
                  "executive/managerial" : 7, 
                  "farmer" : 8, 
                  "homemaker" : 9, 
                  "K-12 student" : 10,
                  "lawyer" : 11, 
                  "programmer" : 12, 
                  "retired" : 13, 
                  "sales/marketing" : 14, 
                  "scientist" : 15, 
                  "self-employed" : 16, 
                  "technician/engineer" : 17, 
                  "tradesman/craftsman" : 18, 
                  "unemployed" : 19, 
                  "writer" : 20 }

In [37]:
# pandas replace() 메소드를 사용해서 값을 대체
df_le['Occupation_Map'] = df['Occupation'].map(label_mapping)

In [38]:
df_le.head(3)

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Gender,Age,Occupation,Occupation_Le,Occupation_Map
0,1,114709,862,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy,1,4.0,F,2,K-12 student,0,10
1,1,114709,862,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy,5,4.0,M,30,writer,20,20
2,1,114709,862,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy,7,4.5,M,39,academic/educator,1,1


---

### 2. 더미 변수 생성(Dummy Variable Creation)
각 카테고리를 이진 형태로 표현하는 방법입니다. <br> 
각각의 값을 고유한 이진 벡터로 변환하여 해당하는 값의 위치에 1을 표시하고, 나머지 위치에는 0을 표시합니다.<br>

In [13]:
df['Genres'].head()

0    Adventure|Animation|Children|Comedy|Fantasy
1    Adventure|Animation|Children|Comedy|Fantasy
2    Adventure|Animation|Children|Comedy|Fantasy
3    Adventure|Animation|Children|Comedy|Fantasy
4    Adventure|Animation|Children|Comedy|Fantasy
Name: Genres, dtype: object

#### 🔖 'Genres' 컬럼의 값(각 행마다 복수의 데이터 포함)을 더미 변수화 하기 전에 아래와 같이 사전 준비 작업을 합니다.

- ####  'Genres' 컬럼을 구분자('|') 기준으로 데이터를 분할합니다.


In [14]:
df['Genres'] = df['Genres'].str.split('|')

In [15]:
df['Genres']

0         [Adventure, Animation, Children, Comedy, Fantasy]
1         [Adventure, Animation, Children, Comedy, Fantasy]
2         [Adventure, Animation, Children, Comedy, Fantasy]
3         [Adventure, Animation, Children, Comedy, Fantasy]
4         [Adventure, Animation, Children, Comedy, Fantasy]
                                ...                        
100818                 [Action, Animation, Comedy, Fantasy]
100819                         [Animation, Comedy, Fantasy]
100820                                              [Drama]
100821                                  [Action, Animation]
100822                                             [Comedy]
Name: Genres, Length: 100823, dtype: object

- #### Pandas의 explode() 함수를 사용하여 각 값들을 별도의 행으로 생성해 줍니다.<br>

Pandas의 explode 함수는 하나의 셀에 여러 값을 가진 리스트, 시리즈(Series), 또는 데이터프레임(DataFrame)을 "폭발(explosion)"시켜 각 값들을 별도의 행으로 변환해주는 함수입니다. 
이를 통해 리스트나 시리즈의 값을 확장하여 새로운 행들을 생성할 수 있습니다.
> df.explode('컬럼명')

In [16]:
df = df.explode('Genres')

In [17]:
df.head(10)

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Gender,Age,Occupation
0,1,114709,862,Toy Story,1995,Adventure,1,4.0,F,2,K-12 student
0,1,114709,862,Toy Story,1995,Animation,1,4.0,F,2,K-12 student
0,1,114709,862,Toy Story,1995,Children,1,4.0,F,2,K-12 student
0,1,114709,862,Toy Story,1995,Comedy,1,4.0,F,2,K-12 student
0,1,114709,862,Toy Story,1995,Fantasy,1,4.0,F,2,K-12 student
1,1,114709,862,Toy Story,1995,Adventure,5,4.0,M,30,writer
1,1,114709,862,Toy Story,1995,Animation,5,4.0,M,30,writer
1,1,114709,862,Toy Story,1995,Children,5,4.0,M,30,writer
1,1,114709,862,Toy Story,1995,Comedy,5,4.0,M,30,writer
1,1,114709,862,Toy Story,1995,Fantasy,5,4.0,M,30,writer


#### 2-1. Pandas get_dummies() 사용해서 더미 변수 생성
Pandas의 get_dummies() 함수를 사용하면 원-핫 인코딩을 수행할 수 있습니다. 

[기본 사용법]
> pd.get_dumiies( df, columns=['컬럼명1', '컬럼명2'])

[옵션]
> - prefix: 원-핫 인코딩된 열들의 이름에 접두어를 지정합니다. 
> - prefix_sep: prefix 옵션에서 지정한 접두어와 원래의 범주형 값 사이에 추가할 구분자를 지정합니다.
> - columns: 원-핫 인코딩을 적용할 열들을 리스트로 지정합니다.
> - drop_first: 첫 번째 범주를 기준으로 원-핫 인코딩할지 여부를 지정합니다. <br>
  &emsp; &emsp; &emsp; &emsp; True인 경우, 첫 번째 범주에 대한 열이 생성되지 않습니다. 기본값은 False입니다.

- #### 'Genres' 컬럼을 get_dummies() 함수를 사용해서 더미 변수를 생성해 봅시다.

In [18]:
pd.get_dummies(df, columns=['Genres'])

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,UserId,Rating,Gender,Age,Occupation,...,Genres_Film-Noir,Genres_Horror,Genres_IMAX,Genres_Musical,Genres_Mystery,Genres_Romance,Genres_Sci-Fi,Genres_Thriller,Genres_War,Genres_Western
0,1,114709,862,Toy Story,1995,1,4.0,F,2,K-12 student,...,0,0,0,0,0,0,0,0,0,0
0,1,114709,862,Toy Story,1995,1,4.0,F,2,K-12 student,...,0,0,0,0,0,0,0,0,0,0
0,1,114709,862,Toy Story,1995,1,4.0,F,2,K-12 student,...,0,0,0,0,0,0,0,0,0,0
0,1,114709,862,Toy Story,1995,1,4.0,F,2,K-12 student,...,0,0,0,0,0,0,0,0,0,0
0,1,114709,862,Toy Story,1995,1,4.0,F,2,K-12 student,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100819,193583,5914996,445030,No Game No Life: Zero,2017,184,3.5,F,33,other,...,0,0,0,0,0,0,0,0,0,0
100820,193585,6397426,479308,Flint,2017,184,3.5,F,33,other,...,0,0,0,0,0,0,0,0,0,0
100821,193587,8391976,483455,Bungo Stray Dogs: Dead Apple,2018,184,3.5,F,33,other,...,0,0,0,0,0,0,0,0,0,0
100821,193587,8391976,483455,Bungo Stray Dogs: Dead Apple,2018,184,3.5,F,33,other,...,0,0,0,0,0,0,0,0,0,0


In [19]:
pd.get_dummies(df['Genres'])

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100819,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
100820,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
100821,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
100821,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


- #### 'prefix', 'drop_first' 옵션을 사용해 봅시다.

In [20]:
pd.get_dummies(df['Genres'], prefix='Genre_')

Unnamed: 0,Genre__(no genres listed),Genre__Action,Genre__Adventure,Genre__Animation,Genre__Children,Genre__Comedy,Genre__Crime,Genre__Documentary,Genre__Drama,Genre__Fantasy,Genre__Film-Noir,Genre__Horror,Genre__IMAX,Genre__Musical,Genre__Mystery,Genre__Romance,Genre__Sci-Fi,Genre__Thriller,Genre__War,Genre__Western
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100819,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
100820,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
100821,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
100821,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [21]:
pd.get_dummies(df['Genres'], drop_first=True)

Unnamed: 0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100819,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
100820,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
100821,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
100821,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


- #### 'Genres' 컬럼을 drop_first 옵션을 True로 하고 새로 생성된 더미 변수의 컬럼명 앞에 접두사(prefix)는 'Genre_'로 변경해서 데이터프레임 'df'를 수정하세요.

In [22]:
df = pd.get_dummies(df, columns=['Genres'], prefix='Genre', drop_first=True)

In [23]:
df.head(5)

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,UserId,Rating,Gender,Age,Occupation,...,Genre_Film-Noir,Genre_Horror,Genre_IMAX,Genre_Musical,Genre_Mystery,Genre_Romance,Genre_Sci-Fi,Genre_Thriller,Genre_War,Genre_Western
0,1,114709,862,Toy Story,1995,1,4.0,F,2,K-12 student,...,0,0,0,0,0,0,0,0,0,0
0,1,114709,862,Toy Story,1995,1,4.0,F,2,K-12 student,...,0,0,0,0,0,0,0,0,0,0
0,1,114709,862,Toy Story,1995,1,4.0,F,2,K-12 student,...,0,0,0,0,0,0,0,0,0,0
0,1,114709,862,Toy Story,1995,1,4.0,F,2,K-12 student,...,0,0,0,0,0,0,0,0,0,0
0,1,114709,862,Toy Story,1995,1,4.0,F,2,K-12 student,...,0,0,0,0,0,0,0,0,0,0


- #### 'Genres' 컬럼을 기준으로 explode() 했던 데이터는 더미화를 통해 새로 생성된 데이터 외의 컬럼들을 키 값으로 그룹핑하여 중복데이터를 제거 해줍니다.  
Pandas 라이브러리의 groupby()와 agg()는 데이터프레임(DataFrame)의 그룹화(grouping)와 집계(aggregation) 기능입니다. <br>
> - 'groupby()'는 특정 컬럼의 값을 기준으로 데이터프레임을 그룹화하여 그룹별로 연산을 수행할 수 있습니다. <br>
> - 'agg()'는 그룹화된 데이터프레임에 대해 다양한 집계 함수들을 적용하여 그룹 별로 연산 결과를 집계한 결과를 반환합니다.

In [24]:
df.columns

Index(['MovieId', 'ImdbId', 'TmdbId', 'Title', 'Year', 'UserId', 'Rating',
       'Gender', 'Age', 'Occupation', 'Genre_Action', 'Genre_Adventure',
       'Genre_Animation', 'Genre_Children', 'Genre_Comedy', 'Genre_Crime',
       'Genre_Documentary', 'Genre_Drama', 'Genre_Fantasy', 'Genre_Film-Noir',
       'Genre_Horror', 'Genre_IMAX', 'Genre_Musical', 'Genre_Mystery',
       'Genre_Romance', 'Genre_Sci-Fi', 'Genre_Thriller', 'Genre_War',
       'Genre_Western'],
      dtype='object')

In [25]:
df = df.groupby(['MovieId', 'ImdbId', 'TmdbId', 'Title', 'Year', 'UserId', 'Rating',
       'Gender', 'Age', 'Occupation']).agg('sum').reset_index()

In [26]:
df.iloc[:10, 11:]

Unnamed: 0,Genre_Adventure,Genre_Animation,Genre_Children,Genre_Comedy,Genre_Crime,Genre_Documentary,Genre_Drama,Genre_Fantasy,Genre_Film-Noir,Genre_Horror,Genre_IMAX,Genre_Musical,Genre_Mystery,Genre_Romance,Genre_Sci-Fi,Genre_Thriller,Genre_War,Genre_Western
0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
5,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
6,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
7,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
8,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
9,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0


---

## 실습문제

#### Q1. 데이터프레임 'df' 의 'Gender' 컬럼을 Lable Encoding 하고 변경된 내용을 df['Gender'] 컬럼에 바로 적용하세요.
> F(여성): 0, M(남성): 1 으로 매핑하세요.

In [27]:
# 여기에 작성하세요.
# 라이브러리 불러오기


# Label Encoder 호출


# 변환하기


# 변환된 데이터 대체하기



#### Q2. 데이터프레임 'df'의 'Occupation' 컬럼을 아래의 옵션을 참고하여 더미 변수로 만든 후 데이터프레임 'df'에 변경된 내용을 적용하세요.
> 새로 생성된 컬럼의 이름 앞에는 아무런 접두사를 'job'으로 변경하고, 구분자를 '_'로 해줍니다.

📌 컬럼명 옵션은 'prefix', 'prefix_sub' 을 조정합니다.


In [28]:
# 여기에 작성하세요.





---