https://ysyblog.tistory.com/71?category=1144778

## LabelEncoder
* sklearn.preprocessing.LabelEncode
* Attributes
     - classes_ : ndarray of shape (n_classes,) Holds the label for each class.

* Methods
    - fit(y): Fit label encoder.
    - fit_transform(y) : Fit label encoder and return encoded labels.
    - get_params([deep]) : Get parameters for this estimator.
    - inverse_transform(y) : Transform labels back to original encoding.
    - set_params(**params): Set the parameters of this estimator.
    - transform(y) : Transform labels to normalized encoding

### 교재 예제

In [1]:
from sklearn.preprocessing import LabelEncoder

items = ['TV', '냉장고', '전자렌지', '컴퓨터', '선풍기', '선풍기', '믹서', '믹서']

# LabelEncoder를 객체로 생성한 후, fit()과 transform()으로 label 인코딩 수행
encoder = LabelEncoder()
encoder.fit(items)
labels = encoder.transform(items)
print('인코딩 변환값: ', labels)

인코딩 변환값:  [0 1 4 5 3 3 2 2]


In [4]:
print('인코딩 클래스: ', encoder.classes_)
print('디코딩 원본 값: ', encoder.inverse_transform([4,5,2,0,1,1,3,3]))

인코딩 클래스:  ['TV' '냉장고' '믹서' '선풍기' '전자렌지' '컴퓨터']
디코딩 원본 값:  ['전자렌지' '컴퓨터' '믹서' 'TV' '냉장고' '냉장고' '선풍기' '선풍기']


### 타이타닉 csv파일로 데이터 전처리 하기.
- 데이터를 보고.. 
- Servived, Sex, Age, Embarked 만 남기고 다른 컬럼 drop
- (name, parch, cabin 등 다른건 별로 안필요하니까 drop)
- feature는 Sex, Age, Embarked 사용,
- label은 Survived 사용해 지도학습.
- 데이터의 전처리가 필요한 상황. 결측치가 관찰되어 결측치 처리, 문자열이 보여 인코딩 해줘야함.
- 결측치는 age는 평균값, Embarked는 S로, 인코딩은 labelencoding을 이용해 처리한 결과를 데이터프레임으로 출력

> 1. pandas 이용해 csv 파일 불러오기

In [42]:
import pandas as pd 
#titanic = pd.read_csv("./csv/train.csv")
titanic = pd.read_csv("/Users/a123123/apps/ml/datasets/train.csv")
titanic_2 = titanic.copy()
titanic_2.head(2)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


> 2. 필요한 칼럼만 남기고 drop

In [43]:
titanic_2.drop(["PassengerId", 'Pclass', 'Name', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin'], axis=1, inplace=True)
# or iloc 사용!

In [44]:
titanic_2.head(2)

Unnamed: 0,Survived,Sex,Age,Embarked
0,0,male,22.0,S
1,1,female,38.0,C


> 3. fillna()를 이용해 결측치 처리 (mean)


In [45]:
titanic_2.isna().sum()
# or... titanic_2.info()

Survived      0
Sex           0
Age         177
Embarked      2
dtype: int64

In [46]:
titanic_2["Age"] = titanic_2["Age"].fillna(titanic_2["Age"].mean())

In [47]:
titanic_2["Embarked"] = titanic_2["Embarked"].fillna("S")

In [48]:
titanic_2.isna().sum()

Survived    0
Sex         0
Age         0
Embarked    0
dtype: int64

> 4. LabelEncoder를 활용해 인코딩

In [54]:
titanic_2.info()
# int, float는 연산이 가능! but object는 연산 안됨 -> 인코딩 해줘야 함.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Sex       891 non-null    object 
 2   Age       891 non-null    float64
 3   Embarked  891 non-null    object 
dtypes: float64(1), int64(1), object(2)
memory usage: 28.0+ KB


In [63]:
# Sex, Embarked 인코딩

from sklearn.preprocessing import LabelEncoder

# import pandas as pd
import numpy as np

encoder = LabelEncoder()
titanic_2["Sex"] = encoder.fit_transform(titanic_2["Sex"])
titanic_2["Embarked"] = encoder.fit_transform(titanic_2["Embarked"])

In [64]:
titanic_2

Unnamed: 0,Survived,Sex,Age,Embarked
0,0,0,22.000000,2
1,1,0,38.000000,0
2,1,0,26.000000,2
3,1,0,35.000000,2
4,0,0,35.000000,2
...,...,...,...,...
886,0,0,27.000000,2
887,1,0,19.000000,2
888,0,0,29.699118,2
889,1,0,26.000000,0


## OneHotEncoder
* sklearn.preprocessing.OneHotEncoder
* Methods
    - fit(X[, y]) : Fit OneHotEncoder to X.
    - fit_transform(X[, y]) : Fit OneHotEncoder to X, then transform X.
    - get_feature_names([input_features]) : DEPRECATED: get_feature_names is deprecated in 1.0 and will be removed in 1.2.
    - get feature_names_out([input_features]) : Get output feature names for transformation.
    - get_params([deep]) : Get parameters for this estimator.
    - inverse_transform(X) : Convert the data back to the original representation.
    - set_params(**params) : Set the parameters of this estimator.
    - transform(X) : Transform X using one-hot encoding.

### 교재 예제

In [79]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

items = ['TV', '냉장고', '전자렌지', '컴퓨터', '선풍기', '선풍기', '믹서', '믹서']

# 먼저 숫자값으로 변환을 위해 LabelEncoder로 변환 (경우에 따라 다름. array 하기 위해 한듯?)
encoder = LabelEncoder()
encoder.fit(items)
labels = encoder.transform(items)

# 2차원 데이터로 변환
labels = labels.reshape(-1, 1)
# items = np.array(items).reshape(-1.1)
# data = one_encoder.fil_transform(item)
labels

array([[0],
       [1],
       [4],
       [5],
       [3],
       [3],
       [2],
       [2]])

In [80]:
# 원-핫 인코딩 적용
oh_encoder = OneHotEncoder()

oh_encoder.fit(labels)
oh_labels = oh_encoder.transform(labels)

print('원-핫 인코딩 데이터')
print(oh_labels.toarray())
print('원-핫 인코딩 데이터 차원')
print(oh_labels.shape) 

원-핫 인코딩 데이터
[[1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]]
원-핫 인코딩 데이터 차원
(8, 6)


In [83]:
from sklearn.preprocessing import OneHotEncoder
df = pd.read_csv("./csv/train.csv")
one_encoder = OneHotEncoder()
trans_data = one_encoder.fit_transform([df['Sex']])
# trans_data.toarray()

### pandas의 get_dummies 이용 (-> 데이터프레임)
- https://devuna.tistory.com/67
- https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

- pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)

In [88]:
import pandas as pd 

df = pd.read_csv("./csv/train.csv")
df.drop(["PassengerId", 'Pclass', 'Name', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin'], axis=1, inplace=True)
df["Age"] = df["Age"].fillna(df["Age"].mean())
df["Embarked"] = df["Embarked"].fillna("S")


In [91]:
# 방법 1
data = df.loc[:, ["Embarked"]] # enbark만 가져오기
data.head()

Object `대체할수도` not found.


Unnamed: 0,Embarked
0,S
1,C
2,S
3,S
4,S


In [92]:
dummy_em = pd.get_dummies(data)
dummy_em

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
886,0,0,1
887,0,0,1
888,0,0,1
889,1,0,0


In [95]:
# 방법2
dummy_data = pd.get_dummies(df[['Sex', 'Embarked']])
dummy_data

Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,1,0,0,1
1,1,0,1,0,0
2,1,0,0,0,1
3,1,0,0,0,1
4,0,1,0,0,1
...,...,...,...,...,...
886,0,1,0,0,1
887,1,0,0,0,1
888,1,0,0,0,1
889,0,1,1,0,0


In [97]:
# 데이터 합치기!
dumy_df = pd.concat([df, dummy_data], axis=1)
dumy_df

Unnamed: 0,Survived,Sex,Age,Embarked,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,male,22.000000,S,0,1,0,0,1
1,1,female,38.000000,C,1,0,1,0,0
2,1,female,26.000000,S,1,0,0,0,1
3,1,female,35.000000,S,1,0,0,0,1
4,0,male,35.000000,S,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...
886,0,male,27.000000,S,0,1,0,0,1
887,1,female,19.000000,S,1,0,0,0,1
888,0,female,29.699118,S,1,0,0,0,1
889,1,male,26.000000,C,0,1,1,0,0


In [98]:
# 써먹었으니.. Sex, Embarked 없애기
dumy_df.drop(columns=['Sex', 'Embarked'], inplace=True)
dumy_df

Unnamed: 0,Survived,Age,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,22.000000,0,1,0,0,1
1,1,38.000000,1,0,1,0,0
2,1,26.000000,1,0,0,0,1
3,1,35.000000,1,0,0,0,1
4,0,35.000000,0,1,0,0,1
...,...,...,...,...,...,...,...
886,0,27.000000,0,1,0,0,1
887,1,19.000000,1,0,0,0,1
888,0,29.699118,1,0,0,0,1
889,1,26.000000,0,1,1,0,0


In [None]:
# 위에서 Age 정규화 즉, 스케일링 해서 일정한 값의 범위로 바꿔줘보자!