### Step to analyze DataSet _ Using Iris Data

In [1]:
from sklearn.datasets import load_iris

In [2]:
# load the data
data = load_iris()

In [3]:
# check the key
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

iris_data를 통해 해결해야할 문제 = Multi Classification
* classification을 위한 모델 구축이 필요함

In [4]:
# find dataset's information

# data length
x_data = data.data
x_data_shape = data.data.shape
y_data = data.target
y_data_shape = data.target.shape
# features and labels names
x_data_columns = data.feature_names
y_data_columns = data.target_names
filename = data.filename

print(f"x_data_shape : {x_data_shape}")
print(f"y_data_shape : {y_data_shape}")
print(f"x_data_columns : {x_data_columns}")
print(f"x_data_columns.shape : {len(x_data_columns)}")
print(f"y_data_columns : {y_data_columns}")
print(f"y_data_columns.shape : {len(y_data_columns)}")

x_data_shape : (150, 4)
y_data_shape : (150,)
x_data_columns : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
x_data_columns.shape : 4
y_data_columns : ['setosa' 'versicolor' 'virginica']
y_data_columns.shape : 3


Iris Data의 DataFrame으로의 변환
* 데이터의 탐색과 조작을 쉽게 하기 위함
* 다양한 데이터 분석 기술을 통해 pre-processing 가능

In [5]:
import pandas as pd

x_iris_data = pd.DataFrame(x_data, columns=x_data_columns)
y_iris_data = pd.DataFrame(y_data, columns=['label'])
iris_data = pd.concat([x_iris_data, y_iris_data], axis=1)

iris_data.sample(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
90,5.5,2.6,4.4,1.2,1
19,5.1,3.8,1.5,0.3,0
108,6.7,2.5,5.8,1.8,2
89,5.5,2.5,4.0,1.3,1
74,6.4,2.9,4.3,1.3,1


In [6]:
iris_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   label              150 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 5.4 KB


In [7]:
iris_data.label.value_counts()

label
0    50
1    50
2    50
Name: count, dtype: int64

In [8]:
iris_data.nunique()

# feature는 categorical data가 아닌 numerical data
# label은 categorical data

sepal length (cm)    35
sepal width (cm)     23
petal length (cm)    43
petal width (cm)     22
label                 3
dtype: int64


### For Create Model, Preprocess the Data

1) Feature Analysis
* Categorical Feature와 Numerical Feature 확인 : Encoding 작업을 위해 분리
* 분리한 Categorical Feature에 대한 Encoding : One-Hot-Encoding _vs_ Label_Encoding
* 만약 Label_Encoding을 할 경우, scaling 필요
* Numerical Feature에 대한 scaling : 동일한 표준편차 구성을 위함
* Concatenate Categorical and Numerical Feature

2) Label Analysis
* Encoding the Label : One-Hot-Encoding _vs_ Label_Encoding
* 연속적인 변화가 중요한 Label일 경우 Label_Encoding을 사용하나, 일반적으로 One_Hot_Encoding 사용

In [9]:
# Numerical Feature에 대한 Normalize 진행 : [-1, 1] 사이로 변환
# 이를 위해 각 feature 별 데이터 정보 확인
iris_data.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [12]:
# Feature data에 대한 scaling 진행 - minmax scaling
from sklearn.preprocessing import MinMaxScaler, LabelBinarizer

# x_data scaling
scaler = MinMaxScaler(feature_range=(-1, 1))
x_iris_data_scaled = pd.DataFrame(scaler.fit_transform(x_iris_data), columns=x_data_columns)

# y_data scaling - onehotencoding
le = LabelBinarizer()
y_iris_data_scaled = pd.DataFrame(le.fit_transform(y_iris_data), columns=y_data_columns)

# iris_data scaling
iris_data_scaled = pd.concat([x_iris_data_scaled, y_iris_data_scaled], axis=1)

print(iris_data_scaled)
print(x_iris_data_scaled)
print(y_iris_data_scaled)

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0            -0.555556          0.250000          -0.864407         -0.916667   
1            -0.666667         -0.166667          -0.864407         -0.916667   
2            -0.777778          0.000000          -0.898305         -0.916667   
3            -0.833333         -0.083333          -0.830508         -0.916667   
4            -0.611111          0.333333          -0.864407         -0.916667   
..                 ...               ...                ...               ...   
145           0.333333         -0.166667           0.423729          0.833333   
146           0.111111         -0.583333           0.355932          0.500000   
147           0.222222         -0.166667           0.423729          0.583333   
148           0.055556          0.166667           0.491525          0.833333   
149          -0.111111         -0.166667           0.389831          0.416667   

     setosa  versicolor  vi

Data Training

In [13]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    x_iris_data_scaled, y_iris_data_scaled, test_size=0.2, random_state=42)

print(f"x_train.shape : {x_train.shape}")
print(f"y_train.shape : {y_train.shape}")
print(f"x_test.shape : {x_test.shape}")
print(f"y_test.shape : {y_test.shape}")

x_train.shape : (120, 4)
y_train.shape : (120, 3)
x_test.shape : (30, 4)
y_test.shape : (30, 3)
