## Data 설명
---
[Glass Identification](https://archive.ics.uci.edu/ml/datasets/glass+identification)

1987년 법의학(Forensic Science) 분야에서 범죄 현장에서 발견된 유리 조각이 어떤 종류인지를 분류하는 규칙을 세우기 위해 사용된 자료이다.
유리의 종류별로 굴절률과 구성 성분의 함량을 측정한 자료이다. 

## 변수 설명
---

| 번호 | 칼럼명 | 설명|
|:---|:---|:---|
| 1 | ID | 1에서 214 |
| 2 | RI | 굴절률 |
| 3 | Na | 나트륨 |
| 4 | Mg | 마그네슘 |
| 5 | Al | 알루미늄 |
| 6 | Si | 실리콘 |
| 7 | K | 칼륨 |
| 8 | Ca | 칼슘 |
| 9 | Ba | 바륨 |
| 10 | Fe | 철 |
| 11 | Type of Glass | 유리 종류(Target) | 

Target 종류
- 1: 플로트 공법(Float Processing)으로 만들어진 건물의 창유리
- 2: 플로트 공법(Float Processing)이 아닌 건물의 창유리
- 3: 플로트 공법 처리된 차량 유리
- 4: 플로트 공법 처리되지 않은 차량 유리
- 5: 유리 용기
- 6: 유리 잔
- 7: 전등 유리

*플로트 공법(Float Processing)란 판 유리를 만드는 공법으로 용융된 유리물을 금속 위로 흘려보내면서 균일한 두께의 판유리로 성형하는 방법이다. 

In [0]:
import pandas as pd

# 변수명이 포함되어 있지 않으므로 header=None로 불러오기
df = pd.read_csv('https://raw.githubusercontent.com/joongyang/Machine-Learning-by-Examples/master/GlassIdentificationData.data', delimiter=',', header=None)

In [2]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,2,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,4,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,5,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [3]:
df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
209,210,1.51623,14.14,0.0,2.88,72.61,0.08,9.18,1.06,0.0,7
210,211,1.51685,14.92,0.0,1.99,73.06,0.0,8.4,1.59,0.0,7
211,212,1.52065,14.36,0.0,2.02,73.42,0.0,8.44,1.64,0.0,7
212,213,1.51651,14.38,0.0,1.94,73.61,0.0,8.48,1.57,0.0,7
213,214,1.51711,14.23,0.0,2.08,73.36,0.0,8.62,1.67,0.0,7


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 11 columns):
0     214 non-null int64
1     214 non-null float64
2     214 non-null float64
3     214 non-null float64
4     214 non-null float64
5     214 non-null float64
6     214 non-null float64
7     214 non-null float64
8     214 non-null float64
9     214 non-null float64
10    214 non-null int64
dtypes: float64(9), int64(2)
memory usage: 18.5 KB


In [5]:
# 분석하기 쉽게 변수명을 할당해 준다.
df.columns = ['ID','X1','X2','X3','X4','X5','X6','X7','X8','X9','Y']

# target을 범주형 변수로 지정한다
df['Y'] = pd.Categorical(df['Y'])

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 11 columns):
ID    214 non-null int64
X1    214 non-null float64
X2    214 non-null float64
X3    214 non-null float64
X4    214 non-null float64
X5    214 non-null float64
X6    214 non-null float64
X7    214 non-null float64
X8    214 non-null float64
X9    214 non-null float64
Y     214 non-null category
dtypes: category(1), float64(9), int64(1)
memory usage: 17.3 KB


Unnamed: 0,ID,X1,X2,X3,X4,X5,X6,X7,X8,X9,Y
0,1,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,2,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,4,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,5,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [0]:
# 일련 번호를 제거한다
df = df.drop(columns=['ID'])
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,Y
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [0]:
df.describe()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9
count,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0
mean,1.518365,13.40785,2.684533,1.444907,72.650935,0.497056,8.956963,0.175047,0.057009
std,0.003037,0.816604,1.442408,0.49927,0.774546,0.652192,1.423153,0.497219,0.097439
min,1.51115,10.73,0.0,0.29,69.81,0.0,5.43,0.0,0.0
25%,1.516523,12.9075,2.115,1.19,72.28,0.1225,8.24,0.0,0.0
50%,1.51768,13.3,3.48,1.36,72.79,0.555,8.6,0.0,0.0
75%,1.519157,13.825,3.6,1.63,73.0875,0.61,9.1725,0.0,0.1
max,1.53393,17.38,4.49,3.5,75.41,6.21,16.19,3.15,0.51


In [0]:
# features와 target을 분리한다.
X = df.iloc[:,:-1]
Y = df['Y']

In [0]:
# 훈련 자료와 테스트 자료에 각 분류 항목의 자료가 
# 전체 자료와 같은 비율로 들어 있도록 stratify 매개 변수에 
# target을 지정한다.
from sklearn.model_selection import train_test_split
X_tr, X_ts, Y_tr, Y_ts = train_test_split(X, Y, test_size=0.4, stratify=Y, random_state=201911)

정규화, 교차 검증 없이 단순하게 랜덤 포리스트 모형을 훈련한다.

In [9]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

clf.fit(X_tr, Y_tr)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [10]:
clf.score(X_ts, Y_ts)  # 기본적으로 정확도를 계산한다.

0.9069767441860465

In [0]:
# 훈련된 모형으로 테스트 자료에 대한 예측(분류)한다.
pred = clf.predict(X_ts)

In [0]:
# 훈련된 분류 모형의 성능을 평가한다.
from sklearn.metrics import confusion_matrix
confusion = confusion_matrix(Y_ts, pred)
print("오차 행렬:\n{}".format(confusion))

오차 행렬:
[[23  2  1  0  0  0]
 [14 16  1  1  0  0]
 [ 1  2  1  0  0  0]
 [ 0  0  0  5  0  1]
 [ 0  2  0  0  3  0]
 [ 2  2  0  0  0  9]]


In [0]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

ticks = np.arange(0.5, 6.5, 1)

plt.figure(figsize=(8, 8))
sns.heatmap(confusion, annot=True, fmt='d', cmap='YlGnBu')
plt.ylabel(ylabel='True label')
plt.xlabel(xlabel='Predicted label')
plt.xticks(ticks, ['1','2','3','4','5','6','7'], size=10)
plt.yticks(ticks, ['1','2','3','4','5','6','7'], size=10)

([<matplotlib.axis.YTick at 0x249921e01d0>,
  <matplotlib.axis.YTick at 0x2499237ab00>,
  <matplotlib.axis.YTick at 0x249923077b8>,
  <matplotlib.axis.YTick at 0x2499230b2b0>,
  <matplotlib.axis.YTick at 0x2499230b780>,
  <matplotlib.axis.YTick at 0x2499230bc50>],
 <a list of 6 Text yticklabel objects>)

In [0]:
from sklearn.metrics import classification_report

print(f'성능 \n {classification_report(Y_ts, pred)}')

성능 
               precision    recall  f1-score   support

           1       0.57      0.88      0.70        26
           2       0.67      0.50      0.57        32
           3       0.33      0.25      0.29         4
           5       0.83      0.83      0.83         6
           6       1.00      0.60      0.75         5
           7       0.90      0.69      0.78        13

   micro avg       0.66      0.66      0.66        86
   macro avg       0.72      0.63      0.65        86
weighted avg       0.69      0.66      0.66        86

