# 랜덤 포레스트

# 알고리즘 top10
- 지도학습
    - 선형회귀 (linear regression)
    - 로지스틱 회귀 (logistic regression)
    - k-nn (K-Nearest Neighbor)
    - 나이브베이스 (Naive Bayes)
    - 결정트리 (Decision Tree)
    - 랜덤포레스트 (random forest)
    - XGBoost
    - LightGBM
- 비지도학습
    - k-means
    - PCA
- 선택이유
    - 범용성
    - 속도
    - 예측력
    - 하이퍼파라미터 튜닝
    - 시각화
    - 해석력

# 머신러닝 필수 라이브러리
- Numpy : Nummerical Python
    - 파이썬 산술계산의 대표적인 라이브러리
    - 자료구조, 알고리즘 산술 데이터를 다루는 대부분의 과학 계산에 필수 라이브러리
    - ndarray 객체
    
- Pandas : 구조화된 데이터나 표 형식의 데이터를 빠르고 쉽게 표현적으로 다루도록 설계된 고수준 자료구조
    - 데이터 과학에서 데이터를 처리하는 대표적인 라이브러리
    - 데이터 핸들링에 표준
    - Series 객체와 DataFrame 객체가 대표적인 자료구조
    
- Matplotlib : 시각화 라이브러리
    - 그래프나 2차원 데이터를 시각화하는 파이썬 기반의 라이브러리
    
- Seaborn : 다양한 시각화 종류를 제공하는 라이브러리
    - Matplotlib에 종속된 라이브러리
    
- Scipy
    - 과학 계산 컴퓨팅 영역의 여러 기본 문제를 다루는 패키지 모음
    - scipy.stats : 가장 많이 사용되는 통계도구를 가지고 있는 라이브러리
    
- scikit-learn : 머신러닝 핵심 라이브러리
    - 분류 : SVM, 최근접 이웃, 랜덤 포레스트, 로지스틱 회귀 등
    - 회귀 : 라쏘, 릿지 회귀 등
    - 클러스터링 : k-평균 등
    - 자원축소 : PCA, 특징 선택, 행렬 인수분해 등
    - 모델선택 : 격자 탐색, 교차 검증, 행렬
    - 전처리 : 특징 추출, 정규화 등
    
- statsmodels : R언어용 회귀분석 모델을 구현한 통계분석 패키지
    - 회귀모델 : 선형회귀
    - 분산분석(ANOVA)
    - 시계열분석 : AR, ARMA, ARIMA 등
    - 통계 모델 결과의 시각화 제공

# 앙상블 학습

In [1]:
import numpy as np
import pandas as pd

# 보팅 분류기
from sklearn.ensemble import VotingClassifier
# 보팅용 학습알고리즘
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
# 데이터 분리
from sklearn.model_selection import train_test_split
# 정확도 확인
from sklearn.metrics import accuracy_score
# 경고 무시
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# 자동완성 속도 증가
%config Completer.use_jedi = False
# 데이터셋 로딩
from sklearn.datasets import load_breast_cancer

In [2]:
# 데이터셋 변수에 저장
cancer = load_breast_cancer()

In [3]:
print(cancer['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [4]:
cancer.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [5]:
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df.head(3)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758


In [6]:
df.shape

(569, 30)

In [7]:
logistic_regression = LogisticRegression()
knn = KNeighborsClassifier(n_neighbors=5)

voting_model = VotingClassifier(estimators=[
    ('LogisticRegression', logistic_regression), ('KNN', knn)],
                                voting='soft', n_jobs=-1)

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, test_size=0.2, random_state=156)

In [9]:
voting_model.fit(X_train, y_train)
pred = voting_model.predict(X_test)

In [10]:
print('보팅 분류기 정확도 : {:.3f}'.format(accuracy_score(y_test, pred)))

보팅 분류기 정확도 : 0.947


In [11]:
# 개별 모델의 학습, 예측, 평가
classifier = [logistic_regression, knn]

for model in classifier:
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    model_name = model.__class__.__name__
    print('{} 정확도 : {:.3f}'.format(model_name, accuracy_score(y_test, pred)))

LogisticRegression 정확도 : 0.939
KNeighborsClassifier 정확도 : 0.904


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# 랜덤 포레스트(RandomForest)

In [12]:
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier

In [13]:
wine = pd.read_csv(
    'https://raw.githubusercontent.com/rickiepark/hg-mldl/master/wine.csv')

In [14]:
wine.keys()

Index(['alcohol', 'sugar', 'pH', 'class'], dtype='object')

In [15]:
wine

Unnamed: 0,alcohol,sugar,pH,class
0,9.4,1.9,3.51,0.0
1,9.8,2.6,3.20,0.0
2,9.8,2.3,3.26,0.0
3,9.8,1.9,3.16,0.0
4,9.4,1.9,3.51,0.0
...,...,...,...,...
6492,11.2,1.6,3.27,1.0
6493,9.6,8.0,3.15,1.0
6494,9.4,1.2,2.99,1.0
6495,12.8,1.1,3.34,1.0


In [16]:
data = wine[['alcohol', 'sugar', 'pH']].to_numpy()
target = wine['class'].to_numpy()

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    data, target, test_size=0.2, random_state=42)

In [18]:
rf = RandomForestClassifier(n_jobs=-1, random_state=42)
scores = cross_validate(rf, X_train, y_train, return_train_score=True)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.9973541965122431 0.8905151032797809


In [19]:
rf.fit(X_train, y_train)

print(rf.feature_importances_)

[0.23167441 0.50039841 0.26792718]


In [20]:
# OOB(Out Of Bag) : 부트스트랩 샘플에 포함되지않고 남는 샘플
rf = RandomForestClassifier(oob_score=True, n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)

print(rf.oob_score_)

0.8934000384837406


# 실전 예제 : 중고차 가격 예측

1. 알고리즘 : 랜덤포레스트(randomforest)
2. dataset : 해외 중고차 거래 데이터셋
3. 데이터셋의 소개 : 종속변수(selling_price), 독립변수()
    - 중고차 판매이력을 수집한 데이터 세트
4. 문제유형 : 회귀
5. 평가지표 : RMSE
6. 사용할 모델 : RandomForestRegressor
7. 사용 라이브러리 : 

In [21]:
car = pd.read_csv(
    'https://media.githubusercontent.com/media/musthave-ML10/data_source/main/car.csv')

In [22]:
df = pd.DataFrame(car)

In [23]:
df.head(3)

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,190Nm@ 2000rpm,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,250Nm@ 1500-2500rpm,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,"12.7@ 2,700(kgm@ rpm)",5.0


1. feature 탐색
    - name : 차종
    - year : 년식
    - selling_price : 판매가격(종속변수)
    - km_driven : 주행거리
    - fuel : 연료 종류
    - seller_type : 판매자유형
    - transmission : 변속기
    - owner : 소유자
    - mileage : 연비(km)
    - engine : 배기량
    - max_power : 최대출력(제동마력)
    - torque : 회전력
    - sates : 좌석수

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8128 entries, 0 to 8127
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           8128 non-null   object 
 1   year           8128 non-null   int64  
 2   selling_price  8128 non-null   int64  
 3   km_driven      8128 non-null   int64  
 4   fuel           8128 non-null   object 
 5   seller_type    8128 non-null   object 
 6   transmission   8128 non-null   object 
 7   owner          8128 non-null   object 
 8   mileage        7907 non-null   object 
 9   engine         7907 non-null   object 
 10  max_power      7913 non-null   object 
 11  torque         7906 non-null   object 
 12  seats          7907 non-null   float64
dtypes: float64(1), int64(3), object(9)
memory usage: 825.6+ KB


In [25]:
df.isnull().sum()

name               0
year               0
selling_price      0
km_driven          0
fuel               0
seller_type        0
transmission       0
owner              0
mileage          221
engine           221
max_power        215
torque           222
seats            221
dtype: int64

## 전처리 : 텍스트 데이터
- split() : 문자열 분리

In [26]:
# 숫자와 문자열 분리
df[['engine', 'engine_unit']] = df['engine'].str.split(expand=True)

In [27]:
# 자료형 변경
df['engine'] = df['engine'].astype(float)

In [28]:
# 결측치 확인
df['engine'].isnull().sum()

221

In [29]:
# 문자열 부분 삭제
df = df.drop('engine_unit', axis=1)

In [30]:
# 결측치 평균으로 대체
df.engine = df.engine.fillna(df.engine.mean())

In [31]:
df['engine'].isnull().sum()

0

In [32]:
df[['mileage', 'mileage_unit']] = df['mileage'].str.split(expand=True)

In [33]:
df['mileage'] = df['mileage'].astype(float)

In [34]:
df.mileage_unit.unique()

array(['kmpl', 'km/kg', nan], dtype=object)

In [35]:
df.fuel.unique()

array(['Diesel', 'Petrol', 'LPG', 'CNG'], dtype=object)

- 연료가 4종류
- 다른 종류의 연료는 같은 단위로 바꿔야함
- 연료 가격으로 변경
- 2022년 시점의 가격
    - Diesel : 1.405
    - Petrol : 1.048
    - LPG : 0.939
    - CNG : 2.76

In [36]:
def mile(x):
    if x['fuel'] == 'Petrol':
        return x.mileage / 1.048
    elif x['fuel'] == 'Diesel':
        return x.mileage / 1.405
    elif x['fuel'] == 'LPG':
        return x.mileage / 3.54
    else:
        return x.mileage / 2.76

In [37]:
df.mileage = df.apply(mile, axis=1)

In [38]:
df = df.drop('mileage_unit', axis=1)

In [39]:
df

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,16.654804,1248.0,74 bhp,190Nm@ 2000rpm,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,15.046263,1498.0,103.52 bhp,250Nm@ 1500-2500rpm,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,16.889313,1497.0,78 bhp,"12.7@ 2,700(kgm@ rpm)",5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,16.370107,1396.0,90 bhp,22.4 kgm at 1750-2750rpm,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,15.362595,1298.0,88.2 bhp,"11.5@ 4,500(kgm@ rpm)",5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai i20 Magna,2013,320000,110000,Petrol,Individual,Manual,First Owner,17.652672,1197.0,82.85 bhp,113.7Nm@ 4000rpm,5.0
8124,Hyundai Verna CRDi SX,2007,135000,119000,Diesel,Individual,Manual,Fourth & Above Owner,11.957295,1493.0,110 bhp,"24@ 1,900-2,750(kgm@ rpm)",5.0
8125,Maruti Swift Dzire ZDi,2009,382000,120000,Diesel,Individual,Manual,First Owner,13.736655,1248.0,73.9 bhp,190Nm@ 2000rpm,5.0
8126,Tata Indigo CR4,2013,290000,25000,Diesel,Individual,Manual,First Owner,16.775801,1396.0,70 bhp,140Nm@ 1800-3000rpm,5.0


### torque
- 앞 부분의 숫자만 추출해서 숫자형
- 단위 스케일(Nm)

In [57]:
df.torque = split_num(df.torque)

In [58]:
df.torque

0       0
1       0
2       0
3       0
4       0
       ..
8123    0
8124    0
8125    0
8126    0
8127    0
Name: torque, Length: 7906, dtype: object

In [40]:
def split_num(x):
    x = str(x)
    for i, j in enumerate(x):
        if j not in '0123456789.':
            cut = i
            break
    return x[:cut]

### max_power

In [41]:
df[['max_power', 'max_power_unit']] = df['max_power'].str.split(expand=True)

In [42]:
df['max_power'].head()

0        74
1    103.52
2        78
3        90
4      88.2
Name: max_power, dtype: object

In [43]:
df.max_power.unique()

array(['74', '103.52', '78', '90', '88.2', '81.86', '57.5', '37', '67.1',
       '68.1', '108.45', '60', '73.9', nan, '67', '82', '88.5', '46.3',
       '88.73', '64.1', '98.6', '88.8', '83.81', '83.1', '47.3', '73.8',
       '34.2', '35', '81.83', '40.3', '121.3', '138.03', '160.77',
       '117.3', '116.3', '83.14', '67.05', '168.5', '100', '120.7',
       '98.63', '175.56', '103.25', '171.5', '100.6', '174.33', '187.74',
       '170', '78.9', '88.76', '86.8', '108.495', '108.62', '93.7',
       '103.6', '98.59', '189', '67.04', '68.05', '58.2', '82.85',
       '81.80', '73', '120', '94.68', '160', '65', '155', '69.01',
       '126.32', '138.1', '83.8', '126.2', '98.96', '62.1', '86.7', '188',
       '214.56', '177', '280', '148.31', '254.79', '190', '177.46', '204',
       '141', '117.6', '241.4', '282', '150', '147.5', '108.5', '103.5',
       '183', '181.04', '157.7', '164.7', '91.1', '400', '68', '75',
       '85.8', '87.2', '53', '118', '103.2', '83', '84', '58.16',
       '147.

In [44]:
df.max_power = df.max_power.astype(float)

ValueError: could not convert string to float: 'bhp'

In [None]:
def isFloat(x):
    try:
        num = float(x)
        return num
    except ValueError:
        return np.NaN

In [45]:
df.max_power = df.max_power.apply(isFloat)

NameError: name 'isFloat' is not defined

In [46]:
df.max_power_unit.unique()

array(['bhp', nan, None], dtype=object)

In [47]:
df.drop('max_power_unit', axis=1, inplace=True)

In [48]:
df.name = df.name.str.split(expand=True)[0]

In [49]:
df.name.unique()

array(['Maruti', 'Skoda', 'Honda', 'Hyundai', 'Toyota', 'Ford', 'Renault',
       'Mahindra', 'Tata', 'Chevrolet', 'Fiat', 'Datsun', 'Jeep',
       'Mercedes-Benz', 'Mitsubishi', 'Audi', 'Volkswagen', 'BMW',
       'Nissan', 'Lexus', 'Jaguar', 'Land', 'MG', 'Volvo', 'Daewoo',
       'Kia', 'Force', 'Ambassador', 'Ashok', 'Isuzu', 'Opel', 'Peugeot'],
      dtype=object)

## 전처리 : 결측치와 더미 변수 변환

In [50]:
df.isna().sum()

name               0
year               0
selling_price      0
km_driven          0
fuel               0
seller_type        0
transmission       0
owner              0
mileage          221
engine             0
max_power        215
torque           222
seats            221
dtype: int64

In [51]:
df.dropna(inplace=True)

In [52]:
season = pd.DataFrame({'season': ['spring', 'summer', 'fall', 'winter', np.nan]})
season

Unnamed: 0,season
0,spring
1,summer
2,fall
3,winter
4,


In [53]:
pd.get_dummies(season.season, dummy_na=True)

Unnamed: 0,fall,spring,summer,winter,NaN
0,0,1,0,0,0
1,0,0,1,0,0
2,1,0,0,0,0
3,0,0,0,1,0
4,0,0,0,0,1


In [54]:
df.head(1)

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti,2014,450000,145500,Diesel,Individual,Manual,First Owner,16.654804,1248.0,74,190Nm@ 2000rpm,5.0


In [55]:
df = pd.get_dummies(data, columns=['name', 'fuel', 'seller_type', 'transmission',
                                   'owner'])

ValueError: Data must be 1-dimensional

- One-hot encoding : 범주형 데이터의 각 범주를 1아니면 0으로 채우는 인코딩 기법
- pd.get_dummies() 함수의 drop_first : 첫번째 카테고리 값은 사용하지 않음

In [56]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('selling_price', axis=1), df.selling_price,
    test_size=0.2, random_state=100)