### 최적의 정규화 가중치 찾아내기
- model : 정류장 1개당 model로, ols를 저장한 model과 동일하다.
- 정류장 1000개 모델별 r_squared, r_squared_adj 를 구하고, 평균값이 가장 높게 하는 가중치를 선택한다.
- 이 방식은 우리가 최종 r_squared를 구하는 것과는 약간 다르지만, 최적의 가중치를 얻는데 인사이트를 얻기위해 진행한다. 

In [29]:
len(models)

1000

In [30]:
#모델들의 rsquared2 합치기
rsquared2=[]
for model in models:
    result=model.fit()
    rsquared2.append(result.rsquared)
rsquared2=[round(rsquared,3) for rsquared in rsquared2]

#모델들의 rsquared_adj 합치기
rsquared_adj2=[]
for model in models:
    result=model.fit()
    rsquared_adj2.append(result.rsquared_adj)
rsquared_adj2=[round(rsquared_adj,3) for rsquared_adj in rsquared_adj2]

#rsquared 데이터프레임만들기
#null,inf,음수,1이상 전처리.
rsquared_df2=pd.DataFrame(rsquared2,rsquared_adj2).reset_index().rename(columns={'index':'rsquared_adj',0:'rsquared'})
rsquared_df2=rsquared_df2.fillna(0)
rsquared_df2[np.isinf]=0
rsquared_df2[rsquared_df2<0]=0
rsquared_df2[rsquared_df2>1]=1
rsquared_df2

Unnamed: 0,rsquared_adj,rsquared
0,0.834,0.850
1,0.842,0.857
2,0.802,0.821
3,0.872,0.886
4,0.644,0.682
...,...,...
995,0.087,0.342
996,0.000,0.140
997,0.618,0.716
998,0.298,0.489


- rsquared_df와 rsquared_df2의 연관성

In [31]:
#정규화 하지 않은 경우
rsquared_df2['rsquared'].median()

0.4545

In [32]:
#정규화 가중치에 따라 models의 평균 rsquared, rsquared_adj 를 반환하는 함수
#이때의 rsquared,rsquared_adj는 각 모델별 값으로, 실제 우리의 모델과는 구하는 방식이 다를 수 있다.
#따라서, 아래의 함수는 최적의 가중치를 얻어내는 인사이트로만 활용하자.
def weight_regularization(models,L1_wt=0):
    df_mean=pd.DataFrame()
    for n in np.arange(0.01,0.1,0.01).tolist():
        results_fr_rsquared=[]
        results_fr_rsquared_adj=[]
        rs_mean=[]
        rs_adj_mean=[]
        for model in models:
            #정규화
            #alpah 가중치=n
            results_fu = model.fit()
            results_fr = model.fit_regularized(L1_wt=L1_wt, alpha=n, start_params=results_fu.params)
            results_fr_fit = sm.regression.linear_model.OLSResults(model, results_fr.params, model.normalized_cov_params)
            #rsquared,rsquared_adj
            results_fr_rsquared.append(results_fr_fit.rsquared)
            results_fr_rsquared_adj.append(results_fr_fit.rsquared_adj)
            
        results_fr_rsquared=[round(rsquared,3) for rsquared in results_fr_rsquared]
        results_fr_rsquared_adj=[round(rsquared,3) for rsquared in results_fr_rsquared_adj]
        
        #dataframe
        results_fr_df=pd.DataFrame({'rsquared':results_fr_rsquared,'rsquared_adj':results_fr_rsquared_adj})
        results_fr_df=results_fr_df.fillna(0)
        results_fr_df[np.isinf]=0
        results_fr_df[results_fr_df<0]=0
        results_fr_df[results_fr_df>1]=1
        #mean
        rs_mean.append(results_fr_df['rsquared'].mean())
        rs_adj_mean.append(results_fr_df['rsquared_adj'].mean())
        #mean_dataframe
        df=pd.DataFrame({'alpha':n,'rs_mean':rs_mean,'rs_adj_mean':rs_adj_mean})
        df_mean=pd.concat([df_mean,df])
        max_of_rs_mean=df_mean[df_mean['rs_mean']==df_mean['rs_mean'].max()]
        max_of_rs_adj_mean=df_mean[df_mean['rs_adj_mean']==df_mean['rs_adj_mean'].max()]
    return df_mean,max_of_rs_mean,max_of_rs_adj_mean

In [33]:
#L1_wt=0 : ridge 모형
#최적의 alpha : 0.01
df_mean,max_of_rs_mean,max_of_rs_adj_mean=weight_regularization(models,0)
df_mean

Unnamed: 0,alpha,rs_mean,rs_adj_mean
0,0.01,0.450035,0.320521
0,0.02,0.433337,0.303265
0,0.03,0.420071,0.289972
0,0.04,0.408901,0.279176
0,0.05,0.399168,0.270077
0,0.06,0.390545,0.262209
0,0.07,0.382795,0.255332
0,0.08,0.375714,0.24918
0,0.09,0.369193,0.243547


In [34]:
max_of_rs_mean

Unnamed: 0,alpha,rs_mean,rs_adj_mean
0,0.01,0.450035,0.320521


In [35]:
max_of_rs_adj_mean

Unnamed: 0,alpha,rs_mean,rs_adj_mean
0,0.01,0.450035,0.320521


In [36]:
##L1_wt=1 : Lasso 모형
#최적의 alpha : 
df_mean,max_of_rs_mean,max_of_rs_adj_mean=weight_regularization(models,1)
df_mean

Unnamed: 0,alpha,rs_mean,rs_adj_mean
0,0.01,0.315667,0.226215
0,0.02,0.255392,0.18492
0,0.03,0.220809,0.164038
0,0.04,0.198891,0.149341
0,0.05,0.183248,0.138847
0,0.06,0.170921,0.130729
0,0.07,0.161452,0.123906
0,0.08,0.153266,0.117693
0,0.09,0.146676,0.11224


In [37]:
max_of_rs_mean

Unnamed: 0,alpha,rs_mean,rs_adj_mean
0,0.01,0.315667,0.226215


In [38]:
max_of_rs_adj_mean

Unnamed: 0,alpha,rs_mean,rs_adj_mean
0,0.01,0.315667,0.226215


In [39]:
##L1_wt=0.05 : Elastic Net 모형
#최적의 alpha : 
df_mean,max_of_rs_mean,max_of_rs_adj_mean=weight_regularization(models,0.5)
df_mean

Unnamed: 0,alpha,rs_mean,rs_adj_mean
0,0.01,0.360899,0.256567
0,0.02,0.30683,0.2174
0,0.03,0.272111,0.193112
0,0.04,0.246643,0.176879
0,0.05,0.227071,0.16557
0,0.06,0.212234,0.156182
0,0.07,0.200842,0.148668
0,0.08,0.191346,0.142468
0,0.09,0.183001,0.136972


In [40]:
max_of_rs_mean

Unnamed: 0,alpha,rs_mean,rs_adj_mean
0,0.01,0.360899,0.256567


In [41]:
max_of_rs_adj_mean

Unnamed: 0,alpha,rs_mean,rs_adj_mean
0,0.01,0.360899,0.256567


#### 결론
- ridge 정규화, alpha:0.01 이 가장 적합하다.

### 정규화를 적용하여 모델 돌리기
- 사용한 정규화 : ridge 정규화, alpha:0.01
- result = model.fit_regularized(alpha=0.01, L1_wt=0)
- 정규화 과정에서 변경된 부분만 나타내기 위하여 데이터로드,변수추가,스케일링등의 과정은 생략함
- ```ols_validation``` 함수에서 model 변수를 return한 이유는, 위에서 나타낸 최적의 정규화 가중치를 구하기 위해서이다. 따라서 실제 정규화를 적용할 때는 없애도 된다. 

In [10]:
# split 데이터 검증용
# train + validation data를 받아서 임의의 정류장 갯수를 가져옴
def split(df_train, num, seed):
    
    test_frame = pd.DataFrame(columns=df_train.columns)
    np.random.seed(seed)

    for i in np.random.choice(df_train['station_code'].unique(), num, replace=False):
        df1 = df_train[df_train['station_code'] == i]
        test_frame = pd.concat([test_frame, df1])
    return test_frame


# 데이터 검증용 함수
# df_train(or test_frame)를 train_df(학습)와 validation_df(검주ㅡㅇ)를 만듬
# train에는 중복제거된 데이터가 포함되어있음
# train과 validation을 나누는 과정에서 random_state를 다르게 주면 데이터가 바뀜(Kfold와 비슷한 효과)
def make_train_validation(dataframe, cate, test_size, seed):
    train_df = pd.DataFrame(columns=dataframe.columns)
    validation_df = pd.DataFrame(columns=dataframe.columns)
    total = tqdm(dataframe['station_code'].unique())
    print('make_train_validation 실행중....')
    for i in total:
        df1 = dataframe[dataframe['station_code'] == i]
        # 필수 train 데이터를 만들고
        nec_train = df1.drop_duplicates(subset=cate)
        # 필수 train 데이터를 제외한 train_validation을 만듬
        train_validation = df1.drop(df1['id'][nec_train['id']])
        # 만약 필수 train데이터와 train_validation의 크기가 같다(모두 고윳값이다)면
        if len(nec_train) == len(df1):
            #그냥 모두 train_df에 넣음,(validation에는 넣지않음, 박사님 피드백)
            train_df = pd.concat([train_df, nec_train])

        # 만약 train_validation의 갯수가 1이하면, train_validation은 바로 validation이 됨       
        elif len(train_validation) <= 1:
            # train_df와
            train_df = pd.concat([train_df, nec_train])
            # validation_df를 생성함
            validation_df = pd.concat([validation_df, train_validation])
        # 그 외에는 필수 train + train , validation 으로 나눠줌
        else:
            X, y = train_test_split(
                train_validation, test_size=test_size, random_state=seed)
            train_a = pd.concat([nec_train, X])
            train_df = pd.concat([train_df, train_a])
            validation_df = pd.concat([validation_df, y])
    return train_df, validation_df

# 검증 모델
# 만일 validation_df의 station_code가 없다면 학습만하고 예측을 하지 않음
def ols_validation(train_df, validation_df, var, cate):
    
    total = tqdm(train_df['station_code'].unique())
    columns = train_df.columns
    df_tr = pd.DataFrame(columns=columns)
    df_te = pd.DataFrame(columns=columns)
    df_tr['yhat'] = 999
    df_te['yhat'] = 999
    cate_c = [f"C({name})" for name in cate]
    y = ['scale_ride18']
    print('ols_validation 실행중....')
    models=[]
    for i in total:
        train_ols = train_df[train_df['station_code'] == i]
        validation_ols = validation_df[validation_df['station_code'] == i]
    
        if len(validation_ols) ==0:
            model = sm.OLS.from_formula(
            'scale_ride18  ~ ' + '+'.join(var)
            + '+'.join('+') + '+'.join(cate_c), data=train_ols)
            models.append(model)
            # 학습
            result = model.fit_regularized(alpha=0.01, L1_wt=0)
            # 결과
            train_ols['yhat'] = result.predict(train_ols)
            # 학습 저장
            df_tr = pd.concat([df_tr, train_ols])

        else :
            model = sm.OLS.from_formula(
            'scale_ride18  ~ ' + '+'.join(var)
            + '+'.join('+') + '+'.join(cate_c), data=train_ols)
            models.append(model)
            # 학습
            result = model.fit_regularized(alpha=0.01, L1_wt=0)
            # 결과
            train_ols['yhat'] = result.predict(train_ols)
            # 학습 저장
            df_tr = pd.concat([df_tr, train_ols])
            
            validation_ols_df = validation_ols[var+cate]  # 테스트 모델
            validation_ols['yhat'] = result.predict(validation_ols_df)
            df_te = pd.concat([df_te, validation_ols])
    return df_tr, df_te, models

# R스퀘어 구하기
# 앞에서 train과 validation 분리한 seed를 넣어 어떤 seed가 어떤 결정계수가 나왔는지 확인한다.
# DataFrame형태로 반환
def get_rsquared(df_tr, df_te, seed):

    df_tr['residual'] = df_tr['scale_ride18'] - df_tr['yhat']
    df_tr['explained'] = df_tr['yhat'] - np.mean(df_tr['yhat'])
    df_tr['total'] = df_tr['scale_ride18'] - np.mean(df_tr['scale_ride18'])

    df_te['residual'] = df_te['scale_ride18'] - df_te['yhat']
    df_te['explained'] = df_te['yhat'] - np.mean(df_te['yhat'])
    df_te['total'] = df_te['scale_ride18'] - np.mean(df_te['scale_ride18'])

    train_ess = np.sum((df_tr['explained'] ** 2))
    train_rss = np.sum((df_tr['residual'] ** 2))
    train_tss = np.sum((df_tr['total'] ** 2))
    test_ess = np.sum((df_te['explained'] ** 2))
    test_rss = np.sum((df_te['residual'] ** 2))
    test_tss = np.sum((df_te['total'] ** 2))

    rsquared = {'seed': [f'{seed}'],
             'train_rsquared_1': [1-train_rss/train_tss],
             'train_rsquared_2': [train_ess/train_tss],
             'validation_rsquared_1': [1-test_rss/test_tss],
             'validation_rsquared_2': [test_ess/test_tss],
             'train_ESS' : [round(train_ess)],
             'train_RSS' : [round(train_rss)],
             'train_TSS' : [round(train_tss)],
             'validation_ESS' : [round(test_ess)],
             'validation_RSS' : [round(test_rss)],
             'validation_TSS' : [round(test_tss)],
             'train_RMSE' : [np.sqrt(((df_tr['scale_ride18'] - df_tr['yhat']) ** 2).mean())],
             'validation_RMSE' : [np.sqrt(((df_te['scale_ride18'] - df_te['yhat']) ** 2).mean())],
               }
    print(f'seed : {seed} 완료')
    return pd.DataFrame(rsquared)

In [11]:
# 한번에 여러번 해보기
# 위에서 만든 make_train_validation -> ols_validation -> get_rsquared의 순서를 거침
# 시드를 리스트로 받아서
# dataframe = (train과 validation으로 나누기 전의 데이터)
# seeds = 리스트형태의 seed 목록 (갯수만큼 ols를 검증함)
# test_size = validation_size, 왠만하면 0.2로 고정해주세요
# 리턴되는 rsquared_df 데이터 프레임에 train, validation의 결정계수, ess, rss, tss, RMSE를 seed 별로 데이터프레임으로 만들어줍니다.

def validations(dataframe, seeds, test_size):
    rsquared_df = pd.DataFrame()
    for seed in seeds:
        train_df, validation_df = make_train_validation(dataframe, cate, test_size, seed)
        df_tr, df_te, models = ols_validation(train_df, validation_df, var, cate)
        rsquared = get_rsquared(df_tr, df_te, seed)
        rsquared_df = pd.concat([rsquared_df, rsquared])
    return rsquared_df, models

In [13]:
# 상위 정류장 1000개 만들고
# 전체 3563개의  정류장 중 상위 1000개의 정류장이 전체 데이터의 약 76% 이상을 차지 함
dataframe_1000 = make_top_station(1000)

상위 정류장 1000개는 전체 3563개의 정류소 중 75.9 %를 대표합니다.


HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))




In [28]:
# 여기서 변수 변경해가면서 검증해보세요
var_total = ['scale_ride6','scale_ride7','scale_ride8','scale_ride9', 'scale_ride10','scale_ride11',
       'scale_off6','scale_off7','scale_off8','scale_off9','scale_off10','scale_off11',
       'scale_temperature','scale_precipitation','scale_bus_interval',
       'scale_ride67','scale_ride89','scale_ride1011','scale_off67','scale_off89', 'scale_off1011',
        'scale_ride_sum','scale_off_sum','scale_bus_route_id_sum','scale_bus_route_id_all_sum']

# 'scale_ride67','scale_ride89','scale_ride1011','scale_off67','scale_off89', 'scale_off1011' 
# 는 2시간 더한 컬럼입니다.

# 검증에 사용할 실수 변수를 넣으세요
var = ['scale_temperature','scale_precipitation','scale_bus_interval',
       'scale_ride_sum','scale_off_sum','scale_bus_route_id_sum','scale_bus_route_id_all_sum',
       'scale_ride67','scale_ride89','scale_ride1011','scale_off67','scale_off89', 'scale_off1011']
# 'scale_ride6','scale_ride7','scale_ride8','scale_ride9', 'scale_ride10','scale_ride11',
#        'scale_off6','scale_off7','scale_off8','scale_off9','scale_off10','scale_off11',

# 검증에 사용할 카테고리 변수를 넣으세요
cate = ['bus_route_id','in_','out', 'weekend', 'weekday', 'holiday', 'typhoon'] # ,

seeds = [300, 20, 30, 40, 50] # 랜덤시드를 주어서 Kfold한 효과를 가져옴
test_size = 0.3 #변경해도 상관은 없는데 그냥 0.3로 하는게.. 그래야 7.5:2.5 정도 나옵니다.
rsquared_df,models = validations(dataframe_1000, seeds, test_size)
rsquared_df

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

make_train_validation 실행중....



HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

ols_validation 실행중....

seed : 300 완료


HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

make_train_validation 실행중....



HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

ols_validation 실행중....

seed : 20 완료


HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

make_train_validation 실행중....



HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

ols_validation 실행중....

seed : 30 완료


HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

make_train_validation 실행중....



HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

ols_validation 실행중....

seed : 40 완료


HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

make_train_validation 실행중....



HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

ols_validation 실행중....

seed : 50 완료


Unnamed: 0,seed,train_rsquared_1,train_rsquared_2,validation_rsquared_1,validation_rsquared_2,train_ESS,train_RSS,train_TSS,validation_ESS,validation_RSS,validation_TSS,train_RMSE,validation_RMSE
0,300,0.763522,0.721794,0.695598,0.669562,153520.0,50297.0,212693.0,55879.0,25404.0,83457.0,0.517154,0.694804
0,20,0.757149,0.715091,0.715326,0.637089,157155.0,53371.0,219769.0,48690.0,21757.0,76426.0,0.532723,0.642989
0,30,0.756112,0.715082,0.707211,0.715455,154175.0,52583.0,215605.0,57639.0,23588.0,80563.0,0.528777,0.669504
0,40,0.764694,0.72004,0.683358,0.691374,155430.0,50794.0,215863.0,55545.0,25439.0,80340.0,0.519701,0.695276
0,50,0.753288,0.710419,0.732093,0.635398,156148.0,54227.0,219797.0,48559.0,20474.0,76423.0,0.536976,0.623752
