### Sample notebook for multiple linear regression (MLR) for Abalone data 
アワビデータの重回帰分析の手順例  

Data: https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/  
Abalone Data (modified: some data are replaced by N/A):

Sex / nominal / -- / M, F, and I (infant)  
Length / continuous / mm / Longest shell measurement  
Diameter / continuous / mm / perpendicular to length  
Height / continuous / mm / with meat in shell  
Whole weight / continuous / grams / whole abalone  
Shucked weight / continuous / grams / weight of meat  
Viscera weight / continuous / grams / gut weight (after bleeding)  
Shell weight / continuous / grams / after being dried  
Rings / integer / -- / +1.5 gives the age in years  

#### Import libraries  

In [None]:
import sys
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
import statsmodels.api as sm
import statsmodels.formula.api as smf

#### Parameters

In [None]:
%config InlineBackend.figure_formats = {'png', 'retina'} #high-reso images
plt.rcParams['font.family'] = 'Yu Mincho' # for Japanese in graph (Win10)

# To show all rows and columns in the results 
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999

#### Step 1. Collect possible explanatory variables  
目的変数に影響を与えていそうな要因は、可能な限り網羅的に説明変数に取り入れる。  

##### Check & read CSV file, replace column labels if needed, etc.  
encoding='shift-jis' may be needed.    
CSVファイルをチェックしてから読み込む。必要に応じて列ラベルを変更。  
CSVファイルの漢字コードがShift-JISの場合は encoding='shift-jis' が必要。　　

In [None]:
csv_in = 'abalone_modified.csv'
df_all = pd.read_csv(csv_in, delimiter=',', skiprows=0, header=None)
# no header in csv, so set columns explicitly
df_all.columns = ['sex', 'len', 'd', 'h', 'w_all', 'w_meat', 'w_gut', 'w_shell', 'ring']
print(df_all.shape)
print(df_all.info())
display(df_all.head())

##### Check numerical / category variables if needed  
数値列・カテゴリー列の様子をみる  

In [None]:
display(df_all.describe())
display(df_all.describe(exclude='number'))

##### See rows including missing data  
欠損値を含む行を表示してみる  

In [None]:
display(df_all[df_all.isnull().any(axis=1)])

##### Delete rows including missing data, or fill missing data  
欠損値を含む行を削除する (または欠損値を埋める)  

In [None]:
df_all = df_all.dropna().reset_index(drop=True)
#df_all = df_all.fillna(df_all.mean()) # if you want to fill missing data instead of deleting them
print(df_all.shape)
display(df_all.head())

##### Separate explanatory variables and objective variable  
説明変数と目的変数を分ける  

In [None]:
X_all_org = df_all.loc[:, 'sex':'w_shell']  # explanatory variables
#X_all_org = df_all.drop(columns='ring')  # alternative way, もうひとつの書き方
y = df_all['ring']  # objective variable
print('X_all_org:', X_all_org.shape)
display(X_all_org.head())
print('y:', y.shape)
print(y.head())

##### Apply get_dummies()  
ダミー変数化  

In [None]:
X_all = pd.get_dummies(X_all_org, drop_first=True)
print('X_all:', X_all.shape)
display(X_all.head())

#### Step 2. Scatter plot and correlation coefficients between all combination of explanatory variables  
変数間の総当たり散布図を描画。相関係数も算出しておく  

##### all by all Pearson correlation coefficients;  
総当たりのPearson相関係数  

In [None]:
corr_all = X_all.corr(method='pearson')
display(corr_all)

##### Pickup explanatory variable pairs with large absolute value of correlation coefficient;  
相関係数の絶対値が大きい説明変数ペアの出力   

Method 2. Rather tricky method ...  

In [None]:
th_corr = 0.3
keep = np.triu(np.ones(corr_all.shape), k=1).astype('bool').flatten()
triu = corr_all.stack()[keep]
triu_sorted = triu[ np.abs(triu).sort_values(ascending=False).index ]
print(triu_sorted[ (triu_sorted < -th_corr) | (triu_sorted > th_corr) ])

##### all by all scatter plots of explanatory variables;  
変数間の総当たり散布図  

##### if you want to use seaborn instead of matplotlib
seabornを使うなら  

In [None]:
sns.pairplot(X_all)
plt.show()

##### Heatmap  
Heatmapを描いてもよい  

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(corr_all,annot=True,fmt='.2f',cmap='bwr')
plt.show()

#### Step 3. MLR calculation using all variables  
全説明変数を用いて、標準化なしで線形重回帰分析  

In [None]:
X_all_c = sm.add_constant(X_all)
model = sm.OLS(y, X_all_c)
results = model.fit()
print(results.summary())

#### Step 4. Check R2 and Adjusted R2 to see whether MLR is appropriate for this data  
決定係数や自由度調整済み決定係数をみて、そもそも線形モデルの当てはめが妥当かどうかを判断  

In [None]:
print('R2:', results.rsquared)
print('Adj R2:', results.rsquared_adj)

Not bad, but not good...  

#### Step 5. Stat. test for MLR equation  
重回帰式の検定 (求めた重回帰式は目的変数を説明している？)  

In [None]:
print('p-values (F-statistic)', results.f_pvalue)

Very small p-value, so this MLR equation is considered to be significant.  

#### Step 6-1. Standardization of variables   
全説明変数と目的変数を標準化

In [None]:
# NOTE: after scaling, X_scaled and Y_scaled are ndarray, not DataFrame.
X_scaled = preprocessing.scale(X_all)
dfX_scaled = pd.DataFrame(X_scaled, columns=X_all.columns)
y_scaled = preprocessing.scale(y)
dfy_scaled = pd.Series(y_scaled, name=y.name)
model = sm.OLS(dfy_scaled, dfX_scaled)

#### Step 6-1. Ridge regression    
**Ridge回帰**  

In [None]:
results_ridge = model.fit_regularized(L1_wt=0.0, alpha=0.1)
print(results_ridge.params)

#### Step 6-2. LASSO regression    
**LASSO回帰**

In [None]:
results_lasso = model.fit_regularized(L1_wt=1.0, alpha=0.1)
print(results_lasso.params)

#### Step 6-3. Elastic Net regression    
**Elastic Net**

In [None]:
results_elastic = model.fit_regularized(L1_wt=0.5, alpha=0.1)
print(results_elastic.params)

#### Step 7. Path plot to see which explanatory variables have large coefficients according to alpha  
Explanatory variables with coefficients far from 0 have larger effect on objective variable.  
alpha値を変化させたときのどの説明変数の係数が大きいかを調べる。  
alphaを大きくしても係数が0にならない説明変数は目的変数への重要度が大きいと判断できる。  

In [None]:
alphas = 10 ** np.linspace(-3, 3, num=50)
print(alphas)

##### Ridge

In [None]:
ret = []
for al in alphas:
    r = model.fit_regularized(L1_wt=0.0, alpha=al)
    ret.append(r.params)

df_ret = pd.DataFrame(ret, columns=dfX_scaled.columns, index=alphas)
display(df_ret.head())  # debug
df_ret.plot(figsize=(7,7))
plt.title('Path plot for Ridge')
plt.xlabel('Alpha')
plt.ylabel('Coefficients')
plt.xscale("log")
plt.grid(True)

##### LASSO

In [None]:
ret = []
for al in alphas:
    r = model.fit_regularized(L1_wt=1.0, alpha=al)
    ret.append(r.params)

df_ret = pd.DataFrame(ret, columns=dfX_scaled.columns, index=alphas)
display(df_ret.head())  # debug
df_ret.plot(figsize=(7,7))
plt.title('Path plot for LASSO')
plt.xlabel('Alpha')
plt.ylabel('Coefficients')
plt.xscale("log")
plt.grid(True)

##### Elastic Net

In [None]:
ret = []
for al in alphas:
    r = model.fit_regularized(L1_wt=0.5, alpha=al)
    ret.append(r.params)

df_ret = pd.DataFrame(ret, columns=dfX_scaled.columns, index=alphas)
display(df_ret.head())  # debug
df_ret.plot(figsize=(7,7))
plt.title('Path plot for Elastic Net (L1_wt=0.5)')
plt.xlabel('Alpha')
plt.ylabel('Coefficients')
plt.xscale("log")
plt.grid(True)