
# 다중선형회귀분석


## 다중선형회귀분석(다변량 회귀분석) 

- 다중선형회귀분석은 두 개 이상의 독립변수가 종속변수에 미치는 영향을 추정하는 통계기법 
- 회귀식은 일반적인 1차항으로 구성된 다항식 



## 다중선형회귀분석 시 검토사항 .



<b>가. 데이터가 전제하는 가정을 만족시키는가?</b>
- 독립변수와 종속변수간 선형성, 오차의 독립성 / 등분산성/ 정규성 등을 만족하고 있는지 확인해야함 


<b>나. 회귀모형은 통계적으로 유의한가? </b>
- 회귀분석의 결과로 산출되는 F-통계량의 p-value값이 0.05보다 작으면 해당 회귀식은 통계적으로 유의하다고 볼 수 있다. 

<b>다. 모형은 데이터를 얼마나 설명할 수 있는가?</b>
- 수정된 결정계수(adjusted R^2)를 확인한다.

<b>라. 모형 내의 회귀계수가 유의한가? </b>
- 단변량 회귀분석에서 회귀계수의 유의성 검토와 마찬가지로 회귀계수에 대한 t통계량의 p-value값이 0.05보다 작으면 해당 회귀 계수가 통계적으로 유의하다고 볼 수 있다. <br>단, 다중회귀분석을 할 때는 모든 회귀계수가 유의한지를 검정한 후 해당 회귀식을 해석야하 함 <br>










<b>마. 모형이 데이터를 잘 적합하고 있는가?</b>
- 모형의 잔차와 종속변수에 대한 산점도를 그리고, 회귀진단을 수행하여 판단한다. 



### 3. 더미변수 

<b>가. 범주형 변수 변환</b>
- 회귀분석은 연속형 변수를 다루는 기법이므로 범주형 데이터의 경우 형태를 변환해주어야 회귀분석을 수행할 수 있다. 
 
- 더미변수란 0 or 1값만 가지며 어떤 특징에 해당 하는지의 여부를 표현하는 변수이다.  




<b>[예제]</b>

kc_house_data 데이터에서 price를 종속변수로 설정하고, date와 id를 제거한 15개의 컬럼을 독립변수로 설정하여 다중선형 회귀분석을 실시한 후, 추정된 회귀모형에 대해 해석하라

In [1]:
import pandas as pd 
import numpy as np 
house = pd.read_csv('../data/kc_house_data.csv')
house.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,standard,0,3,7,1180,0,1955,0,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,standard,0,3,7,2170,400,1951,1991,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,standard,0,3,6,770,0,1933,0,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,standard,0,5,7,1050,910,1965,0,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,standard,0,3,8,1680,0,1987,0,1800,7503


In [2]:
house.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  object 
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  sqft_living15  21613 non-null  int64  
 17  sqft_lot15     21613 non-null  int64  
dtypes: flo

In [3]:
house = house.drop(['id','date'],axis=1)
house.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,sqft_living15,sqft_lot15
0,221900.0,3,1.0,1180,5650,1.0,standard,0,3,7,1180,0,1955,0,1340,5650
1,538000.0,3,2.25,2570,7242,2.0,standard,0,3,7,2170,400,1951,1991,1690,7639
2,180000.0,2,1.0,770,10000,1.0,standard,0,3,6,770,0,1933,0,2720,8062
3,604000.0,4,3.0,1960,5000,1.0,standard,0,5,7,1050,910,1965,0,1360,5000
4,510000.0,3,2.0,1680,8080,1.0,standard,0,3,8,1680,0,1987,0,1800,7503


In [5]:
pd.get_dummies(house,columns=['waterfront'])

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,sqft_living15,sqft_lot15,waterfront_river_view,waterfront_standard
0,221900.0,3,1.00,1180,5650,1.0,0,3,7,1180,0,1955,0,1340,5650,0,1
1,538000.0,3,2.25,2570,7242,2.0,0,3,7,2170,400,1951,1991,1690,7639,0,1
2,180000.0,2,1.00,770,10000,1.0,0,3,6,770,0,1933,0,2720,8062,0,1
3,604000.0,4,3.00,1960,5000,1.0,0,5,7,1050,910,1965,0,1360,5000,0,1
4,510000.0,3,2.00,1680,8080,1.0,0,3,8,1680,0,1987,0,1800,7503,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21608,360000.0,3,2.50,1530,1131,3.0,0,3,8,1530,0,2009,0,1530,1509,0,1
21609,400000.0,4,2.50,2310,5813,2.0,0,3,8,2310,0,2014,0,1830,7200,0,1
21610,402101.0,2,0.75,1020,1350,2.0,0,3,7,1020,0,2009,0,1020,2007,0,1
21611,400000.0,3,2.50,1600,2388,2.0,0,3,8,1600,0,2004,0,1410,1287,0,1


In [6]:
house = pd.get_dummies(house,columns=['waterfront'])

In [7]:
ols_str = 'price ~  '
for i in house.columns.drop('price'): 
    ols_str =  ols_str + i +" + " 

print(ols_str)
    

price ~  bedrooms + bathrooms + sqft_living + sqft_lot + floors + view + condition + grade + sqft_above + sqft_basement + yr_built + yr_renovated + sqft_living15 + sqft_lot15 + waterfront_river_view + waterfront_standard + 


In [8]:
ols_str[:-3]

'price ~  bedrooms + bathrooms + sqft_living + sqft_lot + floors + view + condition + grade + sqft_above + sqft_basement + yr_built + yr_renovated + sqft_living15 + sqft_lot15 + waterfront_river_view + waterfront_standard'

In [9]:
ols_str = ols_str[:-3]

In [37]:
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf


model = smf.ols(formula = ols_str, data = house)
result = model.fit()
result.summary()


## 범주형 변수가 추가되면 오류가 남
## 더미변수로 변환 필요 

0,1,2,3
Dep. Variable:,price,R-squared:,0.654
Model:,OLS,Adj. R-squared:,0.654
Method:,Least Squares,F-statistic:,2401.0
Date:,"Fri, 19 Nov 2021",Prob (F-statistic):,0.0
Time:,07:37:51,Log-Likelihood:,-296140.0
No. Observations:,21613,AIC:,592300.0
Df Residuals:,21595,BIC:,592500.0
Df Model:,17,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,5.673e+06,1.15e+05,49.344,0.000,5.45e+06,5.9e+06
waterfront[T.standard],-5.786e+05,1.86e+04,-31.053,0.000,-6.15e+05,-5.42e+05
bedrooms,-3.907e+04,2026.990,-19.277,0.000,-4.3e+04,-3.51e+04
bathrooms,4.5e+04,3497.863,12.865,0.000,3.81e+04,5.19e+04
sqft_living,109.2180,2.434,44.866,0.000,104.447,113.990
sqft_lot,-0.0048,0.051,-0.094,0.925,-0.105,0.096
floors,2.608e+04,3797.448,6.867,0.000,1.86e+04,3.35e+04
view,4.328e+04,2272.680,19.045,0.000,3.88e+04,4.77e+04
grade,1.201e+05,2254.694,53.270,0.000,1.16e+05,1.25e+05

0,1,2,3
Omnibus:,16350.766,Durbin-Watson:,1.979
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1202226.985
Skew:,3.034,Prob(JB):,0.0
Kurtosis:,39.03,Cond. No.,9.73e+19
