# Prediction (예측)

## Correlation (상관)
* Two variables are interrelated (두 변수가 서로 관련이 있음)

### Scatterplot (산점도)
* Visualization of the relationship between two continuous variables (두 연속 변수의 관계를 시각화)
* Single dot represents a single data (한 건의 데이터를 점으로 표시)

### Covariance (공분산)
* 두 연속변수의 관계를 수치화
* 두 변수가 같은 방향으로 변하면 +, 반대 방향으로 변하면 -
* 함께 변하는 경향이 강할 수록 절대적 크기가 커짐

### Correlation Coefficient (상관계수)
* 공분산을 두 변수의 표준편차로 나눈 것
* 항상 -1 ~ +1 범위

* +1: 완벽하게 같은 방향으로 움직임
* 0: 아무 관계 없음
* -1: 완벽하게 반대 방향으로 움직임

### Spurious Correlation (허위 상관관계)
* 두 변수 사이에 실제로는 관계가 없어도 상관관계가 나타나는 경우
* 데이터가 적을수록 나타나기 쉬움
* 해답 - 상관계수의 신뢰구간을 확인

In [2]:
import pandas as pd

In [3]:
cars = pd.read_csv('./cars.csv')

In [4]:
cars.head()

Unnamed: 0.1,Unnamed: 0,speed,dist
0,1,4,2
1,2,4,10
2,3,7,4
3,4,7,22
4,5,8,16


### 상관계수 확인

In [6]:
from scipy.stats import pearsonr

In [8]:
pearsonr(cars['speed'], cars['dist'])

(0.8068949006892105, 1.4898364962950763e-12)

해석
* 신뢰구간 99.99999999999999999999% 안에 포함됨

In [18]:
from sklearn.utils import resample

In [None]:
df = resample(cars)

In [19]:
cors = []                                                # 빈 리스트를 만든다
for _ in range(10000):                                   # 1만번 반복
    df = resample(cars)                                  # 리샘플링
    res = pearsonr(df['speed'], df['dist'])              # 상관계수를 구한다
    cors.append(res[0])                                  # 상관계수를 리스트에 추가 [0]은 상관계수, [1]은 p값

In [20]:
import numpy

In [21]:
numpy.quantile(cors, [0.025, 0.0975]) # 상관계수의 95% 신뢰구간
# 해석: 내가 95% 장담하는데 p 값은 0.8069에서 0.6998까지밖에 안내려감

array([0.69981158, 0.74387962])

In [23]:
numpy.quantile(cors, [0.025, 0.0975]) # 상관계수의 95% 신뢰구간
# 해석: 내가 95% 장담하는데 p 값은 0.8069에서 0.6998까지밖에 안내려감

array([0.69981158, 0.74387962])

### Pearson 상관계수

### Spearman 상관계수
* 실제 변수 값 대신 그 서열을 사용하여 피어슨 상관계수를 계싼
* 한 변수의 서열이 높아지면 다른 변수의 서열도 높아지는지를 나타냄

### Kendall 상관계수
* 스피어만 상관계수와 비슷하게 서열의 관계를 수치화, 계산 방법이 다름
* 데이터가 작고, 동점이 많을 때 사용

In [25]:
liar = pd.read_csv('./liar.csv')

In [26]:
liar.head()

Unnamed: 0,Creativity,Position,Novice
0,53,1,0
1,36,3,1
2,31,4,0
3,43,2,0
4,30,4,1


In [27]:
from scipy.stats import spearmanr, kendalltau

창의성과 거짓말 등수 사이에 역상관 --> 창의성이 높을수록 거짓말을 잘한다

In [28]:
spearmanr(liar['Creativity'], liar['Position'])

SpearmanrResult(correlation=-0.37321838128767815, pvalue=0.0017204168895658578)

In [29]:
kendalltau(liar['Creativity'], liar['Position'])

KendalltauResult(correlation=-0.3002413080651747, pvalue=0.001258802279346817)

In [30]:
pearsonr(liar['Creativity'], liar['Position'])

(-0.30603143483570205, 0.01114802877289378)

## Regression (회기분석)
* 가장 넓은 의미: x -> y 예측
* 중간의미: y가 연속인경우 (y가 범주형인 경우는 분류)
* 좁은의미: 선형 회기 분석(선형 모형을 이용한 회기분석)

### 자동차 데이터로 회귀분석

In [33]:
from statsmodels.formula.api import ols

In [34]:
res = ols('speed ~ dist', data = cars).fit()

In [35]:
res.summary()

0,1,2,3
Dep. Variable:,speed,R-squared:,0.651
Model:,OLS,Adj. R-squared:,0.644
Method:,Least Squares,F-statistic:,89.57
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,1.49e-12
Time:,13:51:44,Log-Likelihood:,-127.39
No. Observations:,50,AIC:,258.8
Df Residuals:,48,BIC:,262.6
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,8.2839,0.874,9.474,0.000,6.526,10.042
dist,0.1656,0.017,9.464,0.000,0.130,0.201

0,1,2,3
Omnibus:,0.72,Durbin-Watson:,1.195
Prob(Omnibus):,0.698,Jarque-Bera (JB):,0.827
Skew:,-0.207,Prob(JB):,0.661
Kurtosis:,2.526,Cond. No.,98.0


In [37]:
child = pd.read_csv('./child.csv')

In [38]:
child.head()

Unnamed: 0,Aggression,Television,Computer_Games,Sibling_Aggression,Diet,Parenting_Style
0,0.37416,0.172671,0.141907,-0.328216,-0.110303,-0.279034
1,0.771153,-0.032872,0.709918,0.576837,-0.02299,-1.248167
2,-0.097728,-0.07446,-0.390141,-0.217184,0.280301,-0.328063
3,0.015935,-0.004427,-0.40808,0.046223,-0.263479,-1.005119
4,-0.275385,-0.675239,-0.277778,-0.891045,0.226581,0.489478


In [40]:
res = ols('Aggression ~ Computer_Games', child).fit()

In [41]:
res.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.035
Model:,OLS,Adj. R-squared:,0.033
Method:,Least Squares,F-statistic:,23.9
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,1.27e-06
Time:,14:58:42,Log-Likelihood:,-172.63
No. Observations:,666,AIC:,349.3
Df Residuals:,664,BIC:,358.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0068,0.012,-0.560,0.576,-0.031,0.017
Computer_Games,0.1742,0.036,4.889,0.000,0.104,0.244

0,1,2,3
Omnibus:,25.478,Durbin-Watson:,1.929
Prob(Omnibus):,0.0,Jarque-Bera (JB):,66.334
Skew:,-0.011,Prob(JB):,3.94e-15
Kurtosis:,4.546,Cond. No.,2.93


In [42]:
res2 = ols('Aggression ~ Television', child).fit()

In [43]:
res2.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.025
Model:,OLS,Adj. R-squared:,0.024
Method:,Least Squares,F-statistic:,17.11
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,3.98e-05
Time:,15:12:22,Log-Likelihood:,-175.93
No. Observations:,666,AIC:,355.9
Df Residuals:,664,BIC:,364.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0005,0.012,-0.041,0.967,-0.025,0.024
Television,0.1634,0.040,4.137,0.000,0.086,0.241

0,1,2,3
Omnibus:,24.471,Durbin-Watson:,1.931
Prob(Omnibus):,0.0,Jarque-Bera (JB):,58.038
Skew:,0.108,Prob(JB):,2.5e-13
Kurtosis:,4.43,Cond. No.,3.23


해석:

* 95% 신뢰구간이 둘다 양수이면 p값은 0.05보다 낮다
* R-squared : 아이들의 공격성의 3.5%는 컴퓨터 게임으로 설명된다
* F-statistic -> Prob(F-statistics) < 0.05 : 데이터가 충분하다
* AIC, BIC : 작을수록 좋다 (Computer-Games 가 더 좋음)
* Adjusted R-squred: 클수록 좋다 (Computer Games 가 더 좋음)
* AIC & BIC vs Adjusted R-squred가 엇갈린다면 데이터를 더 모아라

In [49]:
child.head()

Unnamed: 0,Aggression,Television,Computer_Games,Sibling_Aggression,Diet,Parenting_Style
0,0.37416,0.172671,0.141907,-0.328216,-0.110303,-0.279034
1,0.771153,-0.032872,0.709918,0.576837,-0.02299,-1.248167
2,-0.097728,-0.07446,-0.390141,-0.217184,0.280301,-0.328063
3,0.015935,-0.004427,-0.40808,0.046223,-0.263479,-1.005119
4,-0.275385,-0.675239,-0.277778,-0.891045,0.226581,0.489478


In [57]:
res3 = ols('Aggression ~ Television + Computer_Games', child).fit()
res3.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.051
Model:,OLS,Adj. R-squared:,0.049
Method:,Least Squares,F-statistic:,17.99
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,2.45e-08
Time:,15:30:26,Log-Likelihood:,-166.81
No. Observations:,666,AIC:,339.6
Df Residuals:,663,BIC:,353.1
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0029,0.012,-0.237,0.813,-0.027,0.021
Television,0.1353,0.040,3.420,0.001,0.058,0.213
Computer_Games,0.1539,0.036,4.293,0.000,0.083,0.224

0,1,2,3
Omnibus:,24.166,Durbin-Watson:,1.934
Prob(Omnibus):,0.0,Jarque-Bera (JB):,57.964
Skew:,0.091,Prob(JB):,2.59e-13
Kurtosis:,4.434,Cond. No.,3.42
