# 선제골과 경기 결과간의 상관관계

## 기본 컨셉
* 선제골 여부가 경기 결과에 영향을 줄 것이다

## 사용 데이터
* 2013~2017년 득점 기록 중 선제골 기록을 추출
* 경기의 주체는 Home 팀으로 가정

## 데이터 추출

In [2]:
import db_conn
import pandas as pd
import numpy as np
import copy
import scipy.stats as st

##### goal_records Table

In [3]:
sql = """SELECT CONCAT(year, "-", division, "-", LPAD(match_id, 3, "0")) as match_id, opening_goal, goal_time, score_team_id
FROM (
SELECT year, division, goal.match_id, min(id) as opening_goal, (half_type - 1) *45 + play_time as goal_time,
  (SELECT team_id FROM team_info WHERE team_name = score_team) score_team_id
FROM goal_records as goal
GROUP BY year, division, match_id) as t"""

In [4]:
# SQL 실행
opening_goal_list = db_conn.select_query(sql)

In [5]:
columns = ['match_id', 'opening_goal', 'goal_time', 'score_team_id']

data_source = [[item[key] for key in columns] for item in opening_goal_list]
opening_goal_list_pd = pd.DataFrame(data_source, columns=columns)
opening_goal_list_pd.columns = ['game_id', 'opening_goal', 'goal_time', 'score_team_id']

opening_goal_list_pd.head()

Unnamed: 0,game_id,opening_goal,goal_time,score_team_id
0,2013-1-001,2,29,10
1,2013-1-002,6,4,5
2,2013-1-003,9,28,23
3,2013-1-004,10,9,13
4,2013-1-006,13,2,7


##### game_records Table

In [6]:
sql = """SELECT game_id, home_team_id, away_team_id, winning_team FROM game_records"""

In [7]:
# SQL 실행
winning_team_list = db_conn.select_query(sql)

In [8]:
columns = ['game_id', 'home_team_id', 'away_team_id', 'winning_team']

data_source = [[item[key] for key in columns] for item in winning_team_list]
winning_team_list_pd = pd.DataFrame(data_source, columns=columns)

winning_team_list_pd.head()

Unnamed: 0,game_id,home_team_id,away_team_id,winning_team
0,2013-1-001,10,25,0
1,2013-1-002,19,5,19
2,2013-1-003,21,23,23
3,2013-1-004,12,13,13
4,2013-1-005,20,2,0


## 데이터 전처리

* remain_time: 선제골 이후 남은 경기 시간의 비율 (90분 기준) 
* opening_team_flag: Home 팀이 선제골을 넣었는지 여부
* winning_team_flag: Home 팀 기준 경기 결과 (1: 승, 0: 무, -1: 패)

In [9]:
home_opening_pd = pd.merge(opening_goal_list_pd, winning_team_list_pd, on='game_id')
home_opening_pd = pd.DataFrame(home_opening_pd, columns = ['game_id', 'opening_goal', 'goal_time', 'score_team_id', 'home_team_id', 'away_team_id', 'winning_team', 'remain_time', 'opening_team_flag', 'winning_team_flag'])
away_opening_pd = copy.deepcopy(home_opening_pd)

home_opening_pd.opening_team_flag = (home_opening_pd.score_team_id == home_opening_pd.home_team_id) * 1
away_opening_pd.opening_team_flag = (away_opening_pd.score_team_id == away_opening_pd.away_team_id) * 1

# 승 1, 무 0, 패 -1
home_opening_pd.winning_team_flag = np.where(home_opening_pd.winning_team == 0, 0, np.where(home_opening_pd.winning_team == home_opening_pd.home_team_id, 1, -1))
away_opening_pd.winning_team_flag = np.where(away_opening_pd.winning_team == 0, 0, np.where(away_opening_pd.winning_team == away_opening_pd.away_team_id, 1, -1))

home_opening_pd.remain_time = (90 - home_opening_pd.goal_time) / 90
away_opening_pd.remain_time = (90 - away_opening_pd.goal_time) / 90

home_opening_pd.head(), away_opening_pd.head()

home_opening_pd.describe()

Unnamed: 0,opening_goal,goal_time,score_team_id,home_team_id,away_team_id,winning_team,remain_time,opening_team_flag,winning_team_flag
count,1934.0,1934.0,1934.0,1934.0,1934.0,1934.0,1934.0,1934.0,1934.0
mean,2713.576525,34.661324,12.824199,12.746122,12.734747,10.382627,0.614874,0.534126,0.067218
std,1576.204712,24.719504,7.380089,7.497002,7.397058,8.477683,0.274661,0.498963,0.886544
min,2.0,0.0,1.0,1.0,1.0,0.0,-0.055556,0.0,-1.0
25%,1353.5,13.0,7.0,6.0,6.0,2.0,0.422222,0.0,-1.0
50%,2677.0,30.0,12.0,12.0,12.0,10.0,0.666667,1.0,0.0
75%,4083.25,52.0,20.0,20.0,19.0,18.0,0.855556,1.0,1.0
max,5461.0,95.0,25.0,25.0,25.0,25.0,1.0,1.0,1.0


## 통계치 분석

### 선제골을 넣었을 경우 승리할 통계적 확률
* 논문 EVALUATION OF GOALS SCORED IN TOP RANKING SOCCER MATCHES: GREEK " SUPERLEAGUE " 2006-07 참고

### 기본 통계치 (HOME 기준)
* 승리한 경기 중 선제골을 넣은 비율: 86.2%
* 선제골을 넣은 경기 중 승리한 비율: 69.2%
 - [조건부 확률](http://benpark.tistory.com/308)(어떤 사건 A가 일어났다는 조건 하에 다른 사건 B가 일어날 확률)
 - P(B|A) = P(A∩B) / P(A)
 - 선제골(사건A)가 일어난 경우 승리(사건B)할 확률
 - P(A) = 선제골 / 전체 경기 = 1033 / 1934
 - P(B) = 승리 / 전체 경기 = 829 / 1934
 - P(A∩B) = 선제골을 넣고 승리한 경기 / 전체 경기 = 715 / 1934
 - P(B|A) = P(A∩B) / P(A) = (715 / 1934) / (1033 / 1934) = 715 / 1033

In [14]:
home_statics = [{'opening_flag': sum(home_opening_pd[home_opening_pd.winning_team_flag == 1].opening_team_flag == 1), 'winning_flag': sum(home_opening_pd.winning_team_flag == 1), 'opening_goal_flag': sum(home_opening_pd.opening_team_flag == 1), 'total': len(home_opening_pd), 'ratio1': 0, 'ratio2': 0},
            {'opening_flag': sum(home_opening_pd[home_opening_pd.winning_team_flag == 0].opening_team_flag == 1), 'winning_flag': sum(home_opening_pd.winning_team_flag == 0), 'opening_goal_flag': sum(home_opening_pd.opening_team_flag == 1), 'total': len(home_opening_pd), 'ratio1': 0, 'ratio2': 0},
            {'opening_flag': sum(home_opening_pd[home_opening_pd.winning_team_flag == -1].opening_team_flag == 1), 'winning_flag': sum(home_opening_pd.winning_team_flag == -1), 'opening_goal_flag': sum(home_opening_pd.opening_team_flag == 1), 'total': len(home_opening_pd), 'ratio1': 0, 'ratio2': 0}]
home_statics_pd = pd.DataFrame(home_statics, columns=['opening_flag', 'winning_flag', 'opening_goal_flag', 'total', 'ratio1', 'ratio2'], index=['승', '무', '패'])
home_statics_pd.ratio1 = home_statics_pd.opening_flag / home_statics_pd.winning_flag
home_statics_pd.ratio2 = home_statics_pd.opening_flag / home_statics_pd.opening_goal_flag
home_statics_pd.columns=['승/무/패 중 선제골 횟수', '승/무/패 경기 수', '선제골 경기 수', '전체 경기 수', '비율(승리 중 선제골)', '비율(선제골 중 승리)']
home_statics_pd

Unnamed: 0,승/무/패 중 선제골 횟수,승/무/패 경기 수,선제골 경기 수,전체 경기 수,비율(승리 중 선제골),비율(선제골 중 승리)
승,715,829,1033,1934,0.862485,0.692159
무,194,406,1033,1934,0.477833,0.187803
패,124,699,1033,1934,0.177396,0.120039


### 기본 통계치 (AWAY 기준)
* 승리한 경기 중 선제골을 넣은 비율: 82.2%
* 선제골을 넣은 경기 중 승리한 비율: 63.8%
 - P(A) = 선제골 / 전체 경기 = 901 / 1934
 - P(B) = 승리 / 전체 경기 = 699 / 1934
 - P(A∩B) = 선제골을 넣고 승리한 경기 / 전체 경기 = 575 / 1934
 - P(B|A) = P(A∩B) / P(A) = (575 / 1934) / (901 / 1934) = 575 / 901

In [15]:
away_statics = [{'opening_flag': sum(away_opening_pd[away_opening_pd.winning_team_flag == 1].opening_team_flag == 1), 'winning_flag': sum(away_opening_pd.winning_team_flag == 1), 'opening_goal_flag': sum(away_opening_pd.opening_team_flag == 1), 'total': len(away_opening_pd), 'ratio1': 0, 'ratio2': 0},
            {'opening_flag': sum(away_opening_pd[away_opening_pd.winning_team_flag == 0].opening_team_flag == 1), 'winning_flag': sum(away_opening_pd.winning_team_flag == 0), 'opening_goal_flag': sum(away_opening_pd.opening_team_flag == 1), 'total': len(away_opening_pd), 'ratio1': 0, 'ratio2': 0},
            {'opening_flag': sum(away_opening_pd[away_opening_pd.winning_team_flag == -1].opening_team_flag == 1), 'winning_flag': sum(away_opening_pd.winning_team_flag == -1), 'opening_goal_flag': sum(away_opening_pd.opening_team_flag == 1), 'total': len(away_opening_pd), 'ratio1': 0, 'ratio2': 0}]
away_statics_pd = pd.DataFrame(away_statics, columns=['opening_flag', 'winning_flag', 'opening_goal_flag', 'total', 'ratio1', 'ratio2'], index=['승', '무', '패'])
away_statics_pd.ratio1 = away_statics_pd.opening_flag / away_statics_pd.winning_flag
away_statics_pd.ratio2 = away_statics_pd.opening_flag / away_statics_pd.opening_goal_flag
away_statics_pd.columns=['승/무/패 중 선제골 횟수', '승/무/패 경기 수', '선제골 경기 수', '전체 경기 수', '비율(승리 중 선제골)', '비율(선제골 중 승리)']
away_statics_pd

Unnamed: 0,승/무/패 중 선제골 횟수,승/무/패 경기 수,선제골 경기 수,전체 경기 수,비율(승리 중 선제골),비율(선제골 중 승리)
승,575,699,901,1934,0.822604,0.63818
무,212,406,901,1934,0.522167,0.235294
패,114,829,901,1934,0.137515,0.126526


### 기본 통계치 (합산)
* 승리한 경기 중 선제골을 넣은 비율: 84.4%
* 선제골을 넣은 경기 중 승리한 비율: 66.7%

In [16]:
total_statics_pd = home_statics_pd[['승/무/패 중 선제골 횟수', '승/무/패 경기 수', '선제골 경기 수', '전체 경기 수']] + away_statics_pd[['승/무/패 중 선제골 횟수', '승/무/패 경기 수', '선제골 경기 수', '전체 경기 수']]
total_statics_pd = pd.DataFrame(total_statics_pd, columns=['승/무/패 중 선제골 횟수', '승/무/패 경기 수', '선제골 경기 수', '전체 경기 수', 'ratio1', 'ratio2'], index=['승', '무', '패'])
total_statics_pd.ratio1 = total_statics_pd['승/무/패 중 선제골 횟수'] / total_statics_pd['승/무/패 경기 수']
total_statics_pd.ratio2 = total_statics_pd['승/무/패 중 선제골 횟수'] / total_statics_pd['선제골 경기 수']
total_statics_pd.columns=['승/무/패 중 선제골 횟수', '승/무/패 경기 수', '선제골 경기 수', '전체 경기 수', '비율(승리 중 선제골)', '비율(선제골 중 승리)']
total_statics_pd

Unnamed: 0,승/무/패 중 선제골 횟수,승/무/패 경기 수,선제골 경기 수,전체 경기 수,비율(승리 중 선제골),비율(선제골 중 승리)
승,1290,1528,1934,3868,0.844241,0.667011
무,406,812,1934,3868,0.5,0.209928
패,238,1528,1934,3868,0.155759,0.123061


## 카이제곱 검정을 통한 변수간 독립성 검증

### 데이터 준비
* 관측 데이터

In [12]:
temp_pd = copy.deepcopy(total_statics_pd)
temp_pd.columns=['opening_flag', 'winning_flag', 'opening_goal_flag', 'total', 'ratio1', 'ratio2']
temp_pd.index=['win', 'draw', 'lose']

joint_prob_pd = pd.DataFrame(temp_pd[['opening_flag']], columns = ['opening_flag', 'non_opening_flag', 'marginal_freq', 'opening', 'non_opening', 'marginal_prob'], index=['win', 'draw', 'lose', 'margianl_prob'])
joint_prob_pd.marginal_freq[:3] = temp_pd.winning_flag[:3]
joint_prob_pd.non_opening_flag = joint_prob_pd.marginal_freq - joint_prob_pd.opening_flag
joint_prob_pd['opening_flag'][3:] = sum(joint_prob_pd['opening_flag'][:3])
joint_prob_pd['non_opening_flag'][3:] = sum(joint_prob_pd['non_opening_flag'][:3])
joint_prob_pd.marginal_freq = joint_prob_pd.opening_flag + joint_prob_pd.non_opening_flag

total_freq = joint_prob_pd.marginal_freq[3:].values[0]
joint_prob_pd.opening[3:] = joint_prob_pd.opening_flag[3:] / total_freq
joint_prob_pd.non_opening[3:] = joint_prob_pd.non_opening_flag[3:] / total_freq
joint_prob_pd.marginal_prob = joint_prob_pd.marginal_freq / total_freq

joint_prob_pd.opening[:3] = joint_prob_pd.marginal_prob[:3] * joint_prob_pd.opening[3:].values[0]
joint_prob_pd.non_opening[:3] = joint_prob_pd.marginal_prob[:3] * joint_prob_pd.non_opening[3:].values[0]

joint_prob_pd

Unnamed: 0,opening_flag,non_opening_flag,marginal_freq,opening,non_opening,marginal_prob
win,1290.0,238.0,1528.0,0.197518,0.197518,0.395036
draw,406.0,406.0,812.0,0.104964,0.104964,0.209928
lose,238.0,1290.0,1528.0,0.197518,0.197518,0.395036
margianl_prob,1934.0,1934.0,3868.0,0.5,0.5,1.0


* 예상(기대) 데이터: 기대값은 관측값의 비율을 이용하여 계산

In [13]:
expected_prob_pd = pd.DataFrame(joint_prob_pd[['opening', 'non_opening', 'marginal_prob']])
expected_prob_pd = expected_prob_pd * total_freq

expected_prob_pd

Unnamed: 0,opening,non_opening,marginal_prob
win,764.0,764.0,1528.0
draw,406.0,406.0,812.0
lose,764.0,764.0,1528.0
margianl_prob,1934.0,1934.0,3868.0


### 카이제곱 검정 - 독립성 테스트(Test of Independence)
* 관측값들이 다수의 인자들에 의해 분할 되어 있는 경우 그 인자들의 관찰 값에 영향을 주고 있는지 아닌지를 검정하는 방법
* 참고
    * http://elearning.kocw.net/KOCW/document/2013/koreasejong/HongSungsik4/10.pdf
    * https://m.blog.naver.com/PostView.nhn?blogId=leerider&logNo=100189714605&proxyReferer=https%3A%2F%2Fwww.google.co.kr%2F
    * http://hamelg.blogspot.kr/2015/11/python-for-data-analysis-part-25-chi.html
    
#### 가설    
* 귀무가설(H0): 선취골 득점 여부에 따라 경기결과에 차이가 없다
* 대립가설(H1): 선취골 득점 여부에 따라 경기결과에 차이가 존재한다

#### 카이제곱 검정 수행

In [14]:
print(joint_prob_pd[:3][['opening_flag', 'non_opening_flag']].values)
result = st.chi2_contingency(joint_prob_pd[:3][['opening_flag', 'non_opening_flag']].values)

result

[[1290.  238.]
 [ 406.  406.]
 [ 238. 1290.]]


(1448.565445026178, 0.0, 2, array([[764., 764.],
        [406., 406.],
        [764., 764.]]))

#### 카이제곱 검정 결과
* x^2는 약 721.85, p_value는 0.01보다 매우 작은 수치, 자유도는 2
* P_value가 0.01(99%)보다 작으므로 귀무가설을 기각
* **선취골 득점 여부에 따라 경기결과에 차이가 존재한다**

In [15]:
print('Win, Draw')
print(st.chi2_contingency(joint_prob_pd[:2][['opening_flag', 'non_opening_flag']]))

print('Draw, Lose')
print(st.chi2_contingency(joint_prob_pd[1:3][['opening_flag', 'non_opening_flag']]))

print('Win, Lose')
print(st.chi2_contingency(joint_prob_pd[0::2][['opening_flag', 'non_opening_flag']]))

Win, Draw
(313.2749650544937, 4.2248129312710823e-70, 1, array([[1107.47350427,  420.52649573],
       [ 588.52649573,  223.47350427]]))
Draw, Lose
(313.27496505449363, 4.2248129312712046e-70, 1, array([[ 223.47350427,  588.52649573],
       [ 420.52649573, 1107.47350427]]))
Win, Lose
(1445.8128272251308, 0.0, 1, array([[764., 764.],
       [764., 764.]]))


#### 추가 검정 (경기 결과를 승리 여부로 변경 시)

In [16]:
a = pd.DataFrame(joint_prob_pd[['opening_flag', 'non_opening_flag']], index=['win', 'draw', 'lose', 'non_win'])
a['opening_flag'][3:] = sum(a['opening_flag'][1:3])
a['non_opening_flag'][3:] = sum(a['non_opening_flag'][1:3])
a.iloc[[0,3]]

Unnamed: 0,opening_flag,non_opening_flag
win,1290.0,238.0
non_win,644.0,1696.0


In [17]:
st.chi2_contingency(a.iloc[[0,3]].values)

(1194.9581230142749, 7.603990602257219e-262, 1, array([[ 764.,  764.],
        [1170., 1170.]]))

#### 추가 분석
* 경기 결과를 승/무/패가 아닌 승리 여부로 변경
* X^2 626.38, P_value 0.01미만, 자유로 1로 동일하게 귀무가설 기각

## 참고 자료

#### 카이제곱검정 중 적합성 테스트(Goodness-Of-Fit Test)
* 관측 값들이 어떤 이론이나 이론적 분포를 따르고 있는지를 검정하는 것
* 참고: https://youtu.be/MIaEyLcRvKw, http://hamelg.blogspot.kr/2015/11/python-for-data-analysis-part-25-chi.html

In [18]:
import scipy.stats as st

observed = np.reshape(joint_prob_pd[:3][['opening_flag', 'non_opening_flag']].values, -1).tolist()
expected = np.reshape(expected_prob_pd[:3][['opening', 'non_opening']].values, -1).tolist()

observed[2:4], expected

([406.0, 406.0], [764.0, 764.0, 406.0, 406.0, 764.0, 764.0])

In [19]:
st.chisquare(f_obs= observed,   # Array of observed counts
                f_exp= expected)


Power_divergenceResult(statistic=1448.565445026178, pvalue=4.1222600200887e-311)

##### 적합성 테스트 결과
* P_value가 0.05보다 작으므로 귀무가설을 기각
* **선취골 득점 여부에 따라 경기결과에 차이가 존재한다**

##### 경기결과를 각각 승리여부, 무승부여부, 패배여부로 카이제곱 검정한 결과

In [20]:
a = pd.DataFrame(joint_prob_pd[['opening_flag', 'non_opening_flag']], index=['win', 'draw', 'lose', 'non_win'])
a['opening_flag'][3:] = sum(a['opening_flag'][1:3])
a['non_opening_flag'][3:] = sum(a['non_opening_flag'][1:3])
a.iloc[[0,3]]
st.chi2_contingency(a.iloc[[0,3]].values)

(1194.9581230142749, 7.603990602257219e-262, 1, array([[ 764.,  764.],
        [1170., 1170.]]))

In [21]:
a = pd.DataFrame(joint_prob_pd[['opening_flag', 'non_opening_flag']], index=['win', 'draw', 'lose', 'non_draw'])
a['opening_flag'][3:] = sum(a['opening_flag'][::2])
a['non_opening_flag'][3:] = sum(a['opening_flag'][::2])
a.iloc[[1,3]]
st.chi2_contingency(a.iloc[[1,3]].values)

(0.0, 1.0, 1, array([[ 406.,  406.],
        [1528., 1528.]]))

In [22]:
a = pd.DataFrame(joint_prob_pd[['opening_flag', 'non_opening_flag']], index=['win', 'draw', 'lose', 'non_lose'])
a['opening_flag'][3:] = sum(a['opening_flag'][0:2])
a['non_opening_flag'][3:] = sum(a['non_opening_flag'][0:2])
a.iloc[[2,3]]
st.chi2_contingency(a.iloc[[2,3]].values)

(1194.9581230142749, 7.603990602257219e-262, 1, array([[ 764.,  764.],
        [1170., 1170.]]))

##### 로지스틱 회귀분석 
* 단순 테스트 목적, 추후 다른 변수들과 함께 돌렸을 때에나 의미있는 결과가 나올 것으로 예상됨
* 근데 단일 변수로도 생각보다 F-measure가 높게 나오는걸 보면 선제골이 승리에 주는 영향은 확실한 것으로 판단됨
* 참고: http://nbviewer.jupyter.org/gist/justmarkham/6d5c061ca5aee67c4316471f8c2ae976

In [23]:
# Data Sampling (Train Set: 2013~2016, Test Set: 2017)
# Variables: 홈/어웨이, 선제골여부, 잔여시간 비중, 승리여부(승 vs. 무/패)
logit_data = home_opening_pd.append(away_opening_pd, ignore_index=True)[['game_id', 'score_team_id', 'home_team_id', 'opening_team_flag', 'remain_time', 'winning_team_flag']]
logit_data.winning_team_flag = np.where(logit_data.winning_team_flag == 1, 1, 0)
logit_data.game_id = logit_data.game_id.str.split('-').str.get(0)
logit_data.game_id = logit_data.apply(pd.to_numeric)
logit_data.score_team_id = np.where(logit_data.score_team_id == logit_data.home_team_id, np.where(logit_data.opening_team_flag == 1, 1, 0), np.where(logit_data.opening_team_flag == 0, 1, 0))
logit_data = logit_data.drop(['home_team_id'], axis=1)
logit_data.columns = ['year', 'location', 'opening_goal', 'remain_time', 'victory']

train_set = logit_data[logit_data.year < 2017]
test_set = logit_data[logit_data.year == 2017]

logit_data.head()

Unnamed: 0,year,location,opening_goal,remain_time,victory
0,2013.0,1,1,0.677778,0
1,2013.0,1,0,0.955556,1
2,2013.0,1,0,0.688889,0
3,2013.0,1,0,0.9,0
4,2013.0,1,1,0.977778,0


In [24]:
# location과 remain_time의 영향도가 미비하여 변수에서 제외

from patsy import dmatrices

y_train, x_train = dmatrices('victory ~ opening_goal', train_set, return_type='dataframe')
y_test, x_test = dmatrices('victory ~ opening_goal', test_set, return_type='dataframe')
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model = model.fit(x_train, y_train)

# 로지스틱 회귀식의 절편과 alpha 값
list(zip(x_train.columns, np.transpose(model.coef_)))

[('Intercept', array([-0.94174735])), ('opening_goal', array([2.56348138]))]

In [25]:
# Training Set 에서의 정확도
model.score(x_train, y_train)

0.7678227360308285

In [26]:
model.sparsify()

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [27]:
# Test Set을 이용하여 예측 수행

from sklearn import metrics

predicted = model.predict(x_test)
predicted_probs = model.predict_proba(x_test)[:, 1]

# Test Set 에서의 정확도
metrics.accuracy_score(y_test, predicted), metrics.roc_auc_score(y_test, predicted_probs)

(0.7891246684350133, 0.8053879310344828)

In [28]:
confusion_mt = metrics.confusion_matrix(y_test, predicted)
tn, fp, fn, tp = confusion_mt.ravel()
tn, fp, fn, tp, confusion_mt
pd.DataFrame([{'Real Positive': tn, 'Real Negative': fp}, {'Real Positive': fn, 'Real Negative': tn}], index=['Predicted Positive', 'Predicted Negative'])

Unnamed: 0,Real Negative,Real Positive
Predicted Positive,123,341
Predicted Negative,341,36


In [29]:
# F-measure 수행
print(metrics.classification_report(y_test, predicted))

             precision    recall  f1-score   support

        0.0       0.90      0.73      0.81       464
        1.0       0.67      0.88      0.76       290

avg / total       0.82      0.79      0.79       754



##### F-Measure

|        |          |               실제값                |                                  |
| :----: | :------: | :---------------------------------: | :--------------------------------: |
|        |          |              Positive               |              Negative              |
| 예측값 | Positive |            True Positive            | False Positive<br />(Type I error) |
|        | Negative | False Negative<br />(Type II Error) |           True Negative            |


* Precision(정밀도) : Positive로 예측한 결과 중 제대로 적중한 비율 (예측의 적중률을 강조)
$$Precision = \frac{tp}{tp + fp} $$ 


* Recall(재현율): 실제 Positive 중 제대로 적중한 비율 (못 찾아낸 부분이 강조)
$$Recall = \frac{tp}{tp + fn} $$


* F1-score(F점수): Precision과 Recall을 이용하여 해당 모델의 정확도를 산출하는 조화평균
$$F1-score = 2 * \frac{precision*recall}{precision+recall}$$


* 선제골을 넣으면 승리할 가능성은 높을 것으로 결론이 난것 같은데...

* 홈 경기에서 선제골을 넣었을 때 승리할 확률은? 69.2%
* 원정경기에서 선제골을 넣었을때 승리할 확률은? 63.8%
