# 使用机器学习预测世界杯冠军 🏆

## 项目大纲

- 获取本届世界杯小组赛和淘汰赛的比赛结果
- 获取球队在本届世界杯的比赛统计数据
- 获取球队在历届世界杯的比赛统计数据
- 使用人工神经网络进行预测

## 数据来源

- 搜狐体育 2018 俄罗斯世界杯实时 [统计数据](http://data.2018.sohu.com/)。
- 外部数据可能会因为 API 变动而失效。

## 获取本届世界杯小组赛和淘汰赛的比赛结果

- 数据地址：http://data.2018.sohu.com/game-schedule.html?index=3

### 解析 JSON 数据


In [1]:
import requests

# 解析 JSON 数据
play_raw = requests.get("http://api.data.2018.sohu.com/api/schedule/time")

In [2]:
import json

play_json = json.loads(play_raw.text)

In [3]:
play_json['result'][0]

{'aggregate_partner_game': None,
 'coverage': None,
 'game_date': '2018-06-14 23:00:00',
 'game_type': '1',
 'game_type_description': 'group',
 'gamecode': '13245886',
 'gamecode_global': '13245886',
 'home_team_alias': '',
 'home_team_flag': '',
 'home_team_global_id': '0',
 'home_team_group': 'A',
 'home_team_id': '4694',
 'home_team_name': '俄罗斯',
 'home_team_outcome': '俄罗斯',
 'home_team_score': '5',
 'home_team_shootout_goals': None,
 'home_team_stars': '0',
 'id': '1',
 'judge_id': '0',
 'links_live': '126052',
 'links_schedule': "<a target=_blank href='http://sports.sohu.com/2018wcrusvsksa/'>战报</a>",
 'links_tv': None,
 'local_date': '2018-06-14 15:00:00',
 'manul_vs': 'http://sports.sohu.com/2018wcrusvsksa/',
 'match_number': '1',
 'original_week': '0',
 'stadium_global_id': '949',
 'stadium_id': '949',
 'stadium_name': '卢日尼基体育场',
 'status': 'closed',
 'status_id': '4',
 'tba': '',
 'venue_id': '949',
 'visiting_team_alias': '',
 'visiting_team_flag': '',
 'visiting_team_global_i

In [4]:
import pandas as pd

play_df = pd.read_json(json.dumps(play_json['result']))

In [5]:
play_score = play_df[['home_team_name', 'visiting_team_name', 'home_team_score', 'visiting_team_score']].iloc[:-1]
play_score.tail()

Unnamed: 0,home_team_name,visiting_team_name,home_team_score,visiting_team_score
58,瑞典,英格兰,0,2
59,俄罗斯,克罗地亚,5,6
60,法国,比利时,1,0
61,克罗地亚,英格兰,2,1
62,比利时,英格兰,2,0


### 根据比分情况，为每一场比赛添加标签

In [6]:
play_score.loc[play_score['home_team_score'] > play_score['visiting_team_score'], 'results'] = '胜利'
play_score.loc[play_score['home_team_score'] == play_score['visiting_team_score'], 'results'] = '平局'
play_score.loc[play_score['home_team_score'] < play_score['visiting_team_score'], 'results'] = '失败'

In [7]:
play_score.head()

Unnamed: 0,home_team_name,visiting_team_name,home_team_score,visiting_team_score,results
0,俄罗斯,沙特阿拉伯,5,0,胜利
1,埃及,乌拉圭,0,1,失败
2,摩洛哥,伊朗,0,1,失败
3,葡萄牙,西班牙,3,3,平局
4,法国,澳大利亚,2,1,胜利


## 获取球队在本届世界杯的比赛统计数据

获取球队在本届世界杯的比赛统计数据，这些数据包括赢球场次、输球场次、比赛次数、进球数量、失球数量等。这些指标用于反映球队的整体实力。

- 数据地址：http://data.2018.sohu.com/

### 得到各国家队整体输赢数据

In [8]:
team_raw = requests.get("http://api.data.2018.sohu.com/api/scores/index")

In [9]:
team_json = json.loads(team_raw.text)

team_json['result'][0]

{'alias': '',
 'away_losses': None,
 'away_ties': None,
 'away_wins': None,
 'flag': '',
 'games_played': '3',
 'goals_against': '4',
 'goals_for': '8',
 'group': 'A',
 'home_losses': None,
 'home_ties': None,
 'home_wins': None,
 'id': '1',
 'losses': '1',
 'name_cn': '俄罗斯',
 'place': '2',
 'points': '6',
 'points_per_game': None,
 'r1': None,
 'r2': None,
 'r3': None,
 'r4': None,
 'r5': None,
 'team_global_id': None,
 'team_id': '4694',
 'ties': '0',
 'winning_percentage': '0.667',
 'wins': '2'}

In [10]:
team_df = pd.read_json(json.dumps(team_json['result']))
team_df = team_df[['name_cn', 'wins', 'losses', 'ties', 'points', 'goals_for', 'goals_against']]
team_df_reindex = pd.DataFrame(team_df).set_index('name_cn')
team_df_reindex.head()

Unnamed: 0_level_0,wins,losses,ties,points,goals_for,goals_against
name_cn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
俄罗斯,2,1,0,6,8,4
乌拉圭,3,0,0,9,5,0
埃及,0,3,0,0,2,6
沙特阿拉伯,1,2,0,3,2,7
葡萄牙,1,0,2,5,5,4


### 得到各国家队得分详细统计数据

- 数据地址：http://data.2018.sohu.com/list.html?type=team&category=2

In [11]:
goal_raw = requests.get("http://api.data.2018.sohu.com/api/rank/team?category_id=2")

In [12]:
goal_json = json.loads(goal_raw.text)

goal_json['result']['list'][0]

{'ball_possession': '387',
 'cards': '15',
 'conceded0_15': '0',
 'conceded16_30': '0',
 'conceded31_45': '0',
 'conceded46_60': '0',
 'conceded61_75': '0',
 'conceded76_90': None,
 'corner_kicks': '40',
 'crosses': None,
 'duelstacklesuccessful': '244',
 'duelstackletotal': '473',
 'fouls': '111',
 'free_kicks': '121',
 'games_played': '7',
 'goals': '21',
 'goals_footed': '18',
 'goals_headed': '2',
 'id': '18',
 'last_form': 'LWWWWWW',
 'losses': '1',
 'name_cn': '克罗地亚',
 'offsides': '9',
 'opponent_goals': '14',
 'red_cards': '0',
 'scored0_15': '0',
 'scored16_30': '0',
 'scored31_45': '1',
 'scored46_60': '2',
 'scored61_75': '1',
 'scored76_90': '3',
 'shots': '80',
 'shots_blocked': '27',
 'shots_off_goal': '52',
 'shots_on_goal': '28',
 'team_id': '4715',
 'ties': '0',
 'touches_passes': '3964',
 'wins': '6',
 'yellow_cards': '15'}

In [13]:
goal_df = pd.read_json(json.dumps(goal_json['result']['list']))
goal_df = goal_df[['name_cn', 'games_played', 'goals', 'opponent_goals', 'shots', 'shots_on_goal', 
                   'fouls', 'offsides', 'touches_passes', 'free_kicks', 'corner_kicks', 'duelstackletotal',
                   'yellow_cards', 'red_cards']]
goal_df_reindex = pd.DataFrame(goal_df).set_index('name_cn')
goal_df_reindex.head()

Unnamed: 0_level_0,games_played,goals,opponent_goals,shots,shots_on_goal,fouls,offsides,touches_passes,free_kicks,corner_kicks,duelstackletotal,yellow_cards,red_cards
name_cn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
克罗地亚,7,21,14,80,28,111,9,3964,121,40,473,15,0
俄罗斯,5,18,14,35,18,95,7,1958,60,26,357,7,0
英格兰,7,16,11,65,27,67,15,4001,113,39,461,8,0
比利时,7,16,6,75,39,99,8,3810,86,39,489,11,0
法国,7,14,6,66,31,90,3,3250,109,21,437,12,0


## 获取球队在历届世界杯的比赛统计数据

- 数据地址：http://data.2018.sohu.com/team-map.html

In [14]:
team_history_raw = requests.get("http://api.data.2018.sohu.com/api/team/list")

In [15]:
team_history_json = json.loads(team_history_raw.text)

team_history_json['result'][1]

{'alias': '',
 'arabic': None,
 'clothing_a': '0',
 'clothing_b': '0',
 'create_time': '1900',
 'crowns': '2',
 'display_name': 'Uruguay',
 'flag': '',
 'global_rank': '14',
 'group': 'A',
 'history_country_id': '0',
 'id': '34',
 'join_time': '1923',
 'links_video': None,
 'location': None,
 'name': None,
 'name_cn': '乌拉圭',
 'presents': '12',
 'sign': '0',
 'team_global_id': '4725',
 'team_id': '4725',
 'wined': '0'}

### 得到以国家名称为索引，各国家历年参加世界杯及名次情况

In [16]:
team_history_df = pd.read_json(json.dumps(team_history_json['result']))
team_history_df = team_history_df[['name_cn', 'global_rank', 'crowns', 'presents']]
team_history_df_reindex = pd.DataFrame(team_history_df).set_index('name_cn')
team_history_df_reindex.head()

Unnamed: 0_level_0,global_rank,crowns,presents
name_cn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
俄罗斯,70,0,10
乌拉圭,14,2,12
埃及,45,0,2
沙特阿拉伯,67,0,4
葡萄牙,4,0,6


## 合并数据

合并 `team_df_reindex`，`goal_df_reindex`，`team_history_df_reindex`。

In [17]:
team_merge = pd.concat([team_df_reindex, goal_df_reindex, team_history_df_reindex], axis=1)

team_merge.head()

Unnamed: 0,wins,losses,ties,points,goals_for,goals_against,games_played,goals,opponent_goals,shots,...,offsides,touches_passes,free_kicks,corner_kicks,duelstackletotal,yellow_cards,red_cards,global_rank,crowns,presents
丹麦,1,0,2,5,2,1,4,5,5,32,...,5,1789,38,18,216,6,0,12,0,4
乌拉圭,3,0,0,9,5,0,5,7,3,44,...,2,2268,78,18,361,3,0,14,2,12
伊朗,1,1,1,4,2,2,3,2,2,17,...,3,684,48,5,180,7,0,37,0,4
俄罗斯,2,1,0,6,8,4,5,18,14,35,...,7,1958,60,26,357,7,0,70,0,10
克罗地亚,3,0,0,9,7,1,7,21,14,80,...,9,3964,121,40,473,15,0,20,0,4


In [18]:
home_team_df = team_merge.reindex(play_score['home_team_name'])
visiting_team_df = team_merge.reindex(play_score['visiting_team_name'])

In [19]:
home_visiting_team_df = pd.concat([home_team_df.reset_index(), visiting_team_df.reset_index()], axis=1)
home_visiting_team_df.head()

Unnamed: 0,home_team_name,wins,losses,ties,points,goals_for,goals_against,games_played,goals,opponent_goals,...,offsides,touches_passes,free_kicks,corner_kicks,duelstackletotal,yellow_cards,red_cards,global_rank,crowns,presents
0,俄罗斯,2,1,0,6,8,4,5,18,14,...,4,1802,54,13,216,1,0,67,0,4
1,埃及,0,3,0,0,2,6,3,2,6,...,2,2268,78,18,361,3,0,14,2,12
2,摩洛哥,0,2,1,1,2,4,3,2,4,...,3,684,48,5,180,7,0,37,0,4
3,葡萄牙,1,0,2,5,5,4,4,6,6,...,6,3426,67,24,232,2,0,10,1,14
4,法国,2,0,1,7,3,1,7,14,6,...,3,1522,35,14,180,7,0,36,0,4


In [20]:
play_score_new = pd.concat([home_visiting_team_df, play_score.iloc[:, -1:]], axis=1).drop(['home_team_name', 'visiting_team_name'], axis=1)
play_score_new.head()

Unnamed: 0,wins,losses,ties,points,goals_for,goals_against,games_played,goals,opponent_goals,shots,...,touches_passes,free_kicks,corner_kicks,duelstackletotal,yellow_cards,red_cards,global_rank,crowns,presents,results
0,2,1,0,6,8,4,5,18,14,35,...,1802,54,13,216,1,0,67,0,4,胜利
1,0,3,0,0,2,6,3,2,6,21,...,2268,78,18,361,3,0,14,2,12,失败
2,0,2,1,1,2,4,3,2,4,28,...,684,48,5,180,7,0,37,0,4,失败
3,1,0,2,5,5,4,4,6,6,33,...,3426,67,24,232,2,0,10,1,14,平局
4,2,0,1,7,3,1,7,14,6,66,...,1522,35,14,180,7,0,36,0,4,胜利



### 数据归一化处理

Min-Max Normalization 对原始数据的线性变换，使结果值映射到 `0-1` 之间：

$$\hat x=\frac{x-x_{min}}{x_{max}-x_{min}}$$

In [21]:
play_score_temp = play_score_new.iloc[:, :-1]
play_score_normal = (play_score_temp - play_score_temp.min()) / (play_score_temp.max() - play_score_temp.min())
play_score_normal = pd.concat([play_score_normal, play_score_new.iloc[:, -1]], axis=1)
play_score_normal.head()

Unnamed: 0,wins,losses,ties,points,goals_for,goals_against,games_played,goals,opponent_goals,shots,...,touches_passes,free_kicks,corner_kicks,duelstackletotal,yellow_cards,red_cards,global_rank,crowns,presents,results
0,0.666667,0.333333,0.0,0.666667,0.857143,0.363636,0.5,0.842105,1.0,0.285714,...,0.337052,0.302083,0.216216,0.170213,0.0,0.0,0.956522,0.0,0.2,胜利
1,0.0,1.0,0.0,0.0,0.0,0.545455,0.0,0.0,0.333333,0.063492,...,0.47754,0.552083,0.351351,0.610942,0.142857,0.0,0.188406,0.4,0.6,失败
2,0.0,0.666667,0.5,0.111111,0.0,0.363636,0.0,0.0,0.166667,0.174603,...,0.0,0.239583,0.0,0.06079,0.428571,0.0,0.521739,0.0,0.2,失败
3,0.333333,0.0,1.0,0.555556,0.428571,0.363636,0.25,0.210526,0.333333,0.253968,...,0.826651,0.4375,0.513514,0.218845,0.071429,0.0,0.130435,0.2,0.7,平局
4,0.666667,0.0,0.5,0.777778,0.142857,0.090909,1.0,0.631579,0.333333,0.777778,...,0.252638,0.104167,0.243243,0.06079,0.428571,0.0,0.507246,0.0,0.2,胜利


 ## 使用人工神经网络进行预测

In [22]:
X = play_score_normal.iloc[:, :-1] # 特征
y = play_score_normal.iloc[:, -1] # 目标

In [23]:
from sklearn.neural_network import MLPClassifier

# 定义人工神经网络分类器
model = MLPClassifier(max_iter=1000)

In [24]:
from sklearn.model_selection import cross_val_score

# 交叉验证，评估模型可靠性
cvs = cross_val_score(model, X, y, cv=5)
cvs

array([ 0.71428571,  0.76923077,  0.61538462,  0.5       ,  0.63636364])

In [25]:
import numpy as np

# 求得交叉验证结果平均值
np.mean(cvs)

0.64705294705294703

In [26]:
model.fit(X, y) # 训练模型

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=1000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

### 取出决赛队伍的特征数据

In [27]:
# 取出决赛队伍数据
final_team = pd.concat([home_team_df.loc['法国'].iloc[0], home_team_df.loc['克罗地亚'].iloc[0]])
final_team

wins                   2
losses                 0
ties                   1
points                 7
goals_for              3
goals_against          1
games_played           7
goals                 14
opponent_goals         6
shots                 66
shots_on_goal         31
fouls                 90
offsides               3
touches_passes      3250
free_kicks           109
corner_kicks          21
duelstackletotal     437
yellow_cards          12
red_cards              0
global_rank            7
crowns                 1
presents              14
wins                   3
losses                 0
ties                   0
points                 9
goals_for              7
goals_against          1
games_played           7
goals                 21
opponent_goals        14
shots                 80
shots_on_goal         28
fouls                111
offsides               9
touches_passes      3964
free_kicks           121
corner_kicks          40
duelstackletotal     473
yellow_cards          15


In [28]:
# 对数据进行归一化
final_team_normal = (final_team - play_score_temp.min()) / (play_score_temp.max() - play_score_temp.min())
final_team_normal

wins                0.666667
losses              0.000000
ties                0.500000
points              0.777778
goals_for           0.142857
goals_against       0.090909
games_played        1.000000
goals               0.631579
opponent_goals      0.333333
shots               0.777778
shots_on_goal       0.771429
fouls               0.743902
offsides            0.200000
touches_passes      0.773591
free_kicks          0.875000
corner_kicks        0.432432
duelstackletotal    0.841945
yellow_cards        0.785714
red_cards           0.000000
global_rank         0.086957
crowns              0.200000
presents            0.700000
wins                1.000000
losses              0.000000
ties                0.000000
points              1.000000
goals_for           0.714286
goals_against       0.090909
games_played        1.000000
goals               1.000000
opponent_goals      1.000000
shots               1.000000
shots_on_goal       0.685714
fouls               1.000000
offsides      

### 预测冠军球队【法国 🇫🇷 VS 克罗地亚 🇭🇷】

In [30]:
model.predict(np.atleast_2d(final_team_normal)) # 预测

array(['胜利'], 
      dtype='<U2')

即代表法国取得冠军。