<a href="https://colab.research.google.com/github/monda00/horse-race-notebook/blob/master/predict_show_neural_network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ニューラルネットワークで予測

学習データの作成から予測の考察までやってみる。

- 単純に１行ずつ学習
- 各レースのデータ数を揃えて１レースまとめて学習

特徴量

|分類	|項目 |
|---|---|
|馬情報	|馬番 |
| |枠番 |
| |年齢 |
| |性別 |
| |体重（現在） | 
| |体重（前走との差分） |
| |負担重量 |
| 当日レース情報 |レース場 |
| |出走馬数 |
| |コース距離 |
| |周回方向 |
| |コースタイプ（ダ/芝/障） |
| |天気 |
| |馬場状態 |
| |開始時間帯 |
| |時期 |
|同馬の過去レース情報（×5走分）	|オッズ |
| |人気 |
| |順位 |
| |タイム（秒） |
| |前走からの経過日数 |
| |コース距離 |
| |コースタイプ（ダ/芝/障） |
| |天気 |
| |馬場状態 |

# 概要

- ライブラリ・データ読み込み
- 前処理
- 学習
- 予測
- 考察

## 参考

- [データ収集からディープラーニングまで全て行って競馬の予測をしてみた](https://qiita.com/kami634/items/55e49dad76396d808bf5#%E5%8F%96%E5%BE%97%E3%81%97%E3%81%9Furl%E3%82%92%E3%82%82%E3%81%A8%E3%81%ABhtml%E3%82%92%E5%BE%97%E3%82%8B)
- [競馬の予測をガチでやってみた](http://stockedge.hatenablog.com/entry/2016/01/03/103428)
- [ディープラーニングさえあれば、競馬で回収率100%を超えられる](https://qiita.com/yossymura/items/334a8f3ef85bff081913)
- [競馬予想AIを作る 〜ニューラルネットワークによる相対評価データセットの取り扱い例〜](https://cocon-corporation.com/cocontoco/horseraceprediction_ai/)

# ライブラリ・データ読み込み

In [1]:
import numpy as np
import pandas as pd
import datetime
from tqdm import tqdm
import collections

from tensorflow import keras
from tensorflow.keras.layers import Input, Dense, Dropout, BatchNormalization
from tensorflow.keras.models import Model

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

In [2]:
DATA_PATH = '/content/drive/My Drive/data/horse-race/'

In [3]:
df = pd.read_csv(DATA_PATH + 'train_nn.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
df.head()

Unnamed: 0,race_id,horse_number,frame_number,age,gen,weight,weight_diff,burden_weight,place,race_horse_number,distance,clockwise,field_type,field_condition,weather,time_hour,season,one_before_odd,one_before_popular,one_before_rank,one_before_time,one_before_elapsed_day,one_before_distance,one_before_field_type,one_before_field_condition,one_before_weather,two_before_odd,two_before_popular,two_before_rank,two_before_time,two_before_elapsed_day,two_before_distance,two_before_field_type,two_before_field_condition,two_before_weather,three_before_odd,three_before_popular,three_before_rank,three_before_time,three_before_elapsed_day,three_before_distance,three_before_field_type,three_before_field_condition,three_before_weather,four_before_odd,four_before_popular,four_before_rank,four_before_time,four_before_elapsed_day,four_before_distance,four_before_field_type,four_before_field_condition,four_before_weather,five_before_odd,five_before_popular,five_before_rank,five_before_time,five_before_elapsed_day,five_before_distance,five_before_field_type,five_before_field_condition,five_before_weather,show,date
0,201945010102,1.0,1.0,7.0,牝,448,0,54.0,川崎,10.0,1400.0,左,ダ,良,晴,11.0,winter,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,2019/1/1
1,201945010102,2.0,2.0,7.0,牡,464,7,56.0,川崎,10.0,1400.0,左,ダ,良,晴,11.0,winter,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1.0,2019/1/1
2,201945010102,3.0,3.0,7.0,牝,464,4,54.0,川崎,10.0,1400.0,左,ダ,良,晴,11.0,winter,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1.0,2019/1/1
3,201945010102,4.0,4.0,6.0,牡,449,7,55.0,川崎,10.0,1400.0,左,ダ,良,晴,11.0,winter,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,2019/1/1
4,201945010102,5.0,5.0,7.0,牡,502,1,56.0,川崎,10.0,1400.0,左,ダ,良,晴,11.0,winter,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1.0,2019/1/1


# 前処理

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 407340 entries, 0 to 407339
Data columns (total 64 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   race_id                       407340 non-null  int64  
 1   horse_number                  253636 non-null  float64
 2   frame_number                  253636 non-null  float64
 3   age                           253636 non-null  float64
 4   gen                           253636 non-null  object 
 5   weight                        253636 non-null  object 
 6   weight_diff                   253636 non-null  object 
 7   burden_weight                 253636 non-null  float64
 8   place                         253636 non-null  object 
 9   race_horse_number             253636 non-null  float64
 10  distance                      253636 non-null  float64
 11  clockwise                     253636 non-null  object 
 12  field_type                    253636 non-nul

## 日付削除

日付はソートのために利用するだけのため削除

In [7]:
df = df.drop('date', axis=1)

## 欠損値

0埋めする。

In [9]:
df.isnull().sum().sum()

9529648

In [19]:
df = df.fillna(0)

過去レースの情報がない馬も結構いるかもしれない。

23752個のデータで過去レースの情報がない。
数値と文字列の0が混在している。

In [11]:
past_race_columns_base = ['odd', 'popular', 'rank', 'time', 'elapsed_day', 'distance', 'field_type', 'field_condition', 'weather']
past_race_num = ['one', 'two', 'three', 'four', 'five']
past_race_columns = []
for n in past_race_num:
  for c in past_race_columns_base:
    past_race_columns.append('{}_before_{}'.format(n, c))

## weight

計測不能が混じっている。

In [23]:
len(df[df['weight'] == '計不'])

44

In [24]:
df[df['weight'] == '計不'] = 0

In [25]:
df['weight'] = df['weight'].astype('int64')

## weight diff

型を変換する

In [26]:
df['weight_diff'] = df['weight_diff'].astype('int64')

## Label Encoding

In [28]:
categorical_cols = ['gen', 'place', 'clockwise', 'field_type', 'field_condition', 'weather', 'season']
categorical_cols_past_base = ['field_type', 'field_condition', 'weather']

In [29]:
categorical_cols_past = []
for n in past_race_num:
  for c in categorical_cols_past_base:
    categorical_cols_past.append('{}_before_{}'.format(n, c))

In [30]:
categorical_cols = categorical_cols + categorical_cols_past

In [42]:
df[categorical_cols] = df[categorical_cols].astype(str)

In [45]:
for c in categorical_cols:
  le = LabelEncoder()
  le.fit(df[c])
  df[c] = le.transform(df[c])

In [46]:
df

Unnamed: 0,race_id,horse_number,frame_number,age,gen,weight,weight_diff,burden_weight,place,race_horse_number,distance,clockwise,field_type,field_condition,weather,time_hour,season,one_before_odd,one_before_popular,one_before_rank,one_before_time,one_before_elapsed_day,one_before_distance,one_before_field_type,one_before_field_condition,one_before_weather,two_before_odd,two_before_popular,two_before_rank,two_before_time,two_before_elapsed_day,two_before_distance,two_before_field_type,two_before_field_condition,two_before_weather,three_before_odd,three_before_popular,three_before_rank,three_before_time,three_before_elapsed_day,three_before_distance,three_before_field_type,three_before_field_condition,three_before_weather,four_before_odd,four_before_popular,four_before_rank,four_before_time,four_before_elapsed_day,four_before_distance,four_before_field_type,four_before_field_condition,four_before_weather,five_before_odd,five_before_popular,five_before_rank,five_before_time,five_before_elapsed_day,five_before_distance,five_before_field_type,five_before_field_condition,five_before_weather,show
0,201945010102,1.0,1.0,7.0,2,448,0,54.0,11,10.0,1400.0,4,1,3,3,11.0,4,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0
1,201945010102,2.0,2.0,7.0,3,464,7,56.0,11,10.0,1400.0,4,1,3,3,11.0,4,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,1.0
2,201945010102,3.0,3.0,7.0,2,464,4,54.0,11,10.0,1400.0,4,1,3,3,11.0,4,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,1.0
3,201945010102,4.0,4.0,6.0,3,449,7,55.0,11,10.0,1400.0,4,1,3,3,11.0,4,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0
4,201945010102,5.0,5.0,7.0,3,502,1,56.0,11,10.0,1400.0,4,1,3,3,11.0,4,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
407335,202048060911,0.0,0.0,0.0,0,0,0,0.0,0,0.0,0.0,0,0,0,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0
407336,202048060911,0.0,0.0,0.0,0,0,0,0.0,0,0.0,0.0,0,0,0,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0
407337,202048060911,0.0,0.0,0.0,0,0,0,0.0,0,0.0,0.0,0,0,0,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0
407338,202048060911,0.0,0.0,0.0,0,0,0,0.0,0,0.0,0.0,0,0,0,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0.0


# 学習

## 学習データと検証データに分割

In [21]:
train_df, test_df = train_test_split(df, test_size=53584, shuffle=False)

In [22]:
test_df

Unnamed: 0,race_id,horse_number,frame_number,age,gen,weight,weight_diff,burden_weight,place,race_horse_number,distance,clockwise,field_type,field_condition,weather,time_hour,season,one_before_odd,one_before_popular,one_before_rank,one_before_time,one_before_elapsed_day,one_before_distance,one_before_field_type,one_before_field_condition,one_before_weather,two_before_odd,two_before_popular,two_before_rank,two_before_time,two_before_elapsed_day,two_before_distance,two_before_field_type,two_before_field_condition,two_before_weather,three_before_odd,three_before_popular,three_before_rank,three_before_time,three_before_elapsed_day,three_before_distance,three_before_field_type,three_before_field_condition,three_before_weather,four_before_odd,four_before_popular,four_before_rank,four_before_time,four_before_elapsed_day,four_before_distance,four_before_field_type,four_before_field_condition,four_before_weather,five_before_odd,five_before_popular,five_before_rank,five_before_time,five_before_elapsed_day,five_before_distance,five_before_field_type,five_before_field_condition,five_before_weather,show
200047,202048031304,2,2.0,4,2,528,0,56.0,5,9,1400,0,0,1,2,12,1,4.4,2.0,5,291.0,14,1400,1,3,3,1.3,1.0,1,691.0,44,1400,1,4,3,1.6,1.0,2,991.0,58,1400,1,2,3,6.3,3.0,8,294.0,293,1600,2,3,3,2.2,1.0,4,586.0,348,1400,1,2,3,1
200048,202048031304,8,8.0,4,1,477,5,54.0,5,9,1400,0,0,1,2,12,1,2.6,2.0,2,391.0,15,1400,1,3,3,2.2,2.0,2,990.0,44,1400,1,4,3,1.5,1.0,1,91.0,71,1400,1,4,3,3.3,2.0,2,1005.0,58,1600,1,3,3,106.8,12.0,12,111.0,215,1700,1,3,3,1
200049,202048031304,1,1.0,4,1,441,1,54.0,5,9,1400,0,0,1,2,12,1,3.5,2.0,2,891.0,18,1400,1,3,3,15.2,4.0,6,692.0,30,1400,1,3,3,1.4,1.0,1,385.0,70,1300,1,1,2,7.2,3.0,5,293.0,44,1400,1,4,3,107.5,14.0,6,524.0,194,2000,2,4,1,1
200050,202048031304,9,8.0,4,2,464,-4,56.0,5,9,1400,0,0,1,2,12,1,25.7,6.0,5,92.0,15,1400,1,3,3,23.0,7.0,2,291.0,30,1400,1,3,3,15.0,4.0,3,491.0,44,1400,1,4,3,13.7,4.0,4,792.0,56,1400,1,3,4,106.0,10.0,2,193.0,192,1400,1,4,5,0
200051,202048031304,5,5.0,4,1,463,-2,52.0,5,9,1400,0,0,1,2,12,1,5.4,2.0,3,191.0,15,1400,1,3,3,1.9,1.0,4,991.0,30,1400,1,3,3,6.5,3.0,4,292.0,44,1400,1,4,3,3.9,2.0,2,291.0,71,1400,1,4,3,1.7,1.0,1,291.0,58,1400,1,2,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253631,202048060911,10,8.0,5,1,435,-3,54.0,5,10,1600,0,0,2,2,17,2,27.8,6.0,1,90.0,11,1400,1,3,3,187.2,14.0,13,506.0,61,1600,1,3,4,122.6,13.0,13,875.0,40,1200,1,2,3,279.6,13.0,14,307.0,83,1600,1,2,3,204.5,11.0,12,276.0,123,1200,1,3,3,0
253632,202048060911,1,1.0,6,1,509,0,51.0,5,10,1600,0,0,2,2,17,2,4.0,3.0,1,1002.0,25,1600,1,3,4,17.0,5.0,6,903.0,321,1600,1,3,3,5.7,2.0,9,592.0,348,1400,1,4,5,1.8,1.0,1,489.0,364,1400,1,4,4,5.5,3.0,1,105.0,404,1600,1,4,3,0
253633,202048060911,4,4.0,5,2,485,-5,56.0,5,10,1600,0,0,2,2,17,2,1.8,1.0,1,989.0,11,1400,1,3,3,49.9,14.0,16,375.0,122,1200,1,3,3,22.4,8.0,10,515.0,374,1800,1,3,3,14.6,7.0,5,313.0,394,1800,1,3,3,6.8,4.0,8,813.0,416,1800,1,3,3,0
253634,202048060911,9,8.0,4,1,403,-2,55.0,5,10,1600,0,0,2,2,17,2,3.4,2.0,1,188.0,34,1400,1,3,4,5.8,3.0,1,417.0,14,1800,1,3,4,2.8,2.0,3,590.0,46,1400,1,3,3,1.8,1.0,3,91.0,60,1400,1,3,3,2.5,1.0,4,890.0,88,1400,1,2,4,0


In [23]:
X_train = train_df.drop(['race_id', 'show'], axis=1).values
y_train = train_df['show'].values
X_test = test_df.drop(['race_id', 'show'], axis=1).values
y_test = test_df['show'].values  

In [24]:
train_race_id_counter = collections.Counter(list(train_df['race_id'].values))
test_race_id_counter = collections.Counter(list(test_df['race_id'].values))
train_query = list(train_race_id_counter.values())
test_query = list(test_race_id_counter.values())

## 正規化

In [25]:
sc = StandardScaler()
sc.fit(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [26]:
X_train_sc = sc.transform(X_train)
X_test_sc = sc.transform(X_test)

## モデルの定義

In [27]:
def define_model(input_shape):
  inp = Input(shape=input_shape)
  x = Dense(300, activation='relu')(inp)
  x = Dropout(0.2)(x)
  x = Dense(150, activation='relu')(x)
  x = Dropout(0.2)(x)
  x = Dense(50, activation='relu')(x)
  x = Dropout(0.2)(x)
  x = Dense(1, activation='sigmoid')(x)

  model = Model(inp, x)
  model.compile(optimizer='adam', loss='mean_squared_error')

  return model

## 学習

In [28]:
model = define_model((X_train_sc.shape[1],))

In [29]:
epoch = 50
batch = 4

In [30]:
model.fit(X_train_sc, y_train, epochs=epoch, batch_size=batch)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f6914ac46d8>

In [31]:
model.save_weights(DATA_PATH + 'param.hdf5')

# 予測

In [32]:
pred = model.predict(X_test_sc)

In [33]:
pred

array([[0.55354965],
       [0.51395077],
       [0.4867122 ],
       ...,
       [0.5785478 ],
       [0.5286216 ],
       [0.20870417]], dtype=float32)

レースごとに予測された確率が最も高い馬が３位以内に入っている確率を算出する。

In [34]:
def calc_prob(predict):
  stack_q = 0
  correct = 0
  for query in test_query:
    ind = np.argmax(predict[stack_q:stack_q+query])
    stack_q += query
    if test_df.iloc[ind]['show'] == 1:
      correct += 1

  print('score is', correct / len(test_query))

In [35]:
calc_prob(pred)

score is 0.6303526448362721


# 考察

## 単純なニューラルネットワークモデル

- 0.6406
