<a href="https://colab.research.google.com/github/monda00/horse-race-notebook/blob/master/predict_show_neural_network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ニューラルネットワークで予測

学習データの作成から予測の考察までやってみる。

# 概要

- ライブラリ・データ読み込み
- データ整形
- 前処理
- 学習
- 予測
- 考察

## 参考

- [データ収集からディープラーニングまで全て行って競馬の予測をしてみた](https://qiita.com/kami634/items/55e49dad76396d808bf5#%E5%8F%96%E5%BE%97%E3%81%97%E3%81%9Furl%E3%82%92%E3%82%82%E3%81%A8%E3%81%ABhtml%E3%82%92%E5%BE%97%E3%82%8B)
- [競馬の予測をガチでやってみた](http://stockedge.hatenablog.com/entry/2016/01/03/103428)
- [ディープラーニングさえあれば、競馬で回収率100%を超えられる](https://qiita.com/yossymura/items/334a8f3ef85bff081913)
- [競馬予想AIを作る 〜ニューラルネットワークによる相対評価データセットの取り扱い例〜](https://cocon-corporation.com/cocontoco/horseraceprediction_ai/)

# ライブラリ・データ読み込み

In [1]:
import numpy as np
import pandas as pd
import re
import collections
import datetime
from tqdm import tqdm

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [2]:
DATA_PATH = '/content/drive/My Drive/data/horse-race/'

In [3]:
df = pd.read_csv(DATA_PATH + 'train_raw.csv')
df = df.sort_values(by=['race_date', 'race_id', 'rank'])
df.reset_index(inplace=True, drop=True)

In [4]:
df.head()

Unnamed: 0,agari,age,frame_number,horse_number,horse_weight,jockey,name,popular,race_date,race_id,race_name,rank,time,weight,win,show,clockwise,distance,field_condition,field_type,place,race_round,start_time,weather
0,39.5,牝7,3.0,3,464(+4),藤本現暉,リコーアペルタ,2.0,2019/1/1,201945010102,C3七　八,1,1:32.5,54.0,3.6,1,左,1400,良,ダ,川崎,2R,11:50,晴
1,39.8,牡7,5.0,5,502(+1),加藤和博,ミラクルツッキー,1.0,2019/1/1,201945010102,C3七　八,2,1:32.5,56.0,2.0,1,左,1400,良,ダ,川崎,2R,11:50,晴
2,40.3,牡7,2.0,2,464(+7),瀧川寿希,ロジレガシー,3.0,2019/1/1,201945010102,C3七　八,3,1:32.8,56.0,5.9,1,左,1400,良,ダ,川崎,2R,11:50,晴
3,40.1,牝7,7.0,8,399(+3),岡村健司,プチプチ,8.0,2019/1/1,201945010102,C3七　八,4,1:33.5,54.0,22.1,0,左,1400,良,ダ,川崎,2R,11:50,晴
4,41.1,牝4,8.0,10,452(+32),伊藤裕人,スエヒロドラ,4.0,2019/1/1,201945010102,C3七　八,5,1:33.8,54.0,10.3,0,左,1400,良,ダ,川崎,2R,11:50,晴


# データ整形

以下のデータに整形する。

馬体重と差分はレース直前の木曜ぐらいにわかる。

|分類	|項目 |
|---|---|
|馬情報	|馬番 |
| |枠番 |
| |年齢 |
| |性別 |
| |体重（現在） | 
| |体重（前走との差分） |
| |負担重量 |
| 当日レース情報 |レース場 |
| |出走馬数 |
| |コース距離 |
| |周回方向 |
| |コースタイプ（ダ/芝/障） |
| |天気 |
| |馬場状態 |
| |開始時間帯 |
| |時期 |
|同馬の過去レース情報（×5走分）	|オッズ |
| |人気 |
| |順位 |
| |タイム（秒） |
| |前走からの経過日数 |
| |コース距離 |
| |コースタイプ（ダ/芝/障） |
| |天気 |
| |馬場状態 |

## カラム作成

In [5]:
id_column = ['race_id']
horse_columns = ['horse_number', 'frame_number', 'age', 'gen', 'weight', 'weight_diff', 'burden_weight']
race_columns = ['place', 'race_horse_number', 'distance', 'clockwise', 'field_type', 'field_condition', 'weather', 'time_hour', 'season']
past_race_columns_base = ['odd', 'popular', 'rank', 'time', 'elapsed_day', 'distance', 'field_type', 'field_condition', 'weather']

過去５回分のレースのカラム を作成。

In [6]:
past_race_num = ['one', 'two', 'three', 'four', 'five']

In [7]:
past_race_columns = []
for n in past_race_num:
  for c in past_race_columns_base:
    past_race_columns.append('{}_before_{}'.format(n, c))

In [8]:
columns = id_column + horse_columns + race_columns + past_race_columns

## 新しいdataframeの作成

In [9]:
train_df = pd.DataFrame(columns=columns)

In [10]:
input_columns = ['race_id', 'horse_number', 'frame_number', 'place', 'distance', 'clockwise', 'field_type', 'field_condition', 'weather']
train_df[input_columns] = df[input_columns]

### 年齢と性別

In [11]:
gen = []
age = []
for i in range(len(df)):
  age_v = df.iloc[i]['age']
  gen.append(re.search(r'(.*)(\d)', age_v).group(1))
  age.append(re.search(r'(.*)(\d)', age_v).group(2))

In [12]:
train_df['age'] = age
train_df['gen'] = gen

### 負担重量

In [13]:
train_df['burden_weight'] = df['weight']

### 時間帯

In [14]:
time_hour = []
for i in range(len(df)):
  start_time = df.iloc[i]['start_time']
  time_hour.append(int(re.search(r'(.*):(.*)', start_time).group(1)))

In [15]:
train_df['time_hour'] = time_hour

### 季節

In [16]:
season = []
for i in range(len(df)):
  race_date = df.iloc[i]['race_date']
  race_month = int(re.search(r'\/.+?\/', race_date).group().replace('/', ''))
  if 3 <= race_month <= 5:
    season.append('sprint')
  elif 6 <= race_month <= 8:
    season.append('summer')
  elif 9 <= race_month <= 11:
    season.append('autumn')
  else:
    season.append('winter')

In [17]:
train_df['season'] = season

### 出走馬数

In [18]:
race_horse_number_counter = list(collections.Counter(list(df['race_id'].values)).values())

In [19]:
race_horse_number = []
for n in race_horse_number_counter:
  for _ in range(n):
    race_horse_number.append(n)

In [20]:
train_df['race_horse_number'] = race_horse_number

### 体重と体重の増減

In [21]:
weight = []
weight_diff = []
for i in range(len(df)):
  horse_weight = df.iloc[i]['horse_weight']
  if horse_weight == '計不':
    weight.append('計不')
    weight_diff.append('計不')
  else:
    weight.append(int(re.search(r'(.*)(\(.*?\))', horse_weight).group(1)))
    weight_diff.append(re.search(r'(.*)(\(.*?\))', horse_weight).group(2).replace('(', '').replace(')', ''))

In [22]:
train_df['weight'] = weight
train_df['weight_diff'] = weight_diff

### 過去レースのデータ

In [23]:
train_df[past_race_columns] = 0

In [None]:
for i in tqdm(range(len(df))):
  horse_name = df.iloc[i]['name']
  race_date = datetime.datetime.strptime(df.iloc[i]['race_date'], "%Y/%m/%d")
  past_num = 0

  horse_df = df.iloc[:i].query('name == "{}"'.format(horse_name))
  for j in range(len(horse_df)-1, -1, -1):
    race_date_before = datetime.datetime.strptime(horse_df.iloc[j]['race_date'], "%Y/%m/%d")
    train_df.loc[i, '{}_before_{}'.format(past_race_num[past_num], 'odd')] = horse_df.iloc[j]['win']
    train_df.loc[i, '{}_before_{}'.format(past_race_num[past_num], 'popular')] = horse_df.iloc[j]['popular']
    train_df.loc[i, '{}_before_{}'.format(past_race_num[past_num], 'rank')] = horse_df.iloc[j]['rank']
    train_df.loc[i, '{}_before_{}'.format(past_race_num[past_num], 'time')] = horse_df.iloc[j]['time']
    train_df.loc[i, '{}_before_{}'.format(past_race_num[past_num], 'elapsed_day')] = abs(race_date - race_date_before).days
    train_df.loc[i, '{}_before_{}'.format(past_race_num[past_num], 'distance')] = horse_df.iloc[j]['distance']
    train_df.loc[i, '{}_before_{}'.format(past_race_num[past_num], 'field_type')] = horse_df.iloc[j]['field_type']
    train_df.loc[i, '{}_before_{}'.format(past_race_num[past_num], 'field_condition')] = horse_df.iloc[j]['field_condition']
    train_df.loc[i, '{}_before_{}'.format(past_race_num[past_num], 'weather')] = horse_df.iloc[j]['weather']
    past_num += 1
    
    if past_num >= 4:
      break

  6%|▌         | 15472/253636 [04:34<2:41:17, 24.61it/s]

In [None]:
train_df

In [None]:
train_df.to_csv(DATA_PATH + 'train_nn.csv', index=False)