# ICネットワークデータの処理

## やっていること
- 全国のICネットワーク表をコンテスト提供用に加工

## Input
- 全国のICネットワーク表: `Input_processed_data/road_master/220303-doronet_ic.csv`

## Output
- 加工された全国のICネットワーク表: `Input_processed_data/road_master/icnet_all.csv`

In [1]:
import pandas as pd

In [2]:
# data directory
PROCESSED_DATA_DIR = '../../Input_processed_data'

In [3]:
df_icnet = pd.read_csv(
    f'{PROCESSED_DATA_DIR}/road_master/220303-doronet_ic.csv',
    dtype={'start_code': str, 'end_code': str, 'road_code': str}
)
df_icnet.head()

Unnamed: 0,rosen_code,road_code,road_name,direction,start_code,end_code,billing_No,start_name,end_name,start_lat,start_lng,end_lat,end_lng,distance,start_degree,end_degree
0,109,1010,【E1】東名高速道路,0,1010001,1010004,1,東京,東京本線,35.62455,139.626661,35.5911,139.576578,5.86,4,4
1,109,1010,【E1】東名高速道路,0,1010004,1010006,6,東京本線,東名川崎,35.5911,139.576578,35.58415,139.573103,0.83,4,4
2,109,1010,【E1】東名高速道路,0,1010006,1010016,11,東名川崎,横浜青葉,35.58415,139.573103,35.54295,139.540811,5.43,4,4
3,109,1010,【E1】東名高速道路,0,1010016,1010018,16,横浜青葉,横浜青葉ＪＣＴ,35.54295,139.540811,35.54295,139.540811,0.0,4,6
4,109,1010,【E1】東名高速道路,0,1010018,1010011,21,横浜青葉ＪＣＴ,港北ＰＡ,35.54295,139.540811,35.53066,139.533192,1.53,6,4


In [4]:
df_icnet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5624 entries, 0 to 5623
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   rosen_code    5624 non-null   int64  
 1   road_code     5624 non-null   object 
 2   road_name     5624 non-null   object 
 3   direction     5624 non-null   int64  
 4   start_code    5624 non-null   object 
 5   end_code      5624 non-null   object 
 6   billing_No    5624 non-null   int64  
 7   start_name    5624 non-null   object 
 8   end_name      5624 non-null   object 
 9   start_lat     5588 non-null   float64
 10  start_lng     5588 non-null   float64
 11  end_lat       5588 non-null   float64
 12  end_lng       5588 non-null   float64
 13  distance      5624 non-null   float64
 14  start_degree  5624 non-null   int64  
 15  end_degree    5624 non-null   int64  
dtypes: float64(5), int64(5), object(6)
memory usage: 703.1+ KB


In [5]:
COLUMNS = ['start_code', 'end_code', 'start_name', 'end_name', 'road_code', 'direction', 'distance']

df = df_icnet.loc[:, COLUMNS]
df.head()

Unnamed: 0,start_code,end_code,start_name,end_name,road_code,direction,distance
0,1010001,1010004,東京,東京本線,1010,0,5.86
1,1010004,1010006,東京本線,東名川崎,1010,0,0.83
2,1010006,1010016,東名川崎,横浜青葉,1010,0,5.43
3,1010016,1010018,横浜青葉,横浜青葉ＪＣＴ,1010,0,0.0
4,1010018,1010011,横浜青葉ＪＣＴ,港北ＰＡ,1010,0,1.53


In [6]:
# direction列をrename
df = df.replace({'direction': {0: '下り', 1: '上り'}})
df.head()

Unnamed: 0,start_code,end_code,start_name,end_name,road_code,direction,distance
0,1010001,1010004,東京,東京本線,1010,下り,5.86
1,1010004,1010006,東京本線,東名川崎,1010,下り,0.83
2,1010006,1010016,東名川崎,横浜青葉,1010,下り,5.43
3,1010016,1010018,横浜青葉,横浜青葉ＪＣＴ,1010,下り,0.0
4,1010018,1010011,横浜青葉ＪＣＴ,港北ＰＡ,1010,下り,1.53


In [7]:
OUTPUT_FILE = f'{PROCESSED_DATA_DIR}/road_master/icnet_all.csv'

In [15]:
# df.to_csv(OUTPUT_FILE, index=False)

In [16]:
! head -n5 "$OUTPUT_FILE"

start_code,end_code,start_name,end_name,road_code,direction,distance
1010001,1010004,東京,東京本線,1010,下り,5.86
1010004,1010006,東京本線,東名川崎,1010,下り,0.83
1010006,1010016,東名川崎,横浜青葉,1010,下り,5.43
1010016,1010018,横浜青葉,横浜青葉ＪＣＴ,1010,下り,0.0


In [17]:
! tail -n5 "$OUTPUT_FILE"

1800001,9134010,練馬,谷原交差点,9134,上り,1.16
9020080,9020060,はわい,大栄東伯,X409,下り,14.03
9020060,9020080,大栄東伯,はわい,X409,上り,14.03
9132020,1100006,高井戸１丁目,調布,X507,下り,7.16
1100006,9132020,調布,高井戸１丁目,X507,上り,7.16


In [18]:
pd.read_csv(OUTPUT_FILE).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5624 entries, 0 to 5623
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   start_code  5624 non-null   object 
 1   end_code    5624 non-null   object 
 2   start_name  5624 non-null   object 
 3   end_name    5624 non-null   object 
 4   road_code   5624 non-null   object 
 5   direction   5624 non-null   object 
 6   distance    5624 non-null   float64
dtypes: float64(1), object(6)
memory usage: 307.7+ KB


# レコードの重複が発生した問題（2024-01-17）

合計16件の重複したレコードを確認した。いずれの重複も道路構造のマスタデータに本来存在していた「路線コード`rosen_code`」「明細No`billing_No`」のカラムをコンテスト提供用に削除したために生じた問題。例えば、SIGNATEに指摘を受けた重複箇所はマスタデータでは以下のように本来異なるレコードとなっていた。

路線コード、明細No以外のカラムはすべて一致しており、かつ今回のコンテストにおいてこれらのカラム情報を区別して考える必要があるとも考えられないため、重複箇所は削除して問題ないように思われる。

ただし、各カラムの意味・定義を覚えておらず、フォーマットを記載したファイルも見当たらないため、もし補足や懸念点があればお願いしたい。
```
      rosen_code road_code     road_name  direction start_code end_code  \
2950         423      1850  【E1A】新東名高速道路          1    1010033  1850110   
2955         454      1850  【E1A】新東名高速道路          1    1010033  1850110   

      billing_No start_name end_name  start_lat   start_lng   end_lat  \
2950           6     伊勢原ＪＣＴ      厚木南   35.40869  139.314272  35.40315   
2955           1     伊勢原ＪＣＴ      厚木南   35.40869  139.314272  35.40315   

         end_lng  distance  start_degree  end_degree  
2950  139.359189      4.12             8           4  
2955  139.359189      4.12             8           4  

```

In [30]:
df.loc[df.duplicated()].shape

(16, 7)

In [18]:
df.loc[df.duplicated()].head()

Unnamed: 0,start_code,end_code,start_name,end_name,road_code,direction,distance
133,1010031,1010033,厚木,伊勢原ＪＣＴ,1010,下り,4.79
134,1010033,1010031,伊勢原ＪＣＴ,厚木,1010,上り,4.79
135,1010033,1010036,伊勢原ＪＣＴ,秦野中井,1010,下り,9.45
136,1010036,1010033,秦野中井,伊勢原ＪＣＴ,1010,上り,9.45
137,1010061,1010063,御殿場,御殿場ＪＣＴ,1010,下り,4.28


In [29]:
for _, row in df.loc[df.duplicated()].iterrows():
    s_code, e_code = row.start_code, row.end_code
    print(df_icnet.loc[(df_icnet.start_code == s_code) & (df_icnet.end_code == e_code)])
    print(f'\n{"-"*20}\n')

     rosen_code road_code   road_name  direction start_code end_code  \
10          109      1010  【E1】東名高速道路          0    1010031  1010033   
133         452      1010  【E1】東名高速道路          0    1010031  1010033   

     billing_No start_name end_name  start_lat   start_lng   end_lat  \
10           51         厚木   伊勢原ＪＣＴ   35.41797  139.365825  35.40869   
133           1         厚木   伊勢原ＪＣＴ   35.41797  139.365825  35.40869   

        end_lng  distance  start_degree  end_degree  
10   139.314272      4.79             6           8  
133  139.314272      4.79             6           8  

--------------------

     rosen_code road_code   road_name  direction start_code end_code  \
122         109      1010  【E1】東名高速道路          1    1010033  1010031   
134         452      1010  【E1】東名高速道路          1    1010033  1010031   

     billing_No start_name end_name  start_lat   start_lng   end_lat  \
122         276     伊勢原ＪＣＴ       厚木   35.40869  139.314272  35.41797   
134           1     

In [21]:
df_icnet.loc[(df_icnet.start_code == '1010031') & (df_icnet.end_code =='1010033')]

Unnamed: 0,rosen_code,road_code,road_name,direction,start_code,end_code,billing_No,start_name,end_name,start_lat,start_lng,end_lat,end_lng,distance,start_degree,end_degree
10,109,1010,【E1】東名高速道路,0,1010031,1010033,51,厚木,伊勢原ＪＣＴ,35.41797,139.365825,35.40869,139.314272,4.79,6,8
133,452,1010,【E1】東名高速道路,0,1010031,1010033,1,厚木,伊勢原ＪＣＴ,35.41797,139.365825,35.40869,139.314272,4.79,6,8


In [19]:
df.loc[(df.start_code == '1010031') & (df.end_code =='1010033')]

Unnamed: 0,start_code,end_code,start_name,end_name,road_code,direction,distance
10,1010031,1010033,厚木,伊勢原ＪＣＴ,1010,下り,4.79
133,1010031,1010033,厚木,伊勢原ＪＣＴ,1010,下り,4.79


# 全国道路構造データの重複を削除（2024-01-19）
`Competition2`ディレクトリ以下で直接作業を行ったため、パスが異なる

In [2]:
df_all = pd.read_csv(
    '../SIGNATE納品データ/road_all.csv',
    dtype={'start_code': str, 'end_code': str, 'road_code': str}
)
df_all.head()

Unnamed: 0,start_code,end_code,start_name,end_name,road_code,direction,distance
0,1010001,1010004,東京,東京本線,1010,下り,5.86
1,1010004,1010006,東京本線,東名川崎,1010,下り,0.83
2,1010006,1010016,東名川崎,横浜青葉,1010,下り,5.43
3,1010016,1010018,横浜青葉,横浜青葉ＪＣＴ,1010,下り,0.0
4,1010018,1010011,横浜青葉ＪＣＴ,港北ＰＡ,1010,下り,1.53


In [5]:
df_all.loc[df_all.duplicated()]

Unnamed: 0,start_code,end_code,start_name,end_name,road_code,direction,distance
133,1010031,1010033,厚木,伊勢原ＪＣＴ,1010,下り,4.79
134,1010033,1010031,伊勢原ＪＣＴ,厚木,1010,上り,4.79
135,1010033,1010036,伊勢原ＪＣＴ,秦野中井,1010,下り,9.45
136,1010036,1010033,秦野中井,伊勢原ＪＣＴ,1010,上り,9.45
137,1010061,1010063,御殿場,御殿場ＪＣＴ,1010,下り,4.28
138,1010063,1010061,御殿場ＪＣＴ,御殿場,1010,上り,4.28
139,1010063,1010066,御殿場ＪＣＴ,駒門ＰＡ,1010,下り,2.32
140,1010066,1010063,駒門ＰＡ,御殿場ＪＣＴ,1010,上り,2.32
2952,1010033,1850120,伊勢原ＪＣＴ,伊勢原大山,1850,下り,2.2
2953,1850120,1010033,伊勢原大山,伊勢原ＪＣＴ,1850,上り,2.2


In [9]:
df_all = df_all.drop_duplicates().reset_index(drop=True)
print(df_all.shape)
df_all.head()

(5608, 7)


Unnamed: 0,start_code,end_code,start_name,end_name,road_code,direction,distance
0,1010001,1010004,東京,東京本線,1010,下り,5.86
1,1010004,1010006,東京本線,東名川崎,1010,下り,0.83
2,1010006,1010016,東名川崎,横浜青葉,1010,下り,5.43
3,1010016,1010018,横浜青葉,横浜青葉ＪＣＴ,1010,下り,0.0
4,1010018,1010011,横浜青葉ＪＣＴ,港北ＰＡ,1010,下り,1.53


In [10]:
OUTPUT_FILE = f'../SIGNATE納品データ/road_all.csv'

In [13]:
# df_all.to_csv(OUTPUT_FILE, index=False)

In [14]:
! head -n5 "$OUTPUT_FILE"

start_code,end_code,start_name,end_name,road_code,direction,distance
1010001,1010004,東京,東京本線,1010,下り,5.86
1010004,1010006,東京本線,東名川崎,1010,下り,0.83
1010006,1010016,東名川崎,横浜青葉,1010,下り,5.43
1010016,1010018,横浜青葉,横浜青葉ＪＣＴ,1010,下り,0.0


In [15]:
! tail -n5 "$OUTPUT_FILE"

1800001,9134010,練馬,谷原交差点,9134,上り,1.16
9020080,9020060,はわい,大栄東伯,X409,下り,14.03
9020060,9020080,大栄東伯,はわい,X409,上り,14.03
9132020,1100006,高井戸１丁目,調布,X507,下り,7.16
1100006,9132020,調布,高井戸１丁目,X507,上り,7.16


In [16]:
pd.read_csv(OUTPUT_FILE).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5608 entries, 0 to 5607
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   start_code  5608 non-null   object 
 1   end_code    5608 non-null   object 
 2   start_name  5608 non-null   object 
 3   end_name    5608 non-null   object 
 4   road_code   5608 non-null   object 
 5   direction   5608 non-null   object 
 6   distance    5608 non-null   float64
dtypes: float64(1), object(6)
memory usage: 306.8+ KB
