## Chapter 1

### 1.1 Data Translation

As you might have noticed by having a quick look to the dataset, some columns are in Japanese. Therefore, we need to translate those columns. For that task we are provided with the file `CAPSULE_TEXT_Translation.xlsx` and you will also need the file `prefecture.txt` that I include in this repo.

Let's start by defining some paths and file names:

In [1]:
import os
import pandas as pd

input_dir = "../datasets/Ponpare/data"
output_dir = "../datasets/Ponpare/data_translated"
documentation_dir = "../datasets/Ponpare/data/documentation"

translate_fname = "CAPSULE_TEXT_Translation.xlsx"
prefecture_fname = "prefecture.txt"

In [2]:
translate_df = pd.read_excel(os.path.join(documentation_dir,translate_fname) ,skiprows=5)

translate_df.head()

Unnamed: 0.1,Unnamed: 0,Frequency,CAPSULE_TEXT,English Translation,Unnamed: 4,Frequency.1,CAPSULE_TEXT.1,English Translation.1
0,,5930,宅配,Delivery service,,5930.0,宅配,Delivery service
1,,3713,グルメ,Food,,3713.0,グルメ,Food
2,,1976,ホテル,Hotel,,3629.0,ホテル・旅館,Hotel and Japanese hotel
3,,1461,ヘアサロン,Hair salon,,1461.0,ヘアサロン,Hair salon
4,,1375,旅館,Japanese hotel,,1110.0,リラクゼーション,Relaxation


In [3]:
# Grab the capsule and english columns
caps_col_idx = [i for i,c in enumerate(translate_df.columns) if 'CAPSULE' in c]
engl_col_idx = [i for i,c in enumerate(translate_df.columns) if 'English' in c]

capsule_text_df = translate_df.iloc[:, [caps_col_idx[0],engl_col_idx[0]]]
capsule_text_df.columns = ['capsule_text', 'english_translation']

genre_name_df = translate_df.iloc[:, [caps_col_idx[1],engl_col_idx[1]]]
genre_name_df.columns = ['genre_name', 'english_translation']
genre_name_df = genre_name_df[~genre_name_df.genre_name.isna()]

In [4]:
capsule_text_df.head()

Unnamed: 0,capsule_text,english_translation
0,宅配,Delivery service
1,グルメ,Food
2,ホテル,Hotel
3,ヘアサロン,Hair salon
4,旅館,Japanese hotel


In [5]:
genre_name_df.head()

Unnamed: 0,genre_name,english_translation
0,宅配,Delivery service
1,グルメ,Food
2,ホテル・旅館,Hotel and Japanese hotel
3,ヘアサロン,Hair salon
4,リラクゼーション,Relaxation


In [6]:
# create capsule_text and genre_name dictionaries (let's all thank python3, literal strings are unicode by default)
capsule_text_dict = dict(zip(capsule_text_df.capsule_text, capsule_text_df.english_translation))
genre_name_dict = dict(zip(genre_name_df.genre_name, genre_name_df.english_translation))

print(capsule_text_dict)

{'宅配': 'Delivery service', 'グルメ': 'Food', 'ホテル': 'Hotel', 'ヘアサロン': 'Hair salon', '旅館': 'Japanese hotel', 'リラクゼーション': 'Relaxation', 'その他': 'Other', 'エステ': 'Spa', 'レジャー': 'Leisure', 'レッスン': 'Lesson', 'ネイル・アイ': 'Nail and eye salon', 'ギフトカード': 'Gift card', 'ペンション': 'Resort inn', '民宿': 'Japanse guest house', '健康・医療': 'Health and medical', 'WEBサービス': 'Web service', 'ビューティー': 'Beauty', '貸別荘': 'Vacation rental', 'ロッジ': 'Lodge', '通学レッスン': 'Class', '通信講座': 'Correspondence course', 'ゲストハウス': 'Guest house', '公共の宿': 'Public hotel', 'イベント': 'Event', 'ビューティ': 'Beauty'}


In [7]:
# create prefecture dictionary for region/area translation
prefecture_dict = {}
prefecture_path = os.path.join(input_dir,prefecture_fname)
with open(prefecture_path, "r") as f:
    stuff = f.readlines()
    for line in stuff:
        line = line.rstrip().split(",")
        prefecture_dict[line[0]] = line[1]
print(prefecture_dict) 

{'関西': 'kansai', '関東': 'kanto', '九州・沖縄': 'kyusyu', '四国': 'shikoku', '中国': 'tyugoku', '東海': 'tokai', '東北': 'tohoku', '北海道': 'hokkaido', '北信越': 'hokushinetu', 'キタ': 'osaka_kita', 'ミナミ他': 'osaka_minami', '愛知': 'aichi', '愛媛': 'ehime', '茨城': 'ibaraki', '横浜': 'kanagawa_yokohama', '岡山': 'okayama', '沖縄': 'okinawa', '岩手': 'iwate', '岐阜': 'gihu', '宮崎': 'miyazaki', '宮城': 'miyagi', '京都': 'kyoto', '銀座・新橋・東京・上野': 'tokyo_ginza', '熊本': 'kumamoto', '群馬': 'gunma', '恵比寿・目黒・品川': 'tokyo_ebisu', '広島': 'hiroshima', '香川': 'kagawa', '高知': 'kouchi', '佐賀': 'saga', '埼玉': 'saitama', '三重': 'mie', '山形': 'yamagata', '山口': 'yamaguchi', '山梨': 'yamanashi', '滋賀': 'shiga', '鹿児島': 'kagoshima', '秋田': 'akita', '渋谷・青山・自由が丘': 'tokyo_shibuya', '新潟': 'niigata', '新宿・高田馬場・中野・吉祥寺': 'tokyo_shinjuku', '青森': 'aomori', '静岡': 'shizuoka', '石川': 'ishikawa', '赤坂・六本木・麻布': 'tokyo_akasaka', '千葉': 'chiba', '川崎・湘南・箱根他': 'kanagawa_kawasaki', '大分': 'ooita', '池袋・神楽坂・赤羽': 'tokyo_ikebukuro', '長崎': 'nagasaki', '長野': 'nagano', '鳥取': 'tottori', '島根': 's

In [8]:
csv_files = []
for _,_,files in os.walk(input_dir):
    for file in files:
        if file.endswith(".csv"):
            csv_files.append(file)

print(csv_files)

['user_list.csv', 'coupon_list_train.csv', 'coupon_detail_train.csv', 'coupon_list_test.csv', 'prefecture_locations.csv', 'coupon_area_train.csv', 'coupon_visit_train.csv', 'coupon_area_test.csv', 'sample_submission.csv']


In [9]:
# manually define a dictionary with the columns to replace and the dictionary to replace them
replace_cols = {
    'capsule_text':'capsule_text_dict',
    'genre_name':'genre_name_dict',
    'pref_name':'prefecture_dict',
    'large_area_name':'prefecture_dict',
    'ken_name':'prefecture_dict',
    'small_area_name':'prefecture_dict'
    }

In [10]:
# Translate
csv_files = [c for c in csv_files if c not in ['prefecture_locations.csv','sample_submission.csv']]
for f in csv_files:
    print("INFO: translating {} into {}".format(os.path.join(input_dir,f), os.path.join(output_dir,f)))
    tmp_df = pd.read_csv(os.path.join(input_dir,f))
    tmp_df.columns = [c.lower() for c in tmp_df]

    for col in tmp_df.columns:
        if col in replace_cols.keys():
            replace_dict = eval(replace_cols[col])
            tmp_df[col].replace(replace_dict, inplace=True)

    tmp_df.to_csv(os.path.join(output_dir,f), index=False)

INFO: translating ../datasets/Ponpare/data/user_list.csv into ../datasets/Ponpare/data_translated/user_list.csv
INFO: translating ../datasets/Ponpare/data/coupon_list_train.csv into ../datasets/Ponpare/data_translated/coupon_list_train.csv
INFO: translating ../datasets/Ponpare/data/coupon_detail_train.csv into ../datasets/Ponpare/data_translated/coupon_detail_train.csv
INFO: translating ../datasets/Ponpare/data/coupon_list_test.csv into ../datasets/Ponpare/data_translated/coupon_list_test.csv
INFO: translating ../datasets/Ponpare/data/coupon_area_train.csv into ../datasets/Ponpare/data_translated/coupon_area_train.csv
INFO: translating ../datasets/Ponpare/data/coupon_visit_train.csv into ../datasets/Ponpare/data_translated/coupon_visit_train.csv
INFO: translating ../datasets/Ponpare/data/coupon_area_test.csv into ../datasets/Ponpare/data_translated/coupon_area_test.csv


After running this code, the translated data is stored in `../datasets/Ponpare/data_translated/`. We are now ready to split the data into train/validation/test and start with the feature engineering.