<a href="https://colab.research.google.com/github/kyochanpy/Kaggle_Indoor_Location_Navigation/blob/main/create_dataset/create_wifi_beacon_for_nn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Overview

[The result dataset is here.](https://www.kaggle.com/kokitanisaka/unified-ds-wifi-and-beacon)<br>
<br>
In this notebook, I show one way to make a dataset, Wi-Fi features and Beacon features. <br>
And also, I tried to utilize timegap of Wi-Fi and Beacon from the nearest waypoints. <br>
<br>
The fundamental idea of this dataset is, make samples based on waypoints. <br>
So the number of samples is same as number of waypoints. <br>
Which is much less than [this dataset](https://www.kaggle.com/kokitanisaka/indoorunifiedwifids).<br>
<br>
We can have similar result only with Wi-Fi features in [this dataset](https://www.kaggle.com/kokitanisaka/unified-ds-wifi-and-beacon) as [this dataset](https://www.kaggle.com/kokitanisaka/indoorunifiedwifids).<br>
As it has less samples, the training speed is much faster.<br>
<br>
With beacon features, I wasn't able to achieve a better result. <br>
So if you are interested in it, feel free to do some experiments. <br>

## Attention
Not all the samples can have beacon features. Because in some paths, there's no beacon signals are observed.<br>
we need to take it into account when we train a model. <br>


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
import pandas as pd
import numpy as np
import glob
import re
import types
import shutil
def imports():
    for name, val in globals().items():
        # module imports
        if isinstance(val, types.ModuleType):
            yield name, val
        # functions / callables
        if hasattr(val, '__call__'):
            yield name, val
np.seterr(divide='ignore', invalid='ignore')
noglobal = lambda fn: types.FunctionType(fn.__code__, dict(imports()))
import multiprocessing
from multiprocessing import Pool

In [3]:
num_cores = multiprocessing.cpu_count()

base_path = '/content/drive/MyDrive'

In [4]:
# get target buildings
sample_submission = pd.read_csv(f'{base_path}/sample_submission.csv')
sample_submission = sample_submission["site_path_timestamp"].apply(lambda x: pd.Series(x.split("_")))
sample_submission.columns = ['site', 'path', 'timestamp']
target_buildings = sorted(sample_submission['site'].value_counts().index.tolist())

In [20]:
# pull out all the buildings actually used in the test set, given current method we don't need the other ones
ssubm = pd.read_csv('/content/drive/MyDrive/sample_submission.csv')

# only 24 of the total buildings are used in the test set, 
# this allows us to greatly reduce the intial size of the dataset

ssubm_df = ssubm["site_path_timestamp"].apply(lambda x: pd.Series(x.split("_")))
used_buildings = sorted(ssubm_df[0].value_counts().index.tolist())

# dictionary used to map the floor codes to the values used in the submission file. 
floor_map = {"B2":-2, "B1":-1, "F1":0, "F2": 1, "F3":2, "F4":3, "F5":4, "F6":5, "F7":6,"F8":7, "F9":8,
             "1F":0, "2F":1, "3F":2, "4F":3, "5F":4, "6F":5, "7F":6, "8F": 7, "9F":8}

## Options
These features determin how to take features from the original txt files.<br>
<br>
We take Wi-Fi or beacon features within specific timespan from waypoints. <br>
I assume that if the time gap is too much, Wi-Fi signals or beacon signals are not trustworthy. <br>
I set it to 3000ms, but you can try other numbers. <br>
<br>
And we can determin how many signals to take into the result dataset. <br>
Actually beacon doesn't have much samples in the original txt files, or even doesn't have it. <br>
You can try other numbers as well. <br>

In [5]:
# options 
NUM_TAKING_BEACONS = 10
NUM_TAKING_WIFIS = 50
TIMEGAP_THRESHOLD = 10 # ms

In [6]:
# constants
FLOOR_DIR = {"B2": -2, "B1": -1, "F1": 0, "F2": 1, "F3": 2, "F4": 3, "F5": 4, "F6": 5, "F7": 6, "F8": 7, "F9": 8,
             "1F": 0, "2F": 1, "3F": 2, "4F": 3, "5F": 4, "6F": 5, "7F": 6, "8F": 7, "9F": 8}

In [7]:
# utils
@noglobal
def split_into_each_beacons(s):
    matches = re.finditer("TYPE_BEACON", s)
    matches_positions = [match.start() for match in matches]
    split_idx = [0] + [matches_positions[i]-14 for i in range(1, len(matches_positions))] + [len(s)]
    return [s[split_idx[i]:split_idx[i+1]] for i in range(len(split_idx)-1)]

@noglobal
def extract_waypoint_beacon(path_file):
    TIME = 0

    WAYPOINT_X = 2
    WAYPOINT_Y = 3
    BEACON_DISTANCE = 7
    BEACON_MAC = 8
    
    waypoints = []
    beacons = []
    wifis = []

    with open(path_file, encoding="utf-8") as f:
        text = f.readlines()
        for i, line in enumerate(text):
            type_count = line.count('TYPE_BEACON')
            if type_count > 1:
                lines = split_into_each_beacons(line)
            else:
                lines = [line]

            for l in lines:
                tmp = l.strip().split()
                
                if tmp[1] == "TYPE_WAYPOINT":
                    #1578462618392	TYPE_WAYPOINT	230.03738	153.49635
                    waypoints.append([int(tmp[TIME]), tmp[1], float(tmp[WAYPOINT_X]), float(tmp[WAYPOINT_Y])])

                elif tmp[1] == "TYPE_WIFI":
                    #1578483067644	TYPE_WIFI	da39a3ee5e6b4b0d3255bfef95601890afd80709	2253c6a0d0f7277737aa8e86e0484be805124806	-51	2437	1578483066126
                    try:
                        wifis.append([int(tmp[TIME]), tmp[1], tmp[2], tmp[3], 
                                     int(tmp[4]), int(tmp[5]), int(tmp[6]), 0])
                    except:
                        print(tmp)
                        raise
                    
                elif tmp[1] == "TYPE_BEACON":
                    #1578462618698	TYPE_BEACON	uuid	major	minor	-56	-58	1.2902861669921697	mac	1578462618698, timediff
                    try:
                        if len(tmp) >= 10:                       
                            second_time = tmp[9]
                        else:
                            second_time = tmp[TIME]
                        
                        beacons.append([int(tmp[TIME]), tmp[1], tmp[2], tmp[3], tmp[4], int(tmp[5]), int(tmp[6]), 
                                        float(tmp[BEACON_DISTANCE]), tmp[BEACON_MAC], second_time, 0])
                    except:
                        print(tmp)
                        raise

    return waypoints, sorted(beacons, key=lambda x: float(x[BEACON_DISTANCE])), wifis

@noglobal
def append_timediff(waypoint, beacons, timediffindex=10):
    TIME = 0
    
    to_be_removed = []
    for i, beacon in enumerate(beacons):
        try:
            beacons[i][timediffindex] = abs(waypoint[TIME] - beacon[TIME])
        except:
            to_be_removed.append(i)
            print(f'error:{beacon}')
            raise
    for i in to_be_removed:
        del beacons[i]
        
    return beacons

@noglobal
def make_item_wifi(target_building, floor_val, path_val, waypoint, wifis, TIMEGAP_THRESHOLD, NUM_TAKING_WIFIS):
    WAYPOINT_X = 2
    WAYPOINT_Y = 3

    sorted_by_nearest = [x for x in wifis if x[7] <= TIMEGAP_THRESHOLD]
    sorted_by_nearest = sorted(sorted_by_nearest, key=lambda x: (abs(x[4]), x[7]))[:NUM_TAKING_WIFIS]

    item = [target_building, floor_val, path_val, waypoint[WAYPOINT_X], waypoint[WAYPOINT_Y]]
    for beacon in sorted_by_nearest: 
        item.extend([beacon[3],
                    beacon[4],
                    beacon[7]])

    if len(sorted_by_nearest) < NUM_TAKING_WIFIS:
        for i in range(NUM_TAKING_WIFIS-len(sorted_by_nearest)):
            item.extend(['-', -999, TIMEGAP_THRESHOLD])    

    return item

@noglobal
def append_beacon(target_building, floor_val, path_val, item, waypoint, beacons, TIMEGAP_THRESHOLD, NUM_TAKING_BEACONS):
    BEACON_DISTANCE = 7
    BEACON_MAC = 8
    BEACON_TIMEDIFF = 10
    WAYPOINT_X = 2
    WAYPOINT_Y = 3

    sorted_by_nearest_beacons = [x for x in beacons if x[BEACON_TIMEDIFF] <= TIMEGAP_THRESHOLD]
    sorted_by_nearest_beacons = sorted(sorted_by_nearest_beacons, key=lambda x: x[BEACON_DISTANCE])[:NUM_TAKING_BEACONS]

    #item = [target_building, floor_val, path_val, waypoint[WAYPOINT_X], waypoint[WAYPOINT_Y]]
    for beacon in sorted_by_nearest_beacons: # select from the nearest beacons
        item.extend([beacon[BEACON_MAC],
                    beacon[BEACON_DISTANCE],
                    beacon[BEACON_TIMEDIFF]])

    if len(sorted_by_nearest_beacons) < NUM_TAKING_BEACONS:
        for i in range(NUM_TAKING_BEACONS-len(sorted_by_nearest_beacons)):
            item.extend(['-', -99, TIMEGAP_THRESHOLD])

    return item

@noglobal
def yield_columns(NUM_TAKING_WIFIS, NUM_TAKING_BEACONS):
    columns = []
    for i in range(NUM_TAKING_WIFIS):
        columns.append(f'wifi_bssid_{i}')
        columns.append(f'wifi_rssi_{i}')
        columns.append(f'wifi_timegap_{i}')

    for i in range(NUM_TAKING_BEACONS):
        columns.append(f'beacon_macaddress_{i}')
        columns.append(f'beacon_distance_{i}')
        columns.append(f'beacon_timegap_{i}')

    return columns

In [12]:
def create_data_per_building(target_building):
    floors = sorted(glob.glob(f'{base_path}/fixed_train/train/{target_building}/*'))
    
    items = []
    
    for floor in floors:
        print(floor)
        
        floor_val = floor.split('/')[-1]
        floor_val = FLOOR_DIR[floor_val]
        
        paths = sorted(glob.glob(f'{floor}/*.txt'))
        
        for path_file in paths:
            path_val = path_file.split('/')[-1].replace('.txt', '')
            
            waypoints, beacons, wifis = extract_waypoint_beacon(path_file)

            for waypoint in waypoints:
                wifis = append_timediff(waypoint, wifis, 7)
                beacons = append_timediff(waypoint, beacons, 10)
                
                item = make_item_wifi(target_building, floor_val, path_val, waypoint, wifis, TIMEGAP_THRESHOLD, NUM_TAKING_WIFIS)
                item = append_beacon(target_building, floor_val, path_val, item, waypoint, beacons, TIMEGAP_THRESHOLD, NUM_TAKING_BEACONS)
            
                items.append(item)
                
                
    items = pd.DataFrame(items, columns=['site', 'floor', 'path', 'x', 'y'] + yield_columns(NUM_TAKING_WIFIS, NUM_TAKING_BEACONS))
    train_all_list.append(items)
    items.to_csv(f'{target_building}_train.csv')
    shutil.move(f'{target_building}_train.csv', '/content/drive/MyDrive/nn_wifi_50_beacon_10_10')

In [13]:
train_all_list = []

In [14]:
# make files for train set
with Pool(num_cores) as pool:
    pool.map(create_data_per_building, [t for t in target_buildings])  

/content/drive/MyDrive/fixed_train/train/5a0546857ecc773753327266/B1
/content/drive/MyDrive/fixed_train/train/5d27096c03f801723c31e5e0/B1
/content/drive/MyDrive/fixed_train/train/5d27096c03f801723c31e5e0/F1
/content/drive/MyDrive/fixed_train/train/5a0546857ecc773753327266/F1
/content/drive/MyDrive/fixed_train/train/5d27096c03f801723c31e5e0/F2
/content/drive/MyDrive/fixed_train/train/5d27096c03f801723c31e5e0/F3
/content/drive/MyDrive/fixed_train/train/5a0546857ecc773753327266/F2
/content/drive/MyDrive/fixed_train/train/5d27096c03f801723c31e5e0/F4
/content/drive/MyDrive/fixed_train/train/5d27096c03f801723c31e5e0/F5
/content/drive/MyDrive/fixed_train/train/5d27096c03f801723c31e5e0/F6
/content/drive/MyDrive/fixed_train/train/5a0546857ecc773753327266/F3
/content/drive/MyDrive/fixed_train/train/5d27097f03f801723c320d97/B1
/content/drive/MyDrive/fixed_train/train/5d27097f03f801723c320d97/B2
/content/drive/MyDrive/fixed_train/train/5d27097f03f801723c320d97/F1
/content/drive/MyDrive/fixed_train

In [23]:
train_all_list = []
for building in used_buildings:
    df = pd.read_csv(f'/content/drive/MyDrive/nn_wifi_50_beacon_10_10/{building}_train.csv')
    train_all_list.append(df)

In [24]:
train_all = pd.concat(train_all_list)
train_all.to_pickle('train_all.pkl')
shutil.move('train_all.pkl', '/content/drive/MyDrive/nn_wifi_50_beacon_10_10')

'/content/drive/MyDrive/nn_wifi_50_beacon_10_10/train_all.pkl'

In [27]:
# make file for test set

paths = sorted(glob.glob(f'{base_path}/fixed_train/test/*'))

items = []

for i, path_file in enumerate(paths):
    path_val = path_file.split('/')[-1].replace('.txt', '')

    print(f'{i}:{path_file}')
    
    _, beacons, wifis = extract_waypoint_beacon(path_file)

    targets = sample_submission[sample_submission['path'] == path_val]
    targets['timestamp'] = targets['timestamp'].astype(int)
    targets.loc[:,'type'] = 'TYPE_WAYPOINT'
    targets.loc[:,'x'] = 0
    targets.loc[:,'y'] = 0
    waypoints_to_predict = targets[['timestamp', 'type', 'x', 'y']].values.tolist()

    target_building = targets.iloc[0, 0]
    
    for waypoint in waypoints_to_predict:
        wifis = append_timediff(waypoint, wifis, 7)
        beacons = append_timediff(waypoint, beacons, 10)

        timed = [str(waypoint[0]).zfill(13)]
        item = make_item_wifi(target_building, 0, path_val, waypoint, wifis, TIMEGAP_THRESHOLD, NUM_TAKING_WIFIS)
        timed.extend(item)
        timed = append_beacon(target_building, 0, path_val, timed, waypoint, beacons, TIMEGAP_THRESHOLD, NUM_TAKING_BEACONS)
        
        items.append(timed)

items = pd.DataFrame(items, columns=['timestamp', 'site', 'floor', 'path', 'x', 'y'] + yield_columns(NUM_TAKING_WIFIS, NUM_TAKING_BEACONS))

items.to_csv(f'test.csv')

0:/content/drive/MyDrive/fixed_train/test/00ff0c9a71cc37a2ebdd0f05.txt


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


1:/content/drive/MyDrive/fixed_train/test/01c41f1aeba5c48c2c4dd568.txt
2:/content/drive/MyDrive/fixed_train/test/030b3d94de8acae7c936563d.txt
3:/content/drive/MyDrive/fixed_train/test/0389421238a7e2839701df0f.txt
4:/content/drive/MyDrive/fixed_train/test/04029880763600640a0cf42c.txt
5:/content/drive/MyDrive/fixed_train/test/0412d582bb8a2c89400a1ffb.txt
6:/content/drive/MyDrive/fixed_train/test/046cfa46be49fc10834815c6.txt
7:/content/drive/MyDrive/fixed_train/test/049bb468e7e166e9d6370002.txt
8:/content/drive/MyDrive/fixed_train/test/04b259d70f2b503f2af14c35.txt
9:/content/drive/MyDrive/fixed_train/test/053526f9012ca715313120cd.txt
10:/content/drive/MyDrive/fixed_train/test/055255f16b549ed22893c0c3.txt
11:/content/drive/MyDrive/fixed_train/test/05a6d4cdf3d1eb9b5671f71b.txt
12:/content/drive/MyDrive/fixed_train/test/05d052dde78384b0c543d89c.txt
13:/content/drive/MyDrive/fixed_train/test/06832e9b9ca24ff7fe906aee.txt
14:/content/drive/MyDrive/fixed_train/test/06882da3694b7160c0f105f5.txt
1

In [28]:
shutil.move('test.csv', '/content/drive/MyDrive/nn_wifi_50_beacon_10_10')

'/content/drive/MyDrive/nn_wifi_50_beacon_10_10/test.csv'

In [29]:
test_all = pd.read_csv('/content/drive/MyDrive/nn_wifi_50_beacon_10_10/test.csv')
test_all.to_pickle('test_all.pkl')
shutil.move('test_all.pkl', '/content/drive/MyDrive/nn_wifi_50_beacon_10_10')

'/content/drive/MyDrive/nn_wifi_50_beacon_10_10/test_all.pkl'