# <center>Project: Indoor Localization.</center>


Data : uploaded in Moodle
- train set (offline) and 
- separated (online) test set

Citation for the data: Thomas King, Stephan Kopf, Thomas Haenselmann, Christian Lubberger, Wolfgang Effelsberg, mannheim/compass, https://doi.org/10.15783/C7F30Q , Date: 20060913

## Introduction 
Recall the ML Project workflow.

Problem understanding
Data collection, Data Cleaning and Exploration
Feature Selection, Feature Engineering
Training and Model Selection
Evaluation and Deployment

## Problem Description
- Indoor localization is different than outdoor, because 
    - no line of sight, lots of distorting factors
    - needs more precision
- Examples of Applications:  
    - Logistics, Customer Support (saving time shopping), Hospitals (Patients, Device Location tracking)
- Variety of positioning technologies
    - WiFi based approaches: Received Signal Strength (RSS), angle, time of arrival, time difference of arrival
    - Bluetooth 
    - Radio Frequency
    - ...
- Challenges: 
    - poor accuracy
    - high computational complexity 
    - costs of hardware

Task: Search for a Survey paper and list some approaches present in the literature 

Task: Formulate the Problem Description, Challanges and Ml formulation

## Our project: Localization based on Received Signal Strength(RSS), using k-NN

In this project, we will work on 
  - understanding and cleaning the data,
  - organizing it in a structure suitable for analysis, and 
  - examining its statistical properties 
  - We will use a k-NN model to predict the location of new data points
  - Hyper parameter tuning, model selection

In [2]:
# Write your imports here

# import pandas as pd
# import numpy as np
# import re
# import matplotlib.pyplot as plt 
# import seaborn as sns 

# import statsmodels.api as sm

# import warnings
# warnings.filterwarnings('ignore')


## Understanding the data
Data: Thomas King, Stephan Kopf, Thomas Haenselmann, Christian Lubberger, Wolfgang Effelsberg, mannheim/compass, https://doi.org/10.15783/C7F30Q , Date: 20060913

Data Documentation, Visual inspection, Checks
Read the documentation of the data and perform a visual inspection (plain text editor). Describe the data and check that the file is as expected. Consider, for example:

- which types of lines are present and how many of them we have. Does the file fit the documentation?
- which columns we have, what they represent, which data type do they have, ...
- Note: Always keep the documentation of the data in mind, return to it if needed!

Example: 
- We see that there are comments, starting with #. Recall that we expect 146,080 lines in the file (166 locations × 8 angles × 110 recordings). Check this.
- What do you notice about the time format? 
- is id always the same?
- is degree as described in the documentation?
- Does every line have the same macs
- how are Signal Strength values?
...

> Continue by documenting what you notice about the data

### Ideas on how to structure the data. 
Task: Consider the two following obvious choices of how to store the data in a structured way and give at least a pro and a con for each:
1. transfer each row in input file to a row in data frame
2. store one signal per row. That implies that each line in the input file turns into multiple rows in the data frame.

Adopt the second approach. 

- Write a function that can process a line of the file into a matrix, and then apply it to each line.
- Create a DataFrame by concatenating the list of processed lines, stacking them. Define columns=["time", "scanMac", "posX", "posY", "posZ", "orientation", "mac", "signal", "channel", "type"]

I recommend you to get familiar with regular expressions: https://de.wikipedia.org/wiki/Regul%C3%A4rer_Ausdruck

Example: for the first line, you should have the matrix

In [3]:

# [['1139643118744', '00:02:2D:21:0F:33', '0.0', '0.0', '0.0', '0.0',
#         '00:14:bf:b1:97:8a', '-38', '2437000000', '3'],
#        ['1139643118744', '00:02:2D:21:0F:33', '0.0', '0.0', '0.0', '0.0',
#         '00:0f:a3:39:e1:c0', '-54', '2462000000', '3'],
#        ['1139643118744', '00:02:2D:21:0F:33', '0.0', '0.0', '0.0', '0.0',
#         '00:14:bf:b1:97:90', '-56', '2427000000', '3'],
#        ['1139643118744', '00:02:2D:21:0F:33', '0.0', '0.0', '0.0', '0.0',
#         '00:14:bf:3b:c7:c6', '-67', '2432000000', '3'],
#        ['1139643118744', '00:02:2D:21:0F:33', '0.0', '0.0', '0.0', '0.0',
#         '00:14:bf:b1:97:81', '-66', '2422000000', '3'],
#        ['1139643118744', '00:02:2D:21:0F:33', '0.0', '0.0', '0.0', '0.0',
#         '00:14:bf:b1:97:8d', '-70', '2442000000', '3'],
#        ['1139643118744', '00:02:2D:21:0F:33', '0.0', '0.0', '0.0', '0.0',
#         '00:0f:a3:39:e0:4b', '-79', '2462000000', '3'],
#        ['1139643118744', '00:02:2D:21:0F:33', '0.0', '0.0', '0.0', '0.0',
#         '00:0f:a3:39:dd:cd', '-73', '2412000000', '3'],
#        ['1139643118744', '00:02:2D:21:0F:33', '0.0', '0.0', '0.0', '0.0',
#         '00:0f:a3:39:e2:10', '-83', '2437000000', '3'],
#        ['1139643118744', '00:02:2D:21:0F:33', '0.0', '0.0', '0.0', '0.0',
#         '02:00:42:55:31:00', '-85', '2457000000', '1']]

###   Cleaning and formatting the data:

- Convert numeric variables from string to numeric
- look at the orientation and notice that the values are not always exact multiples of 45 degrees.  Write a function that rounds the orientation to multiples of 45 degrees and add column with rounded angles. Keep the old values and check that the conversion was done properly. Note that 358 should be converted to 0. 
- Plot the ECDF of the orientation column before and after the conversion(from statsmodels.api, use distributions.ECDF)
- keep only Access Points(abrev. AP), i.e. remove ad-hoc devices and then remove the type column
- format the time column

As mentioned, keep the unchanged columns for now as well

After this steps, the first lines of your data frame should look like:

![](step1.png)


### Univariate Analysis

- Are there other columns that you can eliminate?
- Can you identify the 6 APs that are mentioned in the documentation?
- Was the same device used for taking the measurements?
- ...

For answering this questions, review unique values, counts, look at measures of central tendency and spread for the numerical variables, ...

You should be able to argue a reduction to 7 AP, deleting the channel, and scanMac

## Organize your code


Write a function that summarizes the processing of the data. Write it in such a way that you can apply it to the online data set as well. Check that the  you obtain the same result as until now, ensuring that none of the steps were forgotten.

You can use the pickle format to save a DataFrame, which preserves the data types of the columns. What you should obtain is available in moodle, as pickle file

In [5]:
def processline(x):
    ...

def round_orientation(angles):
    ...


In [6]:
def read_data(filename="../ips_data/offline.final.trace.txt",
                submacs=["00:0f:a3:39:e1:c0", "00:0f:a3:39:dd:cd", "00:14:bf:b1:97:8a",
                        "00:14:bf:3b:c7:c6", "00:14:bf:b1:97:90", "00:14:bf:b1:97:8d", "00:14:bf:b1:97:81"]):
    with open(filename, "r") as file:
        lines = [line for line in file if not line.startswith("#")]

    # Process each line and store the results in a list
   ...
    # Create a DataFrame by concatenating the list of processed lines
   ...
    # Make numeric columns numeric
   ...
    # Add a new column 'angle' with the rounded orientations to the 'offline' DataFrame
   ...
    # Skip rows with type != 3, keeping only access points (remove devices in ad-hoc mode)
   ...
    # Remove the 'type' column
    ...
    # # Format time
  ...
    # remove the z coordinate: all  measurements are taken on the same floor
  ...
    # Remove the 'scanMac' column
    ...
    # Filter the DataFrame to keep only rows with the selected MAC addresses
   ...
    # create posXY, location identifier
    ...

    return offline

In [7]:
# Read offline data using the function call
offline = read_data()

In [8]:
offline.head()

Unnamed: 0,time,posX,posY,orientation,mac,signal,angle,rawTime,posXY
0,2006-02-11 07:31:58.358,0.0,0.0,0.0,00:14:bf:b1:97:8a,-38,0,1139643118358,0.0-0.0
1,2006-02-11 07:31:58.358,0.0,0.0,0.0,00:14:bf:b1:97:90,-56,0,1139643118358,0.0-0.0
2,2006-02-11 07:31:58.358,0.0,0.0,0.0,00:0f:a3:39:e1:c0,-53,0,1139643118358,0.0-0.0
3,2006-02-11 07:31:58.358,0.0,0.0,0.0,00:14:bf:b1:97:8d,-65,0,1139643118358,0.0-0.0
4,2006-02-11 07:31:58.358,0.0,0.0,0.0,00:14:bf:b1:97:81,-65,0,1139643118358,0.0-0.0


# Try to get the same dataframe as uploaded in moodle

In [10]:
df = pd.read_pickle('df.pkl')
df.head()

Unnamed: 0,time,posX,posY,orientation,mac,signal,angle,rawTime,posXY
0,2006-02-11 07:31:58.358,0.0,0.0,0.0,00:14:bf:b1:97:8a,-38,0,1139643118358,0.0-0.0
1,2006-02-11 07:31:58.358,0.0,0.0,0.0,00:14:bf:b1:97:90,-56,0,1139643118358,0.0-0.0
2,2006-02-11 07:31:58.358,0.0,0.0,0.0,00:0f:a3:39:e1:c0,-53,0,1139643118358,0.0-0.0
3,2006-02-11 07:31:58.358,0.0,0.0,0.0,00:14:bf:b1:97:8d,-65,0,1139643118358,0.0-0.0
4,2006-02-11 07:31:58.358,0.0,0.0,0.0,00:14:bf:b1:97:81,-65,0,1139643118358,0.0-0.0


In [11]:
print("Check if they are equal: " , offline.equals(df))

Check if they are equal:  True


## EDA

We actually already did some EDA, for example, we explored 

   - the orientation, and we extracted the angle-column
   - Mac address: 
     - we managed to reduce to 7 potential AP, based on the number of measurements
     - According to the documentation, the access points consist of 5 Linksys/Cisco and one Lancom L-54g routers. We can try to look up these MAC addresses: http://coffer.com/mac_find/ vendor addresses that begin with 00:14:bf belong to Linksys devices, those beginning with 00:0f:a3 belong to Alpha Networks, and Lancom devices start with 00:a0:57. We acknowledge a discrepancy in the documentation.
  
Until now, the goal was to clean and structure the data

What will we do next: 

- Explore the data further, concentrating on investigating the properties of the response variable, signal strength.

Before designing a model, we have to know how the signals behave: 
 Does the signal strength behave similarly at all locations or does the location, orientation, access point influence the distribution? 
 We also want to characterize the relationship between the signal strength and the distance from the device to the access point. What affects this relationship?