# Preliminary Phase

In this phase, we analyze the shape of the data set to understand if some data preparation processis needed.

Importing required libraries and reading the data from file.

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
df = pd.read_csv('../dataset/raw_data.csv')

Understanding the format of data.

In [None]:
df

In [None]:
df.isna().any()

The data set has 5 column, 13.584 rows and does not contains null values.

## Data Preparation

### Timestamp

The timestamp needs to be transformed from milliseconds form to a standard format that can be used as the index. Specifically, we apply the structure based on hours, minutes, seconds, and milliseconds (hh:mm:ss.ss).

In [None]:
df['Time'] = pd.to_datetime(df['timestamp'], unit='ms').dt.strftime('%H:%M:%S.%f')

### Data cleaning

We remove the columns that are not required for our scope. Specifically, the name of the beacon, the location status, and the timestamp in milliseconds

In [None]:
df = df.drop(columns=['name', 'locationStatus', 'timestamp'])
df

In [None]:
df.describe()

Exploring the data set, we notice that the two columns referring to the RSSI registered from each smartphone seem to have the same values. We perform so further investigation to check that.

In [None]:
df['Diff'] = df['rssiOne'] - df['rssiTwo']
df.loc[(df['Diff'] !=0)]

The investigation confirms that there are no differences between the values registered by the two smartphones. Therefore, we can drop one column and reshape the data set to enhance readability
and usability.

In [None]:
df = df.drop(columns=['rssiTwo', 'Diff'])
df.rename(columns={'rssiOne':'rssi'}, inplace=True)
df

## Plotting

We create some basic plots to further understand the data set.

In [None]:
histCount = df.hist(column='rssi', figsize=(20, 12))
for ax in histCount.flatten():
    ax.set_xlabel('RSSI')
    ax.set_ylabel('Count')

In [None]:
df.describe()

The table and the figure show that most of the signals received are on the lower end of the spectrum, i.e., the beacon gets the majority of signals with low power intensity. The max signal strength observed is −55 dBm (decibel milliwatt), which is considered a high enough value for most real work applications. Moreover, the third percentile has a value of −78 dBm reporting an acceptable coverage.

We turn the time column into a timestamp manageble from the library in use to improve readibility.

In [None]:
df['Time'] = df['Time'].apply(pd.Timestamp)
df.plot(x='Time',xlabel='Time', y='rssi', ylabel='RSSI', figsize=(20, 12))

The plot shows a high fluctuation in the signal revealing the presence of a lot of noise, as expected in an indoor environment. Therefore we need to reduce the interference using techniques
to smooth the signal. Different possible methods can be used, and the optimal solution highly depends on the specific context and the degree of noise present.

Saving the cleaned dataframe into a *.csv* file.

In [None]:
df.to_csv('../dataset/clean_data.csv', index=False)