# Call Dropping Prediction

Call Drops are of great interest to the telecommunication industry (TELCO). The identification of users falling within this spectrum would help in future planning of cell towers. Based on the radio frequency (RF) signal strength---generated via a physical model---, ML-based methods can be used to identify which factors can be used to capture the signal strength distribution across a region. Further analysis on the variable importance can be made in order to assess key variables and help with infrastructure planning.

Provided data over Los Angeles region, we are interested in predicting the signal reception quality-which is defined as `max_rf_signal_strength_dbm` in the dataset. It's value is used to classify data points into three categories:

0. Good signal, 
1. Drop,
2. No signal.

## Setup

On top of *HeavyDB*, the following packages are used.

* [heavyai](https://heavyai.readthedocs.io/en/latest/): interact with HeavyDB
* [pandas](https://pandas.pydata.org): tabular data structure
* [scikit-learn](https://scikit-learn.org/stable/): machine learning

They are all available on PyPi or conda-forge (recommended) depending on your prefered method of installation.

In [1]:
import heavyai
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import pipeline
from sklearn import set_config

set_config(display="diagram")

In [2]:
# NBVAL_IGNORE_OUTPUT
import importlib_metadata
importlib_metadata.version('heavyai'), importlib_metadata.version('geopandas'), importlib_metadata.version('contextily')

('1.0', '0.9.0', '1.2.0')

## Data preparation

To train our model/classifier, we have access to measurements over LA.

First, connect to HeavyDB. This connection will be used to load, read data and execute commands on the database.

In [3]:
def create_connection():
    con = heavyai.connect(user="admin", password="HyperInteractive", host="localhost", dbname="heavyai")
    return con

con = create_connection()

### Importing data to HeavyDB

If the data is not yet present in the database, there are convenient functions to load pandas dataframe into HeavyDB.

Assuming our data is in CSV files, it can be loaded with pandas into dataframe:

In [4]:
rf_prop = pd.read_csv("data/la_rf_prop_v19").drop('Unnamed: 0', axis=1)
rf_prop.shape

(3061768, 9)

In [5]:
rf_prop.head(1)

Unnamed: 0,geo,elevation_amsl_meters,rf_source_id,max_rf_signal_strength_dbm,row_id,source_distance,nearest_cell,nearest_source_distance,label
0,POINT (-118.096999972173 33.6369999747896),,,,536680,0.0,38633.0,4731.014691,2


In [6]:
con.load_table("la_rf_prop_v19", rf_prop)

### Reading data from HeavyDB

Once the data is present on the database, it can be access or worked on using all the capabilities of HeavyDB.

In [7]:
con.get_tables()

['omnisci_states',
 'omnisci_counties',
 'omnisci_countries',
 'la_rf_prop_v19',
 'la_rf_prop_v11',
 'lidar',
 'cell_towers']

In [8]:
con.get_table_details('la_rf_prop_v19')

[ColumnDetails(name='geo', type='STR', nullable=True, precision=0, scale=0, comp_param=32, encoding='DICT', is_array=False),
 ColumnDetails(name='elevation_amsl_meters', type='DOUBLE', nullable=True, precision=0, scale=0, comp_param=0, encoding='NONE', is_array=False),
 ColumnDetails(name='rf_source_id', type='DOUBLE', nullable=True, precision=0, scale=0, comp_param=0, encoding='NONE', is_array=False),
 ColumnDetails(name='max_rf_signal_strength_dbm', type='DOUBLE', nullable=True, precision=0, scale=0, comp_param=0, encoding='NONE', is_array=False),
 ColumnDetails(name='row_id', type='BIGINT', nullable=True, precision=0, scale=0, comp_param=0, encoding='NONE', is_array=False),
 ColumnDetails(name='source_distance', type='DOUBLE', nullable=True, precision=0, scale=0, comp_param=0, encoding='NONE', is_array=False),
 ColumnDetails(name='nearest_cell', type='DOUBLE', nullable=True, precision=0, scale=0, comp_param=0, encoding='NONE', is_array=False),
 ColumnDetails(name='nearest_source_dis

In [9]:
pd.read_sql("SELECT * FROM la_rf_prop_v19 limit 10", con).head(1)



Unnamed: 0,geo,elevation_amsl_meters,rf_source_id,max_rf_signal_strength_dbm,row_id,source_distance,nearest_cell,nearest_source_distance,label
0,POINT (-118.096999972173 33.6369999747896),,,,536680,0.0,38633.0,4731.014691,2


The data is conveniently read using a pandas dataframe which allows powerfull analysis.

But `heavyai` provides a more powerful method `select_ipc` which uses Arrow as a transport layer. On top of which, `sample_ratio` can also be used to only sample a fraction of the table. This is useful to prevent pulling by accident very large tables.

In [10]:
con.select_ipc(
    f"SELECT * FROM la_rf_prop_v19"
    f" where sample_ratio((select 100000 / cast(count(*) as float) from la_rf_prop_v19))"
).head(1)

Unnamed: 0,geo,elevation_amsl_meters,rf_source_id,max_rf_signal_strength_dbm,row_id,source_distance,nearest_cell,nearest_source_distance,label
0,POINT (-118.096999972173 33.6369999747896),,,,536680,0.0,38633.0,4731.014691,2


## Predicting signal quality

First, we are interested to see if the distance of an observer to the closest repeater can be used to predict signal reception quality.

In [11]:
rf_prop['distance'] = rf_prop['source_distance'] + rf_prop['nearest_source_distance']

In [12]:
X_train, X_test, y_train, y_test = train_test_split(
    rf_prop[['distance']], rf_prop['label'], test_size=0.8
)

In [13]:
pipe = pipeline.Pipeline(
    [('preprocess', preprocessing.StandardScaler()),
     ('classifier', RandomForestClassifier(n_estimators=1, max_depth=2))]
)
pipe.fit(X_train, y_train)

In [14]:
pipe.score(X_test, y_test)

0.8527607612429907

## Improved Model

To try improve the quality of our classifier, we now make use of the signal strength an observer gets from the top 3 cell towers in addition to the distance to the closest cell tower.

In [15]:
rf_prop = pd.read_csv("data/la_rf_prop_v11").drop('Unnamed: 0', axis=1)
con.load_table("la_rf_prop_v11", rf_prop)
rf_prop.shape

(4967970, 12)

In [16]:
rf_prop = rf_prop.loc[
    (rf_prop['temp_row_id'] == 1)
    & (pd.notna(rf_prop['first_signal']))
    & (pd.notna(rf_prop['second_signal']))
]
rf_prop.reset_index(drop=True, inplace=True)
rf_prop.head(1)

Unnamed: 0,row_id,x,y,elevation_amsl_meters,rf_source_id,terrain_bin_id,rf_signal_strength_dbm,rf_source_distance_meters,temp_row_id,first_signal,second_signal,label
0,,-118.236595,33.7341,4.573941,101278,930242,-62.41758,450.071,1,-62.418934,-62.421417,0


In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    rf_prop[['rf_signal_strength_dbm','rf_source_distance_meters','first_signal','second_signal']],
    rf_prop['label'], test_size=0.3
)

The same classifier architecture is used

In [18]:
pipe.fit(X_train, y_train)

In [19]:
pipe.score(X_test, y_test)

0.9870469178111242

## Conclusion

We've shown an end-to-end workflow combining physical modeling with data science; a similar approach can be used in many other contexts where complex physical models would benefit from calibration, speed or both.