### Predicting the Radio Column

#### Overview

In this challenge, you are tasked with building a predictive model that forecasts the **radio** field based on the available features: **range, samples, changeable, created, updated, averageSignal, and lonlat**. Your model’s performance will be evaluated based on the **macro average recall score** across the classes in the target variable.
Task Objectives:
1.	Data Exploration and Preprocessing:
*   Analyze the provided training dataset to understand the relationships between features.
*   Investigate the distributions and potential correlations of features such as range, samples, averageSignal, etc.
*   Explore the time-related fields (created, updated) and the geographical field (lonlat) for possible feature engineering.
*   Clean the data by handling missing values, outliers, or inconsistencies.
*   Tip: all the string type columns can be transformed in a way that they can be used by ML model([DatetimeFeatures Documentation](https://feature-engine.trainindata.com/en/latest/api_doc/datetime/DatetimeFeatures.html)).
2.	Modeling & Feature Engineering:
*   Design and implement feature engineering strategies to extract additional signal from fields like created/updated and lonlat(please take a look at [Feature Engine](https://feature-engine.trainindata.com/en/latest/) for tips).
*   Potentially reduce the number of features(to only relevant ones) with [Boruta](https://github.com/scikit-learn-contrib/boruta_py?tab=readme-ov-file#description) algorithm.
*   Select and train a model (or ensemble thereof) that can accurately predict the radio column. Use train and test data split([train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split)) for model training and evaluation.
*   Explore different algorithms and tuning techniques to optimize for the **recall** of each class, ensuring that the macro average is as high as possible. [GridSearchCV Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) might be helpfull here. 
3.	Evaluation Metric:
*   The submitted model will be scored using the **macro average [recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html) score**.
*   Focus not only on overall accuracy but on ensuring that **recall is high across all classes**([recall_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html)), especially in cases where class distribution might be imbalanced. Consider balancing data([Imbalanced Data Problem Statement](https://imbalanced-learn.org/stable/introduction.html#problem-statement-regarding-imbalanced-data-sets)) to get best result.
4.	Inference:
*   In addition to the initial training data provided, you will be given a separate dataset of 10,000 samples on which you will generate predictions.
*   As a solution please provide **solution.parquet** file with single column "radio".


Data Description
*	radio: The target variable which is to be predicted. Network type. One of the strings GSM, UMTS, LTE or CDMA.
*	range: Estimate of cell range, in meters.
*	samples: Total number of measurements assigned to the cell tower.
*	changeable: Defines if coordinates of the cell tower are exact or approximate.
*   changeable=1: the GPS position of the cell tower has been calculated from all available measurements
*   changeable=0: the GPS position of the cell tower is precise - no measurements have been used to calculate it.
*	created: The first time when the cell tower was seen.
*	updated: The last time when the cell tower was seen and update.
*	averageSignal: Average signal strength from all assigned measurements for the cell. Either in dBm or as defined in TS 27.007 8.5 - both is accepted. 
*	lonlat: Geolocation information that might be used to incorporate spatial analysis.



In [22]:
import pandas as pd

train_data = pd.read_parquet('dataset.parquet')

train_data.head()

Unnamed: 0,radio,range,samples,changeable,created,updated,averageSignal,lonlat
0,GSM,2798,50,1,2016-10-09 11:09:15,2024-12-07 17:47:04,-84,"18°54'2""E 47°27'32""N"
1,LTE,1000,13,0,2022-07-13 01:29:08,2024-08-19 14:18:06,-8,"26°6'24""E 44°27'38""N"
2,UMTS,1000,1,1,2024-05-10 01:29:14,2024-05-10 01:29:14,-61,"5°47'14""E 51°54'34""N"
3,LTE,1000,6,0,2023-12-16 15:23:35,2023-12-17 19:39:04,-48,"5°34'6""E 46°40'32""N"
4,LTE,1000,6,0,2024-11-06 10:56:47,2024-11-06 19:11:37,-59,"5°25'37""E 45°20'32""N"


In [23]:
production_data = pd.read_parquet('production_data.parquet')

production_data.head()

Unnamed: 0,range,samples,changeable,created,updated,averageSignal,lonlat
0,1000,22,0,2024-08-31 01:49:05,2024-11-05 07:54:38,-67,"19°9'45""E 47°32'15""N"
1,2469,50,1,2018-01-17 14:20:37,2025-01-26 19:13:34,-67,"5°35'12""E 46°11'48""N"
2,1000,3,1,2023-11-12 06:13:35,2023-11-12 19:10:04,-11,"1°52'26""E 49°6'4""N"
3,1613,26,1,2023-06-21 11:14:10,2024-12-13 01:02:02,-82,"-8°-28'-31""W 41°27'45""N"
4,1000,1,1,2024-03-08 00:17:51,2024-03-08 00:17:51,-98,"2°58'17""E 50°23'33""N"


In [24]:
production_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   range          10000 non-null  int64 
 1   samples        10000 non-null  int64 
 2   changeable     10000 non-null  int32 
 3   created        10000 non-null  object
 4   updated        10000 non-null  object
 5   averageSignal  10000 non-null  int32 
 6   lonlat         10000 non-null  object
dtypes: int32(2), int64(2), object(3)
memory usage: 468.9+ KB


### Creating dummy classifier

In [25]:
# lets sample data to speed up example training
train_data = train_data.sample(frac=0.001, random_state=0)

In [26]:
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score

# split data into training and testing sets
data_train, data_test = train_test_split(train_data, test_size=0.2, random_state=0)

# define the target variable
target_train = data_train['radio']
target_test = data_test['radio']

# create a dummy classifier
dummy = DummyClassifier(strategy='most_frequent')

# "train" the model
dummy.fit(data_train, target_train)

# make predictions
target_pred = dummy.predict(data_test)

print("Recall:", recall_score(target_test, target_pred, average="macro"))



Recall: 0.25


#### Final inference with production data

In [27]:
# make inference on production data
inference_pred = dummy.predict(production_data)

# store the predictions in a parquet file
inference_pred = pd.DataFrame({'radio': inference_pred})

inference_pred.to_parquet('solution.parquet')
inference_pred.head()


Unnamed: 0,radio
0,LTE
1,LTE
2,LTE
3,LTE
4,LTE


#### Please send solution.parquet for final exercise scoring