# Data Mining Project
The topic is **predictive maintenance**: instead of periodically check the machinery state (twice a year for example), we use data to estimate which machines are more likely to fail in a given period. Thus saving money and time, and also increasing the efficiency of mainteinance.

## Project context

The project field is **telecommunications**: the dataset is related to the prediction of **faults of equipments** (split in categories) that are installed on radio communication sites. 
Sites are distributed on the national territory.

Each site is composed by:

1. **Tower**: responsible for the signal transmission. Each single tower can mount a different cell type (CELL_TYPE_X) which is the class of the radio transmitter.
2. **Shelter**: small "room" that contains a set of equipments needed for the correct functioning of the tower and the radio transmitter cell. In this set of equipments there's the **air conditioning system**: this system is used to keep a constant working temperature inside the shelter, avoiding circuit/servers overheat.
    
**We want to predict the fault of a cooling systems (shelter) in the next 14 days**. 

The splitting has been performed site-wise (check the actual meaning, it was quite confusing during the speech).

## Features description

1. **SITE_ID**: identifier of the site
2. **DATE**: reference date of the sample
3. **N_TRANSPORTED_SITES**: static feature of the single site. Indicates the **number of transmission dependent on the current sites**. For example, a single site (SITE_ID) can have N_TRANSPORTED_SITES equal to 10. This means that if the current sites fails, we will lose 10 (or 11?) transmissions overall.
3. **CELL_TYPE_X**: type of the transmission cell. X is a placeholder for the real cell type (CELL_TYPE_MACRO,  CELL_TYPE_MICRO,  CELL_TYPE_MOBILE ecc...). There are different cell types which differ (for example) for the emitted power, frequency range ecc...
4. **GEOGRAPHICAL_CLUSTER_K_x**: geographical cluster of the site based on its location. For example sites in different regions (Lombardia, Piemonte ecc...). 
5. **cat_sum_alarms_prevXd**: 9 possible categories of alarms. "cat" is a placeholder for the real category name, X is a placeholder for days. So we will have something like "overheat_sum_alarms_prev14d" which shows the number of "overheat" alarms in the previous "14" days. 
6. **cat_mean/max/min_persistance_prevXd**: statistics about the duration of the "cat" (remember, it's a placeholder) alarm in the previous X days. Longer alarms maybe are more relevant and/or problematic. Very short alarms might be a false positive.
7. **skew_cat_alarms_prev14d**: skewness of each alarm type
8. **kurt_cat_alarms_prev14d**: kurtosis of each alarm type
9. **mean/max/min_w_prevXd**: statistical values for different weather conditions in the previous X (again, X is a placeholder) days. "w" is a placeholder for the weather type, for example "sunny", "rainy" ecc...
10. **mean/max/min_w_f_nextXd**: same as above, but in the future. We're considering the next X days and not the previous X days. 


## Prediction
Based on Weighted Recall. We have to compute the probability that a given site will fail in the next 14 days. The probability needs to be weighted by the numer of transported sites. Check better this part, not that useful right now so I'm skipping a bit. 

## Showing Dataset

In [1]:
import pandas as pd

In [3]:
dataset = pd.read_csv('./dataset/train.csv')

As you might notice from the following dataframe, we have **136 features**. Basically the above listed features (in the **features description** sections) are "expanded" with different categories.

For example we have "temperature_sum_alarms_prev7d" or "power_sum_alarms_prev7d", which are the same "feature" but for different categories of alarm. 

In [6]:
pd.set_option('display.max_columns', None)
dataset

Unnamed: 0,SITE_ID,DATE,CELL_TYPE_Macro,CELL_TYPE_Mobil,CELL_TYPE_TRP,CELL_TYPE_Tx site,CELL_TYPE_micro,N_TRANSPORTED_SITES,GEOGRAPHIC_CLUSTER_K_0,GEOGRAPHIC_CLUSTER_K_1,GEOGRAPHIC_CLUSTER_K_2,GEOGRAPHIC_CLUSTER_K_3,GEOGRAPHIC_CLUSTER_K_4,GEOGRAPHIC_CLUSTER_K_5,GEOGRAPHIC_CLUSTER_K_6,GEOGRAPHIC_CLUSTER_K_7,GEOGRAPHIC_CLUSTER_K_8,GEOGRAPHIC_CLUSTER_K_9,aircon_sum_wo_prev7d,aircon_sum_wo_prev14d,aircon_sum_target_next14d,mean_temperature_prev7d,max_temperature_prev7d,min_temperature_prev7d,mean_temperature_prev3d,max_temperature_prev3d,min_temperature_prev3d,mean_rain_mm_prev7d,max_rain_mm_prev7d,min_rain_mm_prev7d,mean_rain_mm_prev3d,max_rain_mm_prev3d,min_rain_mm_prev3d,mean_humidity_prev7d,max_humidity_prev7d,min_humidity_prev7d,mean_humidity_prev3d,max_humidity_prev3d,min_humidity_prev3d,mean_wind_speed_prev7d,max_wind_speed_prev7d,min_wind_speed_prev7d,mean_wind_speed_prev3d,max_wind_speed_prev3d,min_wind_speed_prev3d,mean_pressure_prev7d,max_pressure_prev7d,min_pressure_prev7d,mean_pressure_prev3d,max_pressure_prev3d,min_pressure_prev3d,mean_temperature_f_next14d,max_temperature_f_next14d,min_temperature_f_next14d,mean_temperature_f_next7d,max_temperature_f_next7d,min_temperature_f_next7d,mean_rain_mm_f_next14d,max_rain_mm_f_next14d,min_rain_mm_f_next14d,mean_rain_mm_f_next7d,max_rain_mm_f_next7d,min_rain_mm_f_next7d,mean_humidity_f_next14d,max_humidity_f_next14d,min_humidity_f_next14d,mean_humidity_f_next7d,max_humidity_f_next7d,min_humidity_f_next7d,mean_wind_speed_f_next14d,max_wind_speed_f_next14d,min_wind_speed_f_next14d,mean_wind_speed_f_next7d,max_wind_speed_f_next7d,min_wind_speed_f_next7d,mean_pressure_f_next14d,max_pressure_f_next14d,min_pressure_f_next14d,mean_pressure_f_next7d,max_pressure_f_next7d,min_pressure_f_next7d,equipment_sum_alarms_prev14d,fire/smoke_sum_alarms_prev14d,ge_sum_alarms_prev14d,power_sum_alarms_prev14d,temperature_sum_alarms_prev14d,equipment_sum_alarms_prev7d,fire/smoke_sum_alarms_prev7d,ge_sum_alarms_prev7d,power_sum_alarms_prev7d,temperature_sum_alarms_prev7d,equipment_sum_alarms_prev3d,fire/smoke_sum_alarms_prev3d,ge_sum_alarms_prev3d,power_sum_alarms_prev3d,temperature_sum_alarms_prev3d,equipment_max_persistance_prev7d,equipment_mean_persistance_prev7d,equipment_min_persistance_prev7d,fire/smoke_max_persistance_prev7d,fire/smoke_mean_persistance_prev7d,fire/smoke_min_persistance_prev7d,ge_max_persistance_prev7d,ge_mean_persistance_prev7d,ge_min_persistance_prev7d,power_max_persistance_prev7d,power_mean_persistance_prev7d,power_min_persistance_prev7d,temperature_max_persistance_prev7d,temperature_mean_persistance_prev7d,temperature_min_persistance_prev7d,equipment_max_persistance_prev3d,equipment_mean_persistance_prev3d,equipment_min_persistance_prev3d,fire/smoke_max_persistance_prev3d,fire/smoke_mean_persistance_prev3d,fire/smoke_min_persistance_prev3d,ge_max_persistance_prev3d,ge_mean_persistance_prev3d,ge_min_persistance_prev3d,power_max_persistance_prev3d,power_mean_persistance_prev3d,power_min_persistance_prev3d,temperature_max_persistance_prev3d,temperature_mean_persistance_prev3d,temperature_min_persistance_prev3d,skew_equipment_alarms_prev14d,skew_fire/smoke_alarms_prev14d,skew_ge_alarms_prev14d,skew_power_alarms_prev14d,skew_temperature_alarms_prev14d,kurt_equipment_alarms_prev14d,kurt_fire/smoke_alarms_prev14d,kurt_ge_alarms_prev14d,kurt_power_alarms_prev14d,kurt_temperature_alarms_prev14d
0,146,2019-04-10,1,0,0,0,0,3.0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0,10.29,14.0,6.0,12.00,14.0,9.0,1.33,8.5,0.0,3.10,8.5,0.3,62.71,81.0,45.0,70.67,81.0,58.0,11.43,16.0,5.0,8.00,12.0,5.0,1013.00,1022.0,1006.0,1008.00,1010.0,1006.0,8.00,12.0,5.0,6.14,9.0,5.0,4.79,19.6,0.1,6.41,19.6,0.2,74.29,89.0,58.0,80.00,89.0,63.0,12.86,17.0,5.0,13.86,17.0,10.0,1020.71,1028.0,1009.0,1016.14,1020.0,1009.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.212308,-1.212308,-1.212308,-1.212308,-1.212308
1,146,2019-04-11,1,0,0,0,0,3.0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0,11.71,16.0,9.0,13.00,16.0,9.0,1.90,8.5,0.0,4.27,8.5,0.3,66.43,81.0,51.0,75.00,81.0,71.0,11.57,16.0,5.0,8.00,12.0,5.0,1010.86,1017.0,1006.0,1007.00,1008.0,1006.0,8.50,16.0,5.0,5.86,8.0,5.0,3.52,12.5,0.1,4.63,12.5,0.2,72.93,89.0,58.0,79.86,89.0,63.0,12.57,17.0,5.0,13.29,17.0,10.0,1021.71,1028.0,1013.0,1018.71,1027.0,1013.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.212308,-1.212308,-1.212308,-1.212308,-1.212308
2,146,2019-04-12,1,0,0,0,0,3.0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0,11.57,16.0,9.0,13.00,16.0,9.0,4.70,19.6,0.0,7.97,19.6,0.3,71.71,88.0,58.0,77.33,88.0,71.0,11.71,16.0,5.0,10.67,15.0,5.0,1009.71,1015.0,1006.0,1007.33,1009.0,1006.0,9.36,20.0,5.0,6.29,11.0,5.0,3.63,12.5,0.1,5.61,12.5,0.2,72.00,89.0,58.0,78.43,89.0,63.0,12.29,17.0,5.0,13.57,17.0,10.0,1022.14,1028.0,1016.0,1020.57,1027.0,1016.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.212308,-1.212308,-1.212308,-1.212308,-1.212308
3,146,2019-04-13,1,0,0,0,0,3.0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0,11.29,16.0,8.0,11.00,16.0,8.0,4.77,19.6,0.0,8.03,19.6,0.5,74.29,88.0,58.0,80.33,88.0,71.0,11.43,16.0,5.0,13.33,15.0,12.0,1009.43,1013.0,1006.0,1009.67,1013.0,1007.0,10.57,22.0,5.0,6.86,11.0,5.0,3.40,12.5,0.1,5.27,12.5,0.2,70.07,89.0,58.0,75.14,89.0,63.0,12.14,17.0,5.0,13.14,17.0,9.0,1021.86,1028.0,1012.0,1022.29,1028.0,1017.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.212308,-1.212308,-1.212308,-1.212308,-1.212308
4,146,2019-04-14,1,0,0,0,0,3.0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0,10.57,16.0,5.0,7.33,9.0,5.0,5.39,19.6,0.3,8.13,19.6,0.5,77.29,88.0,58.0,86.00,88.0,82.0,10.86,15.0,5.0,13.33,15.0,12.0,1009.86,1016.0,1006.0,1012.67,1016.0,1009.0,11.50,22.0,5.0,7.86,12.0,5.0,3.49,12.5,0.1,5.23,12.5,0.2,69.07,89.0,58.0,72.86,89.0,63.0,11.64,17.0,5.0,11.71,17.0,5.0,1021.29,1028.0,1009.0,1023.00,1028.0,1018.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.212308,-1.212308,-1.212308,-1.212308,-1.212308
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
621295,1251,2020-01-30,1,0,0,0,0,7.0,0,0,0,0,0,0,0,0,0,1,0.0,0.0,0,4.00,7.0,1.0,5.33,6.0,5.0,3.57,18.2,0.0,8.33,18.2,1.2,77.71,94.0,60.0,86.00,94.0,81.0,8.14,15.0,5.0,11.00,15.0,6.0,1021.57,1034.0,1010.0,1013.00,1018.0,1010.0,5.71,11.0,1.0,6.71,11.0,2.0,6.04,29.1,0.0,8.37,29.1,0.0,71.50,87.0,45.0,76.43,87.0,53.0,16.36,42.0,5.0,20.29,42.0,11.0,1016.86,1031.0,1002.0,1012.29,1022.0,1002.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.212308,-1.212308,-1.212308,-1.210000,-1.212308
621296,1251,2020-01-31,1,0,0,0,0,7.0,0,0,0,0,0,0,0,0,0,1,0.0,0.0,0,4.29,7.0,1.0,4.67,6.0,3.0,3.89,18.2,0.0,3.00,5.6,1.2,77.86,94.0,60.0,82.33,83.0,81.0,8.57,15.0,5.0,12.33,15.0,10.0,1019.29,1030.0,1010.0,1013.00,1018.0,1010.0,5.79,11.0,1.0,6.57,11.0,2.0,6.68,29.1,0.0,8.21,29.1,0.0,72.07,87.0,45.0,76.57,87.0,53.0,15.93,42.0,5.0,19.57,42.0,10.0,1017.00,1031.0,1002.0,1013.57,1024.0,1002.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.212308,-1.212308,-1.212308,-1.210000,-1.212308
621297,1251,2020-02-01,1,0,0,0,0,7.0,0,0,0,0,0,0,0,0,0,1,0.0,0.0,0,4.71,7.0,3.0,4.00,5.0,3.0,4.04,18.2,0.0,2.97,5.6,1.1,79.14,94.0,60.0,79.67,83.0,75.0,9.86,15.0,5.0,12.33,15.0,10.0,1017.14,1025.0,1010.0,1014.33,1018.0,1010.0,5.64,11.0,1.0,5.43,11.0,1.0,6.68,29.1,0.0,8.21,29.1,0.0,71.57,87.0,45.0,72.14,87.0,53.0,15.79,42.0,5.0,18.71,42.0,5.0,1017.86,1031.0,1002.0,1015.71,1031.0,1002.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.212308,-1.212308,-1.212308,-1.210000,-1.212308
621298,1251,2020-02-02,1,0,0,0,0,7.0,0,0,0,0,0,0,0,0,0,1,0.0,0.0,0,5.57,9.0,3.0,5.33,9.0,3.0,4.04,18.2,0.0,1.10,2.2,0.0,82.86,94.0,75.0,81.33,86.0,75.0,10.71,15.0,6.0,12.00,15.0,10.0,1015.86,1023.0,1010.0,1016.33,1018.0,1015.0,5.43,9.0,1.0,4.29,9.0,1.0,6.04,29.1,0.0,6.94,29.1,0.0,71.21,87.0,45.0,67.57,87.0,45.0,15.29,42.0,5.0,17.57,42.0,5.0,1019.00,1031.0,1002.0,1017.86,1031.0,1002.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.212308,-1.212308,-1.212308,-1.210000,-1.212308
