# Efficient determination of zero-crossings in noisy real-life time series
## Advanced Data Science Capstone Project
### Model training.
In this notebook, the compiled models are trained. The presented functions are called every time, when a new data is received and the models should be re-fitted.

First, all necessary libraries are imported.



In [None]:
#Here, the path to the file [Zero_crossings_in_time_series]_import_libraries_python.ipynb should be indicated.
try:
  %run /content/[Zero_crossings_in_time_series]_import_libraries_python.ipynb
except:
  None

Then, the model compiled at the previous step is fitted. Here, if the simple linear regression model is used, then we need only the features and the values of the objective function given in two numpy arrays data1 and data2, respectively. Otherwise, if the Spark models are used, then all the data should be given in a unique dataframe data1 (data2 is not used in this case and, e.g., `data2=None` can be passes as the input parameter to this function). Finally, if DNNs are used, then `data1` and `data2` are again two numpy arrays of the respective dimensions. 500 epochs are used with the batches of the size N/10, where N is the number of samples in the training set. Finally, EarlyStopping is applied, when the loss function is not improved during 5 consecutive epochs in order to avoid overfitting. 

In [None]:
#############################################
def fit_model(model, data1, data2, model_context = 0):
    if model_context==0:
      model.fit(data1, data2)
    elif model_context==1:
      model = model.fit(data1)
    else: 
      n_epochs = 500
      batch_size=int(data1.shape[0]/10)
      es = EarlyStopping(monitor='loss', mode='min', verbose=0, patience=5)
      model.fit(data1, data2, epochs=n_epochs, batch_size=batch_size, verbose=0,callbacks=[es])
    return model
#############################################


The `predict_model` procedure returns the list of the predicted values at the time steps given in the `data` input parameter. If `model_context=1`, i.e., Spark is used, then `data` is a Spark dataframe of the polynomial features transformed at the previous stages. Otherwise, `data` is a simple numpy array transformed into polynomial features from the previous stages.

In [None]:
#############################################
def predict_model(model,data,model_context = 0):
  if model_context==1:
    df_predicted = model.transform(data)
    df_predicted.createOrReplaceTempView("df_predicted")
    y_predicted = spark.sql("select * from df_predicted").rdd.map(lambda row: row.prediction).collect()
  else: 
    y_predicted = model.predict(data)
  return y_predicted
#############################################


`find_degrees` is, probably, the most complicated part of this step. It returns the optimal number of polynomial degrees for the selected model in the feature engineering. Here, `data1` and `data2` are two numpy arrays of the time vector $t$ and the observed values $g(t)$ (if `model_context` is different from 1) and a Spark dataframe (`data1`) and any other object, e.g., `None` (if `model_context = 1`). `N_train` and `N_predict` are the numbers of samples for the training and evaluating, respectively (`N_train+N_predict` should be equal to the size of the samples in `data1`). 

For each degree from 1 to 30, the respective model is compiled and fitted using N_train values from data1 (and data2, eventually). After that, the last N_predict values from data1 are predicted by the fitted model. The R2 value is calculated using the respective Scikit-learn's procedure. The number of degrees is selected maximizing the R2 value. 

In [None]:
#############################################
def find_degrees(data1,data2,N_train, N_predict, model_context):
    R2_values_test = np.zeros(30)
    if model_context==1:
      data1.createOrReplaceTempView("data1")
      y_real = np.array(spark.sql("select * from data1").rdd.map(lambda row: row.x).collect())
    else:
      y_real = data2
    for deg in range(1,31):
      if model_context==2:
        inp_shape = deg
      else:
        inp_shape=0
      train_t,predict_t = prepare_features(data1,N_train,N_predict,deg,model_context)
      model = compile_model(model_context,inp_shape)
      if model_context==1:
        model = fit_model(model,train_t,None,model_context)  
      else:
        model = fit_model(model,train_t,data2[0:N_train],model_context)
      y_predicted = predict_model(model,predict_t,model_context)
      y_predicted=np.array(y_predicted).reshape(len(y_predicted))
      R2_values_test[deg-1] = r2_score(y_real[N_train:N_train+N_predict], y_predicted)
    degrees = np.argmax(R2_values_test)+1 
    #print(R2_values_test)
    print('Optimal degree number is: '+str(degrees)+'\nR2-test with '+str(degrees)+' degrees is '+str(R2_values_test[degrees-1]))
    return degrees
#############################################