# Efficient determination of zero-crossings in noisy real-life time series
## Advanced Data Science Capstone Project
### Feature creation.
In this notebook, the features are created from the array of the time variable $t$. The presented functions are called during the simulation every time, when the Machine or Deep learning model is re-fitted. 



First, let us import all necessary libraries. 

In [None]:
#Here, the path to the file [Zero_crossings_in_time_series]_import_libraries_python.ipynb should be indicated.


Then, let us prepare our features. Here, `data` is the vector or Spark dataframe (depending on the `model_context` variable: if `model_context=1`, then the Spark dataframes are used, otherwise, standard numpy arrays are used). 
`N_train` and `N_predict` are the number of observations for the train and predictions sets, respectively, while `degrees` is the number of degrees for the polynomial features. The output of this function is two sets of observations (for training and prediction, respectively) in the same format as the input `data`. 

First, the whole input is transformed into polynomial features using Pyspark or Scikit-learn. Then, the polynomial features are scaled into the interval $[0,1]$. This normalization is required, since otherwise the features of higher degrees and the features of smaller degrees will have very different impact on the result: e.g., if `degrees = 10` and $t_1 = 10$, then $t_1^{10} = 10^{10} = 10000000000 >> t_1^1 = 10$. From the other hand, if $t_2 = 0.1$, then $t_2^{10} = 10^{-10} = 0.0000000001 << t_2^1 = 0.1$.

In [1]:
def prepare_features(data, N_train,N_predict,degrees, model_context=0):
  if model_context==1:
    polyExpansion = PolynomialExpansion(degree=degrees, inputCol="t", outputCol="t_poly",)
    polyDF = polyExpansion.transform(data)
    scaledDF=sparkScaler(inputCol="t_poly", outputCol="features").fit(polyDF).transform(polyDF)
    scaledDF=scaledDF.drop('t_poly')
    scaledDF.createOrReplaceTempView("scaledDF")
    train_t = spark.sql("select * from scaledDF limit "+str(N_train))
    predict_t = spark.sql("select * from (select * from scaledDF order by t desc limit "+str(N_predict)+") order by t")
  else:
    if model_context==0:
      t_poly = PolynomialFeatures(degrees,include_bias=False).fit_transform(data[:,np.newaxis])
      t_scaled = MinMaxScaler().fit_transform(t_poly)
    else:
      t_poly = PolynomialFeatures(degrees,include_bias=False).fit_transform(data[:,np.newaxis])
      t_scaled = MinMaxScaler().fit_transform(t_poly)
      #t_scaled[:,0] = np.ones(t_scaled.shape[0])
    train_t = t_scaled[0:N_train]
    predict_t = t_scaled[N_train:N_train+N_predict]
  return train_t,predict_t

Just an example of the features engineering result.

In [None]:
print("Just an example of the features engineering result")
#Create any time array and define the number of training and test sets N_1 and N_2:
t = np.arange(0.1,1.01,0.2)
N = len(t)
N_1 = int(N*0.6)
N_2 = N-N_1
print(N,N_1,N_2)
#Just an example using degrees = 2:
degrees = 2
#Prepare the features using simple, not Spark, data:
train_t,predict_t = prepare_features(t,N_1,N_2,degrees,0)
print(train_t)
print(predict_t)

#Prepare the features for Spark data:
#First, transform the data into a Spark dataframe 
#(in practice, it is done automatically by the provide_new_data procedure):
t_list = []
y_list = []
for t_val in t:
  t_list.append(Vectors.dense(t_val))
  y_list.append(np.nan)
#The values of the dependent variable x are fixed equal to Nan just for simplicity,
#since at this step they are not required (they are fixed automatically by 
#the provide_new_data procedure):
t_df = spark.createDataFrame(sc.parallelize(zip(t_list,y_list)),["t","x"])
train_t,predict_t = prepare_features(t_df,N_1,N_2,degrees,1)
train_t.show()
predict_t.show()
train_t.createOrReplaceTempView("train_t")
predict_t.createOrReplaceTempView("predict_t")
display(spark.sql("select features from train_t").rdd.map(lambda row: row.features.toArray().tolist()).collect())
display(spark.sql("select features from predict_t").rdd.map(lambda row: row.features.toArray().tolist()).collect())

Just an example of the features engineering result
5 3 2
[[0.   0.  ]
 [0.25 0.1 ]
 [0.5  0.3 ]]
[[0.75 0.6 ]
 [1.   1.  ]]
+--------------------+---+--------------------+
|                   t|  x|            features|
+--------------------+---+--------------------+
|               [0.1]|NaN|           (2,[],[])|
|[0.30000000000000...|NaN|[0.25,0.099999999...|
|[0.5000000000000001]|NaN|[0.50000000000000...|
+--------------------+---+--------------------+

+--------------------+---+--------------------+
|                   t|  x|            features|
+--------------------+---+--------------------+
|[0.7000000000000001]|NaN|[0.75,0.599999999...|
|[0.9000000000000001]|NaN|           [1.0,1.0]|
+--------------------+---+--------------------+



[[0.0, 0.0],
 [0.25, 0.09999999999999998],
 [0.5000000000000001, 0.30000000000000004]]

[[0.75, 0.5999999999999999], [1.0, 1.0]]