**Timeseries kinds and applications**
___
- data that changes over time
    - e.g., atmospheric changes, demographic information, financial data, voice wave forms
    - datapoints and timestamps for each data point
- in machine learning, changes over time shows useful patterns in machine learning
- a machine learning pipeline
    - feature extraction
    - model fitting
    - prediction and validation
___

In [None]:
#Plotting a time series (I)

#In this exercise, you'll practice plotting the values of two time
#series without the time component.

#Two DataFrames, data and data2 are available in your workspace.

#Unless otherwise noted, assume that all required packages are loaded
#with their common aliases throughout this course.

#Note: This course assumes some familiarity with time series data,
#as well as how to use them in data analytics pipelines. For an
#introduction to time series, we recommend the Introduction to Time
#Series Analysis in Python and Visualizing Time Series Data with Python
#courses.

# Print the first 5 rows of data
#print(data.head())

#################################################
#<script.py> output:
#    symbol  data_values
#    0        214.009998
#    1        214.379993
#    2        210.969995
#    3        210.580000
#    4        211.980005
#################################################

# Print the first 5 rows of data2
#print(data2.head())

#################################################
#<script.py> output:
#       data_values
#    0    -0.006928
#    1    -0.007929
#    2    -0.008900
#    3    -0.009815
#    4    -0.010653
#################################################

# Plot the time series in each dataset
#fig, axs = plt.subplots(2, 1, figsize=(5, 10))
#data.iloc[:1000].plot(y="data_values", ax=axs[0])
#data2.iloc[:1000].plot(y="data_values", ax=axs[1])
#plt.show()

![_images/15.1.svg](_images/15.1.svg)

In [None]:
#Plotting a time series (II)

#You'll now plot both the datasets again, but with the included time
#stamps for each (stored in the column called "time"). Let's see if
#this gives you some more context for understanding each time series
#data.

# Plot the time series in each dataset
#fig, axs = plt.subplots(2, 1, figsize=(5, 10))
#data.iloc[:1000].plot(x="time", y="data_values", ax=axs[0])
#data2.iloc[:1000].plot(x="time", y="data_values", ax=axs[1])
#plt.show()

![_images/15.2.svg](_images/15.2.svg)
As you can now see, each time series has a very different sampling
frequency (the amount of time between samples). The first is daily
stock market data, and the second is an audio waveform.

**Machine learning basics**
___
- always begin by looking at your data
- scikit-learn data needs to be 2 dimensional
    - (samples, features)
___

In [None]:
#Fitting a simple model: classification

#In this exercise, you'll use the iris dataset (representing petal
#characteristics of a number of flowers) to practice using the
#scikit-learn API to fit a classification model. You can see a sample
#plot of the data below.

#Note: This course assumes some familiarity with Machine Learning
#and scikit-learn. For an introduction to scikit-learn, we recommend
#the Supervised Learning with Scikit-Learn and Preprocessing for
#Machine Learning in Python courses.

![_images/15.3.svg](_images/15.3.svg)

In [None]:
# Print the first 5 rows for inspection
#print(data.head())

#################################################
#<script.py> output:
#        sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
#    50                7.0               3.2                4.7               1.4
#    51                6.4               3.2                4.5               1.5
#    52                6.9               3.1                4.9               1.5
#    53                5.5               2.3                4.0               1.3
#    54                6.5               2.8                4.6               1.5
#
#        target
#    50       1
#    51       1
#    52       1
#    53       1
#    54       1
#################################################

#from sklearn.svm import LinearSVC

# Construct data for the model
#X = data[['petal length (cm)', 'petal width (cm)']]
#y = data[['target']]

# Fit the model
#model = LinearSVC()
#model.fit(X, y)

#################################################
#You've successfully fit a classifier to predict flower type!

In [None]:
#Predicting using a classification model

#Now that you have fit your classifier, let's use it to predict the
#type of flower (or class) for some newly-collected flowers.

#Information about petal width and length for several new flowers is
#stored in the variable targets. Using the classifier you fit, you'll
#predict the type of each flower.

# Create input array
#X_predict = targets[['petal length (cm)', 'petal width (cm)']]

# Predict with the model
#predictions = model.predict(X_predict)
#print(predictions)

# Visualize predictions and actual values
#plt.scatter(X_predict['petal length (cm)'], X_predict['petal width (cm)'],
#            c=predictions, cmap=plt.cm.coolwarm)
#plt.title("Predicted class values")
#plt.show()

#################################################
#<script.py> output:
#    [2 2 2 1 1 2 2 2 2 1 2 1 1 2 1 1 2 1 2 2]
#################################################
#Note that the output of your predictions are all integers,
#representing that datapoint's predicted class.

![_images/15.4.svg](_images/15.4.svg)

In [None]:
#Fitting a simple model: regression

#In this exercise, you'll practice fitting a regression model using
#data from the Boston housing market. A DataFrame called boston is
#available in your workspace. It contains many variables of data
#(stored as columns). Can you find a relationship between the
#following two variables?

#"AGE": proportion of owner-occupied units built prior to 1940
#"RM" : average number of rooms per dwelling

![_images/15.5.svg](_images/15.5.svg)

In [None]:
#from sklearn import linear_model

# Prepare input and output DataFrames
#X = boston[['AGE']]
#y = boston[['RM']]

# Fit the model
#model = linear_model.LinearRegression()
#model.fit(X, y)

#################################################
# In regression, the output of your model is a continuous array of
#numbers, not class identity.

In [None]:
#Predicting using a regression model

#Now that you've fit a model with the Boston housing data, lets see
#what predictions it generates on some new data. You can investigate
#the underlying relationship that the model has found between inputs
#and outputs by feeding in a range of numbers as inputs and seeing
#what the model predicts for each input.

#A 1-D array new_inputs consisting of 100 "new" values for "AGE"
#(proportion of owner-occupied units built prior to 1940) is
#available in your workspace along with the model you fit in the
#previous exercise.

# Generate predictions with the model using those inputs
#predictions = model.predict(new_inputs.reshape(-1, 1))

# Visualize the inputs and predicted values
#plt.scatter(new_inputs, predictions, color='r', s=3)
#plt.xlabel('inputs')
#plt.ylabel('predictions')
#plt.show()

![_images/15.6.svg](_images/15.6.svg)
Here the red line shows the relationship that your model found. As
the proportion of pre-1940s houses gets larger, the average number
of rooms gets slightly lower.

**Machine learning and time series data**
___
- using audio data of heart sounds to detect who has a heart condition
- using new york stock exchange data to detect patterns in historical records that allow us to predict the value of companies in the future
___

In [None]:
#Inspecting the classification data

#In these final exercises of this chapter, you'll explore the two
#datasets you'll use in this course.

#The first is a collection of heartbeat sounds. Hearts normally have
#a predictable sound pattern as they beat, but some disorders can
#cause the heart to beat abnormally. This dataset contains a training
#set with labels for each type of heartbeat, and a testing set with no
#labels. You'll use the testing set to validate your models.

#As you have labeled data, this dataset is ideal for classification.
#In fact, it was originally offered as a part of a public Kaggle
#competition. https://www.kaggle.com/kinguistics/heartbeat-sounds

#import librosa as lr
#from glob import glob

# List all the wav files in the folder
#audio_files = glob(data_dir + '/*.wav')

# Read in the first audio file, create the time array
#audio, sfreq = lr.load(audio_files[0])
#time = np.arange(0, len(audio)) / sfreq

# Plot audio over time
#fig, ax = plt.subplots()
#ax.plot(time, audio)
#ax.set(xlabel='Time (s)', ylabel='Sound Amplitude')
#plt.show()

![_images/15.7.svg](_images/15.7.svg)
There are several seconds of heartbeat sounds in here, though note
that most of this time is silence. A common procedure in machine
learning is to separate the datapoints with lots of stuff happening
from the ones that don't.

In [None]:
#Inspecting the regression data

#The next dataset contains information about company market value
#over several years of time. This is one of the most popular kind
#of time series data used for regression. If you can model the value
#of a company as it changes over time, you can make predictions about
#where that company will be in the future. This dataset was also
#originally provided as part of a public Kaggle competition.
#https://www.kaggle.com/dgawlik/nyse

#In this exercise, you'll plot the time series for a number of
#companies to get an understanding of how they are (or aren't)
#related to one another.

# Read in the data
#data = pd.read_csv('prices.csv', index_col=0)

# Convert the index of the DataFrame to datetime
#data.index = pd.to_datetime(data.index)
#print(data.head())

# Loop through each column, plot its values over time
#fig, ax = plt.subplots()
#for column in data.columns:
#    data[column].plot(ax=ax, label=column)
#ax.legend()
#plt.show()

#################################################
#<script.py> output:
#                      AAPL  FB       NFLX          V        XOM
#    time
#    2010-01-04  214.009998 NaN  53.479999  88.139999  69.150002
#    2010-01-05  214.379993 NaN  51.510001  87.129997  69.419998
#    2010-01-06  210.969995 NaN  53.319999  85.959999  70.019997
#    2010-01-07  210.580000 NaN  52.400001  86.760002  69.800003
#    2010-01-08  211.980005 NaN  53.300002  87.000000  69.519997
#################################################

![_images/15.8.svg](_images/15.8.svg)
Note that each company's value is sometimes correlated with others,
and sometimes not. Also note there are a lot of 'jumps' in there -
what effect do you think these jumps would have on a predictive model?

**Classifying a time series**
___
- always visualize raw data before fitting models
- start with summary statistics
___

In [None]:
#Many repetitions of sounds

#In this exercise, you'll start with perhaps the simplest
#classification technique: averaging across dimensions of a dataset
#and visually inspecting the result.

#You'll use the heartbeat data described in the last chapter. Some
#recordings are normal heartbeat activity, while others are abnormal
#activity. Let's see if you can spot the difference.

#Two DataFrames, normal and abnormal, each with the shape of
#(n_times_points, n_audio_files) containing the audio for several
#heartbeats are available in your workspace. Also, the sampling
#frequency is loaded into a variable called sfreq. A convenience
#plotting function show_plot_and_make_titles() is also available in
#your workspace.

#fig, axs = plt.subplots(3, 2, figsize=(15, 7), sharex=True, sharey=True)

# Calculate the time array
#time = np.arange(normal.shape[0]) / sfreq

# Stack the normal/abnormal audio so you can loop and plot
#stacked_audio = np.hstack([normal, abnormal]).T

# Loop through each audio file / ax object and plot
# .T.ravel() transposes the array, then unravels it into a 1-D vector for looping
#for iaudio, ax in zip(stacked_audio, axs.T.ravel()):
#    ax.plot(time, iaudio)
#show_plot_and_make_titles()

![_images/15.9.svg](_images/15.9.svg)
As you can see there is a lot of variability in the raw data, let's
see if you can average out some of that noise to notice a difference.

In [None]:
#Invariance in time

#While you should always start by visualizing your raw data, this is
#often uninformative when it comes to discriminating between two
#classes of data points. Data is usually noisy or exhibits complex
#patterns that aren't discoverable by the naked eye.

#Another common technique to find simple differences between two
#sets of data is to average across multiple instances of the same
#class. This may remove noise and reveal underlying patterns (or,
#it may not).

#In this exercise, you'll average across many instances of each
#class of heartbeat sound.

#The two DataFrames (normal and abnormal) and the time array (time)
#from the previous exercise are available in your workspace.

# Average across the audio files of each DataFrame
#mean_normal = np.mean(normal, axis=1)
#mean_abnormal = np.mean(abnormal, axis=1)

# Plot each average over time
#fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 3), sharey=True)
#ax1.plot(time, mean_normal)
#ax1.set(title="Normal Data")
#ax2.plot(time, mean_abnormal)
#ax2.set(title="Abnormal Data")
#plt.show()

![_images/15.10.svg](_images/15.10.svg)
Do you see a noticeable difference between the two? Maybe, but it's
quite noisy. Let's see how you can dig into the data a bit further.

In [None]:
#Build a classification model

#While eye-balling differences is a useful way to gain an intuition
#for the data, let's see if you can operationalize things with a
#model. In this exercise, you will use each repetition as a
#datapoint, and each moment in time as a feature to fit a classifier
#that attempts to predict abnormal vs. normal heartbeats using only
#the raw data.

#We've split the two DataFrames (normal and abnormal) into X_train,
#X_test, y_train, and y_test.

#from sklearn.svm import LinearSVC

# Initialize and fit the model
#model = LinearSVC()
#model.fit(X_train, y_train)

# Generate predictions and score them manually
#predictions = model.predict(X_test)
#print(sum(predictions == y_test.squeeze()) / len(y_test))

#################################################
#<script.py> output:
#    0.555555555556
#################################################
#Note that your predictions didn't do so well. That's because the
#features you're using as inputs to the model (raw data) aren't very
#good at differentiating classes. Next, you'll explore how to calculate
#some more complex features that may improve the results.

**Improving features for classification**
___
- The auditory envelope
    - smooth the data to calculate the auditory envelope
    - related to the amount of audio energy present at each moment in time
- smoothing over time
    - instead of averaging over time, we do a local average
    - this is called smoothing your timeseries
    - it removes short-term noise, while retaining the general pattern
___

In [None]:
#Calculating the envelope of sound
#One of the ways you can improve the features available to your
#model is to remove some of the noise present in the data. In audio
#data, a common way to do this is to smooth the data and then rectify
#it so that the total amount of sound energy over time is more
#distinguishable. You'll do this in the current exercise.

#A heartbeat file is available in the variable audio.

# Plot the raw data first
#audio.plot(figsize=(10, 5))
#plt.show()

![_images/15.11.svg](_images/15.11.svg)

In [None]:
# Rectify the audio signal
#audio_rectified = audio.apply(np.abs)

# Plot the result
#audio_rectified.plot(figsize=(10, 5))
#plt.show()

![_images/15.12.svg](_images/15.12.svg)

In [None]:
# Smooth by applying a rolling mean
#audio_rectified_smooth = audio_rectified.rolling(50).mean()

# Plot the result
#audio_rectified_smooth.plot(figsize=(10, 5))
#plt.show()

![_images/15.13.svg](_images/15.13.svg)
By calculating the envelope of each sound and smoothing it, you've
eliminated much of the noise and have a cleaner signal to tell you
when a heartbeat is happening.

In [None]:
#Calculating features from the envelope

#Now that you've removed some of the noisier fluctuations in the
#audio, let's see if this improves your ability to classify.

#audio_rectified_smooth from the previous exercise is available in
#your workspace.

# Calculate stats
#means = np.mean(audio_rectified_smooth, axis=0)
#stds = np.std(audio_rectified_smooth, axis=0)
#maxs = np.max(audio_rectified_smooth, axis=0)

# Create the X and y arrays
#X = np.column_stack([means, stds, maxs])
#y = labels.reshape([-1, 1])

# Fit the model and score on testing data
#from sklearn.model_selection import cross_val_score
#percent_score = cross_val_score(model, X, y, cv=5)
#print(np.mean(percent_score))

#################################################
#<script.py> output:
#    0.716666666667
#################################################
#This model is both simpler (only 3 features) and more understandable
#(features are simple summary statistics of the data).

In [None]:
#Derivative features: The tempogram

#One benefit of cleaning up your data is that it lets you compute
#more sophisticated features. For example, the envelope calculation
#you performed is a common technique in computing tempo and rhythm
#features. In this exercise, you'll use librosa to compute some
#tempo and rhythm features for heartbeat data, and fit a model once
#more.

#Note that librosa functions tend to only operate on numpy arrays
#instead of DataFrames, so we'll access our Pandas data as a Numpy
#array with the .values attribute.

# Calculate the tempo of the sounds
#tempos = []
#for col, i_audio in audio.items():
#    tempos.append(lr.beat.tempo(i_audio.values, sr=sfreq, hop_length=2**6, aggregate=None))

# Convert the list to an array so you can manipulate it more easily
#tempos = np.array(tempos)

# Calculate statistics of each tempo
#tempos_mean = tempos.mean(axis=-1)
#tempos_std = tempos.std(axis=-1)
#tempos_max = tempos.max(axis=-1)

# Create the X and y arrays
#X = np.column_stack([means, stds, maxs, tempos_mean, tempos_std, tempos_max])
#y = labels.reshape([-1, 1])

# Fit the model and score on testing data
#percent_score = cross_val_score(model, X, y, cv=5)
#print(np.mean(percent_score))

#################################################
#<script.py> output:
#    0.533333333333
#################################################
#Note that your predictive power may not have gone up (because this
#dataset is quite small), but you now have a more rich feature
#representation of audio that your model can use!

**The spectrogram**
___
- fourier transform
    - timeseries data can be described as a combination of quickly-changing and slowly-changing things
    - at each moment in time, we can describe the relative presence of fast- and slow-moving components
    - this converts a single timeseries into an array that describes the timeseries as a combination of oscillations
- short time (st) fft is squared = spectrogram
- spectral feature engineering
    - each timeseries has a different spectral pattern
    - we can calculate these spectral patterns by analyzing the spectrogram to describe where most of the energy is at each moment in time
        - **spectral bandwidth**
        - **spectral centroids**
___

In [None]:
#Spectrograms of heartbeat audio

#Spectral engineering is one of the most common techniques in
#machine learning for time series data. The first step in this
#process is to calculate a spectrogram of sound. This describes what
#spectral content (e.g., low and high pitches) are present in the
#sound over time. In this exercise, you'll calculate a spectrogram
#of a heartbeat audio file.

#We've loaded a single heartbeat sound in the variable audio.

# Import the functions you'll use for the STFT
#from librosa.core import stft

# Prepare the STFT
#HOP_LENGTH = 2**4
#spec = stft(audio, hop_length=HOP_LENGTH, n_fft=2**7)

#from librosa.core import amplitude_to_db
#from librosa.display import specshow

# Convert into decibels
#spec_db = amplitude_to_db(spec)

# Compare the raw audio to the spectrogram of the audio
#fig, axs = plt.subplots(2, 1, figsize=(10, 10), sharex=True)
#axs[0].plot(time, audio)
#specshow(spec_db, sr=sfreq, x_axis='time', y_axis='hz', hop_length=HOP_LENGTH)

![_images/15.14.svg](_images/15.14.svg)
Do you notice that the heartbeats come in pairs, as seen by the
vertical lines in the spectrogram?

In [None]:
#Engineering spectral features

#As you can probably tell, there is a lot more information in a
#spectrogram compared to a raw audio file. By computing the spectral
#features, you have a much better idea of what's going on. As such,
#there are all kinds of spectral features that you can compute using
#the spectrogram as a base. In this exercise, you'll look at a few
#of these features.

#The spectogram spec from the previous exercise is available in your
#workspace.

#import librosa as lr

# Calculate the spectral centroid and bandwidth for the spectrogram
#bandwidths = lr.feature.spectral_bandwidth(S=spec)[0]
#centroids = lr.feature.spectral_centroid(S=spec)[0]

#from librosa.core import amplitude_to_db
#from librosa.display import specshow

# Convert spectrogram to decibels for visualization
#spec_db = amplitude_to_db(spec)

# Display these features on top of the spectrogram
#fig, ax = plt.subplots(figsize=(10, 5))
#ax = specshow(spec_db, x_axis='time', y_axis='hz', hop_length=HOP_LENGTH)
#ax.plot(times_spec, centroids)
#ax.fill_between(times_spec, centroids - bandwidths / 2, centroids + bandwidths / 2, alpha=.5)
#ax.set(ylim=[None, 6000])
#plt.show()

![_images/15.15.svg](_images/15.15.svg)
As you can see, the spectral centroid and bandwidth characterize the
spectral content in each sound over time. They give us a summary of
the spectral content that we can use in a classifier.

In [None]:
#Combining many features in a classifier

#You've spent this lesson engineering many features from the audio
#data - some contain information about how the audio changes in time,
#others contain information about the spectral content that is
#present.

#The beauty of machine learning is that it can handle all of these
#features at the same time. If there is different information present
#in each feature, it should improve the classifier's ability to
#distinguish the types of audio. Note that this often requires more
#advanced techniques such as regularization, which we'll cover in
#the next chapter.

#For the final exercise in the chapter, we've loaded many of the
#features that you calculated before. Combine all of them into an
#array that can be fed into the classifier, and see how it does.

# Loop through each spectrogram
#bandwidths = []
#centroids = []

#for spec in spectrograms:
    # Calculate the mean spectral bandwidth
#    this_mean_bandwidth = np.mean(lr.feature.spectral_bandwidth(S=spec))
    # Calculate the mean spectral centroid
#    this_mean_centroid = np.mean(lr.feature.spectral_centroid(S=spec))
    # Collect the values
#    bandwidths.append(this_mean_bandwidth)
#    centroids.append(this_mean_centroid)

# Create the X and y arrays
#X = np.column_stack([means, stds, maxs, tempo_mean, tempo_max, tempo_std, bandwidths, centroids])
#y = labels.reshape([-1, 1])

# Fit the model and score on testing data
#percent_score = cross_val_score(model, X, y, cv=5)
#print(np.mean(percent_score))

#################################################
#<script.py> output:
#    0.483333333333
#################################################
#You calculated many different features of the audio, and combined
#each of them under the assumption that they provide independent 
#information that can be used in classification. You may have noticed 
#that the accuracy of your models varied a lot when using different
#set of features. This chapter was focused on creating new "features" 
#from raw data and not obtaining the best accuracy. To improve the
#accuracy, you want to find the right features that provide relevant 
#information and also build models on much larger data.