**Timeseries kinds and applications**
___
- data that changes over time
    - e.g., atmospheric changes, demographic information, financial data, voice wave forms
    - datapoints and timestamps for each data point
- in machine learning, changes over time shows useful patterns in machine learning
- a machine learning pipeline
    - feature extraction
    - model fitting
    - prediction and validation
___

In [None]:
#Plotting a time series (I)

#In this exercise, you'll practice plotting the values of two time
#series without the time component.

#Two DataFrames, data and data2 are available in your workspace.

#Unless otherwise noted, assume that all required packages are loaded
#with their common aliases throughout this course.

#Note: This course assumes some familiarity with time series data,
#as well as how to use them in data analytics pipelines. For an
#introduction to time series, we recommend the Introduction to Time
#Series Analysis in Python and Visualizing Time Series Data with Python
#courses.

# Print the first 5 rows of data
#print(data.head())

#################################################
#<script.py> output:
#    symbol  data_values
#    0        214.009998
#    1        214.379993
#    2        210.969995
#    3        210.580000
#    4        211.980005
#################################################

# Print the first 5 rows of data2
#print(data2.head())

#################################################
#<script.py> output:
#       data_values
#    0    -0.006928
#    1    -0.007929
#    2    -0.008900
#    3    -0.009815
#    4    -0.010653
#################################################

# Plot the time series in each dataset
#fig, axs = plt.subplots(2, 1, figsize=(5, 10))
#data.iloc[:1000].plot(y="data_values", ax=axs[0])
#data2.iloc[:1000].plot(y="data_values", ax=axs[1])
#plt.show()

![_images/15.1.svg](_images/15.1.svg)

In [None]:
#Plotting a time series (II)

#You'll now plot both the datasets again, but with the included time
#stamps for each (stored in the column called "time"). Let's see if
#this gives you some more context for understanding each time series
#data.

# Plot the time series in each dataset
#fig, axs = plt.subplots(2, 1, figsize=(5, 10))
#data.iloc[:1000].plot(x="time", y="data_values", ax=axs[0])
#data2.iloc[:1000].plot(x="time", y="data_values", ax=axs[1])
#plt.show()

![_images/15.2.svg](_images/15.2.svg)
As you can now see, each time series has a very different sampling
frequency (the amount of time between samples). The first is daily
stock market data, and the second is an audio waveform.

**Machine learning basics**
___
- always begin by looking at your data
- scikit-learn data needs to be 2 dimensional
    - (samples, features)
___

In [None]:
#Fitting a simple model: classification

#In this exercise, you'll use the iris dataset (representing petal
#characteristics of a number of flowers) to practice using the
#scikit-learn API to fit a classification model. You can see a sample
#plot of the data below.

#Note: This course assumes some familiarity with Machine Learning
#and scikit-learn. For an introduction to scikit-learn, we recommend
#the Supervised Learning with Scikit-Learn and Preprocessing for
#Machine Learning in Python courses.

![_images/15.3.svg](_images/15.3.svg)

In [None]:
# Print the first 5 rows for inspection
#print(data.head())

#################################################
#<script.py> output:
#        sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
#    50                7.0               3.2                4.7               1.4
#    51                6.4               3.2                4.5               1.5
#    52                6.9               3.1                4.9               1.5
#    53                5.5               2.3                4.0               1.3
#    54                6.5               2.8                4.6               1.5
#
#        target
#    50       1
#    51       1
#    52       1
#    53       1
#    54       1
#################################################

#from sklearn.svm import LinearSVC

# Construct data for the model
#X = data[['petal length (cm)', 'petal width (cm)']]
#y = data[['target']]

# Fit the model
#model = LinearSVC()
#model.fit(X, y)

#################################################
#You've successfully fit a classifier to predict flower type!

In [None]:
#Predicting using a classification model

#Now that you have fit your classifier, let's use it to predict the
#type of flower (or class) for some newly-collected flowers.

#Information about petal width and length for several new flowers is
#stored in the variable targets. Using the classifier you fit, you'll
#predict the type of each flower.

# Create input array
#X_predict = targets[['petal length (cm)', 'petal width (cm)']]

# Predict with the model
#predictions = model.predict(X_predict)
#print(predictions)

# Visualize predictions and actual values
#plt.scatter(X_predict['petal length (cm)'], X_predict['petal width (cm)'],
#            c=predictions, cmap=plt.cm.coolwarm)
#plt.title("Predicted class values")
#plt.show()

#################################################
#<script.py> output:
#    [2 2 2 1 1 2 2 2 2 1 2 1 1 2 1 1 2 1 2 2]
#################################################
#Note that the output of your predictions are all integers,
#representing that datapoint's predicted class.

![_images/15.4.svg](_images/15.4.svg)

In [None]:
#Fitting a simple model: regression

#In this exercise, you'll practice fitting a regression model using
#data from the Boston housing market. A DataFrame called boston is
#available in your workspace. It contains many variables of data
#(stored as columns). Can you find a relationship between the
#following two variables?

#"AGE": proportion of owner-occupied units built prior to 1940
#"RM" : average number of rooms per dwelling

![_images/15.5.svg](_images/15.5.svg)

In [None]:
#from sklearn import linear_model

# Prepare input and output DataFrames
#X = boston[['AGE']]
#y = boston[['RM']]

# Fit the model
#model = linear_model.LinearRegression()
#model.fit(X, y)

#################################################
# In regression, the output of your model is a continuous array of
#numbers, not class identity.

In [None]:
#Predicting using a regression model

#Now that you've fit a model with the Boston housing data, lets see
#what predictions it generates on some new data. You can investigate
#the underlying relationship that the model has found between inputs
#and outputs by feeding in a range of numbers as inputs and seeing
#what the model predicts for each input.

#A 1-D array new_inputs consisting of 100 "new" values for "AGE"
#(proportion of owner-occupied units built prior to 1940) is
#available in your workspace along with the model you fit in the
#previous exercise.

# Generate predictions with the model using those inputs
#predictions = model.predict(new_inputs.reshape(-1, 1))

# Visualize the inputs and predicted values
#plt.scatter(new_inputs, predictions, color='r', s=3)
#plt.xlabel('inputs')
#plt.ylabel('predictions')
#plt.show()

![_images/15.6.svg](_images/15.6.svg)
Here the red line shows the relationship that your model found. As
the proportion of pre-1940s houses gets larger, the average number
of rooms gets slightly lower.

**Machine learning and time series data**
___
- using audio data of heart sounds to detect who has a heart condition
- using new york stock exchange data to detect patterns in historical records that allow us to predict the value of companies in the future
___

In [None]:
#Inspecting the classification data

#In these final exercises of this chapter, you'll explore the two
#datasets you'll use in this course.

#The first is a collection of heartbeat sounds. Hearts normally have
#a predictable sound pattern as they beat, but some disorders can
#cause the heart to beat abnormally. This dataset contains a training
#set with labels for each type of heartbeat, and a testing set with no
#labels. You'll use the testing set to validate your models.

#As you have labeled data, this dataset is ideal for classification.
#In fact, it was originally offered as a part of a public Kaggle
#competition. https://www.kaggle.com/kinguistics/heartbeat-sounds

#import librosa as lr
#from glob import glob

# List all the wav files in the folder
#audio_files = glob(data_dir + '/*.wav')

# Read in the first audio file, create the time array
#audio, sfreq = lr.load(audio_files[0])
#time = np.arange(0, len(audio)) / sfreq

# Plot audio over time
#fig, ax = plt.subplots()
#ax.plot(time, audio)
#ax.set(xlabel='Time (s)', ylabel='Sound Amplitude')
#plt.show()

![_images/15.7.svg](_images/15.7.svg)
There are several seconds of heartbeat sounds in here, though note
that most of this time is silence. A common procedure in machine
learning is to separate the datapoints with lots of stuff happening
from the ones that don't.

In [None]:
#Inspecting the regression data

#The next dataset contains information about company market value
#over several years of time. This is one of the most popular kind
#of time series data used for regression. If you can model the value
#of a company as it changes over time, you can make predictions about
#where that company will be in the future. This dataset was also
#originally provided as part of a public Kaggle competition.
#https://www.kaggle.com/dgawlik/nyse

#In this exercise, you'll plot the time series for a number of
#companies to get an understanding of how they are (or aren't)
#related to one another.

# Read in the data
#data = pd.read_csv('prices.csv', index_col=0)

# Convert the index of the DataFrame to datetime
#data.index = pd.to_datetime(data.index)
#print(data.head())

# Loop through each column, plot its values over time
#fig, ax = plt.subplots()
#for column in data.columns:
#    data[column].plot(ax=ax, label=column)
#ax.legend()
#plt.show()

#################################################
#<script.py> output:
#                      AAPL  FB       NFLX          V        XOM
#    time
#    2010-01-04  214.009998 NaN  53.479999  88.139999  69.150002
#    2010-01-05  214.379993 NaN  51.510001  87.129997  69.419998
#    2010-01-06  210.969995 NaN  53.319999  85.959999  70.019997
#    2010-01-07  210.580000 NaN  52.400001  86.760002  69.800003
#    2010-01-08  211.980005 NaN  53.300002  87.000000  69.519997
#################################################

![_images/15.8.svg](_images/15.8.svg)
Note that each company's value is sometimes correlated with others,
and sometimes not. Also note there are a lot of 'jumps' in there -
what effect do you think these jumps would have on a predictive model?