# 1. Practical Machine Learning Tutorial with Python Introduction

What you will need for this tutorial series: install numpy, matplotlib, pandas, sklearn and their dependencies

pip install numpy

pip install scipy

pip install scikit-learn

pip install matplotlib

pip install pandas

Hello girls and guys, welcome to an in-depth and practical machine learning course.

The objective of this course is to give you a wholistic understanding of machine learning, covering theory, application, and inner workings of supervised, unsupervised, and deep learning algorithms.

In this series, we'll be covering linear regression, K Nearest Neighbors, Support Vector Machines (SVM), flat clustering, hierarchical clustering, and neural networks.

For each major algorithm that we cover, we will discuss the high level intuitions of the algorithms and how they are logically meant to work. Next, we'll apply the algorithms in code using real world data sets along with a module, such as with Scikit-Learn. Finally, we'll be diving into the inner workings of each of the algorithms by recreating them in code, from scratch, ourselves, including all of the math involved. This should give you a complete understanding of exactly how the algorithms work, how they can be tweaked, what advantages are, and what their disadvantages are.

In order to follow along with the series, I suggest you have at the very least a basic understanding of Python. If you do not, I suggest you at least follow the Python 3 Basics tutorial until the module installation with pip tutorial. If you have a basic understanding of Python, and the willingness to learn/ask questions, you will be able to follow along here with no issues. Most of the machine learning algorithms are actually quite simple, since they need to be in order to scale to large datasets. Math involved is typically linear algebra, but I will do my best to still explain all of the math. If you are confused/lost/curious about anything, ask in the comments section on YouTube, the community here, or by emailing me. You will also need Scikit-Learn and Pandas installed, along with others that we'll grab along the way.

Machine learning was defined in 1959 by Arthur Samuel as the "field of study that gives computers the ability to learn without being explicitly programmed." This means imbuing knowledge to machines without hard-coding it. From what I have personally found, people outside the programming community mainly believe machine intelligence is hard-coded, completely unaware of the reality of the field. One of the largest challenges I had with machine learning was the abundance of material on the learning part. You can find formulas, charts, equations, and a bunch of theory on the topic of machine learning, but very little on the actual "machine" part, where you actually program the machine and run the algorithms on real data. This is mainly due to the history. In the 50s, machines were quite weak, and in very little supply, which remained very much the case for half a century. Machine Learning was relegated to being mainly theoretical and rarely actually employed. The Support Vector Machine (SVM), for example, was created by Vladimir Vapnik in the Soviet Union in 1963, but largely went unnoticed until the 90s when Vapnik was scooped out the Soviet Union to the United States by Bell Labs. The neural network was conceived in the 1940's, but computers at the time were nowhere near powerful enough to run them well, and have not been until the relatively recent times.

The "idea" of machine learning has come in and out of favor a few times through history, each time leaving people thinking it was merely a fad. It is really only very recently that we've been able to put much of machine learning to any decent test. Nowadays, you can spin up and rent a $100,000 GPU cluster for a few dollars an hour, the stuff of PhD student dreams just 10 years ago. Machine learning got another up tick in the mid 2000's and has been on the rise ever since, also benefitting in general from Moore's Law. Beyond this, there are ample resources out there to help you on your journey with machine learning, like this tutorial. You can just do a Google search on the topic and find more than enough information to keep you busy for a while.

This is so much so to the point where we now have modules and APIs at our disposal, and you can engage in machine learning very easily without almost any knowledge at all of how it works. With the defaults from Scikit-learn, you can get 90-95 percent accuracy on many tasks right out of the gate. Machine learning is a lot like a car, you do not need to know much about how it works in order to get an incredible amount of utility from it. If you want to push the limits on performance and efficiency, however, you need to dig in under the hood, which is more how this course is geared. If you are just looking for a quick tutorial for employing machine learning on data, I already have a simple classification example tutorial and a simple clustering (unsupervised machine learning) example that you can check out.

Despite the apparent age and maturity of machine learning, I would say there's no better time than now to learn it, since you can actually use it. Machines are quite powerful, the one you are working on can probably do most of this series quickly. Data is also very plentiful lately.

The first topic we'll be covering is Regression, which is where we'll pick up in the next tutorial. Make sure you have Python 3 installed, along with Pandas and Scikit-Learn.

# 2. Regression Intro and Data

Along with those tutorial-wide imports, we're also going to be making use of Quandl here, which you may need to separately install, with:

pip install quandl
I will note again in the first part of the code, but the Quandl module used to be imported with an upper-case Q, but is now imported with a lower-cased q. In the video and sample codes, it is upper-cased.

To begin, what is regression in terms of us using it with machine learning? The goal is to take continuous data, find the equation that best fits the data, and be able forecast out a specific value. With simple linear regression, you are just simply doing this by creating a best fit line:

![image.png](attachment:image.png)

From here, we can use the equation of that line to forecast out into the future, where the 'date' is the x-axis, what the price will be.

A popular use with regression is to predict stock prices. This is done because we are considering the fluidity of price over time, and attempting to forecast the next fluid price in the future using a continuous dataset.

Regression is a form of supervised machine learning, which is where the scientist teaches the machine by showing it features and then showing it what the correct answer is, over and over, to teach the machine. Once the machine is taught, the scientist will usually "test" the machine on some unseen data, where the scientist still knows what the correct answer is, but the machine doesn't. The machine's answers are compared to the known answers, and the machine's accuracy can be measured. If the accuracy is high enough, the scientist may consider actually employing the algorithm in the real world.

Since regression is so popularly used with stock prices, we can start there with an example. To begin, we need data. Sometimes the data is easy to acquire, and sometimes you have to go out and scrape it together, like what we did in an older tutorial series using machine learning with stock fundamentals for investing. In our case, we're able to at least start with simple stock price and volume information from Quandl. To begin, we'll start with data that grabs the stock price for Alphabet (previously Google), with the ticker of GOOGL:

In [1]:
import pandas as pd
import quandl

df = quandl.get('WIKI/GOOGL')

print(df.head())

              Open    High     Low    Close      Volume  Ex-Dividend  \
Date                                                                   
2004-08-19  100.01  104.06   95.96  100.335  44659000.0          0.0   
2004-08-20  101.01  109.08  100.50  108.310  22834300.0          0.0   
2004-08-23  110.76  113.48  109.05  109.400  18256100.0          0.0   
2004-08-24  111.24  111.60  103.57  104.870  15247300.0          0.0   
2004-08-25  104.76  108.00  103.88  106.000   9188600.0          0.0   

            Split Ratio  Adj. Open  Adj. High   Adj. Low  Adj. Close  \
Date                                                                   
2004-08-19          1.0  50.159839  52.191109  48.128568   50.322842   
2004-08-20          1.0  50.661387  54.708881  50.405597   54.322689   
2004-08-23          1.0  55.551482  56.915693  54.693835   54.869377   
2004-08-24          1.0  55.792225  55.972783  51.945350   52.597363   
2004-08-25          1.0  52.542193  54.167209  52.100830   53.1

Awesome, off to a good start, we have the data, but maybe a bit much. To reference the intro, there exists an entire machine learning category that aims to reduce the amount of input that we process. In our case, we have quite a few columns, many are redundant, a couple don't really change. We can most likely agree that having both the regular columns and adjusted columns is redundant. Adjusted columns are the most ideal ones. Regular columns here are prices on the day, but stocks have things called stock splits, where suddenly 1 share becomes something like 2 shares, thus the value of a share is halved, but the value of the company has not halved. Adjusted columns are adjusted for stock splits over time, which makes them more reliable for doing analysis.

Thus, let's go ahead and pair down our original dataframe a bit:

Each column represents a different feature, for example 'High', 'Low' and 'Close' prices are all features. It's important to use meaningful features in a project. For example perhaps you might be looking for pattern recognition in equity prices but that doesn't necessarily mean you need all the fields, or columns above when you could choose just one or two. So it becomes necessary to look into the relationships between some of these columns. These features are also known as attributes, input, or predictor variables which can be used to ascertain a stock price prediction (called labels, predicted or output variables).

Within Deep Learning and some of the other algorithms the relationships between some of these attributes are explored but not with Regression.

In [2]:
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']]

Now we just have the adjusted columns, and the volume column. A couple major points to make here. Many people talk about or hear about machine learning as if it is some sort of dark art that somehow generates value from nothing. Machine learning can highlight value if it is there, but it has to actually be there. You need meaningful data. So how do you know if you have meaningful data? My best suggestion is to just simply use your brain. Think about it. Are historical prices indicative of future prices? Some people think so, but this has been continually disproven over time. What about historical patterns? This has a bit more merit when taken to the extremes (which machine learning can help with), but is overall fairly weak. What about the relationship between price changes and volume over time, along with historical patterns? Probably a bit better. So, as you can already see, it is not the case that the more data the merrier, but we instead want to use useful data. At the same time, raw data sometimes should be transformed.

Consider daily volatility, such as with the high minus low % change? How about daily percent change? Would you consider data that is simply the Open, High, Low, Close or data that is the Close, Spread/Volatility, %change daily to be better? I would expect the latter to be more ideal. The former is all very similar data points. The latter is created based on the identical data from the former, but it brings far more valuable information to the table.

Thus, not all of the data you have is useful, and sometimes you need to do further manipulation on your data to make it even more valuable before feeding it through a machine learning algorithm. Let's go ahead and transform our data next:

In [3]:
# creating a new attribute (High-Low Percentage) by combining features
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Close']) / df['Adj. Close'] * 100.0

I went ahead and recorded the video version of this, not realizing my stake that it was high minus low divided by close. I meant to do High - Low, divided by the low. Feel free to fix that if you like.

This creates a new column that is the % spread based on the closing price, which is our crude measure of volatility. Next, we'll do daily percent change:

In [4]:
# creating a daily percentage change variable
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0

Now we will define a new dataframe as:

In [5]:
df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]
print(df.head())

            Adj. Close    HL_PCT  PCT_change  Adj. Volume
Date                                                     
2004-08-19   50.322842  3.712563    0.324968   44659000.0
2004-08-20   54.322689  0.710922    7.227007   22834300.0
2004-08-23   54.869377  3.729433   -1.227880   18256100.0
2004-08-24   52.597363  6.417469   -5.726357   15247300.0
2004-08-25   53.164113  1.886792    1.183658    9188600.0


It becomes useful to examine relationships between 'High' and 'Low', or 'Open' and 'Close' rather than just simply listing all the price data for each. It's useful to look at whether they have a positive or negative relationship and if so, how much they move together.

In [6]:
df.tail()

Unnamed: 0_level_0,Adj. Close,HL_PCT,PCT_change,Adj. Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-03-21,1094.0,1.343693,0.130884,1990515.0
2018-03-22,1053.15,2.921711,-2.487014,3418154.0
2018-03-23,1026.55,3.918952,-2.360729,2413517.0
2018-03-26,1054.09,0.491419,0.332191,3272409.0
2018-03-27,1006.94,5.720301,-5.353887,2940957.0


# 3. Regression - Features and Labels

In the previous section we were determining if the 'Adj. Close' column would act as a feature or a label (input, or output). All we know here is that at some point in the future we will be using Price as the output label, so it may in fact be the 'Adj. Close'. The hope here is that we've grabbed data, decided on the valuable data, created some new valuable data through manipulation, and now we're ready to actually begin the machine learning process with regression. First, we're going to need a few more imports. All imports now:

In [15]:
import quandl, math
import numpy as np
import pandas as pd
# from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

An important note is if we decided to use 'Adj. Close' as a label, the data would be incredibly biased because both the 'HL_PCT' and 'PCT_change' variables could not even be calculated until the 'Adj. Close' was reported at the end of the day, thereby leading to data bleed. So, in this case, it's more of a feature than a label.

We'll be using the numpy module to convert data to numpy arrays, which is what Scikit-learn wants. We will talk more on preprocessing and cross_validation when we get to them in the code, but preprocessing is the module used to do some cleaning/scaling of data prior to machine learning, and cross_ alidation is used in the testing stages. Finally, we're also importing the LinearRegression algorithm as well as svm from Scikit-learn, which we'll be using as our machine learning algorithms to demonstrate results.

At this point, we've got data that we think is useful. How does the actual machine learning thing work? With supervised learning, you have features and labels. The features are the descriptive attributes, and the label is what you're attempting to predict or forecast. Another common example with regression might be to try to predict the dollar value of an insurance policy premium for someone. The company may collect your age, past driving infractions, public criminal record, and your credit score for example. The company will use past customers, taking this data, and feeding in the amount of the "ideal premium" that they think should have been given to that customer, or they will use the one they actually used if they thought it was a profitable amount.

Thus, for training the machine learning classifier, the features are customer attributes, the label is the premium associated with those attributes.

In our case, what are the features and what is the label? We're trying to predict the price, so is price the label? If so, what are the featuers? When it comes to forecasting out the price, our label, the thing we're hoping to predict, is actually the future price. As such, our features are actually: current price, high minus low percent, and the percent change volatility. The price that is the label shall be the price at some determined point the future. Let's go ahead and add a few new rows:

In [16]:
forecast_col = 'Adj. Close'
df.fillna(value=-99999, inplace=True)

In ML because we can't work with na or NaN values we use the fillna() method as we actually have to replace these values with something. An alternative is to remove the whole column but this isn't ideal when working with ML algorithms. We input the value=-99999 as a sort of outlier.

Next we will extrapolate, or forecast outwards which can be expressed as follows:

In [23]:
forecast_out = int(math.ceil(0.10 * len(df)))

We know that using math.ceil() will take anything and get to the ceiling. Lets assume that the len(df) was going to return a number which is decimal or fractional such as 0.2; math.ceil() will round that up to 1, or the nearest whole number. In this particular case we are predicting outward up to 10% of the length of the total dataframe. 

Here, we define the forecasting column, then we fill any NaN data with -99999. You have a few choice here regarding how to handle missing data. You can't just pass a NaN (Not a Number) datapoint to a machine learning classifier, you have to handle for it. One popular option is to replace missing data with -99,999. With many machine learning classifiers, this will just be recognized and treated as an outlier feature. You can also just drop all feature/label sets that contain missing data, but then you're maybe leaving a lot of data out.

In the real world, many data sets are very messy. Most stock price/volume data is pretty clean, rarely with missing data, but many datasets will have a lot of missing data. I've seen datasets where the majority of the rows contain some missing bit of info. You don't necessarily want to forfeit all of that great data, plus, if your sample data has holes, you can probably bet your real-world use-case will also have holes. You need to train, test, and go live all on the same data and characteristics of that data.

Finally, we define what we want to forecast out. In many cases, such as in the case of trying to predict a client's premium for insurance, you just want one number, for the "right now", but, with forecasting, you want to forecast out a certain number of datapoints. We're saying we want to forecast out 1% of the entire length of the dataset. Thus, if our data is 100 days of stock prices, we want to be able to predict the price 1 day out into the future. Choose whatever you like. If you are just trying to predict tomorrow's price, then you would just do 1 day out, and the forecast would be just one day out. If you predict 10 days out, we can actually generate a forcast for every day, for the next week and a half.

In our case, we've decided the features are a bunch of the current values, and the label shall be the price, in the future, where the future is 1% of the entire length of the dataset out. We'll assume all current columns are our features, so we'll add a new column with a simple pandas operation. Next we want to create the label column as follows:

In [24]:
df['label'] = df[forecast_col].shift(-forecast_out)
df.dropna(inplace=True)

Now we have the data that comprises our features and labels. Next, we need to do some preprocessing and final steps before actually running everything, which is what we will be focusing on in the next tutorial.

In [25]:
print(df.head())

            Adj. Close    HL_PCT  PCT_change  Adj. Volume       label
Date                                                                 
2004-08-19   50.322842  3.712563    0.324968   44659000.0  213.825057
2004-08-20   54.322689  0.710922    7.227007   22834300.0  216.688898
2004-08-23   54.869377  3.729433   -1.227880   18256100.0  216.132179
2004-08-24   52.597363  6.417469   -5.726357   15247300.0  212.977441
2004-08-25   53.164113  1.886792    1.183658    9188600.0  214.005615


In [26]:
print(df.tail())

            Adj. Close    HL_PCT  PCT_change  Adj. Volume    label
Date                                                              
2016-09-22      815.95  0.381151    0.734568    1759290.0  1177.37
2016-09-23      814.96  0.250319   -0.022082    1411673.0  1182.22
2016-09-26      802.65  0.925684   -0.885382    1472732.0  1181.59
2016-09-27      810.73  0.340434    1.109961    1367271.0  1119.20
2016-09-28      810.06  0.023455    0.743707    1470280.0  1068.76


Try adjusting the outward forecast percentage of the overall dataset to 1%.

In [29]:
forecast_out = int(math.ceil(0.01 * len(df)))
print(df.head())

            Adj. Close    HL_PCT  PCT_change  Adj. Volume       label
Date                                                                 
2004-08-19   50.322842  3.712563    0.324968   44659000.0  213.825057
2004-08-20   54.322689  0.710922    7.227007   22834300.0  216.688898
2004-08-23   54.869377  3.729433   -1.227880   18256100.0  216.132179
2004-08-24   52.597363  6.417469   -5.726357   15247300.0  212.977441
2004-08-25   53.164113  1.886792    1.183658    9188600.0  214.005615
