In contrast to classification, a regression problem in machine learning looks to predict a numerical, quantitative result from data. For example, estimating the travel time by your navigation app is a regression problem. And like classification, it can also be seen as the practice of filling a missing column in a table, but this time the column is a numerical value. We again start with the Iris dataset as an example.

In [6]:
import pandas as pd

df = pd.read_csv("iris-mv.csv")
df = df.fillna(method='ffill')
df

Now the scikit-learn cannot do calculations on strings, so the column "Species" must be changed to something it can work with. In this problem, the best approach is to turn the species using the so-called one-hot encoding, as follows:

In [7]:
df = pd.get_dummies(df,drop_first=True)
df

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species_Iris-versicolor,Species_Iris-virginica
0,5.1,3.5,1.4,0.2,0,0
1,4.9,3.0,1.4,0.2,0,0
2,4.7,3.0,1.3,0.2,0,0
3,4.6,3.1,1.5,0.2,0,0
4,5.0,3.6,1.4,0.2,0,0
...,...,...,...,...,...,...
130,6.8,3.2,5.9,2.3,0,1
131,6.7,3.3,5.7,2.5,0,1
132,6.3,2.5,5.0,1.9,0,1
133,6.5,3.0,5.2,2.0,0,1


Note that the species column is split into two columns, each only having 0 or 1 as its value, and that the number of new columns is always the number of species minus one. Now the dataset can be split and trained. Say we do not have the sepal length of a certain flower, so we want to predict it:

In [8]:
from sklearn.model_selection import train_test_split
X = df.drop(['SepalLength'],axis=1)
y = df['SepalLength']
X_train, X_test, y_train, y_test = train_test_split(X,y)
X_train

Unnamed: 0,SepalWidth,PetalLength,PetalWidth,Species_Iris-versicolor,Species_Iris-virginica
6,3.4,1.4,0.3,0,0
4,3.6,1.4,0.2,0,0
126,3.1,4.8,2.1,0,1
51,3.3,4.5,1.6,1,0
56,3.0,4.2,1.5,1,0
...,...,...,...,...,...
89,2.8,3.0,1.3,1,0
10,3.7,1.5,0.2,0,0
54,2.7,3.9,1.4,1,0
79,2.3,4.4,1.3,1,0


The way to judge the accuracy of a regression model is through a metric called mean square error. It is the mean of the difference between the predicted values and the actual values squared. The smaller the number the more accurate the model. For example:

In [17]:
from sklearn.metrics import mean_squared_error
a = [1.0,2.2,1.5,3.4]
b = [1.2,2.3,0.9,3.5]
mean_squared_error(a,b)

0.10499999999999998

The code for training and testing a regression model is very similar to that for classification:

In [13]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train,y_train)
y_pred = reg.predict(X_test)
mean_squared_error(y_test, y_pred)

0.20129511462236382

Another algorithm is the decision tree:

In [14]:
from sklearn.tree import DecisionTreeRegressor
reg = DecisionTreeRegressor().fit(X_train,y_train)
y_pred = reg.predict(X_test)
mean_squared_error(y_test, y_pred)

0.32132352941176473

Random forest:

In [15]:
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor().fit(X_train,y_train)
y_pred = reg.predict(X_test)
mean_squared_error(y_test, y_pred)

0.19348134534830533

Support vector machine:

In [19]:
from sklearn.svm import SVR
reg = SVR().fit(X_train,y_train)
y_pred = reg.predict(X_test)
mean_squared_error(y_test, y_pred)

0.21332158702911225

Gaussian process:

In [20]:
from sklearn.gaussian_process import GaussianProcessRegressor
reg = GaussianProcessRegressor().fit(X_train,y_train)
y_pred = reg.predict(X_test)
mean_squared_error(y_test, y_pred)

20.22483027971451

Each of the algorithms also has different parameters that can be tuned to specific problem or to achieve even better acuracy. Read the documentation for the details.

Exercise: Read the "housing.csv" data file into a dataframe. Create a model for predicting house prices.