In contrast to classification, a regression problem in machine learning looks to predict a numerical, quantitative result from data. For example, estimating the travel time by your navigation app is a regression problem. And like classification, it can also be seen as the practice of filling a missing column in a table. We again start with the Iris dataset as an example.

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
df = pd.read_csv("iris.csv")
df

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
...,...,...,...,...,...
145,6.5,3.0,5.5,1.8,Iris-virginica
146,7.7,2.6,6.9,2.3,Iris-virginica
147,6.0,2.2,5.0,1.5,Iris-virginica
148,6.9,3.2,5.7,2.3,Iris-virginica


Let's say a sample of an Iris setosa flower is collected and its dimension is measured, but for some reason its petal width data is missing. Its petal width can be deduced using a machine learning algorithm. The terms and language used in regression is similar to that in classification. For example the data and the target:

In [2]:
X = df[df["Species"]=='Iris-setosa'].drop(['PetalWidth','Species'],axis=1)  # select only rows with setosa species
y = df[df["Species"]=='Iris-setosa'].drop(['Species'],axis=1)['PetalWidth']

Unnamed: 0,SepalLength,SepalWidth,PetalLength
29,5.2,4.1,1.5
31,4.9,3.1,1.5
36,5.1,3.4,1.5
135,4.7,3.2,1.3
40,4.8,3.0,1.4
12,5.7,4.4,1.5
4,5.4,3.9,1.7
43,5.3,3.7,1.5
8,5.4,3.7,1.5
14,5.1,3.5,1.4


The first algorithm used is called linear regression. The code for training is

In [3]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X,y)

And the code for predicting, for example, the sepal width of an Iris setosa flower with sepal length 5 cm, sepal width 3 cm, and petal length 1 cm is therefore

In [4]:
reg.predict([[5,3,1]])

array([0.15066439])

Similarly, the code for using another regression algorithm, the decision tree, is

In [5]:
reg = DecisionTreeRegressor().fit(X,y)
reg.predict([[5,3,1]])

array([0.2])

In [6]:
reg = RandomForestRegressor().fit(X,y)
reg.predict([[5,3,1]])



array([0.21333333])

Exercise: Read the "housing.csv" data file into a dataframe. Create a model for predicting house prices.