In [7]:
from azureml import Workspace

ws = Workspace()
ds = ws.datasets['autos.csv']
df = ds.to_dataframe()

In [99]:
df.head()

Unnamed: 0,symboling,normalized-losses,make-id,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
3,2,164.0,2,0,0,1,2,1,0,99.8,...,109,0,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,2,0,0,1,2,2,0,99.4,...,136,0,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0
6,1,158.0,2,0,0,1,2,1,0,105.8,...,136,0,3.19,3.4,8.5,110.0,5500.0,19,25,17710.0
8,1,158.0,2,0,1,1,2,1,0,105.8,...,131,0,3.13,3.4,8.3,140.0,5500.0,17,20,23875.0
10,2,192.0,3,0,0,0,2,0,0,101.2,...,108,0,3.5,2.8,8.8,101.0,5800.0,23,29,16430.0


# Solving Autos Dataset with Sklearn in Jupyter Notebooks

A lovely aspects of Notebooks is that you can use Markdown cells to explain what the code is doing rather than code comments. There are several benefits to doing so:

<ul>
<li>Markdown allows for richer text formatting, like <em>italics</em>, <strong>bold</strong>, <code>inline code</code>, hyperlinks, and headers.</li>
<li>Markdown cells automatically word wrap whereas code cells do not. Code comments typically use explicit line breaks for formatting, but that's not necessary in Markdown.</li>
<li>Using Markdown cells makes it easier to run the Notebook as a slide show.</li>
<li>Markdown cells help you remove lengthy comments from the code, making the code easier to scan.</li>
</ul>

## Install packages using pip or conda
Because the code in your notebook likely uses some Python packages, you need to make sure the Notebook environment contains those packages. You can do this directly within the notebook in a code block that contains the appropriate pip or conda commands prefixed by !:

```
!pip install  
```

This present notebook requires numpy, matplotlib, pandas, and sklearn. Because these packages are already included in Azure Notebooks, the following commands are commented out but are included to clearly note the dependencies.

In [6]:
!pip install numpy
!pip install matplotlib
!pip install pandas
!pip install sklearn

Collecting sklearn
  Downloading https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Building wheels for collected packages: sklearn
  Running setup.py bdist_wheel for sklearn ... [?25ldone
[?25h  Stored in directory: /home/nbuser/.cache/pip/wheels/76/03/bb/589d421d27431bcd2c6da284d5f2286c8e3b2ea3cf1594c074
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0


In this example we're using numpy, pandas, and matplotlib. 

In [8]:
import numpy as np
import pandas as pd

In [101]:
X = df.drop('price', axis=1)
y = df['price']

Next, split the dataset into a Training set (2/3rds) and Test set (1/3rd). We don't need to do any feature scaling because there is only one column of independent variables, and libraries typically do scaling for you.

In [102]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

## Fit the data to the training set
"Fitting" the data to a training set means making the line that describes the relationship between the independent and the dependent variables. Fitting the data means plotting all the points in the training set, then drawing the best-fit line through that data.

The regressor's fit method here creates the line, which algebraically is of the form y = x*b1 + b0, where b1 is the coefficient or slope of the line (which you can get to through regressor.coef_), and b0 is the intercept of the line at x=0 (which you can get to through regressor.intercept).

In [103]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()    # This object is the regressor, that does the regression
regressor.fit(X_train, y_train)   # Provide training data so the machine can learn to predict using a learned model.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

## Dit it work?

Of course no! Life is hard. Linear regression can't handle strings!

## Dit it work? (try 2)

Of course no! Life is hard. Linear regression can't handle missing values!

<img src='https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/03/How-to-Handle-Missing-Values-with-Python.jpg' />

## Did it work? (try 3)

Yeap! Let's try to use our model. 

## Predict the results
With the regressor in hand, we can predict the test set results using its predict method. That method takes a vector of independent variables for which you want predictions.

Because the regressor is fit to the data by virtue of coef_ and intercept_ and coef_, a prediction is the result of coef_ * x + intercept_. (Indeed, predict(0) returns intercept_ and predict(1) returns intercept_ + coef_.)

In the code, the y_test matrix (from when we split the set) contains the real observations. y_pred assigned here contains the predictions for the same X_test inputs. It's not expected that the test or training points exactly fit the regression; the regression is trying to find the model that we can use to make predictions with new observations of the independent variables.

In [105]:
y_pred = regressor.predict(X_test)
print(y_pred)

[ 10056.29912166   8018.70911357   6790.46282267   8279.28060009
  10151.96697477   6784.57558414  16791.1603156   17381.60255689
   7512.75273163   6713.41369343   8988.49361003   7787.21532477
  13982.15671974  18020.91304936  15055.91295732  13093.5517696
  20281.60696508  10844.20982215   3701.15495557   7245.26640824
   7283.47321733   7237.39328578  13887.10936938   6181.24970911
   6811.40003444   9668.23787983   8353.81592633  17375.58013654
   5400.80743904  19999.23248481   6990.58111126   8590.93616396
  10415.52146723  26053.22000726  11724.65361727  14631.75684384
  18049.59326559   9772.36681982   7315.86287716  19447.19660953
   6457.60938984   6927.78786544  13318.94945659   6868.034466
  17171.88348642  11990.87950167  13062.0911398   16468.10608007
  15354.48083444   7626.22826924   7557.28599222  19999.23248481
  10510.63062495  15247.25950487   7408.55931931   8279.43309828
  16663.316801  ]


It's interesting to think that all the "predictions" we use in daily life, like weather forecasts, are just regression models of some sort working with various data sets. Those models are much more complicated than what's shown here, but the idea is the same.

Knowing how predictions work help us understand that the actual observations we would collect in the moment will always be somewhat off from the predictions: the predictions fit exactly to the model, whereas the observations typically won't.

Of course, such systems feed new observations back into the dataset to continually improve the model, meaning that predictions should get more accurate over time.

The challenge is determining what data to actually use. For example, with weather, how far back in time do you go? How have weather patterns been changing decade by decade? In any case, something like weather predictions will be doing things hour by hour, day by day, for things like temperature, precipitation, winds, cloud cover, etc. Radar and other observations are of course fed into the model and the predictions are reduced to mathematics.

In [107]:
from sklearn.metrics import r2_score

r2_score(y_test, y_pred) 

0.8139354578291087

## Handling string features

The method dtypes will list all the features along with their corresponding types. Note that anything listed as Object is an string

In [12]:
df.dtypes

symboling              int64
normalized-losses    float64
make-id                int64
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object

Pandas has a helpful select_dtypes function which we can use to build a new dataframe containing only the object columns.

In [108]:
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head()

3
4
6
8
10


We can easily contract factors from the string properties. The method Factorize will be very helpful here. Note that in many cases you will need more advanced ways to handle encoding as it is not trivial

In [109]:
columns = list(obj_df.columns.values)

for col in columns:
    df[col] = df[col].factorize()[0]

In [110]:
df.head()

Unnamed: 0,symboling,normalized-losses,make-id,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
3,2,164.0,2,0,0,1,2,1,0,99.8,...,109,0,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,2,0,0,1,2,2,0,99.4,...,136,0,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0
6,1,158.0,2,0,0,1,2,1,0,105.8,...,136,0,3.19,3.4,8.5,110.0,5500.0,19,25,17710.0
8,1,158.0,2,0,1,1,2,1,0,105.8,...,131,0,3.13,3.4,8.3,140.0,5500.0,17,20,23875.0
10,2,192.0,3,0,0,0,2,0,0,101.2,...,108,0,3.5,2.8,8.8,101.0,5800.0,23,29,16430.0


## Handling missing values

Drop missing values

In [111]:
df.dropna(inplace=True)

# What about method that can handle missing values?

In [112]:
from sklearn.ensemble import RandomForestRegressor

# Impute our data, then train
regr = RandomForestRegressor(max_depth=2, random_state=0, n_estimators=100)
regr = regr.fit(X_train, y_train)

In [113]:
print(regr.feature_importances_)

[  0.00000000e+00   0.00000000e+00   0.00000000e+00   1.98284033e-03
   2.13071129e-03   0.00000000e+00   1.13071900e-04   0.00000000e+00
   0.00000000e+00   1.95503695e-03   2.43840993e-02   6.66350065e-02
   8.45116922e-04   4.19087217e-01   0.00000000e+00   8.27942898e-03
   6.29080297e-02   1.88050281e-03   6.59173893e-04   2.66393781e-03
   2.41219578e-03   9.01296480e-03   5.31128256e-03   7.02252467e-02
   3.19514137e-01]


In [114]:
y_pred_rf = regr.predict(X_test)
print(y_pred_rf)

[  9608.86488835   7954.81637273   7289.71930179   9608.86488835
   7896.59748716   7675.23895293  17404.05444586  17368.14460105
  12938.11659286   7424.62652701  10014.77191179   7732.78465136
  10197.90657104  17268.68975989  16614.17518786  16595.73766802
  27273.60772897   9782.26948604   7289.71930179   7289.71930179
   7289.71930179   7289.71930179  10197.90657104   7767.37320956
   7289.71930179  14082.98427302   7313.46841795  17406.15801637
   7289.71930179  18452.59915327   7289.71930179   9608.86488835
   9554.74982785  28956.43320006   7289.71930179  14718.52335628
  17371.4934111    9633.09000416   7289.71930179  18147.11204486
   7289.71930179   9819.676678    14910.29897524   7289.71930179
  17368.14460105  17313.37123157  13125.25963552  17194.91248894
  16614.17518786   7289.71930179   7289.71930179  18452.59915327
  10228.17543965  16614.17518786   7289.71930179   8532.50495397
  17516.96615263]


In [115]:
from sklearn.metrics import r2_score

r2_score(y_test, y_pred_rf) 

0.81974534463918203