All instructions are provided for R. I am going to reproduce them in Python as best as I can.

# Preface

From the textbook, p. 333:
>In the lab, a classification tree was applied to the `Carseats` data set after converting `Sales` into a qualitative response variable. Now we will seek to predict `Sales` using regression trees and related approaches, treating the response as a quantitative variable.

In [50]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor


sns.set()
%matplotlib inline

In [4]:
carseats = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Carseats.csv')
carseats.head(3)

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,9.5,138,73,11,276,120,Bad,42,17,Yes,Yes
1,11.22,111,48,16,260,83,Good,65,10,Yes,Yes
2,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes


Columns:
1. `Sales` &mdash; unit sales (in thousands) at each location.
1. `CompPrice` Price charged by competitor at each location.
1. `Income` &mdash; community income level (in thousands of dollars).
1. `Advertising` &mdash; local advertising budget for company at each location (in thousands of dollars).
1. `Population` &mdash; population size in region (in thousands).
1. `Price` &mdash; price company charges for car seats at each site.
1. `ShelveLoc` &mdash; a factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site.
1. `Age` &mdash; average age of the local population.
1. `Education` &mdash; education level at each location.
1. `Urban` &mdash; a factor with levels No and Yes to indicate whether the store is in an urban or rural location.
1. `US` &mdash; a factor with levels No and Yes to indicate whether the store is in the US or not.

`ShelveLoc`, `Urban`, and `US` are categorical variables. Conversion to dummy variables is unnecessary.

In [11]:
x = pd.get_dummies(carseats.drop('Sales', axis='columns'), drop_first=True)
y = carseats.Sales

# (a)

From the textbook, p. 333:
> Split the data set into a training set and a test set.

In [12]:
x_train, x_test, y_train, y_test = train_test_split(x, y)

# (b)

From the textbook, p. 333:
> Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

In [29]:
model_b = DecisionTreeRegressor()
model_b.fit(x_train, y_train)
y_pred = model_b.predict(x_test)
mse_b = ((y_test - y_pred)**2).mean()
print(f'MSE = {mse_b}\nmax_depth = {model_b.tree_.max_depth}')

MSE = 5.228434
max_depth = 19


# (c)

From the textbook, p. 333:
> Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

In [26]:
parameters = {'max_depth' : np.arange(1, 100)}
model_c = GridSearchCV(DecisionTreeRegressor(), parameters)
model_c.fit(x_train, y_train)
y_pred = model_c.predict(x_test)
mse_c = ((y_test - y_pred)**2).mean()
print(mse_c)
model_c.best_params_

4.607652359859251


{'max_depth': 8}

It is difficult to tell whether pruning improves the test MSE: every time I re-run the above cell, it gives different results. I get both better and worse metrics. I suspect that the randomness in choosing folds is at fault. 

# (d)

From the textbook, p. 333:
> Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important.

`BaggingRegressor` in `sklearn` is a general-purpose bagging algorithm; so it does not have `importance` attribute. I'm calculating the average manually.

In [49]:
n = 20
model_d = BaggingRegressor(DecisionTreeRegressor(), n_estimators=n)
model_d.fit(x_train, y_train)
y_pred = model_d.predict(x_test)
mse_d = ((y_test - y_pred)**2).mean()
print(f'MSE = {mse_d}')

feature_importances = np.mean([
    tree.feature_importances_ for tree in model_d.estimators_
], axis=0)
pd.DataFrame({'Importance' : feature_importances}
             , index=x.columns
             ).sort_values('Importance', ascending=False)

MSE = 2.8418795974999993


Unnamed: 0,Importance
ShelveLoc_Good,0.259215
Price,0.253496
Age,0.10633
CompPrice,0.099939
Advertising,0.082851
Income,0.06639
ShelveLoc_Medium,0.056888
Population,0.037501
Education,0.025865
US_Yes,0.00688


The importances vary with each re-run. `ShelveLoc_Good` and `Price` compete for the first place.

# (e)

From the textbook, p. 334:
> Use random forests to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important. Describe the effect of $m$, the number of variables considered at each split, on the error rate obtained.

In [60]:
model_e = RandomForestRegressor()
model_e.fit(x_train, y_train)

pd.DataFrame({'Importance' : model_e.feature_importances_}
             , index=x.columns
            ).sort_values('Importance', ascending=False)

Unnamed: 0,Importance
Price,0.281706
ShelveLoc_Good,0.260531
CompPrice,0.107302
Age,0.086523
Advertising,0.078342
Income,0.05838
ShelveLoc_Medium,0.04971
Population,0.037895
Education,0.027795
US_Yes,0.00689


This time, `Price` is reliably higher in importance than `ShelveLoc_Good`.

In [55]:
p = x_train.shape[1]
for m in [1, 2, round(np.sqrt(p)), p//2, p]:
  model_e2 = RandomForestRegressor(max_features=m)
  model_e2.fit(x_train, y_train)
  y_pred = model_e.predict(x_test)
  mse_e2 = ((y_test - y_pred)**2).mean()
  print(mse_e2)

4.720805975599999
4.091199948800002
3.4343433033999973
2.8992118589999993
2.602508439799999


The test MSE drops with increase in the number of predictors used in the random forest.