# Using Jupyter Notebooks
:label:`sec_jupyter`


This section describes how to edit and run the code
in each section of this book
using the Jupyter Notebook. Make sure you have
installed Jupyter and downloaded the
code as described in
:ref:`chap_installation`.
If you want to know more about Jupyter see the excellent tutorial in
their [documentation](https://jupyter.readthedocs.io/en/latest/).


## Editing and Running the Code Locally

Suppose that the local path of the book's code is `xx/yy/d2l-en/`. Use the shell to change the directory to this path (`cd xx/yy/d2l-en`) and run the command `jupyter notebook`. If your browser does not do this automatically, open http://localhost:8888 and you will see the interface of Jupyter and all the folders containing the code of the book, as shown in :numref:`fig_jupyter00`.

![The folders containing the code of this book.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter00.png?raw=1)
:width:`600px`
:label:`fig_jupyter00`


You can access the notebook files by clicking on the folder displayed on the webpage.
They usually have the suffix ".ipynb".
For the sake of brevity, we create a temporary "test.ipynb" file.
The content displayed after you click it is
shown in :numref:`fig_jupyter01`.
This notebook includes a markdown cell and a code cell. The content in the markdown cell includes "This Is a Title" and "This is text.".
The code cell contains two lines of Python code.

![Markdown and code cells in the "text.ipynb" file.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter01.png?raw=1)
:width:`600px`
:label:`fig_jupyter01`


Double click on the markdown cell to enter edit mode.
Add a new text string "Hello world." at the end of the cell, as shown in :numref:`fig_jupyter02`.

![Edit the markdown cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter02.png?raw=1)
:width:`600px`
:label:`fig_jupyter02`


As demonstrated in :numref:`fig_jupyter03`,
click "Cell" $\rightarrow$ "Run Cells" in the menu bar to run the edited cell.

![Run the cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter03.png?raw=1)
:width:`600px`
:label:`fig_jupyter03`

After running, the markdown cell is shown in :numref:`fig_jupyter04`.

![The markdown cell after running.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter04.png?raw=1)
:width:`600px`
:label:`fig_jupyter04`


Next, click on the code cell. Multiply the elements by 2 after the last line of code, as shown in :numref:`fig_jupyter05`.

![Edit the code cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter05.png?raw=1)
:width:`600px`
:label:`fig_jupyter05`


You can also run the cell with a shortcut ("Ctrl + Enter" by default) and obtain the output result from :numref:`fig_jupyter06`.

![Run the code cell to obtain the output.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter06.png?raw=1)
:width:`600px`
:label:`fig_jupyter06`


When a notebook contains more cells, we can click "Kernel" $\rightarrow$ "Restart & Run All" in the menu bar to run all the cells in the entire notebook. By clicking "Help" $\rightarrow$ "Edit Keyboard Shortcuts" in the menu bar, you can edit the shortcuts according to your preferences.

## Advanced Options

Beyond local editing two things are quite important: editing the notebooks in the markdown format and running Jupyter remotely.
The latter matters when we want to run the code on a faster server.
The former matters since Jupyter's native ipynb format stores a lot of auxiliary data that is
irrelevant to the content,
mostly related to how and where the code is run.
This is confusing for Git, making
reviewing contributions very difficult.
Fortunately there is an alternative---native editing in the markdown format.

### Markdown Files in Jupyter

If you wish to contribute to the content of this book, you need to modify the
source file (md file, not ipynb file) on GitHub.
Using the notedown plugin we
can modify notebooks in the md format directly in Jupyter.


First, install the notedown plugin, run the Jupyter Notebook, and load the plugin:

```
pip install d2l-notedown  # You may need to uninstall the original notedown.
jupyter notebook --NotebookApp.contents_manager_class='notedown.NotedownContentsManager'
```

You may also turn on the notedown plugin by default whenever you run the Jupyter Notebook.
First, generate a Jupyter Notebook configuration file (if it has already been generated, you can skip this step).

```
jupyter notebook --generate-config
```

Then, add the following line to the end of the Jupyter Notebook configuration file (for Linux or macOS, usually in the path `~/.jupyter/jupyter_notebook_config.py`):

```
c.NotebookApp.contents_manager_class = 'notedown.NotedownContentsManager'
```

After that, you only need to run the `jupyter notebook` command to turn on the notedown plugin by default.

### Running Jupyter Notebooks on a Remote Server

Sometimes, you may want to run Jupyter notebooks on a remote server and access it through a browser on your local computer. If Linux or macOS is installed on your local machine (Windows can also support this function through third-party software such as PuTTY), you can use port forwarding:

```
ssh myserver -L 8888:localhost:8888
```

The above string `myserver` is the address of the remote server.
Then we can use http://localhost:8888 to access the remote server `myserver` that runs Jupyter notebooks. We will detail on how to run Jupyter notebooks on AWS instances
later in this appendix.

### Timing

We can use the `ExecuteTime` plugin to time the execution of each code cell in Jupyter notebooks.
Use the following commands to install the plugin:

```
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
jupyter nbextension enable execute_time/ExecuteTime
```

## Summary

* Using the Jupyter Notebook tool, we can edit, run, and contribute to each section of the book.
* We can run Jupyter notebooks on remote servers using port forwarding.


## Exercises

1. Edit and run the code in this book with the Jupyter Notebook on your local machine.
1. Edit and run the code in this book with the Jupyter Notebook *remotely* via port forwarding.
1. Compare the running time of the operations $\mathbf{A}^\top \mathbf{B}$ and $\mathbf{A} \mathbf{B}$ for two square matrices in $\mathbb{R}^{1024 \times 1024}$. Which one is faster?


[Discussions](https://discuss.d2l.ai/t/421)


Q1: K-Fold Cross Validation for Multiple Linear Regression (Least Square Error Fit)  
Download the dataset regarding USA House Price Prediction from the following link:  
https://drive.google.com/file/d/1O_NwpJT-8xGfU_-3llUl2sgPu0xllOrX/view?usp=sharing  
Load the dataset and Implement 5- fold cross validation for multiple linear regression
(using least square error fit).  
Steps:  
a) Divide the dataset into input features (all columns except price) and output variable  
(price)  
b) Scale the values of input features.  
c) Divide input and output features into five folds.  
d) Run five iterations, in each iteration consider one-fold as test set and remaining
four sets as training set. Find the beta (𝛽) matrix, predicted values, and R2_score
for each iteration using least square error fit.  
e) Use the best value of (𝛽) matrix (for which R2_score is maximum), to train t

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import r2_score
file_path = "/mnt/data/USA_Housing.csv"
df = pd.read_csv("/content/USA_Housing.csv")

X = df.drop("Price", axis=1).values
y = df["Price"].values.reshape(-1, 1)

print(X)
print(y)

[[7.95454586e+04 5.68286132e+00 7.00918814e+00 4.09000000e+00
  2.30868005e+04]
 [7.92486424e+04 6.00289981e+00 6.73082102e+00 3.09000000e+00
  4.01730722e+04]
 [6.12870672e+04 5.86588984e+00 8.51272743e+00 5.13000000e+00
  3.68821594e+04]
 ...
 [6.33906869e+04 7.25059062e+00 4.80508098e+00 2.13000000e+00
  3.32661455e+04]
 [6.80013312e+04 5.53438842e+00 7.13014386e+00 5.44000000e+00
  4.26256202e+04]
 [6.55105818e+04 5.99230531e+00 6.79233610e+00 4.07000000e+00
  4.65012838e+04]]
[[1059033.558]
 [1505890.915]
 [1058987.988]
 ...
 [1030729.583]
 [1198656.872]
 [1298950.48 ]]


In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(X)
print(X)

[[ 1.02865969 -0.29692705  0.02127433  0.08806222 -1.31759867]
 [ 1.00080775  0.02590164 -0.25550611 -0.72230146  0.40399945]
 [-0.68462915 -0.11230283  1.5162435   0.93084045  0.07240989]
 ...
 [-0.48723454  1.28447022 -2.17026949 -1.50025059 -0.29193658]
 [-0.05459152 -0.44669439  0.14154061  1.18205319  0.65111608]
 [-0.28831272  0.01521477 -0.19434166  0.07185495  1.04162464]]


In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)
r2_scores, betas = [], []

for train_idx, test_idx in kf.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    X_train_bias = np.c_[np.ones((X_train.shape[0], 1)), X_train]
    X_test_bias = np.c_[np.ones((X_test.shape[0], 1)), X_test]
    beta = np.linalg.inv(X_train_bias.T @ X_train_bias) @ X_train_bias.T @ y_train
    y_pred = X_test_bias @ beta
    r2 = r2_score(y_test, y_pred)
    r2_scores.append(r2)
    betas.append(beta)
print("betas",betas)
print("R2 score",r2_scores)

betas [array([[1232002.6748241 ],
       [ 230745.94073479],
       [ 163243.27314515],
       [ 120309.77397759],
       [   3011.45976111],
       [ 151552.63069359]]), array([[1232037.85755946],
       [ 229081.97914235],
       [ 165882.1605634 ],
       [ 121536.57475055],
       [   2092.4478622 ],
       [ 150874.99274586]]), array([[1231951.92563846],
       [ 230224.50511001],
       [ 162766.17455493],
       [ 121022.77324577],
       [   1247.16258975],
       [ 150234.77720419]]), array([[1232751.46486511],
       [ 229500.10043209],
       [ 165212.07110924],
       [ 122839.9376815 ],
       [   3063.71699324],
       [ 150917.88484984]]), array([[1.23161736e+06],
       [2.30225051e+05],
       [1.63956839e+05],
       [1.21115120e+05],
       [7.83467170e+02],
       [1.50662447e+05]])]
R2 score [0.9179971706985147, 0.9145677884802818, 0.9116116385364478, 0.9193091764960816, 0.9243869413350316]


In [None]:
best_idx = np.argmax(r2_scores)
best_beta = betas[best_idx]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train_bias = np.c_[np.ones((X_train.shape[0], 1)), X_train]
X_test_bias = np.c_[np.ones((X_test.shape[0], 1)), X_test]
final_beta = np.linalg.inv(X_train_bias.T @ X_train_bias) @ X_train_bias.T @ y_train
y_pred_final = X_test_bias @ final_beta
final_r2 = r2_score(y_test, y_pred_final)
print("Final R2 on 70-30 split:", final_r2)
print("Best Beta Matrix:\n", final_beta)

Final R2 on 70-30 split: 0.9146818498916266
Best Beta Matrix:
 [[1231278.63687691]
 [ 230464.52520478]
 [ 164159.19982569]
 [ 120514.71328324]
 [   2913.62424674]
 [ 151019.35865134]]


Concept of Validation set for Multiple Linear Regression (Gradient Descent  
Optimization)  
Consider the same dataset of Q1, rather than dividing the dataset into five folds, divide the
dataset into training set (56%), validation set (14%), and test set (30%).  
Consider four different values of learning rate i.e. {0.001,0.01,0.1,1}. Compute the values of
regression coefficients for each value of learning rate after 1000 iterations.  
For each set of regression coefficients,

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

# --------------------------
# Load dataset
# --------------------------
file_path = "/mnt/data/USA_Housing.csv"
df = pd.read_csv("/content/USA_Housing.csv")

# Features and target
X_all = df.drop("Price", axis=1).values
y_all = df["Price"].values.reshape(-1, 1)
X_remain, X_test, y_remain, y_test = train_test_split(
    X_all, y_all, test_size=0.30, random_state=42, shuffle=True)
X_train, X_val, y_train, y_val = train_test_split(
    X_remain, y_remain, test_size=0.2, random_state=42, shuffle=True)

print("Train size:", X_train.shape[0])
print("Validation size:", X_val.shape[0])
print("Test size:", X_test.shape[0])
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
def add_bias(X):
    return np.c_[np.ones((X.shape[0], 1)), X]

Xtr = add_bias(X_train_scaled)
Xval = add_bias(X_val_scaled)
Xte = add_bias(X_test_scaled)

ytr = y_train
yval = y_val
yte = y_test

def gradient_descent(X, y, lr=0.01, iters=1000):
    m, n = X.shape
    beta = np.zeros((n, 1))  # initialize coefficients
    for i in range(iters):
        preds = X @ beta
        error = preds - y
        grad = (2/m) * (X.T @ error)
        beta = beta - lr * grad
    return beta
learning_rates = [0.001, 0.01, 0.1, 1]
results = []

for lr in learning_rates:
    beta = gradient_descent(Xtr, ytr, lr=lr, iters=1000)
    yval_pred = Xval @ beta
    ytest_pred = Xte @ beta
    r2_val = r2_score(yval, yval_pred)
    r2_test = r2_score(yte, ytest_pred)

    results.append({
        "learning_rate": lr,
        "beta": beta.flatten(),
        "R2_validation": r2_val,
        "R2_test": r2_test
    })
for res in results:
    print("\nLearning Rate:", res["learning_rate"])
    print("Beta coefficients:", res["beta"])
    print("Validation R²:", res["R2_validation"])
    print("Test R²:", res["R2_test"])


Train size: 2800
Validation size: 700
Test size: 1500

Learning Rate: 0.001
Beta coefficients: [1065976.38830794  201488.20579361  140561.40223826   97763.79716349
   20752.2271235   130367.9841764 ]
Validation R²: 0.6873855137112391
Test R²: 0.6523293554451501

Learning Rate: 0.01
Beta coefficients: [1232434.57365088  234562.95835223  162415.89218475  121759.16106366
    2815.31550729  151577.64115977]
Validation R²: 0.9097996244888865
Test R²: 0.9147569623458182

Learning Rate: 0.1
Beta coefficients: [1232434.57572502  234562.99454568  162415.95445045  121760.31153933
    2814.16200824  151577.59032902]
Validation R²: 0.9097995626742028
Test R²: 0.9147570103083724

Learning Rate: 1
Beta coefficients: [ 1.93782797e+272 -3.65591444e+285 -1.11301648e+285 -6.12761090e+286
 -6.14426004e+286  3.45630746e+285]
Validation R²: -inf
Test R²: -inf


  numerator = xp.sum(weight * (y_true - y_pred) ** 2, axis=0)
  numerator = xp.sum(weight * (y_true - y_pred) ** 2, axis=0)


ownload the dataset regarding Car Price Prediction from the following link:  
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data  
1. Load the dataset with following column names ["symboling", "normalized_losses",  
"make", "fuel_type", "aspiration","num_doors", "body_style", "drive_wheels",  
"engine_location", "wheel_base", "length", "width", "height", "curb_weight",  
"engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke",  
"compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]  
and replace all ? values with NaN  
2. Replace all NaN values with central tendency imputation. Drop the rows with NaN  
values in price column  
3. There are 10 columns in the dataset with non-numeric values. Convert these values to  
numeric values using following scheme:  
(i) For “num_doors” and “num_cylinders”: convert words (number names) to figures  
for e.g., two to 2  
(ii) For "body_style", "drive_wheels": use dummy encoding scheme  
(iii) For “make”, “aspiration”, “engine_location”,fuel_type: use label encoding  
scheme  
(iv) For fuel_system: replace values containing string pfi to 1 else all values to 0.  
(v) For engine_type: replace values containing string ohc to 1 else all values to 0.  
4. Divide the dataset into input features (all columns except price) and output variable  
(price). Scale all input features.  
5. Train a linear regressor on 70% of data (using inbuilt linear regression function of  
Python) and test its performance on remaining 30% of data.  
6. Reduce the dimensionality of the feature set using inbuilt PCA decomposition and then  
again train a linear regressor on 70% of reduced data (using inbuilt linear regres

---

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.metrics import r2_score

# 1. Load dataset
col_names = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
             "num_doors", "body_style", "drive_wheels", "engine_location",
             "wheel_base", "length", "width", "height", "curb_weight",
             "engine_type", "num_cylinders", "engine_size", "fuel_system",
             "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
             "city_mpg", "highway_mpg", "price"]

df = pd.read_csv("imports-85.data.txt", names=col_names)
df.replace("?", np.nan, inplace=True)
df

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470


In [None]:
# 2. Handle missing values
df = df.apply(lambda col: col.fillna(col.mode()[0]) if col.dtypes=="object" else col.fillna(col.mean()))
df = df.dropna(subset=["price"])
df["price"] = pd.to_numeric(df["price"])
df.isnull().sum()

Unnamed: 0,0
symboling,0
normalized_losses,0
make,0
fuel_type,0
aspiration,0
num_doors,0
body_style,0
drive_wheels,0
engine_location,0
wheel_base,0


In [None]:
# 3. Encode categorical columns
# (i) Convert words to numbers
door_map = {"two": 2, "four": 4}
cyl_map = {"two": 2, "three": 3, "four": 4, "five": 5, "six": 6,
           "eight": 8, "twelve": 12}
df["num_doors"] = df["num_doors"].map(door_map)
df["num_cylinders"] = df["num_cylinders"].map(cyl_map)
df


Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,161,alfa-romero,gas,std,2,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,161,alfa-romero,gas,std,2,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,161,alfa-romero,gas,std,2,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,4,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,4,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,gas,std,4,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,4,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,volvo,gas,std,4,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,4,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470


In [None]:
# (ii) One-hot encoding
df = pd.get_dummies(df, columns=["body_style", "drive_wheels"], drop_first=True)
df

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,engine_location,wheel_base,length,width,...,peak_rpm,city_mpg,highway_mpg,price,body_style_hardtop,body_style_hatchback,body_style_sedan,body_style_wagon,drive_wheels_fwd,drive_wheels_rwd
0,3,161,alfa-romero,gas,std,2,front,88.6,168.8,64.1,...,5000,21,27,13495,False,False,False,False,False,True
1,3,161,alfa-romero,gas,std,2,front,88.6,168.8,64.1,...,5000,21,27,16500,False,False,False,False,False,True
2,1,161,alfa-romero,gas,std,2,front,94.5,171.2,65.5,...,5000,19,26,16500,False,True,False,False,False,True
3,2,164,audi,gas,std,4,front,99.8,176.6,66.2,...,5500,24,30,13950,False,False,True,False,True,False
4,2,164,audi,gas,std,4,front,99.4,176.6,66.4,...,5500,18,22,17450,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,gas,std,4,front,109.1,188.8,68.9,...,5400,23,28,16845,False,False,True,False,False,True
201,-1,95,volvo,gas,turbo,4,front,109.1,188.8,68.8,...,5300,19,25,19045,False,False,True,False,False,True
202,-1,95,volvo,gas,std,4,front,109.1,188.8,68.9,...,5500,18,23,21485,False,False,True,False,False,True
203,-1,95,volvo,diesel,turbo,4,front,109.1,188.8,68.9,...,4800,26,27,22470,False,False,True,False,False,True


In [None]:
# (iii) Label encoding
label_cols = ["make", "aspiration", "engine_location", "fuel_type"]
le = LabelEncoder()
for col in label_cols:
    df[col] = le.fit_transform(df[col])
df

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,engine_location,wheel_base,length,width,...,peak_rpm,city_mpg,highway_mpg,price,body_style_hardtop,body_style_hatchback,body_style_sedan,body_style_wagon,drive_wheels_fwd,drive_wheels_rwd
0,3,161,0,1,0,2,0,88.6,168.8,64.1,...,5000,21,27,13495,False,False,False,False,False,True
1,3,161,0,1,0,2,0,88.6,168.8,64.1,...,5000,21,27,16500,False,False,False,False,False,True
2,1,161,0,1,0,2,0,94.5,171.2,65.5,...,5000,19,26,16500,False,True,False,False,False,True
3,2,164,1,1,0,4,0,99.8,176.6,66.2,...,5500,24,30,13950,False,False,True,False,True,False
4,2,164,1,1,0,4,0,99.4,176.6,66.4,...,5500,18,22,17450,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,21,1,0,4,0,109.1,188.8,68.9,...,5400,23,28,16845,False,False,True,False,False,True
201,-1,95,21,1,1,4,0,109.1,188.8,68.8,...,5300,19,25,19045,False,False,True,False,False,True
202,-1,95,21,1,0,4,0,109.1,188.8,68.9,...,5500,18,23,21485,False,False,True,False,False,True
203,-1,95,21,0,1,4,0,109.1,188.8,68.9,...,4800,26,27,22470,False,False,True,False,False,True


In [None]:
# (iv) Fuel system: pfi -> 1 else 0
df["fuel_system"] = df["fuel_system"].apply(lambda x: 1 if "pfi" in x else 0)
df["fuel_system"]

Unnamed: 0,fuel_system
0,1
1,1
2,1
3,1
4,1
...,...
200,1
201,1
202,1
203,0


In [None]:
# (v) Engine type: ohc -> 1 else 0
df["engine_type"] = df["engine_type"].apply(lambda x: 1 if "ohc" in x else 0)
df["engine_type"]

Unnamed: 0,engine_type
0,1
1,1
2,1
3,1
4,1
...,...
200,1
201,1
202,1
203,1


In [None]:
# 4. Split features and target
X = df.drop("price", axis=1)
y = df["price"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

In [None]:
# 5. Linear Regression (original features)
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print("R² score (original features):", r2_score(y_test, y_pred))

R² score (original features): 0.7895045576733848


In [None]:
# 6. PCA + Regression
pca = PCA(0.95)  # keep 95% variance
X_pca = pca.fit_transform(X_scaled)

Xp_train, Xp_test, yp_train, yp_test = train_test_split(X_pca, y, test_size=0.3, random_state=42)

lr_pca = LinearRegression()
lr_pca.fit(Xp_train, yp_train)
yp_pred = lr_pca.predict(Xp_test)
print("R² score (PCA features):", r2_score(yp_test, yp_pred))
print("Number of PCA components used:", pca.n_components_)

R² score (PCA features): 0.7478420860380317
Number of PCA components used: 16
