# Intro to Scientific Computing

## Jupyter

You are running a [Jupyter](https://jupyter.org/) Notebook. This is an interactive development environment where we can selectively run "cells". Cells can run code, provide documentation, display plots, and more. When you highlight a cell and click Shift+Enter, the cell "runs". Be careful, though --  you can run cells out of order!

## [numpy](https://numpy.org/)

`numpy` is an "array programming library". It allows us to perform mathematical operations efficiently on vectors, matrices, and higher-dimensional arrays. The fundamental component of `numpy` is the `array`. Remember that in machine learning we want an `X` feature matrix (the inputs to our model) and a `y` vector of ground truth. Generally, we will want to use `numpy` to construct `X` and `y`.

In [None]:
# Common convention is to rename numpy `np`.
import numpy as np

In [None]:
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
X

In [None]:
X.shape

In [None]:
y = np.array([1, 2, 3])
y

In [None]:
y.shape

You can perform many linear algebra operations with `numpy`:

In [None]:
# Scalar operations
y * 2 + 1

In [None]:
# Elementwise multiplication
X * X

In [None]:
# Matrix multiplication
X @ X

In [None]:
# Multiplication of vectors and matrices

In [None]:
# Create a 1-D column vector by using `reshape`
beta = np.array([0.1, 0.2, 0.3]).reshape(-1, 1)
beta.shape

In [None]:
y_pred = X.dot(beta)

In [None]:
y_pred

# [pandas](https://pandas.pydata.org/)

`pandas` is a data analysis library built on top of `numpy`. The fundamental component of `pandas` is the `DataFrame` which is similar to an Excel spreadsheet. You can alternatively think of it as a numpy matrix, where each row is a data point, and we have columns with names. `pandas` also has functionality for reading and writing data in different formats and plotting data.

I grabbed an actual housing dataset from [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview). The data consists of information about houses in Ames, Iowa, including the sale price of the house. The dataset is called `train.csv`, and it's located in the `data` folder that's above this notebook's folder. There is also a file called `data_description.txt` that contains information about the data. We can read that data from disk into a pandas `DataFrame`.

In [None]:
# Common convention is to rename pandas `pd`.
import pandas as pd

In [None]:
# Read the CSV file into a pandas dataframe
housing_data = pd.read_csv("../data/train.csv")
# Display the first 5 rows of the dataframe.
housing_data.head()

An individual row or column of a `DataFrame` is a `Series`. A `Series` is kind of like a one-dimensional `DataFrame`.

In [None]:
# Select the third row
housing_data.iloc[3]

In [None]:
# Select the SalePrice column
housing_data["SalePrice"]

In [None]:
# Select multiple columns
housing_data[["SalePrice", "LotArea"]]

Similar to `numpy`, we can perform mathematical operations on the DataFrame.

In [None]:
housing_data["SalePrice"] * 2 + housing_data["LotArea"]

We can add and remove columns

In [None]:
housing_data["double_sale_price"] = housing_data["SalePrice"] * 2

In [None]:
housing_data.columns

In [None]:
"Street" in housing_data.columns

In [None]:
housing_data = housing_data.drop(columns=["Street"])

In [None]:
"Street" in housing_data.columns

We can also convert DataFrames and Series to numpy arrays

In [None]:
# Convert DataFrame to 2D array (i.e. a matrix)
housing_data[["SalePrice", "LotArea"]].values

In [None]:
# Convert Series to a 1D array (i.e. a vector)
housing_data["SalePrice"].values

# [scikit-learn](https://scikit-learn.org/stable/)

`scikit-learn` is a library for performing machine learning in Python. This will be the primary library that we use this semester for training ML models. Recall that our constant goal is to construct the `X` feature matrix and the `y` vector of ground truth. In `scikit-learn`, you "fit" (aka train) your model with `X` and `y` as inputs. `X` and `y` can be numpy arrays, pandas DataFrames, or lists.

The `scikit-learn` process is to instantiate a model, `fit` it on `X` and `y`, and then `predict` using `X`.

Just like in the Week 1 slides, let's build a model to predict the sale price using the square footage of the house. There are different columns for each floor's square footage in our dataset, so we will add these all together to get a single column containing the total square footage:

In [None]:
# Reload clean housing data because we previously manipulated it.
housing_data = pd.read_csv("../data/train.csv")

In [None]:
housing_data["total_area"] = (
    housing_data["1stFlrSF"] 
    + housing_data["2ndFlrSF"] 
    + housing_data["TotalBsmtSF"]
)

We'll now generate a plot of the Square Footage versus the Sale Price. We'll use [matplotlib](https://matplotlib.org/) to generate the plot. This is a pretty confusing library to use, but it's the standard plotting libray in Python. We'll write a function for generating this plot because we will make this plot again later in the notebook.

In [None]:
# Common convention is to rename pyplot plt
# Most matplotlib commands come from pyplot.
import matplotlib.pyplot as plt
from matplotlib import ticker

# This cryptic line below is a "magic" command in jupyter to make the
# matplotlib plots high resolution
%config InlineBackend.figure_format = "retina"

In [None]:
# These commands adjust various font sizes in the matplotlib plots.
plt.rcParams["xtick.labelsize"] = 14
plt.rcParams["ytick.labelsize"] = 14
plt.rcParams["axes.labelsize"] = 16
plt.rcParams["axes.titlesize"] = 18

In [None]:
def plot_area_vs_price(housing_data):
    fig, ax = plt.subplots()
    ax.scatter(
        x=housing_data["total_area"], 
        y=housing_data["SalePrice"], 
        alpha=0.25, 
        edgecolors="none"
    )
    formatter = lambda x, pos: f"${int(x/1000):,}K"
    ax.yaxis.set_major_formatter(formatter)
    ax.set_ylabel("Sale Price")
    ax.set_xlabel("Total Area (Square Feet)")
    return ax

In [None]:
ax = plot_area_vs_price(housing_data)

Now, let's fit a linear model where the input is just the Total Area. Remember, the `scikit-learn` workflow is to create the model, and then fit it on `X` and `y`.

In [None]:
# While the name of the package is scikit-learn, you import it as sklearn.
from sklearn.linear_model import LinearRegression

In [None]:
# All scikit-learn models are classes, and we must instantiate them.
# Different models take different model-specific arguments. 
# Here, we make sure that we fit a y-intercept/bias term.
model = LinearRegression(fit_intercept=True)

In [None]:
# Construct our X and y
X = housing_data[["total_area"]]
y = housing_data["SalePrice"]

Note: scikit-learn expects `X` to be a 2D matrix (rows are samples, columns are features). Even though we only have a single feature, we selected the column from the DataFrame using double brackets. This ensures a 2D DataFrame is returned rather then a 1D Series. `y`, on the other hand, was selected using single brackets, and it's now a 1D Series.

In [None]:
print(f"X: type={type(X)}, shape={X.shape}")
print(f"y: type={type(y)}, shape={y.shape}")

Anyway, let's fit our model.

In [None]:
model = model.fit(X, y)

And let's generate predictions with the model on the same dataset that we fit it with.

In [None]:
predictions = model.predict(X)

We can plot our predictions alongside our original plot.

In [None]:
ax = plot_area_vs_price(housing_data)
ax.plot(X.values[:, 0], predictions, color="red", label="model prediction")
ax.legend()
None

We can also plot a comparison between the predicted sale price and the actual sale price.

In [None]:
def comparison_plot(actual, predictions):

    fig, ax = plt.subplots(figsize=(8, 8))
    ax.scatter(actual, predictions)
    ax.set_xlabel("Actual")
    ax.set_ylabel("Predicted")
    ax.set_title("Home Price Predictions")

    formatter = lambda x, pos: f"${int(x/1000):,}K"
    ax.xaxis.set_major_formatter(formatter)
    ax.yaxis.set_major_formatter(formatter)

    # Make the axis limits equal so that the figure
    # is perfectly square.
    xmin, xmax = ax.get_xlim()
    ymin, ymax = ax.get_ylim()

    lim = (min(xmin, ymin), max(xmax, ymax))
    ax.set_xlim(lim)
    ax.set_ylim(lim)

    # Plot a 1:1 line to show where a perfect model's points would lie.
    line = np.linspace(lim[0], lim[1], 201)
    ax.plot(line, line, color="red", linestyle="dashed")
    return ax

ax = comparison_plot(y, predictions)

Lastly, let's calculate the $R^{2}$ for the model

In [None]:
from sklearn.metrics import r2_score

In [None]:
R2 = r2_score(y, predictions)
print(f"R^2 = {R2:4.3f}")