# Test 3 — Reference Notebook

This is Miles' reference notebook for test #3 in CSC630 Machine Learning.

-----

### Boilerplate Code

Common Import Block

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import math
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.model_selection import train_test_split
%matplotlib inline
plt.style.use("fivethirtyeight")

Reading basic CSV files

In [None]:
df = pd.read_csv("...")

Creating a dataframe from raw data

In [None]:
#                  2d array                  labels
hd = pd.DataFrame(raw_dataset.data, columns=raw_dataset['feature_names'])
hd['...'] = raw_dataset.target

### Data Processing

`df.shape` provides (samples, columns)

In [None]:
df.shape

For the first real look, always call `df.head()`

In [None]:
df.head()

`df.describe()` is also very helpful

In [None]:
df.describe()

A useful—but obtuse—data cleaning feature is `df.dropna()`

In [None]:
df.dropna(inplace=True)

Sometimes, column names are prepended with whitespace due to an incorrect CSV read. If that's the case, this one-liner can help:

In [None]:
df.rename(columns={column: column.strip() for column in df.columns}, inplace=True)

Broadcasting allows simple feature engineering

In [None]:
df['mpg_reciprocal'] = df['mpg'] ** -1

Horizontal slicing allows the creation of sub-datasets

In [None]:
df_sliced = df[df["..."] == "..."]

Columns can be dropped, too

In [None]:
df.drop("column name", inplace=True)

### NumPy Utilities

Ranged data

In [None]:
np.arange(start, stop, step) # includes start, excluses stop

Random data

In [None]:
np.random.sample(size)

### Plotting

Remember that `matplotlib.pyplot` plots can be overlayed.

Remember to always add a colorbar (if applicable), label axes, and name the plot!

In [None]:
plt.colorbar()
plt.title("PCA transformations and error of a multiclass \nlogistic regression on the sepal dataset")
plt.xlabel("X label")
plt.ylabel("Y label")
# plt.xlim(left, right)
# plt.ylim(left, right)
plt.show()

Basic scatterplots, histograms, and line plots (respectively)

In [None]:
plt.scatter(X, Y, c=optional_color_variable, marker='*', alpha=0.2, cmap="brg", s=30)
plt.hist(X, bins=[1,2,3])
plt.plot(X, Y)

The *emperor of all plots*, however, is the Seaborn pairplot!

In [None]:
# Be careful; can sometimes take a long time to run!
sns.pairplot(df)

# Alternatively, if you only need one or two Y values:
sns.pairplot(df, y_vars=['MEDV'], x_vars=[key for key in df.keys() if key not in ["MEDV"]])

### Supervised Learning Utilities

Always start with a nice test-train split!

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop("..."), df["..."], test_size=0.25) # will remove the target column

To check accuracy on a **logistic regression**, use the following:

In [None]:
accuracy_score(y_true, y_pred)
accuracy_score(y_true, [round(model.predict(i)) for i in np.arange(0, 10, 10./200)])

To check accuracy on a **linear regression**, use the following:

In [None]:
metrics.mean_squared_error(y_true, y_pred)
# or, better:
metrics.r2_score(y_true, y_pred)

To perform simple one-hot encoding, use the following function:

In [None]:
df_dummies = pd.get_dummies(df[['...']], prefix=["descriptor"])
df = pd.concat([df, df_dummies], axis=1)

DataFrame `.apply(function)` runs the given function for each row in the dataframe.

In [None]:
df.apply(function)

To apply a function to only one element, use `.map()`:

In [None]:
df["..."].map(function)

### Supervised Learning Models

Linear Regression

In [None]:
model = LinearRegression()
model.fit(X, y)
model.predict(X) # --> y

Logistic Regression

In [None]:
model = LogisticRegression()
model.fit(X, y)
model.predict(X) # --> y

### Dr. Z's Magic Function

In [None]:
# Thank you, Dr. Z!

def scatter_with_decision(original_x, original_y, original_z, model, rules=None):
    """ Create a scatter plot for 2-dimensional input data, as well as the decision 
    boundary for the given logistic regression model. 
    
    parameters:
        original_x, original_y, original_z: numpy arrays
            the data for the two input dimensions (x and y) and output (z, with values 0 or 1)
        model: sklearn.linear_model.LogisticRegression
            the already-fit model
        rules: List[(index, function)]
            A collection of functions defining how to turn the original 
            columns into your engineered columns.  The index is either `0` or `1` 
            to indicate that the rule is applied to column 0 or 1, or `2` if 
            the rule uses both columns.
            Some examples:
                if you want `original_x**2`, your `rules` should contain the tuple `(0, lambda x: x**2)`.  
                if you want `original_y**3`, your `rules` should contain the tuple `(1, lambda x: x**3)`.
                if you want `original_x * original_y`, your `rules` should contain the tuple `(2, lambda x, y: x*y)`.
    returns:
        the Figure and Axes objects produced (in order to add more to it if you want, 
            e.g. title and axis labels)
    """
    fig = plt.figure()
    ax = fig.add_subplot(111)
    
    ### create the decision surface
    x = np.arange(original_x.min(), original_x.max(), 0.1)
    y = np.arange(original_y.min(), original_y.max(), 0.1)
    xx, yy = np.meshgrid(x, y)                       # this is its xy-coordinate grid

    ### We need to "ravel" the grid's matrices to make them one long column
    grid_as_columns =[xx.ravel(), yy.ravel()]
    if rules:
        for i, rule in rules:
            if i < 2:
                # this rule uses only one input column
                grid_as_columns.append(rule(grid_as_columns[i]))    
            else:
                # this rule uses both input columns
                grid_as_columns.append(rule(grid_as_columns[0], grid_as_columns[1]))
    dataset_cols = np.array(grid_as_columns).T       # now we have all the points in the grid as a long (_)x2 array 

    ### Now we can feed them into the prediction function and reshape it back to the grid
    zz_col = model.predict_proba(dataset_cols).T[0]
    zz = zz_col.reshape(xx.shape)                    # finally, we have the z-coordinates for each grid point

    # make the plots
    ax.contour(xx, yy, zz, levels=[.5], colors=['c'])
    ax.scatter(original_x, original_y, c=original_z)
    return fig, ax