# Test 3 — Reference Notebook

This is Miles' reference notebook for test #3 in CSC630 Machine Learning.

-----

### Boilerplate Code

Common Import Block

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import math
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.model_selection import train_test_split
%matplotlib inline
plt.style.use("fivethirtyeight")

Reading basic CSV files

In [None]:
df = pd.read_csv("...")

Creating a dataframe from raw data

In [None]:
#                  2d array                  labels
hd = pd.DataFrame(raw_dataset.data, columns=raw_dataset['feature_names'])
hd['...'] = raw_dataset.target

### Data Processing

`df.shape` provides (samples, columns)

In [None]:
df.shape

For the first real look, always call `df.head()`

In [None]:
df.head()

`df.describe()` is also very helpful

In [None]:
df.describe()

A useful—but obtuse—data cleaning feature is `df.dropna()`

In [None]:
df.dropna(inplace=True)

Sometimes, column names are prepended with whitespace due to an incorrect CSV read. If that's the case, this one-liner can help:

In [None]:
df.rename(columns={column: column.strip() for column in df.columns}, inplace=True)

Broadcasting allows simple feature engineering

In [None]:
df['mpg_reciprocal'] = df['mpg'] ** -1

Horizontal slicing allows the creation of sub-datasets

In [None]:
df_sliced = df[df["..."] == "..."]

Columns can be dropped, too

In [None]:
df.drop("column name", inplace=True)

### Plotting

Remember that `matplotlib.pyplot` plots can be overlayed.

Remember to always add a colorbar (if applicable), label axes, and name the plot!

In [None]:
plt.colorbar()
plt.title("PCA transformations and error of a multiclass \nlogistic regression on the sepal dataset")
plt.xlabel("X label")
plt.ylabel("Y label")
plt.show()

Basic scatterplots, histograms, and line plots (respectively)

In [None]:
plt.scatter(X, Y, c=optional_color_variable, marker='*', alpha=0.2, cmap="brg", s=30)
plt.hist(X, bins=[1,2,3])
plt.plot(X, Y)

The *emperor of all plots*, however, is the Seaborn pairplot!

In [None]:
# Be careful; can sometimes take a long time to run!
sns.pairplot(df)

# Alternatively, if you only need one or two Y values:
sns.pairplot(df, y_vars=['MEDV'], x_vars=[key for key in df.keys() if key not in ["MEDV"]])

### Supervised Learning Utilities

Always start with a nice test-train split!

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop("..."), df["..."], test_size=0.25) # will remove the target column

To check accuracy on a **logistic regression**, use the following:

In [None]:
accuracy_score(y_true, y_pred)
accuracy_score(y_true, [round(model.predict(i)) for i in np.arange(0, 10, 10./200)])

To check accuracy on a **linear regression**, use the following:

In [None]:
metrics.mean_squared_error(y_true, y_pred)
# or, better:
metrics.r2_score(y_true, y_pred)

### Supervised Learning Models

Linear Regression

In [None]:
model = LinearRegression()
model.fit(X, y)
model.predict(X) # --> y

Logistic Regression

In [None]:
model = LogisticRegression()
model.fit(X, y)
model.predict(X) # --> y