# Sarah Gets a Diamond - Minimal Starter Code

This is the same code that is in `DiamondStarter.ipynb`. However, I have removed all text, explanations, and unnecessary steps. If you have not already gone through `DiamondStarter.ipynb` carefully, you should start there.

Sometimes it can be easier to work with a minimal notebook when actually building models, so you may when working with starter code want to replicate something like this. You can duplicate a file in JupyterHub by right clicking on the file and selecting "Duplicate". You can then go into the starter code and cut out unnecessary cells by using the scissor icon on the toolbar.

Feel free to work in `DiamondStarter.ipynb` notebook instead. This is purely an optional resource.

## Importing

In [None]:
import numpy as np
import pandas as pd
from math import *
import statsmodels.formula.api as smf

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib as mpl

In [None]:
# This is a comment. Anything in a "code cell" that is preceeded by a "#" is a comment
# and it will not be interpreted as code to be run when you run the cell.
# This sets some nicer defaults for plotting.
# This must be run in a separate cell from importing matplotlib due to a bug.
params = {'legend.fontsize': 'large',
          'figure.figsize': (11.0, 11.0),
          'axes.labelsize': 'x-large',
          'axes.titlesize':'xx-large',
          'xtick.labelsize':'large',
          'ytick.labelsize':'large'}
mpl.rcParams.update(params)

# This makes it so that the pandas dataframes don't get truncated horizontally.
pd.options.display.max_columns = 200

## Load and clean the data

In [None]:
df_train = pd.read_csv("train.csv", index_col="ID")
df_test = pd.read_csv("test.csv", index_col="ID")

In [None]:
df_train.head()

In [None]:
df_train.shape

In [None]:
df_test.shape

In [None]:
def summarize_dataframe(df):
    """Summarize a dataframe, and report missing values."""
    missing_values = pd.concat([pd.DataFrame(df.columns, columns=['Variable Name']), 
                      pd.DataFrame(df.dtypes.values.reshape([-1,1]), columns=['Data Type']),
                      pd.DataFrame(df.isnull().sum().values, columns=['Missing Values']), 
                      pd.DataFrame([df[name].nunique() for name in df.columns], columns=['Unique Values'])], 
                     axis=1).set_index('Variable Name')
    with pd.option_context("display.max_rows", 1000):
        display(pd.concat([missing_values, df.describe(include='all').transpose()], axis=1).fillna(""))

In [None]:
summarize_dataframe(df_train)

In [None]:
summarize_dataframe(df_test)

In [None]:
df_train['Price']

## Prepare the data

In [None]:
df_train['Price_numeric'] = df_train['Price'].replace(to_replace='[\$,]', value='', regex=True).astype(float)

In [None]:
summarize_dataframe(df_train)

## Split into `smaller_train` and `validation` Data Sets

In [None]:
df_smaller_train, df_validation = train_test_split(df_train, test_size=.25, random_state=201)

In [None]:
summarize_dataframe(df_smaller_train)

In [None]:
summarize_dataframe(df_validation)

## Advanced Regressions

### Additive Model

In [None]:
lm_1 = smf.ols(formula='Price_numeric ~ Q("Carat Weight")', data=df_smaller_train).fit()
lm_1.summary()

In [None]:
lm_1_predictions = lm_1.predict(df_validation)

In [None]:
lm_1_predictions

In [None]:
mean_absolute_error(df_validation["Price_numeric"], lm_1_predictions)

### Multiplicative Model

In [None]:
lm_2 = smf.ols(formula='np.log(Price_numeric) ~ Q("Carat Weight")', data=df_smaller_train).fit()
lm_2.summary()

In [None]:
lm_2_predictions = lm_2.predict(df_validation)

In [None]:
lm_2_predictions

In [None]:
np.exp(lm_2_predictions)

In [None]:
mean_absolute_error(df_validation["Price_numeric"], np.exp(lm_2_predictions))

### Log-Log Model

In [None]:
lm_3 = smf.ols(formula='np.log(Price_numeric) ~ np.log(Q("Carat Weight"))', data=df_smaller_train).fit()
lm_3.summary()

In [None]:
lm_3_predictions = lm_3.predict(df_validation)

In [None]:
np.exp(lm_3_predictions)

In [None]:
mean_absolute_error(df_validation["Price_numeric"], np.exp(lm_3_predictions))

### Models with Multiple Independent Variables

In [None]:
lm_4 = smf.ols(formula='np.log(Price_numeric) ~ Cut + np.log(Q("Carat Weight"))', data=df_smaller_train).fit()
lm_4.summary()

In [None]:
df_smaller_train['Cut'].value_counts()

In [None]:
lm_4_predictions = lm_4.predict(df_validation)

In [None]:
mean_absolute_error(df_validation["Price_numeric"], np.exp(lm_4_predictions))

### Model with Interactions

In [None]:
lm_5 = smf.ols(formula='np.log(Price_numeric) ~ Cut + np.log(Q("Carat Weight")) + Cut*np.log(Q("Carat Weight"))', data=df_smaller_train).fit()
lm_5.summary()

In [None]:
lm_5_predictions = lm_5.predict(df_validation)

In [None]:
mean_absolute_error(df_validation["Price_numeric"], np.exp(lm_5_predictions))

### ADVANCED: Segmenting Variables

In [None]:
lm_6 = smf.ols(formula='np.log(Price_numeric) ~ Q("Carat Weight") + np.maximum(Q("Carat Weight") - 1, 0) + np.maximum(Q("Carat Weight") - 2, 0)', data=df_smaller_train).fit()
lm_6.summary()

In [None]:
lm_6_predictions = lm_6.predict(df_validation)

In [None]:
mean_absolute_error(df_validation["Price_numeric"], np.exp(lm_6_predictions))

## Your Turn

## Submitting Final Predictions

In [None]:
test_predictions = lm_1.predict(df_test)

In [None]:
test_predictions.to_csv("DiamondSubmission.csv", header=["Price"])