# Capstone Two - Modeling

## Overview Of data science problem
The purpose of this data science project is to come up with a pricing model for diamonds in the diamond market. 

Using historical prices of diamonds for a business case, I can use a dataset to understand price and other features related to diamonds and also help others make future predictions in a more reliable way than just using the historical price. There is also a huge absence of knowing what the fair value for a diamond to the public is. In particular, which diamonds we are most likely to pay more for. 

This project aims to build a predictive model for diamond price based on a number of features, or properties of industry standard-type diamonds. This model will be used to provide guidance for diamond pricing and future value investment plans.

#### **Goal**: 

* Build two to three different models and identify the best one. 
    * Fit models with a training dataset
    * Review model outcomes — Iterate over additional models as needed
    * Identify the final model that is the best model for this project

## Importing relevant libararies

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning tools
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

# Set seaborn style
sns.set_style('whitegrid')

# Suppress future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

# Load the data

In [5]:
explored_data = "Explored_data_copy"
df = pd.read_csv(explored_data)
df.head()

Unnamed: 0,weight,quality,color,clarity,depth,flat surface (top),price,length,width
0,0.23,Ideal,E,SI2,61.5,55.0,326.0,3.95,3.98
1,0.21,Premium,E,SI1,59.8,61.0,326.0,3.89,3.84
2,0.23,Good,E,VS1,56.9,65.0,327.0,4.05,4.07
3,0.29,Premium,I,VS2,62.4,58.0,334.0,4.2,4.23
4,0.31,Good,J,SI2,63.3,58.0,335.0,4.34,4.35


# Preprocessing

In [6]:
# Print the data type of each column
df.dtypes

weight                float64
quality                object
color                  object
clarity                object
depth                 float64
flat surface (top)    float64
price                 float64
length                float64
width                 float64
dtype: object

### Create dummy or indicator features for categorical variables

In [33]:
# Get only object columns
df_cats = df.select_dtypes('object')

# Seperate features from target feature
X = df.drop(columns=['price'], axis=1)
y = df['price']

# Get n-1 dummies for all categorical variables
X = pd.get_dummies(X, columns=df_cats.columns, drop_first=True, prefix='C')
X.head()

Unnamed: 0,weight,depth,flat surface (top),length,width,C_Good,C_Ideal,C_Premium,C_Very Good,C_E,...,C_H,C_I,C_J,C_IF,C_SI1,C_SI2,C_VS1,C_VS2,C_VVS1,C_VVS2
0,0.23,61.5,55.0,3.95,3.98,0,1,0,0,1,...,0,0,0,0,0,1,0,0,0,0
1,0.21,59.8,61.0,3.89,3.84,0,0,1,0,1,...,0,0,0,0,1,0,0,0,0,0
2,0.23,56.9,65.0,4.05,4.07,1,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
3,0.29,62.4,58.0,4.2,4.23,0,0,1,0,0,...,0,1,0,0,0,0,0,1,0,0
4,0.31,63.3,58.0,4.34,4.35,1,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0


# Fit models with a training dataset

### Split data into testing and training subsamples

In [34]:
X_train, X_test, y_train, y_train = train_test_split(X, y, test_size=.2, random_state=1)

### Standardize the magnitude of numeric features using a scaler

In [35]:
# View range of data
X_train.describe()

Unnamed: 0,weight,depth,flat surface (top),length,width,C_Good,C_Ideal,C_Premium,C_Very Good,C_E,...,C_H,C_I,C_J,C_IF,C_SI1,C_SI2,C_VS1,C_VS2,C_VVS1,C_VVS2
count,42720.0,42720.0,42720.0,42720.0,42720.0,42720.0,42720.0,42720.0,42720.0,42720.0,...,42720.0,42720.0,42720.0,42720.0,42720.0,42720.0,42720.0,42720.0,42720.0,42720.0
mean,0.784666,61.754841,57.447528,5.704533,5.708236,0.091386,0.401779,0.254096,0.223596,0.18324,...,0.152949,0.100562,0.051896,0.033521,0.240988,0.168375,0.151264,0.229354,0.068165,0.095131
std,0.45883,1.426725,2.237177,1.101816,1.130244,0.28816,0.490263,0.435357,0.416659,0.386867,...,0.359943,0.300751,0.22182,0.179994,0.427688,0.374204,0.35831,0.420422,0.252031,0.293399
min,0.2,43.0,43.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.4,61.075,56.0,4.7,4.71,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.7,61.8,57.0,5.68,5.69,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.03,62.5,59.0,6.52,6.52,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,4.13,79.0,95.0,10.02,58.9,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [36]:
# Initialize scaler
scaler = StandardScaler()

# Build scaler based on training data
X_train_scaled = scaler.fit_transform(X_train)

# Apply (but not fit) scaler to test data
X_test_scaled = scaler.transform(X_test)

In [37]:
X_train_scaled

array([[ 0.75264831, -0.17862149, -0.64704076, ..., -0.54553876,
         3.69734089, -0.32424135],
       [ 0.90521205,  0.17183548,  0.24695331, ..., -0.54553876,
        -0.27046465, -0.32424135],
       [-1.03452696, -0.24871288, -0.20004373, ...,  1.83305032,
        -0.27046465, -0.32424135],
       ...,
       [ 0.27316226,  0.66247524,  2.48193848, ...,  1.83305032,
        -0.27046465, -0.32424135],
       [ 0.86162241, -0.0384387 , -0.64704076, ..., -0.54553876,
        -0.27046465, -0.32424135],
       [-0.94734768,  0.10174409, -0.64704076, ..., -0.54553876,
        -0.27046465, -0.32424135]])

In [38]:
X_test_scaled

array([[-0.77298912, -0.24871288, -1.0940378 , ..., -0.54553876,
         3.69734089, -0.32424135],
       [-0.18452897,  0.59238385, -0.20004373, ..., -0.54553876,
        -0.27046465, -0.32424135],
       [-0.9691425 ,  0.38210967,  1.14094738, ...,  1.83305032,
        -0.27046465, -0.32424135],
       ...,
       [-0.92555286,  0.24192688, -0.64704076, ..., -0.54553876,
        -0.27046465, -0.32424135],
       [-0.81657876, -0.45898706, -0.64704076, ...,  1.83305032,
        -0.27046465, -0.32424135],
       [-0.18452897, -1.01971822,  0.24695331, ..., -0.54553876,
        -0.27046465, -0.32424135]])

## Regression Analysis

OLS (ordinary least squares) is one of the most common methods for estimating the linear regression equation. All OLS regression assumptions should be met before we rely on this method of estimation.

### Assumptions
* Linearity 
* No endogeneity 
* Normality and homoscedasticity
* No autocorrelation 
* No multicollinearity 

### Multiple Linear Regression

In [39]:
# OLS Model


# Review model outcomes