<a id='top'></a>
# Diamond Price - A Linear Regression Model
### Table of Contents
| Simple Linear Regression                                                                        |  Multiple Linear Regresssion                                                     |
| ------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
| [Introduction](#intro)                                                                          |   [Introduction](#multi)                                                         |
| [Step 1](#step1) - Importing the libraries needed to process the data and the regression model  |   [Step 1](#stepm1) - Read In a New Dataframe                                    |
| [Step 2](#step2) - Importing the data to a dataframe                                            |   [Step 2](#stepm2) - Data Preparation                                           |
| [Step 3](#step3) - Exploring the data frame                                                     |   [One-hot Encoding](#one-hot) - converting categorical data to numerical values | 
| [Step 4](#step4) - Assigning X and y - Feature and Target                                       |   [Step 3](#stepm3) - Assigning X and y - Feature and Target                     |
| [Step 5](#step5) - Dividing the data Into two sets - Train and Test                             |   [Step 4](#stepm4) - Dividing the Data Into Sets - Train and Test               |
| [Step 6](#step6) - Training the Linear Regression Model                                         |   [Step 5](#stepm5) - Training the Linear Regression Model                       |  
| [Step 7](#step7) - Predicting the Diamond price                                                 |   [Step 6](#stepm6) - Predictions and Performance Evaluation                     |
| [Step 8](#step8) - Evaluating the performance of the model                                      |   [Final Results](#finalm)                                                       |
| [Step 9](#step9) - Graphing the data and regression line                                        |                                                                                  |
| [Results](#results)                                                                             |                                                                                  |
| [References](#references)                                                                       |                                                                                  |

<a id='intro'></a>
## Diamonds

This is a regression model built to predict a diamond's price based on different characteristics of a diamond. 

It starts with simple linear regression model. The model will try to use a single indepdent variable (in this case the carat size) to predict a dependent variable (price of the diamond) as this seems to make sense, the larger the carat size the more expensive it should be. 

Using the data provided by Kaggle, the data is a single comma-seprated file (csv) with the following characteristics:

- A data frame with 53940 rows and 10 variables:
- price: price in US dollars (\$326--\$18,823)
- carat: weight of the diamond (0.2--5.01)
- cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- color: diamond colour, from J (worst) to D (best)
- clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- x: length in mm (0--10.74)
- y: width in mm (0--58.9)
- z: depth in mm (0--31.8)
- depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
- table: width of top of diamond relative to widest point (43--95)

Further description and download link can be found in the [references](#references) section of this notebook


Back to [TOC](#top)

<a id='step1'></a>
## Step 1 - Importing the libraries needed to process the data and the regression model

- pandas - Python library used to process data. We will put the data into a pandas dataframe to prepare the data for our regression test.
- matplotlib - library used to generate a plot in Python. We will generate a scatter plot and the regression line using this library
- Sklearn - this is the libary used for machine learning in Python. It is called scikitl-learn. It contains the functions and methods to:
    - Split data to train and test
    - Linear regression test
    - Metrics to evaluate the model


Go back to [TOC](#top)

In [None]:
# Step 1: Importing libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

<a id='step2'></a>
## Step 2 - Importing the data to a dataframe
Referring to the documentation for the data, and reviewing the csv file to find the format of the data is important

- The variable Diamond is used to point to the dataframe created by reading the csv into memory using pandas


Go back to [TOC](#top)


In [None]:
# Step 2: reading in the csv and creating the DataFrame 

Diamond = pd.read_csv('diamonds.csv')


<a id='step3'></a>
## Step 3 - Exploring the data frame
Exploring the data frame created will ensure our data is complete and usuable for our model.  We can extract some information about the data by using the following methods:


- .info() - prints information about the data: 
    - number of columns
    - column labels
    - data types **
    - range index - how many rows of data
    
A good way to get some statistics on the data we are using
- .describe() - quick calculation of the data 
    - count - not empty values
    - mean - average
    - std - standard deviation
    - min - minimum value
    - 25% - 25% percentile
    - 50% - 50% percentile
    - 75% - 75% percential
    - max - maximum value

- .head() - small snapshot of the data. The first 5 rows if a number is not specified.
- .isna() - show any missing values denoted as NaN by Pandas.
- .corr() - cross-correlation matrix - show how closely related our X and y are. We use this table to find a good correlation between our independant and dependant variables (X and y). 

** As shown in the output of .info()  the data returned three types, int64, float64 and object. Since regression can only use numeric data, we should be aware cut, color, clarity cannot be used as it. More on this later in the notebook. 


Go back to [TOC](#top)

In [None]:
# printing .info() and .describe to view info on the dataframe
print("Printing some information about the data using .info():\n")
Diamond.info()
print("\n The data description using .describe():")
Diamond.describe()


A small snapshot of the data using the .head() method
- .head() - method in pandas used to return a few rows of the data. If a number is not specified, return the first 5 rows. We will use this again later for data comparision. Notice cut color and categoric features (non-numeric) 

In [None]:
print("Printing 5 rows of the data:")
Diamond.head()

We want to make sure this data is complete and does not have any missing values. Using the .isna() method will show any missing values denoted as NaN by Pandas.  All zeros in the right column means there is data in each row for the column

In [None]:
# looking foir NaN - missing data
Diamond.isna().sum()

Finding the cross correlation matrix. - Since we only want to do a simple regression test, we want to try the find the independant variable that is closest to the price of the diamond. By running the .corr() method , Pandas will show us the relationship between the columns in the dataframe. The higher the number the closer the columns are related to each other. Only numeric values are shown in this matrix. We will look at the categorical (non-numeric) data later in part 2

In [None]:
# Finding the cross-correlation matrix - Looking for two correlated values. 
print(Diamond.corr(numeric_only = True))

### The correlation matrix above shows that price is closely related to carat which indicates this may be a good relationship to use for our regresssion model

<a id='step4'></a>
## Step 4 - Assigning X and y - Feature and Target 
We need the independent variable and dependent variables assigned. X is independent and y is dependent. we are going to use X to predict y   
X = adding carat to the feature matrix   
y = adding price as the target variable 



Go back to [TOC](#top)

In [None]:
# Step 4: Assigning our feature (X) and target variable (y)
#Using the carat of the diamond (X) to predict the price (y)
X = Diamond[['carat']]
y = Diamond['price']

<a id='step5'></a>
## Step 5 - Dividing the Data Into Sets - Train and Test
we want to take our data and divide it into two sets, train and test. We will use our train set to train the model and use the test set afterwards. 70% of the data will be used to train the model with 30% used as the test data set.  Using a the random_state option sets a seed so that we can recreate the same test with same results. shuffle will shuffle the data before it splits so we get a good representation of the data without introducing a bias such as the order of the data into our sets. Train_test_split() is a function of scikit-learn



Go back to [TOC](#top)

In [None]:

# Step 5: Dividing the dataset into test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10, shuffle=True)

<a id='step6'></a>
## Step 6 - Training the Linear Regression Model
We train the model using the LinearRegression() function from scikit_learn. We are using the training set of data.



Go back to [TOC](#top)

In [None]:
# Step 6: Selecting the linear regression method from the scikit-learn library
model = LinearRegression().fit(X_train, y_train)

<a id='step7'></a>
## Step 7 - Predicting the Diamond price

Using the .predict() from scikit-learn, we are predicting y based on the X_train and X_test datasets. This will allow us to calculate the performance of the model for both training and test data. 



Go back to [TOC](#top)

In [None]:
# Step 7: Predicting the diamonds price from training and test data
y_prediction_train = model.predict(X_train)
y_prediction_test = model.predict(X_test)

<a id='step8'></a>
## Step 8 - Evaluating the performance of the model
After predicting the diamond's price in step 7. We want to see how our model performs. We are going to use the **R2** score and the **Mean Absolute Error (MAE)** as indications of how well our simple Linear Regression model can predict the price of a diamond using only the carat size.

-  The R2 score is the percentage of the variance in the dependent variable that the independant variable explains. In other words how well did the carat size predict the price of the diamond. The closer to 1 theoretically the better the model fit. If the R2 score is less than .9, we will likely have to consider using more features to predict y.   
-  MAE will show us how "off" our model is. Because the diamond prices can vary even if they are the same carat size, MAE was chosen because it is less sensitve to outliers. The lower the MAE, the closer the predictions are to the actual values.



Go back to [TOC](#top)

In [None]:
# Step 8:  Evaluation
# Evaluating the trained model on training data. R2 and MAE calculations are from scikit-learn

print(f"The average carat size of diamonds in data {Diamond['carat'].mean()}")
print(f"The average price of diamonds in the data {Diamond['price'].mean()}")

print ("R2 score on train data= ",metrics.r2_score(y_train,y_prediction_train))
print ("R2 score on test data= ",metrics.r2_score(y_test,y_prediction_test))
print("MAE on train data= " , metrics.mean_absolute_error(y_train, y_prediction_train))
print("MAE on test data = " , metrics.mean_absolute_error(y_test, y_prediction_test))

<a id='step9'></a>
## Step 9 - Graphing the data and regression line
We will use matplotlib to produce a scatter plot of both sets of data and also plot the regeression line for both test and train data sets.  Reference lines for the average carat size (X) and price (y) were added as an observation point



Go back to [TOC](#top)

In [None]:
plt.scatter(X_test, y_test, color='b', label='Actual Data')
plt.plot(X_test, y_prediction_test, color='r', label='Regression Line')
plt.axvline(x=np.nanmean(X_test),color='c',linestyle='--', label ='Avg Carat Size \'X test\'')
plt.axhline(y=np.nanmean(y_test),color='g',linestyle='--', label ='Avg price \'y test\'')
plt.xlabel('Carat Size')
plt.ylabel('Price')
plt.legend()
plt.title('Linear Regression Test Data')
plt.show()

plt.scatter(X_train, y_train, color='b', label='Training Data')
plt.plot(X_train, y_prediction_train, color='r', label='Regression Line')
plt.axvline(x=np.nanmean(X_train),color='c',linestyle='--', label ='Avg Carat Size \'X train\'')
plt.axhline(y=np.nanmean(y_train),color='g',linestyle='--', label ='Avg price \'y train\'')
plt.xlabel('Carat Size')
plt.ylabel('Price')
plt.legend()
plt.title('Linear Regression Training Data')
plt.show()

For and easy reference, we created a dataframe using the actual values of price and carat size from the data and the predicted value from the regression model. This is an easy way to visually examine the accuracy of the model. As mentioned in step 3, we will use .head() to print out the first 25 rows of the data. Alternatively, we can use .iloc[] in pandas to select a range of rows to print out. The code box below has commented instruction to use .iloc[]

In [None]:

#put the data into a dataframe to compare actual and predicted values, print 25 rows
#comparison_df = pd.DataFrame({"Actual":y_test, "Predicted":y_prediction_test}) 

comparison_df = pd.DataFrame({"Carat": X_test['carat'], "Actual y":y_test, "Predicted y ":y_prediction_test}) 
comparison_df.head(15)

# comment out the line above and uncomment the line below to print different sections of the dataframe. Adjust the range for the section of data to print. 
#print (comparison_df.iloc[25:50])

<a id='results'></a>
# Results 
Our results above indicates a only fairly accurate model based soley on the carat size to predict the price.  The intersection of the mean of X and mean of y directly on the regression line gives us an indication of a fairly accurate mode since the intersection point represents the value of y when X is at its average.The R2 score indicates this is a fairly accurate model but as noted above the score is less than .9. The MAE indicated our model is off about ~1000 average.   We are in the ballpark but there is more work to be done. Looking at the actual and predicted results, We can see some predicted values are within twenty dollars on some diamonds and off by thousands on others.   Interpretation of the MAE is relative to the data. Diamonds in our dataset have an average price of $3932 , so to be off by ~1000 isn't a terrific result. The combination of a R2 score < .9 and a somewhat high MAE will lead us to perform a multiple linear regression in part 2 to see if the model is more accurate when including other features.  We will compare the R2 and MAE from both models and see if our model is more accurate when using a larger feature matrix.



Go back to [TOC](#top)

<a id='multi'></a>
## Introduction - Multiple Linear Regression
Since our model above had some larger variances in the results, an R2 score <.9 and MAE ~1000, can we get the model to be more accurate? One way we can try to do this is to include more features in the data and see if these additional features affect the accuracy of the model.  We will use a multiple linear regression and add in variables which we beleive would affect the price of a diamond. 

Read more about how a diamond is graded and ultimately priced: https://www.gemsociety.org/article/a-consumers-guide-to-gem-grading/




*To keep the dataframes separate, we will import the data into a different dataframe called M_diamond.  We will also have to prepare the data to include the categorical features cut, color , clarity as these seem important when deciding the price of a diamond.



Go back to [TOC](#top)

<a id='stepm1'></a>
## Step 1 - Read In a New Dataframe


Go back to [TOC](#top)

In [None]:

# Seading the DataFrame and printing a few lines data fraome. 

M_diamond = pd.read_csv('diamonds.csv')
M_diamond.head()


The column 'Unnamed: 0' isn't necessary as it seems like it is just a row indicator. The dataframe will also have a column to indicated the rows, so we will drop this column from the dataframe as we do not want to use it.

In [None]:
# drop the Unnamed: 0 column
M_diamond.drop('Unnamed: 0', axis=1,inplace=True)

<a id='stepm2'></a>
## Step 2 - Data Preparation

This time we need to prepare the data by converting the categorical features into numerical values. In order to do this, we will extract the unique answers from each of the columns (color and clarity) and create additional columns in the dataframes with the unique answers. After the columns are created, for each diamond,  we will populate the appropriate cell with a boolean (True or False) to indicate which of the characteristics the diamond has. Once this is completed, we will replace the True and False values with integers 1 and 0 respectively

For example, take the color of a diamond. Each diamond is graded against a scale which contains seven colors (E,I,J,H,F,G,D).  We wiill expand the dataframe to include one column for each of the color grades. If a diamond is graded an 'E' in color, the dataframe will contain a 'True' in the new 'E' column and 'False' for the other columns (I,J,H,F,G,D). This ensures we have accounted for the color with a boolean value. We will apply the same concept for clarity by expanding the dataframe with an additional eight columns (one for each clarity grade). The True and False values will then be replaced by integer 1 and 0 respectively.

The cut rating of a diamond is ordinal which means it is a hierarchal scale.  The best thing to do with this is to directly replace each cut rating with a numeric value scale as follows. This differs from the color and clarity as their values do not make up a hierarchal scale, but rather just a rating system.
The cut scale will be replaced as:
- Fair = 0
- Good - 1
- Very Good = 2
- Premium = 3
- Ideal = 4



Go back to [TOC](#top)


In [None]:
# looking to identify the columns which contain categorical features. 
M_diamond.select_dtypes('object').columns

In the code box below, we are iterating through the data frame columns which contain objects as datatypes
the for loop controls the iteration for each column. The if statement will print the column name and the unique values in the column if the column's datatype is 'object'

This will give us the unique names we need to add the columns for color and clarity. It also shows the unique values for the rating scale for cut.

In [None]:
for col in M_diamond:
    if M_diamond[col].dtypes=='object':
        print(f'{col} : {M_diamond[col].unique()}')

<a id='one-hot'></a>
## One-hot Encoding
The code box below is using  one-hot encoding to convert each of the unique values we identified above to a boolean datatype (True or False). This technique involves using a new dataframe called 'dummies' and  using the 'get_dummies' function from Pandas.  As noted above, we will peform this for the 'color' and 'clarity' columns.

The concept of Multicollinearity can occur when using the one-hot encoding. This means that two or more of the new independent variables we are creating have a high correlation with one another in the model. This condition makes it difficult to identify the effect of each variable's effect on the dependent variable. They are simply too closely related. When using dummy variables the dummy variable trap can occur in which one dummy variable can be predicted from the others. We will drop one of the dummy variables. the 'drop_first = True' option below will drop the first level of variable. Compare the output of 'dummies.dtypes' below to the unique object values from above. One unique value was dropped from 'cut' and one from 'clarity'


Go back to [TOC](#top)

In [None]:
dummies = pd.get_dummies(M_diamond[['color','clarity']],drop_first=True)
dummies.dtypes, print(f"A few lines of the new dataframe \n {dummies.head()}")

From the header output above, we can see the results of the 'get_dummies' function created the new boolean columns, but the values are not numeric. The values in the columns are set the "True" if the diamond in the row is rated a particular color or clarity rating and 'False' for the remaining respective columns. The next codebox uses the .replace() method to replace 'True' with a '1' and 'False' with a '0' and set the type as integer. This is the final step for color and clarity columns. observe the output from the box below. All of the values are now '1' or '0'

In [None]:
for col in dummies:
    dummies[col] = dummies[col].replace({'True':1,'False':0}).astype(int)
    
dummies

Now that the dummies dataframe is complete for color and clarity, we will now concatenate the dummies df with our M_diamond dataframe. Also drop the categorical 'color' and 'clarity' columns

In [None]:
# Concat the two dataframes together as noted in above Markdown box. 
M_diamond = pd.concat([M_diamond,dummies],axis=1)
M_diamond.drop(['color','clarity'],axis=1,inplace=True)
M_diamond

The final column to fix is to replace the cut column with a numeric scale. 
We first look at the unique values in the dataframe

In [None]:
M_diamond.cut.unique()

Next we use .replace() again to replace the values as follows:
- Fair = 0
- Good - 1
- Very Good = 2
- Premium = 3
- Ideal = 4

We print the info for the dataframe to confirm all columns are now numeric

In [None]:
# replace the values in cut with numeric scale and disply the data type for each column
M_diamond['cut'] = M_diamond['cut'].replace({'Fair':0,'Good':1,'Very Good':2,'Premium':3,'Ideal':4})
M_diamond.info()

In [None]:
# Finding the cross-correlation matrix - Looking for additional  correlated values. 
print(M_diamond.corr(numeric_only = True))

<a id='stepm3'></a>
## Step 3 - Assigning X and y - Feature and target
We need the independent variable and dependent variables reassigned. X is independent and y is dependent. we are going to use X to predict y
X = drop the price column and use all of the remaining columns. This increases the feature matrix from the previous simple regression. 
y = adding price as the target variable 



Go back to [TOC](#top)

In [None]:
# Step 3: Seperating the data into features and labels

X = M_diamond.drop('price',axis=1) # Independent variable Drop price and keep everything else
y = M_diamond['price'] # dependent variable
X.head()

<a id='stepm4'></a>
## Step 4 Dividing the Data Into Sets - Train and Test
we want to take our data and divide it into two sets, train and test. We will use our train set to train the model and use the test set afterwards. 70% of the data will be used to train the model with 30% used as the test data set.  Using a the random_state option sets a seed so that we can recreate the same test with same results. shuffle will shuffle the data before it splits so we get a good representation of the data without introducing a bias such as order of the data into our sets. train_test_split(0 is a function of scikit-learn



Go back to [TOC](#top)

In [None]:

# Step 4: Dividing the dataset into test and train data
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.3, random_state=10, shuffle=True)
X_train2.head()

<a id='stepm5'></a>
## Step 5 - Training the Linear Regression Model
We train the model using the LinearRegression() function from scikit_learn. We are using the training set of data 



Go back to [TOC](#top)

In [None]:
# Step 5: Selecting the linear regression method from the scikit-learn library
model = LinearRegression().fit(X_train2, y_train2)



<a id='stepm6'></a>
## Step 6 - Predictions and Performance Evaluation
This time we combined the prediction of the price using the data and we calculate the MAE and R2 values to evaluate the performnance.  The comparision data frame is also created to allow for an easy comparision with the results from the simple regression above



Go back to [TOC](#top)

In [None]:
# Step 6: Validation
# Evaluating the trained model on training data
# MAE is relative depending on the scale of the data. The data scale for X ,y is two digits, so 
#The best possible score is 1 which is obtained when the predicted values are the same as the actual values.
print(f"The average carat size of diamonds in data {M_diamond['carat'].mean():.4f}")
print(f"The average price of diamonds in the data {M_diamond['price'].mean():.2f}")

# Generate the predictions
y_prediction_train2 = model.predict(X_train2)
y_prediction_test2 = model.predict(X_test2)

# Evaluating the trained model on both data sets
print("MAE on multiple regression train data= " , metrics.mean_absolute_error(y_train2, y_prediction_train2))
print("MAE on multiple regression test data = " , metrics.mean_absolute_error(y_test2, y_prediction_test2))
print ("R2 score on muliptle regression train data= ",metrics.r2_score(y_train2,y_prediction_train2))
print ("R2 score on muliptle regression test data= ",metrics.r2_score(y_test2,y_prediction_test2))


#put the data into a dataframe to compare actual and predicted values, print 25 rows
comparison_df2 = pd.DataFrame({"Carat":X_test2["carat"],"Actual":y_test2, "Predicted":y_prediction_test2})
comparison_df2.head(25)

<a id='finalm'></a>
# Final Results
Looking at the results from the code box below, one can see the R2 scores for the multiple linear regression model were above .9 and the MAE lowered which indicates our model had less variance when using the multiple linear regression model. Overall, the multiple linear regression model was more accurate in predicting the price of a diamond than the simple linear regression. This model aligns with how the diamond industry grades and subsequently prices its diamonds (See the diamond article in the [references](#references) section). The carat, cut, color and clarity all have an effect the price of the diamond. 

Two bar charts were created using matplotlib. The values and labels were put into two separate lists. Titles were added and the bar chart was plotted for each value. Please see the comments in the code box for additional Python syntax detail.



Go back to [TOC](#top)

In [None]:
# R2 Score and MAE using metrics from scikit-learn - round to 4 decimal places.
r2SimpleTrain = round(metrics.r2_score(y_train,y_prediction_train),4)
r2SimpleTest = round(metrics.r2_score(y_test,y_prediction_test),4)
r2MultiTrain =round(metrics.r2_score(y_train2,y_prediction_train2),4)
r2MultiTest = round(metrics.r2_score(y_test2,y_prediction_test2),4)

maeSimpleTrain = round(metrics.mean_absolute_error(y_train, y_prediction_train),4)
maeSimpleTest = round(metrics.mean_absolute_error(y_test, y_prediction_test),4)
maeMultiTrain = round(metrics.mean_absolute_error(y_train2, y_prediction_train2),4)
maeMultiTest = round(metrics.mean_absolute_error(y_test2, y_prediction_test2),4)

#Print the scores out 
print (f"The R2 simple linear regression on training data : {r2SimpleTrain}")
print (f"The R2 multiple regression on training data : {r2MultiTrain}")
print('\n')
print (f"The R2 simple linear regression on test data : {r2SimpleTest}")
print (f"The R2 multiple linear regression on test data : {r2MultiTest}")
print('\n')
print (f"The MAE simple linear regression on training data : {maeSimpleTrain}")
print (f"The MAE multiple linear regression on training data : {maeMultiTrain}")
print('\n')
print (f"The MAE multiple linear regression on training data : {maeMultiTest}")
print (f"The MAE simple linear regression on test data : {maeSimpleTest}")

## Create two bar graphs to compare R2 and MAE scores from simple to multi linear regression
# Put all values and labels into lists
r2Values=[r2SimpleTrain, r2MultiTrain, r2SimpleTest, r2MultiTest]
r2ValueLabels = ['R2 Simp Train','R2 Mult Train',  'R2 Simp Test','R2 Mult Test']
maeLabels=['MAE Simp Train', 'MAE Mult Train', 'MAE Simp Test','MAE Mult Test']
maeValues =[maeSimpleTrain, maeMultiTrain, maeSimpleTest, maeMultiTest]


# Create the first bar chart for R2
bars = plt.bar(r2ValueLabels, r2Values, width=0.4)
bars[0].set_color('green')
bars[1].set_color('green')
bars[2].set_color('blue')
bars[3].set_color('blue')
# Loop to put the centered data values on top of the bars
for i in range(len(r2ValueLabels)):
    plt.text(i,r2Values[i],r2Values[i], ha='center')
#Titles and labels for chart    
plt.suptitle('R2 test results - Simple vs. Multiple Linear Regression')
plt.title("Higher values reflect higher accuracy")
plt.xlabel('R2 Results')
plt.show()

# Create the second chart for MAE
bars = plt.bar(maeLabels, maeValues, width=0.4)
bars[0].set_color('green')
bars[1].set_color('green')
bars[2].set_color('blue')
bars[3].set_color('blue')
# Loop to put the centered data values on top of the bars
for i in range(len(maeLabels)):
    plt.text(i,maeValues[i],maeValues[i], ha='center')
#Titles and labels for chart
plt.suptitle('MAE test results - Simple vs. Multiple Linear Regression')
plt.title("Lower values reflect less variance in the model")
plt.xlabel('MAE Results')
plt.show() 

<a id='references'></a>
### References and sources
Data Source :   
https://www.kaggle.com/datasets/swatikhedekar/price-prediction-of-diamond/data

Basic Regression test sample:   
https://www.educative.io/blog/machine-learning-regression-models-with-python

Code to prepare categorical data:   
https://www.kaggle.com/code/amirulabdlatib/diamond-price-prediction 

Matplotlib plotting:   
https://www.geeksforgeeks.org/bar-plot-in-matplotlib/   
https://statisticsbyjim.com/regression/interpret-r-squared-regression/   
https://www.geeksforgeeks.org/plot-a-horizontal-line-in-matplotlib/   
https://bobbyhadz.com/blog/matplotlib-add-average-line-to-plot#:~:text=Use%20the%20pyplot.,data%20coordinates%20as%20a%20parameter   
https://www.geeksforgeeks.org/adding-value-labels-on-a-matplotlib-bar-chart/


R2 :   
https://statisticsbyjim.com/regression/interpret-r-squared-regression/

Multicollinearity:   
https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/

Diamond Pricing:   
https://www.gemsociety.org/article/a-consumers-guide-to-gem-grading/


Go back to [TOC](#top)
