# The X-files problem

This problem is for you to show your modeling abilities.

## Context

The limousine comes to a full stop. As the driver gets out to open the door you take a deep breath and get inside. You enter 10 Downing Street and are conducted to the usual meeting room. Inside you find the Prime Minister, accompanied by a fat, tall man and a short, deform one with long ears and an even longer nose.

>  <font color= #2B65EC> **Prime Minister**: Ah! You’re here! Great! Let me introduce my guests. This is Fidelious, Minister of Magic, and Krenk, the owner of the Gringotts Wizarding Bank. <br> </font>
> <font color= green> **You**: Uhhh, ma’am, is this a joke?  <br> </font>
> <font color= gray>  **Fidelious**: Not at all, but don’t worry, don’t sweat the details, tomorrow you won’t remember anything. Security measures, you see. <br> </font>
> <font color= orange> **Krenk**: Let’s move things along. I don’t like to be exposed to Muggles.<br> </font>
> <font color= green> **You**: What... <br> </font>
>  <font color= #2B65EC> **Prime Minister**: Our friends here seem to have run into a bit of an issue, see, some diamonds seem to have been stolen. Problem is, the only person... goblin, to have seen them is our distinguished guest, Krenk. <br> </font>
> <font color= gray>  **Fidelious**: And while the Ministry completely believes Krenk as to the diamonds’ worth, we need another person to validate his claim. Safety policies, you see. <br> </font>
>  <font color= #2B65EC> **Prime Minister**: So, since you’re the best data scientist in our country, I thought you could help. Mr. Krenk will provide you with the characteristics of the missing diamonds and we need you to create a model to value them. <br> </font>
> <font color= green> **You**: But I’m not a lapidarist.<br> </font>
>  <font color= #2B65EC> **Prime Minister**: Which is why we’re providing you with a huge dataset, containing characteristics and valuations for tens of thousands of diamonds. Now, get working. <br> </font>

Huge? Tens of thousands? You think. And I thought I was the clueless one here.

## Data

The file that contains the information can be download from: [Diamonds Dataset](https://github.com/jeasusav10/IM-Automation/blob/main/Lapidarist%20Problem/diamonds_data.csv)

## Code

In general terms, the goal was to <font color= red> **create a prediction model** </font>

The code will use **Tensorflow** for creating the model, **Pandas** for data maniupulation and anlaysis, and **Matplotlib/Seaborn** for data visualization.

### Import Data

The first was importing the data and libraries required.

In [1]:
#Install tensorflow - uncomment if required
#!pip install tensorflow 

In [2]:
#Import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing

In [3]:
#Import data
path = 'diamonds_data.csv'
df = pd.read_csv(path)

In [4]:
#Last five rows
df.tail()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
53925,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.5
53926,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53927,0.7,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53928,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74
53929,0.75,Ideal,D,SI2,62.2,55.0,2757,5.83,5.87,3.64


### Data Overview

It was nescesary to revise which parameters have *Null/NaN values*, in order to replace the *damaged* entries.

In [5]:
#General information of the parameters
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53930 entries, 0 to 53929
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53930 non-null  float64
 1   cut      53930 non-null  object 
 2   color    53930 non-null  object 
 3   clarity  53930 non-null  object 
 4   depth    53930 non-null  float64
 5   table    53930 non-null  float64
 6   price    53930 non-null  int64  
 7   x        53930 non-null  float64
 8   y        53930 non-null  float64
 9   z        53930 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


In this case, the dataset did not contain null values.

However, the algorithms for **prediction models** require numeric data. In the dataset, there were three non-numeric parameters (**cut, color, clarity**). 

So, the first step was to cast these parameters using *dummies*.

In [6]:
#Generate dummies
df = pd.get_dummies(df, prefix='', prefix_sep='')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53930 entries, 0 to 53929
Data columns (total 27 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   carat      53930 non-null  float64
 1   depth      53930 non-null  float64
 2   table      53930 non-null  float64
 3   price      53930 non-null  int64  
 4   x          53930 non-null  float64
 5   y          53930 non-null  float64
 6   z          53930 non-null  float64
 7   Fair       53930 non-null  uint8  
 8   Good       53930 non-null  uint8  
 9   Ideal      53930 non-null  uint8  
 10  Premium    53930 non-null  uint8  
 11  Very Good  53930 non-null  uint8  
 12  D          53930 non-null  uint8  
 13  E          53930 non-null  uint8  
 14  F          53930 non-null  uint8  
 15  G          53930 non-null  uint8  
 16  H          53930 non-null  uint8  
 17  I          53930 non-null  uint8  
 18  J          53930 non-null  uint8  
 19  I1         53930 non-null  uint8  
 20  IF    

### Trainset & Testset

The next step was to split the data in train and test set. The dataset is quite big (**>50,000 entries**), so the rule 80/20 is not really nescesary. In cases like this, it is recommended to use more data for training, therefore, **the chosen partition were 90-10** for trainset and testset, respectively.

In [8]:
#Split in train/test sets
trainset = df.sample(frac=0.9, random_state=0)
testset = df.drop(df_train.index)

### Features & Labels

Also, it was necessary to define our ouput: **price** 

It is important to remeber:
- Features = Inputs
- Labels = Outputs

In [9]:
#Define labels and features for both sets
train_features = trainset.copy()
test_features = testset.copy()

train_labels = train_features.pop('price')
test_labels = test_features.pop('price')

### Normalization

Normalization was required because parameters the difference in ranges was noticeable (this could cause a problem on the algorithm because it could ponderate some parameters with more weight).

For example, **depht** include two decimals, meanwhile **carat** is in the mean is below 1.

In [12]:
#Verify if normalization required
train_features.describe().transpose()[['mean', 'std']]

Unnamed: 0,mean,std
carat,0.797333,0.474182
depth,61.749031,1.433777
table,57.457851,2.236612
x,5.729558,1.121658
y,5.73286,1.139139
z,3.537737,0.706939
Fair,0.030224,0.171206
Good,0.090941,0.287528
Ideal,0.400416,0.489988
Premium,0.254692,0.435693


In [18]:
#Normalization (layer)
normalizer = preprocessing.Normalization()
normalizer.adapt(np.array(train_features))

In [19]:
#Calculate mean and std, and store them in the normalization layer
normalizer.mean.numpy()

array([7.9733318e-01, 6.1749031e+01, 5.7457851e+01, 5.7295585e+00,
       5.7328601e+00, 3.5377374e+00, 3.0224364e-02, 9.0940930e-02,
       4.0041617e-01, 2.5469229e-01, 2.2372623e-01, 1.2621300e-01,
       1.8124318e-01, 1.7658694e-01, 2.0909822e-01, 1.5439768e-01,
       1.0048005e-01, 5.1980961e-02, 1.3948122e-02, 3.3500217e-02,
       2.4224818e-01, 1.7020005e-01, 1.5040073e-01, 2.2722872e-01,
       6.7989372e-02, 9.4484620e-02], dtype=float32)

When the layer is called it returns the input data, with each feature independently normalized.

### Model (DNN)

The idea was to generate a Deep Neural Network with a multivariate linear regreession as output layer.
This means that the goal is to predict an output 

In order to understant the procedure, the next code lines shows the representation for:
- Linear Regression: One variable
- Linear Regression: Multiple variables 