# Transform color to pH value 

Hello there! 👋 If you are intrigued about the step by step of how to make a color to pH value ML model, you are in the right place! I will try to guide you in this journey the best way I can and explain the intricacies and thought process behind the solution.

### ML process

Before getting our hands dirty we need to understand the type of problem we are solving. For starters, PH is an indicator of how acid or basic a certain mixture is and can be measured in a myriad of ways including electronic sensors and litmus paper. The later takes advantage of chemical reactions that result in a color dying of the paper that can be compared with a known pH-color scale:

![pH-scale](https://img.freepik.com/premium-vector/ph-scale_79145-851.jpg?w=996)

From the last paragraph... In which realm of Machine Learning (supervised or unsupervised learning) does our problem better fit? Let me give you a hint: you are working with known labels. If you thought of supervised learning you guessed it! To be specific we are going to treat this problem as a regression problem. In other words, we will try to find the best mathematical formulae that represents the correlation between color and pH values.

Now that we know the ML approach for our problem, the methodology is the following:

```
preprocessing -> data_normalization -> data_split -> model_training -> model_testing
```

**No more talk, lets get into it!**

In [38]:
#importing the packages.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import colorsys

We first need to load our dataset which you can download from [here](https://www.kaggle.com/datasets/robjan/ph-recognition). Then you should complete the dataset with the `dataset/extension_data.csv` which has some complementary training values that where manually tagged from different image repositories found on Google (worth the effort 😉).

The reason I added this complementary data was because the original model struggled to accurately predict the pH values of real life examples of litmus paper measurements.

In [None]:
#reading the dataset.
ds_base = pd.read_csv('dataset/ph_data.csv')
ds_extension = pd.read_csv('dataset/extension_data.csv')
ds = pd.concat([ds_base, ds_extension])
ds

: 

Now let us understand how an image is built. An image is a matrix of RGB that can be represented as a 3 dimensional cube:

![wikimedia-commons-rgb-image](https://s4f4q3q3.rocketcdn.me/wp-content/uploads/2021/11/RGB_color_solid_cube.png)

Now lets assume that we have the principal color of a couple of images, which happen to be red, green and purple or 0, 7, and 14 in the pH scale respectively. If we try to find a line that connects those dots we end up with the following:

![rgb-color=ph-mapping-path](../images/rgb_path.png)

What would happen if we add the blue and yellow colors to the mixture? Are we capable of coming up with the mathematical formula that solves this regression problem? The answer probably is yes with some difficulty, but there is another way to represent the color space that can be useful for our problem 

In [40]:
# Transform color space from RGB to HSV
ds[["blue", "green", "red"]] = ds[["blue", "green", "red"]] / 255
ds[["red","green", "blue" ]] = ds[["red","green", "blue" ]].apply(lambda x: pd.Series(colorsys.rgb_to_hsv(*x), index=['red', 'green', 'blue']), axis=1)

In [None]:
ds["blue"].corr(ds["label"])

In [None]:
ds["red"].corr(ds["label"])

In [None]:
ds["green"].corr(ds["label"])

In [5]:
ds = ds[["red", "blue", "label"]]
ds

Unnamed: 0,red,blue,label
0,0.992647,0.905882,0
1,0.037383,0.980392,1
2,0.097095,1.000000,2
3,0.130901,1.000000,3
4,0.168468,0.874510,4
...,...,...,...
648,0.564677,0.788235,10
649,0.661359,0.796078,11
650,0.647383,0.662745,12
651,0.718137,0.678431,13


In [None]:
#dataset is consistent. 
ds_plot = ds.label.value_counts()
ds_plot.plot(kind = 'bar')
plt.xticks(rotation=45)
plt.xlabel('pH Scale')
plt.ylabel('Count')
plt.savefig("count_plot.png", dpi = 1000, transparent = True)

In [31]:
X = ds.iloc[:, :-1].columns
X
# y = ds.iloc[:, -1].columns

Index(['blue', 'green', 'red'], dtype='object')

In [43]:
#selecting X and y for classification.
X = ds.iloc[:, :-1].values
y = ds.iloc[:, -1].values

In [25]:
print(X)

[[0.91 0.88 0.99]
 [0.98 0.86 0.04]
 [1.   0.85 0.1 ]
 ...
 [0.66 0.72 0.65]
 [0.68 0.79 0.72]
 [0.51 0.98 0.76]]


In [44]:
#reshaping y for regression.
y = y.reshape(len(y),1)
print(y)

[[ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2

In [45]:
#scaling the independent variable.
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = sc_X.fit_transform(X)

In [14]:
print(X)

[[ 0.87521907  0.32902404  2.06118638]
 [ 1.35465935  0.15836282 -1.43351586]
 [ 1.48082785  0.15145226 -1.21506905]
 ...
 [-0.68927028 -0.72288193  0.79808476]
 [-0.58833548 -0.28138474  1.05692981]
 [-1.64815085  0.96854817  1.22312427]]


In [15]:
print(y)

[[ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [ 0]
 [ 1]
 [ 2

In [46]:
#splitting the dataset into test and training.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [47]:
#importing the packages.
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

SEED = 1

#variable initiation.
rf = RandomForestRegressor(random_state = SEED)
dt = DecisionTreeRegressor(random_state = 0)
svr = SVR(kernel = 'rbf')
ridge = Ridge(alpha=0.1, normalize=True)
lr = LinearRegression()

#storing in an array.
regressors = [('Multiple Linear Regression', lr),
              ('Ridge Regression', ridge),
              ('SVM', svr),
              ('Decision Tree', dt),
              ('Random Forest', rf)]

In [48]:
for reg_name, reg in regressors:
    
    #fit the model.
    reg.fit(X_train, y_train)
    
    #predicting test set results.
    y_pred = reg.predict(X_test)
    
    #reshaping into former form.
    np.set_printoptions(precision=2)
    np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1)
    
    #printing r2_score.
    print('{:s} : {:f}'.format(reg_name, r2_score(y_test, y_pred)))

Multiple Linear Regression : 0.558000
Ridge Regression : 0.554113
SVM : 0.902776
Decision Tree : 0.935763


  y = column_or_1d(y, warn=True)
  reg.fit(X_train, y_train)


Random Forest : 0.952574


In [49]:
from sklearn.externals import joblib
joblib.dump(rf, 'model.pkl')

['model.pkl']