<a href="https://colab.research.google.com/github/henthornlab/ProcessAnalytics/blob/master/RODataKmeans.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**K-Means Algorithm**#
David B. Henthorn, Dept. of Chemical Engineering,
Rose-Hulman Institute of Technology

<img style="float: right;" src="https://raw.githubusercontent.com/henthornlab/ProcessAnalytics/master/RHITlogo.png">

Python example of the k-means algorithm on a reverse osmosis dataset.

In [0]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

We are using sci-kit to enable the k-means algorithm. More details are at: https://scikit-learn.org/

In [0]:
from sklearn.cluster import KMeans

Load the reverse osmosis dataset from my Github repository.

In [0]:
df = pd.read_excel('https://raw.githubusercontent.com/henthornlab/ProcessAnalytics/master/ROSampleRuns.xlsx')
df.tail()

Unnamed: 0,Pressure (psig),Feed Conc. (ppm),Permeate Conc. (ppm),Concentrate Conc. (ppm),Feed Flow (l/min),Concentrate Flow (l/min),Permeate Flow (l/min),Recycle Flow (l/min)
302,49.875069,318.2612,2.857253,352.559692,6.652651,5.959144,0.677282,-0.000658
303,49.846458,317.427612,2.857253,352.202393,6.619621,5.959144,0.669492,-0.000658
304,49.8703,318.14209,2.857253,352.559692,6.561592,5.959144,0.671403,-0.000658
305,49.94659,317.189392,2.857253,352.559692,6.618747,5.959144,0.685123,-0.000658
306,49.884609,318.02301,2.864405,352.202393,6.679479,5.958572,0.687755,-0.000515


Let's plot it and take a look at the results. Specifically, let's plot permeate flow rate as a function of transmembrane pressure.

In [0]:
fig = px.scatter(df, x = 'Pressure (psig)', y = 'Permeate Flow (l/min)')
fig.update_layout(
    title="Reverse Osmosis Performance",
    xaxis_title='Transmembrane Pressure (psig)',
    yaxis_title="Permeate Flow Rate (lpm)",
    font=dict(
        family="Arial",
        size=16
    )
)
fig.show()

We will use five clusters. For information on why we chose that number of clusters, see the notebook on the "Elbow Method".

In [0]:
kmeans = KMeans(n_clusters=5, random_state=0).fit(df)

Add a new column to the dataframe to label which cluster center each data point belongs to.

In [0]:
df['Cluster'] = kmeans.labels_.astype(int)

Plot the data, color coding it to which cluster it belongs to.

In [0]:
fig2 = px.scatter(
    df, x = 'Pressure (psig)',
    y = 'Permeate Flow (l/min)',
    color = 'Cluster',
    color_continuous_scale=px.colors.sequential.Plotly3
)

fig2.update_layout(
    title="Reverse Osmosis Performance",
    xaxis_title='Transmembrane Pressure (psig)',
    yaxis_title="Permeate Flow Rate (lpm)",
    font=dict(
        family="Arial",
        size=16
    )
)
fig2.show()

Let's take a look at the centers of the clusters. You'll see there are five data points, each with eight dimensions.

In [0]:
print(kmeans.cluster_centers_)

[[ 8.53949852e+01  3.18047518e+02  2.44709948e+00  3.64438821e+02
   9.00101516e+00  7.84547734e+00  1.17720289e+00 -6.57989411e-04]
 [ 6.97017852e+01  3.19054326e+02  2.56200221e+00  3.61481109e+02
   8.01613570e+00  7.07948681e+00  9.63421904e-01 -6.57989411e-04]
 [ 4.99770647e+01  3.17770308e+02  2.79927546e+00  3.53085451e+02
   6.61634004e+00  5.96984868e+00  6.75585043e-01 -6.54500306e-04]
 [ 1.01031699e+02  3.19029933e+02  2.35261051e+00  3.69492386e+02
   9.93003902e+00  8.55889704e+00  1.40217582e+00 -6.57989411e-04]
 [ 9.12856293e+01  3.19427319e+02  2.29596713e+00  3.67371747e+02
   9.38930639e+00  8.13167095e+00  1.28228517e+00 -6.57989411e-04]]


Create a new dataframe that has the cluster centers in them.

In [0]:
clusterCenters = pd.DataFrame(kmeans.cluster_centers_, columns=df.columns[0:8])
clusterCenters.head()

Unnamed: 0,Pressure (psig),Feed Conc. (ppm),Permeate Conc. (ppm),Concentrate Conc. (ppm),Feed Flow (l/min),Concentrate Flow (l/min),Permeate Flow (l/min),Recycle Flow (l/min)
0,85.394985,318.047518,2.447099,364.438821,9.001015,7.845477,1.177203,-0.000658
1,69.701785,319.054326,2.562002,361.481109,8.016136,7.079487,0.963422,-0.000658
2,49.977065,317.770308,2.799275,353.085451,6.61634,5.969849,0.675585,-0.000655
3,101.031699,319.029933,2.352611,369.492386,9.930039,8.558897,1.402176,-0.000658
4,91.285629,319.427319,2.295967,367.371747,9.389306,8.131671,1.282285,-0.000658


Let's make a combined plot that includes the data points along with the cluster centers. Note that the radius of the cluster center is completely meaningless and it's there solely to guide the eye.

In [0]:
fig3 = go.Figure()

# Add traces
fig3.add_trace(go.Scatter(x=df['Pressure (psig)'], y=df['Permeate Flow (l/min)'],
                    mode='markers',
                    name='Data'))
fig3.add_trace(go.Scatter(x=clusterCenters['Pressure (psig)'], y=clusterCenters['Permeate Flow (l/min)'],
                    mode='markers',
                    name='Cluster Centers', marker_size = 100, marker_color='rgba(0, 0, 128, .1)'))
fig3.update_layout(
    title="Reverse Osmosis Performance",
    xaxis_title='Transmembrane Pressure (psig)',
    yaxis_title="Permeate Flow Rate (lpm)",
    font=dict(
        family="Arial",
        size=16
    )
)
fig3.show()

The previous plots focused on the very predictable behavior between flow rate and pressure. Let's look at some less predictable behavior.

In [0]:
fig4 = px.scatter(df, x = 'Permeate Conc. (ppm)', y = 'Feed Conc. (ppm)',
                  color = 'Cluster',
                  color_continuous_scale=px.colors.sequential.Plotly3
)
fig4.update_layout(
    title="Reverse Osmosis Performance",
    xaxis_title='Permeate Conc. (ppm)',
    yaxis_title="Feed Conc. (ppm)",
    font=dict(
        family="Arial",
        size=16
    )
)
fig4.show()

In [0]:
fig5 = go.Figure()

# Add traces
fig5.add_trace(go.Scatter(x=df['Permeate Conc. (ppm)'], y=df['Feed Conc. (ppm)'],
                    mode='markers',
                    name='Data'))
fig5.add_trace(go.Scatter(x=clusterCenters['Permeate Conc. (ppm)'], y=clusterCenters['Feed Conc. (ppm)'],
                    mode='markers',
                    name='Cluster Centers', marker_size = 100, marker_color='rgba(0, 0, 128, .1)'))
fig5.update_layout(
    title="Reverse Osmosis Performance",
    xaxis_title='Permeate Conc. (ppm)',
    yaxis_title='Feed Conc. (ppm)',
    font=dict(
        family="Arial",
        size=16
    )
)
fig5.show()