# Welcome to my CHE210 Module - Lesson 1
## In this module, you will learn the following skills:
    - How to import data directly from the UO Lab Historian
    - How manipulate dataframes using pandas
    - How to perform a linear regression
    - How to use linear regression values to calculate a parameter of interest
    - How to plot data

# Lab Scenario for Analysis
This entire module will be using real data from the Spring 2024 Agitated Tank group. One of their goals was to calculate the overall heat transfer coefficient at different agitation speeds!

Based on the theory behind this project, it will be necessary to perform a linear regression to find the overall heat transfer coefficient. The equation we will be finding the regression of is:

$$ ln\left( \frac{T_{initial} - T_{inlet}}{T(t) - T_{inlet}} \right) = \frac{m_{inlet}}{M_{tank}} \left( \frac{K_2-1}{K_{2}}\right) t $$

After rearrangement of the equation and some substitutions, the manipulation results in the following equation to find the overall heat transfer coefficient:

$$ U_{o} = \left( \frac{m_{inlet} C_{p, inlet water}}{A_{s,coil}} \right) \left( ln\frac{1}{1-\frac{M_{tank}}{m_{inlet}} m  \right) $$

# Section 1: Importing & Manipulating Data Directly from the Historian

First, you will need to specify what data you want to collect. Similar to how you would navigate the UO Lab Data Retrieval website, type out the dates and times you are interested in. Make sure to include the interval you want for the data!

The code for the data importation was sourced from Dr. Dave's GitHub: https://github.com/henthornlab/ChEDataCollection/blob/Main/clients/client_python_basic.py 

In [31]:
startdate = '2024-04-02'
enddate = '2024-04-02'

starttime = '16:21'
endtime = '16:45'

area = '250'

interval = '10s'

In [32]:
data_URL = 'https://uolab.rose-hulman.edu/csv?starttime=' + starttime + '&endtime=' + endtime + '&startdate=' + \
startdate + '&enddate=' + enddate + '&area=' + area + '&interval=' + interval

Above, the first code block simply has the information you want to grab for analysis. 

The second code block involves actually taking the information you specified and feeding it into the URL connected to the UO Lab Historian. From this, you are getting a file that has all the values we need for data collection.

### Now, let's import a software library called "pandas." Pandas is useful for data analysis and manipulation. 

We will create a dataframe to house all of the information from the Historian. The URL will be providing the file as a comma separated file (csv), so we use the pd.read_csv().

After you have imported the file, test out df.head() and df.tail() to see portions of the data and become familiar with the layout of the dataframe.

In [33]:
# Import libraries
import pandas as pd

In [34]:
df = pd.read_csv(data_URL)
df.head()

Unnamed: 0,Timestamp,TE250-R01/AI1/OUT.CV °C,TE250-R02/AI1/OUT.CV °C,TE250-R03/AI1/OUT.CV °C,TE250-R04/AI1/OUT.CV °C
0,16:21:00,61.170197,119.489288,64.230347,63.79776
1,16:21:10,61.795044,119.585419,64.662933,64.26239
2,16:21:20,62.03537,116.252899,65.143585,64.598846
3,16:21:30,61.763,104.701233,65.399933,64.662933
4,16:21:40,61.346436,99.974823,65.431976,64.63089


### To Delete or Drop Columns from a Dataframe:

For agitated tank, it turns out that we don't need the first column of information for our analysis. To make our lives easier, let's get rid of the column entirely so we don't get confused.

When making large manipulations to your dataframe, make sure you do "df = " before your chosen manipulation. This ensures that you overwrite your previous dataframe. To check that your change was made, use df.head()

Deciding which columns are necessary requires understanding your unit's P&ID! This lab group used their P&ID to determine that the first column was not necessary.

In [35]:
df = df.drop(columns='TE250-R01/AI1/OUT.CV °C')
df.head()

Unnamed: 0,Timestamp,TE250-R02/AI1/OUT.CV °C,TE250-R03/AI1/OUT.CV °C,TE250-R04/AI1/OUT.CV °C
0,16:21:00,119.489288,64.230347,63.79776
1,16:21:10,119.585419,64.662933,64.26239
2,16:21:20,116.252899,65.143585,64.598846
3,16:21:30,104.701233,65.399933,64.662933
4,16:21:40,99.974823,65.431976,64.63089


### Renaming Columns 

Now that we have shortened it down to only the necessary columns, we should rename them so they are easier to reference later in the program.

Changing columns & adding columns (which we will use later) are scenarios where it is not necessary to do "df =" as the changes will be applied automatically.

In [36]:
df.columns = ['Timestamp','Coil_Inlet','Tank_Temp_Outside','Tank_Temp_Inside']

### Adding Additional Columns

To meet the objective, we need to create a column that represents the average temperature of the tank. 

We can do that by simply referencing the two columns and inputting the appropriate manipulations.

In [37]:
df['Avg_Tank_Temp'] = (df['Tank_Temp_Outside']+df['Tank_Temp_Inside'])/2
df.head()

Unnamed: 0,Timestamp,Coil_Inlet,Tank_Temp_Outside,Tank_Temp_Inside,Avg_Tank_Temp
0,16:21:00,119.489288,64.230347,63.79776,64.014053
1,16:21:10,119.585419,64.662933,64.26239,64.462662
2,16:21:20,116.252899,65.143585,64.598846,64.871216
3,16:21:30,104.701233,65.399933,64.662933,65.031433
4,16:21:40,99.974823,65.431976,64.63089,65.031433


### Querying Dataframes

Querying a dataframe means that we are looking for a when a certain condition is true. You give the program a condition you want to be true. It will search through your dataframe for the entries where the condition is true. 

For the Agitated Tank, the group wanted to find the first point where the average temperature of the tank reaches 60 C. This first point will correspond to the "0" point of their data collection.

In [38]:
df.query("Avg_Tank_Temp < 61")

Unnamed: 0,Timestamp,Coil_Inlet,Tank_Temp_Outside,Tank_Temp_Inside,Avg_Tank_Temp
12,16:23:00,15.876770,61.378479,60.160828,60.769653
13,16:23:10,15.460205,60.785675,59.584045,60.184860
14,16:23:20,15.123749,60.176849,58.975220,59.576035
15,16:23:30,14.819336,59.584045,58.398438,58.991241
16,16:23:40,14.627075,59.023285,57.837677,58.430481
...,...,...,...,...,...
140,16:44:20,13.072968,23.599243,22.910309,23.254776
141,16:44:30,13.072968,23.487091,22.798157,23.142624
142,16:44:40,13.072968,23.374939,22.669983,23.022461
143,16:44:50,13.072968,23.246765,22.557831,22.902298


### Slicing the Data

Now we know that point 12 in the dataframe is where we need to start our data analysis. Let's create a new dataframe that only include the values we want to analyze. 

This process is called splicing which uses the iloc[] function. iloc() says start at the 12 value in the dataframe and go all the way to the end (denoted by the 12::).

In [39]:
df_data = df.iloc[12::]
df_data.head()

Unnamed: 0,Timestamp,Coil_Inlet,Tank_Temp_Outside,Tank_Temp_Inside,Avg_Tank_Temp
12,16:23:00,15.87677,61.378479,60.160828,60.769653
13,16:23:10,15.460205,60.785675,59.584045,60.18486
14,16:23:20,15.123749,60.176849,58.97522,59.576035
15,16:23:30,14.819336,59.584045,58.398438,58.991241
16,16:23:40,14.627075,59.023285,57.837677,58.430481


# Section 2: Preparing the Dataframe for the Linear Regression & Plotting

Now that we have created the main dataframe that we are going to be working with, we need to start preparing it for the regression. 

This will require more fine-tuning adjustment and variable creation.

From the linearized equation, we need to define some of the variables:
* Inlet Temperature = average of all the coil inlet temperatures
* Initial Temperature = first entry in the average tank temperature column

We can then create a new column in the dataframe that creates the \frac{T_{initial} - T_{inlet}}{T(t) - T_{inlet}} term.

Note: A red error will occur when using the spliced data, but that is okay because the changes are still being made to the dataframe.

In [40]:
T_inlet = df_data['Coil_Inlet'].mean()

In [41]:
T_initial = df_data['Avg_Tank_Temp'].iloc[0]

In [42]:
df_data['Term'] = (T_initial-T_inlet)/(df_data['Avg_Tank_Temp']-T_inlet)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



### Creating a Column for the Time Elapsed

Creating this column is more difficult than you would initially think. The timestamp is given in a datetime format. However, we want to designate the first entry of the dataframe as time "0" and increase y 10 seconds with each subsequent entry. 

First, let's create a column called NumEntries that starts from 0 and goes the length of our dataframe. 

Once that has been done, create an additional column (Elapsed_Time) that multiplies all of the NumEntries by the necessary time interval (10s).

In [43]:
df_data['NumEntries'] = range(0,len(df_data))



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [44]:
df_data['Elapsed_Time'] = df_data['NumEntries']*10



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



# Section 3: Performing the Linear Regression & Plotting

To create a linear regression & plot it, we will need to import three different packages:

* numpy
* sklearn
* plotly

Note: These packages are not included in the initial Jupyter Notebook that we have created, so you will need to install the packages. 

* Click on the tri-stack icon in the left margin of DataSpell
* Search for: 
    * numpy 
    * scikit-learn 
    * plotly
* Click on the install package on the right side of the screen to correctly download
    * Clicking on install in the left panel could create errors that claim they cannot download...

In [45]:
import numpy as np
from sklearn import linear_model

### Converting Dataframe Columns into Numpy Arrays

Unfortunately, sklearn does not like the format of dataframes. That means we need to convert them into numpy arrays that have the same shape. 

np.column_stack will convert the necessary column. It will convert the Elapsed_Time column into the form of [1,133] which will not match the shape of the Term column array which will be [133,1]. 
* To fix this, just transpose the Elapsed_Time column.

Use the np.log() function to take the natural log of the Term values from the dataframe. That will create a Series, but we want it as an array, so convert it using the .to_numpy()

In [46]:
X = np.column_stack(df_data['Elapsed_Time']).T

In [47]:
y = np.log(df_data['Term'])
y = y.to_numpy()

### The Regression

To call the linear regression, do linear_model.LinearRegression().

Then the reg.fit(X,y) says: 
* Take the linear regression of the X and y arrays that we created.

Using X and y, makes it easier to differentiate between our dependent and independent variables. 
    * You can name them whatever you want though as long as you are consistent!

In [48]:
reg = linear_model.LinearRegression()
reg.fit(X,y)

### Obtaining the Results

You can get the $ R^2 $ value for your regression, the slope, and intercept through very simply code. 
* reg.score() = $ R^2 $
* reg.coef_ = slope
* reg.intercept_ = intercept

In [49]:
reg.score(X,y)

0.9999634359592735

In [50]:
reg.coef_

array([0.0012288])

In [51]:
reg.intercept_

0.007345110496613683

### Plot the Regression

One of the most popular plotting features is plotly. You should have already installed the plotly packages as mentioned earlier. 

Let's import both the express and graph_object libraries.

In [52]:
import plotly_express as px
import plotly.graph_objects as go

Ideally, we want to plot both the original data and the predicted values by the regression. We performed the log when we made the arrays.

Simply add a column to the dataframe that uses those values from the "y" array.

In [53]:
df_data['Ln'] = y



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



We will also need to create a column for the predicted values obtained from the regression. To do this, use the reg.coef_, X array, and reg.intercept_ values that we looked at earlier.

In [54]:
df_data['Predicted'] = reg.coef_*X+reg.intercept_



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



### To plot:
* Define the figure (fig)
* Specify the type of plot (px.scatter)
* Identify the dataframe (df_data) & the columsn you want to plot

### Adding additional plot features:
* 'your figure name'.add_trace
* Add to the plot type you want (go.Scatter)
* Specify the columns of the dataframe associated with x and y
* Specify the way you want the data to display (lines)
* Name the trendline (Linear Fit)
* For changing axes labels: see code

In [55]:
fig = px.scatter(df_data, x = 'Elapsed_Time', y = 'Ln')

fig.add_trace(
    go.Scatter(x=df_data['Elapsed_Time'], y=df_data['Predicted'],
               mode = 'lines',
               name = 'Linear Fit')
)

fig.update_layout(
xaxis_title='Elapsed Time [s]',
yaxis_title='Natural Log Term'
)

fig.show()

Fun fact: The plots created with plotly are interactive! You can click and drag over a section of the graph and it will zoom in for you. To go back to the original graph, just double click on the graph.

Below, I have included a new column that puts the lapsed time in terms of minutes so that it can be more easily understood by viewers.

In [56]:
df_data['Time_Min'] = df_data['Elapsed_Time']/60



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [57]:
fig2 = px.scatter(df_data, x = 'Time_Min', y = 'Ln')

fig2.add_trace(
    go.Scatter(x=df_data['Time_Min'], y=df_data['Predicted'],
               mode = 'lines',
               name = 'Linear Fit',
               fillcolor='rgba(0,0,0,0)')
)

fig2.update_layout(
    xaxis_title='Elapsed Time [min]',
    yaxis_title='Natural Log Term'
)

fig2.show()

# Section 4: Additional Calculations for the Overall Heat Transfer Coefficient

Looking back at our equation for the overall heat transfer coefficient, you can see that there are still many variables that have not been defined. 

The rest depend on your knowledge of the system itself. Therefore, it will require you to look at documentation available for the unit. The lab group defined the following variables.

In [58]:
Vdot_in = 1155              # cubic inches/min
rho = 0.03629               # lbm/ cubic inches
Tank_Vol = 10165.41         # cubic inches
OutCoil = 145.4214          # cubic inches
Baffles = 70                # cubic inches
OutDiam = 0.625             # in
LenCoil = 474               # in
Cp = 18.25                  # btu/ lbmol C
MW = 18.02                  # lbm/ lbmol

### Creating the Equations for Necessary Values

In [59]:
V = Tank_Vol - OutCoil - Baffles        # cubic inches
mdot_in = Vdot_in*rho*(1/60)            # lbm/s
MassWater = V*rho                       # lbm
SurfArea = np.pi*OutDiam*LenCoil        # square inches

In [60]:
Uo = (mdot_in*Cp/SurfArea)*(-1)*np.log(1-(MassWater/mdot_in)*reg.coef_)*(1/MW*1055.06*pow(39.37,2))         # W/ C square meters

# 1055.06 is the conversion from btu to joules
# power(39.27, 2) is the conversion from inches to meters

print(Uo)

[1253.41364075]


### There you have it! You just did most of the calculations (minus uncertainty analysis) for the agitated tank project! 

### See the next lesson to learn how to parse through units with many more columns!