# Python for Chemical Reaction Engineering Part 3 - Data Structure Essentials
[![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/neagan01/eagan_public/blob/master/Python_for_CRE_Part_3_-_Data_Structure_Essentials.ipynb)

Data can be manipulated in different formats. There are many ways to do this, and they can be a bit tricky to handle

## Basic data types

The three basic types of data:
- **Scalars** are single values
- **Strings** are single sets of letters
- **Data structures** are sets of scalars and strings

### Scalars

This is a single value and can full under a few different *types*:

- **int** refers to an integer. **Examples:** 1, 42, 1571
- **float** is a floating doint decimal approximation of a value. This is often what we work with. **Examples:** 1.0, 3.14159,20.11111
- **complex** refers to complex numbers that have an imaginary part. **Examples:** 4.2 + 6i. *Note: Python displays a 'j' instead of an 'i'*

### Strings

**str** refers to a 'string' of characters. These are declared in Python with single, double, or triple quotes. Often it will not matter which convention you use, but there are specific exceptions. Technically a string is not a 'scalar,' but for simplicity  **Examples:** 'A',"Penguin",''''Kevin the Penguin'''

Another data type that is sort of an integer and sort of a string is a Boolean or **bool**. This is a binary entity which can either be 'True' or 'False.' In Python we can also not that True=1 and False=0 or [].

### Data structures

Oftentimes we will work with collections of scalars or strings. There are ***many*** ways to organize and manipulate data, all with major or subtle pros and cons. We will only use a few:

- **Lists** are the default data structure. They are built into the basic programming of Python. A list is **indexed**, meaning that each element has an assigned index value, **ordered** meaning that the precise order of these indices is important, and **mutable**, meaning that the contents can be changed after creation. These can contain scalars or strings or both. It is also possible to add or remove entries with certain commands.
- **Tuples** are like lists, but they are **immutable** meaning they can not be altered after creation-they can only be overwritten.
- **Dictionaries** are like lists but are **not ordered** and inlcude **keys** that can be used to access complex data with words rather than indices. A simple usage could be to create an atomic masses dictionary. Accessing the 'rhenium' entry could return 186.207.
- **Numpy arrays** are mathematical arrays like those used by default in MATLAB and are **what we do most of our math with.** These are very similar to lists but are better suited for mathematical operations. They can have **multiple dimensions** and therefore also represent matrices. To use a numpy array, you have to first import the numpy package.
- **Pandas dataframes** are sort of a hybrid between dictionaries and numpy arrays. They allow for easy, fast math with large sets of indexed data, and the data can be easily manipulated. We might create, for example, a dataframe containing columns of reaction times, conversions, concentrations, and selectivities. We can use these to quickly find the selectivity of a specific species at a specific time or even the selectivity of a species at a certain conversion.


## Working with data

### Defining and displaying values

When a variable is first defined, the convention is simple-write the variable name, an equals sign, and then the value. To set x equal to 4 type:

In [1]:
x = 4

If you then end a code block with simply the name of the variable, it will print by default. If it is not the last line and you want to print the output, you can do so with the 'print' command:

In [2]:
print(x)

4


You can add a line break with '\n'. Doing this more than once will add more than one line. By default, adding a new 'print' command will move you to the next line, so adding '\n' to the beginning adds a blank row between the previous statment and the current one:

In [3]:
print('\n',x,'\n\n',x,'\n\n\n',x)


 4 

 4 


 4


You can combine this with text like so:

In [6]:
print('\n The value of x is',x)


 The value of x is 4


To control the number of decimals, you can write it like this:

In [16]:
print('\n The value of x with 4 decimals is','{:.4f}'.format(x))


 The value of x with 4 decimals is 4.0000


We can now either redefine this variable or generate new ones from it.

In [17]:
x = 42
y = 2*x+1
print('\n x is now',x,'and y is now',y)


 x is now 42 and y is now 85


We define strings using quotations. When we add them together we get a new string where everything is stuck together:

In [21]:
string_1 = 'Chemical'
string_2 = 'Engineering'
string_3 = string_1 + string_2
print('\n',string_1,'+',string_2,'=',string_3)


 Chemical + Engineering = ChemicalEngineering


Note that a space was not automatically created. We would need to add a ' ' string in between to do so.

### Working with lists and tuples

First we will talk about creating these. We will make one version of each that just has [10,20,30,40].

In [39]:
x = 4
x_list = [10,20,30,40]
x_tuple = (10,20,30,40)

Notice that a list uses square brackets and a tuple uses round brackets.

We can determine the type of any data using the "type" command:

In [42]:
print('\n x is a',type(x))
print('\n x_list is a',type(x_list))
print('\n x_tuple is a',type(x_tuple))


 x is a <class 'int'>

 x_list is a <class 'list'>

 x_tuple is a <class 'tuple'>


Now it is important to understand **indexing**. The **Python convention is to start with index 0 rather than 1**, in constrast to some other languages such as MATLAB. To find the first element, use square brackets and select element 0. The second would be element 1. The final is noted as -1 regardless of length, and nth-to-last would be -n:

In [38]:
print('\n The first element of',x_list,' is',x_list[0])
print('\n The second element of',x_tuple,' is',x_tuple[1])
print('\n The final element of',x_tuple,' is',x_tuple[-1])
print('\n The second-to-last element of',x_tuple,' is',x_tuple[-2])


 The first element of [10, 20, 30, 40]  is 10

 The second element of (10, 20, 30, 40)  is 20

 The final element of (10, 20, 30, 40)  is 40

 The second-to-last element of (10, 20, 30, 40)  is 30


The value of an element can be reassigned in a list but not in a tuple. For example, the third element (index 2) of the list can be changed to 100 as follows:

In [40]:
x_list[2] = 100
print('\n Now x_list is',x_list)


 Now x_list is [10, 20, 100, 40]


The same command for the tuple will not work.

To add an element to a list, you can use **append**. Python sometimes uses **dot notation** to access **methods** like this. For example, we can add a fifth element equal to 45 to the list as follows:

In [41]:
x_list.append(45)
print('\n Now x_list is',x_list)


 Now x_list is [10, 20, 100, 40, 45]


You can also insert elements. For example, we can insert a '4' in the third element (index=2):

In [43]:
x_list.insert(2,4)
print('\n Now x_list is',x_list)


 Now x_list is [10, 20, 4, 100, 40, 45]


If you add two lists together, it just stacks them on top of each other. For example, adding list_1=[1,2,3] to list_2=[,40,50,60] could give either [1,2,3,40,50,60] or [40,50,60,1,2,3] depending on the order:

In [44]:
list_1 = [1,2,3]
list_2 = [40,50,60]
print('\n list_1 + list2 =',list_1+list_2)
print('\n list_2 + list1 =',list_2+list_1)


 list_1 + list2 = [1, 2, 3, 40, 50, 60]

 list_2 + list1 = [40, 50, 60, 1, 2, 3]


Notice that no math was done in any of this. We could have used strings instead. For example:

In [45]:
list_strings_1 = ['ethanol','1-butanol','1-hexanol']
list_strings_2 = ['1-octanol','1-decanol']
print('\n list_strings_1 =',list_strings_1)
print('\n list_strings_2 =',list_strings_2)
print('\n list_strings_1 + list_strings_2 =',list_strings_1+list_strings_2)


 list_strings_1 = ['ethanol', '1-butanol', '1-hexanol']

 list_strings_2 = ['1-octanol', '1-decanol']

 list_strings_1 + list_strings_2 = ['ethanol', '1-butanol', '1-hexanol', '1-octanol', '1-decanol']


The length of an array can be determined with the **len()** function:

In [46]:
print('\n The length of ',x_list,' is',len(x_list))


 The length of  [10, 20, 4, 100, 40, 45]  is 6


### Working with numpy arrays

Many Python-based numerical evaluations involve numpy arrays. To use them, you first have to import numpy. It is conventional to import it 'as np' so that you can write 'np' instead of 'numpy' when you want to use it. This is done as follows:

In [48]:
import numpy as np

Once imported, you can define a numpy array from a list. Two ways of getting to the same numpy array containing [10,20,30,40] are as follows:

In [49]:
x_list = [10,20,30,40]
array_1 = np.array(x_list)
array_2 = np.array([10,20,30,40])
print('Method 1 gives:',array_1)
print('Method 2 gives:',array_2)

Method 1 gives: [10 20 30 40]
Method 2 gives: [10 20 30 40]


It is important to note, however, that these are *neither* row *nor* column vectors as written. These are purely "one-dimensional" and are not technically 1x4 arrays but rather 0x4 arrays. This can cause problems when doing math. To specify an array as a **row vector**, add another set of brackets:

In [64]:
array_1_row = np.array([[10,20,30,40]])
print('A 1D row vector of the array is:\n', array_1_row)

A 1D row vector of the array is:
 [[10 20 30 40]]


A **column vector** is most easily created by making the **tranpose** of a row vector using the **.T** method:

In [54]:
array_1_column = np.array([[10,20,30,40]]).T
print('A 1D column vector of array is:\n', array_1_column)

A 1D column vector of array is:
 [[10]
 [20]
 [30]
 [40]]


These can all be distinguished by their **shapes**:

In [55]:
print('The shape of ',array_1,'is ',np.shape(array_1))
print('The shape of ',array_1_row,'is ',np.shape(array_1_row))
print('The shape of ',array_1_column,'is ',np.shape(array_1_column))

The shape of  [10 20 30 40] is  (4,)
The shape of  [[10 20 30 40]] is  (1, 4)
The shape of  [[10]
 [20]
 [30]
 [40]] is  (4, 1)


These can be easily interconverted using "**reshape**:"

In [56]:
print('Original array: ',array_1)
print('Horizontal array:', array_1.reshape(1,-1))
print('Vertical array:', array_1.reshape(-1,1))
print('1D array (same as original):', array_1.reshape(-1,))

Original array:  [10 20 30 40]
Horizontal array: [[10 20 30 40]]
Vertical array: [[10]
 [20]
 [30]
 [40]]
1D array (same as original): [10 20 30 40]


Array elements can be referenced with square brackets as they are with lists in the same way as with lists. Elements can also be changed in the same way.

The key difference now is how math works.

**All elements can be multiplied by a single value** via:

In [65]:
print('2*array_1 =',2*array_1)

2*array_1 = [ 320  640  960 1280]


**Element-wise addition** can be done with a simple '+':

In [66]:
array_3 = np.array([1,2,3,4])
array_4 = array_1+array_3
print(array_1,'+',array_3,'=',array_4)

[160 320 480 640] + [1 2 3 4] = [161 322 483 644]


Element-wise subtraction is the same.

**Element-wise multiplication** can be done with a simple '*':

In [67]:
array_5 = array_1*array_3
print(array_1,'*',array_3,'=',array_5)

[160 320 480 640] * [1 2 3 4] = [ 160  640 1440 2560]


A **matrix dot product** uses the **np.dot** function:

In [68]:
array_6 = np.dot(array_1,array_1)
print(array_1_row,'\ndot\n',array_1_column,'=',array_6)

[[10 20 30 40]] 
dot
 [[10]
 [20]
 [30]
 [40]] = 768000


A matrix can be created either with **np.matrix()** or simply with **np.array()** but with two dimensions:

In [69]:
matrix_version = np.matrix(([1,2,3],[4,5,6],[7,8,9]))
array_version = np.array(([1,2,3],[4,5,6],[7,8,9]))
print('2D matrix:\n',matrix_version)
print('\n2D array:\n',array_version)

2D matrix:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]

2D array:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]


Usually an array is better to work with. The main difference is what happens with default operations, such as multiplication. With numpy arrays, multiplication accessed via "*" is element-wise. With matrices, the dot product occurs instead.

Note that you don't need matrices to do linear algebra operations, however. np.dot(x,y) takes the dot product of x and y, while np.linalg.multi_dot([x,y,z]) takes the dot product of x, y, and z in that order.

In [70]:
print('Matrix product:\n',matrix_version*matrix_version,'\n')
print('Array product:\n',array_version*array_version)

Matrix product:
 [[ 30  36  42]
 [ 66  81  96]
 [102 126 150]] 

Array product:
 [[ 1  4  9]
 [16 25 36]
 [49 64 81]]


The **number of dimensions** in an array can be found with **.dim**:

In [71]:
print('array_1 has',array_1.ndim,'dimensions')
print('matrix_version has',matrix_version.ndim,'dimensions')
print('array_version has',array_version.ndim,'dimensions')

array_1 has 1 dimensions
matrix_version has 2 dimensions
array_version has 2 dimensions


The **number of elements** can be read from **.size**:

In [72]:
print('array_1 has',array_1.size,'elements')
print('matrix_version has',matrix_version.size,'elements')
print('array_version has',array_version.size,'elements')

array_1 has 4 elements
matrix_version has 9 elements
array_version has 9 elements


Shapes can also be accessed as they were previously:

In [73]:
print('The shape of array_1 is',np.shape(array_1))
print('The shape of matrix_version is',np.shape(array_version))
print('The shape of array_version is',np.shape(array_version))

The shape of array_1 is (4,)
The shape of matrix_version is (3, 3)
The shape of array_version is (3, 3)


Some **very convenient arrays** that can be generated include **arrays of all ones with np.ones(dim)**, **arrays of all zeros with np.zeros(dim)**, and **arrays with evenly spaced values with np.linspace(val1,val2,num=#elements):

In [76]:
print('\n',np.ones([3,4]))
print('\n',np.zeros([3,4]))
print('\n',np.linspace(1,5,num=9))


 [[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]

 [[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

 [1.  1.5 2.  2.5 3.  3.5 4.  4.5 5. ]


You can also append to numpy arrays like with lists, but it is slightly different. To add another element, you need to redefine your array using np.append(). For example, a fifth element of '6' can be added to array_1 as follows:

In [77]:
array_1 = np.append(array_1,6)
print('\n array_1 is now',array_1)


 array_1 is now [160 320 480 640   6]


Multiple arrays can be combined with **np.concatenate()**:

In [78]:
print('\n Concatenating',array_1,'with',array_2,'gives',np.concatenate((array_1,array_2)))


 Concatenating [160 320 480 640   6] with [10 20 30 40] gives [160 320 480 640   6  10  20  30  40]


Notice the double parentheses; this is important. You are putting in one element consisting of two arrays, not two elements that are each arrays. **This is an easy mistake to make**.

Multiple **rows** of the **same size** can be **stacked** with **np.hstack** and the same is true with **columns** and **vstack**:

In [79]:
print(array_1_row,'\non\n',2*array_1_row,'\nmakes\n',np.vstack((array_1_row,2*array_1_row)))
print('\n',array_1_row,'\nnext to\n',2*array_1_row,'\nmakes\n',np.hstack((array_1_row,2*array_1_row)))
print('\n',array_1_column,'\nnext to\n',2*array_1_column,'\nmakes\n',np.hstack((array_1_column,2*array_1_column)))

[[10 20 30 40]] 
on
 [[20 40 60 80]] 
makes
 [[10 20 30 40]
 [20 40 60 80]]

 [[10 20 30 40]] 
next to
 [[20 40 60 80]] 
makes
 [[10 20 30 40 20 40 60 80]]

 [[10]
 [20]
 [30]
 [40]] 
next to
 [[20]
 [40]
 [60]
 [80]] 
makes
 [[10 20]
 [20 40]
 [30 60]
 [40 80]]


Some final things to cover include **simple math on arrays** and **statistics**. You can add all elements together with **.sum()**, find the max value with **.max()**, find the min value with **.max()**, find the mean with **.mean()**.

In [80]:
print('\n array_1.sum() =',array_1.sum())
print('\n array_1.max() =',array_1.max())
print('\n array_1.min() =',array_1.min())
print('\n array_1.mean() =',array_1.mean())


 array_1.sum() = 1606

 array_1.max() = 640

 array_1.min() = 6

 array_1.mean() = 321.2


Access an entire row or column with ':' instead of a specific element. For example:

In [85]:
x_2D = np.array(([1,2,3],[8,1,1]))
x_2D_row_first = x_2D[0,:]
x_2D_column_last = x_2D[:,-1]
print('\n x_2D is\n',x_2D) 
print('\n The first row in x_2D is',x_2D_row_first)
print('\n The last column in x_2D is',x_2D_column_last)


 x_2D is
 [[1 2 3]
 [8 1 1]]

 The first row in x_2D is [1 2 3]

 The last column in x_2D is [3 1]


Change a numpy array to a list with .tolist():

In [86]:
numpy_ones = np.ones(6)
list_ones = numpy_ones.tolist()
print('\n numpy_ones type is',type(numpy_ones))
print('\n list_ones type is',type(list_ones))


 numpy_ones type is <class 'numpy.ndarray'>

 list_ones type is <class 'list'>


## Pandas brief intro

Pandas dataframes can take some time to learn but can be very useful in organizing data. The internet is your friend here, but an example of the utility can be seen by evaluating the following cells in series. These have not been annotated and are therefore up to you to interpret.

In [None]:
import pandas as pd

df_1 = pd.DataFrame(columns=['Time (s)','C_A (mol/L)'])
display(df_1)

In [None]:
df_1['Time (s)'] = [0,1,2,4,8]
display(df_1)

In [None]:
df_1['C_A (mol/L)'] = [10,8,6,2,0]
display(df_1)

In [None]:
C_A0 = 10
C_A = df_1['C_A (mol/L)'].to_numpy()
X_A = (C_A0-C_A)/C_A0
df_1['X_A'] = X_A
display(df_1)

In [None]:
df_2 = pd.DataFrame(columns=['Some stuff','Some other stuff'],data=[[1,2],[9,1]])
display(df_2)

with pd.ExcelWriter('Output.xlsx') as writer:  
    df_1.to_excel(writer, sheet_name='Some random stuff')
    df_2.to_excel(writer, sheet_name='Super random stuff')

In [None]:
df_3 = pd.read_excel('Output.xlsx',sheet_name='Super random stuff')
display(df_3)

In [None]:
df_4 = df_3.drop(columns=['Unnamed: 0'])
display(df_4)