# Outline for Python Scripting for Biochemists
This workshop is based on Python Scripting for [Computational Molecular Sciences](http://education.molssi.org/python_scripting_cms/).

## Modules
1. Setup
2. Introduction with calculations using enzyme kinetics
3. File parsing for data sets for a promiscuous enzyme with multiple substrates
4. Plotting and linear curve fitting for protein concentration calibration curves using numpy
5. Plotting and nonlinear curve fitting for saturable single substrate enzyme kinetics using numpy
6. Writing functions
7. Running code from the Linux command line
8. Testing code with pytest
9. Version control with git 
10. Sharing code
11. Introduction to panda using data sets derived from the PDB
12. Plotting UV-Visible spectra using matplotlib including making interactive spectra with magic % 

## Meeting Notes: BASIL team on Monday, 25 Jan 2021
<!--- ![BASIL Team Interview](BASIL_Interview_Monday_25Jan2021.jpg)-->

Date: Monday January 25, 2021
Interview with BASIL Team members: Bonnie Hall, Julia Koeppe, Jose Tormos Melendez, Cassidy Terrell

### Details
I told them that I was creating a workshop based on Ashley's Python Scripting for CMS workshops and that I wanted to focus on things biochemists need to learn.
1. Manipulating a PDB file
    1. Using Biopython
    1. What data would you want to extract from a PDB file?
    1. They might want to measure a distance
1. Enzyme kinetics from a plate reader
    1. lots of data
    1. merging multiple data sets
    1. converting raw data to derivatized data
        1. students don't understand that it is a time course for absorbance values as a function of time at a fixed wavelength
        1. subtracting a negative control
    1. look over the data and decide to reduce or increase enzyme concentration
1. Tools that would be useful
    1. LaTeX equations training
    1. Greek letters
    1. Quadratic equations
    1. Equilibrium calculations: e.g., is this ionizable group protonated or deprotonated at a given pH?
    1. pH profiles of activities for enzymes with amino acid substitutions
        1. Does the activity change?
        1. Does the pH optimum change?
        1. Can you extract a pK<sub>a</sub> value from the data?
1. More sophisticated ideas
    1. Ask students to use python to reproduce a plot given the data 
    1. Predictive modeling and clustering
    1. K nearest neighbor
    1. Heat mapping data
    
### Emerging ideas
- Students need to learn things they will use
- A toolbox of resources would be nice
    - Greek letters
    - LaTeX equation training
    - Common calculations: pH, pKa, M-M, quadratic equations
- Plate readers produce lots of data. Tools for merging, manipulating, and analyzing these data files would be useful
    - enzyme assays
    - protein assays
    - immunoassays
- Some faculty will want more advanced tools for their research students
    - Heat mapping (what is that?)
    - K nearest neighbor analysis
    - Predictive modeling and clustering


## Meeting Notes: Phil Ortiz on Wednesday, 27 Jan 2021
<!--- ![Phil Ortiz Interview](Phil_Ortiz_Interview_Wednesday_27Jan2021.jpg)-->

Date: Wednesday January 27, 2021
Interview with Phil Ortiz

### General ideas
1. He is not certain that students need any coding experience. 
1. He thinks they definitely need to know how to use a spreadsheet to analyze and plot data.
    1. He strongly favors having students learn to build their own spreadsheets
    1. He learned on Lotus 1-2-3 and Sigma Plot
1. He is a strong advocate of quantitative skills for the students
1. He mentioned a number of challenging calculations
    1. Michaelis-Menten
    1. Henderson-Hasselbalch: pH, pK_a
1. Computation could be handy for some more advanced topics
    1. Structure-function relationships: how does changing pH change the function of a protein?
    1. Hemoglobin would be a great example for this
1. Plots are important and difficult to generate
    1. Hydropathy plots
    1. Plot of helical wheels with colors to indidate charge and polarity (even better in 3D)
    1. It might be useful to go through a biochem textbook and look at the plots that are there

### Emerging Ideas
- Use of existing software (Excel) is one level of computing skill that should be reinforced
- Some tools for computation would be handy
- It would be worth reviewing all the plots in Tansey's text to see which ones can be done in Excel and which need Python or something more.


## Meeting Notes: Ashley Ringer McDonald on Friday, 29 Jan 2021
Date: 29 January 2021
### General Ideas
1. When considering the Python Scripting for CMS, different is better
    1. The file parsing lesson could focus on a larger PDB file
        1. Take a PDB file and write a new file with only the title, ATOM and HETATM records
        1. Selectively remove ANISOU entries
        1. Selectively extract only x, y, z coordinates
        1. Look for specific ligands
    1. The pandas library could be introduced
    1. Regular expression matching might be more important
1. Use the SciPy library for curve-fitting. It has better syntax than NumPy
1. Do we have any need for symbolic math? I told her no, since I don't know what to do with it
1. She recommended reviewing the MolSSI [Python for Data Analysis workshop](https://education.molssi.org/python-data-analysis/)
1. She recommended getting workshops onto venues sooner rather than later
    1. ASBMB Teaching for the Molecular Biosciences
    1. Better luck at regional and specialty meetings
        1. ACS-regional(NERM)
        1. Rochester ACS section
        1. local institutions
    1. Summer research programs
    1. NIH Biophysics training grant
    1. Teach the teachers workshop for biochemists
    
### Emerging Ideas
- Start working on workshop planning
- Don't be afraid to take a different path than the Python Scripting for CMS. Different is better.
- Consider other libraries: Pandas, SciPy

## Meeting Notes: Wally Novak on Monday, 1 Feb 2021
Date: Monday 1 Feb 2021
### General Ideas
1. The conversation focused a lot on pedagogy - how and what we want to teach. We had a good discussion about creating tools that will support hypothesis generation based on 
    1. sequence comparison
    1. structure comparison
1. The sequence comparison should go well beyond what can be done in BLAST and Pfam
    1. There are so many identical sequences it is difficult to pull useful data from those sites any more.
    1. It is more powerful from the command line, where parameters are more readily controlled.
    1. Cluster analysis is powerful and possible
1. Another challenge is correlating sequence and structural data
    1. Identifying how closely residues from crystals align. This is especially interesting in comparing cases like subtilisin and trypsin.
    1. Follow these comparisons with sequence alignments
    1. Learn to wrtie python codes to look for patterns of residues in the entire PDB
1. Emphasis on two questions when approaching programming:
    1. How it works (not so important for BMB students and faculty)
    1. What it can do for you (very important)
1. He also mentioned learning about and using Biopython


### Emerging Ideas
- Need to consider correlating sequence and structural alignments
- How much command line instruction do we want to provide?
- Should we consider free-standing Python or focus only on Jupyter notebooks
* Overall we must emphasize "what can it do for me?"

## Meeting Notes: Charlie Weiss on Monday, 1 Feb 2021

### General Ideas
Our conversation focused mainly on libraries and skills. Charlie is an organic and inorganic chemist.
1. Libraries
    1. Numpy is powerful but basic
    1. Pandas has more features and datatypes (e.g. dataframes) for data importing. His students love pandas.
    1. SciPy is for plotting. I need to read the chapter in his book.
        1. built on numpy
        1. basic
        1. good for integration
        1. Signal processing
        1. Fourier transform and reverse Fourier transform
    1. Visualization libraries
        1. Matplotlib is powerful but basic and verbose
        1. Seaborn is built on matplotlib. You can make a complex plot with 1-2 lines. See Charlie's book.
    1. Biopython
        1. stable and pretty old
        1. many for loops required
        1. good for bringing in data repeatedly or with similar patterns
    1. SciKitBio
        1. in flux - still listed as in beta
        1. you can use it to poke around at data
        1. It wil convert DNA sequences to RNA and protein sequences
    1. SciKitLearn
    1. NMRGlue
1. I mentioned my desire to bring in 3D visualization.
    1. He was not aware of any molecular viewers for Jupyter notebooks.
    1. He said that the Jupyter developers are very friendly and responsive.
    1. He recommended two conferences
        1. Jupyter Con
        1. Annual SciPy conference in Austin, TX
            1. Some presentations
            1. Lots of impromptu discussions
            1. Birds of a feather sessions were really helpful
1. Skills
    1. Dealing with large datasets that can't be managed on a local computer
    1. Out of core methods (programs that must be run on a supercomputer?)

### Emerging Ideas
- Some of these libraries should be included in the workshop: numpy, scipy, pandas
- There are additional libraries that should be part of followup materials
- I need to get to JupyterCon and the SciPy meeting in Austin, TX

## Meeting Notes: Dan Dries on Tuesday, 9 Feb 2021

### General Ideas

### Emerging Ideas

## Meeting Notes: Joe Provost
### General Ideas

### Emerging Ideas

## Meeting Notes: Bonnie Hall on Wednesday, 4 Feb 2021
### General Ideas
1. Start with simple linear regression
    1. Use a common library - matplotlib, numpy or scipy
    1. Base it on a common target - Bradford protein assay or SDS-PAGE mobility
    1. Take data from Excel -> CSV -> Jupyter/Python
    1. At some point we may want to recruit someone to translate this to R
    1. Fit the curve (slope, intersect, R-squared), then graph the curve
    1. Include error
        1. In replicate points
        1. For the curve itself - where is is most and least reliable?
1. Predictive modeling
    1. Classification
        1. K-nearest neighbor
        1. Many data points
        1. Sort the information by proximity
    1. Examples
        1. Protein sequence
        1. Protein function, e.g., hydrolases with substrate results
1. Kinetics calculations using output from a plate reader
    1. Absorbance vs. time to get an initial slope
        1. Use this to estimate substrate consumption
        1. Is it linear? If not, how should the experiment be changed?
    1. Use this to calculate the activity
    1. An exercise to handle tedious math
1. Information from the PDB
1. Data wrangling using Pandas (including the cheat sheet Bonnie provided)
    1. Take data from a .csv file but don't modify the original file
    1. Run code that does calculations and analysis and output to a new file
    1. Search for useful points: minima, maxima, null values
    1. Change ligand that are the same but have different names
    1. One possible data source is metabolomics raw data

### Emerging Ideas
- Simple is good. Simple linear regression will be well received and make them comfortable.
- Very interested in data wrangling with pandas
- Put useful tools in their hands that they and their students will use