# Big Data & Quantum Mechanics
 

## Overview of Medford Group research

<img src="images/medford_group_bg.png">



## About Prof. AJ Medford

- Started as a professor (and this VIP course) in Spring 2017.
- Experience in developing and contributing to several open-source software packages (CatMAP, ElectroLens, TAPSolver, SPARC, AMPTorch)
- Instructor for "Data Analytics for Chemical Engineers" and Numerical Methods.
- Interest in applying data science techniques to problems in quantum chemistry and physics.

### Co-instructor: Prof. Phanish Suryanarayana

- Lead developer of SPARC DFT code
- Expert in electronic structure theory and numerical methods

## Introductions

We will go around the class and introduce ourselves to the everyone. When it is your turn to speak, tell everyone your *preferred name, major, and something **boring** about yourself*.

## How does VIP work?

The premise of VIP is teams working on projects. Much like a real-world engineering team, individual members work on different aspects of the project. Team members range from sophomores through graduate students, from first-time participants to students who have been involved for four or more semesters. Some students take the course for one credit, and others take it for two credits; naturally, the bar will be higher for those taking it for two credits.

## How is VIP graded?

You will receive a grade for the course based on three criteria:

- Documentation (33.3%): Based on biweekly updates of progress on tasks.
- Personal Accomplishments (33.3%): Based on how well you achieve your research goals.
- Teamwork and Participation (33.3%): Peer evaluations will be used to establish how well you work on a team.

Grading process:
- Bi-weekly on Thursday: Submit "bi-weekly update" and literature review to Canvas. Complete peer grading (instructions in syllabus).

- Midterm: Submit personal accomplishment documentation to Canvas. Complete peer evaluations. Complete peer grading. Grade is *advisory*.

- Final: Identical to midterm, but grade is *final*

The following deliverables are expected at the midterm and final evaluations. Note that the "personal accomplishment" documentation will be graded using a combination of peer grading and instructor grading, so you will also need to complete the peer grading at each point.

- Deliverables: 
    - Compiled bi-weekly update
    - Personal accomplishment documentation
    - Peer grading
    - Peer evaluation

See [syllabus](https://github.com/medford-group/bdqm-vip/blob/master/syllabus.md) for more details.

## VIP is not like a regular course 

Regular courses have a clear direction:

<img src="images/mario.png">

VIP lets you choose your own adventure:

<img src="images/zelda.jpeg">


## Group Communication:

- Slack group used for all communication, join using [this link](https://join.slack.com/t/medfordgroup/shared_invite/enQtNzI2NTcxNTMyNzIxLTdlNDQ3OTE0YjcxOWYyNTQxNDZiZjFjNmIwMzRlYWVhYWY5NWY2MjFkYTAwNzA1YTM1ZDM4ZTAzMDRlNGI1ZDE)
	- vip: all VIP students should join. Official announcements will be posted here.
	- training: discussion related to training project.
    - big_data: channel for "Big Data" sub-team (see below)
    - quantum_mechanics: channel for "Quantum Mechanics" sub-team (see below)
	- general: channel for general discussions with the whole group.
 

## Team Structure : New Students

- Complete a training project involving DFT adsorption energy calculation and neural network training.
- Training project goals document [available here](https://github.com/medford-group/bdqm-vip/blob/master/project_descriptions/training.md).

New students will join the "Training" sub-team. You will work individually to complete all the tasks, but should meet regularly with your teammates and graduate advisor to touch base on progress, ask questions, and get help with anything you are stuck on.

## Team Structure: Returning Students and OMSCS Students

- Returning students will join one of the two "sub-teams" described below, or work on "independent study" projects.
- OMSCS students can decide whether to complete the training exercises or join a sub-team (or both)

### Sub-teams
- There are two main sub-teams, each of which will function as a small research group advised by a graduate mentor
    * **Big Data** - This sub-team will broadly focus on the ["Open Catalyst Project"](https://opencatalystproject.org/) (OCP) benchmark datasets with the goal of creating, testing, and improving machine-learning models on these benchmarks.
    * **Quantum Mechanics** - This sub-team will broadly focus on improving, benchmarking, and applying the suite of [SPARC](https://github.com/SPARC-X/) codes for performing density functional theory and other quantum-mechanical simulations.
- Students should work on self-defined individual tasks within the scope of a sub-team
    * Each student should have a clearly-defined task that they are working independently on. This task should be self-determined and within the scope of the broader goal of the sub-team based on consultations with other sub-team members and the graduate student advisor.
    * Students may work together on a given task or direction, but should also have individual goals.
    * Each student should create a "goals document" similar to the [training project goals](https://github.com/medford-group/bdqm-vip/blob/master/project_descriptions/training.md) document. This should be uploaded to Canvas by the second week of class, and ideally added to the `project_descriptions` folder of this Github repo in the format of `lastname-project_short_title.md`.
    * Students should regularly communicate with their sub-team to (1) coordinate progress on individual tasks to work toward the larger goal of the team, (2) ask for and provide assistance to other sub-team members, and (3) seek advice from and provide updates to the sub-team graduate mentor.
    * It is fully expected that the goals of research tasks change throughout the semester. The goals document can be updated at any time up to 2 weeks prior to the end of the semester. Students should revise goals as needed to ensure they are achievable.


### Independent Study Projects

These projects are self-defined projects that are outside the scope of the two main sub-teams, but are within the scope of the Medford research group. These projects should be defined through direct discussions with Prof. Medford at least one month prior to the start of the semester. The projects will generally be associated with initiating new research projects or wrapping up ongoing projects from prior semesters. Independent study projects are typically high risk and/or require significant prior experience, so they will generally be reserved for more senior and returning students.

## Details on Sub-teams

Additional details and broader context for the overarching goals of each sub-team, along with possible ideas of individual tasks, are provided below.

### Big Data

Adsorption energies of molecules on solid surfaces are extremely challenging to measure experimentally, but they can be computed with DFT at a significant computational cost. The ["Open Catalyst Project"](https://opencatalystproject.org/) has curated two datasets and associated baseline machine-learning models for predicting adsorption energies. The "OC20" dataset is an extremely large dataset including a broad class of materials, while the OC22 dataset is somewhat smaller and focuses on oxide materials (a materials class that is particularly challenging to simulate). There are a number of specific tasks and a leaderboard for each dataset with state-of-the-art (SOTA) models created by major computer science companies including Meta and Microsoft as well as numerous academic groups. The goal of this sub-team is to build, test, and improve machine-learning models for this important benchmark dataset. Some examples of project directions could include:

#### Developing AMPTorch and GMP models
The [AMPTorch code](https://github.com/ulissigroup/amptorch) is co-developed by the Medford and Ulissi groups, and includes a unique featurization scheme called "Gaussian Multipole Features". These features have [shown promising performance for OC20](https://arxiv.org/abs/2102.02390), especially for smaller datasets. They have far fewer fitted parameters than more complex models, and thus may not be able to compete for very large datasets, but still have some advantages in terms of speed and small datasets. Some specific tasks that could be investigated include:

* Optimizing AMPTorch+GMP models through hyperparameters of the features and neural net architectures to improve performance.
* Creating and testing AMPTorch+GMP models for the OC22 benchmark.
* Implementing the infrastructure needed to place AMPTorch+GMP models onto the official Open Catalyst leaderboard.
* Improvements in computational efficiency and memory management of the AMPTorch code (especially parallelizing training to multiple GPU nodes).
* Testing how GMP features work with more complex machine-learning model architectures such as graph convolutional neural networks.

#### Testing state-of-the-art models
The Open Catalyst leaderboard is constantly evolving with new state-of-the-art models that utilize cutting edge concepts like graph neural networks and transformers. In principle, these models may be capable of accelerating many research tasks within the Medford group. However, installing these models on Georgia Tech infrastructure and applying them to new problems outside the OCP benchmarks often requires sigificant software troubleshooting. Some specific tasks to be investigated include:

* Installing SOTA models on PACE and repeating the benchmark tests for sub-sets of the OC20 set to check the time required for inference.
* Applying SOTA models to new adsorption predictions that are of interest to the Medford group.
* Creating infrastructure for ["fine tuning"](https://iopscience.iop.org/article/10.1088/2632-2153/ac8fe0) of models to accelerate geometry optimizations. 

#### Creating general data infrastructure
There are some tasks associated with training atomistic machine-learning models that are agnostic to the details of the models being used or trained. Developing software and infrastructure to help with these tasks will be broadly beneficial for many of the other tasks mentioned above. Some specific examples include:

* Infrastructure for automating and accelerating the optimization of hyperparameters for neural network models and atomistic features.
* Algorithms for stratified sub-sampling of massive datasets to create more representative sub-samples that facilitate faster model optimization.

### Quantum Mechanics

Quantum mechanical simulations are critical for understanding materials, and DFT is especially impactful owing to its tradeoffs in speed/scalability and accuracy. However, even DFT has limitations in terms of system size owing to the computational scaling that increases approximately as $\mathcal{O}(N^3)$ where $N$ is the number of electrons. One route to overcoming this limitation is computational parallelization, allowing much larger systems to be studied by using more computational power. The [SPARC DFT code](https://arxiv.org/abs/2005.10431), developed by Prof. Suryanarayana at Georgia Tech, is able to achieve faster wall times than any other code in the limit of large numbers of processors, making it an ideal platform for implementation of new methods and studying realistic systems. However, there are many outstanding challenges in improving and applying the code. Some examples of project directions include:

#### Improving the user friendliness of the SPARC code
The SPARC code has [documentation](https://github.com/SPARC-X/SPARC/tree/master/doc) and a [Python API](https://github.com/SPARC-X/sparc-dft-api), but there is room for improvement in both areas. Creating new examples and tutorials would also be useful. Some specific tasks include:

* Automatically linking the SPARC documentation to the Python API (e.g. have doc strings for functions that automatically update from the SPARC docs).
* Streamlining the Python API code to be better organized and easier to maintain.
* Creating a Jupyter Book with specific tutorials on how to utilize various features of SPARC.

#### Developing pseudopotentials and associated infrastructure
Pseudopotentials (PSPs) are a critical part of DFT calculations, and the SPARC code requires a specific type of "norm conserving" PSPs. The Suryanarayana and Medford groups have worked together to [apply evolutionary algorithms](https://arxiv.org/abs/2209.09806) to help accelerate the process of psuedopotential development, leading to a new library of "SPMS" PSPs. However, these PSPs are focused on a specific exchange-correlation functional type (GGA) and could be tweaked in many ways. Moreover, an improved system for organizing and disseminating these PSPs would be impactful. Some specific tasks include:

* Learning to apply the genetic algorithm and using it to optimize other PSP formats.
* Creating a website to help visualize and organize PSP libraries (similar to [this example](http://www.pseudo-dojo.org/)).
* Systematic testing of the speed and accuracy of different types/libraries of PSPs for various tasks.

#### Applying and benchmarking methods in SPARC
There is an ever-evolving list of new exchange-correlation functionals and other methods that are being implemented in SPARC. These new methods enable the study of various material and chemical phenomena, although benchmarking is often required to establish accuracy. Some specific tasks include:

* Applying SPARC to study a specific material or chemical problem of interest.
* Comparing the speed and accuracy of different methods within SPARC to a specific problem.

## Standard meeting format

Subsequent meetings will follow one of three formats. We will start each meeting virtually in this Zoom" room regardless of the format.

* Flipped class meetings: A lecture video will be posted at least 2 days prior to the lecture. All training students should plan on watching the video prior to the class meeting time. The class meeting time will then be used for discussion of the lecture materials (training group), or as a time to meet with your sub-group. The main lecture will be used to briefly discuss logistics before breaking into sub-groups. Sub-groups where all (or some) of the students are on campus may elect to meet in person or in a hybrid mode based on the preference of group members.


* Update meetings: For midterm and final updates, each team will post a 10-15 minute update presentation to Canvas, and each student will be assigned 3 update presentations to watch and provide peer reviews before class. During the class time, all students are expected to be present, and we will go through each group to field questions and discuss their work. Any remaining time will be used for sub-group meetings. 


* Workshops: No official lecture video to watch beforehand. The entire lecture will be used as unstructured time to work on projects and interact with mentors and instructors.

**Note:** If no member of your sub-team is present for the synchronous lecture, then everyone from the group will lose 1/2 point (out of 5) from the documentation grade. If you cannot attend during your update please confirm with an instructor at least 24 hours ahead of time.

## Lecture schedule and syllabus

The course [syllabus](https://github.com/medford-group/bdqm-vip/blob/master/syllabus.md) is available via Github, and includes a list of all the lecture topics and dates. Lecture videos will be posted to the "Media Gallery" on the course Canvas page.

## Week 1 Assignment

- Join the [Slack channel](https://join.slack.com/t/medfordgroup/shared_invite/enQtNzI2NTcxNTMyNzIxLTdlNDQ3OTE0YjcxOWYyNTQxNDZiZjFjNmIwMzRlYWVhYWY5NWY2MjFkYTAwNzA1YTM1ZDM4ZTAzMDRlNGI1ZDE)
- Start discussion for selecting your group or sub-team project
- Install necessary software following instructions below.

## Software Installation:

### Install Anaconda3

We'll be using Python3 and Jupyter notebooks extenstively in this class. To access this easily, we'll need to install anaconda3. To do that, go to the anaconda website below and simply follow the buttons to download and install it (ensure that you're downloading the correct version for your operating system.)

https://www.anaconda.com/distribution/

### Ensure you can access a linux/unix prompt

#### Windows users:
Please install the windows ubuntu subsystem using these instructions:

https://docs.microsoft.com/en-us/windows/wsl/install-win10

#### Mac users:
Be sure you can open a [terminal](https://www.youtube.com/watch?v=zw7Nd67_aFw)