In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = '16'

# Physics and Data Science

**Author:** Pierre de Buyl  
http://pdebuyl.be  

**Data Science Leuven** 11 February 2020


**License:** [CC-BY](https://creativecommons.org/licenses/by/4.0/)

## Who am I?

- Physicist at the [Institute for Theoretical Physics](https://fys.kuleuven.be/itf/) KU Leuven
  - Statistical physics
  - Nonlinear dynamics
  - Computational science
- Co-organizer of EuroSciPy 2012 and 2013
- Open science / open source contributor [@pdebuyl](https://github.com/pdebuyl) on GitHub


## Two parts

1. What can Data Science do for Physics?
2. What can Physics do for Data Science?

Physics and computing (in general) have a strong shared history.

## Data Science ?

### Greater Data Science categories by Donoho
1. Data Gathering, Preparation, and Exploration
2. Data Representation and Transformation
3. Computing with Data
4. Data Modeling
5. Data Visualization and Presentation
6. Science about Data Science

**Focus** on computing and modeling

## Data Science for Physics

- Detection of particles in particle accelerators
- LHC summary:
  - 30 PB per year
  - Real-time filtering of collision events
  - Worldwide grid for data analysis
- In preparation: deep learning pipeline for event filtering

![Collision at CERN](cern_detection_figure.png)

**Credit:** [Daniel Helmborg](https://towardsdatascience.com/particle-tracking-at-cern-with-machine-learning-4cb6b255613c) on Medium.

## Data Science for Physics

- Machine learning for materials science
- Tune interatomic potentials without human "a priori"
- Large scale data analysis for optimizing photovoltaic cells

## Physics for Data Science: general ideas

- Laws of physics
- Modeling
- Information theory

## Physics for Data Science: Nonlinear dynamics and chaos

- The butterfly effect exists *but* is often misrepresented.
- Quick summary: a small change can have a large consequence.
- Example with the logistic map
$$ x \to 4\ x \ (1-x)$$

In [None]:
def logistic_map(x):
    return 4*x*(1-x)

In [None]:
data_1 = [] ; data_2 = []
x1 = 0.45678998765478
x2 = x1 + 0.0000001
for i in range(35):
    x1 = logistic_map(x1) ; x2 = logistic_map(x2)
    data_1.append(x1) ; data_2.append(x2)


In [None]:
plt.plot(data_1, marker='o')
plt.plot(data_2, marker='o')

### Why the example?

- Sometimes unintuitive patterns have a simple explanation.
- In 1961, Lorenz discovered chaos by entering numerical values on the computer. He found out unexpected results because of the limited accuracy of his printout.
- Many real-life problems exhibit chaotic behavior: population dynamics, economics, biochemical networks, ...

## Physics for Data Science: information theory

- Thermodynamics applies to information
- Entropy hints at the quantity of information

![](xkcd_936_password.png)

**Credit:** [xkcd](https://xkcd.com/936/)

<video controls src="gas_entropy.mp4" />

**Credit:** [@AndrewM_Webb](https://mobile.twitter.com/AndrewM_Webb/status/1182340203253514247)

## Physics for Data Science: information theory

- Minimal energy to erase a bit: $k_B \ln 2$
- Relation between data and the physical world

## Physics for Data Science: stochastic optimization

- Finding good parameters is hard!
- Objective: minimize cost function.

In [None]:
def cost_function(x):
    return x**4 + 5*x - 10*x**2

def cost_derivative(x):
    return 4*x**3 + 5 - 20*x

def update(x, step=0.005):
    derivative = cost_derivative(x)
    return x - derivative*step

xr = np.linspace(-4, 4)

In [None]:
x = 3.5
data = [x]
for i in range(100):
    x = update(x)
    data.append(x)
data = np.array(data)

In [None]:
plt.plot(xr, cost_function(xr), label='cost function')
plt.scatter(data, cost_function(data), color='orange', label='gradient method')
plt.ylim(-40, 40) ; plt.legend()

In [None]:
def update_stoch(x, step=0.01):                    
    derivative = cost_derivative(x)
    return x - derivative*step + 4*np.random.normal()*np.sqrt(step)

x = 3.5            
data_stoch = [x]
for i in range(10000):
    x = update_stoch(x)
    if i%10==0: data_stoch.append(x)  
data_stoch = np.array(data_stoch)

In [None]:
plt.plot(xr, cost_function(xr), label='cost function')
plt.scatter(data, cost_function(data), color='orange', label='gradient method')
plt.scatter(data_stoch, cost_function(data_stoch), color='green', label='stochastic gradient method')
plt.ylim(-40, 50) ; plt.legend()

In [None]:
plt.plot(data_stoch)

## Physics for Data Science: stochastic processes

- Stochastic gradient method: "a random walk in a force field"
- Relation to Brownian motion, a cornerstone of statistical physics
- Monte Carlo methods → Markov Chain Monte Carlo (MCMC)

## Summary

- Physicists use Data Science
- Ideas from statistical physics can be useful to Data Science:
  - Nonlinear dynamics and chaos
  - Information theory
  - Random walks and Brownian motion

## Perspectives

- Beyond black boxes
- Physics should adopt more widely good programming practices