## In this notebook, you'll use Logistic Regression for the Ising model. 

It accompanies Chapter 5 of the book (4 of 5).

Copyright: Viviana Acquaviva (2023); see also other data credits below.
Modifications by Julieta Gruszko (2025)

License: [BSD-3-clause](https://opensource.org/license/bsd-3-clause/)

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pickle
from matplotlib import cm

In [None]:
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import cross_val_predict, cross_val_score, cross_validate, train_test_split

from sklearn.model_selection import KFold, StratifiedKFold

from sklearn import metrics

### First, let's take a look at those sigmoids!

In [None]:
x = np.linspace(-10,10,100)

In [None]:
z = 2*x + 5 #Linear bit

Let's say that the probability that something will happen is called $\pi$. 

The logistic model assumes that

$log (\frac{\pi}{1-\pi}$) = z 

We can now solve for $\pi$:

In [None]:
pi = 1/(1 + np.exp(-z))

In [None]:
plt.plot(x, pi)

plt.xlim(-7,3);

plt.title('Hello, I am a sigmoid!')

plt.xlabel('x', fontsize=14)

plt.ylabel('$ \pi$',fontsize=14);

Questions:
    
- Where does $\pi$ = 0.5 occur? 

- What happens if the slope of the linear model is negative?

### We can now see an example from Mehta et al 2018:

["A high-bias, low-variance introduction to Machine Learning for physicists"](https://arxiv.org/abs/1803.08823).

(Thank you to Pankaj Mehta and David Schwab)!

We are trying to use a logistic regression model to predict whether a material is in a ordered or disordered phase, based on its spin configuration. In an ordered phase, the spins are aligned. The representation is a 2D lattice so our features are the spin states of each element in the lattice. The physical model, known as Ising model, predicts that the transition depends on temperature and is smeared (for a finite-size lattice), around a critical temperature $T_c$.

The training data is composed of 160,000 Monte Carlo simulations in a range of temperatures, and their labels.

Possible applications of this formalism involve predicting the critical temperature for more complex systems.

Reading in the data might take a little while.

In [None]:
#This is gratefully borrowed with permission from the notebooks maintained by P. Mehta.

######### LOAD DATA
# The data consists of 16*10000 samples taken in T=np.arange(0.25,4.0001,0.25):
data_file_name = '../Data/Ising2DFM_reSample_L40_T=All.pkl'
# The labels are obtained from the following file:
label_file_name = '../Data/Ising2DFM_reSample_L40_T=All_labels.pkl'


#DATA
with open(data_file_name, 'rb') as pickle_file:
    data = pickle.load(pickle_file) # pickle reads the file and returns the Python object (1D array, compressed bits)

data = np.unpackbits(data).reshape(-1, 1600) # Decompress array and reshape for convenience
data=data.astype('int')
data[np.where(data==0)]=-1 # map 0 state to -1 (Ising variable can take values +/-1)

#LABELS (convention is 1 for ordered states and 0 for disordered states)
with open(label_file_name, 'rb') as pickle_file:
    labels = pickle.load(pickle_file) # pickle reads the file and returns the Python object (here just a 1D array with the binary labels)

In [None]:
data.shape

In [None]:
np.unique(labels)
#labels: 1 = ordered or near-critical
#labels: 0 = disordered

Check the label distribution. Are the classes balanced or imbalanced? Do the data need to be shuffled?

#### We can take a look at a few examples:

In [None]:
#H/T: https://stackoverflow.com/questions/16834861/create-own-colormap-using-matplotlib-and-plot-color-scale

cmap = matplotlib.colors.ListedColormap(["aquamarine","navy"], name='from_list', N=None)

plt.figure(figsize=(15,8))
fig, axarr = plt.subplots(nrows=1, ncols=3)
axarr[0].imshow(data[0].reshape(40,40), cmap = cmap) #first object has label "1"
axarr[1].imshow(data[80000].reshape(40,40), cmap = cmap) #from documentation, this is critical-ish (between 60, and 90,000)
axarr[2].imshow(data[100000].reshape(40,40), cmap = cmap) #disordered
for i in range(3):
    axarr[i].set_xticks([0,20,40]);

### Let's pick a random selection to speed up the computations.

In [None]:
np.random.seed(10)

sel = np.random.choice(data.shape[0], 16000, replace = False)

In [None]:
seldata = data[sel,:]

In [None]:
sellabels = labels[sel]

In [None]:
plt.scatter(np.arange(seldata.shape[0]),sellabels); #The random selection also has the advantage of reshuffling the data!

How many features are we using (this is our largest feature space yet!)?

### And now time for the logistic regression model.

In [None]:
model = LogisticRegression(max_iter = 1000) #This uses a numerical method to find the minimum of the loss function

In [None]:
model.get_params() #Note that (unlike in linear regression) regularization is the norm!

In [None]:
model

Using cross validation, as usual, train the model and report the results.

In [None]:
# code to train model and get results

What metric is being reported? Is this enough information? What other information might we want?

### Do your own grid search to optimize the regularization parameter C. 

Check log-spaced values of C between 1E-3 and 1E3 (in other words, C = {1E-3, 1E-2, ..., 1E3}). No need to for cross-validation this time around; think of this as a preliminary exploratory phase, not reporting our final results.

Note that our data is already very regular (feature values are -1/1), so we are not doing any scaling.

Does regularization make a noticeable improvement to this model's performance?

In [None]:
#Test each value of C and report the results.


### Now let's generate labels in order to check predictions.

For those classifiers that are solving a regression problem under the hood, there is the handy "predict_proba" method.

In [None]:
model = LogisticRegression(C=1.0, max_iter=1000)

ypred = cross_val_predict(model, seldata, sellabels, \
                               cv = KFold(n_splits=5, shuffle=True, random_state=10))

ypred_prob = cross_val_predict(model, seldata, sellabels, \
                               cv = KFold(n_splits=5, shuffle=True, random_state=10), method = 'predict_proba')

The output of predict_proba gives the probability to belong to disordered (label 0) or ordered (label 1) phase, in that order (these add to 1 of course, as they should). The simple classifier output is the class with p > 0.5. We can look at this to convince ourselves:

In [None]:
np.column_stack([ypred_prob, ypred])

### Plot a ROC curve to check the performance
Is 0.5 really the best threshold to set? Maybe or maybe not! It depends on your application. We can get a more complete picture of the performance using an ROC curve. 


In [None]:
from sklearn.metrics import roc_curve

In [None]:
fpr, tpr, thresholds = roc_curve(sellabels, ypred_prob[:, 1])

plt.plot(fpr, tpr, label='Logistic Regression ROC Curve')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.grid(True)
plt.show()

### We can plot a few examples to see how our classifier is doing. 

In [None]:
fig, axarr = plt.subplots(nrows=1, ncols=8, figsize=(15,5))
for i in range(8):
    axarr[i].imshow(seldata[i].reshape(40,40), cmap = cmap) 
    axarr[i].set_xlabel('True label:'+str(sellabels[i])+'\n'+'Pred label:'+str(ypred[i]))
    axarr[i].set_yticks([])
    axarr[i].set_xticks([])

Unfortunately, there are two instances that are misclassified by our Logistic Regressor classifier. At least visually, this is understandable.

Let's take a look at the corresponding probabilities:

In [None]:
ypred_prob[:8]

How confident is the model about its choice for the first 8 instances? Does this raise any concerns?

### Some analysis:
The conclusion is that the main indicator for this model is lack of consistency between spin alignments, which is not modeled well by our regressor. It's a tricky problem because many algorithms tend to look at the value of each feature to decide - for many of them, it's hard to represent the correlation among features as an indicator. 


### Improving the model:
One way to improve the performance may be to add engineered features that combine the behavior of neighboring spins. Pooling is one approach to do this: you add new features that are calculated by reducing the dimensionality of the lattice from 40x40 to 20x20 by combining adjacent cells and rounding the average spin (e.g. a cell with 4 pixels with spins -1, 1, 1, 1 would be assigned 1, a cell with spins -1, -1, 1, 1 would be assigned 0, a cell with -1, -1, -1, 1 would be assigned -1). 

Add 4-pixel pooling features to the data, and then repeat fitting with linear regression. Does this give an improved result?



In [None]:
# Here an example of selecting from a data point, just to help you figure out how to do this
inst = seldata[3].reshape(40, 40)
inst[0:2, 0:2]


In [None]:
# here's a framework for your pooling steps. Save the result as a new set of features for each instance.
# Note, this is a very slow way to do this! On my laptop, it took about 30 seconds to run. 
# It's not very pythonic to have all those loops, and we pay the cost in processing time since the code isn't optimizing the processes at all.
# On the other hand, it's nice and easy to understand. 
# If you want to write something that runs more quickly, this will help: https://numpy.org/devdocs/reference/generated/numpy.lib.stride_tricks.sliding_window_view.html
pooledfeat = []
for inst in range(seldata.shape[0]):
    inst = seldata[inst].reshape(40, 40)
    pooledinst = []
    for i in range(0, inst.shape[0], 2):
        for j in range(0, inst.shape[1], 2):
            #average and round the values in 4 pixels, append the results to pooledinst
    #append each pooledinst to pooledfeat

# the feature array comes out with some nearly-but-not-quite 0 values because of numerical precision issues, this fixes that problem
pooledfeat = np.where(np.array(pooledfeat)<1E-6, 0, pooledfeat)

In [None]:
# add the new features to the old ones as new columns using np.concatenate

In [None]:
# train logistic regression again with your previous best value of C and check the results

How did the model perform with the added features? 

Draw the ROC curve for the new model, comparing it to your original model. What do you see?

In [None]:
# ROC curve code