In this section, we explore a simple code gloss on the stories that in-school data tell. How do they reinforce existing stories? What stories do they hide? As with all the code sections, or, for that matter, as with (quite literally) all communication, ideas or positions are erased, foregrounded, backgrounded, and subject to bias. These are simplistic digital explorations of the nature of complex social interactions. That does not make them less worthy of study or comprehension; ignoring dominant modes of power does not make them go away.

As in all the code in the book, we have to start by `import`-ing the appropriate parts of the python language (called "libraries") so that we can use them as we program. Here, we need: various math functions (basic arithmetic is included by default); the ability to generate pseudorandom numbers (that is, "effectively random" numbers); the `numpy` library, which is at the core of most data analysis in Python; and a statistics (and machine learning) package called `scikit-learn`.

In [None]:
from random import randint,choice,sample
from math import ceil, floor, log
import numpy as np
from sklearn.linear_model import LinearRegression

If we want to print out our very simple graph, we need to handle that ourselves. The following "cell" in this "notebook" prints a textual graph given a list of points on that graph. 

In [None]:
# this generates a list of random numbers (defaulting to 100)
# with a given mean (def. to 50) and standard deviation (def. to 25)
def generate_simple(n=100, mean=50, sd=25, max=100):
  return list(np.random.normal(mean, sd, n))

# this function makes a readable picture in characters
# (a variable with 1 or more characters is a `string`)
# given a graph that is a list of coordinates (`tuple`s)
# for instance, a graph that is [(0,1),(3,4)] includes two points
# (x=0,y=1) and (x=3,y=4).
def graph_to_string(a_graph):
  output = "+\n"
  a_graph.reverse()
  for x in a_graph:
    output += "|{}\n".format("".join(x))
  output += "+" + "-"*len(a_graph[0])
  return output

# if we are drawing our own graphs, we need to scale
# them to the right size. we don't want all the points
# to be on top of each other, say.
def normalize_datum(y, x, bucket_size_y, bucket_size_x):
  return (floor(y / bucket_size_y), floor(x / bucket_size_x))

graph_size_x = 20 # this is an arbitrary number that i thought looked ok!
graph_size_y = ceil(graph_size_x / 2) # graphs often are wider than tall
def plot_simple(ys, xs = []):
  if ([] == xs) or (len(ys) != len(xs)):
    xs = range(len(ys))
  bucket_size_y = max(ys) / (float(graph_size_y) - 1)
  bucket_size_x = len(xs) / (float(graph_size_x) - 1)
  graph = [[" " for x in range(graph_size_x)] for y in range(graph_size_y)]
  for i in range(len(xs)):
    dn = normalize_datum(ys[i],xs[i],bucket_size_y,bucket_size_x)
    graph[dn[0]][dn[1]] = "*"
  return graph_to_string(graph)



Now that the preliminaries are out of the way, let's delve into the world of data science. Unfortunately, there's no easy way to explain linear regression without a lot more information – indeed, this is one of those situations in which a little knowledge is more dangerous than none (**CITE**). That said, people use linear regression every single day across thousands of professions. A linear regression is, per [Wikipedia](https://en.wikipedia.org/wiki/Linear_regression), an approach for modeling the relationship between two variables. In our case, we will draw a line and the data from two variables and see a linear regression produces a "meaningfully predictive" model of that relationship (be it strong or weak, positive or negative).

In [None]:
max_events = 10

simple_data = generate_simple(max_events) ## generate random data
print(plot_simple(simple_data)) ## plot those data

X=np.array(range(len(simple_data))).reshape(-1,1) ## put those data in the right format
y=simple_data 
model = LinearRegression().fit(X,y) ##  perform our linear regression
print(f"linear_model: score: {model.score(X, y):.2f}, coef: {model.coef_[0]:.2f}, intercept: {model.intercept_:.2f}")
# show how we did

pred = model.predict(X) # draw the best fit line
print(plot_simple(pred))

+
|                 *  
|                    
|                    
|                    
|     *         *    
|                    
|                    
|*  *   *   * *      
| *                  
|         *          
+--------------------
linear_model: score: 0.36, coef: 6.36, intercept: 17.59
+
|                 *  
|               *    
|             *      
|           *        
|       * *          
|     *              
|   *                
|**                  
|                    
|                    
+--------------------


Now that we have seen how linear regression is performed, we can play a simple game. The algorithm generates random data, and you tell if it is right or wrong.

The obvious first modification (on your part) would be to input your real world data. We provide a list (called `variables`) of possible data you could input; these are variables that real schools use every day. You would be shocked. Real schools and corporations use data that are often this simple (with the addition of your identifying information). 



In [None]:
print("I will tell you a story about your data, then you tell me if it is correct.")
variables = ["current grade","actions per minute","things turned in","lateness","attendance","other browser windows open"]
salient_variables = sample(variables,4)
real_answers = ["My mom was sick.","My dad helped me.","My job made me come in, so I could not get any sleep."]

X=np.array(range(max_events)).reshape(-1,1)
y=generate_simple(max_events)
model = LinearRegression().fit(X,y)
m = model.coef_[0]
if m < -1:
  judgment = "going down."
elif m > 1:
  judgment = "going up."
else:
  judgment = "staying the same."
print(f"Your '{choice(salient_variables)}' is {judgment}")
print("What is the real story?")
print("Working off this data: ")
for col in salient_variables:
  print(f"\t{col}")
print("Some stories I might tell are:")
for col in real_answers:
  print(f"\t{col}")
print("Why can't those data tell that story?")


I will tell you a story about your data, then you tell me if it is correct.
Your 'current grade' is going up.
What is the real story?
Working off this data: 
	current grade
	lateness
	other browser windows open
	things turned in
Some stories I might tell are:
	My mom was sick.
	My dad helped me.
	My job made me come in, so I could not get any sleep.
Why can't those data tell that story?


Real lives are hard to contain in a limited set of variables; they make the world much *messier* and, well, the story less straightforward. It is easier to get mad at a kid because their homework is late. It is harder to be mad if you know that it was late their mom was sick. Is there a set of variables that makes these analyses humane? How do we interpret them humanely?