### Objective: Let's make a LinearRegression object!

Needs:
- inputs
- methods
- a class to wrap it all together!


First, lets demonstrate what data objects there are.

In [None]:
obj_str = "2"
obj_int = 2
obj_float = 2.0
obj_bool = 2>1

Here's what the types look like.

In [None]:
obj_list = [obj_str,obj_int,obj_float,obj_bool]

for object_ in obj_list:
    print(type(object_))

### Numeric objects:
Integers and floats obey the usual math operations:

In [None]:
obj_int*3,\
obj_int +2,\
obj_float**3

Notice the difference above between the float and integer. Python will say they are the same thing:

In [None]:
obj_int==obj_float

but if you add them then it will become a float:

In [None]:
obj_int + obj_float

__Notice:__ In most languages, you must __declare__ variable types. __Python automatically did this__ on the fly.

This makes it easy to define objects, but also can be a pain if Python infers the wrong thing.

As long as you remember how Python does it, it can be quite useful. If you forget, it can lead to big problems. If you throw an error it will remind you of your mistake, but the worst mistakes are __ones you don't know you are making__.

__Make sure you know what each thing is supposed to be!__

### Strings

Strings, unlike numeric objects, have "length."

In [None]:
len(obj_str)

In [None]:
len(obj_int)

Above I tried to call a function on an "int" object, but this object does not have length.

Here I can call "len()" on a string and it gives me the number of characters.

In [None]:
example_string = "Do Re Mi Fa So La Ti Do"
len(example_string)

Like other languages, I can "index" strings.

In [None]:
example_string[0], \
example_string[:3], \
example_string[0::3],\
example_string[:-3],

__Important:__ Python indexes from __zero__ not one, so the first element `D` is `example_string[0]` and not `example_string[1]`

__Question: What do you think the third one did?__

Note that in the last one I used "negative indexing"; I told it to give me up to the last two elements. This is handy for when you do not know how long the string is.

Weirdly, you can apply some "math" operations to strings in Python to do different things:
- Use "+" will "concatenate"
- Use "*" will duplicate.

In [None]:
obj_str+"3 dog" , obj_str*10 

### Booleans
These are binary objects: True or False.

Handy when using functions if you only want something to happen if a condition is met.

For example, here's a "while loop" that trims a string until its 3 characters long:

In [None]:
len(example_string)>3

In [None]:
while len(example_string)>3:
    example_string = example_string[:-1]
    print(example_string)
    
example_string

What did this just do?
- I just re-assigned the object "example_string" each time with a version of it without the last character.

But the "condition" is actually an object itself:

In [None]:
is_it_more_than_three = len(example_string)>3
is_it_more_than_three

Weird little handy thing: if you interact them with numeric objects, they become binary (zero or one) numeric objects.

In [None]:
is_it_more_than_three + 5,\
is_it_more_than_three*2

In [None]:
1/(True+2)

### Lists and Dictionaries
- First one is an __ordered__ array.
- The second one is an __unordered__ mapping.

In [None]:
obj_list

In [None]:
obj_list[2]

The third object (index 2) is __always__ the float.

A dictionary is not ordered, but the mapping keeps relationships between objects consistently.

In [None]:
obj_dict = {"string":obj_str,
            "integer":obj_int,
            "float":obj_float,
            "boolean":False,
            "something_else":"something"}

In [None]:
if obj_dict["boolean"]:
    print("yay!")
else:
    print("too bad")

So passing the string "float" retrieves whatever the dictionary relates to "float."

This is how you "look things up" in the dictionary.

### Another dictionary example:

Suppose you had data about three individuals. For each individual, you have a description and some anthropometric data. How do you represent all of this data in one python object?

A common way is using a __nested dictionary__:

In [None]:
Babies = {"Bart Harley Jarvis":{
                "Description" : "Underbite, flat back of the head",
                "Weight Percentile" : 50,
                "Height Percentile" : 80 },
          "Michael Patrick Porkins":{
                "Description" : "Button nose, apple cheeks",
                "Weight Percentile" : 99,
                "Height Percentile" : 10},
          "Taffy Lee Fubbins" : {
                "Description" : "Tuna can",
                "Weight Percentile" : 90,
                "Height Percentile" : 10}}

The `.keys()` method accesses the first layer:

In [None]:
Babies.keys()

And we can use those keys to access the next dictionary level:

In [None]:
Babies['Bart Harley Jarvis']

In [None]:
Babies['Bart Harley Jarvis']['Height Percentile']

This kind of data hierarchy is common in __json files__, which is a common way to store data.

### Towards Data

Dictionaries can refer not just to individual objects but to lists.

To get to our linear regression example, let's assume that we have two lists of data $x$ and $y$.

How do we calculate the linear regression parameter $\beta$ in the equation $y = \beta x$?

The estimator for OLS is that we find the value of $\beta$ that minimizes this function:
$$ SSE(\beta) = \sum_i (y_i - \beta x_i)^2 $$
Now we will:
1. Write a function that does that.
2. Write a class that combines everything together!

In [None]:
x = [ 0.72238169,  0.81319053,  1.02818518, -0.13406947, -0.32687184,
       -0.8436763 , -0.11656874,  1.26557628, -1.30864275, -1.11902229]
e = [-0.02545513,  0.2013095 ,  0.15369068,  0.77728519,  0.39257324,
       -0.04470027, -1.02603586,  0.21550981,  0.23245853, -0.06602041]

y = [] # Start an empty list.

beta = 2 

for i in range(len(x)):
    y += [x[i]*beta + e[i]]

In [None]:
data = {"y":y,"x":x}

data["y"]

In [None]:
for i in range(len(x)):
    y += [x[i]*beta + e[i]]

Some `for` loop magic here.
- The function "`range`" makes a series of integers from `0` to whatever you put in.
    - `range(len(x))` just made a series of integers from `0` to the number `len(x)`
    - This is going to go print each index of the list, which is very useful for going through an array.
- Putting `y += x` is equivalent to writing `y = y + x`
    - Since `y` and `x` are lists, we are adding lists together
    - So this adds an element to the list for each `i`!

#### An aside: the magic of "list comprehension":
Sometimes you will see me write:

`y = [x[i]*beta + e[i] for i in range(len(x))]`

Instead of 

`y = []`

`for i in range(len(x)):`

`    y += [x[i]*beta + e[i]]`

There is no difference between these, but the first one is shorter to type and is called a __list comprehension__.

"List comprehensions" are for loops written in one line whose output is always a list.

Structure in one line:
`[i + 2 for i in x]`

In [None]:
[ i + 2        # Thing to do
 for i in x]  # normal for loop language.

Notice that the output is a list. This makes the `y=[]` step unneccesary.

In [None]:
y =  [x[i]*beta + e[i]          # assign list to variable "y", add values together
      for i in range(len(x))]   # regular for loop

You can also put in `if` conditions:

In [None]:
y_mod = [x[i]*beta + e[i]            # assign list to variable "y", add values together
         for i in range(len(x))      # regular for loop
         if x[i]>1]                  # condition goes after if just "if"

y_mod

`if else` goes before `for` but after operation:

In [None]:
y_mod = [x[i]*beta + e[i]    # assign list to variable "y", add values together
         if x[i]>1           # only does operation if this  
         else -99        # else it returns this
         for i in range(len(x))]      # regular for loop

y_mod

Now we have data. If we wanted to do a linear regression, we could just try and find which value of $\beta$ minimizes the sum of squared error between these two variables.

First, we need an SSE function:

In [None]:
def sse(beta):
    sse = 0
    
    for i in range(len(x)):
        sse = sse + (data['y'][i] - beta*data['x'][i])**2 
    return sse

This is our first function. Let's go line by line:

In [None]:
def sse(beta): # every function needs this line: def name_of_function(input)
    sse = 0    # I'm going to calculate SSE, so let me 
               # first start by defining a variable which equals zero
    
    for i in range(len(x)): # for each index in the x vector
        sse += (y[i] - beta*x[i])**2  # take the difference between the y at element i 
                                      # and the x at the element x times beta
    return sse # if you want an output, you have to tell the function what needs to come out.

Now let's input some values and see which one gives us the smallest SSE:

In [None]:
sse(1),sse(2),sse(3)

Looks like 2 could be the winner...

### Getting Classy

"Classes" are the __ultimate__ python object in that they can hold all the above information in one object!

In essence, a class holds a bunch of variables and functions into one object. Let's make a class that contains:
- the data we are estimating.
- the functions we want to use with the data.

In [None]:
class LinearRegression:
    def __init__(self, x,y):
        self.indep_var = x
        self.dep_var = y
    
    def sse(self,beta):
        sse_val = 0
        for i in range(len(self.indep_var)):
            sse_val += (self.dep_var[i] - beta*self.indep_var[i])**2
        return sse_val
    
    def estimate(self,betagrid):
        sse_vals =[]
        for beta in betagrid:
            sse_vals += [self.sse(beta)]
        
        the_min = min(sse_vals)
        
        for i in range(len(sse_vals)):
            if sse_vals[i] == the_min:
                return betagrid[i]

#### The "__init__"
This "initializes" the class. Here we ask it for two arguments, "x" and "y", and then assign them to two "attributes" indep_var and dep_var

Every class must have an `__init__` function.

In [None]:
class LinearRegression:

    def __init__(self, x,y):
        self.indep_var = x
        self.dep_var = y

In [None]:
lm_obj = LinearRegression(data['x'],data['y'])
lm_obj.indep_var

#### The method
Now we can give it our SSE method. Each function in a class needs to have the argument "self" to work, and this stands in for whatever you created in "init"

In [None]:
class LinearRegression:

    def __init__(self, x,y):
        self.indep_var = x
        self.dep_var = y
    
    def sse(self,beta):
        sse_val = 0
        for i in range(len(self.indep_var)):
            sse_val += (self.dep_var[i] - beta*self.indep_var[i])**2
        return sse_val

In [None]:
lm_obj = LinearRegression(data['x'],data['y'])
lm_obj.sse(2)

Finally, let's give it an "estimate" method to find the $\beta$ that makes sse as small as possible.

In [None]:
    def estimate(self,betagrid):
        sse_vals =[]
        for beta in betagrid:
            sse_vals += [self.sse(beta)]
        
        the_min = min(sse_vals)
        
        for i in range(len(sse_vals)):
            if sse_vals[i] == the_min:
                return betagrid[i]

In [None]:
class LinearRegression:
    def __init__(self, x,y):
        self.indep_var = x
        self.dep_var = y
    
    def sse(self,beta):
        sse_val = 0
        for i in range(len(self.indep_var)):
            sse_val += (self.dep_var[i] - beta*self.indep_var[i])**2
        return sse_val
    
    def estimate(self,betagrid):
        sse_vals =[]
        for beta in betagrid:
            sse_vals += [self.sse(beta)]
        
        the_min = min(sse_vals)
        
        for i in range(len(sse_vals)):
            if sse_vals[i] == the_min:
                return betagrid[i]

In [None]:
lm_obj = LinearRegression(data['x'],data['y'])
grid = [0,1,2,3,4]
lm_obj.estimate(grid)

Given that grid, it determined that $\beta=2$ gives the lowest SSE value.

__WE DID IT!__
