### Objective: Let's make a LinearRegression object!

Needs:
- inputs
- methods
- a class to wrap it all together!


First, lets demonstrate what data objects there are.

In [1]:
obj_str = "2"
obj_int = 2
obj_float = 2.0
obj_bool = 2>1

Here's what the types look like.

In [4]:
obj_list = [obj_str,obj_int,obj_float,obj_bool]

for object in obj_list:
    print(type(object))

<class 'str'>
<class 'int'>
<class 'float'>
<class 'bool'>


### Numeric objects:
Integers and floats obey the usual math operations:

In [5]:
obj_int*3,\
obj_int +2,\
obj_float**3

(6, 4, 8.0)

Notice the difference above between the float and integer. Python will say they are the same thing:

In [6]:
obj_int==obj_float

True

but if you add them then it will become a float:

In [7]:
obj_int + obj_float

4.0

__Notice:__ In most languages, you must __declare__ variable types. __Python automatically did this__ on the fly.

This makes it easy to define objects, but also can be a pain if Python infers the wrong thing.

As long as you remember how Python does it, it can be quite useful. If you forget, it can lead to big problems. If you throw an error it will remind you of your mistake, but the worst mistakes are __ones you don't know you are making__.

__Make sure you know what each thing is supposed to be!__

### Strings

Strings, unlike numeric objects, have "length."

In [8]:
len(obj_str)

1

In [9]:
len(obj_int)

TypeError: object of type 'int' has no len()

Above I tried to call a function on an "int" object, but this object does not have length.

Here I can call "len()" on a string and it gives me the number of characters.

In [10]:
example_string = "Do Re Mi Fa So La Ti Do"
len(example_string)

23

Like other languages, I can "index" strings.

In [26]:
example_string[3], \
example_string[:3], \
example_string[-2::-1],\
example_string[:-3],

('R', 'Do ', 'D iT aL oS aF iM eR oD', 'Do Re Mi Fa So La Ti')

__Important:__ Python indexes from __zero__ not one, so the first element M is "example_string[0]"

__Question: What do you think the third one did?__

Note that in the last one I used "negative indexing"; I told it to give me up to the last two elements. This is handy for when you do not know how long the string is.

Weirdly, you can apply some "math" operations to strings in Python to do different things:
- Use "+" will "concatenate"
- Use "*" will duplicate.

In [29]:
obj_str+"3 dog" , obj_str*10 

('23 dog', '2222222222')

### Booleans
These are essentially binary objects: True or False.

Handy when using functions if you only want something to happen if a condition is met.

For example, here's a "while loop" that trims a string until its 3 characters long:

In [34]:
while len(example_string)>3:
    example_string = example_string[:-1]
    print(example_string)
    
example_string

Do Re Mi Fa So La Ti D
Do Re Mi Fa So La Ti 
Do Re Mi Fa So La Ti
Do Re Mi Fa So La T
Do Re Mi Fa So La 
Do Re Mi Fa So La
Do Re Mi Fa So L
Do Re Mi Fa So 
Do Re Mi Fa So
Do Re Mi Fa S
Do Re Mi Fa 
Do Re Mi Fa
Do Re Mi F
Do Re Mi 
Do Re Mi
Do Re M
Do Re 
Do Re
Do R
Do 


'Do '

What did this just do?
- I just re-assigned the object "example_string" each time with a version of it without the last character.

But the "condition" is actually an object itself:

In [35]:
is_it_more_than_three = len(example_string)>3
is_it_more_than_three

False

Weird little handy thing: if you interact them with numeric objects, they become binary (zero or one) numeric objects.

In [36]:
is_it_more_than_three + 5,\
is_it_more_than_three*2

(5, 0)

In [37]:
1/is_it_more_than_three

ZeroDivisionError: division by zero

### Lists and Dictionaries
- First one is an __ordered__ array.
- The second one is an __unordered__ mapping.

In [38]:
obj_list

['2', 2, 2.0, True]

In [39]:
obj_list[2]

2.0

The third object (index 2) is __always__ the float.

A dictionary is not ordered, but the mapping keeps relationships between objects consistently.

In [44]:
obj_dict = {"string":obj_str,"integer":obj_int,"float":obj_float,"boolean":obj_bool,2:"integer","bool":True}

In [45]:
if obj_dict["bool"]:
    print("yay!")

yay!


So passing the string "float" retrieves whatever the dictionary relates to "float."

This is how you "look things up" in the dictionary.

### Another dictionary example:

Suppose you had data about three individuals. For each individual, you have a description and some anthropometric data. How do you represent all of this data in one python object?

A common way is using a __nested dictionary__:

In [46]:
Babies = {"Bart Harley Jarvis":{
                "Description" : "Underbite, flat back of the head",
                "Weight Percentile" : 50,
                "Height Percentile" : 80 },
          "Michael Patrick Porkins":{
                "Description" : "Button nose, apple cheeks",
                "Weight Percentile" : 99,
                "Height Percentile" : 10},
          "Taffy Lee Fubbins" : {
                "Description" : "Tuna can",
                "Weight Percentile" : 90,
                "Height Percentile" : 10}}

The `.keys()` method accesses the first layer:

In [51]:
Babies.keys()

dict_keys(['Bart Harley Jarvis', 'Michael Patrick Porkins', 'Taffy Lee Fubbins'])

And we can use those keys to access the next dictionary level:

In [52]:
Babies['Bart Harley Jarvis']

{'Description': 'Underbite, flat back of the head',
 'Weight Percentile': 50,
 'Height Percentile': 80}

In [53]:
Babies['Bart Harley Jarvis']['Height Percentile']

80

This kind of data hierarchy is common in __json files__, which is a common way to store data.

### Towards Data

Dictionaries can refer not just to individual objects but to lists.

To get to our linear regression example, let's assume that we have two lists of data $x$ and $y$.

How do we calculate the linear regression parameter $\beta$ in the equation $y = \beta x$?

The estimator for OLS is that we find the value of $\beta$ that minimizes this function:
$$ SSE(\beta) = \sum_i (y_i - \beta x_i)^2 $$
Now we will:
1. Write a function that does that.
2. Write a class that combines everything together!

In [54]:
x = [ 0.72238169,  0.81319053,  1.02818518, -0.13406947, -0.32687184,
       -0.8436763 , -0.11656874,  1.26557628, -1.30864275, -1.11902229]
e = [-0.02545513,  0.2013095 ,  0.15369068,  0.77728519,  0.39257324,
       -0.04470027, -1.02603586,  0.21550981,  0.23245853, -0.06602041]

y = [] # Start an empty list.

beta = 2 

for i in range(len(x)): # The function "range" makes a series of integers from "0" to whatever you put in.
                        # This is going to go over each index of the lists.
    y += [x[i]*beta + e[i]] # += adds whatever is on the right side to the left side.
                            # Because they are list objects, it concatenates them.

In [55]:
y

[1.4193082499999998,
 1.82769056,
 2.21006104,
 0.5091462499999999,
 -0.26117044000000006,
 -1.7320528700000002,
 -1.2591733399999998,
 2.74666237,
 -2.3848269699999998,
 -2.30406499]

In [56]:
data = {"y":y,"x":x}

Now we have data. If we wanted to do a linear regression, we could just try and find which value of $\beta$ minimizes the sum of squared error between these two variables.

First, we need an SSE function:

In [57]:
def sse(beta):
    sse = 0
    for i in range(len(x)):
        sse += (data['y'][i] - beta*data['x'][i])**2
    return sse

In [58]:
sse(1),sse(2),sse(3)

(10.180663510029161, 1.9826681599664406, 9.098506318986834)

Looks like 2 could be the winner...

### Getting Classy

"Classes" are the __ultimate__ python object in that they can hold all the above information in one object!

In [67]:
class LinearRegression:
    def __init__(self, x,y):
        self.indep_var = x
        self.dep_var = y
    
    def sse(self,beta):
        sse_val = 0
        for i in range(len(self.indep_var)):
            sse_val += (self.dep_var[i] - beta*self.indep_var[i])**2
        return sse_val
    
    def estimate(self,betagrid):
        sse_vals =[]
        for beta in betagrid:
            sse_vals += [self.sse(beta)]
        
        the_min = min(sse_vals)
        
        for i in range(len(sse_vals)):
            if sse_vals[i] == the_min:
                return betagrid[i]

#### The "__init__"
This "initializes" the class. Here we ask it for two arguments, "x" and "y", and then assign them to two "attributes" indep_var and dep_var


In [59]:
class LinearRegression:

    def __init__(self, x,y):
        self.indep_var = x
        self.dep_var = y

In [60]:
lm_obj = LinearRegression(data['x'],data['y'])
lm_obj.indep_var[:2]

[0.72238169, 0.81319053]

#### The method
Now we can give it our SSE method. Each function in a class needs to have the argument "self" to work, and this stands in for whatever you created in "init"

In [62]:
class LinearRegression:

    def __init__(self, x,y):
        self.indep_var = x
        self.dep_var = y
    
    def sse(self,beta):
        sse_val = 0
        for i in range(len(self.indep_var)):
            sse_val += (self.dep_var[i] - beta*self.indep_var[i])**2
        return sse_val

In [63]:
lm_obj = LinearRegression(data['x'],data['y'])
lm_obj.sse(1.4)

5.063805348914101

Finally, let's give it an "estimate" method to find the $\beta$ that makes sse as small as possible.

In [None]:
class LinearRegression:
    def __init__(self, x,y):
        self.indep_var = x
        self.dep_var = y
    
    def sse(self,beta):
        sse_val = 0
        for i in range(len(self.indep_var)):
            sse_val += (self.dep_var[i] - beta*self.indep_var[i])**2
        return sse_val
    
    def estimate(self,betagrid):
        sse_vals =[]
        for beta in betagrid:
            sse_vals += [self.sse(beta)]
        
        the_min = min(sse_vals)
        
        for i in range(len(sse_vals)):
            if sse_vals[i] == the_min:
                return betagrid[i]

In [65]:
    def estimate(self,betagrid):
        sse_vals =[]
        for beta in betagrid:
            sse_vals += [self.sse(beta)]
        
        the_min = min(sse_vals)
        
        for i in range(len(sse_vals)):
            if sse_vals[i] == the_min:
                return betagrid[i]

In [68]:
lm_obj = LinearRegression(data['x'],data['y'])
lm_obj.estimate([0,1,2,3,4])

2

Given that grid, it determined that $\beta=2$ gives the lowest SSE value.

__WE DID IT!__
