# Week 1: Intro to Jupyter Notebooks and ML Models
**For Codeology's Music Generation Project Fall 2019 by [Jennifer Xiao](mailto:jenniferxiao@berkeley.edu) and [Alma Pineda](mailto:almapineda@berkeley.edu)**

**With help from Codeology's ML workshop Jupyter notebook authored [Calvin Chen](mailto:chencalvin99@berkeley.edu), [Micah Harrison](mailto:mharrison08@berkeley.edu), and [Sai Kapuluru](mailto:saikapuluru@berkeley.edu).**

### Table Of Contents
* [Jupyter Notebook guide](#intro)
    * [Python Review](#python)
* [Machine Learning and modeling](#machine)  
    * [Linear Regression](#linear)
    * [Ordinary Least Squares](#ordinary_least_squares)
    * [Example Model](#making_model)


<a id='intro'></a>
## Jupyter Notebook How To:


Below is a code cell. Code cells allow you to enter and run code Run a code cell using Shift-Enter or pressing the button in the toolbar above:

In [11]:

message = "Hello Python world!"
print(message)

Hello Python world!


### Shortcuts 

Alt-Enter runs the current cell and inserts a new one below.

Ctrl-Enter  run the current cell and enters command mode.

Shift-Enter runs the current cell and "moves" you to the next one


### Restarting the kernals
The kernel maintains the state of a notebook's computations and variables. You can reset this state (e.g. if the kernal gets stuck on a computation or loses wifi) by restarting the kernel. Do this by going up to the Toolbar -> Kernal -> Kernal Restart.... Once you restart your kernel you will need to rerun all the cells you previously had run.

You can also rerun lots of cells all at once by clicking on a cell of code and going to the toolbar: Cell -> Cell and Rerun All & Above...

<a id='python'></a>
## Helpful Basic Python How To:

### Variables
We can use variables to hold values. These variables can hold different types of values. You can use type() to see what the variable's type.

In [3]:
x = 1.0
type(x)

float

In [12]:
x = "Codeology is da bomb"
type(x)

str

In [13]:
x

'Codeology is da bomb'

In [9]:
w = 3 
y = 8
z = w + y 
print(z)

11


### Dictionaries
Think of a dictionary as an unordered set of key: value pairs, with the requirement that the keys are unique (within one dictionary). You can use keys to retrieve the values associated with that key.

In [14]:
params = {"parameter1" : 1.0,
          "parameter2" : 2.0,
          "parameter3" : 3.0,}

print(type(params))
print(params)
print(params["parameter1"])
print(params["parameter2"])
print(params["parameter3"])

<class 'dict'>
{'parameter1': 1.0, 'parameter2': 2.0, 'parameter3': 3.0}
1.0
2.0
3.0


### Lists
Lists are the most commonly used data structure. Think of it as a sequence of data that is enclosed in square brackets and data are separated by a comma. Each of these data can be accessed by calling it's index value.

Lists are declared by just equating a variable to '[ ]' or list

In [15]:
a = []
print(type(a))

<class 'list'>


In [16]:
x = ['machine learning', 'data science', 'berkeley']

### Indexing
In python, Indexing starts from 0. Thus now the list x, which has two elements will have apple at 0 index and orange at 1 index

In [19]:
x[0]

'machine learning'

In [20]:
x[0] = "ai buzzword!"
x[0]

'ai buzzword!'

In [21]:
###to go backwards in a list 
x[-1]

'berkeley'

In [22]:
### list inside of a list
y = [x]
y

[['ai buzzword!', 'data science', 'berkeley']]

### Slicing 

get part of a list by defining the index values of the first element and the last element from the parent list

In [24]:

num = [0,1,2,3,4,5,6,7,8,9]
print(num[0:4])
print(num[4:])

[0, 1, 2, 3]
[4, 5, 6, 7, 8, 9]


### Built in List Functions

In [25]:
#Find the length
print(len(num))

#Find the min value
print(min(num))

#Find the max value
print(max(num))

#Concatenate two lists 
x = [1,2,3] + [5,4,7]
print(x)

#Check if an element is in a list

10
0
9
[1, 2, 3, 5, 4, 7]


### For Loops

In [26]:
for x in range(4):
    print(x)

0
1
2
3


In [28]:

#iterate through the key value pairs from the dictionary we defined earlier
for key, value in params.items():
    print(key + " = " + str(value))

parameter1 = 1.0
parameter2 = 2.0
parameter3 = 3.0


### Functions 

In [30]:
def square(x):
    return x*x 

# You can return multiple  values
def powers(x):
    return x ** 2, x ** 3, x ** 4

#You don't have to have a return value
def split_up_string(x):
    for s in x:
        print(s)

<a id='machine'></a>
## Intro to ML

### Models

**Basic Ideas**

Models describe classes of things and relationship between things. For example, you can have a a model to describe spam vs not-spam emails, cats vs dogs, female vs male. Or you could have a model describe the relationship between mass, acceleartion and force in Newton's Second Law. 

Once you have a model, you can use them to make predictions. Maybe you want to create an email spam-classifier. Or maybe you want to predict someone's weight based on their gender and height.

BUT, there is one thing you should always remember. Just because you can make a model and use it to accurately predict the value of one variable based on the value of the other, it doesn't mean that one causes the other. You've probably heard this before: **correlation != causation**. A classic example is the relationship between ice cream sales and murder rates. Turns out, when ice cream sales rise, so do murder rates. Does this mean ice cream *causes* people to commit murder? Or get murdered? Nope!

*** One of the most basic models is one that describes a linear relationship between two things. So below we are going to use the linear model to understand the fundamental ideas behind modeling in ML. ***


<a id='linear'></a>
### Linear Regression

This is the fancy term used to describe the method behind making a linear model. 

**Simple linear regression** is a special case of linear regression in which you only have one explanatory variable. As the name suggests, it models the relationship as a *line*. You may be familiar with the slope-intercept form of a line, and that's exactly how the linear model looks!

$$y = mx+b$$

$y$ is the **response** or **dependent** variable we're trying to predict.
$x$ is an **explanatory** or **independent** variable used to predict $y$. 

If we were trying to create a model that uses height to predict weight, $y$ would represent weight, while $x$ represents height. Using known $x$'s, we want to accurately predict $y$ by using the right $m$ and $b$.

**But, given a bunch of x's and y's, how do we know what the right m and b are?**

In high school labs, you probably plugged these values into excel and created a **Best Fit Line**, and now we can unpack the magic that happens behind the scenes from a machine learning perspective.


The line of best fit is the line that "fits" the data the best. But creating the "best" or most accurate model involves defining a *loss function*. The **loss function** is a function that measures how far off our model's estimated values are from the true values. We want our model to be as accurate as possible, so that means we want to minimize the error our model makes in predicting values. In other words, we want to minimize the loss between the points we have and the line we are creating for the model

<img src='https://www.cs.toronto.edu/~frossard/post/linear_regression/lreg.jpg' width=400>







<a id='ols'></a>
### Ordinary Least Squares

In linear regression, we use **ordinary least squares (OLS)**, which minimizes the sum of squared residuals. A **residual** is the difference between the predicted value and the observed value for a given $x$. For a given observation $(x_i, y_i)$, the residual $e_i$ is calculated as:

$$ \underbrace{e_{i}}_{error} = \underbrace{y_i}_{actual} - \underbrace{\hat{y_i}}_{predicted} = y_i - mx_i - b$$

**Question**: Can you think of why we would want to *square* the residuals and sum them instead of just minimizing their sum?

Since we want to minimize the **residual sum of squares (RSS)**, what we're actually going to minimize is this:

$$\textit{RSS} = \sum_{i=0}^n {e_i}^2 = \sum_{i=0}^n (y_i - mx_i - b)^2$$

By minimizing this function, we can solve for slope $m$ and the intercept $b$. The actual calculations for deriving the formulas that define these coefficients requires a bit of calculus, so we'll skip that part for now, but if you want to look into it more on your own you can check out [this link](http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf)! For now, we'll just tell you that $m$ and $b$ can be solved as:

$$\begin{aligned}
\hat{b}&=\bar {y}-\hat{m}\,{\bar{x}},\\
\hat{m}&=\frac{\sum _{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar {y})}{\sum _{i=1}^{n}(x_{i}-\bar{x})^2}\\
\end{aligned}$$

This is pretty complicated! Luckily, you don't need to know any of this to make a linear model, but this is here for reference if you're interested in the math behind what we'll be getting into today. 

<a id='making_model'></a>
### Example Model

Run the cell below to find the dataset we found to work with!

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from plotting import overfittingDemo
from scipy.optimize import curve_fit


mpg = pd.read_csv("./data/mpg.csv", index_col="name") # load mpg dataset
mpg = mpg.loc[mpg["horsepower"] != '?'].astype(int) # remove columns with missing horsepower values
mpg_train, mpg_test = train_test_split(mpg, test_size = .2, random_state = 0) # split into training set and test set
mpg_train, mpg_validation = train_test_split(mpg_train, test_size = .5, random_state = 0)
mpg_train.head()

Here we've chosen the mpg dataset, which tells us various attributes of different cars, including a car's make and model, miles per gallon, number of cylinders, weight, and more! We're going to be trying to see which features affect a car's mpg, and our goal is to create a model that accurately predicts mpg given other attributes of the car.

You'll notice that we separated the mpg data into two separate dataframes, mpg_train and mpg_test. We'll get into why in a later part of today's lecture, but for now, make sure to do all of your analysis and model creation on the mpg_train dataset!

Try making some scatter plots of different variables as your x and mpg as your y using the mpg_train dataset below!

Hint: Hitting shift-tab with the cursor on the name of a function will bring up helpful documentation about how to use the function

In [2]:
mpg_train.plot.scatter("mpg", 'weight')

NameError: name 'mpg_train' is not defined

sklearn's linear_model module makes it really easy to make linear models! There's a lot of different types of linear models implemented in the linear_model module, which you can take a look at  [here](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) if you're interested.

Today we'll be using LinearRegression, which we've imported for you in the cell below. Try reading the documentation to figure out what the fit() function expects as input to correctly fit our model to the mpg_train data!

Hint: if you want to select a subset of columns from a dataframe, pass in a list of column names, like df[['col1', 'col2]]

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# Initialize our linear regression model
linear_model = LinearRegression()

X = mpg_train[['displacement']]
Y = mpg_train[['mpg']]

# TODO: Fit the model to the data


# TODO: extract the coefficient and the intercept
coef = 
intercept = 

You might notice that, while the `intercept_` is a single scalar value, `coef_` returns an array. This is because you can choose to fit your model to multiple explanatory variables (hence the list form of `feature_cols`). When you define multiple explanatory variables, the `coef_` will contain a separate coefficient for each explanatory variable you chose! You'll be able to explore that in a bit, but for now let's take a look at what our linear model looks like relative to our original data.

We've provided the skeleton for a helper function called `overlay_simple_linear_model`. Try to fill out the function so that it plots a scatterplot with the linear model overlaid on top.

*Hint:* If you press `tab` after a `[object].` or `[package].`, Jupyter will show you a list of valid functions defined for that object type or package.

In [4]:
def overlay_simple_linear_model(data, x_name, y_name, linear_model):
    """
    This function plots a simple linear model on top of the scatterplot of the data it was fit to.
    
    data(DataFrame): e.g. mpg_train
    x_name(string): the name of the column representing the predictor variable
    y_name(string): the name of the column representing the dependent/response variable
    linear_model
    
    returns None but outputs linear model overlaid on scatterplot
    """
    
    x = np.arange(max(data[x_name])).reshape((-1, 1)) # an array of integers between 0 and the maximum value of the x_name column
    y = linear_model.____ # replace ___ with correct function 
    
    
    data.plot.scatter(...) # scatter plot of x_name vs. y_name
    
    plt.plot(x, y, color='red')
    plt.title("Linear Model vs. Data: " + x_name + " vs. " + y_name)
    plt.show()

In [5]:
# If you wrote the function above correctly, this should produce a scatterplot with a line through it
overlay_simple_linear_model(mpg_train, ..., "mpg", linear_model)

NameError: name 'mpg_train' is not defined

## Congratulations! You've learned the very basics of linear regression. Next week we will continue to build on these ideas!!