### Intro

Machine learning can be branched out into the following categories:

1. Supervised Learning
2. Unsupervised Learning
Supervised Learning is where the data is labeled and the program learns to predict the output from the input data. For instance, a supervised learning algorithm for credit card fraud detection would take as input a set of recorded transactions. For each transaction, the program would predict if it is fraudulent or not.

### Supervised learning 
problems can be further grouped into regression and classification problems.

#### Regression:
In regression problems, we are trying to predict a continuous-valued output. Examples are:
- What is the housing price in Neo York?
- What is the value of cryptocurrencies?

#### Classification:
In classification problems, we are trying to predict a discrete number of values. Examples are:
- Is this a picture of a human or a picture of an AI?
- Is this email spam?

### Unsupervised Learning 
is a type of machine learning where the program learns the inherent structure of the data based on unlabeled examples.

Clustering is a common unsupervised machine learning approach that finds patterns and structures in unlabeled data by grouping them into clusters.

Some examples:
- Social networks clustering topics in their news feed
- Consumer sites clustering users for recommendations
- Search engines to group similar objects in one cluster



### Scikit-learn 
is a library in Python that provides many unsupervised and supervised learning algorithms. It’s built upon some of the technology you might already be familiar with, like NumPy, pandas, and Matplotlib!

The functionality that scikit-learn provides include:
- Regression, including Linear and Logistic Regression
- Classification, including K-Nearest Neighbors
- Clustering, including K-Means and K-Means++
- Model selection
- Preprocessing, including Min-Max Normalization

### Scikit-Learn Cheatsheet
#### Linear Regression
Import and create the model:
```
from sklearn.linear_model import LinearRegression
your_model = LinearRegression()
```
Fit:

`your_model.fit(x_training_data, y_training_data)`

`.coef_`: contains the coefficients
`.intercept_`: contains the intercept

Predict:

`predictions = your_model.predict(your_x_data)`

`.score()`: returns the coefficient of determination R²

#### Naive Bayes
Import and create the model:
```
from sklearn.naive_bayes import MultinomialNB
your_model = MultinomialNB()
```
Fit:

`your_model.fit(x_training_data, y_training_data)`

Predict:
```
# Returns a list of predicted classes - one prediction for every data point
predictions = your_model.predict(your_x_data)

# For every data point, returns a list of probabilities of each class
probabilities = your_model.predict_proba(your_x_data)
```

#### K-Nearest Neighbors
Import and create the model:
```
from sklearn.neighbors import KNeighborsClassifier
your_model = KNeighborsClassifier()
```
Fit:

`your_model.fit(x_training_data, y_training_data)`

Predict:
```
# Returns a list of predicted classes - one prediction for every data point
predictions = your_model.predict(your_x_data)

# For every data point, returns a list of probabilities of each class
probabilities = your_model.predict_proba(your_x_data)
```

#### K-Means
Import and create the model:
```
from sklearn.cluster import KMeans
your_model = KMeans(n_clusters=4, init='random')
```
`n_clusters`: number of clusters to form and number of centroids to generate
`init:` method for initialization
- k-means++: K-Means++ [default]
- random: K-Means
`random_state`: the seed used by the random number generator [optional]

Fit:

`your_model.fit(x_training_data)`

Predict:

`predictions = your_model.predict(your_x_data)`

#### Validating the Model
Import and print accuracy, recall, precision, and F1 score:
```
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

print(accuracy_score(true_labels, guesses))
print(recall_score(true_labels, guesses))
print(precision_score(true_labels, guesses))
print(f1_score(true_labels, guesses))
```
Import and print the confusion matrix:
```
from sklearn.metrics import confusion_matrix
print(confusion_matrix(true_labels, guesses))
```
Training Sets and Test Sets
```
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2)

train_size: the proportion of the dataset to include in the train split
test_size: the proportion of the dataset to include in the test split
random_state: the seed used by the random number generator [optional]
```

### Distance between 2 points
points are usually represented in lists
distance can be measured only between 2 points with the same dimensions

#### 1. Euclidean Distance
To find the Euclidean distance between two points, we first calculate the squared distance between each dimension. If we add up all of these squared differences and take the square root, we’ve computed the Euclidean distance.

In [None]:
def euclidean_distance(pt1, pt2):
  distance = 0
  for i in range(0,len(pt1)):
    distance += (pt1[i] - pt2[i])**2
  distance = distance**0.5
  return distance

#### 2. Manhattan Distance
Sum the absolute value of the difference between each dimension. It’s called Manhattan distance because it’s similar to how you might navigate when walking city blocks. If you’ve ever wondered “how many blocks will it take me to get from point A to point B”, you’ve computed the Manhattan distance.

In [None]:
def manhattan_distance(pt1, pt2):
  distance = 0
  for i in range(len(pt1)):
    distance += abs(pt1[i]-pt2[i])
  return distance

#### Hamming Distance
Hamming Distance is another slightly different variation on the distance formula. Instead of finding the difference of each dimension, Hamming distance only cares about whether the dimensions are exactly equal. When finding the Hamming distance between two points, add one for every dimension that has different values.

Hamming distance is used in spell checking algorithms. For example, the Hamming distance between the word “there” and the typo “thete” is one. Each letter is a dimension, and each dimension has the same value except for one.

In [None]:
def hamming_distance(pt1, pt2):
  distance = 0
  for i in range(len(pt1)):
    if pt1[i] != pt2[i]:
      distance += 1
  return distance

### SciPy Distances
Now that you’ve written these three distance formulas yourself, let’s look at how to use them using Python’s SciPy library:

- Euclidean Distance .euclidean()
- Manhattan Distance .cityblock()
- Hamming Distance .hamming()

There are a few noteworthy details to talk about:
- First, the scipy implementation of Manhattan distance is called cityblock(). Remember, computing Manhattan distance is like asking how many blocks away you are from a point.
- Second, the scipy implementation of Hamming distance will always return a number between 0 an 1. Rather than summing the number of differences in dimensions, this implementation sums those differences and then divides by the total number of dimensions. For example, in your implementation, the Hamming distance between [1, 2, 3] and [7, 2, -10] would be 2. In scipy‘s version, it would be 2/3.

## Regression vs Classification

**Regression** is used to predict outputs that are continuous. The outputs are quantities that can be flexibly determined based on the inputs of the model rather than being confined to a set of possible labels.

For example:
- Predict the height of a potted plant from the amount of rainfall
- Predict salary based on someone’s age and availability of high-speed internet
- Predict a car’s MPG (miles per gallon) based on size and model year

Linear regression is the most popular regression algorithm. It is often underrated because of its relative simplicity. In a business setting, it could be used to predict the likelihood that a customer will churn or the revenue a customer will generate. More complex models may fit this data better, at the cost of losing simplicity.

**Classification** is used to predict a discrete label. The outputs fall under a finite set of possible outcomes. Many situations have only two possible outcomes. This is called binary classification (True/False, 0 or 1, Hotdog / not Hotdog).

For example:
- Predict whether an email is spam or not
- Predict whether it will rain or not
- Predict whether a user is a power user or a casual user

There are also two other common types of classification: **multi-class classification** and **multi-label** classification.
Multi-class classification has the same idea behind binary classification, except instead of two possible outcomes, there are three or more.

For example:
- Predict whether a photo contains a pear, apple, or peach
- Predict what letter of the alphabet a handwritten character is
- Predict whether a piece of fruit is small, medium, or large

in multi-label classification, there are multiple possible labels for each outcome. This is useful for customer segmentation, image categorization, and sentiment analysis for understanding text. To perform these classifications, we use models like Naive Bayes, K-Nearest Neighbors, SVMs, as well as various deep learning models.

### Linear Regression
The purpose of machine learning is often to create a model that explains some real-world data, so that we can predict what may happen next, with different inputs.

The simplest model that we can fit to data is a line. When we are trying to find a line that fits a set of data best, we are performing Linear Regression.

We often want to find lines to fit data, so that we can predict unknowns. For example:
- The market price of a house vs. the square footage of a house. Can we predict how much a house will sell for, given its size?
- The tax rate of a country vs. its GDP. Can we predict taxation based on a country’s GDP?
- The amount of chips left in the bag vs. number of chips taken. Can we predict how much longer this bag of chips will last, given how much people at this party have been eating?

#### Points and Lines
In the last exercise, you were probably able to make a rough estimate about the next data point for Sandra’s lemonade stand without thinking too hard about it. For our program to make the same level of guess, we have to determine what a line would look like through those data points.

A line is determined by its slope and its intercept. In other words, for each point y on a line we can say:
        **y = mx + b**

where m is the slope, and b is the intercept. y is a given point on the y-axis, and it corresponds to a given x on the x-axis.

The slope is a measure of how steep the line is, while the intercept is a measure of where the line hits the y-axis.

When we perform Linear Regression, the goal is to get the “best” m and b for our data.

In [None]:
# change m and b to the best fit by trial and error:

import matplotlib.pyplot as plt
months = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
revenue = [52, 74, 79, 95, 115, 110, 129, 126, 147, 146, 156, 184]

#slope:
m = 5
#intercept:
b = 30
y = [(m * x ) + b for x in months]

plt.plot(months, revenue, "x")
plt.plot(months, y)
plt.show()

#### Loss
When we think about how we can assign a slope and intercept to fit a set of points, we have to define what the best fit is.

For each data point, we calculate loss, a number that measures how bad the model’s (in this case, the line’s) prediction was. You may have seen this being referred to as error.

We can think about loss as the squared distance from the point to the line. We do the squared distance (instead of just the distance) so that points above and below the line both contribute to total loss in the same way:

In [None]:
#We have three points, (1, 5), (2, 1), and (3, 3). We are trying to find a line that produces lowest loss.

x = [1, 2, 3]
y = [5, 1, 3]

#y = x
m1 = 1
b1 = 0
y_predicted1 = [(m1*x) + b1 for x in x]

#y = 0.5x + 1
m2 = 0.5
b2 = 1
y_predicted2 = [(m2*x) + b2 for x in x]

total_loss1 = 0
for i in range(len(y)):
  total_loss1 += (y[i] - y_predicted1[i])**2

total_loss2 = 0
for i in range(len(y)):
  total_loss2 += (y[i] - y_predicted2[i])**2

print(total_loss1)
print(total_loss2)

#### Gradient Descent for `Intercept`
As we try to minimize loss, we take each parameter we are changing, and move it as long as we are decreasing loss. 
The process by which we do this is called gradient descent. We move in the direction that decreases our loss the most.

**The calculation for this:**
Basically:
1. we find the sum of y_value - (m*x_value + b) for all the y_values and x_values we have
2. and then we multiply the sum by a factor of -2/N. N is the number of points we have.

#### Gradient Descent for `Slope`

To find the m gradient:
- we find the sum of x_value * (y_value - (m*x_value + b)) for all the y_values and x_values we have
- and then we multiply the sum by a factor of -2/N. N is the number of points we have.

Once we have a way to calculate both the m gradient and the b gradient, we’ll be able to follow both of those gradients downwards to the point of lowest loss for both the m value and the b value. Then, we’ll have the best m and the best b to fit our data!

In [None]:
def get_gradient_at_b(x, y, m, b):
    diff = 0
    N = len(x)
    for i in range(N):
      y_val = y[i]
      x_val = x[i]
      diff += (y_val - ((m * x_val) + b))
    b_gradient = -2/N * diff
    return b_gradient

def get_gradient_at_m(x, y, m, b):
  diff = 0
  N = len(x)
  for i in range(N):
    y_val = y[i]
    x_val = x[i]
    diff += x_val * (y_val - (m * x_val + b))
    m_gradient = -2/N * diff
  return m_gradient


Now that we know how to calculate the gradient, we want to take a “step” in that direction. However, it’s important to think about whether that step is too big or too small. We don’t want to overshoot the minimum error!

We can scale the size of the step by multiplying the gradient by a learning rate.
To find a new b value, we would say:

`new_b = current_b - (learning_rate * b_gradient)`

where current_b is our guess for what the b value is, b_gradient is the gradient of the loss curve at our current guess, and learning_rate is proportional to the size of the step we want to take.

In a few exercises, we’ll talk about the implications of a large or small learning rate, but for now, let’s use a fairly small value.

In [None]:
# Try to move the parameter values in the direction of the gradient at a rate of 0.01
# Define your step_gradient function here
def step_gradient(x, y, b_current, m_current):
  b_gradient = get_gradient_at_b(x, y, b_current, m_current)
  m_gradient = get_gradient_at_m(x, y, b_current, m_current)
  b = b_current - (0.01 * b_gradient)
  m = m_current - (0.01 * m_gradient)
  return b, m

#### Convergence
How do we know when we should stop changing the parameters m and b? How will we know when our program has learned enough?

To answer this, we have to define convergence. Convergence is when the loss stops changing (or changes very slowly) when parameters are changed.

#### Learning Rate
We want our program to be able to iteratively learn what the best m and b values are. So for each m and b pair that we guess, we want to move them in the direction of the gradients we’ve calculated. But how far do we move in that direction?

We have to choose a learning rate, which will determine how far down the loss curve we go.

A small learning rate will take a long time to converge — you might run out of time or cycles before getting an answer. A large learning rate might skip over the best value. It might never converge! Oh no!

Finding the absolute best learning rate is not necessary for training a model. You just have to find a learning rate large enough that gradient descent converges with the efficiency you need, and not so large that convergence never happens.

In [None]:
import matplotlib.pyplot as plt

def get_gradient_at_b(x, y, b, m):
  N = len(x)
  diff = 0
  for i in range(N):
    x_val = x[i]
    y_val = y[i]
    diff += (y_val - ((m * x_val) + b))
  b_gradient = -(2/N) * diff  
  return b_gradient

def get_gradient_at_m(x, y, b, m):
  N = len(x)
  diff = 0
  for i in range(N):
      x_val = x[i]
      y_val = y[i]
      diff += x_val * (y_val - ((m * x_val) + b))
  m_gradient = -(2/N) * diff  
  return m_gradient

#Your step_gradient function here
def step_gradient(b_current, m_current, x, y, learning_rate):
    b_gradient = get_gradient_at_b(x, y, b_current, m_current)
    m_gradient = get_gradient_at_m(x, y, b_current, m_current)
    b = b_current - (learning_rate * b_gradient)
    m = m_current - (learning_rate * m_gradient)
    return [b, m]
  
#Your gradient_descent function here:  
def gradient_descent(x,y,learning_rate, num_iterations):
  b = 0
  m = 0
  for i in range(num_iterations):
    b, m = step_gradient(b,m,x,y,learning_rate)
  return b, m

months = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
revenue = [52, 74, 79, 95, 115, 110, 129, 126, 147, 146, 156, 184]

b, m = gradient_descent(months, revenue, 0.01, 1000)

y = [m*x + b for x in months]

plt.plot(months, revenue, "o")
plt.plot(months, y)

plt.show()




#### Scikit-Learn
Congratulations! You’ve now built a linear regression algorithm from scratch.

Luckily, we don’t have to do this every time we want to use linear regression. We can use Python’s scikit-learn library. Scikit-learn, or sklearn, is used specifically for Machine Learning. Inside the linear_model module, there is a LinearRegression() function we can use:

`from sklearn.linear_model import LinearRegression`

You can first create a LinearRegression model, and then fit it to your x and y data:
```
line_fitter = LinearRegression()
line_fitter.fit(X, y)
```
The .fit() method gives the model two variables that are useful to us:

the `line_fitter.coef_`, which contains the slope
the `line_fitter.intercept_`, which contains the intercept
We can also use the .predict() function to pass in x-values and receive the y-values that this line would predict:

`y_predicted = line_fitter.predict(X)`

Note: the num_iterations and the learning_rate that you learned about in your own implementation have default values within scikit-learn, so you don’t need to worry about setting them specifically!

In [None]:
import codecademylib3_seaborn
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np

temperature = np.array(range(60, 100, 2))
temperature = temperature.reshape(-1, 1)
sales = [65, 58, 46, 45, 44, 42, 40, 40, 36, 38, 38, 28, 30, 22, 27, 25, 25, 20, 15, 5]

# Create an sklearn linear regression model and call it line_fitter
line_fitter = LinearRegression()

# Fit the line_fitter object to temperature and sales.
line_fitter.fit(temperature, sales)

# Create a list called sales_predict that is the predicted sales values that line_fitter would generate from the temperature list
sales_predict = line_fitter.predict(temperature)

plt.plot(temperature, sales, 'o')
plt.plot(temperature, sales_predict)
plt.show()