## Data Modeling

### Clustering

My nephew wants me to program a video game for him called 'Monster Family', and even provided a description of its characters:

<img src='https://courses.edx.org/asset-v1:Microsoft+DAT210x+4T2016+type@asset+block@Haadi.jpg'
alt='' style='height:300px;'/>

Although he's young, you can tell he's particular about details. To make the game to his likings, I needed to know which monsters were real family members, and which monsters weren't. If he had added just one more column that held that detail, FamilyA, FamilyB, etc., I would have been set:

<img src='pic/data-modeling-clustering-1.png'
alt='' style='height:200px;'/>

This monster 'dataset' is similar to real-world data in that it comes loaded with observational features, but isn't labeled. The one question I want a direct answer to isn't included as a feature. If there were way to automatically group similar samples based solely on their features, we then could use that knowledge to guide us towards actionable intelligence. That way exist, and its called unsupervised clustering.

#### Similarity

Since the goal of clustering is the grouping of similar records, you have to first define what similarity means. How would you go define monster similarity?

<img src='pic/data-modeling-clustering-2.png'
alt='' style='height:200px;'/>

One way you could group them is by attack power. Perhaps the monsters who deal the most damage to the player belong to a family, the weaker monsters belong to a family, and then the rest grouped as a family too, as shown above.

<img src='pic/data-modeling-clustering-3.png'
alt='' style='height:200px;'/>

Another grouping would be by weakness. It seems a lot of monsters that share a water-based weakness. All of these might belong to the same family. The remaining monsters might each belong to a separate family, or might actually be members of a sporadic family. There are many other ways you could group them as well, such as by size, by name (e.g. monster vs snake), etc. 

Without a generalizable way to group the samples, deterministic computers can't cluster your data. What's needed is a systematic means of measuring the **overall** similarity between your samples. Let's discuss how that's accomplished in the next section.

#### How Does k-Means Work?

Clustering groups samples that are similar within the same cluster. The more similar the samples belonging to a cluster group are (and conversely, the more dissimilar samples in separate groups), the better the clustering algorithm has performed. Since clustering is an unsupervised algorithm, this similarity metric must be measured automatically and based solely on your data.

The implementation details and definition of *similarity* are what differentiate the many clustering algorithms. The **K-Means** way of doing this, is to iteratively separate your samples into a user-specified number of "K" cluster groups of roughly equal variance. Cluster groups are defined by their geometric cluster center, single point referred to as its centroid. Separately, *centroid* and *cluster* are sometimes used interchangeably; but if used together, a cluster is a set of similar samples, and a centroid is just the mean feature-position of all samples assigned to the cluster.

The centroids are not records in your dataset, however they do 'exist' within your datasets feature-space. This is important because it allows for a meaningful distance measure to be calculated between the centroids and your samples. Every sample in your dataset is assigned to the centroid nearest to it, so if you have a sample that is 10 units away from CusterA's centroid, and 100 units away from ClusterB's, the sample is assigned to ClusterA.

In the case of continuous features, calculating the distance is straightforward. But when you have categorical features, such as 'Cookies n Cream' vs 'Mango' ice cream favors, you'll have to creatively come up with other methods. SciKit-Learn's K-Means implementation only natively supports numeric features types, so we'll leave the discussion on how to do clustering with categorical features to the Dive Deeper section.

#### The K-Means Algorithm

K-Means starts by placing a user-specified number of "K" cluster centers in your feature space. There are many techniques for choosing the first centroid placement, and your results will vary depending on the one you select! The simplest being just use the position of some random samples as the centroids' starting spots.

Each cluster then takes ownership of the samples nearest to its centroid, and every sample can only be assigned as single cluster. 'Nearest' is a value that has to be evaluated and in SciKit-Learn, it is defined as the multivariate, n-dimensional Euclidean distance between the sample and the centroid. After this, the centroid location is updated to be the mean value of all samples assigned to it. This mean value is calculated by feature, so the centroid position ends up being a n-length vector within your feature space.

The assignment and update steps repeat until there are no more changes in either, at which point the algorithm has converged. K-Means always converges, and it is very fast at doing so. But it does not always converge at the global minima...

The technical explanation for what K-Means does is minimizing the within-cluster inertia, or **sum of squared errors** between each sample and its respective centroid. As mentioned, the initial centroid assignment affects the results. Two runs of K-means might produce different outcomes, but the quality of their cluster assignments are ranked by looking at which run has the smallest overall inertia.

#### When Should I Use K-Means?

Clustering is a natural action we do even as children, by arranging similar shaped blocks and colors. K-Means clustering is best suited when you have a good idea of the number of distinct clusters your unlabeled dataset should be segmented into. Generally, the output of K-Means is used in two ways. To separate your unlabeled data into K groups, which is the clear use case, or to find and use the resulting centroids.

##### Separate Your Data

Astronomers use clustering to group different star types, classes of planets, and galaxies. Biologists use it to group every living thing by species, genus, and kingdom. In business, clustering is used to segment likely and unlikely prospects, for location assignment, factor endowment, and the assignment and deployment of remote services.

##### Centroid Usage

Besides divvying up samples, clustering can also provide a layer of abstraction, by directing attention to the cluster and its attributes and not each samples. In the climate change case study from the previous module, you saw how climate divisions were used as a cluster abstraction over individual ground stations for various mentioned reasons. Another example of centroid usage would be a company looking for ideal locations to open a limited number of branches, based on the location of their customers.

You can use the centroid to 'compress' your data. By referring to the centroid rather than the data sample, the number of unique values is reduced, which optimizes the execution speed of other algorithms. Isomap, for instance, uses a nearest neighbors algorithm to calculate the distance from the record you want to transform to every sample in the training dataset. By using the record-to-cluster distance approximation in replacement of the individual record-to-sample distances, since there are far fewer clusters than records, you can achieve unprecedented orders of optimization.

#### SciKit-Learn and K-Means

It's very simple to get up and running with K-Means in SKLearn. Given a dataframe df, you can compute its labels and centroids as follows:

```python
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(df)

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)

labels = kmeans.predict(df)
centroids = kmeans.cluster_centers_
```

The most important factor for you to focus on being n_clusters, the "**K**" number of clusters you want K-Means to place for you. Also experiment with different initialization methods, including rolling your own and in the positions as an NDArray shaped as `[n_clusters, n_features]`. We've include more details for you on that in the dive deeper section.

The most important factor for you to focus on being n_clusters, the "**K**" number of clusters you want K-Means to place for you. Also experiment with different initialization methods, including rolling your own and in the positions as an NDArray shaped as `[n_clusters, n_features]`. We've include more details for you on that in the dive deeper section.

#### K-Means Gotchas!

t's easy to understand the K-Means algorithm, and extremely fast to execute. So fast that it's often ran several times over as you saw earlier. Since each successive run of isn't dependent on the results of earlier runs, the execution process lends itself to parallelization, each centroid seeding trial being ran independently. If the clustering job at hand is still taking too long, SciKit-Learn's [MiniBatchKMeans](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html) further optimizes the process for you.

Considering how basic of an algorithm it is, K-Means performs pretty well, and its implementation is the basis for a few more advanced clustering algorithms, such as learning vector quantization and Gaussian mixture. Having a solid understanding of K-Means will help you understand those better when you study them.

K-Means is only really suitable when you have a good estimate of the number clusters that exist in your unlabeled data. There are many estimation techniques for approximating the correct number of clusters, but you'll have to get that number before running K-Means. Even if you do have the right number of clusters selected, the result produced by K-Means can vary depending on the initial centroid placement. So if you need the same results produced each time, your centroid seeding technique also needs to be able to reliably produce the same placement given the same data. Due to the centroid seed placement having so much of an effect on your clustering outcome, you have to be careful since it is possible to have centroids with only a single sample assigned to them, or even no samples assigned to them in the worst case scenario.

Two other key characteristics of K-Means are that it assumes your samples are length normalized, and as such, is sensitive to feature scaling. It also assumes that the cluster sizes are roughly spherical and similar; this way, the nearest centroid is always the correct assignment.

#### Knowledge Checks

##### Review Question 1

*Only one of following statements is true. Which one is it?*

+ **Its possible for samples from two different clusters to be more similar to one another than their intra-cluster neighbors, if the the two clusters are large and located near one another correct**
+ Real world data typically comes labeled
+ Unsupervised clustering aims to group your samples based on their labels
+ Centroids are records that live in your dataset and share the same feature space so that a meaningful distance can be calculated between them and your samples

##### Review Question 2

Once again, only a single one of the following statements is correct. Do you know which one it is?*

+ It's possible for a sample to be assigned to two clusters; but only if its equidistant from either cluster.
+ The K-Means algorithm scans your dataset to detect clusters using an iterative assignment / update cycle. The algorithm returns the number of clusters found, as well as their centroid position. incorrect
+ As a clustering algorithm, K-Means is really only useful for grouping your samples
+ **K-Means assumes your features are either length normalized, or that their length encodes a specific meaning.**

*Answer*

**Incorrect**: 

Wrong! You have to specify how many clusters exist in your data. Given that number, K-Means will attempt to find *that* many clusters in your data. But the responsibility of specifying the number of clusters is yours, not the algorithm's.

**Explanation**

A sample can only have a single cluster assignment. Also, you are the one responsible for specifying the number of clusters. K-Means won't tell you the number of clusters in your data. Besides assigning a cluster to your samples, there are many uses for the centroid locations. Review th reading please. Since K-Means cluster assignment depends on an Euclidean length metric, your features have to either be length normalized, or have appropriate units for the algorithm to perform properly.

#### Assignment 1

##### Lab Assignment 1

Many U.S. cities, the U.S. federal government, and even other cities and governments abroad have started subscribing to an Open Data policy, because some data should be transparent and available to everyone to use and republish freely, without restrictions from copyright, patents, or other mechanisms of control. After reading their [terms of use](http://www.cityofchicago.org/city/en/narr/foia/data_disclaimer.html), in this lab you'll be exploring the City of Chicago's Crime data set, which is part of their Open Data initiative.

1. Start by navigating over to the [City of Chicago's Crimes dataset](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2) exploration page. It lists crimes from 2001 to the present, but you'll only be targeting Gambling. The city's website itself has hundreds of other datasets you can browse and do machine learning on.
2. Open up the /Module5/**assignment1.py** starter code, and follow the directions to acquire the dataset and properly set it up.
3. Fill out the **doKMeans** method to find and plot **seven clusters** and print out their centroids. These could be places a police officer investigates to check for on-going illegal activities.
4. Re-run your assignment a few times over, looking at your printed and plotted results. Then answer the following questions.

*Note: If Pandas complains about your data, you can use dropna() on any row that has nans in it.*

##### Lab Questions

2 points possible (graded)

You'll notice that the cluster assignments are pretty accurate. Most of them should be spot-on, dead-center. Only one cluster might have been assigned to outliers. Given the results, answer the following questions to the best of your ability:

*Did your centroid locations change after you limited the date range to +2011?*

+ Their locations are completely different
+ **They move slightly...**
+ Not at all

*What about during successive runs of your assignment? Any centroid location changes happened there?*

+ All clusters have moved, and the cluster arrangement isn't anything like it was before
+ **All clusters have moved but only slightly, and the centroid arrangement still has the same shape for the most part**
+ The clusters did not really move at all, or if they did, it wasn't noticeable
+ The cluster centroids are identical according to the print statement output

#### Assigment 2

##### Lab Assignment 2

The spirit of data science includes exploration, traversing the unknown, and applying a deep understanding of the challenge you're facing. In an academic setting, it's hard to duplicate these tasks, but this lab will attempt to take a few steps away from the traditional, textbook, "plug the equation in" pattern, so you can get a taste of what analyzing data in the real world is all about.

After the September 11 attacks, a series of secret regulations, laws, and processes were enacted, perhaps to better protect the citizens of the United States. These processes continued through president Bush's term and were renewed and and strengthened during the Obama administration. Then, on May 24, 2006, the United States Foreign Intelligence Surveillance Court (FISC) made a fundamental shift in its approach to Section 215 of the Patriot Act, permitting the FBI to compel production of "business records" relevant to terrorism investigations, which are shared with the NSA. The court now defined as *business* records the entirety of a telephone company's call database, also known as Call Detail Records (**CDR** or *metadata*).

News of this came to public light after an ex-NSA contractor leaked the information, and a few more questions were raised when it was further discovered that not just the call records of suspected terrorists were being collected in bulk... but perhaps the entirety of Americans as a whole. After all, if you know someone who knows someone who knows someone, your private records are relevant to a terrorism investigation. The white house quickly reassured the public in [a press release](http://www.cbsnews.com/news/obama-nobody-is-listening-to-your-telephone-calls/) that "Nobody is listening to your telephone calls," since, "that's not what this program is about." The public was greatly relieved.

The questions you'll be exploring in this lab assignment using K-Means are: exactly how useful is telephone metadata? It must have some use, otherwise the government wouldn't have invested however many millions they did into it secretly collecting it from phone carriers. Also what kind of intelligence can you extract from CDR metadata besides its face value?

You will be using a sample CDR dataset generated for 10 people living in the Dallas, Texas metroplex area. Your task will be to attempt to do what many researchers [have already](http://journals.plos.org/plosone/article/asset?unique&id=info:doi/10.1371/journal.pone.0020814.s001) successfully done - partly de-anonymize the CDR data. People generally behave in predictable manners, moving from home to work with a few errands in between. With enough call data, given a few K-locations of interest, K-Means should be able to isolate rather easily the geolocations where a person spends the most of their time.

Note: to safeguard from doxing people, the CDR dataset you'll be using for this assignment was generated using the tools available in the Dive Deeper section. CDRs are at least supposed to be protected by privacy laws, and are the basis for proprietary revenue calculations. In reality, there are quite a few public CDRs out there. Much information can be discerned from them such as social networks, criminal acts, and believe it or not, even the spread of diseases as was demonstrated by [Flowminder Foundation paper on Ebola](http://www.worldpop.org.uk/ebola/Flowminder-Mobility-Data-21.08.14.pdf). 

1. Open up the starter code in /Module5/*assignment2.py* and *read* through it all. It's long, so make sure you understand everything that is being asked for you before proceeding.
2. Load up the CDR dataset from /Module5/Datasets/**CDR.csv**. Do your due diligence to make sure it's been loaded correctly and all the features and rows match up.
3. Pick the first unique user in the list to examine. Follow the steps in the assignment file to approximate where the user lives.
4. Once you have a (**Latitude**, **Longitude**) coordinate pair, drop them into Google Maps. Just do a search for the "{Lat, Lon}". So if your centroid is located at Longitude = **-96.949246** and Latitude = **32.953856**, then do a maps search for "[32.953856, -96.949246](https://www.google.com/maps/place/32%C2%B057'13.9%22N+96%C2%B056'57.3%22W/@32.953856,-96.950343,18z/data=!3m1!4b1!4m5!3m4!1s0x0:0x0!8m2!3d32.953856!4d-96.949246)".
5. Answer the questions below.

#### Assignment 3

##### Lab Assignment 3

Continuing on with the previous lab, this time you'll validate your results by comparing the user's weekday activity to their weekend activity. To get started, use the starter code in /Module5/**assignment3.py**.

Load up the same CDR dataset into a dataframe, and extract the unique "**In**" phone numbers. You don't have to save it as a Python list this time, and can keep it as an NDArray. The previous lab had you convert to a list just so you'd have the experience doing it.
Create a new slice, once again for the first unique number in the CDR. Instead of limiting it to Weekend only entries, index it so that the slice only contains Weekday entries, **Mon-Fri**, and so that it occurs any time before 5pm.
Run K-Means on the data with K=4. Plot the cellphone towers the user connected to, and then plot the cluster centers using a different marker and color.
Answer the questions below.

##### Lab Questions
3/3 points (graded)
Answer the following questions given the data you just recorded, for K=4, and CallTime is less than **5pm** (that is "*17:00:00*"), and the call's day-of-week being a weekday.

The users home location will likely be near the centroid with the second most attached samples. Does your approximated home location from this map coincide with the home approximation from the previous lab?

+ Yes, they are exactly the same
+ **Yes, they match, but there is a slight difference correct**
+ No, you can tell that they should match; however, their locations are *very* different
+ No, they are completely different

Given the indexed time range, and the times people usually receive / make calls, the cluster with the most samples is likely to be the user's work location. What is the phone number of the user who works at the US Post Office near Cockrell Hill Rd?

+ 463-847-2273
+ 206-862-7935
+ **289-436-5987**
+ 155-941-0755
+ 368-808-9071

Run your assignment with K=3. Look at the code that gets the mean CallTime value for the cluster with the least amount of samples assigned to it (the cluster we suspect corresponds to the user transiting to work). What hour is the average CallTime value of that cluster closest to?

+ 5am
+ 6am
+ 7am
+ **8am**
+ 9am
+ 10am
+ 11am
 
**CallTime Clarification**

Your calculated average calltime should be on a *per-cluster* basis. Recall, each user in your dataset has 3 clusters, and you're only interested in the cluster-per-user with the fewest # of samples--that is, the least number of `.lables_`. So you should have 10 clusters total, each with a certain # of samples, and you want to calculate the average time per cluster.

#### Assignment 4

##### Lab Assignment 4

Feature scaling was first discussed within one of the PCA lab assignments, but this lab will really familiarize you with it. You will be making use of the **Wholesale Customer's** dataset, hosted by UCI's Machine Learning. *Unsupervised* clustering scans your features and then groups your samples based off of them. Therefore you should have a solid understanding of what each of your features are, which one's you should remove, and how to scale them in order for the 'blind' clustering to preform correctly and do what you want it to do.

Visit the [UCI dataset page](https://archive.ics.uci.edu/ml/datasets/Wholesale+customers) and read all the content available, so you become accustomed to the dataset. Then, load up the starter code from Module5/**assignment4.py** and as usual, read through that it in its entirety as well.
The first thing that needs to be answered is, what is it you'd like to accomplish by clustering this dataset? There are a couple of potential questions you could ask given the data, and the one you choose will drive how you manipulate your dataset. Are you interested in which products people buy together, so that you can place them near one another in your store, or recommend the pair product when shopping online? Perhaps you're more interested in which products people are spending the most money on? Or maybe your interest just lies in identifying what individual products people are buying. All of this must be considered.
For the purposes of this lab, you'll assume you're interested in overall customer behavior rather than channel or region specific behavior, so you'll drop those two fields from the dataset. If you were a large wholesaler with branches all over the nation, you'd want to keep those fields in so that you can duplicate the process outlined in the assignment to particular areas and vertical markets.
Complete the assignment and answer the questions below.

##### Lab Question

Which of SciKit-Learn's preprocessors causes the principal components to spread out as much as possible in an arrangement unlike the others?

+ StanadardScaler
+ MinMaxScaler
+ **Normalize**
+ Scale incorrect
+ No scaling necessary

**Explanation**

Normalizing has a slight 'correlating' effect, since each sample's features get scaled by the overall sample's magnitude. This causes it to behave in a manner unique compared to the other scalers, which act on a per-feature basis. The result is visibly discernible, and each sample's feature-values becomes their unitized contribution to the sample's overall magnitude.

### Splitting Data 

#### Supervised and Unsupervised Learning

So far, you've only covered unsupervised machine learning algorithms that run on unlabeled data. Even in the few example cases where we labeled the data for visualization purposes, those labels were removed before feeding the data into PCA, Isomap, and K-Means, and only added back in at the end when the data was graphed. You've taken a look at two types of dimensionality reduction methods that aim to extract a simplified version of the nature of your dataset. And you've also experimented with a clustering algorithm that ranks similarity and attempts to minimize within-cluster differences between the centroid and samples. This is a good time to document these algorithms, along with their use cases in your **course map**. You should  take note of the strengths of weaknesses of the algorithm, as well as a sample use cases for when you should use each one.

Every single machine learning class, or *estimator* as SciKit-Learn call them, implements the `.fit()` method as you've seen. This will continue to hold true for the supervised one's as well. The unsupervised estimators also allowed you to make use of the following methods:

+ `.transform()` : without changing the number of samples, alters the value of each existing feature by changing its units
+ `.predict()` : only with clustering, you could predict the label of the specified sample

For the rest of this course, you're going to focus on supervised learning algorithms. The main difference between supervised and unsupervised learning is that with supervised learning, you actually guide the machine's *hand* at choosing the right answers. By showing the computer examples of what you want it to do, instead of just asking it to tell you something interesting about your data, the computer's responsibility shifts to deriving a set of rules that when applied to raw data, has a decent chance of choosing the answers you've trained it to. What you'll now see is that supervised learning estimators implement a slightly different set of distinct methods:

+ `.predict()` : After training your machine learning model, you can predict the labels of new and never seen samples
+ `.predict_proba()` : For some estimators, you can further see what the probability of the new sample belonging to each label is
+ `.score()`: The ability to score how well your model fit the training data

### Overfitting & Scoring Results

While training your machine with supervised learning, it's important to track how well its performing. Consider the spam email identification example we presented in The Possibilities section from The Big Picture module: Imagine you had 100 spam emails and 100 regular emails and you trained an estimator to differentiate between the two. And it did so *perfectly*. What comes next? With such an exact estimator that 100% reliably labels emails, you'd probably want to integrate it into your email client. Do you see a problem here?

Every email that you quizzed your estimator, it's actually seen before during its training! You were telling it, this email is spam, this email isn't spam, etc. What you've essentially done is create a basic model that simply regurgitates the label you'*ve already given* for any specific email. That's not machine learning. In fact, that's no more remarkable than opening up a text file, saving something inside of it, and then being amazed when the exact same text is found within the text file upon reopening. This is called **overfitting**. Your goal with machine learning is to create a generalizable algorithm that can be applied to data it hasn't seen yet and still do the task you've trained it to do. In our case, properly classify the spam status of an email.

To make this possible, of course you'll still need to fit your estimator; but when it comes to *testing* it, that part will have to be done with data the estimator has never seen before. In other words, all of your training **transformation and modeling** needs to done using just your training data, without ever seeing your testing data. This will be your way of validating the true accuracy of your model. This is doable by splitting your training data into two portions. One part will actually be used for the training as usual, but the other part of the data is retained and used during testing only. How much data should you hold back? If you hold back too much data then your algorithm's performance is going to suffer, since you didn't train it well. If you don't hold back enough you won't have a statistically sufficient number of 'quiz' questions to gauge your machine learning model with.

#### SciKit-Learn Implementation

SciKit-Learn helps you split your data:

In [9]:
from sklearn.model_selection import train_test_split
data = [0,1,2,3,4,5,6,7,8,9] # input dataframe samples
labels = [0,0,0,0,0,1,1,1,1,1] # the function we're traning is " > 4"
data_train, data_test, label_train, label_test = train_test_split(data, labels, test_size=0.5, random_state=7)
print('data_train:  ', data_train)
print('data_test:   ', data_test)
print('label_train: ', label_train)
print('label_test:  ', label_test)

data_train:   [9, 7, 3, 6, 4]
data_test:    [8, 5, 0, 2, 1]
label_train:  [1, 1, 0, 1, 0]
label_test:   [1, 1, 0, 0, 0]


Notice how your data is held separately from your labels. This is important! It's difficult enough for your machines to figure out how to mathematically get from your raw data to your labels without the two being inter-jumbled. We've used df as a shorthand for dataframe but in the future we will start using a more conventional X for our data, and y for our answer labels.

Unless you specify otherwise, it'll hold back 25% of your data in the validation or testing set, and 75% will stay in the original training set. Each time you run `train_test_splot()` it holds back a randomly shuffled amount of data, so one thing you'll start noticing is that successive trials of your algorithm may actually produce slightly different accuracy levels. This is normal, so do not be alarmed. If you absolutely need the results to come back identically, such as if you're doing a demo, then you can pass in an optional random_state variable to make the centroid selection reproducible.

After you've trained your model against the **training** data (`data_train`, `label_train`), the next step is testing it. You'll use the `.predict()` method of your model, passing in the **testing** data (`data_test`) to create an array of predictions. An then you'll gauge its accuracy against the `true` label_test answers. SciKit-Learn also has a method to help you do that:

```python
from sklearn.metrics import accuracy_score
predictions = my_model.predict(data_test) 
accuracy_score(label_test, predictions)
accuracy_score(label_test, predictions, normalize=False)
```

We covered a lot of important material here that you'll want to make sure gets added to your Evaluating phase on the **course map**. You'll only know if your models are performing well after evaluating them, so being able to reliably grade your models is a precursor to driving intelligence from your raw data.

#### Knowledge Checks

##### Review Question 1

One of the following statements is actually a lie. Select the statement that is inaccurate from the list below:

+ **Overfitting means your machine learning algorithm is performing at 100% and has been over trained correct**
+ In supervised learning, you provide the algorithm the correct answers while training it
+ If you split off too many samples for testing, your training is going to suffer as a consequence
+ There shouldn't be an overlap in your training and testing dataset, because your algorithm already has the answers to the training data

##### Review Question 2

What of these demonstrates the proper order of operations?

+ Load Data, Encode Data / Wrangle Data, KMeans the Data, Split Data
+ **Load Data, Encode Data / Wrangle Data, Split Data, Fit PCA with Training Data correct**
+ Load Data, Encode Data / Wrangle Data, PCA the Data, Split Data
+ Load Data, Encode Data / Wrangle Data, Split Data, Fit Isomap with Testing Data
+ Load Data, Encode Data / Wrangle Data, Isomap the Data, Split Data
 
##### Review Question 3

Overfitting is best described as:

+ **Your machine learning model not generalizing well against new data correct**
+ What happens when you don't split your data too much
+ Training your machine learning algorithms until they have a high level of accuracy
+ Working out more than necessary

### K-Nearest Neighbors

#### What is K-Nearest Neighbors

In this section, you're going to explore the K-Nearest Neighbors classifier. K-Neighbors and K-Means are similarly named, so people sometimes get the two confused, but they are actually different. K-Means is an unsupervised clustering algorithm, where K-Neighbors is a supervised classification algorithm. If clustering is the process of separating your samples into groups, then classification would be the process of assigning samples **into** those groups. Given a set of groups, take a set of samples and mark each sample as being a member of a group. Each group being the correct answer, label, or *classification* of the sample.

The K-Nearest Neighbors, or K-Neighbors classifier, is one of the simplest machine learning algorithms. Due to its simplicity, machine learning dabblers tend to start their journey by actually building out this classifier theirselves. We urge you to do the same if you'd like to go that route.

The thought process behind K-Neighbors is that almost all information can be modeled on a continuous basis if you just "zoom in" close enough. We observe this in real life from the atomic to the cosmic scales. In atoms, you'll never encounter a proton resting in the clouds where electrons orbit. Instead you'll only find other electrons. Similarly, you wont encounter stray electrons hanging out in the nucleus where protons and electrons are packed. So if your purpose were to name the type of elementary particle and you knew its location, you could infer that its type based on that. If you look at a neighborhood on Zillow or RedFin and examined house prices, houses near one another tend to have similar sale values. Just by knowing the prices of neighboring houses, you have a good way of discerning what an unidentified house might sell for(1). If you trained your eyes on the stars far away, those belonging to the same galaxy appear closer to one another than those belonging to different galaxies, and so on.

(1)Note: This example is actually an example of nearest neighbors regression instead of nearest neighbors classification, but it follows the same principle.

#### How Does K-Neighbors Work?

You've actually already seen the major portion of the K-Neighbor algorithm in action as an interim step in Isomap's process. Isomap used K-Neighbors in an unsupervised way to build a neighborhood map. The K-Neighbors classifier takes that a step further by applying more logic to actually label new and never-before-seen records.

K-Nearest Neighbors works by first simply storing all of your training data samples.

Then in the future, when you attempt to check the classification of a new, never-before seen sample, it finds the nearest "K" number of samples to it from within your training data. You must have numeric features in order for 'nearest' to be meaningful. There are other methods you can use for categorical features. For example you can use bag of words to vectorize your data. Even so, you may want to experiment with other methods, such as as cosine similarity instead. But at the end of the day, SciKit-Learn's K-Nearest Neighbors only supports numeric features, so you'll have to do whatever has to be done to get your data into that format before proceeding. The distance will be measures as a standard Euclidean 

With the nearest neighbors found, K-Neighbors looks at their classes and takes a mode vote to assign a label to the new data point. Further extensions of K-Neighbors can take into account the distance to the samples to weigh their voting power. Each new prediction or classification made, the algorithm has to again find the nearest neighbors to that sample in order to call a vote for it. This process is where a majority of the time is spent, so instead of using brute force to search the training data as if it were stored in a list, tree structures are used instead to optimize the search times. Due to this, the number of classes in dataset doesn't have a bearing on its execution speed. Only the number of records in your training data set. 

##### Decision Boundaries

A unique feature of supervised classification algorithms are their decision boundaries, or more generally, their n-dimensional decision surface. These boundaries are somewhat similar to the event horizon on a black hole: whenever performing classification, there exist a threshold or region where if superseded, will result in your sample being assigned that class.

The decision surface isn't always spherical. In fact, it can take many different types of shapes depending on the algorithm that generated it. Some decision boundaries very linear and take the form of a hyperplane. Others are twisted and contorted and look like strung up swiss cheese. In the following labs, you'll dedicate some time to becoming familiar with the decision surface options for each classification method you cover.

For K-Neighbors, generally the higher your "K" value, the smoother and less jittery your decision surface becomes. Higher K values also result in your model providing probabilistic information about the ratio of samples per each class. There is a tradeoff though, as higher K values mean the algorithm is less sensitive to local fluctuations since farther samples are taken into account. This causes it to only model the overall classification function without much attention to detail, and increases the computational complexity of the classification.

<img src='https://courses.edx.org/asset-v1:Microsoft+DAT210x+4T2016+type@asset+block@DecisionSurface.mp4' style='height: 350px;'>

##### SciKit-Learn and K-Neighbors

One way to understand supervised machine learning is to imagine your computer solving a complicated series of math equations, completely devoid of operators. `5 + 5 = 10` is easy to solve, but what if you had to solve this the following:

+  8     4     2     6   
+ 12   108    36   1/2

Not easy anymore, is it? Supervised machine learning essentially places the operators in the equation in order to make it work:

+ (8  +   4)   /    2   =    6
+ 12  = (108   +   36)  ^ (1/2)

It's hard enough coming up with the operators without the added burden of having to figure out which column is the *solution* the algorithm is supposed to be guided towards. This is why classification algorithms need you to separate out your answers or **labels** from the rest of your feature columns. When you download labeled datasets from the internet or even make your own, you'll see a classification column that tells you the class of each sample. This column will be right in your feature space. It's your responsibility to splice it out and drop the column once you've finished loading your dataset, but before any transformations, modeling, or test / train splitting:

```python
# Process:
# Load a dataset into a dataframe
X = pd.read_csv('data.set', index_col=0)

# Do basic wrangling, but no transformations
# ...

# Immediately copy out the classification / label / class / answer column
y = X['classification'].copy()
X.drop(labels=['classification'], inplace=True, axis=1)

# Feature scaling as necessary
# ...

# Machine Learning
# ...

# Evaluation
# ...
```

SciKit-Learn's KNeighbors classifier interface is similar to the other machine learning algorithms you've seen so far, but there are a few notes to be aware of. As mentioned, when you fit your model, you now need to supply a second classification array in addition to your (now) unlabeled samples. The labeling vector should be an array of shape `[n_samples]`, and it should contain te classification or label for each training sample. Alternatively, if you're predicting multiple outputs, your array shape should be `[n_samples, n_outputs]` and again, each array value would contain the respective classification.

The KNeighborsClassifier class constructor takes in a few arguments, most optional:

+ **n_neighbors** The number of neighbors to consider. Keep it odd when doing binary classification, particularly when you use uniform weighting.
+ **weights** How to count the votes from the neighbors; does everyone get an equal vote, a weighted vote, or something else?
+ **algorithm** You can select an optimization method for searching through your training data set to find the nearest neighbors.

And here is the algorithm in action:

In [11]:
# From now on, you only train on a "portion" of your dset:
import pandas as pd
X_train = pd.DataFrame([ [0], [1], [2], [3] ])
y_train = [0, 0, 1, 1]

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

# You can pass in a dframe or an ndarray
print(model.predict([[1.1]]))
print(model.predict_proba([[0.9]]))

[0]
[[ 0.66666667  0.33333333]]


##### Gotchas!

By now, you're already familiar with the highlights of K-Neighbors. It's easy to apply and make sense of because it mimics our own understanding of reality. Even its implementation is straightforward, no curve balls or tricks to it. Unlike other algorithms, training is immediate because it simply stores all of your training data. This has the added benefit that if you were to come into contact with more labeled data in the future, you could introduce it into your model without having to do heavy computations to rebuild it. Because of this, K-Neighbors is sometimes referred to as a memory based, 'lazy' machine learning algorithm, as it delays doing calculations until you start classifying. 

K-Neighbors being the first supervised learning model you've encountered in this course, you might not be aware of how the other models behave. But keep in the back of your mind that K-Neighbors is particularly useful when no other model fits your data well, as it is a parameter free approach to classification. So for example, you don't have to worry about things like your data being linearly separable or not.

Some of the caution-points to keep in mind while using K-Neighbors is that your data needs to be measurable. If there is no metric for discerning distance between your features, K-Neighbors cannot help you. As with all algorithms dependent on distance measures, it is also sensitive to feature scaling. K-Neighbors is also sensitive to perturbations and the local structure of your dataset, particularly at lower "K" values.

On the other hand, with large "K" values, you have to be more cautious of the overall class distribution of your samples. If 30% of your dataset is labeled **A** and 70% of labeled **B**, with high enough "K" values, you might experience K-Neighbors unjustly giving preference to **B** labeling, even in those localities of your dataset that should be properly classified as **A**.

#### Knowledge Checks

##### Review Question 1

Classification is the process of...

+ Looking for groups of samples based only on their features incorrect
+ Labeling samples depending based on their neighbors
+ **Identifying the group membership of samples correct**
+ Grouping similar samples, and then assigning a label or class to them

**Explanation**
+ Looking for groups of samples based on their features is called clustering. 
+ Labeling samples based on their neighbors is a type of classification, called K-Means classification. But there are other types of clustering as well that behave differently. 
+ Classification really is the process of assigning groups to samples.


##### Review Question 2

The main similarity between K-Neighbors and K-Means are...

+ **They both use distance functions to tackle the problem of group assignment correct**
+ They both have the same, non-linear, decision boundary
+ They both have a K in their names
+ They both are classification algorithms that aim to assign a label to your samples

#### Assignment 5

##### Lab Assignment 5

Remember that wheat dataset you used while exploring visualizations? It's about to make a comeback! While learning the many classification algorithms we're going to cover in the next few sections, it's a good idea to have a 'benchmark dataset' to come back to, so you can can compare the performance and accuracy of other algorithms.

1. Start by looking through the starter code /Module5/**assignment5.py** and /Module5/Datasets/**wheat.data**
2. Complete the assignment except for the bonus instruction.
3. Try experimenting with other feature scaling methods, in addition to normalize(), to see how they affect the decision boundary.
4. Then, answer the following questions.

##### Lab Questions 1

Please enter a numeric value (e.g. 0, 1, 10.5, etc) which correctly answers the question(s) below:

What is the accuracy score of your KNeighbors Classifier when K=9 (Enter as a decimal)?

*Answer*: **0.871428571429**

**Explanation**

Follow the steps in the starter code file. Each 'TODO' should be accomplishable with just 1-4 lines of code. The accuracy value you should be getting if you set the random state properly is **0.871428571429.**

##### Lab Questions 2

Decrease K by 1 and record the new accuracy score. Keep doing this until you get down to, and including, K=1. Concerning the scores you saw:

+ It is always decreased as K decreased 
+ It is always increased as K increased
+ I did not see the same accuracy reading again
+ **I eventually got to the same accuracy reading, but overfit my data**

Congratulations on training your computer to identify wheat kernels! As you know, PCA throws away some of your data. Yet, you were able to get the high accuracy level you got in this lab by applying KNeighbors to just **two** principal components! If you're ready for a bonus experiment, remove both the PCA code as well as the visualization code from the lab. Run the KNeighbors Classifier on your entire X_train dataset and see how it performs compared to the PCA-only version you just completed above. Does it perform better? Or worse?

You can also try properly encoding the wheat_type series as a dummy feature, spanning three columns. If you attempt that, be sure to adjust your `.predict()` and `.score()` methods to fit.

#### Assignment 6

##### Lab Assignment 6

In this assignment, you'll flex your understanding of Isomap and KNeighbors, as well as practice splitting your data for testing and evaluation by taking your Module4/**assignment4.py** lab to the next level. If you haven't been able to complete module four's labs or haven't fully understood them, take a moment to re-do them all before proceeding.

This assignment was engineered to be truer to the life of a data scientist by being more challenging than previous ones, so do not be disheartened. If data explorers only needed to drop their observations into black-box algorithms without investing time to toggle parameters, and experiment and understand what those algorithms were truly doing to their data, they wouldn't be valued as much.

In module four's fourth lab assignment, you explored using isomap, an indispensable tool to have while working with non-linear datasets. Your goal this time is to train the KNeighborsClassifier to identify what direction a face is pointing towards: either up, down, left, or right.

<img src='https://courses.edx.org/asset-v1:Microsoft+DAT210x+4T2016+type@asset+block@Facing.png'>

This data takes the form of image samples that have been transformed either using PCA to reduce their linear dimensionality, or isomap to non-linearly do similar. Start by reviewing your lab work in the Module4/**assignment4.py** file before opening up the /Module5/**assignment6.py** starter code. You will need access to the **face_data.mat** file from Module four, as well as the new Module5/**face_labels.csv** file.

1. Add in the Module4/assignment4.py code responsible for: loading up the .mat file, properly rotating its images, and storing the whole thing into a Pandas dataframe object.
2. Load into a dataframe your classifications faces_labels.csv file. Make sure your dataframe and your .csv file align properly and start from the same values! This classification dataframe only has a single column in it, so create a series (a slice) that selects only that column and save it as label.
3. Do your train_test_split just as directed in the reading. Set random_state=7 as documented. Your variables should be: data_train, data_test, label_train, and label_test.
4. Fill out the code for PCA, Isomap, and KNeighborsClassifier. Both PCA and Isomap should be reducing your training data's dimensionality down to 2D. You're free to experiment with different K values for KNeighborsClassifier.
5. Predict the accuracy of the test dataset / test label using .score() and print it out.
6. Answer the questions below:

##### Lab Question 1

Please enter a numeric value (e.g. 0, 1, 10.5, etc) that correctly answers the question(s) below:

Enter the accuracy value reported for your KNeighbor's model, after doing a test/train split (test_size = 15%, random_state = 7) and using ISOMAP (5Neighbors, 2Components) to transform your data:

*Answer*: **0.961904761905**

**Explanation**

Refer to the previous labs to setup your dataset. Then, start by splitting your data according to the instructions. Fit isomap against the training data, then use the trained isomap model to transform both test+train data. Finally, fit your KNeighbors model against the training data and use it to calculate the accuracy score against your test data. The value should be: **0.961904761905**

 Submit You have used 2 of 2 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button.

##### Lab Question 2

Only one of the following setups is ideal if you plan on using SciKit-Learn's KNeighbors classifier to predict the label of your samples after transforming them. Which is it?

+ Fit and transform your data using PCA or Isomap. Split your data. Then fit the KNeighbors model against the training data and labels. Then predict the class of your testing data.
+ Fit and transform your data using PCA or Isomap. Then fit the KNeighbors model against your data and labels. Then split your data and predict the class of your testing data.
+ Use preprocessing to scale your training and testing data. Split your data. Fit and transform your training data using PCA or Isomap, and fit the KNeighbors model against the training data and labels. Then predict the class of your testing data.
+ **Split your data. Fit any desired preprocessors, such as scaling and / or PCA and Isomap on your training data, and apply the transformations to both training and testing data. Fit KNeighbors against the training data and labels. Then predict the class of your testing data.**

#### Assignment 7

##### Lab Assignment 7

Breast cancer usually starts from an uncontrolled growth of the cells that make up the milk-producing ducts. While fairly uncommon with men (less than 0.1% experience it), according to BreastCancer.org, one in eight women (12%) end up developing a malignant form of breast cancer over the course of their lifetime. These invasive cells form tumors that destroy nearby tissue, can spread to other parts of the body, and if not duly addressed, may result in death. To put things into perspective, in the U.S., roughly [600 women die per year](http://www.cdc.gov/reproductivehealth/MaternalInfantHealth/Pregnancy-relatedMortality.htm) due to pregnancy related complications... yet over [40,000 die per year](http://www.breastcancer.org/symptoms/understand_bc/statistics) due to breast cancer.

Breast cancer doesn't develop over night and, like any other cancer, can be treated extremely effectively if detected in its earlier stages. Part of the understanding cancer is knowing that not all irregular cell growths are malignant; some are benign, or non-dangerous, non-cancerous growths. A benign tumor does not mean the mass doesn't increase in size, but only means it does not pose a threat to nearby tissue, nor is it likely to spread to other parts of the body. The mass simply stays wherever it's growing. Benign tumors are actually pretty popular, such as moles and some warts. Being able to properly assess if a tumor is actually benign and ignorable, or malignant and alarming is therefore of importance, and also is a problem that might be solvable through data and machine learning.

In this lab, you'll be using the [Breast Cancer Wisconsin Original]() data set, provided courtesy of UCI's Machine Learning Repository. A copy of the dataset is located at Module5/Datasets/**breast-cancer-wisconsin.data**. Here are the column names, which you can read more details about on the dataset's information page: `['sample', 'thickness', 'size', 'shape', 'adhesion', 'epithelial', 'nuclei', 'chromatin', 'nucleoli', 'mitoses', 'status']`.

1. Open up the starter code located in Module5/assignment7.py, and as usual, read through it entirely.
2. Load up and clean up the dataset, and follow the written directions to split your data, do feature scaling since the features use different units, and then implement PCA and IsoMap so you can test the performance of both, as the technique used to reduce the dimensionality of the dataset down to two variables.
3. Train KNeighborsClassifier on the 2D projected training dataset, the score KNeighborsClassifier on the 2D projected testing dataset.
4. Finally, plot the decision boundary for visual confirmation.

##### Lab Question 1

Code up everything as instructed in the assignment. Experiment with various SKLearn preprocessing scaler classes, such as: MaxAbsScaler(), MinMaxScaler(), StandardScaler(), Normalizer(), RobustScaler(), and of course no scaling at all.

Overall, which produced the best result (highest accuracy when scoring against testing data)?

+ Normalizer()
+ **MinMaxScaler()**
+ RobustScaler()
+ StandardScaler()

##### Lab Question 2

It's important to always keep the objective of the problem you're solving in mind. In this case, your goal is to come up with a way to classify tumor growths as benign or malignant, based off of a handful of features. This is so that a simple test can be administered to see if further action need be taken when a tumor is discovered.

There are two types of errors this classification can make, and they are NOT equal. The first is a false positive. This would be the algorithm errantly classifying a benigh tumor as malignant, which would then prompt doctors to investigate it further, perhaps even schedule a surgery to have it removed. It would be wasteful monetairly and in terms of resources, but not much more than that.

The other type of error would be a false negative. This would be the algorithm incorrectly classifying a dangerious, malignant tumor as benign. If that were to occur, the tumor would be given time to progress into later, more serious stages, and could potentially spread to other parts of the body. A much more dangerious situation to be in.

The KNeighbors classifier in SciKit-Learn gives you the ability to specify weights when initializing the object. By default, these weights are set to 'uniform', so every "K" neighbor has an even vote. It also allows you to specify 'distance', where the votes are scaled inversely porportionally to their distance from the sample being classified (1/d). Lastly, it allows you to specify a user defineable function.

The problem is, the UDF takes in as parameters only a vector of distances and expects an equally sized vector of weights. This doesn't allow you to take advantage of using a different metric on a per class basis to properly weigh your samples to address the undesireability of false negatives over false positives, as it's WAY more important to errantly classify a benign tumor as malignant and have it removed, than to incorrectly leave a malignant tumor, believing it to be benign, and then having the patient progress to full blown in cancer.

One work around for this would be to program your own KNeighbors classifier. Another would be to "bake" the information into your dataset by taking advantage of the fact that KNeighbors is sensitive to the distribution of your variables. For example, randomly reducing the ratio of benign samples compared to malignant samples in your training set.

Between the two provided SciKit-Learn options for weighing, which one performed better on this dataset overall, given the many 'K' permutations you experimented with?

+ Uniform, because each data point should contribute to the classification equally
+ Uniform, because the dataset has an instrinsic clustering of sample values
+ **Distance, because each data point should contribute to the classification weighted by distance**
+ Distance, because this dataset only has a few samples, so weighing nearer samples is very important


### Regression

#### Linear Regression

Some time last night, you probably made a couple of decisions. Before talking about what those decisions ended up being, let's take a look at some practical features that probably influenced them:

1. Are there better cooks than you in the house? Do you even know how to cook?
2. Do you live near good ranked, and affordable restaurants?
3. How much spending money do have on hand?
4. Are there any decent leftovers in the fridge?
5. How badly do you hate doing the dishes?
6. What ingredients do you have at home?
7. How is the current weather outside?
8. How hungry are you right now?

Armed with the answers to the above features, you are ready to make a few decisions, such as:

+ Should I eat out tonight, or cook at home?
+ Should I cook something new, or heat leftovers up?
+ Should I cook tonight, or ask someone else in the house to?

These questions are all examples of categorical decisions you can calculate with a supervised classification algorithm. Such algorithms derive weights for the contribution each feature has to determining the overall outcome. You can either out in a restaurant or eat at home, but you can't eat out and eat at home simultaneously; only a single decision at a time.

<img src="https://courses.edx.org/asset-v1:Microsoft+DAT210x+4T2016+type@asset+block@DecisionBoundary.png" style="height: 300px;"/>

<img src="https://courses.edx.org/asset-v1:Microsoft+DAT210x+4T2016+type@asset+block@Regression.png" style="height: 300px;"/>>

The main difference between classification and regression algorithms is that regression aims to compute a continuous output, but the goal of classification is to predict a discrete, categorical output. Using classification, samples get labeled depending on a decision boundary test that separates your data into a range of space. With regression, a continuous value output is calculated from a best fit curve function that runs through your data. In the special case of linear regression, the curve is restricted such that it is linear. Given the features listed above, a regression algorithm would enable you to calculate continuous values like:

+ How far are you willing to drive to eat out?
+ How much money can you save by cooking at home?
+ How much time are you willing to invest cooking at home?

Effectively predicting the future, known as *extrapolating*, or identifying a trend in your existing data, known as *interpolating*, requires there be a statistically significant, linear correlation between your features. Without a decent correlation, linear regression isn't able to benefit to you. Check the Visualization module's Higher Dimensionality 'imshow' section to see methods of discerning correlation.

You may also have heard the phrase, [correlation doesn't imply causation](https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation). This is an easy trap to fall into when using regression. You can build a linear regression model that fits a relationship between university student's GPA and their first job's annual salary. But simply having a high GPA in doesn't cause someone to have a high paying job, although there is probably some significance between the two.

#### How Does Linear Regression Work?

Everything that simple linear regression does can be explained with this single figure:

<img src="https://courses.edx.org/asset-v1:Microsoft+DAT210x+4T2016+type@asset+block@LinRegExplained.png" style="">

The *simple* part just means that you're only relating a single independent feature "x" to the dependent feature or output "y". You will see that upgrading this a full-blown, multivariate linear regression is just an exercise of adding in additional coefficients, so let's first walk through the simple case.

You probably recall a certain equation from school that looked like this: `y = mx + b`. This is the basic equation of linear regression and is the formula for the green line above, however we're going to alter the variable naming convention slightly, just so that it matches the SciKit-Learn documentation: `y = w0 + w1x`.  All that we've done so far is change **m** to **w1**, and **b** to **w0**. The w's stand for `weights coefficients`, our currently unknown parameters for calculating y given x. From the diagram it's clear that w0 actually corresponds to the y-intercept or offset between the green line and the x-axis. As for w1, that is the quotient of the change of your dependent variable y and the change of your independent variable x. That's pretty much it! Linear regression is all about computing a scalar feature as a linear combination of weights multiplied by dependent features.

##### So How Do We Find the Weights?

SciKit-Learn uses a technique called ordinary least squares to compute the weights coefficients and intercept needed to solve for the best fitting line that goes through your samples. In the figure above, each of the black dots represents one of your features and of course the green line is your least squares, best fitting line. The red lines represent distances between the true, observed values of your sample compared to the least squares line we're hoping to calculate. Stated differently, these distances are the error between the approximate solution and the actual value. Ordinary least squares works by minimizing the *squared sum* of all these red line errors to compute the best fitting line. If you're wondering why the squared sum is used instead of the absolute value of the sums, or even just the regular sums, a note about that is included in the Dive Deeper section.

Once you have the equation, you can use it to calculate an expected value for feature y, given that you have feature x. If the x values you plug into the equation happen to lie within the x-domain boundary of those samples you trained your regression with, then this is called interpolation or even approximation, because you do have the actual observed values of y for the data in that range. When you use the function to calculate a y-value outside the bounds of your training data's x-domain boundary, that is called extrapolation.

##### Multivariate Linear Regression

To take it to the next level, more than just one variable has to be considered. To do this, SciKit-Learn just tacks on more weight terms, multiplied by the additional features. So:

$$y = w_{0} + w_{1}*x$$

Turns into:

$$y = w_{0} + w_{1}*x_{1} + w_{2}*x_{2} + ... + w_{n}*x_{n}$$

Where n is the number of features examined. PCA works similar, in a way, to linear regression, but instead of taking one variable as dependent on the rest, PCA makes no assumptions and considers all variables. As a result, it attempts to find minimize the distance between the points and the line itself, which can span over multiple features / dimensions, instead of the way linear regression does it. Ordinary least squares, and by extension, linear regression in its multivariate form, attempts to minimize the sum of squared distances of all independent variables with *just* the dependent variable. A very cool graphical explanation on the differences between linear regression and PCA has been included in the Dive Deeper section.

#### When To Use It

##### When Should I Use Linear Regression?

Linear regression is widely used in all disciplines for forecasting upcoming feature values by extrapolating the regression line, and for estimating current feature values by interpolating the regression curve over existing data. It is an extremely well-understood and interpretable technique that run very fast, and produces reasonable results as long as you do not extrapolate too far away from your training data.

One of the main advantages of linear regression over other machine learning algorithms is that even though it's a supervised learning technique, it doesn't force you to fine tune a bunch of parameters to get it working. You can literally just dump your data into it and let it produce its results.

You can use linear regression if your features display a significant correlation. The stronger the feature correlation, it being closer to +1 or -1, the better and more accurate the linear regression model for your data will be. The questions linear regression helps you answer are which independent feature inputs relate to the dependent feature output variable, and the degree of that relationship.

In business, linear regression is often used to forecast sales. By finding a correlation between time and the number of sales, a company can predict their near-terms future revenue, which will then help them budget accordingly. Linear regression can also be used to assess risk. Before issuing a loan, most banks will consider many features or aspects about their customers, and run a regression to see if it is worthwhile for them to borrow the money, of if their return on investment is insignificant or even positive for that matter.

In the sciences, geologists train linear regression against historic records to calculate the rate of glacier snow melting, and can use it extrapolate how long it'll take for it to all disappear. Oil engineers do the same while calculating how much is potentially left. When measuring experimental results, chemists use linear regression to empirically calculate and validate concentrations and expected reactions. And of course there are many more uses.

#### SciKit-Learn and Linear Regression

SciKit-Learn's LinearRegression class implements all the expected methods found in the rest of their supervised learning classes, including:  `fit()`, `predict()`, `fit_predict()`, and `score()`. As for its outputs, the attributes you're interested in are:

+ **intercept\_** the scalar constant offset value
+ **coef\_** an array of weights, one per input feature, which will act as a scaling factor

To model a dataset using linear regression, use the following code:
```python
from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
```

From here on out, you should separate your data into a training and testing set before fitting your predictor models, as reflected in the code above. `X_train` is either a single dimensional array, shaped like `[n_samples]`, where each sample corresponds to the single feature you're fitting. Or alternatively, you can run multiple linear regression models simultaneously on the same dataset by passing in a [`n_samples`, `n_features`] array.

Your `y_train` target can also be a single dimensional `[n_samples]` list of expected values, or it can be an [`n_samples`, `n_targets`] array if you are using linear regression to compute more than one feature simultaneously. When doing that, your `.intercept_` will also be an array, one value per target.

Lastly, you'll want to see how well your model performed. The LinearRegression class comes with a convenience `.score()` method that returns the $R^2$ coefficient, and the calculation for that is included with the SciKit-Learn API documentation. This coefficient communicates how much your change in output is explained by your change in input values. $R^2$ is beneficial because unlike the sum of squared distances, it is normalized so that the number of observations sampled does not affect it. The larger the $R^2$ score, the better of a fit the model is for your data, `1.0` being the maximum value achievable for a perfect match.

```python
model.score(X_test, y_test)
np.sum(model.predict(X_test) - y_test) ** 2)
```

One thing to keep in mind is that your $R^2$ coefficient increases the more features you consider when modeling your linear regression, even if those features don't have a good correlation with your dependent feature's values. Due to this, be selective about which features you choose to use and select just the subset of the most promising ones, otherwise you might be subject to errant overfitting.

Finally, linear regression works with continuous data, as well as categorical data once numerically encoded. If you do end up using categorical data and have multiple dummy boolean columns you want to calculate, for example IceCream_Vanilla, IceCream\_Chocolate, and IceCream\_CookiesNCream, then you should calculate all three of these target regression lines simultaneously. Just increase the number of `targets`, or your training labels dimension, and then each regression calculation will have its own offset stored in your `.intercepts`\_ array attribute, and the `.coeff`\_ attribute will become an array of arrays, one per target. In SciKit-Learn, this is called Multi-Output Linear Regression.

#### Linear Regression Gotchas!

Linear regression is a very powerful technique if used correctly. With just a few instances of well correlated samples, linear regression can capture the underlying pattern in your dataset, making its use of your data very efficient. This applies even more so to smaller datasets.

All of the math and theory associated with linear regression is well understood, which means by modeling with it, even the multi-output version of it supported by SciKit-Learn, the results remain easily interpretable.

Linear regression isn't all roses and cherries. For one, as the name hints, it only works with linear data, by identifying linear relationships between your continuous output and your independent input features. Sometimes there simply isn't a linear correlation, and in these cases the regression completely fails. Luckily, it's pretty easy to detect this by looking at the resulting $R^2$ coefficient, or by visually plotting your data.

Under the hood, linear regression examines the relationship between the mean value of your output variable and your input variables. So if you're trying to model stock market security price as a function of the date, linear regression will only factor in the average stock price taken at different date intervals. If you wanted to know what the highest and lowest values were for any date interval, linear regression wouldn't be able to provide that.

Of all this, the major thing to watch out for while using linear regression (even more so than its sensitivity to outliers) is that linear regression assumes your variables are linearly independent. So in your dataset, if you have multiple observations, it would assume that the feature values of one sample have nothing to do with the values of another subject. This is often not the case. In our above, stock market example, it is often observed that one company's stock price fluctuations have a ripple effect to other companies in the same markets.

The last thing to watch out for with linear regression is that the further you extrapolate from the range of your training data, the less reliable the results of the regression become. Keep these thoughts in mind while using linear regression!

#### Knowledge Checks

##### Review Question 1

Which of the following statements makes the most sense?

+ Linear regression helps you answer categorical questions
+ **Linear regression can work with categorical features as inputs correct**
+ Linear regression uses the same calculation that PCA does to get the shortest distance to a line / hyperplane
+ Linear regression allows you to extrapolate data more accurately than it allows you to interpolate it
 
##### Review Question 2

The difference between the actual, observed y-value of your sample and the predicted y-value from the linear regression line is called?

+ A standard deviation
+ **An error**
+ A slope
+ Δy (Delta y)
+ A weight