# Machine learning

1. ## Supervised Machine Learning

**Supervised Machine Learning**

Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Supervised learning is when we teach or train the machine using data that is well-labelled. Which means some data is already tagged with the correct answer. After that, the machine is provided with a new set of examples(data) so that the supervised learning algorithm analyses the training data(set of training examples) and produces a correct outcome from labeled data.

<b>Types of Supervised Learning<b>
    
Supervised learning is classified into two categories of algorithms: 

1. **Regression**: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.
2. **Classification**: A classification problem is when the output variable is a category, such as “Red” or “blue” , “disease” or “no disease”


1. **Regression**
    
Regression is a type of supervised learning that is used to predict continuous values, such as house prices, stock prices, or customer churn. Regression algorithms learn a function that maps from the input features to the output value.

Some common regression algorithms include:

* Linear Regression
* Polynomial Regression
* Support Vector Machine Regression
* Decision Tree Regression
* Random Forest Regression
* KNN(K nearest neighbor)

1. <b>Linear Regression and Multiple Linear Regression<b>

**Linear Regression**
One example of a Data Model that we will be using is:

**Simple Linear Regression**

Simple Linear Regression is a method to help us understand the relationship between two variables:

+ The predictor/independent variable (X)
+ The response/dependent variable (that we want to predict)(Y)
The result of Linear Regression is a linear function that predicts the response (dependent) variable as a function of the predictor (independent) variable.
![image.png](attachment:image.png)
**Linear Function**

***𝑌ℎ𝑎𝑡=𝑎+𝑏𝑋***
*a* refers to the intercept of the regression line, in other words: the value of Y when X is 0
*b* refers to the slope of the regression line, in other words: the value with which Y changes when X increases by 1 unit

a=Y WHEN X=0
b= change in Y when X increases by 1 unit

In [6]:
# lets take a dataset for regression
import pandas as pd
path = 'C:/Users/SB INFO/Documents/Data Sciene/Courses/IBM_Data_Science/c7_data_analysis/automobileEDA.CSV'
df = pd.read_csv(path)
df.head()
# lets load the modules from the linear regression:
from sklearn.linear_model import LinearRegression
#create a regression object:
lm = LinearRegression()
lm
#For this example, we want to look at how highway-mpg can help us predict car price. Using simple linear regression,
#we will create a linear function with "highway-mpg" as the predictor variable and the "price" as the response variable.
X = df[['highway-mpg']]
Y = df['price']

# Fit the linear model using highway-mpg:
lm.fit(X,Y)
# We can output a prediction
Yhat=lm.predict(X)
print("predicted values:",Yhat[0:5])

#if we wants to knoe the parameters of te linear function
print("intercept/a: ",lm.intercept_)
print("slope/b: ",lm.coef_)

predicted values: [16236.50464347 16236.50464347 17058.23802179 13771.3045085
 20345.17153508]
intercept/a:  38423.305858157386
slope/b:  [-821.73337832]


In [None]:
normalization

**Multiple Linear Regression**

What if we want to predict a data using more than one variable?

If we want to use more variables in our model to predict car price, we can use Multiple Linear Regression. Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and two or more predictor (independent) variables. Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:
![image.png](attachment:image.png)

In [11]:
#What if we want to predict a price using more than one variable?
Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]
Y=df['price']
lm.fit(Z,Y)
Mhat=lm.predict(Z)
print("predicted values:",Mhat[:5])


predicted values: [13699.11161184 13699.11161184 19051.65470233 10620.36193015
 15521.31420211]


**2. Polynimial regression**

A polynomial regression model is a machine learning model that can capture non-linear relationships between variables by fitting a non-linear regression line, which may not be possible with simple linear regression. It is used when linear regression models may not adequately capture the complexity of the relationship.

*Polynomial regression is used when:*

+ The data is correlated but the relationship doesn't look linear
+ The linear regression model fails to capture the points in the data
+ The linear regression fails to adequately represent the optimum 
![image.png](attachment:image.png)

# Overfitting , Underfitting and Model Selection
This situation where any given model is performing too well on the training data but the performance drops significantly over the test set is called an **overfitting model**. On the other hand, if the model is performing poorly over the test and the train set, then we call that an **underfitting model**.
![image-5.png](attachment:image-5.png)

Consider the following function. We assume the training points come from a polynomial function plus some noise. The goal of model selection is to determine the order of the polynomial to provide the best estimate of the function y(x).
![image-3.png](attachment:image-3.png)
If we try and fit the function with a linear function, the line is not complex enough to fit the data. As a result, there are many errors. This is called underfitting, where the model is too simple to fit the data.
![image-4.png](attachment:image-4.png)
This is an example of the eighth order polynomial used to fit the data. We see the model does well at fitting the data and estimating the function, even at the inflection points. 
![image.png](attachment:image.png)
Increasing it to a 16th order polynomial, the model does extremely well at tracking the training point, but performs poorly at estimating the function.
![image-2.png](attachment:image-2.png) ***This is especially apparent where there is little training data.*** The estimated function oscillates, not tracking the function. This is called overfitting, where the model is too flexible and fits the noise rather than the function.

<!-- Let's see how the R^2 changes on the test data for different order polynomials and then plot the results  -->
![image-7.png](attachment:image-7.png)
![image-8.png](attachment:image-8.png)

**Underfitting**

Following  is the condition of underfitting where the model(Linear regression ) is too simple that it is unable to capture the true relationship between datasets. WE can conculde that in training set the model has a *high bais/less accuracy* but by seeing performance of the model on the test data , we can say that the model has *less variability/ good precision* in detecting the true function.
![image-10.png](attachment:image-10.png))

*reasons*
1. less training set
2. less features
3. more regularization

**Overfitting**


However, if you train the model too much or add too many features to it, you may overfit your model, resulting in low bias but high variance (i.e. the bias-variance tradeoff). In this scenario, the statistical model fits too closely against its training data, rendering it unable to generalize well to new data points.
![image-11.png](attachment:image-11.png)This leads to overfitting of the data. in this example we will have a *less bais/ high accuracy* on training data , but the performance of the test data will drop that it would lead to *high variability/ low precision.*
Overfitting occurs when a machine learning model learns to fit too closely to the training dataset instead of generalizing.

*reasons*
1. Training data size: The training data is too small and doesn't contain enough samples to accurately represent all input data values
2. Data quality: The training data is dirty or contains noise
3. Model bias: The model has a high bias
4. Model complexity: The model is too simple or complex
5. High dimensionality: The data has a large number of features, which can introduce complexity and sparsity

**NOTE** : overfitting or underfitting , both depends on the model we are using on our dataset.if the model is too simple like linear regression on our dataset, that the model cannot detect the true function/ performs poorly on training data then it leads to *underfitting*. now whetehr the training set is big or smal or average , if the model we choose is performing poorly at prediction the true function/cannot capture the whole dataset,then that scenario will cause underfitting. 
</br> NOW same goes for Overfitting, it also depends on the model we use. if the model is too complex / model learns to fit too closely to the training dataset *(higher order ploynomial regression)* giving the high accuracy on training dataset instead of generalizing. wheather the trainig set is big or small, if our model learns to fit too closely to training data it will cause overfitting.



In [None]:
# how to detect overfitting
Rsqu_test = []

order = [1, 2, 3, 4]
for n in order:
    pr = PolynomialFeatures(degree=n)
    
    x_train_pr = pr.fit_transform(x_train[['horsepower']])
    
    x_test_pr = pr.fit_transform(x_test[['horsepower']])    
    
    lr.fit(x_train_pr, y_train)
    
    Rsqu_test.append(lr.score(x_test_pr, y_test))

plt.plot(order, Rsqu_test)
plt.xlabel('order')
plt.ylabel('R^2')
plt.title('R^2 Using Test Data')
plt.text(3, 0.75, 'Maximum R^2 ')  

the inabilty for machine learning model (like linear regression) to capture the true relationship is called bais.
![image.png](attachment:image.png)
in machine learning, the difference in fit between datastes i scalled variance
![image-2.png](attachment:image-2.png)
https://www.youtube.com/watch?v=EuBBz3bI-aA

![image-3.png](attachment:image-3.png)
This is the condition of underfitting where the model(Linear regression ) is too simple that it is unable to capture the true relationship between datasets. WE can conculde that in training set the model has a high bais/less accuracy but by seeing performance of the model on the test data , we can say that the model has less variability/ good precision detecting the true function.

We have learned about overfitting , it's possible reasons, how to dtect it, now lets talk about how to avoid it.

we need to find a sweet spot between overfitting and underfitting. three commonly used method for finding the sweet spot are
1. regularisation
2. Boosting
3. Bagging

we will talk about reularization

***Regularization*** is a collection of training techniques that can help avoid overfitting in machine learning models. Regularization makes models simpler, which helps prevent overfitting and underfitting. It also prevents models from learning something too complex and fitting too much to the training data's noise.

Regularization works by grading features based on importance, and reducing the number of features. For example, L1 regularization, also known as Lasso, shrinks parameters towards 0 to combat overfitting.

## Ridge Regression
Ridge regression—also known as L2 regularization—is one of several types of regularization for linear regression models. Regularization is a statistical method to reduce errors caused by overfitting on training data. Ridge regression specifically corrects for multicollinearity in regression analysis.

Ridge regression is the process of regularizing the feature set using the hyperparameter alpha.Ridge regression can be utilized to regularize and reduce standard errors and avoid over-fitting while using a regression model.

This is useful when developing machine learning models that have a large number of parameters, particularly if those parameters also have high weights.
lets see an example to find out what a ridge reression is?
![image.png](attachment:image.png)
In the above example e see only two training data and we applied linear regression model on the data.  the model fits too well on the training data/ the sum of the squared residulas on he training set is small(say 0), but the sum of squares of residulas test data is large,thus there is high vaiance on test data. we can say that the model/new line(linear regression line) is overfit to training data.

The main idea behind Ridge regression is find a new line that doesn't fit the training data as well..  In other word we introduce a small amount of bias into how the new line is fit to the data, but in return to  the small amount of bais we get a signigicant drop  in variance.
![image-2.png](attachment:image-2.png)

Now lets see how it works?

When least squaes determine the parameters in this equation: ![image-3.png](attachment:image-3.png) it minimizes the the **sum of the sqaured residuals**

In contrast when ridge regression determines the parameters of thos equation, it minimizes the **sum of the squared residuals** + ![image-5.png](attachment:image-5.png)
![image-4.png](attachment:image-4.png)
https://www.youtube.com/watch?v=Q81RR3yKn30&t=185s

## Grid search

Grid search is a machine learning technique that uses hyperparameter tuning to find the best combination of hyperparameters for a model. Hyperparameters are variables that the model doesn't learn, but rather the user sets before training. Grid search involves trying out different values for parameters and selecting the value that gives the best score. It works by creating a grid of all possible combinations of parameter values and testing each combination to find the best one

### Evaluating Supervised Learning Models**

Evaluating supervised learning models is an important step in ensuring that the model is accurate and generalizable. There are a number of different metrics that can be used to evaluate supervised learning models, but some of the most common ones include:

**For Regression**

1. **Mean Squared Error (MSE)**: MSE measures the average squared difference between the predicted values and the actual values. Lower MSE values indicate better model performance.
![image-2.png](attachment:image-2.png)
2. **Root Mean Squared Error (RMSE)**: RMSE is the square root of MSE, representing the standard deviation of the prediction errors. Similar to MSE, lower RMSE values indicate better model performance.
![image-3.png](attachment:image-3.png)
3. **Mean Absolute Error (MAE)**: MAE measures the average absolute difference between the predicted values and the actual values. It is less sensitive to outliers compared to MSE or RMSE.
![image.png](attachment:image.png)
4. **R-squared (Coefficient of Determination)**: R-squared measures the proportion of the variance in the target variable that is explained by the model. Higher R-squared values indicate better model fit.
![image-4.png](attachment:image-4.png)

2. Classification
Classification is a type of supervised learning that is used to predict categorical values, such as whether a customer will churn or not, whether an email is spam or not, or whether a medical image shows a tumor or not. Classification algorithms learn a function that maps from the input features to a probability distribution over the output classes.

Some common classification algorithms include:

* Logistic Regression
* Support Vector Machines
* Decision Trees
* Random Forests
* Naive Baye

**Evaluating Supervised Learning Models**

**For Classification**

1. **Accuracy**: Accuracy is the percentage of predictions that the model makes correctly. It is calculated by dividing the number of correct predictions by the total number of predictions.
2. **Precision**: Precision is the percentage of positive predictions that the model makes that are actually correct. It is calculated by dividing the number of true positives by the total number of positive predictions.
3. **Recall**: Recall is the percentage of all positive examples that the model correctly identifies. It is calculated by dividing the number of true positives by the total number of positive examples.
4. **F1 score**: The F1 score is a weighted average of precision and recall. It is calculated by taking the harmonic mean of precision and recall.
![image-3.png](attachment:image-3.png)
5. **Confusion matrix**: A confusion matrix is a table that shows the number of predictions for each class, along with the actual class labels. It can be used to visualize the performance of the model and identify areas where the model is struggling.
6. **Jaccard Index**:  Jaccard is as the size of the intersection divided by the size of the union of two label sets.
![image-2.png](attachment:image-2.png).
7. **Log Loss**: Logarithmic loss (also known as Log loss) measures the performance of a classifier where the predicted output is a probability value between 0 and 1. 
![image-4.png](attachment:image-4.png)

## logistic regression
Logistic regression is an classification algorithm for categorical variables.
Logistic regression is a statistical and machine learning technique for classifying records of a dataset based on the values of the input fields.
In logistic regression, we use one or more independent variables to predict an outcome, such as churn, which we call the dependent variable representing whether or not customers will stop using the service.
 Logistic regression is analogous to linear regression but tries to predict a categorical or discrete target field instead of a numeric one. 
 In linear regression, we might try to predict a continuous value of variables such as the price of a house, blood pressure of a patient, or fuel consumption of a car. But in logistic regression, we predict a variable which is binary such as yes/no, true/false, successful or not successful, pregnant/not pregnant, and so on, all of which can be coded as zero or one.
  In logistic regression independent variables should be continuous. If categorical, they should be dummy or indicator coded. This means we have to transform them to some continuous value.
  to predict whether a patient has a given disease such as diabetes based on observed characteristics of that patient such as weight, height, blood pressure, and results of various blood tests and so on
  
  Here are four situations in which logistic regression is a good candidate.:
1. First, when the target field in your data is categorical or specifically is binary.
2.  Second, you need the probability of your prediction. Logistic regression returns a probability score between zero and one for a given sample of data.  logistic regression predicts the probability of that sample and we map the cases to a discrete class based on that probability.
3. When you need a linear decision boundary. The decision boundary of logistic regression is a line or a plane or a hyper plane.
4. Fourth, you need to understand the impact of a feature.

WHY linear regression is not used
1. proability issue. if we try using linear regression on a classification data and derive a prediced value (say0.3 ),Now, we can define a threshold here (say .5),as 0.3 is ledd than 0.5 we predict it to be zero, but  we don't know what is the probability of the predicted value belong to class zero.
2. the predicted value can be too high or too low. 
   for example, we we use linearregrssion to calculate the class of a point, it always returns a number such as three or            negative two, and so on. Then, we should use a threshold, for example, 0.5. Notice that in the step function, no matter how big the value is, as long as it's greater than 0.5, it simply equals one and vice versa.In other words, there is no difference between a customer who has a value of one or 1,000. The outcome would be one.

 We need a method that can give us the probability of falling in the class and whose value is always between 0 and 1.
 https://www.coursera.org/learn/machine-learning-with-python/lecture/sJC9q/logistic-regression-vs-linear-regression
 ![image.png](attachment:image.png)
 ![image-2.png](attachment:image-2.png)
  when Theta transpose x gets ver>y big, the e power minus Theta transpose x in the denominator of the fraction becomes almost 0, and the value of the sigmoid function gets closer to 1. If Theta transpose x is very small, the sigmoid function gets closer to 0. It is obvious that when the outcome of the sigmoid function gets closer to 1, the probability of y equals 1 given x goes up. In contrast, when the sigmoid value is closer to 0, the probability of y equals 1 given x is very small.
  ![image-3.png](attachment:image-3.png)
  ![image-4.png](attachment:image-4.png)
  0.7 is the probability that the class belongs to 1

## SVM
mapping dat into higher dimentiinal space,in such a way that can transform a lineraly inseperable data into linearly seperable data.

SVM is a supervised algorithm that classifies case using a seperator.
1. Mapping data into higher dimentional space
2. Finding a seperator.
![image-4.png](attachment:image-4.png)

mapping data into a higher-dimensional space is called, kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as linear, polynomial, Radial Basis Function,or RBF, and sigmoid.![image-3.png](attachment:image-3.png)

the best hyperplane is the one that represents the largest separation or margin between the two classes. So the goal is to choose a hyperplane with as big a margin as possible.

Examples closest to the hyperplane are support vectors. It is intuitive that only support vectors matter for achieving our goal. And thus, other trending examples can be ignored. We tried to find the hyperplane in such a way that it has the maximum distance to support vectors.![image-2.png](attachment:image-2.png)

That said, the hyperplane is learned from training data using an optimization procedure that maximizes the margin. And like many other problems, this optimization problem can also be solved by gradient descent.

The two main advantages of support vector machines are :
1. they're accurate in high-dimensional spaces.
2. And they use a subset of training points in the decision function called, support vectors, so it's also memory efficient. 

The disadvantages of Support Vector Machines include :
1. that the algorithm is prone for over-fitting if the number of features is much greater than the number of samples.
2. SVMs do not directly provide probability estimates, which are desirable in most classification problems.
3. SVMs are not very efficient computationally if your dataset is very big, such as when you have more than 1,000 rows.

SVMs can be used for a variety of tasks, such as ***text classification, image classification, spam detection, handwriting identification, gene expression analysis, face detection, and anomaly detection.*** 

**Note**for extra information check out https://www.geeksforgeeks.org/support-vector-machine-algorithm/


## Decision Trees
A Decision Tree is a type of clustering  approach that can predict the class of a group, for example, DrugA or DrugB
The basic  intution behinddecision trees is to map out all possible decision path in form of a tree. 
![image.png](attachment:image.png)
A Decision tree is a tree-like structure that represents a set of decisions and their possible consequences. Each internal node in the tree represents a decision/test, and each branch represents an outcome of that decision/test. The leaves of the tree represent the final decisions or predictions.
![image-2.png](attachment:image-2.png)
Decision trees are created by recursively partitioning the data into smaller and smaller subsets. At each partition, the data is split based on a specific feature, and the split is made in a way that maximizes the information gain.

Decision Tree algorithm :
1. First, choose an attribute from our dataset.
2. Calculate the significance of the attribute in the splitting of the data.
3. split the data based on the value of the best attribute.
4. then go to each branch and repeat it for the rest of the attributes. go to step 2.

Now howto determine the best attribute in splliting the data. It is determine by entropy of the atribute. Entropy measure the randomness or uncertainity in the data. The entropy in a node is the amount of information disorder calculated in each node.
The lower the Entropy , the less uniform the attribute,the purer the node.![image-3.png](attachment:image-3.png)

we choose all attribute from our dataset and calculate  the entropy of the each node of that attribute , and then we choose an attribute with the highest information gain. ![image-4.png](attachment:image-4.png)
Information gain iis the information that can increase the level of certainity after splitting. It is the entriopy of the tree before the splitting - weighted entropy of the tree after splitting. As entropy or the amount of randomness decreases, the information gain or amount of certainty increases and vice versa.
![image-5.png](attachment:image-5.png)
We choose the tree with the higher information gained after splitting, this means the sex attribute. So, we select the sex attribute as the first splitter.

Python Decision trees are versatile tools with a wide range of applications in machine learning:

1. Classification: Making predictions about categorical results, like if an email is spam or not.
2. Regression: The estimation of continuous values; for example, feature-based home price prediction.
3. Feature Selection: Feature selection lowers dimensionality and boosts model performance by determining which features are most pertinent to a certain job.

Decision trees are used in data analytics and machine learning to break down complex data into more manageable parts. They are often used in these fields for prediction analysis, data classification, and regression.

There are two types of decision trees in machine learning:
1. Classification trees: Determine whether an event happened or didn't happen, usually involving a "yes" or "no" outcome
2. Regression trees: Predict continuous values based on previous data or information sources 

independent variables must be categorical.

## KNN
The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. The K-NN algorithm works by finding the K nearest neighbors to a given data point based on a distance metric, such as Euclidean distance.
![image.png](attachment:image.png)
 The K-Nearest Neighbors algorithm is a classification algorithm that takes a bunch of labeled points and uses them to learn how to label other points.
* This algorithm classifies cases based on their similarity to other cases.
* In K-Nearest Neighbors, data points that are near each other are said to be neighbors. 
*  K-Nearest Neighbors is based on this paradigm. Similar cases with the same class labels are near each other.
*  Thus, the distance between two cases is a measure of their dissimilarity.

There are different ways to calculate the similarity or conversely, the distance or dissimilarity of two data points. For example, this can be done using Euclidean distance.

K-Nearest Neighbors algorithm :
1. Pick a value for k
2. Two, calculate the distance from the new case hold out from each of the cases in the dataset.
3. Three, search for the K-observations in the training data that are nearest to the measurements of the unknown data point.
4. And four, predict the response of the unknown data point using the most popular response value from the K-Nearest Neighbors. 

How can we calculate the similarity between two data points? 

We can easily use a specific type of Minkowski distance to calculate the distance of these two data points, it is indeed the Euclidean distance. We can also use the same distance matrix for two features or multidimensional vectors. Of course, we have to normalize our feature set to get the accurate dissimilarity measure.
![image-2.png](attachment:image-2.png)

 K and K-Nearest Neighbors is the number of nearest neighbors to examine. It is supposed to be specified by the user. So, how do we choose the right K?
A low value of K causes a highly complex model as well, which might result in overfitting of the model. It means the prediction process is not generalized enough to be used for out-of-sample cases.if we choose a very high value of K such as K equals 20, then the model becomes overly generalized.

**Solution**: The general solution is to reserve a part of your data for testing the accuracy of the model. Once you've done so, choose K equals one and then use the training part for modeling and calculate the accuracy of prediction using all samples in your test set. Repeat this process increasing the K and see which K is best for your model.


(K-NN) algorithm is a versatile and widely used machine learning algorithm that is primarily used for its simplicity and ease of implementation. It can also handle both numerical and categorical data.

### **Applications of Supervised learning**
    
Supervised learning can be used to solve a wide variety of problems, including:

1. **Spam filtering**: Supervised learning algorithms can be trained to identify and classify spam emails based on their content, helping users avoid unwanted messages.
2. **Image classification**: Supervised learning can automatically classify images into different categories, such as animals, objects, or scenes, facilitating tasks like image search, content moderation, and image-based product recommendations.
3. **Medical diagnosis**: Supervised learning can assist in medical diagnosis by analyzing patient data, such as medical images, test results, and patient history, to identify patterns that suggest specific diseases or conditions.
4. **Fraud detection**: Supervised learning models can analyze financial transactions and identify patterns that indicate fraudulent activity, helping financial institutions prevent fraud and protect their customers.
5. **Natural language processing (NLP)**: Supervised learning plays a crucial role in NLP tasks, including sentiment analysis, machine translation, and text summarization, enabling machines to understand and process human language effectively.

2. ## Unsupervised Machine Learning

Unsupervised learning is a type of machine learning that learns from unlabeled data. This means that the data does not have any pre-existing labels or categories. The goal of unsupervised learning is to discover patterns and relationships in the data without any explicit guidance.

## Model Evaluation and Refinement

Model evaluation tells us how our model performs in the real world. In-sample evaluation tells us how well our model fits the data already given to train it. It does not give us an estimate of how well the train model can predict new data.

*SOLUTION*: The solution is to split our data up, use the in-sample data or training data to train the model. The rest of the data, called Test Data, is used as out-of-sample data. This data is then used to approximate, how the model performs in the real world. Separating data into training and testing sets is an important part of model evaluation.

We use the test data to get an idea how our model will perform in the real world.

We use training set to build a model and discover predictive relationships. We then use a testing set to evaluate model performance. When we have completed testing our model, we should use all the data to train the model.
![image.png](attachment:image.png)

### Generalization
Generalization error is a measure of how well our model does a predicting previously unseen data. The error we obtain using our testing data is an approximation of this error. This figure shows the distribution of the actual values in red compared to the predicted values from a linear regression in blue. We see the distributions are somewhat similar. If we generate the same plot using the test data, we see the distributions are relatively different. The difference is due to a generalization error and represents what we see in the real world.

Using a lot of data for training gives us an accurate means of determining how well our model will perform in the real world, but the precision of the performance will be low.
![image.png](attachment:image.png)
If we use fewer data points to train the model and more to test the model, the accuracy of the generalization performance will be less, but the model will have good precision.
![image-2.png](attachment:image-2.png)

### Cross-Validation 
To overcome this problem, we use cross validation. One of the most common out-of-sample evaluation metrics is cross validation. In this method, the dataset is split into k equal groups. Each group is referred to as a fold.
 
![image.png](attachment:image.png)
Some of the folds can be used as a training set which we use to train the model, and the remaining parts are used as a test set which we use to test the model.

For example, we can use three folds for training, then use one fold for testing. This is repeated until each partition is used form both training and testing.
![image-2.png](attachment:image-2.png)
At the end, we use the average results as the estimate of out-of-sample error. The evaluation metric depends on the model.
![image-3.png](attachment:image-3.png)


Residual Plot
A good way to visualize the variance of the data is to use a residual plot.

What is a residual?

The difference between the observed value (y) and the predicted value (Yhat) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.

So what is a residual plot?

A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis.

What do we pay attention to when looking at a residual plot?

We look at the spread of the residuals:

- If the points in a residual plot are randomly spread out around the x-axis, then a linear model is appropriate for the data.

Why is that? Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data

UN supervised

## K means clutering

Customer segmentation is the practice of partitioning a customer base into groups of individuals that have similar characteristics. It is a significant strategy, as it allows the business to target specific groups of customers, so as to more effectively allocate marketing resources. For example, one group might contain customers who are high profit and low risk.
The important requirement is to use the available data to understand and identify how customers are similar to each other.

**Clustering** means finding clusters in a dataset, unsupervised. So what is a cluster? A cluster is a group of data points or objects in a dataset that are similar to other objects in the group, and dissimilar to datapoints in other clusters.

*Difference between classification and clustering*

Classification algorithms predict categorical classed labels. This means assigning instances to predefined classes such as defaulted or not defaulted. In clustering however, the data is unlabeled and the process is unsupervised.

Generally clustering can be used for one of the following purposes: 
* exploratory data analysis.
* summary generation or reducing the scale.
* outlier detection- especially to be used for fraud detection or noise removal. 
* finding duplicates and datasets or as a pre-processing step for either prediction, other data mining tasks or as part of a complex system.


1. Partition-based clustering is a group of clustering algorithms that produces sphere-like clusters, such as; K-Means, K-Medians or Fuzzy c-Means. These algorithms are relatively efficient and are used for medium and large sized databases.
2. Hierarchical clustering algorithms produce trees of clusters, such as agglomerative and divisive algorithms. This group of algorithms are very intuitive and are generally good for use with small size datasets.
3. Density-based clustering algorithms produce arbitrary shaped clusters. They are especially good when dealing with spatial clusters or when there is noise in your data set. For example, the DB scan algorithm.

Some real-world applications of k-means:

* Customer segmentation
* Understand what the visitors of a website are trying to accomplish
* Pattern recognition
* Machine learning
* Data compression

In [1]:
pip install pillow


