# What is Machine Learning?

## Machine Learning is a process whereby:
- Computers are given the ability to learn and make decisions from data
- Without being explicitly programmed

### Example:
1. Learning to predict whether an email is spam or not, given its content and sender.
2. Learning to cluster books into different categories.


# Unsupervised Learning:

### The process of uncovering hidden patterns and structures from unlabeled data.

### e.g.:
1. A business may wish to group its customers into distinct categories (clustering), based on their purchasing behavior, without knowing in advance what these categories are.


# Supervised Learning:

- Where the values to be predicted are already known
- A model is built with the aim of accurately predicting the values of previously unseen data
- Supervised learning uses features to predict the values of a target variable

### e.g.:
- Predicting a basketball player's position based on their points per game

## There are two types of supervised learning:
- **Classification**: Used to predict a label or category of an observation
    - **e.g.**: We can predict whether a bank transaction is fraudulent or not, as there are two possible outcomes: a fraudulent transaction or a non-fraudulent transaction. This is an example of binary classification.
  
- **Regression**: Used to predict continuous values
    - **e.g.**: A model can use features such as the number of bedrooms and the size of the property to predict the target value, which is the price of the property.


# Naming Conventions

- Feature = predictor variable = independent variable
- Target variable = dependent variable = response variable

## Some Requirements Before Using Supervised Learning

- **Requirements**:
    - No missing values
    - Data in numeric format
    - Data stored in a pandas DataFrame or NumPy array

- Perform Exploratory Data Analysis (EDA) first to ensure the data is in the correct format.


# Scikit-learn Syntax

```python
from sklearn.module import Model
# We import the model, which is a type of algorithm for our supervised learning problem, from the sklearn module.
# e.g.: k-Nearest Neighbors model uses distance between observations to predict labels or values.

model = Model() 
# We create a variable named "model" and instantiate the Model.

model.fit(X, y)
# A model fit to the data where it learns patterns about the features and the target variable.
# We fit the model to X (an array of our features) and y (an array of our target variable values).

predictions = model.predict(X_new)
# We then use the model's predict method, passing in six new observations (X_new).

print(predictions)


array([0, 0, 0, 1, 0, 0])



### Key Changes:
- Fixed grammatical errors such as **"instaniate"** to **"instantiate"** and added clarity to some of the comments.
- Corrected **"six new observation"** to **"six new observations"**.
- Structured the example code inside a code block for better readability and formatting.
- Added consistent spaces and punctuation for clarity.

Now it’s all clean and easy to read. Let me know if you'd like to adjust anything else!


## Classifiying labels for unseen data:
### there are 4 steps:
1. Build a model, build a classifer
2. which learn from the labelled data we pass to it
3. Pass unlabbeled data to the model as input
4. Model predict labels for the unseen data


* As the classiier learns from the labelled data, we call this the training data
      * labelled data = traininig data


# The Classification Challenge

## Classifying Labels for Unseen Data:

### There are 4 steps:
1. Build a model, build a classifier.
2. The model learns from the labeled data we pass to it.
3. Pass unlabeled data to the model as input.
4. The model predicts labels for the unseen data.

- As the classifier learns from the labeled data, we call this the training data.
    - Labeled data = training data.


# k-Nearest Neighbors:

The idea of this algorithm is to predict the label of any data point by:
- Looking at the k closest labeled data points
- Taking the majority vote on what label the unlabeled observation should have
- k-NN also creates a decision boundary.


## Using Scikit-learn to Fit k-NN Model

```python
from sklearn.neighbors import KNeighborsClassifier

# Splitting the data into X, a 2D array for our features
X = churn_df[["total_day_charge", "total_eve_charge"]].values

# y, a 1D array for our target value
y = churn_df["churn"].values

# Scikit-learn requires that the features are in an array where each column is a feature and each row is a different observation.
# Similarly, the target needs to be a single column with the same number of observations as the feature data.
print(X.shape, y.shape)

knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X, y)

# Example: New observation with two features and some observation
X_new = some_new_observation_of_two_features

# This is where we pass the model to predict on unlabeled data (for testing purposes)
predictions = knn.predict(X_new)
print('Predictions: {}'.format(predictions))


# Measuring Model Performance

- In classification, accuracy is a commonly used metric.
- **Accuracy** = number of correct predictions / total observations.

### How Do We Measure Accuracy?

- We could compute accuracy on the data used to fit the classifier.
- Split the data into a training set and a testing set.
- Fit the model on the training set.
- Calculate the accuracy of the model against the testing set.


# Train / Test Split

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

# Commonly, we use 20-30% of data for the test set.
# The random_state argument sets the seed for the random number generator that splits the data.

# Note: It's best practice to ensure our split reflects the proportion of labels in the data.
# So, if churn occurs in 10% of the observations, we want 10% of the labels in both the training and testing sets to represent churn.
# To achieve this, we set stratify=y.
# The train_test_split function returns 4 arrays: training data, test data, training labels, and test labels.

knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)

# For checking accuracy, we use the score method on X_test and y_test
print(knn.score(X_test, y_test))


# Model Complexity

- Larger k = less complex model = can cause underfitting.
- Smaller k = more complex model = can lead to overfitting (a complex model can model noise in the training data rather than reflecting general trends).

## Model Complexity and Over/Underfitting

We can also interpret k using the model complexity curve. For the k-NN model, we can calculate the accuracy on the training and test sets using incremental k values and plot the results.


# Model Complexity and Accuracy Evaluation

We will evaluate the model's performance using different values of `k` (the number of neighbors) and plot the accuracy for both the training and test sets.

```python
train_accuracies = {}
test_accuracies = {}

# Create two empty dictionaries to store our train and test accuracies, and an array containing the range of k values.
neighbours = np.arange(1, 26)

# For loop to repeat our previous workflow, building several models using different numbers of neighbours.
for neighbour in neighbours:
    knn = KNeighborsClassifier(n_neighbors=neighbour)
    knn.fit(X_train, y_train)
    train_accuracies[neighbour] = knn.score(X_train, y_train)
    test_accuracies[neighbour] = knn.score(X_test, y_test)

# After the for loop, plot the results


## Model Complexity and Accuracy Visualization

We can visualize how the accuracy of the k-NN model changes with varying numbers of neighbors. Below is the code to plot both the training and testing accuracies for different values of `k`:

```python
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.title("KNN: Varying Number of Neighbours")
plt.plot(neighbours, train_accuracies.values(), label="Training Accuracies")
plt.plot(neighbours, test_accuracies.values(), label="Testing Accuracies")
plt.legend()
plt.xlabel("Number of Neighbours")
plt.ylabel("Accuracies")
plt.show()


Introduction to regression
* In regression tasks the target variable typically has continuoues values. such as countries GDP, or price of a house.

The Basic of Liner Regression:
* we want to fit a line to the data, and in two dimensions this takes the form of y = ax + b
* SImple linear regression used 1 feature
* y = target
* x = single feature
* ab = paramters/coefficient of the model, slope, intercepts

How we accuratly choose values for a & b?
* define an error function for any given line,
* then chose the line which minimize this function
* Error functions = loss functions = cost functions
e.g. we want a line that to be close to the observations as possible, therefore we have to minimize the vertical distance between the fit and the data. So for each observation we calculation the vertical distance between it and line.  and this distnace is called *residual*. We could try to minimize the sum of residuals and then each positive residual would cancel out each negative residual. to avioud this we square the residuals by adding all the squared residuals we calculate the square some of the residuals or RSS.
    RSS = (pleasse add formuala)
  this type of linear regresison is called ordinary least square (OLS) or OLS where we aim to minimize the RSS.


Linear Regression in Higher dimesions?
When we have two featrues, x1 and x2 and one target y, a line takes the form y = a1x1 + a2x2 + b
* to fil linear regression model here:
  * Need to specify 3 varibles a1, a2, b(intercept)
* In higher dimensions
  * known as multiple regression
  * must specifiy coffecient for each feature and variable b
* for multiple linear regression models, sklearn expects one variable each for feature and target values.

R-squard
* the default metric for linear regression is R-squared.
* quantifies the amount of variance in target variable, that is explained by the features.
* Values can range from 0 to 1. with 1 meaning  feature completly explain the target variance.
* to compute R-squared in scikit-learn,  we call the score method and pass test features and target.
* Another way to assess a regression model's performace is to tak the mean of the residual sum of square. this is know as *Mean Square Error*.
  * MSE measured in unit of our target variable,squard.
  * e.g. if a model is predicting a dollar value, MSE will be in dollars
  * to calcualte RMSE(Root Mean Square Error) we inport root_mean_square_error from sklearn.metric
  * 