# Machine Learning with Python

## Brief history of Machine Learning

Machine learning (ML), a subset of artificial intelligence (AI), has a long history from the dawn of computing. The first mathematical model of an artificial neuron was developed in the 1940s and 1950s, laying the groundwork for ML. Turing, Minsky, and McCarthy were early AI pioneers who created early AI systems. With Samuel's checkers-playing programme and Rosenblatt's perceptron method, ML gained traction.

Expert systems, rule-based systems, and neural network learning processes were the focus of research in the 1970s and 1980s. By inventing Support Vector Machines and popularising data mining in the 1990s, ML switched towards data-driven techniques. Deep learning techniques such as convolutional and recurrent neural networks enabled rapid AI developments in the 2010s.

Today, machine learning is a fast-emerging field with applications across industries, emphasising its significance in the broader landscape of artificial intelligence and its potential to revolutionise different aspects of our life.


## Definition of Machine Learning
Machine learning is a subset of artificial intelligence that allows computers to learn from data and make predictions or decisions without being explicitly programmed.

## Importance and Applications of Machine Learning in Daily Life
Machine learning has a wide range of applications in our daily lives, including:
- Email spam filtering
- Product recommendations on e-commerce websites
- Personalized content on streaming platforms
- Voice recognition and assistants
- Fraud detection in financial transactions

# Types of Machine Learning

## Supervised Learning
Supervised learning is a type of machine learning where the algorithm is trained on a labeled dataset, which includes both input features and the correct output. There are two main types of supervised learning:

>### Classification
Classification is the process of predicting a categorical output or class. Examples include spam detection, image recognition, and medical diagnosis.

>### Regression
Regression involves predicting a continuous output or quantity. Examples include housing price prediction, stock market analysis, and sales forecasting.

<b>Personalized ads on platforms like Instagram and Facebook are examples of machine learning in action. Specifically, they can be considered as applications of supervised learning, particularly classification and regression tasks.</b>

## Unsupervised Learning
Unsupervised learning is a type of machine learning where the algorithm is trained on an unlabeled dataset, meaning it doesn't have access to the correct output. There are two main types of unsupervised learning:

>### Clustering
Clustering is the process of grouping similar data points based on their features. Examples include customer segmentation, social network analysis, and anomaly detection.

>### Dimensionality Reduction
Dimensionality reduction involves reducing the number of features in the dataset while preserving its structure. Examples include Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE).

## Reinforcement Learning
Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties. Examples include robotics, game playing, and autonomous vehicles.

The classic example of reinforcement learning: In October 2015, the distributed version of AlphaGo defeated the European Go champion Fan Hui. You can find more about it here:
>https://www.deepmind.com/research/highlighted-research/alphago

>https://en.wikipedia.org/wiki/AlphaGo#:~:text=Match%20against%20Fan%20Hui,-Main%20article%3A%20AlphaGo&text=In%20October%202015%2C%20the%20distributed,full%2Dsized%20board%20without%20handicap.


# Basic Terminology

>## Data: Features and Labels
- **Features**: The independent variables or input data used to train a machine learning model.
- **Labels**: The dependent variables or output data that the model tries to predict.

>## Model
A model is an algorithm that learns from the input data (features) to make predictions or decisions.

>## Training
Training is the process of teaching a model using the input data (features) and the corresponding output data (labels).

>## Prediction
Prediction is the process of making educated guesses or estimates based on the trained model and new, unseen input data.

### Linear Regression

Linear regression is a simple machine learning algorithm used to model the relationship between a target variable (output) and one or more input features. It assumes that the relationship between the target and input features is linear. It's commonly used for predicting numerical values.

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression

# Generate house sizes (in square meters) and prices
house_sizes = np.arange(20, 70, 1).reshape(-1, 1)
prices = 490 + house_sizes * 10
'''
This line of code generates an array of house prices using a simple linear relationship with the house sizes. 
The base price of a house is set to 500, and the price increases by 10 for every additional square meter of the house size. 
This relationship is expressed as: price = 490 + (10 * house_size).
'''

# Create and train the linear regression model
model = LinearRegression()
model.fit(house_sizes, prices)

# Predict the price of a house with an area in square meters
predicted_price = model.predict([[17]])
print("Predicted price of an 80 sqm house:", predicted_price[0][0])

### Decision Tree Classifier
A Decision Tree Classifier is a type of machine learning model used for making decisions or predictions based on data. 
In layman terms, it can be thought of as a flowchart or a tree-like structure that helps in making decisions by following a 
series of simple rules or questions.

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import tree

# Load the Iris dataset
iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                       columns=iris['feature_names'] + ['target'])

# Prepare the features (X) and target (y) arrays
X, y = iris_df.iloc[:, :4].values, iris_df.iloc[:, -1].values

# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)

# Train a decision tree classifier on the training set
dt = tree.DecisionTreeClassifier()

dt.fit(X_train, y_train)

# Make a prediction on a randomly chosen sample from the dataset
random_sample = iris_df.sample(1)
random_sample_x = random_sample.iloc[0, :4].values

# Predict the class of the random sample
prediction = dt.predict(random_sample_x.reshape(1, -1))
print(prediction)

The output is the predicted class label for the randomly chosen iris flower sample. The class label is an integer that represents one of the three iris species in the dataset:

- 0: Iris setosa
- 1: Iris versicolor
- 2: Iris virginica

In this case, the output [1,] indicates that the model predicts the randomly selected iris flower to be of the species Iris versicolor. Keep in mind that since the code is using a random sample, the output may be different each time the code is executed, and it may predict a different class for another random sample.

The `np.c_[]` function is used to concatenate the feature data (iris['data']) and target data (iris['target']) column-wise. This means that the target column is added as the last column after the feature columns. The result is a two-dimensional NumPy array that will be used as the data for the DataFrame.

##### What is Iris dataset?
The Iris dataset, also known as the Fisher's Iris dataset, is a classic and widely used dataset in the field of machine learning and statistics. The dataset consists of 150 samples of iris flowers, equally divided into three classes representing three different species of iris: Iris setosa, Iris versicolor, and Iris virginica. For each sample, there are four features (or attributes) measured in centimeters:

1. Sepal length
2. Sepal width
3. Petal length
4. Petal width

The main objective when working with the Iris dataset is to develop a classification model that can accurately predict the species of an iris flower based on its sepal and petal measurements. Due to its simplicity and clear structure, the Iris dataset is often used as an introductory dataset for teaching machine learning techniques and classification algorithms.

##### Splitting the dataset
Splitting the dataset into separate training and testing sets is essential for evaluating the performance and generalization ability of a machine learning model. By training the model on one subset of the data and testing it on another unseen subset, we can estimate how well the model will perform on new, unseen data.

The choice of splitting ratio, such as 80-20, 70-30, or 60-40, depends on the size of the dataset and the problem at hand. There is no strict rule for choosing the exact ratio, but some general guidelines can be followed:

- If the dataset is large enough, an 80-20 or 70-30 split can provide a good balance between the amount of data used for training the model and the amount reserved for testing its performance.
- If the dataset is relatively small, a higher proportion of the data should be allocated to the training set to ensure that the model has enough examples to learn from. In such cases, a 90-10 or 85-15 split might be more appropriate.
- When dealing with imbalanced datasets or situations where the model's performance on a specific class is of utmost importance, techniques like stratified sampling or cross-validation can be used to ensure a better distribution of classes in both training and testing sets.

### Logistic regression

Logistic regression is a way to predict whether something belongs to one group or another based on certain features. Imagine you have a basket of apples and oranges, and you want to identify the fruit type based on its color and size. Logistic regression helps you find the best way to separate the apples from the oranges using these features.

In simple terms, logistic regression takes information about an item (like its color and size) and calculates the probability of it belonging to a specific group (e.g., apple or orange). It does this by learning from examples where the right answer is already known. Once the model is trained, it can be used to predict the group membership of new, unseen items.

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Load the Iris dataset
iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                       columns=iris['feature_names'] + ['target'])

# Train a logistic regression model on the training set
logreg = LogisticRegression(max_iter=300, random_state=0)
logreg.fit(X_train, y_train)

# Select a random sample from the dataset
random_sample = iris_df.sample(1)
random_sample_x = random_sample.iloc[0, :4].values

# Predict the class of the random sample using logistic regression
pred_log = logreg.predict(random_sample_x.reshape(1, -1))
print(pred_log)

# Predict the class probabilities of the random sample using logistic regression
pred_proba = logreg.predict_proba(random_sample_x.reshape(1, -1))
print(pred_proba)

[2.]: This is the predicted class label for the randomly chosen iris flower sample. 

This is a two-dimensional array containing the predicted class probabilities for the random sample. Each value in the array represents the probability of the sample belonging to a specific class, according to the logistic regression model. The values in the array sum up to 1.

In this example, the class probabilities are:
Iris setosa (class 0): 0.00070041
Iris versicolor (class 1): 0.49433337
Iris virginica (class 2): 0.50496622

The model predicts that the random sample has a very low probability of being Iris setosa, a slightly higher probability of being Iris versicolor, and the highest probability of being Iris virginica. As expected, the class with the highest probability (Iris virginica) corresponds to the predicted class label [2.].

## Accuracy

In [None]:
from sklearn.metrics import accuracy_score

# Predict class labels for the test and train sets
pred_dt = dt.predict(X_test)
pred_dt_train = dt.predict(X_train)
pred_logreg_train = logreg.predict(X_train)
pred_logreg = logreg.predict(X_test)

# Calculate the accuracy scores for the decision tree and logistic regression models
dt_acc = accuracy_score(y_test, pred_dt)
logreg_acc = accuracy_score(y_test, pred_logreg)

# Print the accuracy scores
print(f"DT accuracy: {dt_acc}\nLogistic Regression accuracy: {logreg_acc}")

Accuracy score is a commonly used metric to evaluate the performance of classification models.

An accuracy score of 0.95 means that the model correctly predicts the class labels for 95% of the samples in the test set. A higher accuracy score indicates better performance, but it's important to keep in mind that accuracy can be misleading in cases where the dataset is imbalanced or when the model's performance on specific classes is more important. In such cases, other evaluation metrics like precision, recall, or F1 score might be more appropriate.

Recall (sensitivity or true positive rate) measures the proportion of actual positive instances that were correctly predicted by the model. A high recall indicates that the model is good at identifying positive instances.

Precision (positive predictive value) measures the proportion of true positive instances among the instances predicted as positive by the model. A high precision indicates that the model is good at not labeling negative instances as positive.

F1 score is the harmonic mean of precision and recall, and it balances both metrics. It's a good single-number summary of a model's performance, especially when dealing with imbalanced datasets or when both false positives and false negatives are important to consider.

In [None]:
from sklearn.metrics import recall_score, precision_score, f1_score, classification_report

# Calculate recall, precision, and F1 score for the decision tree model
recall = recall_score(y_test, pred_dt, average="macro")
prec_score = precision_score(y_test, pred_dt, average="macro")
f1 = f1_score(y_test, pred_dt, average="macro")

# Print the recall, precision, and F1 score
print(f"DT precision, recall, and F1: {recall, prec_score, f1}")

# Print the classification report for the decision tree model
print("----Decision Tree----")
print(classification_report(y_test, pred_dt))

This code snippet calculates the precision, recall, and F1 score for the decision tree model on the test set and then prints a summary of these classification metrics using the classification_report function.

In [None]:
print("Logistic Regression")
print(classification_report(y_test, pred_logreg))

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

y_pred = dt.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

print("Decision Tree")
cmd = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names)
cmd.plot()

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load the Iris dataset
iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                       columns=iris['feature_names'] + ['target'])

# Prepare the features (X) and target (y) arrays
X, y = iris_df.iloc[:, :4].values, iris_df.iloc[:, -1].values

# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)

# Train a logistic regression model on the training set
logreg = LogisticRegression(max_iter=300, random_state=0)
logreg.fit(X_train, y_train)

# Predict the labels for the test set
y_pred = logreg.predict(X_test)

# Print the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix using ConfusionMatrixDisplay
print("Logistic Regression")
cmd = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names)
cmd.plot()

The confusion matrix provides a visual representation of the model's performance, making it easier to see where the model is making correct predictions and where it is making mistakes. The diagonal elements in the confusion matrix represent the number of correct predictions, while the off-diagonal elements represent the number of incorrect predictions. A perfect model would have all its predictions on the diagonal, resulting in a confusion matrix with zero values in all off-diagonal elements.

# Task 1

Create a model using Logistic Regression classifier from diabetes.csv data. Choose the Outcome column (defines if the patient has diabetes or not) as the target variable. To do that, compute the following:
a) Create train and test splits. The train split should contain 80% of the data, and the test split, 20%.
b) Fit the logistic regression model with data

Hint: Explore the data first

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Read the dataset
diabetes_data = pd.read_csv('diabetes.csv')

# Separate features and target
X = diabetes_data.drop('Outcome', axis=1)
y = diabetes_data['Outcome']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Check the model's accuracy on the test set
acc = model.score(X_test, y_test)
print('Accuracy:', acc)

# Task 2:
Use the logistic regression model that you created from diabetes.csv at Task 1 for answering following questions:

- a) Provide prediction for the patients with diabetes
- b) Evaluate your model using Confusion Matrix
- c) Evaluate your model using accuracy, precision, and recall.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

# Load the diabetes dataset
diabetes_df = pd.read_csv('diabetes.csv')

# Preprocess the data
X = diabetes_df.drop(columns=['Outcome']).values
y = diabetes_df['Outcome'].values

# Split the data into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a logistic regression model on the training set
logreg = LogisticRegression(max_iter=1000, random_state=0)
logreg.fit(X_train, y_train)

# Provide prediction for the patients with diabetes
pred = logreg.predict(X_test)
print(pred)

# Evaluate the model using confusion matrix
cm = confusion_matrix(y_test, pred)
print(cm)

# Evaluate the model using accuracy, precision, and recall
acc = accuracy_score(y_test, pred)
prec = precision_score(y_test, pred)
rec = recall_score(y_test, pred)
print('Accuracy:', acc)
print('Precision:', prec)
print('Recall:', rec)

## Cross validation
Cross-validation is a technique used to evaluate the performance of a machine learning model by testing its ability to make predictions on new, unseen data. It helps to ensure that the model is not just memorizing the training data but is actually learning to generalize from it.

The main idea behind cross-validation is to split the available data into several smaller sets, called "folds." The model is then trained on some of these folds and tested on the remaining fold(s). This process is repeated multiple times, with different folds being used for testing each time. Finally, the results from each round are combined to provide an overall estimate of the model's performance.

By doing this, cross-validation helps to get a better idea of how well the model will perform on new data, while still making the most of the available data for training. It also helps to avoid overfitting, which is when the model becomes too specialized in the training data and performs poorly on new, unseen data.

In [None]:
from sklearn.model_selection import cross_val_score

# Create the decision tree and logistic regression models
dt = tree.DecisionTreeClassifier()
logreg = LogisticRegression(random_state=18)

# Perform 5-fold cross-validation on the decision tree model
dt_cv = cross_val_score(dt, X, y, scoring="accuracy", cv=5)
print("Decision Tree accuracy scores for each fold:", dt_cv)

# Perform 5-fold cross-validation on the logistic regression model
logreg_cv = cross_val_score(logreg, X, y, scoring="accuracy", cv=5)
print("\nLogistic Regression accuracy scores for each fold:", logreg_cv)

The code snippet performs 5-fold cross-validation for both the decision tree and logistic regression models on the iris dataset. It calculates the accuracy score for each fold and then prints the scores for both models.

# Task 3
Implemented 5 fold cross-validation with LogisticRegression on a sklearn dataset that is called load_digits. You can randomly select your target variable.

# Clustering

K-means clustering is a popular clustering algorithm that aims to partition the data into K clusters, where each data point belongs to the cluster with the nearest mean (centroid). The algorithm iteratively refines the centroids until they are stable, meaning the data points' assignments to clusters don't change anymore.

### Basic concept

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Generate some random data
np.random.seed(42)
X = np.random.rand(50, 2)

# Perform k-means clustering with k=3
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Plot the clusters
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')
plt.show()

### A more complex example

In [None]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load the dataset
df = pd.read_csv("datasets/countries.csv")

# Preprocess data
scaler = StandardScaler()
X = scaler.fit_transform(df[['latitude', 'longitude']])

# K-Means clustering
n_clusters = 7
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
df['Cluster'] = kmeans.fit_predict(X)

# Display clusters
for i in range(n_clusters):
    print(f"Cluster {i}:")
    print(df[df['Cluster'] == i]['country'].tolist())
    print()


In [None]:
import matplotlib.pyplot as plt

# Create a scatter plot with different colors for each cluster
colors = plt.cm.get_cmap('viridis', n_clusters)
fig, ax = plt.subplots(figsize=(12, 8))

for i in range(n_clusters):
    cluster_data = df[df['Cluster'] == i]
    ax.scatter(cluster_data['longitude'], cluster_data['latitude'], c=[colors(i)], label=f'Cluster {i}', alpha=0.6, edgecolors='w', s=100)

ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Country Clusters')
ax.legend()
plt.show()