I am a Masters student pursuing Data science based in the UAE.
- Data Anlysis
- Business analytics
- Data Visualization
- Machine Learning Algorithms
- Deep Learning
Python, R, Google Analytics 4, Tableau, power BI, Jira, SQLplus , MySQL, SAP
Here is a summary of all the details of my Machine Learning project so far!
In this unit, I delved into the evolving landscape of machine learning and its significance in deriving insights from complex data. Through hands-on experience and theoretical exploration, I developed an understanding of how algorithms can be trained to recognize patterns, make predictions, and support data-driven strategies. Machine learning emerged not just as a technical skillset, but as a practical tool for solving real-world problems transforming raw information into actionable intelligence across diverse business contexts.
- Understand the fundamentals of machine learning concepts and algorithms
- Learn how to apply machine learning techniques to real-world datasets
- Develop the ability to critically evaluate model performance and outcomes
- Engage in self-reflection on the ethical and practical implications of machine learning
The activities paired with self reflection really made me aware of where I stood in terms of my personal development.
Learned Objectives:
- Apply and critically appraise machine learning techniques to real-world problems, particularly where technical risk and uncertainty is involved.
EDA Tutorial Missing Values Identified:
- horsepower has 6 missing values.
Here’s what we found in the numerical columns:
Skewness:
mpg 0.457092
cylinders 0.508109
displacement 0.701669
horsepower 1.087326
weight 0.519586
acceleration 0.291587
model year 0.019688
origin 0.915185
Highest skewness:
-
horsepower
-
origin
-
displacement
Lowest skewness (almost symmetric):
- model year
Kurtosis:
mpg -0.515993
cylinders -1.398199
displacement -0.778317
horsepower 0.696947
weight -0.809259
acceleration 0.444234
model year -1.167446
origin -0.841885
Most peaked (high kurtosis):
-
horsepower
-
acceleration
Light tails (platykurtic):
-
cylinders
-
model year
Highest skewness:
horsepower: 1.09 (right-skewed)
-
origin
-
displacement
Lowest skewness (almost symmetric):
- model year
Learned Objectives :
- Understanding how how the change in data points impacts correlation and regression.
Pearson's correlation The following results are what was observed when we changed the variable values.
No Noise Code:
Output:
When we eliminate the noise, we are looking at a perfect one to one relationship in which every point falls neatly into place, forming a straight line which makes the Pearson correlation hit the max 1.0000.
Less Noise
Code:
Output:
Adding a bit of noise, that perfect line will start to wobble slightly. The connection is still very strong and the correlation stays high around 0.9 to 0.99.
High Noise
Code:
Output:
When throwing in a lot of noise, things start to fall apart. The link between the two variables weakens, the correlation drops and the scatterplot starts to look like a cloud than a line. It is harder to spot any clear trend.
Linear Regression When looking at linear regression, the way your data behaves really makes all the difference. If there is a clear trend and the numbers don’t vary too much, the line fits nicely and the connection between variables is easy to see. Even when adding a bit of noise the pattern still holds, but it is just a little messier. The more randomness you throw in, the harder it becomes to spot any solid link—the data spreads out and the line doesn’t really capture what is going on. Outliers, even just one or two weird values, can throw everything off and totally shift the direction of the line. And if there’s no actual relationship between the variables, you will notice the line pretty much flattens out, because there is nothing meaningful to predict.
Code:
Output:
Predict Future Values
If you tweak the data, the whole pattern shifts, so the line the model draws will change too. With more consistent values, predictions usually improve. But if we try to predict way beyond the original data range, It becomes less reliable at that stage because the model is essentially estimating without solid reference points.
Multiple Linear Regression
Code:
import pandas
from sklearn import linear_model
df = pandas.read_csv("cars.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#predict the CO2 emission of a car where the weight is 2300kg, and the volume is 1300cm3:
predictedCO2 = regr.predict([[2300, 1300]])
print(predictedCO2)print(regr.coef_)predictedCO2 = regr.predict([[3300, 1300]])
print(predictedCO2)Polynomial Regression
import numpy
import matplotlib.pyplot as plt
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
#NumPy has a method that lets us make a polynomial model
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))
#specify how the line will display, we start at position 1, and end at position 22
myline = numpy.linspace(1, 22, 100)
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()Output:
import numpy
from sklearn.metrics import r2_score
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))
print(r2_score(y, mymodel(x)))Output:
import numpy
from sklearn.metrics import r2_score
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))
speed = mymodel(17)
print(speed)Output:
Learned Objectives:
- Understand the applicability and challenges associated with different datasets for the use of machine learning algorithms.
Correlation Heatmap:
Linear Regression:
Learned Objectives:
- Clustering and relate that with algorithm logic.
- Algorithm logic
Watching the K-Means animation helped me grasp how the algorithm works beyond just the steps.
The first animation clearly demonstrated how K-Means clustering works step by step. It starts with randomly placing centroids, then repeatedly assigns points to the nearest centroid and updates the centroid positions based on the average location of those assigned points. I noticed that when centroids were placed too far from the main data distribution, they often ended up with very few or even no points assigned to them. In contrast, centroids that began closer to the centre of the data resulted in more balanced clusters and smoother convergence.
Convergence in K-Means occurs when the centroids no longer move significantly between iterations, meaning the cluster assignments have stabilised and further updates do not change the outcome. The animation made this process easy to visualise, as the centroids gradually shifted less and less with each step until they finally settled.
In the second animation, using the “Uniform Points” option, I manually selected the initial centroid positions. Regardless of the starting points, the algorithm consistently produced evenly spaced clusters. This showed that when data is uniformly distributed, K-Means tends to converge reliably, as there are no natural groupings that could mislead the algorithm.
Overall, the animations reinforced how important both initial centroid placement and data distribution are for K-Means to generate meaningful and consistent results and provided a clear view of how the algorithm reaches convergence during clustering. K-Means doesn’t consider the shape or density of clusters, it only focuses on distance to the centre. Hence, highlighting if the data isn’t roughly circular or evenly distributed, the results might not actually reflect the real structure. That’s something I’ll be more aware of when deciding whether or not K-Means is the right approach for a dataset.
Jaccard coefficient:
(Jack, Mary) = 3 / 7 = 0.429 (Jack, Jim) = 5 / 7 = 0.714 (Jim, Mary) = 1 / 7 = 0.143
Simple Perceptron:
import numpy as np
inputs = np.array([45, 25])
# Check the type of the inputs
type(inputs)
# check the value at index position 0
inputs[0]
# creating the weights as Numpy array
weights = np.array([0.7, 0.1])
# Check the value at index 0
weights[0]
def sum_func(inputs, weights):
return inputs.dot(weights)
# for weights = [0.7, 0.1]
s_prob1 = sum_func(inputs, weights)
s_prob1
def step_function(sum_func):
if (sum_func >= 1):
print(f'The Sum Function is greater than or equal to 1')
return 1
else:
print(f'The Sum Function is NOT greater')
return 0
step_function(s_prob1 )Output:
Gradient Cost Function: The following code is after changing the number of iteration and the learning rate and it is observed that the cost increases per iteration when the learning rate is increased.
import numpy as np
def gradient_descent(x,y):
m_curr = b_curr = 0
iterations = 10
n = len(x)
learning_rate = 0.08
for i in range(iterations):
y_predicted = m_curr * x + b_curr
cost = (1/n) * sum([val**2 for val in (y-y_predicted)])
md = -(2/n)*sum(x*(y-y_predicted))
bd = -(2/n)*sum(y-y_predicted)
m_curr = m_curr - learning_rate * md
b_curr = b_curr - learning_rate * bd
print ("m {}, b {}, cost {} iteration {}".format(m_curr,b_curr,cost, i))
x = np.array([1,2,3,4,5])
y = np.array([5,7,9,11,13])
gradient_descent(x,y)Output:
Model Performance Measurement Code:
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).ravel()
(tn, fp, fn, tp)
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=0)
clf = SVC(random_state=0)
clf.fit(X_train, y_train)
SVC(random_state=0)
predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=clf.classes_)
disp.plot()
plt.show()Output:
from sklearn.metrics import f1_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
print(f"Macro f1 score: {f1_score(y_true, y_pred, average='macro')}")
print(f"Micro F1: {f1_score(y_true, y_pred, average='micro')}")
print(f"Weighted Average F1: {f1_score(y_true, y_pred, average='weighted')}")
print(f"F1 No Average: {f1_score(y_true, y_pred, average=None)}")
y_true = [0, 0, 0, 0, 0, 0]
y_pred = [0, 0, 0, 0, 0, 0]
f1_score(y_true, y_pred, zero_division=1)
# multilabel classification
y_true = [[0, 0, 0], [1, 1, 1], [0, 1, 1]]
y_pred = [[0, 0, 0], [1, 1, 1], [1, 1, 0]]
print(f"F1 No Average: {f1_score(y_true, y_pred, average=None)}")Output:
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)Output:
from sklearn.metrics import precision_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
precision_score(y_true, y_pred, average='macro')Output:
from sklearn.metrics import recall_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
recall_score(y_true, y_pred, average='macro')Output:
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))Output:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
X, y = load_breast_cancer(return_X_y=True)
clf = LogisticRegression(solver="liblinear", random_state=0).fit(X, y)
roc_auc_score(y, clf.predict_proba(X)[:, 1])Output:
#multiclass case
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(solver="liblinear").fit(X, y)
roc_auc_score(y, clf.predict_proba(X), multi_class='ovr')Output:
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_auc_score
# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
# Learn to predict each class against the other
classifier = OneVsRestClassifier(
svm.SVC(kernel="linear", probability=True, random_state=random_state)
)
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])plt.figure()
lw = 2
plt.plot(
fpr[2],
tpr[2],
color="darkorange",
lw=lw,
label="ROC curve (area = %0.2f)" % roc_auc[2],
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic example")
plt.legend(loc="lower right")
plt.show()Output:
from sklearn.metrics import log_loss
log_loss(["spam", "ham", "ham", "spam"], [[.1, .9], [.9, .1], [.8, .2], [.35, .65]])Output:
from sklearn.metrics import mean_squared_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mean_squared_error(y_true, y_pred)Output (RMSE):
from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mean_absolute_error(y_true, y_pred)Output (MAE):
from sklearn.metrics import r2_score
r2_score(y_true, y_pred)Output (r squared):































