**PROBLEM1: Consider a multivariate dataset with a large number of dimensions. Often, in such scenarios, it's essential to perform operations like normalization, matrix manipulation and data transformations.**

Using the NumPy module, complete the following tasks:

1. Data Generation: Generate a matrix X of size 100x10 where each entry is a random float drawn from a standard normal distribution. Print the number of rows and columns in the array. 



2. Normalization: Normalize the columns of X so that each column has a mean of 0 and a standard deviation of 1. This is often referred to as Z-score normalization.
    Note - For normalization, you may want to use the following formula: Z = (X - μ) / σ, where X is the original data, μ is the mean, and σ is the standard deviation. 



3. Covariance Matrix: Compute the covariance matrix of the normalized dataset.
    Note - Covariance can be computed using the np.cov() function.(you can find the documentation of this function here: https://numpy.org/doc/stable/reference/generated/numpy.cov.html).



4. Eigendecomposition: Perform eigendecomposition on the covariance matrix to find the eigenvalues and eigenvectors.
    Note - Eigendecomposition can be achieved with np.linalg.eig().(you can find the documentation for this function :https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html).




In [1]:
import numpy as np

# Task 1: Data Generation
X = np.random.randn(100, 10)
print("Shape of X:", X.shape)
shape_array = np.array([X.shape[0], X.shape[1]])
print("Number of rows and columns:", shape_array)

# Task 2: Normalization
X_mean = np.mean(X, axis=0)
X_std = np.std(X, axis=0)
X_normalized = (X - X_mean) / X_std
print("Normalized Matrix (X_normalized):")
print(X_normalized)

# Task 3: Covariance Matrix
cov_matrix = np.cov(X_normalized, rowvar=False)
print("Covariance Matrix:")
print(cov_matrix)

# Task 4: Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print("Eigenvalues:")
print(eigenvalues)

print("Eigenvectors:")
print(eigenvectors)


Shape of X: (100, 10)
Number of rows and columns: [100  10]
Normalized Matrix (X_normalized):
[[ 8.05417744e-01 -8.09456722e-01 -8.28066060e-01  5.17055960e-01
   1.34691133e-01 -8.55870037e-02  2.55087912e-01  2.34389827e-03
   8.23031837e-01 -1.28412206e+00]
 [-1.00722148e+00 -8.89328193e-01 -2.28931076e+00  4.90404290e-01
  -4.42279443e-02 -4.46753814e-01 -5.63134641e-01  2.68017787e-01
   8.92500333e-02  2.63291348e-01]
 [-2.03081943e+00  1.19191723e+00  2.70810579e-01 -2.79760026e-01
  -5.65840831e-01 -1.57924731e-01 -8.30657478e-01 -2.65326775e-01
  -1.38433066e-02 -1.39341614e+00]
 [-6.93223737e-01 -7.37787723e-01  8.71934276e-04  8.80316040e-01
  -5.08277693e-01 -1.51528268e+00  2.86043106e+00 -5.93398100e-01
   5.91586481e-01  7.26346669e-01]
 [ 3.26910551e-01  4.77645258e-01 -2.28170531e+00  2.49171447e+00
  -4.94759497e-01 -6.05628581e-01 -7.12458714e-01  1.37572838e-01
  -1.35654648e+00  8.43362184e-02]
 [-5.20695289e-01  8.10184113e-01 -2.62706434e-01  1.13011763e-01
  -1.

**Problem 2: Students Perfomance**

Write a Python function named student_performance_analysis for analyzing student performance based on the following specifications:

The function should accept a 2-dimensional numpy array student_scores as input. Each row in this array represents a different student, and each column represents the scores they received in different exams.

Compute the following statistics and return them in a dictionary:

1. 'average_student_scores': A 1-dimensional numpy array containing the average exam score for each student.


2. 'average_exam_scores': A 1-dimensional numpy array containing the average score for each exam.


3. 'highest_student_average': The highest average exam score amongst all the students.


4. 'lowest_student_average': The lowest average exam score amongst all the students.


5. 'highest_exam_score': The highest score in each exam.


6. 'lowest_exam_score': The lowest score in each exam.


You should make use of numpy functions and array operations to accomplish these tasks.

For your submission, include the Python function. Use the given set of test cases to demonstrate your function usage.


In [2]:
import numpy as np

def student_performance_analysis(student_scores):
    # Compute statistics
    average_student_scores = np.mean(student_scores, axis=1)
    average_exam_scores = np.mean(student_scores, axis=0)
    highest_student_average = np.max(average_student_scores)
    lowest_student_average = np.min(average_student_scores)
    highest_exam_score = np.max(student_scores, axis=0)
    lowest_exam_score = np.min(student_scores, axis=0)

    # Create and return result dictionary
    result = {
        'average_student_scores': average_student_scores,
        'average_exam_scores': average_exam_scores,
        'highest_student_average': highest_student_average,
        'lowest_student_average': lowest_student_average,
        'highest_exam_score': highest_exam_score,
        'lowest_exam_score': lowest_exam_score
    }

    return result


# Test case 1
scores1 = np.array([
    [85, 90, 78],
    [88, 92, 96],
    [78, 76, 88],
    [94, 88, 92]
])
result1 = student_performance_analysis(scores1)
print("Performance Analysis - Test case 1:")
print(result1)
print()

# Test case 2
scores2 = np.array([
    [85, 90, 78, 88, 72],
    [88, 92, 96, 78, 94],
    [78, 76, 88, 90, 86],
    [94, 88, 92, 84, 90]
])
result2 = student_performance_analysis(scores2)
print("Performance Analysis - Test case 2:")
print(result2)


Performance Analysis - Test case 1:
{'average_student_scores': array([84.33333333, 92.        , 80.66666667, 91.33333333]), 'average_exam_scores': array([86.25, 86.5 , 88.5 ]), 'highest_student_average': 92.0, 'lowest_student_average': 80.66666666666667, 'highest_exam_score': array([94, 92, 96]), 'lowest_exam_score': array([78, 76, 78])}

Performance Analysis - Test case 2:
{'average_student_scores': array([82.6, 89.6, 83.6, 89.6]), 'average_exam_scores': array([86.25, 86.5 , 88.5 , 85.  , 85.5 ]), 'highest_student_average': 89.6, 'lowest_student_average': 82.6, 'highest_exam_score': array([94, 92, 96, 90, 94]), 'lowest_exam_score': array([78, 76, 78, 78, 72])}


**Problem3:Distance between points**

Write a Python function to calculate the distance between points using Numpy

1. Write a Python function calculate_distance that calculates and returns the Euclidean distance between a set of points in n-dimensional space.

    The function should be named calculate_distance and it should take two arguments:

    1. point_A: A 1D numpy array containing the coordinates of point A.
    2. points_B: A 2D numpy array where each row contains the coordinates of one point in set B.


The calculate_distance function should a 1D numpy array where each element is the Euclidean distance between point_A and the corresponding row in points_B.

Use the following formula to calculate Euclidean distance between two points:

$$d(p,q) = d(q,p) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_n - q_n)^2}$$

Note:You should use numpy to perform the calculation in a vectorized way, without needing to explicitly write a loop in Python.
    
               

2. Write a function nearest_point that takes the same arguments as calculate_distance and returns the coordinates of the point in points_B that is closest to point_A. If there is a tie for the closest point, return any one of them. You may use calculate_distance to calculate distances.


In [12]:
import numpy as np

def calculate_distance(point_A, points_B):
    differences = points_B - point_A  # Calculate the coordinate differences
    distances = np.sqrt(np.sum(differences ** 2, axis=1))  # Calculate Euclidean distance
    return distances.astype("int64")

def nearest_point(point_A, points_B):
    # Calculate distances
    distances = calculate_distance(point_A, points_B)
    
    # Find the index of the smallest distance
    min_index = np.argmin(distances)
    print("Index of the smallest distance:", min_index)
    
    # Return the corresponding point from points_B
    nearest = points_B[min_index]
    print("Corresponding point from points_B:", nearest)
    
    return nearest

# Test Case
point_A = np.array([2, 3, 4, 5])

point_B = np.array([
    [1, 2, 3, 4], 
    [5, 6, 7, 8], 
    [9, 10, 11, 12], 
    [13, 14, 15, 16],
    [17, 18, 19, 20],
    [21, 22, 23, 24],
    [25, 26, 27, 28],
    [29, 30, 31, 32],
    [33, 34, 35, 36],
    [37, 38, 39, 40]
])

print("Distances: ", calculate_distance(point_A, point_B))
print("Nearest point to A: ", nearest_point(point_A, point_B))


Distances:  [ 2  6 14 22 30 38 46 54 62 70]
Index of the smallest distance: 0
Corresponding point from points_B: [1 2 3 4]
Nearest point to A:  [1 2 3 4]


**Problem 4:Closest Point**

 Write a Python function to find the k closest points to a given point using Numpy

1. Write a Python function k_closest_points that finds the k closest points in n-dimensional space to a specified point.

    The function should be named k_closest_points and it should take three arguments:

    point_A: A 1D numpy array containing the coordinates of point A.
    
    points_B: A 2D numpy array where each row contains the coordinates of one point in set B.
    
    k: The number of closest points to find.
    The k_closest_points function should return closest_points, a 2D numpy array containing the k closest points to point_A from points_B. Each row of closest_points should contain the coordinates of one of the k closest points.

    You may use the calculate_distance function from the previous question to calculate distances.

    You are not allowed to use any external libraries except for Numpy and Python's standard library.
Note - use numpy function argsort for array of indices of data along the given axis in sorted order.
https://numpy.org/doc/stable/reference/generated/numpy.argsort.html





2. Write a function closest_labels that takes the same arguments as k_closest_points, plus an additional argument labels_B.  labels_B  is a 1D numpy array where each element is a label associated with the corresponding row in points_B. The closest_labels function should return a 1D numpy array containing the labels associated with the k closest points.

In [14]:
import numpy as np
# Calculate distances using calculate_distance function from previous question

import Problem3

def calculate_distance(point_A, point_B):
    return np.sqrt(np.sum((point_A - point_B)**2))

point_A = np.array([2, 3, 4, 5])

point_B = np.array([
    [1, 2, 3, 4], 
    [5, 6, 7, 8], 
    [9, 10, 11, 12], 
    [13, 14, 15, 16],
    [17, 18, 19, 20],
    [21, 22, 23, 24],
    [25, 26, 27, 20],
    [29, 30, 31, 32],
    [33, 34, 35, 36],
    [37, 38, 39, 40]
])
Problem3.calculate_distance(point_A, point_B)

# Get the first k indices
def k_closest_points(point_A, points_B, k):
    distances = np.array([calculate_distance(point_A, point) for point in points_B])
    
    # Get the indices of the distances sorted in ascending order
    indices = np.argsort(distances)
    closest_indices = indices[:k]
    closest_points = points_B[closest_indices]
    return closest_points

 # Return the corresponding k closest labels from labels_B
def closest_labels(point_A, points_B, labels_B, k):
    distances = np.array([calculate_distance(point_A, point) for point in points_B])
    indices = np.argsort(distances)
    closest_indices = indices[:k]
    closest_labels = labels_B[closest_indices]
    return closest_labels
# Test case
point_A = np.array([2, 3, 4, 5])

points_B = np.array([
    [1, 2, 3, 4], 
    [5, 6, 7, 8], 
    [9, 10, 11, 12], 
    [13, 14, 15, 16],
    [17, 18, 19, 20],
    [21, 22, 23, 24],
    [25, 26, 27, 28],
    [29, 30, 31, 32],
    [33, 34, 35, 36],
    [37, 38, 39, 40]
])

labels_B = np.array([0, 0, 1, 1, 0, 1, 0, 1, 0])

k = 3

print("K closest points to A:", k_closest_points(point_A, points_B, k))
print("Labels of K closest points to A:", closest_labels(point_A, points_B, labels_B, k))


K closest points to A: [[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
Labels of K closest points to A: [0 0 1]


**Problem5:K-Neighrest Neighbour**

Write a Python function to implement the "k-Nearest Neighbors" algorithm using Numpy



1. Write a Python function that uses the k-Nearest Neighbors (k-NN) algorithm to classify a given set of test data based on a given set of training data and labels.

    The function should be named knn_classifier and it should take four arguments:

    1. train_data: A 2D numpy array where each row is a separate data point in the training dataset.
    2. train_labels: A 1D numpy array where each element is the class label of the corresponding row in train_data.
    3. test_data: A 2D numpy array where each row is a separate data point in the test dataset. The function should predict and return class labels for these data points.
    4. k: The number of nearest neighbors to be considered.
    
    The knn_classifier function should return test_labels, a 1D numpy array where each element is the predicted class label of the corresponding row in test_data.

    Use the Euclidean distance to calculate the distance between data points. The class of each test point is determined by the class of k closest training points (in terms of Euclidean distance). If there's a tie in the voting process, break the tie by choosing the class with the closest representative among the k nearest neighbors.

    You may use the k_closest_points and closest_labels functions from the previous question in your implementation.

    You are not allowed to use any external libraries except for Numpy and Python's standard library.


.

2. Write a function accuracy_score that calculates and returns the accuracy of the knn_classifier function's predictions. The function should take two parameters:

    true_labels: The actual labels of the test data.
    predicted_labels: The labels predicted by the knn_classifier function.
    The accuracy score is the proportion of correct predictions over total predictions.

In [15]:
import numpy as np

def knn_classifier(train_data, train_labels, test_data, k):
    predicted_labels = []

    for test_point in test_data:
        # Get labels of k closest points in train_data to the current test point
        distances = np.linalg.norm(train_data - test_point, axis=1)  # Calculate Euclidean distances
        indices = np.argsort(distances)[:k]  # Get indices of k closest points
        closest_labels = train_labels[indices]  # Get labels of k closest points
        unique_labels, counts = np.unique(closest_labels, return_counts=True)
        max_count = np.max(counts)

        # Find the most common label among the k closest ones
        most_common_labels = unique_labels[counts == max_count]
        
        if len(most_common_labels) == 1:
            predicted_label = most_common_labels[0]
        else:
            closest_distances = distances[indices]
            min_distance = np.min(closest_distances[closest_labels == most_common_labels[0]])
            closest_label = most_common_labels[0]
            
            for label in most_common_labels[1:]:
                distance = np.min(closest_distances[closest_labels == label])
                if distance < min_distance:
                    min_distance = distance
                    closest_label = label
            
            predicted_label = closest_label
        
        predicted_labels.append(predicted_label)

    #return the predicted labels for test data 
    return np.array(predicted_labels)

def accuracy_score(true_labels, predicted_labels):
    correct_predictions = np.sum(true_labels == predicted_labels)
    total_predictions = len(true_labels)
    accuracy = correct_predictions / total_predictions
    return accuracy

# Example usage
train_data = np.array([
    [1, 2, 3, 4, 5], 
    [6, 7, 8, 9, 10], 
    [11, 12, 13, 14, 15], 
    [16, 17, 18, 19, 20],
    [21, 22, 23, 24, 25],
    [26, 27, 28, 29, 30],
    [31, 32, 33, 34, 35],
    [36, 37, 38, 39, 40],
    [41, 42, 43, 44, 45],
    [46, 47, 48, 49, 50],
    [51, 52, 53, 54, 55],
    [56, 57, 58, 59, 60],
    [61, 62, 63, 64, 65],
    [66, 67, 68, 69, 70],
    [71, 72, 73, 74, 75]
])
train_labels = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0])
test_data = np.array([
    [4, 5, 6, 7, 8], 
    [9, 10, 11, 12, 13], 
    [14, 15, 16, 17, 18],
    [19, 20, 21, 22, 23],
    [24, 25, 26, 27, 28],
    [29, 30, 31, 32, 33],
    [34, 35, 36, 37, 38]
])
k = 3

predicted_labels = knn_classifier(train_data, train_labels, test_data, k)
print("Predicted labels: ", predicted_labels)

true_labels = np.array([0, 0, 1, 1, 1, 0, 0])
accuracy = accuracy_score(true_labels, predicted_labels)
print("Accuracy: ", accuracy)


Predicted labels:  [0 1 0 1 0 1 0]
Accuracy:  0.42857142857142855
