<p style="text-align:center">
    <a href="https://www.ict.mahidol.ac.th/en/" target="_blank">
    <img src="https://www3.ict.mahidol.ac.th/ICTSurveysV2/Content/image/MUICT2.png" width="400" alt="Faculty of ICT">
    </a>
</p>

# Tutorial 03: Numerical Computation for Data Science

NumPy, short for Numerical Python, is an open-source Python library essential for numerical computing. It's used extensively for array manipulation, linear algebra, Fourier transforms, and working with matrices. NumPy's core data structure is the `ndarray` (n-dimensional array), which comes with a rich set of functions simplifying array operations. Because data science often deals with large datasets where speed and efficient resource use are critical, arrays are fundamental, and NumPy provides optimized array handling. 

## This tutorial aims to teach you how to:

* Use basic operations and datatypes of NumPy.
* Perform NumPy array operations on 1D arrays.
* Use various mathematical functions as well as linspace.
* Basic operations for 2D arrays.

## Exercise 01: Introduction to NumPy

Typically imported as `np`, NumPy arrays have a fixed size and homogenous data type. Python lists can be converted to NumPy arrays after importing the library.

In [None]:
import numpy as np

### Creating NumPy Arrays (ndarray)

The core of NumPy is the ndarray object (n-dimensional array). Let's see how to create them:

In [None]:
# From a Python list:
my_list = [1, 2, 3, 4, 5]
my_array = np.array(my_list)
print("Array from list:", my_array)
print(type(my_array))

In [None]:
# From a Python tuple:
my_tuple = (6, 7, 8, 9, 10)
my_array_from_tuple = np.array(my_tuple)
print("Array from tuple:", my_array_from_tuple)

In [None]:
# Creating specific arrays:
zeros_array = np.zeros(5)  # Array of zeros
print("\nZeros array:", zeros_array)

In [None]:
ones_array = np.ones((2, 3)) # 2x3 array of ones
print("Ones array:\n", ones_array)

In [None]:
arange_array = np.arange(0, 10, 2) # Array with values from 0 to 10 (exclusive), with a step of 2
print("Arange array:", arange_array)

In [None]:
linspace_array = np.linspace(0, 1, 5) # Array with 5 evenly spaced values between 0 and 1 (inclusive)
print("Linspace array:", linspace_array)

### Type (dtype)
NumPy arrays are homogeneous, meaning they store elements of the same data type. The dtype attribute tells us the data type of the array elements.

In [None]:
int_array = np.array([1, 2, 3])
print("\nInteger array type:", int_array.dtype)

In [None]:
float_array = np.array([1.0, 2.5, 3.7])
print("Float array type:", float_array.dtype)

In [None]:
string_array = np.array(['a', 'b', 'c'])
print("String array type:", string_array.dtype)

In [None]:
# Explicitly setting the data type:
int8_array = np.array([1, 2, 3], dtype=np.int8)
print("Int8 array type:", int8_array.dtype)

### Assigning Values
You can assign values to individual elements or slices of an array:

In [None]:
my_array = np.array([10, 20, 30, 40, 50])
print("\nOriginal array:", my_array)

In [None]:
my_array[0] = 100 # Assign to index 0
print("Array after assigning to index 0:", my_array)

In [None]:
my_array[-1]=500
print("Array after assigning to last index:", my_array)

### Slicing
Slicing allows you to extract portions of an array:

In [None]:
my_array = np.array([10, 20, 30, 40, 50])

slice1 = my_array[1:4] # Elements from index 1 (inclusive) to 4 (exclusive)
print("\nSlice 1:", slice1)

In [None]:
slice2 = my_array[:3] # Elements from the beginning up to index 3 (exclusive)
print("Slice 2:", slice2)

In [None]:
slice3 = my_array[2:] # Elements from index 2 (inclusive) to the end
print("Slice 3:", slice3)

In [None]:
slice4 = my_array[:] # All the elements in the array
print("Slice 4:", slice4)

A powerful feature of NumPy is the ability to assign a single value to an entire slice (broadcasting):

In [None]:
my_array = np.array([10, 20, 30, 40, 50])
print("\nOriginal array:", my_array)

In [None]:
my_array[1:4] = 99 # Assign 99 to elements at indices 1, 2, and 3
print("Array after assigning to slice:", my_array)

Important Note about Slices:

In [None]:
#Slices are views, not copies. Modifying a slice modifies the original array:
my_array = np.array([10, 20, 30, 40, 50])
my_slice = my_array[1:4]
my_slice[:] = 0
print("Original Array after slice modification:", my_array)

In [None]:
#To make a copy, use .copy():
my_array = np.array([10, 20, 30, 40, 50])
my_slice = my_array[1:4].copy()
my_slice[:] = 0
print("Original Array after copy modification:", my_array)

Summary: This exercise covered the basics of NumPy arrays, including creation, data types, assigning values, slicing, and important array attributes. These are fundamental concepts for working with numerical data efficiently in Python for data science.

## Exercise 02: 1D Array Operations

Fundamental operations on one-dimensional (1D) NumPy arrays are essential for various data science tasks.

### Creating NumPy 1D Arrays
Let's start by creating some sample 1D arrays:

In [None]:
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([6, 7, 8, 9, 10])

print("Array 1:", arr1)
print("Array 2:", arr2)

### Array Addition
Adding two NumPy arrays performs element-wise addition:

In [None]:
arr_sum = arr1 + arr2
print("\nArray Addition (arr1 + arr2):", arr_sum)

In [None]:
# You can also use the np.add() function:
arr_sum_np = np.add(arr1,arr2)
print("Array Addition using np.add(arr1,arr2):", arr_sum_np)

### Array Multiplication
Multiplying two NumPy arrays performs element-wise multiplication:

In [None]:
arr_product = arr1 * arr2
print("\nArray Multiplication (arr1 * arr2):", arr_product)

In [None]:
#You can also use the np.multiply() function:
arr_product_np = np.multiply(arr1,arr2)
print("Array Multiplication using np.multiply(arr1,arr2):", arr_product_np)

### Product of Two NumPy Arrays (Element-wise)
As shown above, the `*` operator performs element-wise multiplication. This is different from the dot product.

In [None]:
#Example demonstrating element-wise multiplication again:
arr_element_wise = arr1 * arr2
print("\nElement-wise Product (arr1 * arr2):", arr_element_wise)

### Dot Product
The dot product is a different type of multiplication. For 1D arrays, it calculates the sum of the element-wise products.

In [None]:
dot_product = np.dot(arr1, arr2)  # Most common way
print("\nDot Product (np.dot(arr1, arr2)):", dot_product)

In [None]:
dot_product_at = arr1 @ arr2 # Another way to calculate dot product
print("Dot Product using @ operator (arr1 @ arr2):", dot_product_at)

In [None]:
#Manual Calculation to demonstrate the dot product
manual_dot_product = sum(arr1[i] * arr2[i] for i in range(len(arr1)))
print("Manual Dot Product:", manual_dot_product)

### Adding a Constant to a NumPy Array
You can add a constant value to each element of a NumPy array:

In [None]:
constant = 5
arr_plus_constant = arr1 + constant
print("\nArray + Constant (arr1 + 5):", arr_plus_constant)

In [None]:
#You can also use the np.add() function
arr_plus_constant_np = np.add(arr1, constant)
print("Array + Constant using np.add(arr1, 5):", arr_plus_constant_np)

### Important Considerations: Broadcasting

Broadcasting is NumPy's way of handling operations between arrays of different shapes (under certain rules). In the case of adding a constant to an array, the constant is "broadcast" to all elements of the array.

In [None]:
# Example of broadcasting with a scalar (constant):
arr = np.array([1,2,3])
scalar = 10
result = arr + scalar #scalar is broadcasted to each element of the array
print("\nBroadcasting Example:", result)

**Summary**: This exercise covered essential operations on 1D NumPy arrays, including addition, multiplication (element-wise and dot product), and adding a constant. These operations are fundamental building blocks for more complex data analysis and manipulation tasks in data science.

## Exercise 03: Mathematical Functions and Linspace
This exercise explores essential mathematical functions in NumPy and introduces the linspace function, demonstrating its use in data science, particularly for plotting.

In [None]:
import numpy as np 
import matplotlib.pyplot as plt  # For plotting (make sure matplotlib is installed: pip install matplotlib)

### Essential Mathematical Functions
NumPy provides a wide range of mathematical functions that are crucial for data science. Here are some commonly used ones:

In [None]:
arr = np.array([0, 1, 2, 3, 4, 5])

# Trigonometric functions:
sin_arr = np.sin(arr)
cos_arr = np.cos(arr)
tan_arr = np.tan(arr)

print("Original array:", arr)
print("\nSine values:", sin_arr)
print("Cosine values:", cos_arr)
print("Tangent values:", tan_arr)

In [None]:
# Exponential and logarithmic functions:
exp_arr = np.exp(arr)  # e^x
log_arr = np.log(arr + 1)  # Natural logarithm (avoiding log(0))
log2_arr = np.log2(arr+1) # Base 2 logarithm
log10_arr = np.log10(arr+1) # Base 10 logarithm

print("\nExponential values:", exp_arr)
print("Natural logarithm values:", log_arr)
print("Base 2 logarithm values:", log2_arr)
print("Base 10 logarithm values:", log10_arr)

In [None]:
# Other useful functions:
sqrt_arr = np.sqrt(arr)  # Square root
abs_arr = np.abs(np.array([-1, -2, 3, -4])) # Absolute value
power_arr = np.power(arr, 2) #arr to the power of 2

print("\nSquare root values:", sqrt_arr)
print("Absolute values:", abs_arr)
print("Power values:", power_arr)

### Linspace: Creating Evenly Spaced Values
The linspace function is incredibly useful for generating evenly spaced numbers over a specified interval. This is particularly important for creating data points for plotting and numerical analysis.

`np.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0)`
* `start`: The starting value of the sequence.
* `stop`: The end value of the sequence, unless endpoint is set to False. In that case, the sequence consists of all but the last of num + 1 evenly spaced samples, so that stop is excluded. Note that the step size changes when endpoint is False.
* `num`: Number of samples to generate. Default is 50. Must be non-negative.
* `endpoint`: If True, stop is the last sample. Otherwise, it is not included. Default is True.
* `retstep`: If True, returns (samples, step), where step is the spacing between the samples.
* `dtype`: The type of the output array. If dtype is not given, infer the data type from the other input arguments.
* `axis`: The axis in the result to store the samples. Only relevant if start or stop are array-like. By default (0), the samples will be along a new axis inserted at the beginning. Use -1 to get an axis at the end.

In [None]:
# Example: Create 10 evenly spaced numbers between 0 and 10 (inclusive)
x = np.linspace(0, 10, 10)
print("\nLinspace example:", x)

In [None]:
x_retstep = np.linspace(0, 10, 10, retstep=True)
print("\nLinspace example with retstep:", x_retstep)

In [None]:
x_endpoint_false = np.linspace(0, 10, 10, endpoint = False)
print("\nLinspace example with endpoint False:", x_endpoint_false)

### Plotting the sin() Function using Linspace
Let's use linspace to generate x-values and then plot the sine function:

In [None]:
x = np.linspace(0, 2 * np.pi, 100)  # Create 100 points between 0 and 2*pi
y = np.sin(x)

In [None]:
# Plotting
plt.figure(figsize=(8, 6))  # Adjust figure size for better visualization
plt.plot(x, y)
plt.title("Plot of sin(x)")
plt.xlabel("x")
plt.ylabel("sin(x)")
plt.grid(True) # Add a grid for better readability
plt.show()

In [None]:
# Example of plotting other functions
x = np.linspace(-5, 5, 100)
y = x**2 #y=x^2

plt.figure(figsize=(8,6))
plt.plot(x,y)
plt.title("Plot of x^2")
plt.xlabel("x")
plt.ylabel("x^2")
plt.grid(True)
plt.show()

**Summary**: This exercise covered essential mathematical functions in NumPy and demonstrated the power of linspace for generating evenly spaced values, particularly in the context of plotting. These tools are fundamental for data analysis, visualization, and various other data science tasks.

## Exercise 04: 2D Array Operations
This exercise focuses on operations with two-dimensional (2D) NumPy arrays, which are essential for representing matrices and tabular data in data science.

In [None]:
#import numpy as np

### Creating and Displaying 2D NumPy Arrays
Creating a 2D array from a list of lists:

In [None]:
list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
arr_2d = np.array(list_of_lists)
print("2D Array:\n", arr_2d)

Creating a 2D array using other methods:

In [None]:
zeros_2d = np.zeros((3, 4))  # 3 rows, 4 columns of zeros
ones_2d = np.ones((2, 2), dtype=int) # 2x2 array of ones, integer type
eye_2d = np.eye(4) # 4x4 Identity matrix

print("\nZeros 2D Array:\n", zeros_2d)
print("Ones 2D Array:\n", ones_2d)
print("Identity 2D Array:\n", eye_2d)

### Accessing Elements of a 2D NumPy Array

In [None]:
# Accessing elements using indexing (row, column):

print("\nElement at row 0, column 1:", arr_2d[0, 1])  # Output: 2
print("Element at row 1, column 2:", arr_2d[1, 2])  # Output: 6
print("Last element of the last row:", arr_2d[-1, -1]) # Output 9

In [None]:
# Accessing entire rows and columns:

print("\nFirst row:", arr_2d[0, :])  # All elements of the first row
print("Second column:", arr_2d[:, 1])  # All elements of the second column

In [None]:
print("Last row:", arr_2d[-1,:]) # All elements of the last row
print("Last column:", arr_2d[:,-1]) # All elements of the last column

In [None]:
# Slicing 2D arrays:
print("\nSubmatrix (rows 0-1, cols 1-2):\n", arr_2d[0:2, 1:3])

### Basic Operations on 2D NumPy Arrays

In [None]:
arr1_2d = np.array([[1, 2], [3, 4]])
arr2_2d = np.array([[5, 6], [7, 8]])

In [None]:
# Element-wise addition:
arr_sum_2d = arr1_2d + arr2_2d
print("\nElement-wise Addition:\n", arr_sum_2d)

In [None]:
# Element-wise multiplication:
arr_product_2d = arr1_2d * arr2_2d
print("\nElement-wise Multiplication:\n", arr_product_2d)

In [None]:
# Matrix multiplication (dot product):
matrix_mult = np.dot(arr1_2d, arr2_2d)
print("\nMatrix Multiplication (Dot Product):\n", matrix_mult)

In [None]:
# Matrix Transpose
print("\nTranspose of arr1_2d:\n", arr1_2d.T) # or np.transpose(arr1_2d)

In [None]:
# Scalar multiplication (multiplying by a constant):
scalar = 3
arr_scalar_mult = arr1_2d * scalar
print("\nScalar Multiplication:\n", arr_scalar_mult)

In [None]:
# Sum of all elements
print("\nSum of all elements in arr1_2d:", np.sum(arr1_2d))

In [None]:
# Sum of elements by column (axis=0)
print("\nSum of each column in arr1_2d:", np.sum(arr1_2d, axis=0))

In [None]:
# Sum of elements by row (axis=1)
print("\nSum of each row in arr1_2d:", np.sum(arr1_2d, axis=1))

In [None]:
# Mean of all elements
print("\nMean of all elements in arr1_2d:", np.mean(arr1_2d))

In [None]:
# Mean of elements by column (axis=0)
print("\nMean of each column in arr1_2d:", np.mean(arr1_2d, axis=0))

In [None]:
# Mean of elements by row (axis=1)
print("\nMean of each row in arr1_2d:", np.mean(arr1_2d, axis=1))

### Applications in Data Science
2D arrays are fundamental for:

*   Representing datasets (rows as samples, columns as features).
*   Image processing (images as matrices of pixel values).
*   Linear algebra operations in machine learning algorithms.

In [None]:
# Example: Representing a simple dataset:

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])  # Example data
print("\nExample Dataset:\n", data)

**Summary**: This exercise covered creating, accessing, and performing basic operations on 2D NumPy arrays. These skills are essential for working with tabular data, images, and implementing machine learning algorithms in data science.

## Exercise 05: SciPy for Data Science
SciPy (Scientific Python) builds on NumPy and provides a wide range of scientific computing tools crucial for data science. This tutorial covers frequently used SciPy functionalities with practical examples.

In [None]:
import numpy as np
import scipy.stats as stats
import scipy.optimize as optimize
import scipy.interpolate as interpolate
import scipy.signal as signal
import scipy.linalg as linalg
import scipy.integrate as integrate
import matplotlib.pyplot as plt

### Statistical Analysis (`scipy.stats`)
Probability Distributions:

In [None]:
# Normal distribution
x = np.linspace(-5, 5, 100)
norm_pdf = stats.norm.pdf(x, loc=0, scale=1) # Probability Density Function
norm_cdf = stats.norm.cdf(x, loc=0, scale=1) # Cumulative Distribution Function

plt.plot(x, norm_pdf, label='PDF')
plt.plot(x, norm_cdf, label='CDF')
plt.title('Normal Distribution')
plt.legend()
plt.show()

In [None]:
# T-test (Independent samples):

group1 = np.random.normal(10, 2, 50)
group2 = np.random.normal(12, 2.5, 60)
t_statistic, p_value = stats.ttest_ind(group1, group2)
print("\nT-test:")
print("T-statistic:", t_statistic)
print("P-value:", p_value)

### Optimization (`scipy.optimize`)

In [None]:
# Minimizing a function:
def f(x):
    return x**2 + 5*np.sin(x)

result = optimize.minimize(f, x0=0) # x0 is the initial guess
print("\nOptimization Result:")
print(result)

In [None]:
# Curve fitting:
x_data = np.linspace(-5, 5, 100)
y_data = 2*x_data**3 + 3*x_data**2 - 5*x_data + 2 + np.random.normal(0, 5, 100) #add noise
def func(x, a, b, c, d):
    return a*x**3 + b*x**2 + c*x + d

popt, pcov = optimize.curve_fit(func, x_data, y_data)
print("\nCurve Fitting Parameters:", popt)

plt.plot(x_data, y_data, 'o', label='data')
plt.plot(x_data, func(x_data, *popt), '-', label='fit')
plt.legend()
plt.show()

### Interpolation (`scipy.interpolate`)

In [None]:
# 1D Interpolation:

x = np.linspace(0, 10, 10)
y = np.sin(x)
f = interpolate.interp1d(x, y, kind='cubic') #cubic interpolation
x_new = np.linspace(0, 10, 50)
plt.plot(x, y, 'o', label='data')
plt.plot(x_new, f(x_new), '-', label='cubic')
plt.legend()
plt.show()

### Signal Processing (`scipy.signal`)

In [None]:
# Fourier Transform:
t = np.linspace(0, 1, 1000, False)  # 1 second
sig = np.sin(2*np.pi*10*t) + np.sin(2*np.pi*20*t)
fft_sig = np.fft.fft(sig)
freq = np.fft.fftfreq(t.size)

plt.plot(freq, np.abs(fft_sig))
plt.title("Fourier Transform")
plt.xlabel("Frequency")
plt.ylabel("Magnitude")
plt.show()

### Linear Algebra (`scipy.linalg`)

In [None]:
# Solving linear systems:

A = np.array([[1, 2], [3, 4]])
b = np.array([5, 6])
x = linalg.solve(A, b)
print("\nSolution to linear system:", x)

In [None]:
# Eigenvalues and eigenvectors:
eigenvalues, eigenvectors = linalg.eig(A)
print("\nEigenvalues:", eigenvalues)
print("Eigenvectors:\n", eigenvectors)

### Integration (`scipy.integrate`)


In [None]:
# Numerical integration:
result = integrate.quad(lambda x: np.exp(-x**2), -np.inf, np.inf) #integrate gaussian function
print("\nNumerical Integration (Gaussian):", result)

**Summary**: This tutorial provided an overview of commonly used SciPy modules and functions in data science. These tools are invaluable for statistical analysis, optimization, signal processing, and other essential tasks.

## Expercise 06: Let's see what we can do.
This exercise demonstrates how to analyze real-world data in CSV format using Pandas, NumPy, and SciPy. We will perform data loading, cleaning, exploration, statistical analysis, and simple visualization.

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt # For plots
import seaborn as sns # For better plots

### Data Loading with Pandas

The student performance dataset `files/StudentsPerformance.csv` has been provided. Load it into a dataframe.

In [None]:
try:
    df = pd.read_csv('files/StudentsPerformance.csv')
except FileNotFoundError:
    print("Please upload the StudentsPerformance.csv file or place it in the same directory as the notebook")
    raise #stop execution

In [None]:
# Display the first few rows:
print("First 5 rows of the dataset:")
print(df.head())

In [None]:
# Display dataset info
print("\nDataset info:")
df.info()

### Data Cleaning and Preprocessing (Pandas)

In [None]:
# Check for missing values:
print("\nMissing values:")
print(df.isnull().sum())

Apparently, no missing values in this dataset, but if there were, we could use methods like `df.fillna()` or `df.dropna()`.

### Exploratory Data Analysis (EDA) with Pandas and Matplotlib/Seaborn

In [None]:
# Descriptive statistics:
print("\nDescriptive statistics:")
print(df.describe())

In [None]:
# Distribution of scores:
plt.figure(figsize=(12, 6))
plt.subplot(1, 3, 1)
sns.histplot(df['math score'], kde=True)
plt.title('Math Score Distribution')
plt.subplot(1, 3, 2)
sns.histplot(df['reading score'], kde=True)
plt.title('Reading Score Distribution')
plt.subplot(1, 3, 3)
sns.histplot(df['writing score'], kde=True)
plt.title('Writing Score Distribution')
plt.tight_layout()
plt.show()

In [None]:
# Relationship between scores:
sns.pairplot(df[['math score', 'reading score', 'writing score']])
plt.suptitle("Pairplot of Scores", y=1.02)
plt.show()

In [None]:
# Scores by gender
plt.figure(figsize=(10,6))
sns.boxplot(x='gender', y='math score', data=df)
plt.title("Math Score by Gender")
plt.show()

### NumPy for Numerical Operations

In [None]:
# Convert scores to NumPy arrays:
math_scores = df['math score'].values
reading_scores = df['reading score'].values
writing_scores = df['writing score'].values

In [None]:
# Calculate mean, median, standard deviation:
print("\nMath Score Statistics:")
print("Mean:", np.mean(math_scores))
print("Median:", np.median(math_scores))
print("Standard Deviation:", np.std(math_scores))

In [None]:
# Correlation between math and reading scores:
correlation = np.corrcoef(math_scores, reading_scores)[0, 1]
print("\nCorrelation between math and reading scores:", correlation)

### SciPy for Statistical Testing

In [None]:
# Independent samples t-test (comparing math scores of male and female students):
male_math_scores = df[df['gender'] == 'male']['math score'].values
female_math_scores = df[df['gender'] == 'female']['math score'].values

t_statistic, p_value = stats.ttest_ind(male_math_scores, female_math_scores)
print("\nT-test (Math scores by gender):")
print("T-statistic:", t_statistic)
print("P-value:", p_value)

In [None]:
# One-way ANOVA (comparing scores across different parental levels of education):
groups = df.groupby('parental level of education')['math score'].apply(list)
f_statistic, p_value = stats.f_oneway(*groups)

print("\nOne-way ANOVA (Math scores by parental education):")
print("F-statistic:", f_statistic)
print("P-value:", p_value)

**Summary**: This exercise demonstrated how to use Pandas, NumPy, and SciPy to analyze a real-world dataset. We covered data loading, cleaning, EDA, numerical operations, and statistical testing. These skills are fundamental for any data science project. 

<p style="text-align:center;">That's it! Congratulations! <br> 
    Let's now work on your lab assigment.</p>