# Big Idea Module 9
- key concepts: vectorization, matrix operations, and the linear model 

## The PyData Ecosystem
The PyData ecosystem is a collection of libraries used to handle, analyze, and visualize data in Python. It includes libraries like NumPy, pandas, and Matplotlib.

Examples: importing a few essential libraries from the PyData ecosystem. These libraries are the backbone of data science workflows in Python

In [None]:
import numpy as np
# It’s the foundation for handling numerical data. You can think of it as the engine behind more advanced libraries like pandas.
import pandas as pd
# It builds on NumPy to handle tabular data efficiently.
import matplotlib.pyplot as plt
# Used for visualizing data, built on NumPy arrays to provide flexibility.


## Numpy
Why is NumPy faster than Python lists? Python lists are great for flexibility, but NumPy arrays offer performance benefits.
NumPy arrays are more efficient because they store data in contiguous blocks of memory. This allows for fast access and manipulation, compared to Python lists, which store references to objects. NumPy also leverages C under the hood for optimized computations.
NumPy, in particular, serves as the computational backbone because of its ability to efficiently manipulate large datasets, making it essential for high-performance tasks like matrix operations and linear modeling.

- NumPy is preferred over Python lists for numerical computations because it optimizes the way data is stored and accessed in memory. With NumPy, arrays store elements in contiguous memory blocks, unlike lists, which store references to objects. This results in faster data access and manipulation. Moreover, NumPy is built on top of optimized C code and operates on raw memory arrays

In this example, we see that using NumPy arrays gives us a performance boost (varies on size) over Python lists when applying the same operation.

In [19]:
import numpy as np
import time

# Create large NumPy array and Python list
arr = np.arange(1000000)
lst = list(range(1000000))

# NumPy vectorized operation
start_time = time.time()
arr = arr + 1
print(f"NumPy vectorized operation took: {time.time() - start_time:.6f} seconds")

# Python list comprehension
start_time = time.time()
lst = [x + 1 for x in lst]
print(f"Python list comprehension took: {time.time() - start_time:.6f} seconds")

NumPy vectorized operation took: 0.001146 seconds
Python list comprehension took: 0.047512 seconds


# Vectorization 
One of the reasons NumPy is so powerful is because of vectorization. Vectorization allows us to apply operations across entire arrays without using loops. This makes the code faster and more concise.


In [11]:
arr = np.array([1, 2, 3, 4, 5])
# Without vectorization
arr_squared = [x**2 for x in arr]

# With vectorization
arr_squared_np = arr ** 2

print("Without vectorization:", arr_squared)
print("With vectorization:", arr_squared_np)

Without vectorization: [1, 4, 9, 16, 25]
With vectorization: [ 1  4  9 16 25]


This code shows how we can square each element of an array using vectorization, which avoids the need for a for-loop.

## Reshaping Arrays in NumPy:
In data science, it’s common to reshape data into different formats. NumPy makes it easy to reshape arrays to match the structure we need for analysis, like transforming a 1D array into a matrix. Reshaping arrays is important when transitioning from 1D to 2D or multidimensional arrays. It’s essential for handling real-world data in different formats.

- Reshaping arrays allows us to adjust data into the desired structure for analysis.

- Reshaping arrays: When working with real-world datasets, it’s common to reshape arrays. For example, if you have a one-dimensional array and need it to represent a matrix or a multidimensional array, you can use reshape. This is especially useful when we need to perform matrix operations like those in machine learning models


In [20]:
# Reshaping a 1D array into a 3x3 matrix
arr = np.arange(9)
print("Original 1D array:", arr)

arr_reshaped = arr.reshape(3, 3)
print("Reshaped into 3x3 matrix:\n", arr_reshaped)

Original 1D array: [0 1 2 3 4 5 6 7 8]
Reshaped into 3x3 matrix:
 [[0 1 2]
 [3 4 5]
 [6 7 8]]


# Boolean Indexing
Boolean indexing is a powerful tool that lets us filter arrays based on specific conditions. This can be especially useful when cleaning data or selecting specific rows or columns based on a condition.

Boolean Indexing: Boolean indexing is like filtering rows in a table. In this case, I’m filtering for values in the array that are greater than 5. This kind of operation is invaluable when we need to work with only part of the data, such as selecting outliers or missing data.

In [21]:
# Boolean indexing example: Select values greater than 5
arr = np.arange(10)
print("Original array:", arr)

filtered_arr = arr[arr > 5]
print("Values greater than 5:", filtered_arr)
# We use Boolean indexing to select elements of an array that meet a specific condition. It’s useful for filtering data quickly.
# Boolean indexing allows us to filter elements easily and intuitively, without needing loops or complex logic.

Original array: [0 1 2 3 4 5 6 7 8 9]
Values greater than 5: [6 7 8 9]


## Matrix Operations in NumPy 
- Matrix operations are fundamental in data science, especially for tasks like linear modeling, where we use matrix multiplication and other linear algebra techniques.
Matrix operations, like matrix multiplication and transposition, are central to data science, especially when working with linear models or machine learning algorithms.

Matrix multiplication is used to combine different sets of data or apply weights in linear models. 

The transpose operation swaps rows and columns, which is useful in many mathematical and machine learning tasks.

In [16]:
# Creating matrices A and B
A = np.array([[1, 2], [3, 4], [5, 6]])
B = np.array([[7, 8, 9], [10, 11, 12]])

# Matrix multiplication
C = A @ B
print("Matrix Multiplication (A @ B):\n", C)
# Matrix multiplication is at the core of many machine learning algorithms, and NumPy makes it easy with the @ operator.

# Transpose of A
A_T = A.T
print("Transpose of A:\n", A_T)
# The transpose of a matrix switches its rows and columns. It’s used frequently in data transformations.

Matrix Multiplication (A @ B):
 [[ 27  30  33]
 [ 61  68  75]
 [ 95 106 117]]
Transpose of A:
 [[1 3 5]
 [2 4 6]]


Matrix operations allow us to transform data in useful ways

Matrix Multiplication: Matrix multiplication is a fundamental operation in linear algebra. It’s used in everything from data transformations to neural networks. In this case, we use the @ operator, which is a shorthand for matrix multiplication.

Transpose: The transpose operation swaps the rows and columns of a matrix. It’s useful when you need to manipulate the orientation of your data, especially when preparing data for models or visualizations.

## Building a Linear Model Using Matrix Operations
Now that we understand basic matrix operations, let’s see how we can use them to build a simple linear model. We’ll use a dataset that measures caffeine intake, hours of sleep, and test scores.

- In data science, matrix operations allow us to build linear models. Here, we’ll predict an output (test score) based on input variables (caffeine intake, sleep hours) using matrix multiplication.

Linear models are some of the simplest but most powerful tools in data science. They’re often the first step in predictive modeling, helping us understand relationships between variables.

- A linear model predicts an output based on input features using an equation like: y = intercept + slope_1 * feature_1 + slope_2 * feature_2

In [22]:
# Input data (caffeine, sleep) and test scores
X = np.array([[100, 5], [200, 4], [150, 6]])  # Input: Caffeine, Sleep
y = np.array([75, 82, 89])  # Output: Test scores

# Adding an intercept column (1's) to X
X_design = np.hstack([np.ones((X.shape[0], 1)), X])

# Coefficients: intercept, caffeine, and sleep coefficients
beta = np.array([50, 0.2, 2.5])

# Prediction using matrix multiplication (linear model)
y_pred = X_design @ beta
print("Predicted Test Scores:", y_pred)

Predicted Test Scores: [ 82.5 100.   95. ]


Linear Model: This linear model predicts test scores based on caffeine intake and hours of sleep. By stacking a column of ones, we add an intercept to the model, which helps us account for a baseline prediction.

- In this example, we’ve built a basic linear model using matrix multiplication. We add an intercept term, then apply the weights (the beta values) to make predictions.

Matrix Multiplication in Modeling: In the model, we use matrix multiplication to combine the input data with our coefficients (the intercept and slopes).