# NumPy, SciPy, Scikit-learn
They are three fundamental libraries in the Python ecosystem, especially for data science, machine learning, and scientific computing. They are closely related and often used together in various workflows. Here's how they are connected:

1. NumPy
Core Library: NumPy is the foundational package for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
Core Data Structures: Scikit-learn heavily relies on NumPy arrays (especially the ndarray data structure) as the standard format for input data. When you pass data into scikit-learn algorithms, it is usually in the form of a NumPy array.
Numerical Operations: Many of the computations within scikit-learn, such as linear algebra operations, are performed using NumPy's efficient, low-level implementations.
2. SciPy
Advanced Scientific Functions: SciPy builds on NumPy and provides additional functionality for scientific and technical computing. It includes modules for optimization, integration, interpolation, eigenvalue problems, algebraic equations, and others.
Scikit-learn Dependency: Scikit-learn utilizes functions from SciPy for more advanced mathematical operations that go beyond the basic capabilities of NumPy. For instance, tasks like optimization (e.g., scipy.optimize), sparse matrix computations, and some statistical functions are sourced from SciPy.
Sparse Matrices: SciPy’s sparse matrix functionality is essential for handling large, sparse datasets in machine learning, which is particularly useful in tasks like natural language processing or large-scale data analysis.
3. Scikit-learn
Machine Learning Library: Scikit-learn is a high-level library built on top of NumPy and SciPy. It provides simple and efficient tools for data mining and data analysis, including implementations of various machine learning algorithms (e.g., classification, regression, clustering), preprocessing tools, model selection methods, and more.
Interface with NumPy/SciPy: Scikit-learn uses NumPy arrays as inputs and outputs for its algorithms, making it seamless to integrate with other NumPy or SciPy-based code. For example, after preprocessing data with SciPy or performing numerical operations with NumPy, you can directly pass the processed data to a scikit-learn model.
Dependency: Scikit-learn directly depends on both NumPy and SciPy, meaning it requires these libraries to be installed to function. This dependency ensures that scikit-learn can leverage the optimized performance and rich functionality of NumPy and SciPy.
Summary:
NumPy provides the basic array data structures and mathematical operations.
SciPy builds on NumPy, offering more advanced mathematical tools and utilities.
Scikit-learn builds on both NumPy and SciPy, providing high-level implementations of machine learning algorithms and data processing tools.


Summary:

NumPy provides the basic array data structures and mathematical operations.

SciPy builds on NumPy, offering more advanced mathematical tools and utilities.

Scikit-learn builds on both NumPy and SciPy, providing high-level implementations of machine learning algorithms and data processing tools.

# Numpy

NumPy is one of the fundamental packages for scientific computing in Python. It contains functionality for multidimensional arrays, high-level mathematical functions such as linear algebra operations and the Fourier transform, and pseudorandom number generators.

In scikit-learn, the NumPy array is the fundamental data structure. scikit-learn takes in data in the form of NumPy arrays. Any data you’re using will have to be converted to a NumPy array. 

Key features of NumPy include:

1. **Arrays:** At the core of the NumPy package, is the **ndarray** object that encapsulates n-dimensional arrays of homogeneous data types. These arrays are more **efficient** than Python lists for numerical operations.

2. **Indexing and Slicing:** NumPy provides powerful indexing and slicing capabilities for accessing and manipulating data within arrays. This makes it easy to extract subsets of data or modify specific elements.

3. **Broadcasting:** NumPy allows for operations between arrays of different shapes and sizes through a mechanism called broadcasting. This makes it easy to perform element-wise operations on arrays of different shapes without the need for explicit looping or reshaping.

4. **Parallelization:** NumPy operations can be parallelized, as they are often implemented using optimized low-level libraries that take advantage of parallel processing capabilities on modern hardware.

NumPy is a foundational library in the Python data science ecosystem and is often used in conjunction with other libraries like Pandas, Matplotlib, and scikit-learn for tasks such as data manipulation, analysis, and visualization.

### ndarrays

In [6]:
import numpy as np
arr_1d = np.array([1, 2, 3])
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
arr_3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

Some key points about ndarray and its attributes:

1. **Axis:**
  - In NumPy, arrays can have one or more dimensions, and each dimension is referred to as an "axis."

  - Many NumPy functions allow operations to be performed along a specified axis. Common operations include ```sum, mean, minimum, maximum,``` etc. The axis parameter in these operations specify the direction along which an operation is applied. For example, when summing a 3D array along axis 0, the operation is performed along columns; when summing along axis 1, the operation is performed along rows.

2. **Shape:**
   - The "shape" of an ndarray refers to the size of the array. For example, a 1-dimensional array might have a shape like ```(5,)```, indicating it has 5 elements along a single axis. A 2-dimensional array might have a shape like ```(3, 4)```, indicating it has 3 rows and 4 columns.

In [8]:
import numpy as np

arr_1d = np.array([1, 2, 3])
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])

print("shape of arr_1d:",arr_1d.shape)
print("shape of arr_2d:",arr_2d.shape)

shape of arr_1d: (3,)
shape of arr_2d: (2, 3)


### Data Types of ndarrays

Understanding data types in NumPy is crucial as it allows us to control how data is stored in memory and how operations are performed on that data. NumPy provides a rich set of data types that are more efficient than the built-in Python types.

Here are some key data types in NumPy:

1. **int8, int16, int32, int64**: Signed integers with 8, 16, 32, or 64 bits of precision, respectively.

2. **uint8, uint16, uint32, uint64**: Unsigned integers with 8, 16, 32, or 64 bits of precision, respectively.

3. **float16, float32, float64**: Floating-point numbers with 16, 32, or 64 bits of precision, respectively.

4. **complex64, complex128**: Complex numbers with 64 or 128 bits of precision, where the real and imaginary parts are represented by 32 or 64-bit floating-point numbers.

5. **bool**: Boolean type storing True or False values.

6. **object**: A generic object data type.

7. **string_**: String data type.

8. **unicode_**: Unicode data type.

**You can specify the data type when creating a NumPy array using the `dtype` parameter.** For example:

```python
import numpy as np
# define the data type during array creation
arr1 = np.array([1, 2, 3], dtype=np.int32)
arr2 = np.array([1, 2, 3], dtype='int32')
```
You can change the data type of an array after it has been created using the `astype` function.
```python
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
float_arr = arr.astype(np.float32)
print(arr.dtype,float_arr.dtype)
```

# SciPy 

SciPy is a collection of functions for scientific computing in Python. It provides, among other functionality, advanced linear algebra routines, mathematical function optimization, signal processing, special mathematical functions, and statistical distributions. scikit-learn draws from SciPy’s collection of functions for implementing
its algorithms.

One of its key components is scipy.sparse, which provides efficient storage and manipulation of sparse matrices. Sparse matrices are matrices that are predominantly filled with zeros and contain very few non-zero elements.

In many machine learning applications, especially when dealing with high-dimensional data (like text data in natural language processing), the data can be sparse. Using dense matrices (where all elements, including zeros, are stored in memory) for such data is often inefficient in terms of both memory and computational power. This is where sparse matrices come into play.


scipy.sparse: provides sparse matrices, which are another representation that is used for data in scikitlearn.
Sparse matrices are used whenever we want to store a 2D array that contains mostly zeros:

In [10]:
from scipy import sparse
# Create a 2D NumPy array with a diagonal of ones, and zeros everywhere else (Creating a Dense Matrix (NumPy array))
eye = np.eye(4)
print("NumPy array:\n{}".format(eye))

#np.eye(4) creates a 4x4 identity matrix, which is a square matrix with ones on the diagonal and zeros elsewhere.
#This matrix is stored in a dense format, meaning every element (including the zeros) is explicitly stored in memory.


NumPy array:
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]


In [13]:
# Convert the NumPy array to a SciPy sparse matrix in Compressed Sparse Row (CSR) format
# Only the nonzero entries are stored
sparse_matrix = sparse.csr_matrix(eye)
print("\nSciPy sparse CSR matrix:\n{}".format(sparse_matrix))
#The csr_matrix function from scipy.sparse converts the dense NumPy array into a sparse matrix using the Compressed Sparse Row (CSR) format.
#In CSR format, only the non-zero elements are stored along with their row and column indices, making it much more memory-efficient for sparse data.
#Instead of storing 16 elements, the CSR format stores only the 4 non-zero elements and their positions.


SciPy sparse CSR matrix:
  (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 3)	1.0


Usually it is not possible to create dense representations of sparse data (as they would not fit into memory), so we need to create sparse representations directly. Here is a way to create the same sparse matrix as before, using the COO (Coordinate List) format:

In [14]:
data = np.ones(4)
row_indices = np.arange(4)
col_indices = np.arange(4)
eye_coo = sparse.coo_matrix((data, (row_indices, col_indices)))
print("COO representation:\n{}".format(eye_coo))
#Here, you're creating the sparse matrix directly using the Coordinate List (COO) format.
#data contains the non-zero values (in this case, all ones).
#row_indices and col_indices specify the row and column positions of each non-zero element.
#coo_matrix constructs the sparse matrix from this data.

COO representation:
  (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 3)	1.0


Why Use Sparse Matrices?

Memory Efficiency: Sparse matrices save a significant amount of memory when dealing with large datasets that are mostly zeros. Instead of storing all the zeros, you only store the non-zero elements and their positions.

Computational Efficiency: Operations on sparse matrices are often faster because the algorithms can skip over the zero elements, focusing only on the non-zero data.

Use in Scikit-learn
Scikit-learn leverages sparse matrices to handle large datasets efficiently, particularly in tasks like text classification where feature vectors can have thousands or even millions of dimensions, with most of them being zero. By using sparse matrices, scikit-learn can perform machine learning tasks without running into memory issues, making it possible to work with large-scale data.