## NumPY

one of the fundamental packages for scientific computing in Python

Contains functionality for:
- multidimensional arrays
- high-level mathematical functions (such as linear algebra operations and the Fourier transform, and pseudorandom
number generators)

In **scikit-learn**, the NumPy array is the fundamental data structure.       
**scikit-learn** takes in data in the form of NumPy arrays. (Any data has to be converted to NumPy array)           
The core functionality of NumPy is the ndarray class, a multidimensional (n-dimensional) array            
All elements of the array must be of the same type.    

A NumPy array looks like following:     

In [37]:
import numpy as np

x = np.array([[1, 2, 3], [4, 5, 6]])
print("x:\n{}".format(x))

x:
[[1 2 3]
 [4 5 6]]


## SciPy

is a collection of functions for scientific computing in Python       
Contains functionality for:
- advanced linear algebra routines
- mathematical function optimization
- signal processing
- special mathematical functions
- statistical distributions

**scikit-learn** draws from _SciPy_'s collection of functions for implementing its algorithms      
The most important part of SciPy for us is `scipy.sparse`, which provides _sparse matrices_             
**sparse matrices** are another representation which is used for data in **scikit-learn**     
**sparese matrices** are used whenever we want to store a 2D array that contains mostly zeros

In [42]:
from scipy import sparse

# Create a 2D NumPy array with a diagonal of ones, and zeros everywhere else
eye = np.eye(4)
print("NumPy array:\n{}".format(eye))

NumPy array:
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]


In [44]:
# Convert the NumPy array to a SciPy sparse matrix in CSR format
# Only the nonzero entries are stored
sparse_matrix = sparse.csr_matrix(eye)
print("\nSciPy sparse CSR matrix:\n{}".format(sparse_matrix))


SciPy sparse CSR matrix:
  (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 3)	1.0


Usually it is not possible to create dense representations of sparse data (as they would not fit into memory), so we need to create sparse representations directly. Here is a way to create the same sparse matrix as before, using the **COO** format

In [47]:
data = np.ones(4)
row_indices = np.arange(4)
col_indices = np.arange(4)
eye_coo = sparse.coo_matrix((data, (row_indices, col_indices)))
print("COO representation:\n{}".format(eye_coo))

COO representation:
  (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 3)	1.0


## matplotlib    
is the primary scientific plotting library in Python.                 
It provides functions for making publication-quality visualizations such as _line charts_, _histograms_, _scatter plots_, and so on.      
In the Jupyter Notebook, you can show figures directly in the browser, using the `%matplotlib notebook` and `%matplotlib inline` commands       
We recommend using `%matplotlib notebook`, which provides an interactive environment

In [52]:
%matplotlib notebook
import matplotlib.pyplot as plt

# Generate a sequence of numbers from -10 to 10 with 100 steps in between
x = np.linspace(-10, 10, 100)
# Create a second array using sine
y = np.sin(x)
# The plot function makes a line chart of one array against another
plt.plot(x, y, marker="x")

[<matplotlib.lines.Line2D at 0x1f8763ee330>]

## pandas     
Python library for data wrangling and analysis     
built around a data structure called the DataFrame that is modeled after the R DataFrame     
Another words, DataFrame is a table similar to Excel spreadsheet.    
**pandas** provides great range of methods to modify and operate on this table i.e. SQL-like queries and joins of tables.        
As opposed to NumPy which requires all the entries in array of same type, pandas allows each column to have separate type.         
**pandas** also allows you to ingest from variety of file formats and databases, like SQL, Excel files and CSV files.      

Following example illustrates the creation of DaraFrame using a dictionary:

In [55]:
import pandas as pd

# create a simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
'Location' : ["New York", "Paris", "Berlin", "London"],
'Age' : [24, 13, 53, 33]
}
data_pandas = pd.DataFrame(data)

# IPython.display allows "pretty printing" of dataframes
# in the Jupyter notebook
display(data_pandas)

Unnamed: 0,Name,Location,Age
0,John,New York,24
1,Anna,Paris,13
2,Peter,Berlin,53
3,Linda,London,33


In [57]:
# Select all rows that have an age column greater than 30
display(data_pandas[data_pandas.Age > 30])

Unnamed: 0,Name,Location,Age
2,Peter,Berlin,53
3,Linda,London,33


### Print versions

In [60]:
import sys
print("Python version: {}".format(sys.version))

import pandas as pd
print("pandas version: {}".format(pd.__version__))

import matplotlib
print("matplotlib version: {}".format(matplotlib.__version__))

import numpy as np
print("NumPy version: {}".format(np.__version__))

import scipy as sp
print("SciPy version: {}".format(sp.__version__))

import IPython
print("IPython version: {}".format(IPython.__version__))

import sklearn
print("scikit-learn version: {}".format(sklearn.__version__))

Python version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]
pandas version: 2.2.2
matplotlib version: 3.9.2
NumPy version: 1.26.4
SciPy version: 1.13.1
IPython version: 8.27.0
scikit-learn version: 1.5.1
