<a href="https://colab.research.google.com/github/haris18896/Python-Data-Analysis/blob/main/08_Numpy_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Numpy

## Notes

* NumPy, which stands for Numerical Python, is foundational for numerical computing in Python.
* Designed for scientific computation and is used extensively for data analysis because of its ability to handle large, multi-dimensional arrays and matrices efficiently.
* Serves as the basis for many other Python data science libraries, including Pandas, due to its speed and efficiency in numerical computations.
* A few special use cases for NumPy specifically for data analysis:
    * Array operations
    * Linear algebra
    * Statistical functions
    * Random number generation

## Importance

* Important for data analysis.
* Way to store data in a structured format, making it easier to organize, access, and manipulate data.
* Pandas library is built on top of NumPy.

For more information on NumPy check out the official documentation [here](https://numpy.org/doc/1.26/).


## Array Operations

### Notes
* Creation: **`import numpy as np; np.array([1, 2, 3])`**
* Basic Operations:  `+`, `-`, `/`, `*` performed element-wise
* Slicing: **`array[1:3]`**
* Boolean Indexing: **`array[array > 0]`**

### Examples

Let's start with creating a fictional array with the number of years of experience required for five different data science job listings.

In [15]:
# create a list of 1,000,000 salries ranging from 50,000 to 150,000
import random

salary_list = [random.randint(50000, 100000) for _ in range(1_000_000)]

In [16]:
%%timeit
import statistics


statistics.median(salary_list)

447 ms ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [17]:
%%timeit
import numpy as np

np.median(salary_list)

60.9 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Numpy Array

In [18]:
import numpy as np

my_array = np.array([1,2,3,4])

my_array.mean()

2.5

#### Example

## NaN


### Notes

* Generate NaN values using `np.nan`
* `np.nan` value is used in NumPy (and by extension, Pandas) to represent missing or undefined data
* Helpful because it:
    * Handles missing data.
    * Helps with computations since it won't return errors but instead return `np.nan`.
    * Help filter out or fill in missing data using other methods that we'll use often in the `pandas` library like `dropna()`, `fillna()`, `isna()`, or `notna()`.

### Examples

Below are a few examples of what you can do.

In [27]:
from decimal import BasicContext
job_title = np.array(['Data Analyst', 'Data Scientist', 'Data Engineer', 'Machine Learning Engineer'])

# base_salaries = np.array([50000, 60000, 70000, 80000, np.nan]) # we are not using the 0, becuase the base salary can't be zero # add another item to the job_title
base_salaries = np.array([50000, 60000, 70000, 80000])
# bonus_rates = np.array([.05, .1, .08, .12, np.nan]) # we aren't using None, becuase it will generate error, that's why numpy give us np.nan
bonus_rates = np.array([.05, .1, .08, .12])
total_salaries = base_salaries * (1 + bonus_rates)

total_salaries

array([52500., 66000., 75600., 89600.])

In [24]:
np.mean(total_salaries)

nan

In [25]:
np.nanmean(total_salaries)

70925.0

In [29]:
np.max(total_salaries)

89600.00000000001

## Where

### Notes

* `np.where` check elements of an array against a condition and to assign a value for True and another for false.
* Syntax: `np.where(condition)`.
* It's commonly used to conditionally replace array elements.

### Example

We're going to replace all values in `salary_array` that are less than 120,000 with 120,000 (to apply a minimum salary threshold). We'll use this syntax for it: `np.where(condition, x, y)`. With a `condition` and if it's true we do `x` and if not then do `y`.

In [30]:
salary_array = np.where(base_salaries < 60000, 60000, base_salaries)

In [31]:
salary_array

array([60000, 60000, 70000, 80000])

## Random Sampling

### Notes
* Generate a random numbers or samples.
* `np.random.normal` - draws random samples from a normal (Gaussian) distribution.
    * Specify the Arguments:
        * `loc`: This is the mean (`μ`) of the normal distribution.
        * `scale`: This is the standard deviation (`σ`) of the normal distribution, representing the dispersion from the mean.
        * `size`: This defines the number of random samples to draw, which is set to match the number of job postings.
    * Syntax: `np.random.normal(loc=0.0, scale=1.0, size=None)`
* A few other random sampling functions:
    * `np.random.rand`
    * `np.random.randn`
    * `np.random.randint`
    * `np.random.random`
    * `np.random.uniform`
    * `np.random.binomial`
    * `np.random.poisson`

### Example

Let's add some random noise to the `salary_array` to simulate salary variations. We can generate random values from a normal distribution with a mean of 0 and a standard deviation of 5000, then add these values to the salaries.

**Why?** This can be used to simulate salary data for job postings if actual salary data isn't available, for instance, in modeling or simulation scenarios.

In [33]:
noise = np.random.normal(0, 5000, salary_array.size)

salary_array_with_noise = salary_array + noise
salary_array_with_noise

array([49901.77395778, 53943.15352971, 74199.59840628, 77425.67731777])