### 343.1 Intro to Data Analytics & Statistics
#### Learning Objectives
- Describe fundamental concepts in Data Analytics (DA) & statistics, including descriptive & inferential statistics
- Explain what DA is & key steps in DA process
    - DA Process:
        - Identifying the question
        - Collecting raw data
        - Cleaning data
        - Analyzing data
        - Interpreting results
- Define statistical methods to analyze data, including measures of central tendency, dispersion, & hypothesis testing
- Summarize four steps of DA
    - Descriptive analytics: tells what happened
    - Diagnostic analytics: tells why something happened
    - Predictive analytics: tells what will likely happen in the future
    - Prescriptive analytics: tells how to act or make it happen
- Explain how DA can be applied to drive business decisions & strategy
    - Understand past performance & informs decision-making for future actions
    - Validate course of action before committing to it, saving valuable time & resources
    - Identify opportunities for growth, anticipate potential risks, & gain insights into customer behavior patterns
- Explain Linear Regression
- Recognize role of DEs & DAs
    - Primary responsibility is to transform raw data into actionable insights
        - identify goals
        - prioritize needs
        - shape strategies
        - build databases & dashboards
        - ensure data quality
        - use best practices
        - maintain relevant documentation
    - DE - collaborate closely with other departments to extract meaningful info from data
    - DA - identify specific problem to be solved or question that requires an answer
    - Steps to obtain insights & solutions
        - Define the question
        - Collect the data
        - Clean the data
        - Analyze the data
        - Create visualizations
        - Share findings
- Describe Statistical terms
- Explain Statistical types & benefits
    - Exploratory Data Analysis (EDA) - focuses on discovering new features in data
    - Confirmatory Data Analysis (CDA) - focuses on confirming or falsifying existing hypotheses
- Identify Data Wrangling concept (also known as data munging)
    - an iterative four step process that involves data exploration (Discovery), transformation, validation, & making it available for credible & meaningful analysis (Publishing)
- Define Linear Relationships
- Discuss Correlation concept

- DA is the process of gathering, systematically examining, cleansing, transforming, analyzing, mining, & interpreting data to discover meaningful patterns, insights, & trends
    - Analysis - detailed examination of elements or structure of something (inferences based on historical data)
    - Analytics - the systematic computational anlaysis of data or statistics (predicting future performance)
- Major approaches:
    - Data Mining - the analysis of large quantities of data to extract previously unknown, interesting patterns of data, unusual data, & it's dependencies
    - BI
    - Statistical Analysis
    - Predictive Analytics
    - Text Analytics
    

### Summary
- DA is a process of obtaining large, unstructured data from various sources, and converting that data into info that is useful for:
    - Answering questions
    - Testing hypotheses
    - Making decisions
    - Disproving theories
- There are four primary types of DA:
    - Descriptive Analytics, which helps decode "what happened"
    - Diagnostic Analytics, which helps us understand "why it happened"
    - Predictive Analytics, which analyzes historical data & trends to suggest "what will happen next"
    - Prescriptive Analytics, which prescribes "what should be done next"

***
### Intro to Statistics Analysis & Basic Statistics for DA
- Statistics - a branch of mathematics that deals with the collection, analysis, interpretation, and presentation of numerical or quantitative data
    - Examples include:
        - Calculations - average income, average age, highest-paid professions
        - Clinical Trials - analyze data from production of vaccines to ensure safety & efficacy
        - Economic & Finance - analyze market trends, consumer behavior, & preferences, enabling businesses to make informed decisions
        - Climate Studies - anaylyze climate data, model environmental changes, & assess impact of human activities on environment
        - Machine Learning & Artificial Intelligence - statistical methods form the foundation of many machine learning algorithms used in data analysis, pattern recognition, & predictive modeling
    - Why learn / benefits of statistics
        - ensure data is interpreted correctly, apparent relationships are really significant or meaningful
        - assist in gaining insight into business operations
        - data preprocessing & feature engineering
        - visualizing numbers to understand patterns & trends existing in quatitative data
        - serve as a base for several ML algorithms such as logistic regression
    - Statistical Analysis is the application of statistical methods to a sample of data in order to develop an understanding of what it represents
        - Sample: a representative selection drawn from a total population
        - Population: a discrete group (people or things) that can be identified by at least one common characteristic for purposes of data collection & analysis
            - Example: population may be all people in a state that have a drivers license; and a sample of this population that is a part or a subset of the population could be men drivers over the age of 50
    - Types:
        - Descriptive - summarizing info about the sample / describes the characteristics of data
            - Example: Population, Frequency of variable
        - Inferential - making inferences or generalizations about the broader population / studies the sample of the same data
            - Example: Grade, Percentile

### Summary
- Statistical Analysis
    - Statistics is the study of collection, analysis, interpretation, presentation, & organization of data
    - In data analysis, two main statistical methodologies are used:
        - Descriptive Statistics - data from the entire population or a sample is summarized with numerical descriptors such as:
            - Mean, Standard Deviation for Continuous Data
            - Frequency, Percentage for Categorical Data
        - Inferential Statistics - uses patterns in the sample data to draw inferences about the represented population or accounting for randomness
            - answering yes/no questions about the data (hypothesis testing)
            - estimating numerical characteristics of the data (estimation)
            - describing associations within the data (correlation)
            - modeling relationships with the data (regression analysis)
    - Statistics, combined with Data Mining help in better decision-making
        - Providing measures & methodologies necessary for data mining
        - Identifying patterns that help identify differences between random noise & significant findings
    - Inferential Statistics is a powerful tool for making predictions or inferences about populations based on sample data
        - Regression is the statistical method used to determine the strength & the relation between the independent & dependent variables
            - Multiple Linear Regression is an extension of Simple Linear Regression, where multiple independent variables are used to predict the dependent variable

***
## 343.2 - Numpy
- Ways to Create NumPy Array / created or initialized using various functions:
    - np.array() - takes a sequence (list, tuple, range) as its argument
    - np.zeros()
    - np.ones()
    - np.arange() - returns an ndarray, args(start, stop, step, dtype)
    - np.linespace()
    - np.eye()
    - np.random()

In [1]:
import numpy as np

In [None]:
# Creating a one-dimensional array w/ 5 elements
a = np.array([1, 2, 3, 4, 5])
print(a)
print(np.array) # output: <built-in function array>

[1 2 3 4 5]
<built-in function array>


### Using arange()

In [7]:
my_array = np.arange(1, 30)
print(my_array)

# Parameters: start at 10, stop at 18, step by 2
x = np.arange(10, 20, 2)
print(x)

y = np.arange(-3, 10, 4, dtype=int)
y2 = np.arange(-3, 10, 4, dtype=float)
print(y)
print(y2)

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29]
[10 12 14 16 18]
[-3  1  5  9]
[-3.  1.  5.  9.]


### Array Attributes
- ndim - number of dimensions of the array
- shape - shape of array, tuple of integers indicating size of array in each dimension
- size - total number of elements of array
- dtype - object describing the type of elements in array: int32, int16, float64, etc

In [9]:
print(a.ndim)
print(a.shape)
print(a.size)
print(a.dtype)

1
(5,)
5
int64


### dtype - Data Type Objects
- Refers to an object that specifies the type of data stored in an array. Defines how data is stored in memory
- It describes the following aspects of data:
    - Type of data (integer, float, Python object, etc)
    - Size of data (number of bytes)
    - Byte order of data (little-endian or big-endian)
    - If data type is sub-array, what is its shape & data type?

In [None]:
# Creates data type object for integer with 8 bits of precision
# dtype = np.dtype(np.int8)

### Common Data Types
- int: Integer types (int8, int16, int32, int64)
- uint: Unsigned integer (uint8, uint16, uint32, uint64)
- float: Floating-point types (float16, float32, float64)
- complex: Complex types (complex64, complex128)
- bool: Boolean type (bool)
- object: Object type (generic Python object)

In [10]:
# Creating array with different data types
arr_int32 = np.array([1, 2, 3], dtype=np.int32)
arr_float64 = np.array([1.0, 2.0, 3.0], dtype=np.float64)
arr_complex = np.array([1 + 2j, 3 + 4j], dtype=np.complex128)

# Checking dtype attribute
print(arr_int32.dtype)
print(arr_float64.dtype)
print(arr_complex.dtype)

int32
float64
complex128


### Multi-dimensional Array
- by passing a nested list or tuple as argument

In [22]:
# Creating a 2-dimensional array with 2 rows and 3 columns
md_array = np.array([[1, 2, 3], [4, 5, 6]])
print(md_array[1, 2])
print(md_array[:])

# Creating a 2-dimensional array with 3 rows and 2 columns
md_array2 = np.array([[1, 2], [3, 4], [5, 6]])
print(f"3 rows & 2 cols:\n {md_array2}")

6
[[1 2 3]
 [4 5 6]]
3 rows & 2 cols:
 [[1 2]
 [3 4]
 [5 6]]


### Random Numbers
- Useful for generating random data for visualization
- Have to import random from numpy
- Most commonly used functions:
    - `np.random.rand()`: Generates random #s uniformly distributed between 0 and 1
    - `np.random.randn()`: Generates random #s from a standard normal distribution with mean 0 and standard deviation 1
    - `np.random.randint()`: Generates random integers between specified range
    - `np.random.choice()`: Generates random samples from a given sequence
    - `np.random.shuffle()`: Shuffles a sequence in place

### Most Common Numpy Operations:
- Indexing and slicing
- Arithmetic operations
- Aggregate functions
- Broadcasting
- Reshaping and Transposing
- Linear Algebra operations
- Boolean operations
- Sorting
- Masking

In [28]:
# Indexing: Access a specific element in array
print(a) # output: [1 2 3 4 5]
# Return element at index 2
element = a[2]
print(element)

# Slicing: Access a range of elements in array
a_alice = a[1:4]
print(a_alice)

# Modifying elements using indexing
a[2] = 7
print(a) # output: [1 2 7 4 5]

# Modifying elements using slicing
a[1:4] = np.array([6, 8, 9])
print(a) # output: [1 6 8 9 5]

# Modify back to original [1 2 3 4 5]
a[1:4] = np.array([2, 3, 4])
print(a)

[1 6 8 9 5]
8
[6 8 9]
[1 6 7 9 5]
[1 6 8 9 5]
[1 2 3 4 5]


### Arithmetic Operations
- Element-wise arithmetic operations with scalar values (1 and 2)

In [29]:
# Element-wise sum with scalar 1
sum_a = a + 1

# Element-wise product with scalar 2
prod_a = a * 2

# Element-wise division with scalar 1
div_a = a / 2

print(sum_a)
print(prod_a)
print(div_a)

[2 3 4 5 6]
[ 2  4  6  8 10]
[0.5 1.  1.5 2.  2.5]


In [30]:
# Sum of all elements
a_sum = np.sum(a)

# Mean of all elements
a_mean = np.mean(a)

# Min value
a_min = np.min(a)

# Max value
a_max = np.max(a)

print(a_sum, a_mean, a_min, a_max, sep=',')

15,3.0,1,5


### Broadcasting
- Multiply each element in array by 2, doubling values in array

In [31]:
# Create 2D array
two_d_array = np.array([[1, 2, 3], [4, 5, 6]])

# Multiply array by scalar value
result = two_d_array * 2
print(result)

[[ 2  4  6]
 [ 8 10 12]]


### Reshaping and Transposing
- T attribute to transpose matrix, swapping rows & columns
- Use reshape function to reshape transposed matrix

In [34]:
print(two_d_array) # output: [[1 2 3],[4 5 6]]
# Transpose matrix using .T attribute
transposed_matrix = two_d_array.T

# Reshape transposed matrix into 3x2 matrix
reshaped_matrix = transposed_matrix.reshape(3, 2)

print("Original Matrix: \n", two_d_array)
print("Transposed Matrix: \n", transposed_matrix)
print("Reshaped Matrix: \n", reshaped_matrix)

[[1 2 3]
 [4 5 6]]
Original Matrix: 
 [[1 2 3]
 [4 5 6]]
Transposed Matrix: 
 [[1 4]
 [2 5]
 [3 6]]
Reshaped Matrix: 
 [[1 4]
 [2 5]
 [3 6]]


### Linear Algebra Operations
- Common operations:
    - Vector (one dimension) Addition
    - Matrix (two dimension) Addition
    - Vector (one dimension) Dot Product
    - Matrix (two dimension) Multiplication
    - Transpose of a Matrix

In [39]:
# Vector addition
c = np.array([1, 2, 3])
d = np.array([4, 5, 6])
e = c + d
print(e)

# Matrix addition
C = np.array([[1, 2], [3, 4]])
D = np.array([[5, 6], [7, 8]])
E = C + D
print(E)

# Vector Dot Product - np.dot() function: sum of the products of corresponding elements
e_dot = np.dot(c, d)
print("Dot function:\n", e_dot)

# Matrix Multiplication - multiply corresponding elemens of two matrices and summing the results
E_dot = np.dot(C, D)
print("Matrix Dot Product:\n", E_dot)

# Transpose Matrix - swapping its rows and columns
E_transpose = np.transpose(C)
print("Transpose Matrix:\n", E_transpose)


[5 7 9]
[[ 6  8]
 [10 12]]
Dot function:
 32
Matrix Dot Product:
 [[19 22]
 [43 50]]
Transpose Matrix:
 [[1 3]
 [2 4]]


### Boolean Operations
- Element-Wise comparison
- Logical AND
- Logical OR
- Logical NOT
- Array-Wise comparison

In [46]:
# Compare each element of one array with the corresponding element of other array using == operator
cw = np.array([1, 2, 3, 4, 5])
dw = np.array([2, 2, 3, 3, 5])
ew = cw == dw
print(ew)

# Logical AND
a_logical = np.array([True, True, False, False])
b_logical = np.array([True, False, True, False])
c_logical = np.logical_and(a_logical, b_logical)
print(c_logical)

# Logical OR
d_logical = np.logical_or(a_logical, b_logical)
print(d_logical)

# Logical NOT
not_log = np.logical_not(b_logical)
print(not_log)

# Array-Wise - check if two arrays are element-wise equal
aw = np.array_equal(cw, dw)
print(aw)


[False  True  True False  True]
[ True False False False]
[ True  True  True False]
[False  True False  True]
False


### Sorting

In [47]:
a_sort = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])
b_sort = np.sort(a_sort)
print(b_sort)

[1 1 2 3 3 4 5 5 6 9]


### Masking
- Boolean mask compares each element in the array with a condition. The result is a boolean array with True values, where the condition is satisfied and False values, where it is not. Use mask to select only elements where the mask is True, and store in a new variable

In [None]:
mask = a_sort > 3 # a_sort = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3]
b_mask = a_sort[mask]
print(b_mask)

[4 5 9 6 5]


### Searching and Counting
- Check if a value is present in an array
- argmin() and argmax() - find index of min and max values in array

In [None]:
# Searching
search_arr = np.array([33, 2, 3, 56, 58, 96, 4, 6, 9])
x = 3
print("Is 3 in the array?")
print(3 in search_arr)
print("The index of the min value:\n", np.argmin(search_arr))
print("The index of the max value:\n", np.argmax(search_arr))

Is 3 in the array?
True
The index of the min value:
 1
The index of the max value:
 5


In [54]:
# Count non-zero elements in 1D array
arr_1d = np.array([0, 5, 0, 8, 0, 3, 7, 0])
nonzero_count_1d = np.count_nonzero(arr_1d)

print("1D Array:\n", arr_1d)
print("Number of non-zero elements:\n", nonzero_count_1d)

# Count non-zero elements in 2D array
arr_2d = np.array([[0, 2, 0, 4], [5, 0, 0, 8], [0, 0, 3, 0]])
nonzero_count_2d = np.count_nonzero(arr_2d)
print("2D Array:")
print(arr_2d)
print("Number of non-zero elements:\n", nonzero_count_2d)

# Unique Element or Value in 1D array
unique_elements_1d = np.unique(arr_1d)
print("Unique elements in 1D:\n", unique_elements_1d)

# Unique Element or Value in 2D array
unique_elements_2d = np.unique(arr_2d)
print("Unique elements in 2D:\n", unique_elements_2d)


1D Array:
 [0 5 0 8 0 3 7 0]
Number of non-zero elements:
 4
2D Array:
[[0 2 0 4]
 [5 0 0 8]
 [0 0 3 0]]
Number of non-zero elements:
 5
Unique elements in 1D:
 [0 3 5 7 8]
Unique elements in 2D:
 [0 2 3 4 5 8]


### Manipulation Functions
- Helps peform changes in array elements
- reshap() - change dimensions of array without hampering array values
- concatenate() - joins two arrays of same shapes either row-wise or column-wise

In [55]:
arr1 = np.arange(4)
print("Elements of an array1:\n", arr1)
arr2 = np.arange(4, 8)
print("Elements of an array2:\n", arr2)
res1 = arr1.reshape(2, 2)
print("Reshaped array with 2x2 dimensions:\n", res1)
res2 = arr2.reshape(2, 2)
print("Reshaped array with 2x2 dimensions res2:\n", res2)
print("Concatenation two arrays:\n")
concat = np.concatenate((arr1, arr2), axis=0)
print(concat)

Elements of an array1:
 [0 1 2 3]
Elements of an array2:
 [4 5 6 7]
Reshaped array with 2x2 dimensions:
 [[0 1]
 [2 3]]
Reshaped array with 2x2 dimensions res2:
 [[4 5]
 [6 7]]
Concatenation two arrays:

[0 1 2 3 4 5 6 7]
