# Probability and Basic Statistics

### Descriptor: Skill Level
1. Able to select an appropriate probability analysis based on nature of the data and business knowledge.
2. Able to understand and apply advanced probability methods, such as Bayes Theorem, random number generation, central limit theorem, etc.
3. Able to choose the correct hypothesis testing and confidence interval approaches for a given business problem.
4. Able to define and perform estimation using techniques such as maximum likelihood estimation (MLE) and least squares.


## 1. Appropriate Probability Analysis


## 2. Advanced Probability Methods


### Bayes Theorem
Describes the probability of an event based on prior knowledge. ![image.png](attachment:image.png)
- P(A) or P(B) also known as prior probability
- P(A|B) or P(B|A) a.k.a. posterior probability

### Random Number Generation (RNG)
True randomness is based on natural occuring entropy: atmospheric noise, thermal noise etc

Hardware (HRNG)
- Based on physical/ mechanical action: dice rolling, coin flipping etc.

Pseudo (PRNG)
- Based on initial value which is called seed. May reproduced based on the seed number.

In [29]:
import random
# random.seed(1)
print("randint ", random.randint(1, 100))
print("randrange ", random.randrange(100))
print("random ", random.random())

randint  58
randrange  60
random  0.651592972722763


### Central Limit Theorem
CLT states that the distribution of $\bar{x}$ is approximately normal, $\bar{x}$ ~ $(\mu, \frac{\sigma}{\sqrt{n}})$. This approximation gets better when the sample size increase.

## 3. Hypothesis Testing and Confidence Interval

Implemented in drift analysis for Proxy Well Model (PWM) where we check whether data drift and/ or model drift occur.

In [1]:
from evidently.dashboard import Dashboard
from evidently.dashboard.tabs import DataDriftTab, NumTargetDriftTab, RegressionPerformanceTab
from evidently.pipeline.column_mapping import ColumnMapping
from evidently.model_profile import Profile
from evidently.model_profile.sections import DataDriftProfileSection, NumTargetDriftProfileSection
from evidently.model_profile.sections import RegressionPerformanceProfileSection
import json

In [2]:
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing(as_frame=True)
housing_data = data.frame
housing_data.rename(columns={'MedHouseVal': 'target'}, inplace=True)
housing_data['prediction'] = housing_data['target'].values + np.random.normal(0, 5, housing_data.shape[0])

In [11]:
housing_data

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [5]:
reference_data = housing_data[:int(0.7*len(housing_data))]
current_data = housing_data[int(0.7*len(housing_data)):]

column_mapping = ColumnMapping(target=run.data.params["variable"], prediction='prediction')  

In [10]:
dashboards = dict(
    data_drift=Dashboard(tabs=[DataDriftTab()]),
    target_drift=Dashboard(tabs=[NumTargetDriftTab()]),
    model_drift=Dashboard(tabs=[RegressionPerformanceTab(verbose_level=0)]),
)

profiles = dict(
    data_drift=Profile([DataDriftProfileSection()]),
    target_drift=Profile([NumTargetDriftProfileSection()]),
    model_drift=Profile([RegressionPerformanceProfileSection()]),
)

sections = []
for d, dashboard in dashboards.items():
    try:
        dashboard.calculate(reference_data, current_data, column_mapping=column_mapping)
        sections.append(profiles[d])
    except Exception as e:
        print(e)

# Run data, target and model drift analysis using Evidently AI
reports = []
for sect in sections:
    sect.calculate(reference_data, current_data, column_mapping=column_mapping)
    report = sect.json()
    temp_json = json.loads(report)
    reports.append(temp_json)

Widget [Regression Model Performance Report.] requires 'target' and 'prediction' columns


## 4. Maximum Likelihood Estimation (MLE)
What is MLE? Estimating the parameters of a probability distribution which maximize the likelihood function.
Why MLE? To make inference about the population.
Where being implemented?