# DATA 605 Homework 2
Author: Kevin Havis

In [2]:
from scipy.stats import binom, poisson, hypergeom, geom, randint, expon, exp
import math
import matplotlib.pyplot as plt
import numpy as np

## Problem 1

### 1. Bayesian

A new credit scoring system has been developed to predict the likelihood of loan defaults. The system has a 90% sensitivity, meaning that it correctly identifies 90% of those who will default on their loans. It also has a 95% specificity, meaning that it correctly identifies 95% of those who will not default.The default rate among borrowers is 2%.

Given these prevalence, sensitivity, and specificity estimates, what is the probability that a borrower flagged by the system as likely to default will actually default? If the average loss per defaulted loan is $200,000 and the cost to run the credit scoring test on each borrower is $500, what is the total first-year cost for evaluating 10,000 borrowers?

Our probability of any given borrower flagged by our system defaulting, $P(D^+|T^+)$, can be represented as the *Positive Predictive Value* or PPV, which is commonly reprsented as;

$$
PPV = P(D^+|T^+) = \frac{prevalence * sensitivity}{(prevalence * sensitivity) + [(1-prevalence)*(1-specificity)]}
$$

Given we know the prevalence, sensitivity, and specificity, we can calculate this probability to be 0.2687


In [3]:
def calculate_ppv(prior=0.02, sensitivity=0.9, specificity=0.95):
    """Bayesian caculation of positive predictive value given pior/prevalence, sensitivity, and specificity"""

    return (prior * sensitivity) / (
        (prior * sensitivity) + ((1 - prior) * (1 - specificity))
    )


calculate_ppv()

0.2686567164179103

To further understand the cost of running this system, and assuming no negative impact from false positives, we can use the sensitivity (how often the control correctly detects fraud) to calculate expected values, $E(x)$, of running this system as;

$$
E(x) = (\$200,000 * prior * sensitivity) - \$500
$$

In this way, we are framing saving the value of the defaulted loans as earning the same value, times the probability of correctly detecting a potential default (sensitivity times the prior), less the cost of running the check.

We will compute this for the expected number of borrowers, $n$, to evaluate in a given year.

In [4]:
def calculate_system_cost(
    prior=0.02, sensitivity=0.9, value=200000, overhead=500, n=100000
):
    """Calculate the earned value of correctly predicting a defaulted loan"""
    return ((value * prior * sensitivity) - overhead) * n


calculate_system_cost()

310000000.0

With a expected value of $310,000,000 over the year, we would certainly want to implement this system.

### 2. Binomial

The probability that a stock will pay a dividend in any given quarter is 0.7. What is the probability that the stock pays dividends exactly 6 times in 8 quarters? What is the probability that it pays dividends 6 or more times? What is the probability that it pays dividends fewer than 6 times? What is the expected number of dividend payments over 8 quarters? What is the standard deviation?

The probability of a stock paying dividends in exactly six quarters, $P(k)$, out of eight total quarters, $n$, can be calculated using the binomial distribution.

$$
P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}
$$

In [5]:
def stock_dividends(p=0.7, k=6, n=8):
    return (
        ((math.factorial(n)) / (math.factorial(n - k) * math.factorial(k)))
        * (p**k)
        * ((1 - p) ** (n - k))
    )


stock_dividends()

0.29647547999999996

We can see the probabilit of getting dividends exactly six of the eight months is 0.2965.

We will use `scipy` implementation of binomial probabilities for the additional questions.

In [6]:
def more_stock_dividends(p=0.7, n=8):
    # 6 or more times?
    k = 6
    binom_dist = binom(n=n, p=p)
    prob_6 = sum(binom_dist.pmf(k=range(k, n)))  # k to n dividends
    print(f"Probability of more than {k} dividends in {n} quarters: {prob_6}")

    # Fewer than 6?
    k = 6
    binom_dist = binom(n=n, p=p)
    prob_less_6 = sum(binom_dist.pmf(k=range(0, (k - 1))))  # less than k dividends

    print(f"Probability of less than {k} dividends in {n} quarters: {prob_less_6}")

    # Mean dividends in eight quarters?
    prob_mean = binom_dist.mean()
    print(f"Expected (mean) number of dividends in {n} quarters: {prob_mean}")

    # Stddev
    prob_stddev = binom_dist.std()
    print(f"Standard deviation of dividends in {n} quarters: {prob_stddev}")


more_stock_dividends()


Probability of more than 6 dividends in 8 quarters: 0.49412580000000017
Probability of less than 6 dividends in 8 quarters: 0.19410435000000015
Expected (mean) number of dividends in 8 quarters: 5.6
Standard deviation of dividends in 8 quarters: 1.2961481396815722


### 3. Poisson

A financial analyst notices that there are an average of 12 trading days each month when a certain stock’s price increases by more than 2%. What is the probability that exactly 4 such days occur in a given month? What is the probability that more than 12 such days occur in a given month? How many such days would you expect in a 6-month period? What is the standard deviation of the number of such days? If an investment strategy requires at least 70 days of such price increases in a year for profitability, what is the percent utilization and what are your recommendations?

Our first instinct when considering events over time should be to consider the Poisson distribution. The Poisson distribution is a variation of the binomial distribution that helps us understand the probabilities of a given event $X$ resulting in a specific outcome $k$ within a specific time period $t$. 

The Poisson probability function allows us to simplify the model such that we can concern ourselves only with the average rate for the given time period (i.e. the rate of $k$ per $t$), $\lambda$, and the number of outcomes we are expecting, $k$.

$$
P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}
$$

In [7]:
def price_fluctuations(k=12, t=30, l=12):
    # Calculate lambda
    # l = k / t
    print(f"Lambda (average events per month): {l}")

    # Calculate the distribution
    poisson_dist = poisson(mu=l)

    # Exactly four days
    prob_exactly_4 = poisson_dist.pmf(4)
    print(f"Probability of exactly 4 price fluctuations: {prob_exactly_4:.4f}")

    # More than twelve
    prob_more_than_12 = 1 - poisson_dist.cdf(12)
    print(f"Probability of more than 12 price fluctuations: {prob_more_than_12:.4f}")

    # Six months
    days_with_increase = l * 6
    print(f"Expected number of days to see an increase: {days_with_increase}")

    # Standard deviation
    std_dev = poisson_dist.std()
    print(f"Standard deviation of daily price fluctuations: {std_dev:.4f}")

    # At least 70 per year
    n_days = 70
    print(n_days)
    prob_at_least_70_per_year = poisson_dist.cdf(12) - poisson_dist.cdf(
        n_days / 30  # convert to months
    )
    print(
        f"Probability of at least 70 price fluctuations per year: {prob_at_least_70_per_year:.4f}"
    )


results = price_fluctuations()

Lambda (average events per month): 12
Probability of exactly 4 price fluctuations: 0.0053
Probability of more than 12 price fluctuations: 0.4240
Expected number of days to see an increase: 72
Standard deviation of daily price fluctuations: 3.4641
70
Probability of at least 70 price fluctuations per year: 0.5754


### 4. Hypergeometric
A hedge fund has a portfolio of 25 stocks, with 15 categorized as high-risk and 10 as low-risk. The fund manager randomly selects 7 stocks to closely monitor. If the manager selected 5 high-risk stocks and 2 low-risk stocks, what is the probability of selecting exactly 5 high-risk stocks if the selection was random? How many high-risk and low-risk stocks would you expect to be selected?

We can use `scikitlearn` to construct our hypergeometric distribution, which has parameters $M$ for total choices, $n$ for "success" choices (high risk stocks in this case), and $N$ for the number we are randomly choosing from $M$ *without replacement*.

We will use the probability mass functio to determine the probability of choosing exaclty five high risk stocks, as well as the average expected number of high risks stocks, given seven random selections.

In [8]:
def high_risk_stocks(x, M=25, n=15, N=7):
    hyper = hypergeom(M, n, N)
    p = hyper.pmf(x)
    print(f"Probability of selecting {N} high risk stocks: {p}")
    print(f"Mean high risk stocks: {hyper.mean()}")


# Probability of selecting exactly 5
high_risk_stocks(5)

Probability of selecting 7 high risk stocks: 0.2811212814645309
Mean high risk stocks: 4.2


### 5. Geometric

The probability that a bond defaults in any given year is 0.5%. A portfolio manager holds this bond for 10 years. What is the probability that the bond will default during this period? What is the probability that it will default in the next 15 years? What is the expected number of years before the bond defaults? If the bond has already survived 10 years, what is the probability that it will default in the next 2 years?

We will cynically frame this problem as a default being a success, and will build our geometric distribution accordingly. Since we want to the probability of default as time goes on, we will use the cumulative distribution function of the geometric distribution. This tells us the probability of default *within* the given timeframe.

The last part of the question, where we are asked to solve for default in the 11th or 12th year given ten have passed, will need to be the difference in probability between the 12th and the 10th year.

In [9]:
def bond_default(p=0.005, t_1=10, t_2=15, t_3=12):
    geometric = geom(p)
    p_t = geometric.cdf(t_1)
    print(f"Probability of default in {t_1} years : {p_t}")

    p_t2 = geometric.cdf(t_2)
    print(f"Probability of default in {t_2} years : {p_t2}")

    p_t3 = geometric.cdf(t_3) - geometric.cdf(10)
    print(f"Probability of default in next two years given ten have passed: {p_t3}")


bond_default()


Probability of default in 10 years : 0.04888986953422811
Probability of default in 15 years : 0.0724310311816721
Probability of default in next two years given ten have passed: 0.009487323551396074


### 6. Poisson

A high-frequency trading algorithm experiences a system failure about once every 1500 trading hours. What is the probability that the algorithm will experience more than two failures in 1500 hours? What is the expected number of failures?

Using the Poisson distribution, we can define its single parameter, the rate of failure $\lambda$, to be $1/1500$, and construct the distribution from there.

We can determine the expected number of failures, and probability of that exceeding two, using the mean and cumulative distributive function respectively.

In [10]:
def algo(k=1, t=1500):
    l = k / t

    poisson_dist = poisson(l)

    print(
        f"Probability of more than two failures in 1500 hours: {1 - poisson_dist.cdf((2))}"
    )
    print(f"Expected number of failures in {t} hours: {poisson_dist.mean()}")


algo()

Probability of more than two failures in 1500 hours: 4.9358073184180284e-11
Expected number of failures in 1500 hours: 0.0006666666666666666


### 7. Uniform Distribution
An investor is trying to time the market and is monitoring a stock that they believe has an equal chance of reaching a target price between 20 and 60 days. What is the probability that the stock will reach the target price in more than 40 days? If it hasn’t reached the target price by day 40, what is the probability that it will reach it in the next 10 days? What is the expected time for the stock to reach the target price?

If the stock has an equal chance of reaching a target price, we can model this with a uniform distribution. Using `scipy.stats.randint`, we can select our low to be the minimum expected date range (20), and the corresponding high (60).

We then examine the distribution to determine the probabilities of observing the target price within that time frame.

In [11]:
def target_price(low=20, high=60):
    uni = randint(low, high)

    print(f"Probability of target price in 40+ days : {1 - uni.cdf(40)}")
    print(
        f"Probability of target price in next 10 days, given 40 have passed : {uni.cdf(50) - uni.cdf(40)}"
    )
    print(f"Mean number of days to hit target price : {uni.mean()}")


target_price()


Probability of target price in 40+ days : 0.475
Probability of target price in next 10 days, given 40 have passed : 0.25
Mean number of days to hit target price : 39.5


### 8. Exponential Distribution
A financial model estimates that the lifetime of a successful start-up before it either goes public or fails follows an exponential distribution with an expected value of 8 years. What is the expected time until the start-up either goes public or fails? What is the standard deviation? What is the probability that the start-up will go public or fail after 6 years? Given that the start-up has survived for 6 years, what is the probability that it will go public or fail in the next 2 years?

Considering an exponential distribution only has one parameter, the rate of an event $\lambda$, we simply need to calculate it and construct our probability density function. We can then calculate the probabilities of all the described events occcuring.

In [12]:
def start_up(mu=8):
    l = 1 / mu

    expo = expon(scale=(1 / l))

    print(f"Expected time for a startup to go public or fail: {expo.mean()} years")
    print(f"Expected standard deviation: {expo.std()} years")
    print(f"Probability of going public or failing in the 6th year: {expo.pdf(6)}")
    print(
        f"Probability of going public of failing in next two years after 6: {expo.cdf(8) - expo.cdf(6)}"
    )


start_up()


Expected time for a startup to go public or fail: 8.0 years
Expected standard deviation: 8.0 years
Probability of going public or failing in the 6th year: 0.059045819092626836
Probability of going public of failing in next two years after 6: 0.10448711156957236


## Problem 2

### 1. Product Selection
A company produces 5 different types of green pens and 7 different types of red pens. The marketing team needs to create a new promotional package that includes 5 pens. How many different ways can the package be created if it contains fewer than 2 green pens?

If we are required to build a package that contains *fewer than two* green pens, that means we can have a maximum of one green pen in a package. This leaves us two outcomes; A package of only red pens, or a package of one green pen and four red pens.

Now, assuming that each pen itself is unique, we now must consider the *possibilities* of each of our two cases. We can do this using the binomial coefficient formula;

$$\binom{n}{k} = \frac{n!}{k!(n-k)!}$$

In [13]:
def binom_coeff(n, k):
    return math.factorial(n) / (math.factorial(k) * math.factorial(n - k))


def pen_packing():
    # Case 1 One green pen
    green_1 = binom_coeff(5, 1)
    red_1 = binom_coeff(7, 4)
    total_1 = green_1 * red_1

    # Case 2 No green pen
    red_2 = binom_coeff(7, 5)
    total_2 = red_2

    total_possibilities = total_1 + total_2

    print(f"Total possible combinations of red and green pens: {total_possibilities}")


pen_packing()


Total possible combinations of red and green pens: 196.0


### 2. Team Formation for a Project

A project committee is being formed within a company that includes 14 senior managers and 13 junior managers. How many ways can a project team of 5 members be formed if at least 4 of the members must be junior managers?

Similar to the last problem, we can use the binomial coefficient to calculate this. We have the same number of cases in this problem; given at least 4 members must be junior, we either have one or zero senior managers on the 5 member committee.

In [14]:
def project_committee():
    seniors = 14
    juniors = 13

    # Case 1 One senior
    senior_1 = binom_coeff(seniors, 1)
    junior_1 = binom_coeff(juniors, 4)
    total_1 = senior_1 * junior_1

    # Case 2 No senior
    junior_2 = binom_coeff(juniors, 5)
    total_2 = junior_2

    total_possibilities = total_1 + total_2

    print(f"Total possible combinations of committee members: {total_possibilities}")


project_committee()


Total possible combinations of committee members: 11297.0


### 3. Marketing Campaign Outcomes

A marketing campaign involves three stages: first, a customer is sent 5 email offers; second, the customer is targeted with 2 different online ads; and third, the customer is presented with 3 personalized product recommendations. If the email offers, online ads, and product recommendations are selected randomly, how many different possible outcomes are there for the entire campaign?

This is a simple permutation and we can calculate it by multiplying the $n$ number of options for each $m$ stage. In this case, we would multiply $5 \times 2 \times 3 = 30$, meaning we have 30 potential outcomes for a customer through the entire campaign.

### 4. Product Defect Probability

A quality control team draws 3 products from a batch of size N without replacement. What is the probability that at least one of the products drawn is defective if the defect rate is known to be consistent? 

Given we are sampling without replacement, we cannot use a binomial coefficient to calculate this. We must instead use the hypergeometric, which has parameters;

-  $N$, number of products in the batch
-  $k$ number of total defects expected in the batch
-  $n$ number of products we sample
-  $x$ the number of defects in our sample

Note that we could calculate $k = p \times N$ as we know the consistent rate $p$, but we will leave as $k$ for simpler expression.

In this scenario, $N=N$, $k=k$, $n=3$, and $x\ge 1$. Using these as parameters to the generic hypergeometric formula, we get;


$$
H(x \ge 1: N, 3, k) = H(x=1: N, 3, k) + H(x=2: N, 3, k) + H(x=3: N, 3, k)
$$

where each $H(x: N, n, k) = (_kC_x \times (_{N-k}C_{n-x})) / _NC_n$

### 5. Business Strategy Choices

A business strategist is choosing potential projects to invest in, focusing on 17 high-risk, high-reward projects and 14 low-risk, steady-return projects.

#### Step 1: How many different combinations of 5 projects can the strategist select? 

We can use our earlier binomial coefficient function to calculate the combination of 5 projects given 17 high risk and 14 low risk.

With a portfolio of five projects, we have 5 cases to calculate combinations for; five high risk and no low risk, four high risk and one low, and so on.

Thus we will define $k$ as the number of high risk selected, and $n-k$ as the number of low risk. We will calculate the sum of $k$ combinations of high and low risk projects.

In [None]:
def portfolio(n=5, k=5, high=17, low=14):
    # Number of cases we have - need a different variable to avoid decrementing accidentally
    cases = k
    outcomes = []

    for _ in range(cases):
        outcomes.append(
            binom_coeff(high, k)  # High risk
            * binom_coeff(low, (n - k))
        )  # Low risk
        k -= 1  # Reduce high risk / increase low risk by 1

    print(f"Total possible outcomes: {sum(outcomes)}")


portfolio()

Total possible outcomes: 167909.0


#### Step 2: How many different combinations of 5 projects can the strategist select if they want at least one low-risk project? 

We can solve this the same way as before, but in this case we have $k-1$ cases since we always want at least one low-risk project.

In [21]:
portfolio(k=4)

Total possible outcomes: 161721.0


### 6. Event Scheduling

A business conference needs to schedule 9 different keynote sessions from three different industries: technology, finance, and healthcare. There are 4 potential technology sessions, 104 finance sessions, and 17 healthcare sessions to choose from. How many different schedules can be made? Express your answer in scientific notation rounding to the hundredths place.

### 7. Book Selection for Corporate Training

An HR manager needs to create a reading list for a corporate leadership training program, which includes 13 books in total. The books are categorized into 6 novels, 6 business case studies, 7 leadership theory books, and 5 strategy books.

#### Step 1: If the manager wants to include no more than 4 strategy books, how many different reading schedules are possible? Express your answer in scientific notation rounding to the hundredths place. 

#### Step 2: If the manager wants to include all 6 business case studies, how many different reading schedules are possible? Express your answer in scientific notation rounding to the hundredths place. 

### 8. Product Arrangement

A retailer is arranging 10 products on a display shelf. There are 5 different electronic gadgets and 5 different accessories. What is the probability that all the gadgets are placed together and all the accessories are placed together on the shelf? Express your answer as a fraction or a decimal number rounded to four decimal places.

#### 9. Expected Value of a Business Deal

A company is evaluating a deal where they either gain $4 for every successful contract or lose $16 for every unsuccessful contract. A "successful" contract is defined as drawing a queen or lower from a standard deck of cards. (Aces are considered the highest card in the deck.)

#### Step 1: Find the expected value of the deal. Round your answer to two decimal places. Losses must be expressed as negative values. 

#### Step 2: If the company enters into this deal 833 times, how much would they expect to win or lose? Round your answer to two decimal places. Losses must be expressed as negative values. 

## Problem 3

### 1. Supply Chain Risk Assessment

Let $X_1, X_2, ..., X_n$ represent the lead times (in days) for the delivery of key components from $n=5$ different suppliers. Each lead time is uniformly distributed across a range of 1 to $k=20$ days, reflecting the uncertainty in delivery times. Let $Y$ denote the minimum delivery time among all suppliers. Understanding the distribution of $Y$ is crucial for assessing the earliest possible time you can begin production. Determine the distribution of $Y$ to better manage your supply chain and minimize downtime.

### 2. Maintenance Planning for Critical Equipment

Your organization owns a critical piece of equipment, such as a high-capacity photocopier (for a law firm) or an MRI machine (for a healthcare provider). The manufacturer estimates the expected lifetime of this equipment to be 8 years, meaning that, on average, you expect one failure every 8 years. It's essential to understand the likelihood of failure over time to plan for maintenance and replacements.

#### a. Geometric Model

Calculate the probability that the machine will not fail for the first 6 years. Also, provide the expected value and standard deviation. This model assumes each year the machine either fails or does not, independently of previous years.

#### b. Exponential Model

Calculate the probability that the machine will not fail for the first 6 years. Provide the expected value and standard deviation, modeling the time to failure as a continuous process.

#### c. Binomial Model

Calculate the probability that the machine will not fail during the first 6 years, given that it is expected to fail once every 8 years. Provide the expected value and standard deviation, assuming a fixed number of trials (years) with a constant failure probability each year.

#### d. Poisson Model

Calculate the probability that the machine will not fail during the first 6 years, modeling the failure events as a Poisson process. Provide the expected value and standard deviation.

## Problem 4

### 1. Scenario

You are managing two independent servers in a data center. The time until the next failure for each server follows an exponential distribution with different rates:

- Server A has a failure rate of $\lambda_A = 0.5$ failures per hour
- Server B has a failure rate of $\lambda_B = 0.3$ failures per hour

**Question**: What is the distribution of the total time until both servers have failed at least once? Use the moment generating function (MGF) to find the distribution of the sum of the times to failure.

### 2. Sum of Independent Normally Distributed Random Variables

An investment firm is analyzing the returns of two independent assets, Asset X and Asset Y. The returns on these assets are normally distributed:

- Asset X: $X \sim N(\mu_X = 5\%, \sigma^2_X = 4\%)$
- Asset Y: $Y \sim N(\mu_Y = 7\%, \sigma^2_Y = 9\%)$

**Question**: Find the distribution of the combined return of the portfolio consisting of these two assets using the moment generating function (MGF).


### 3. Scenario

A call center receives calls independently from two different regions. The number of calls received from Region A and Region B in an hour follows a Poisson distribution:

- Region A: $X_A \sim Poisson(\lambda_A = 3)$
- Region B: $X_B \sim Poisson(\lambda_B = 5)$

**Question**: Find the distribution of the combined return of the portfolio consisting of these two assets using the moment generating function (MGF).


## Problem 5

### 1. Customer Retention and Churn Analysis

Scenario: A telecommunications company wants to model the behavior of its customers regarding their likelihood to stay with the company (retention) or leave for a competitor (churn). The company segments its customers into three states:

- State 1: Active customers who are satisfied and likely to stay (Retention state).
- State 2: Customers who are considering leaving (At-risk state).
- State 3: Customers who have left (Churn state).

The company has historical data showing the following monthly transition probabilities:

- From State 1 (Retention): 80% stay in State 1, 15% move to State 2, and 5% move to State 3.
- From State 2 (At-risk): 30% return to State 1, 50% stay in State 2, and 20% move to State 3.
- From State 3 (Churn): 100% stay in State 3.

The company wants to analyze the long-term behavior of its customer base.

Question: 
- (a) Construct the transition matrix for this Markov Chain
- (b) If a customer starts as satisfied (State 1), what is the probability that they will eventually churn (move to State 3)?
- (c) Determine the steady-state distribution of this Markov Chain. What percentage of customers can the company expect to be in each state in the long run?


### 2. Inventory Management in a Warehouse

Scenario: A warehouse tracks the inventory levels of a particular product using a Markov Chain model. The inventory levels are categorized into three states:

- State 1: High inventory (More than 100 units in stock)
- State 2: Medium inventory (Between 50 and 100 units in stock)
- State 3: Low inventory (Less than 50 units in stock)

The warehouse has the following transition probabilities for inventory levels from one month to the next

- From State 1 (High): 70% stay in State 1, 25% move to State 2, and 5% move to State 3
- From State 2 (Medium): 20% move to State 1, 50% stay in State 2, and 30% move to State 3.
- From State 3 (Low): 10% move to State 1, 40% move to State 2, and 50% stay in State 3.

The warehouse management wants to optimize its restocking strategy by understanding the long-term distribution of inventory levels.

**Question**:
- (a) Construct the transition matrix for this Markov Chain
- (b) If the warehouse starts with a high inventory level (State 1), what is the probability that it will eventually end up in a low inventory level (State 3)? 
- (c) Determine the steady-state distribution of this Markov Chain. What is the long-term expected proportion of time that the warehouse will spend in each inventory state?