# Probability and Random Variables

- Why we need inferential statistics
- The nature of probability
- Random variables 
- Discrete versus continuous random variables
- Mean and variance for a random variable
- Excel and Python demonstrations using an example of a random variable

Inferential statistics is a technique to draw inferences about populations using samples.

Let’s use an example to help you understand: Consider that Amazon’s Quality Control department wants to know the proportion of the company’s products in its warehouses that are defective. To do so, the team can simply inspect a small sample of 1,000 products instead of inspecting each individual product (which would be a lot). It can then find the defect rate (i.e., the proportion of defective products) for the sample, based on which it can further infer the defect rate for all the products in the warehouses.

This process of deriving insights or drawing inferences from sample data is called inferential statistics. Situations like the one above arise all the time in big companies like Amazon and Flipkart, among others.

Inferential statistics finds applications in many areas. Here are a few examples of such areas:

- Consumers’ reactions to product pricing
- Quality check for defective products
- Success rate of a movie in the box office

Population data may be expensive, or we may at times not have any available to draw insightful inferences from. In such cases, we consider the available data as a sample and draw inferences about future data. Thus, based on the sample, we try to predict to some degree what will happen in the future. 

### Basics of Probability :




# Random Variables

- A random variable is one whose values represent the outcomes of a random experiment.
- It is determined by specifying its possible values and their associated probabilities.
- A random variable is denoted with a capital letter (typically, X, Y, Z, etc.), and specific values are denoted with lowercase letters (e.g., X = x or X ≤ x).

In the case of tossing coins, the outcomes are {HH,HT,TH,TT} and these will be represented by P(X = x).

- P(X = 0) = 1/4
- P(X = 1) = 1/2
- P(X = 2) = 1/4

Random Variables are of two types : 

1. Discrete Random Variable 
    - They take a fixed set of possible outcomes.
    - Each outcome has an associated probability.
    - Examples : Number of heads in two tosses, The creditworthiness of a loan applicant
    
    
2. Continuous Random Variable 
    - They can take any value within a range.
    - Example : The future value of a stock, A customer’s purchase in my store tomorrow, The reliability of an automobile part.

![3a1ec0fa-c762-41fc-877c-2c0c82103e42-UMD_DSBA_2.5.3_Img01-01.jpg](attachment:3a1ec0fa-c762-41fc-877c-2c0c82103e42-UMD_DSBA_2.5.3_Img01-01.jpg)

## <font color = 'maroon'>Discrete Random Variable</font>

### <font color = 'blue'>1. Mean of Discrete Random Variables</font>

We will see with example how mean helps in making useful business decisions

a property firm is deciding between two locations, locations A and B, to construct its new hotel. The two locations are included under two separate projects: Project A and Project B. 

Since the two locations are in the development stage, there is uncertainty regarding the amount of profit they can make off the projects in the next five years.

The probability of the profit estimates for both projects is given below : 

![affe215c-baeb-48aa-8605-87fc79e2af8b-Example-Comparing%20two%20projects.png](attachment:affe215c-baeb-48aa-8605-87fc79e2af8b-Example-Comparing%20two%20projects.png)

We can represent this in the form of a probability distribution as shown here:

![75caeb02-cd4d-4284-aa83-c6b5766082f1-Comparing%20Project%20A%20and%20Project%20B.png](attachment:75caeb02-cd4d-4284-aa83-c6b5766082f1-Comparing%20Project%20A%20and%20Project%20B.png)

We can compute the expected value of return for both projects using this formula below.

![dd16c4f9-4d33-4aa6-a727-3255db61aa53-Expected%20Value.png](attachment:dd16c4f9-4d33-4aa6-a727-3255db61aa53-Expected%20Value.png)

**Expected return for Project A**

E(X) = 81%(-10) + 18%(90) + 1%(190)       
= (0.81)(-10)+(0.18)(90)+(0.01)(190) 
= $10 million

**Expected return for Project B**

E(Y) = 90%(-10) + 10%(190)
= (0.9)(-10)+(0.10)(190) 
= $10 million

Because the expected value of both projects is the same, are still unsure of the project in which the property firm should invest. We will take a look at several interpretations of the expected value next.

**A Probability-Weighted Average**

It is a summary figure that considers all possible values that a random variable can take multiplied by their respective probabilities. It averages over all possibilities as calculated above.

 

**A Long-Run Average**

If you watched carefully, you would notice that the expected value is 10 million dollar, which is none of the possible outcomes for Project A or Project B. This means the average can take any value apart from the outcomes, and it is indicative of what the expected returns would be in the long run. If the company invests in either of the two projects, then it would likely gain a return of 10 million dollar from each of them. Hence, it is a long-run average.

**A Fair Value of a Gamble**

The gambling value is the expected value. This means you are paying 10 million dollar to take that gamble of losing or gaining by investing in Project B. If the firm invests in Project B, it may lose $10 million with a 90% chance, but it may also gain 190 million dollar with a 10% chance by purchasing the gamble for 10 million dollar.

### <font color = 'blue'>2. Variance of a Discrete Random Variable</font>

We will find the variance for the two projects and learn how it can be interpreted for business decision-making.

The variance and standard deviation are a measure of risk. The variance of a discrete random variable measures the variability of the outcome of the variable X about its mean, weighted by the probabilities.

Variability of return for Project A - 

SD(X) = sqrt((−10−10)2∗0.81+(90−10)2∗0.18+(190−10)2∗0.01) = 42.4

Variability of return for Project B -

SD(Y) = sqrt((−10−10)20∗0.90+(190−10)2∗0.1) = 60

Since the variance or standard deviation associated with Project B is greater compared with Project A, Project B involves greater risk than Project A.

Look at another Example -

![18e70ce9-9241-4fba-b81e-df504b4db137-Project%20X%20and%20Project%20Y.png](attachment:18e70ce9-9241-4fba-b81e-df504b4db137-Project%20X%20and%20Project%20Y.png)

Let’s take a look at the probability distributions for both projects. For Project Y, on the higher end of the probability distribution, there are chances of making 60,000 dollars, and 30,000 dollars on the lower end of the probability distribution. However, for Project X, you can only make 50,000 dollars on the higher end of the probability distribution, and $30,000 on the lower end of the probability distribution. This means it is better to invest in Project Y than Project X.

The image below is an example of a decision tree that can help you choose from among three investment options: stocks, cryptocurrency, and certificates of deposit (CDs). You can try to make a rational decision based on the expected payoffs from and risks involved in each option. This is how expected values and variance are applied in real-world problems.

![0fa762b5-00c2-40f6-8f1b-5296c80abf9e-Session%20Summary.png](attachment:0fa762b5-00c2-40f6-8f1b-5296c80abf9e-Session%20Summary.png)

## <font color = 'maroon'>Continuous Random Variable</font>

Here are the two characteristics of continuous random variables:

1. They take any real value in a described range.
2. They are measured on a continuum.
 

Here are a few examples of continuous random variables:

1. Price of a stock
2. Total money spent by customers in a store
3. A company’s market share
4. A retailer’s annual sales
5. Waiting time for service at a check-out

In the case of continuous random variables, we define the density of occurrence of outcomes within a range. This is expressed by the probability density function (PDF). You can use the PDF to compute the probability of a continuous random variable.

Let’s take an example to help you understand this better.

The chart below provides data pertaining to the closing prices of the Dow Jones Industrial Average (DJIA) from March 2002 to February 2003, one of the most prominent stock market indices.

![fbe4ac2c-4e10-4dbf-82b7-1aacdc9e63cf-UMDDSBA-B1-2.5.2%20PL03%20OG.png](attachment:fbe4ac2c-4e10-4dbf-82b7-1aacdc9e63cf-UMDDSBA-B1-2.5.2%20PL03%20OG.png)

One metric to understand how the index is performing is the daily percentage return.

You can then create a histogram with daily returns on the x-axis and frequency on the y-axis.

![c52828db-7a43-4eca-af6e-c6d4099b9c5e-UMDDSBA-B1-2.5.2%20PL04%20OG.png](attachment:c52828db-7a43-4eca-af6e-c6d4099b9c5e-UMDDSBA-B1-2.5.2%20PL04%20OG.png)

This graph is a probability distribution function, or PDF for the daily percentage return.

Here, the daily percentage is a continuous random variable. Using the PDF graph, we can easily determine the probability of, say, getting a positive or a negative daily return on the DJIA.

## Normal Distribution

1. What is Normal Distribution
2. Examples of Normal Distributions by changing mean and standard deviation.
3. Empirical Rule
4. Z-Score and it's use with University cut-off example

## Central Limit Theorem

1. Population and Sample
    - Population vs sample
    - Sampling Frame
    - Census
    
Suppose you’re working as a data analyst for a startup that focuses on instant delivery services. You want to find the average number of times urban people went to a mall in the past year.

Obviously, you cannot possibly ask every person how many times they visited a mall last year (we are talking about millions of people). That would be a costly and tedious process. How can you reduce the time and money spent on finding a reliable estimate for this number?

This is where sampling comes in. It involves taking a small sample of people from the larger population and checking whether the metrics obtained from it can be somehow extrapolated to the entire population (potentially with a small margin of error).

**What is Central Limit Theorem?**

1. Take the Dataset
2. Make multiple samples with replacement from the dataset. Those samples can contain 30,50 or 100 datapoints.
3. Take means of all the various samples we have taken in above step.
4. Check 'Sampling Distribution of Sample Mean' in the form of a histogram.
5. They will always follow `Normal Distribution`. This is `Central Limit Theorem`.

**Example Use Case of Central Limit Theorem in Industry**

We have Uber Trips dataset and we want to predict the time taken to reach any destination from somewhere, when a user books a ride. Now we have small dataset for this generally, so we extrapolate it by making many samples of the data and plotting the distribution of the means of the samples. We will get normal distribution of sample means. Now from that curve we can find out the average time for any ride. 

We use empirical rule on the normal curve to find our 95% confidence result. Each sample size must be greater than 30 (n >= 30).

### Sampling Methods

1. Simple random sampling
2. Stratified sampling 
3. Systematic sampling
4. Cluster sampling 
5. Judgment sampling