# Probability

Probability can roughly be described as "the percentage chance of an event or sequence of events occurring".
If you think about a coin flip intuitively, there's a 50% chance of getting heads, and a 50% chance of getting tails. This is because there are only two possible outcomes, and each event is equally likely


## Calculating probability ##

In [None]:
total_countries = flags.shape[0]

orange_probability = flags[flags["orange"] == 1].shape[0]/total_countries
stripe_probability = flags[flags["stripes"] >1].shape[0]/total_countriesdef probability_of_one(num_trials, num_rolls):


#  This function will take in the number of trials, and the number of rolls per trial.
# Then it will conduct each trial, and record the probability of rolling a one.
def probabilities(num_trials)

    probabilities = []
    for i in range(num_trials):
        die_rolls = [roll() for _ in range(num_rolls)]
        one_prob = len([d for d in die_rolls if d==1]) / num_rolls
        probabilities.append(one_prob)
    return probabilities

## Conjunctive probabilities ##
Involves a sequence of events. eg. We want to find the probability that the first flip is heads and the second flip is heads, and so on

Each event in this sequence is independent, as the outcome of the first flip won't have an impact on the outcome of the last flip. All we have to do to compute the probability of this sequence is multiply the individual probabilities of each event together. This is .5 * .5 * .5 * .5 * .5, which equals .03125, giving us a 3.125% chance that all 5 coin flips result in heads



In [None]:
five_heads = .5 ** 5 
ten_heads =  .5 ** 10
hundred_heads = .5 ** 100

## Dependent probabilities ##

In [None]:
# Remember that whether a flag has red in it or not is in the `red` column.
number_red_flags = flags[flags["red"] == 1].shape[0]
number_countries = flags.shape[0]

three_red = 1

for i in range(0,3):
    
   three_red = three_red * ( (number_red_flags - i) / (number_countries - i))

## Disjunctive probability ##
we want to know the probability of some event occurring or another event occurring. Let's say we're rolling a six-sided die -- the probability of rolling a 2 is 1/6.
What if we want to know the probability of rolling a 2 or the probability of rolling a three? We actually can just add the probabilities, because both events are independent. Rolling a 2 doesn't change my odds of rolling a three next time around. Thus, the probability is 1/6 + 1/6, or 1/3.


In [None]:
start = 1
end = 18000

# What are the odds of getting a number evenly divisible by 100, with no remainder? (ie 100, 200, 300, etc).
hundred_prob = len([i for i in range(start,end+1) if i%100==0]) /end

# What are the odds of getting a number evenly divisible by 70, with no remainder? (ie 70, 140, 210, etc). 
seventy_prob = len([i for i in range(start,end+1) if i%70==0]) /end



## Disjunctive dependent probabilities ##
When not mutually exclusive - 2 conditions: 

What if two traits are not independent -  eg. Some of the cars are red and convertibles. If we don't account for this overlap, we end up with a vastly inflated count.
Let's say that we have 3 cars that are red and convertibles. Our probability for red or convertible then comes out to (1/2 + 1/2) - 3/10. We subtract 3/10 to account for the cars we double counted when we computed (1/2 + 1/2). This gives us a .7 probability of a car being a convertible or red.

In [None]:
stripes_or_bars = None
red_or_orange = None
total_flags = flags.shape[0]

red_flags = flags[flags["red"] == 1].shape[0]
orange_flags = flags[flags["orange"] == 1].shape[0]
red_and_orange_flags = flags[(flags["orange"] == 1) & (flags["red"] ==1)].shape[0]

red_or_orange = (red_flags/total_flags) + (orange_flags/total_flags) - (red_and_orange_flags/total_flags)


stripes = flags[flags["stripes"] > 0].shape[0]
bars = flags[flags["bars"] > 0].shape[0]
stripes_and_bars = flags[(flags["stripes"] >0) & (flags["bars"] >0)].shape[0]

stripes_or_bars = (stripes/total_flags + bars/total_flags) - stripes_and_bars/total_flags

## Disjunctive probabilities with multiple conditions ##
When not mutually exclusive - n conditions:

eg. all cars that are red or convertibles or have a top speed of 130mph. One easy way to solve for cases like this is to find everything that doesn't match our criteria first
Let's say there are 2 vehicles that are blue and sport utility vehicles and have a 110mph top speed. We would get a 1 - .2 or .8 probability for red or convertible or 130mph top speed.



In [None]:
heads_or = None
all_three_tails = (1/2 * 1/2 * 1/2)
heads_or = 1 - all_three_tails

## Statistical significance:

Statistical significance is a measure of whether your research findings are meaningful. More specifically, it’s whether your stat closely matches what value you would expect to find in an entire population. 
You usually set a significance level beforehand that will determine if your hypothesis is true or not. After conducting the experiment, you check against the significance level to determine.
A common significance level is .05. This means: "only 5% or less of the time will the result have been due to chance.
In order to test for significance, we compare our result ratio with the mean ratios.



## Probability Distributions:

The pmf function in SciPy is an implementation of the mathematical probability mass function. The pmf will give us the probability of each k in our outcome_counts list occurring.
A binomial distribution only needs two parameters. A parameter is the statistical term for a number that summarizes data for the entire population. For a binomial distribution, the parameters are:
* N, the total number of events,
* p, the probability of the outcome we're interested in seeing.
The SciPy function pmf matches this and takes in the following parameters:
* x: the list of outcomes,
* n: the total number of events,
* p: the probability of the outcome we're interested in seeing.


In [None]:
# Create a range of numbers from 0 to 30, with 31 elements (each number has one entry).
outcome_counts = linspace(0,30,31)

outcome_probs = binom.pmf(outcome_counts,30,0.39)
plt.bar(outcome_counts,outcome_probs)
plt.show

## Expected value of Probaility Distribution (mean):

The most likely result of a single sample that we look at. To compute this, we just multiplyN by p

## Standard Deviation of a Probability Distribution:

How much the actual values will vary from the mean when we take a sample.

## Cumulative density function:

So far, we've looked at the probability that single values of k will occur. What we can look at instead is the probability that k or less will occur.


In [None]:
from scipy import linspace
from scipy.stats import binom

# Create a range of numbers from 0 to 30, with 31 elements (each number has one entry).
outcome_counts = linspace(0,30,31)

# Create the cumulative binomial probabilities, one for each entry in outcome_counts.
dist = binom.cdf(outcome_counts,30,0.39)



## Z Scores:

The number of standard deviations away from the mean a probability is. 
These z-scores can then be used to find the percentage of values to the left and right of the value we're looking at. This is because every normal distribution, as we learned in an earlier mission, has the same properties when it comes to what percentage of the data is within a certain number of standard deviations of the mean. You can look these up in a standard normal table. About 68% of the data is within 1 standard deviation of the mean, 95% is within 2, and 99% is within 3.
￼

## Chi Squared Test

The chi-squared test enables us to quantify the difference between sets of observed and expected categorical values. We can calculate
χ2, the chi-squared value, by adding up all of the squared differences between observed and expected values.

What we really want to find is one number that can tell us how much all of our observed counts deviate from all of their expected counterparts. This will let us figure out if our difference in counts is statistically significant. We can get one step closer to this by squaring the top term in our difference formula: 

(observed-expected) 2
/
expected

Chi-squared values for the same sized effect increase as sample size increases, but the chance of getting a high chi-squared value decreases as the sample gets larger.



In [None]:
from scipy.stats import chisquare
import numpy as np

observed_list = [27816,3124,1039,311,271]
expected_list = [26146.5,3939.9,944.3,260.5,1269.8]

chisquare_value, race_pvalue = chisquare(observed_list,expected_list)

#Multiple Category Chi Square Tests (with Crosstab)

import numpy as np
from scipy.stats import chi2_contingency

chisq_value, pvalue_gender_race, df, expected = chi2_contingency(pandas.crosstab(income["sex"], [income["race"]]))

## Crosstabs:

The crosstab function will print a table that shows frequency counts for two or more columns. Here's how you could use the pandas.crosstab function:
…bit like a big group by!


In [None]:
import pandas
table = pandas.crosstab(income["sex"], [income["high_income"]])
print(table)

## Contingency:

generate the expected values, use the scipy.stats.chi2_contingency function to do this. The function takes in a cross table of observed counts, and returns the chi-squared value, the p-value, the degrees of freedom, and the expected frequencies.


## Correlations:

Correlations tell us how closely related two columns are. We'll be using the r value, also called Pearson's correlation coefficient, which measures how closely two sequences of numbers are correlated.

An r value falls between -1 and 1. 

The value tells us whether two columns are positively correlated, not correlated, or negatively correlated. 
- The closer to 1 the r value is, the stronger the positive correlation between the two columns. 
- The closer to -1 the r value is, the stronger the negative correlation (i.e., the more "opposite" the columns are). 
- The closer to 0, the weaker the correlation. 

In general, r values above .25 or below -.25 are enough to qualify a correlation as interesting. An r value isn't perfect, and doesn't indicate that there's a correlation -- just the possiblity of one. 

To really assess whether or not a correlation exists, we need to look at the data using a scatterplot to see its "shape." 

NOTE: We can use the pandas pandas.DataFrame.corr() method to find correlations between columns in a dataframe. The method returns a new dataframe where the index for each column and row is the name of a column in the original data set.


In [None]:
# The pearsonr function will find the correlation between two columns of data.
# It returns the r value and the p value.  We'll learn more about p values later on.
r, p_value = pearsonr(nba_stats["fga"], nba_stats["pts"])
# As we can see, this is a very high positive r value - it's close to 1.
print(r)

# These two columns are much less correlated.
r, p_value = pearsonr(nba_stats["trb"], nba_stats["ast"])
# We get a much lower, but still positive, r value.
print(r)

r_fta_pts, p_value = pearsonr(nba_stats["fta"], nba_stats["pts"])
print(r)

r_stl_pf, p_value = pearsonr(nba_stats["stl"], nba_stats["pf"])
print(r)

## Probaility Distribution Example

In [None]:
## 3. Bikesharing distribution ##

import pandas
bikes = pandas.read_csv("data/bike_rental_day.csv")

number_of_days = len(bikes)
prob_over_5000 = bikes[bikes["cnt"] >5000 ].shape[0]/number_of_days

## 4. Computing the distribution ##

import math

# Each item in this list represents one k, starting from 0 and going up to and including 30.
outcome_counts = list(range(31))

def find_combinations(N,k):
     # Calculate the numerator of our formula.
    numerator = math.factorial(N)
    # Calculate the denominator.
    denominator = math.factorial(k) * math.factorial(N - k)
    # Divide them to get the final value.
    return numerator / denominator

p = .39
q = .61
N = 30

def single_probability(N,k, p, q):
   
    return (p ** k) * (q ** (N-k))

outcome_probs=[]

for i in outcome_counts:
    
    probability = single_probability(N,i,p,q)
    combinations = find_combinations(N,i)
    
    outcome_probs.append(probability * combinations)


## 5. Plotting the distribution ##

import matplotlib.pyplot as plt

# The most likely number of days is between 10 and 15.
plt.bar(outcome_counts, outcome_probs)
plt.show()

## 6. Simplifying the computation ##

import scipy
from scipy import linspace
from scipy.stats import binom
import pandas as pd
import matplotlib.pyplot as plt

# Create a range of numbers from 0 to 30, with 31 elements (each number has one entry).
outcome_counts = linspace(0,30,31)

outcome_probs = binom.pmf(outcome_counts,30,0.39)
plt.bar(outcome_counts,outcome_probs)
plt.show

## 8. Computing the mean of a probability distribution ##

dist_mean = None

dist_mean = 30 * 0.39

## 9. Computing the standard deviation ##

dist_stdev = None

dist_stdev = (30 * 0.39 * 0.61) ** (1/2)

## 10. A different plot ##

# Enter your answer here.
outcome_counts = linspace(0,10,11)
outcome_probs = binom.pmf(outcome_counts,10,0.39)
plt.bar(outcome_counts,outcome_probs)
plt.show()

outcome_counts = linspace(0,100,101)
outcome_probs = binom.pmf(outcome_counts,100,0.39)
plt.bar(outcome_counts,outcome_probs)
plt.show()

## 11. The normal distribution ##

# Create a range of numbers from 0 to 100, with 101 elements (each number has one entry).
outcome_counts = scipy.linspace(0,100,101)

# Create a probability mass function along the outcome_counts.
outcome_probs = binom.pmf(outcome_counts,100,0.39)

# Plot a line, not a bar chart.
plt.plot(outcome_counts, outcome_probs)
plt.show()

## 12. Cumulative density function ##

outcome_counts = linspace(0,30,31)

outcome_probs = binom.cdf(outcome_counts,30,0.39)

plt.plot(outcome_counts, outcome_probs)

## 14. Faster way to calculate likelihood ##

left_16 = None
right_16 = None

left_16 = binom.cdf(16,30,0.39)
right_16 = 1 - left_16