# Before your start:

    Read the README.md file
    Comment as much as you can and use the resources (README.md file)
    Happy learning!

In this exercise, we  will generate random numbers from the continuous disributions we learned in the lesson. There are two ways to generate random numbers:

1. Using the numpy library 
1. using the Scipy library 

Use either or both of the lbraries in this exercise.

In [None]:
from platform import python_version
print(python_version())

In [None]:
# Import Required Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import math
from scipy.stats import norm

## Uniform Distribution

To generate uniform random numbers between any two given values using scipy, we can either use the following code or the code that we have
discussed in class:

In [None]:
from scipy.stats import uniform
x = uniform.rvs(size=10)
a = 2
b = 3
randoms  = a + (b-a)*x
print(randoms)

**Your task:**

1. Based on the code above, write a function that generates uniformly distributed random numbers. There are several requirements for your function:
    * It should accept 3 parameters: 
        * `bottom` - the lower boundary of the generated numbers
        * `ceiling` - the upper boundary of the generated numbers
        * `count` - how many numbers to generate
    * It should return an array of uniformly distributed random numbers

1. Call your function with 2 sets of params below:
    * bottom=10, ceiling=15, count=100
    * bottom=10, ceiling=60, count=1,000

1. Plot the uniform distributions generated above using histograms, where x axis is the value and y axis is the count. Let the histogram's number of bins be 10.

You can check the expected output [here](https://drive.google.com/file/d/1uSelMUT-aSspJcDbfXpswZv9A5ChlaEL/view?usp=sharing)

In [None]:
# your code here

def uniform_random(bottom, ceiling, count):
    x = uniform.rvs(size=count)
    randoms  = bottom + (ceiling-bottom)*x
    return randoms

randoms_1 = uniform_random(10, 15, 100)
randoms_2 = uniform_random(10, 60, 1000)

In [None]:
fig, [ax1, ax2] = plt.subplots(1,2, figsize = (10,5))

ax1.hist(randoms_1,bins=10)
ax1.set_title("Plot 1 with first set of parameters")

ax2.hist(randoms_2,bins=10)
ax2.set_title("Plot 2 with second set of parameters")

plt.show()

How are the two distributions different?

In [None]:
# your answer below
# Plot 2 is more evenly distributed than plot 1. But that is because there are more number of values.
# If we increase the number of bins for Plot 2 then it will most likely look like Plot 1.
# I tried using 100 bins for plot 2 and it looked very much like plot 1
#Paolo:yes!

## Normal Distribution

1. In the same way in the Uniform Distribution challenge, write a function that generates normally distributed random numbers.
1. Generate 1,000 normally distributed numbers with the average of 10 and standard deviation of 1
1. Generate 1,000 normally distributed numbers with the average of 10 and standard deviation of 50
2. Plot the distributions of the data generated.

You can check the expected output [here](https://drive.google.com/file/d/1ULdYD411SqkrlR9CqJJ7H8_Rt5T2GjLe/view?usp=sharing)

In [None]:
# your code here

import scipy.stats

def noramldist_random(mean, stdev, size):
    return scipy.stats.norm.rvs(loc= mean, scale= stdev, size=size)
    #where loc is the mean and scale is the std dev
    
plot1 = noramldist_random(10, 1, 1000)
plot2 = noramldist_random(10, 50, 1000)


fig, [ax1, ax2] = plt.subplots(1,2, figsize = (10,5))

ax1.hist(plot1,bins=50)
ax1.set_title("Average of 10 and standard deviation of 1")

ax2.hist(plot2,bins=50)
ax2.set_title("Average of 10 and standard deviation of 50")

plt.show()

How are the two distributions different?

In [None]:
# your answer below
# They are both similar as we forced them both to be normal distributions.
#Paolo: yes but the standard deviation is different no?

## Normal Distribution of Real Data

In this challenge we are going to take a look the real data. We will use vehicles.csv file for this exercise

First import `vehicles.csv` from [here](https://drive.google.com/file/d/1bNZgaQ-_Z9i3foO-OeB89x7kXJxm8xcC/view?usp=sharing), place it in the data folder and load it.


In [None]:
#your code here
vehicles = pd.read_csv("data/vehicles.csv") 
vehicles.head()

Then plot the histograms for the following variables:
1. Fuel Barrels/Year

In [None]:
# your code here
# Histogram

vehicles.hist(column='Fuel Barrels/Year', bins=50)
plt.show()

In [None]:
# https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/
# Another popular plot for checking the distribution of a data sample is the quantile-quantile plot, 
# Q-Q plot, or QQ plot for short.

# q-q plot
from statsmodels.graphics.gofplots import qqplot
qqplot(vehicles['Fuel Barrels/Year'], line='s')
plt.show()

In [None]:
# Shapiro-Wilk Test
from scipy.stats import shapiro

# normality test
stat, p = shapiro(vehicles['Fuel Barrels/Year'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')
#Paolo: great work, altough in this case (more than 5000 samples) the p value is not accurate
# (check shapiro function description)
# Same comments for similar tests below.

2. CO2 Emission Grams/Mile 

In [None]:
# your code here
# Histogram
vehicles.hist(column='CO2 Emission Grams/Mile', bins=50)
plt.show()


In [None]:
# q-q plot
from statsmodels.graphics.gofplots import qqplot
qqplot(vehicles['CO2 Emission Grams/Mile'], line='s')
plt.show()

In [None]:
# Shapiro-Wilk Test
from scipy.stats import shapiro

# normality test
stat, p = shapiro(vehicles['CO2 Emission Grams/Mile'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')

3. Combined MPG

In [None]:
# your code here
# Histogram
vehicles.hist(column='Combined MPG', bins=50)
plt.show()

In [None]:
# q-q plot
from statsmodels.graphics.gofplots import qqplot
qqplot(vehicles['Combined MPG'], line='s')
plt.show()

In [None]:
# Shapiro-Wilk Test
from scipy.stats import shapiro, anderson

# normality test
stat, p = shapiro(vehicles['Combined MPG'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')

Which one(s) of the variables are nearly normally distributed? How do you know?

In [None]:
# your answer here
# In the above, I conduced both Visual Normality Checks and Statistical Normality Tests to check if the data is normally distributed.

# I plotted both Histograms and QQ Plots for each of the data values as 
# Histograms alone were not clear enough to draw conclusions.

# I also did Shapiro-Wilk Test to test the normality of the data and all the 3 plots are not normally distributed.

# So, based on both Visual Normality Checks and Statistical Normality Tests none of the above 3 plots are normally distributed.

# Reference: https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/
#Paolo: great work!

## Exponential Distribution

1. Using `numpy.random.exponential`, create a function that returns a list of numbers exponentially distributed with the mean of 10. 

1. Use the function to generate two number sequences with the size of 1 and 100.

1. Plot the distributions as histograms with the nubmer of bins as 100.

You can check the expected output [here](https://drive.google.com/file/d/1pybmhXeeG5Wzb69wfFv2J8JyR6t44mRi/view?usp=sharing)

In [None]:
# your code here
from scipy.stats import expon
import scipy.stats

def exponen_random(mean, size):
    return np.random.exponential(mean, size=size)

plot_expo1 = exponen_random(10,1)
plot_expo2 = exponen_random(10,100)

fig, [ax1, ax2] = plt.subplots(1,2, figsize = (10,5))

ax1.hist(plot_expo1,bins=100)
ax1.set_title("Exponent sequence with size 1")

ax2.hist(plot_expo2,bins=100)
ax2.set_title("Exponent sequence with size 100")

plt.show()

How are the two distributions different?

In [None]:
# your answer here

# I am not sure if the size 1 plot is correct.

# Both the histograms look very different beacuse of the size values. 
#Paolo:yes!

## Exponential Distribution of Real Data

Suppose that the amount of time one spends in a bank is exponentially distributed with mean as 10 minutes (i.e. λ = 1/10). What is the probability that a customer will spend less than fifteen minutes in the bank? 

Write a code in python to solve this problem

In [None]:
# First I am plotting to see how the amount of time one spends in a bank 
# is exponentially distributed with mean as 10 minutes (i.e. λ = 1/10)

#pdf(x, loc=0, scale=1)
# here scale is lambda
# loc is always 0

#lambd = 0.1
x = np.arange(0, 100, 10)
y = expon.pdf(x, 0, 10)
fig, ax = plt.subplots(1, 1)
ax.plot(x,y)


In [None]:
# The probability that a customer will spend less than fifteen minutes in the bank

# Hint: This is same as saying P(x<15)


lambd = 1/15
x = np.arange(0, 100, 10)
y = lambd * np.exp(-lambd * x)
print(y)
fig, ax = plt.subplots(1, 1)
ax.plot(x,y)

What is the probability that the customer will spend more than 15 minutes

In [None]:
# your answer here

lambd = 1/15
x = np.arange(0, 100, 10)
y = 1 - (lambd * np.exp(-lambd * x))
print(y)
fig, ax = plt.subplots(1, 1)
ax.plot(x,y)


In [None]:
#Paolo: what is  your answer to the question though?

# Central Limit Theorem

A delivery company needs 35 minutes to deliver a package, with a standard deviation of 8 minutes. Suppose that in one day, they deliver 200 packages.

**Hint**: `stats.norm.cdf` can help you find the answers.

#### Step 1: What is the probability that the mean delivery time today is between 30 and 35 minutes?

In [None]:
# your code here

#cdf(x, loc=0, scale=1)
# The location (`loc`) keyword specifies the mean.
# The scale (`scale`) keyword specifies the standard deviation.

population_mean = 35
population_stdev = 8

sample_size = 200 
sample_mean = population_mean 
standard_error = population_stdev/np.sqrt(sample_size) 
standard_error

# probability that the mean delivery time today is between 30 and 35 minutes 
norm.cdf(35, population_mean, standard_error ) - norm.cdf(30, population_mean, standard_error)
#Paolo: fantastic work!

#### Step 2: What is the probability that in total, it takes more than 115 hours to deliver all 200 packages?

In [None]:
# your code here

# Time (in minutes) it takes to deliver 200 packages in 115 hours
sample_mean = (115*60)/200
print(sample_mean) 


# probability that in total, it takes more than 115 hours to deliver all 200 packages
1 - norm.cdf(sample_mean, population_mean, standard_error) 
#Paolo: fantastic work!