<a href="https://colab.research.google.com/github/hewp84/ENGR390/blob/main/Lab_Normal_Distribution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ENGR 390 Lab Practice: Modeling data using Normal Distribution

**Objective:**

The objective of this lab practice is to introduce you to the concept of probability distributions, specifically continuous probability distributions. You will learn how to fit a probability distribution to a dataset and visualize it using Python in Google Colab environment.

**Background:**

In statistics, a probability distribution describes how the values of a random variable are distributed. Probability distributions can be classified into two main types: discrete and continuous. In this lab practice, we will focus on continuous probability distributions.

A continuous probability distribution is characterized by a continuous random variable, which can take any value within a certain range. Examples of continuous probability distributions include the normal distribution, exponential distribution, and uniform distribution.

**Normal Distribution:**
The normal distribution, also known as the Gaussian distribution, is one of the most widely encountered probability distributions in statistics. It is characterized by a bell-shaped curve that is symmetric around the mean. The mean, median, and mode of a normal distribution are equal, and the distribution is defined by two parameters: the mean (μ) and the standard deviation (σ).

![taken from wikipedia](https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/1200px-Normal_Distribution_PDF.svg.png)

## Procedure

### Step 0: Import essential libraries

In [None]:
#Importing Libraries
import pandas as pd
from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy as np

### Step 1: Upload Dataset

* Download or prepare a dataset in CSV format containing numerical data.
* Upload the dataset to Google Colab using the file upload feature.



In [None]:
# Step 1: Upload CSV file and load data
# Assuming your CSV file is named 'data.csv'
from google.colab import files
uploaded = files.upload()

### Step 2: Load and Explore the Dataset

* Use the pandas library to load the dataset into a DataFrame.
* Explore the dataset to understand its structure and characteristics.

In [None]:
# Step 2: Load the data
data = pd.read_csv('bmi_data.csv') # Make sure you change the name of the file when using a different csv file
data.head()

### Step 3: Extract and filter the data to analyze

* Choose the column you want to work and analyze
* Filter your data, if needed, using boolean masks

In [None]:
# Step 3: Extract the data from a column
column_name = data.columns[4]  # Assign column number minus 1. If your selected column number is 4, assign: 3
column_data = data[column_name]

Boolean masking:

In [None]:
#Step 3a (Optional): Filtering data using boolean masks

# Define your boolean mask based on the values in the string column
# For example, let's assume you have a column named 'category' and you want to filter rows where 'category' is 'high'
boolean_mask = data['category'] == 'high'

# Apply the boolean mask to filter the dataset
filtered_data = data[boolean_mask]

### Step 4: Fit a Probability Distribution

* Choose a column from the dataset to analyze.
* Fit a probability distribution (e.g., normal distribution) to the selected data column using suitable statistical libraries.

In [None]:
# Step 4: Fit a probability distribution
# For example, fitting a normal distribution
mu, std = norm.fit(column_data)

#You will be needing these variables for Step 6

Step 5: Visualize the Fitted Distribution

* Generate points from the fitted distribution.
* Plot the histogram of the data and the fitted distribution.
* Make sure your data fits a normal distribution. Otherwise, pick a different distribution or change your variable to analyze.

In [None]:
# Step 5: Generate points from the fitted distribution
xmin = column_data.min()
xmax = column_data.max()
x = np.linspace(xmin, xmax, len(column_data))
y = norm.pdf(x, mu, std)

In [None]:
# Step 5a: Plot the data and the fitted distribution
plt.hist(column_data, bins=20, density=True, alpha=0.6, color='g') # Plot histogram of data
plt.plot(x, y, 'r', linewidth=2) # Plot fitted distribution
plt.legend(['Fitted Distribution', 'r.v. X'])
plt.xlabel('Name of the random variable X')
plt.ylabel('Probability Density')
plt.title('Fitted Probability Distribution')
plt.show()

# If the red curve is completely different from the green area, change distribution or random variable

Step 6: Interpretation and Analysis

* Interpret the results of the fitted distribution.
* Analyze the characteristics of the dataset and the fitted distribution.
* Discuss any insights or observations derived from the analysis.

In this section, you craft your research questions and answer them using coding.


### Example:

**BMI introduction**

BMI is a measurement of a person's leanness or corpulence based on their height and weight, and is intended to quantify tissue mass. It is widely used as a general indicator of whether a person has a healthy body weight for their height. Specifically, the value obtained from the calculation of BMI is used to categorize whether a person is underweight, normal weight, overweight, or obese depending on what range the value falls between. These ranges of BMI vary based on factors such as region and age, and are sometimes further divided into subcategories such as severely underweight or very severely obese. Being overweight or underweight can have significant health effects, so while BMI is an imperfect measure of healthy body weight, it is a useful indicator of whether any additional testing or action is required. Refer to the table below to see the different categories based on BMI that are used by the calculator.
![from NCBI website](https://www.ncbi.nlm.nih.gov/books/NBK551660/bin/bmi__WHO.jpg)

What is the probability of someone being found in the 'Obese class I' weight status?

In [None]:
# Obese class I spans between 30 < X < 34.9

# Recall mean and std calculated in Step 4
# P(X<34.9)
x1 = 34.9    #Change this value to the convenience of the research question
p1 = norm.cdf(x1, mu, std)
print(p1)

# P(X<30)
x2 = 30 #Change this value to the convenience of the research question
p2 = norm.cdf(x2, mu, std)
print(p2)

# P(30 < X < 34.9)
prob = p1 - p2
print('P(30 < X < 34.9)= ',prob)

Calculate the BMI interval that encompasses the lower 10% of the sample population based on the inverse normal distribution.

In [None]:
# Now performing inverse operation using normal distribution.
# We have the probability and we are looking for the X value.

prob_1 = 0.10 #Change this value to the convenience of the research question
x_1 = norm.ppf(prob_1, mu, std)

print("BMI interval such that the probability is", prob_1, "is between 0 and ", x_1)

Try it yourself: Come up with different questions and answer them recycling the code above.

In [None]:
#Write your code below:
