<a href="https://colab.research.google.com/github/poudyaldiksha/Data-Science-project/blob/main/Lesson_33_b2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 33: Probability

---

### Liver disease Dataset

**Context**

Patients with Liver disease have been continuously increasing because of excessive consumption of alcohol, inhale of harmful gases, intake of contaminated food, pickles and drugs



Here we have a dataset of patients who were diagnosed for the liver disease. Where, 416 patients were tested positive and 167 were tested negative for the liver disease. The data is collected from the North East of Andhra Pradesh, India. We need to find out what is the probability of a patient having the disease. We also need to find out what is the probability of a patient having the disease given that the patient is

- a juvenile i.e. a person whose age is less than 18 years.

- an adult i.e. a person whose age is greater than or equal to 18 years but less than 50 years.

- an elderly i.e. a person whose age is at least 50 years.




Use these patient records to determine which patients have liver disease and which ones do not.

---

#### Data Description

The dataset contains the following columns (or features):

1. `Age`: Age of the patient. Any patient whose age exceeded 89 is listed as being of age "90".

2. `Gender`: Gender of the patient

3. `Total_Bilirubin`: Total Bilirubin

4. `Direct_Bilirubin`: Direct Bilirubin

5. `Alkaline_Phosphotase`: Alkaline Phosphatase

6. `Alamine_Aminotransferase`: Alamine Aminotransferase

7. `Aspartate_Aminotransferase`: Aspartate Aminotransferase

8. `Total_Protiens`: Total Proteins

9. `Albumin`: Albumin

10. `Albumin_and_Globulin_Ratio`: Ratio Albumin and Globulin Ratio

11. `Dataset`: Whether a patient has the liver disease or not.
 `1` means a patient has the liver disease and `2` means a patient does not have the liver disease.

Link: https://www.kaggle.com/datasets/uciml/indian-liver-patient-records/data

#### Acknowledgements

This dataset was downloaded from the UCI ML Repository:

Lichman, M. (2013). [UCI Machine Learning Repository](https:/archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)). Irvine, CA: University of California, School of Information and Computer Science.

---

#### Activity 1:  What is Probability?

*Probability of an outcome or an event is defined as the ratio of the number of favourable outcomes to the total number of possible outcomes.*

Consider throwing a die.

It has six sides labelled as 1, 2, 3, 4, 5 and 6. So whenever you throw a die you will get either 1, 2, 3, 4, 5 or 6.

Suppose you are playing a game in which you need to get $6$ when you roll a die to win the game. So $6$ is your favourable outcome to win the game. Hence, the probability of getting 6 is **one over six**, i.e. $\frac{1}{6}$ because there is only **one** favourable outcome (i.e. getting $6$ on dice) out of six possible outcomes.

Mathematically, the probability of an outcome or an event $E$ is given by $$P(E) = \frac{n(E)}{n(S)}$$

or

$$P(E) = \frac{\text{Number of favourable outcomes}}{\text{Total number of possible outcomes}}$$

where

- $E$ is a set containing favourable outcomes

- $n(E)$ is the number of items contained in the set $E$

- $S$ is a set of all possible outcomes which is also known as **sample space**

- $n(S)$ the number of items contained in the set $S$

In the game of getting $6$ to win the game, the set of favourable outcome(s) is $E = \{6\}$ and the set of possible outcomes is $S = \{1, 2, 3, 4, 5, 6\}$

Hence, the probability of the outcome of getting $6$ is $\frac{1}{6}$

FOr suppose, player can take out their pawn only if they get a even number. In this case, the set of favourable outcomes becomes $E = \{2, 4, 6\}$

So the probability of taking out the pawn or in other words, the probability of the outcome of getting either 2 or 4 or 6 is $\frac{3}{6} = \frac{1}{2}$ because there are three items in the set $E$.

**Note:**

- Probability is just a measure of finding out which events are more likely to occur. It does not mean that the highly likely events will definitely occur. The probability of an event which definitely will occur is 1. Similarly, the probability of an event which will definitely NOT occur is 0.

- The sum of the probability of occurrence of an event and the probability of that event not occurring is always 1.



---

#### Activity 2: Data Preparation

Let's prepare the dataset for analysis by:

- Treating the null values (if there are any)

- Renaming the `Dataset` column with the name `Disease`, the `Alkaline_Phosphotase` column with the name `Alkaline_Phosphatase` and `Alamine_Aminotransferase` column with the name `Alanine_Aminotransferase`

- Labelling each patient as a juvenile, an adult or an elderly based on their age

- Encoding (or converting) the `Male` and `Female` values for the `Gender` column to the numeric values, i.e., `0` and `1`

In [None]:
#Mounting the google drive
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
# Import the modules, read the dataset and create a Pandas DataFrame.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
liver_df = pd.read_csv("/content/drive/MyDrive/datasets/indian_liver_patient.csv")
liver_df.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


In [None]:
#Get the dataset information.
liver_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         583 non-null    int64  
 1   Gender                      583 non-null    object 
 2   Total_Bilirubin             583 non-null    float64
 3   Direct_Bilirubin            583 non-null    float64
 4   Alkaline_Phosphotase        583 non-null    int64  
 5   Alamine_Aminotransferase    583 non-null    int64  
 6   Aspartate_Aminotransferase  583 non-null    int64  
 7   Total_Protiens              583 non-null    float64
 8   Albumin                     583 non-null    float64
 9   Albumin_and_Globulin_Ratio  579 non-null    float64
 10  Dataset                     583 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB


In [None]:
#Check for the null values.
liver_df.isnull().sum()

Unnamed: 0,0
Age,0
Gender,0
Total_Bilirubin,0
Direct_Bilirubin,0
Alkaline_Phosphotase,0
Alamine_Aminotransferase,0
Aspartate_Aminotransferase,0
Total_Protiens,0
Albumin,0
Albumin_and_Globulin_Ratio,4


In [None]:
liver_df["Albumin_and_Globulin_Ratio"].isnull().sum()

0

In [None]:
# Treat the missing values and check for the null values again.
a = liver_df["Albumin_and_Globulin_Ratio"].median()
liver_df.loc[liver_df["Albumin_and_Globulin_Ratio"].isnull()==True, "Albumin_and_Globulin_Ratio"] = a

**Renaming a Column**

To rename a column in a Pandas DataFrame, use the `rename()` function. It requires a dictionary as an input to the `columns` parameter. The dictionary must contain the old column names and new column names as a key-value pair respectively.

**Syntax:** `dataframe.rename(columns=dictionary_of_new_and_old_names)`

where `dictionary_of_new_and_old_names` is a dictionary containing old column names and new column names as a key-value pair respectively.

In [None]:
# Rename the 'Dataset' column with the name 'Disease'. Also, rename the 'Alkaline_Phosphotase' column with 'Alkaline_Phosphatase'
# Also, rename the 'Alamine_Aminotransferase' column with 'Alanine_Aminotransferase'
liver_df.rename(columns={
   "Dataset" : "Disease",
   "Alkaline_Phosphotase" : "Alkaline_Phosphatase",
   "Alamine_Aminotransferase" : "Alanine_Aminotransferase"
},inplace=True)


In [None]:
liver_df.columns

Index(['Age', 'Gender', 'Total_Bilirubin', 'Direct_Bilirubin',
       'Alkaline_Phosphatase', 'Alanine_Aminotransferase',
       'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
       'Albumin_and_Globulin_Ratio', 'Disease'],
      dtype='object')

**Labelling Age Group**

We need to label each patient as

- a juvenile i.e. a person whose age is less than 18 years.

- an adult i.e. a person whose age is greater than or equal to 18 years but less than 50 years.

- an elderly i.e. a person whose age is at least 50 years.

so that later we can find out the probability of a patient having the liver disease given that he/she is

- a juvenile

- an adult

- an elderly

In [None]:
a = label_a(liver_df["Age"])

In [None]:
#Activity: labelling
def label_a(c):
    list1 =[]
    for i in c:
        if i<18:
            list1.append(1)
        elif i>=18 and i<50:
            list1.append(2)
        else:
            list1.append(3)
    return pd.Series(list1)


In [None]:
label_a([5,8,90,8,34,76,44])

Unnamed: 0,0
0,1
1,1
2,3
3,1
4,2
5,3
6,2


In [None]:
label_a(liver_df["Age"])

In [None]:
# Add a new column to the 'df' DataFrame and name it as Age_group

liver_df["Age_group"] = label_a(liver_df["Age"])

In [None]:
liver_df.columns

Index(['Age', 'Gender', 'Total_Bilirubin', 'Direct_Bilirubin',
       'Alkaline_Phosphatase', 'Alanine_Aminotransferase',
       'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
       'Albumin_and_Globulin_Ratio', 'Disease', 'Age_group'],
      dtype='object')

In [None]:
liver_df["Age_group"].value_counts()

Unnamed: 0_level_0,count
Age_group,Unnamed: 1_level_1
2,328
3,230
1,25


**Encoding Gender**

Encode (or convert) the `Male` and `Female` values for the `Gender` column to the numeric values, i.e., `0` and `1`. For this, we can use the `replace()` function.

In [None]:
# Find the number of male and female patients before encoding.
liver_df["Gender"].value_counts()

Unnamed: 0_level_0,count
Gender,Unnamed: 1_level_1
Male,441
Female,142


In [None]:
#  Encode the 'Male' & 'Female' values for the 'Gender' column to the numeric values, i.e. 0 and 1.
liver_df["Gender"].replace({
    "Male" : 0,
    "Female" : 1
},inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  liver_df["Gender"].replace({
  liver_df["Gender"].replace({


In [None]:
# Find the number of male and female patients after encoding.
liver_df["Gender"].value_counts()

Unnamed: 0_level_0,count
Gender,Unnamed: 1_level_1
0,441
1,142


In [None]:
liver_df.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphatase,Alanine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Disease,Age_group
0,65,1,0.7,0.1,187,16,18,6.8,3.3,0.9,1,3
1,62,0,10.9,5.5,699,64,100,7.5,3.2,0.74,1,3
2,62,0,7.3,4.1,490,60,68,7.0,3.3,0.89,1,3
3,58,0,1.0,0.4,182,14,20,6.8,3.4,1.0,1,3
4,72,0,3.9,2.0,195,27,59,7.3,2.4,0.4,1,3


In [None]:
liver_df["Age_group"].value_counts()[1]

25

In [None]:
a = liver_df["Age_group"].value_counts()
a

Unnamed: 0_level_0,count
Age_group,Unnamed: 1_level_1
2,328
3,230
1,25


In [None]:
liver_df["Age_group"].value_counts()[2]

In [None]:
liver_df.shape[0]

583

In [None]:
liver_df["Age_group"].value_counts()[1]/liver_df.shape[0]

0.04288164665523156

---

#### Activity 3: Computing Probabilities

Let's answer the following questions:

1. What is the probability that a patient is a juvenile?

2. What is the probability that a patient is an adult?

3. What is the probability that a patient is an elderly?

4. What is the probability that a patient has the liver disease given that the patient is a juvenile?

5. What is the probability that a patient has the liver disease given that the patient is an adult?

6. What is the probability that a patient has the liver disease given that the patient is an elderly?

In [None]:
#Find the number of patients who are juveniles, adults and elderlies.
liver_df["Age_group"].value_counts()

Unnamed: 0_level_0,count
Age_group,Unnamed: 1_level_1
2,328
3,230
1,25


Juveniles are very few in number.

In [None]:
# What is the probability that a patient is a juvenile? What is the probability that a patient is an adult?
# What is the probability that a patient is an elderly?

pj= liver_df["Age_group"].value_counts()[1]/liver_df.shape[0]
pa = liver_df["Age_group"].value_counts()[2]/liver_df.shape[0]
pe = liver_df["Age_group"].value_counts()[3]/liver_df.shape[0]

print(pj,pa,pe)

0.04288164665523156 0.5626072041166381 0.39451114922813035


In [None]:
pj + pa + pe

1.0

As expected, the probability that a patient having the liver disease is an adult is the greatest amongst the three age groups.

The sum of the above three probabilities should be one.

In [None]:
#Calculate the sum of the above three probabilities.


The remaining three probabilities

- What is the probability that a patient has the liver disease given that the patient is a juvenile?

- What is the probability that a patient has the liver disease given that the patient is an adult?

- What is the probability that a patient has the liver disease given that the patient is an elderly?

are **conditional probabilities** because they have a condition involved in them.

- In the first case, the condition is that a patient is a juvenile.

- In the second case, the condition is that a patient is an adult.

- In the third case, the condition is that a patient is an elderly.

So to solve the first case, we need to find the number of patients amongst the juveniles having the disease, then calculate the probability of juveniles having the liver disease and then multiply it with the probability that a patient is a juvenile.





In [None]:
young_df = liver_df[liver_df['Age_group'] == 1]
young_df.shape

(25, 12)

In [None]:
young_df["Disease"].value_counts()

Unnamed: 0_level_0,count
Disease,Unnamed: 1_level_1
2,13
1,12


In [None]:
# Find the probability that a patient has the liver disease given that they are a juvenile.
# Also, find the probability that a patient doesn't have the liver disease given that they are a juvenile.
jdf = liver_df[liver_df['Age_group'] == 1]
jd = sum(jdf['Disease'] == 1)/jdf.shape[0]
jnd = sum(jdf['Disease'] == 2)/jdf.shape[0]



print("Probability of a juvenile having the disease is" , jd)
print("Probability of a juvenile NOT having the disease is " , jnd)

print("\nProbability that a patient has the liver disease given that they are a juvenile: " , pj*jd )
print("Probability that a patient DOES NOT have the liver disease given that they are a juvenile: " , pj*jnd )



Probability of a juvenile having the disease is 0.48
Probability of a juvenile NOT having the disease is  0.52

Probability that a patient has the liver disease given that they are a juvenile:  0.020583190394511147
Probability that a patient DOES NOT have the liver disease given that they are a juvenile:  0.022298456260720412


In [None]:
#  Find the probability that a patient has the liver disease given that they are an adult.
# Also, find the probability that a patient doesn't have the liver disease given that they are an adult.

adf = liver_df[liver_df['Age_group'] == 2]
ad = sum(adf['Disease'] == 1)/adf.shape[0]
a_nd = sum(adf['Disease'] == 2)/adf.shape[0]



print("Probability of a adult having the disease is" , ad)
print("Probability of a adult NOT having the disease is " , a_nd)

print("\nProbability that a patient has the liver disease given that they are a adult: " , pa*ad )
print("Probability that a patient DOES NOT have the liver disease given that they are a adult: " , pa*a_nd )


Probability of a adult having the disease is 0.698170731707317
Probability of a adult NOT having the disease is  0.3018292682926829

Probability that a patient has the liver disease given that they are a adult:  0.3927958833619211
Probability that a patient DOES NOT have the liver disease given that they are a adult:  0.169811320754717


In [None]:
# Find the probability that a patient has the liver disease given that they are an elderly.
# Also, find the probability that a patient doesn't have the liver disease given that they are an elderly.
edf = liver_df[liver_df['Age_group'] == 3]
ed = sum(edf['Disease'] == 1)/edf.shape[0]
end = sum(edf['Disease'] == 2)/edf.shape[0]



print("Probability of a ederly having the disease is" , ed)
print("Probability of a ederly NOT having the disease is " , end)

print("\nProbability that a patient has the liver disease given that they are a ederly: " , pe*ed )
print("Probability that a patient DOES NOT have the liver disease given that they are a ederly: " , pe*end )



Probability of a ederly having the disease is 0.7608695652173914
Probability of a ederly NOT having the disease is  0.2391304347826087

Probability that a patient has the liver disease given that they are a ederly:  0.30017152658662094
Probability that a patient DOES NOT have the liver disease given that they are a ederly:  0.09433962264150944


You can collect the favourable cases, i.e, the juvenile patients having the disease using the `&` operator and the calculate the probabilities.

In [None]:
# Find the probability that a patient has the liver disease given that they are a juvenile.
# Also, find the probability that a patient doesn't have the liver disease given that they are a juvenile.
liver_df.loc[(liver_df['Age_group'] == 1) & (liver_df['Disease'] == 2)].shape[0]/ liver_df.shape[0]

0.022298456260720412

In [None]:
# Find the remaining two conditional probabilities.
print(liver_df.loc[(liver_df['Age_group'] == 2) & (liver_df['Disease'] == 1)].shape[0]/ liver_df.shape[0])
print(liver_df.loc[(liver_df['Age_group'] == 3) & (liver_df['Disease'] == 2)].shape[0]/ liver_df.shape[0])

---