 
# 1.Overview of the project:

## LIPID PANEL
<font color = red>A lipid panel is a common blood test that healthcare providers use to monitor and screen for your risk of cardiovascular disease. The panel includes three measurements of your cholesterol levels and a measurement of your triglycerides.</font>

## Problem Statement:
<font color = blue>To identify the abnormalities and potential risk of the patients for cardiovascular diseases by analyzing the lipid profile test and training the dataset for predicting the risk of forecoming patients to undertake necessary measures beforehand</font>

## Five tests in a lipid panel:
A lipid panel measures five different types of lipids from a blood sample, including:
1. **Total cholesterol**: This is your overall cholesterol level — the combination of LDL-C, VLDL-C and HDL-C.
2. **Low-density lipoprotein (LDL) cholesterol**: This is the type of cholesterol that’s known as “bad cholesterol.” It can collect in your blood vessels and increase your risk of cardiovascular disease.
3. **Very low-density lipoprotein (VLDL) cholesterol**: This is a type of cholesterol that’s usually present in very low amounts when the blood sample is a fasting samples since it’s mostly comes from food you’ve recently eaten. An increase in this type of cholesterol in a fasting sample may be a sign of abnormal lipid metabolism.
4. **High-density lipoprotein (HDL) cholesterol**: This is the type of cholesterol that’s known as “good cholesterol.” It helps decrease the buildup of LDL in your blood vessels.
5. **Triglycerides**: This is a type of fat from the food we eat. Excess amounts of triglycerides in your blood are associated with cardiovascular disease and pancreatic inflammation.

## Normal lipid panel results:
The optimal level ***(measured in milligrams per deciliter of blood — mg/dL)*** for each of the four standard tests in a lipid panel are as follows:
1. **Total cholesterol**: Below 200 mg/dL.
2. **High-density lipoprotein (HDL) cholesterol**: Above 60 mg/dL.
3. **Low-density lipoprotein (LDL) cholesterol**: Below 100 mg/dL (For people who have diabetes: Below 70 mg/dL).
4. **Triglycerides**: Below 150 mg/dL.

# 2. Generating Dataset
***Based on the above information the dataset is to be formulated from the raw data collected.***

## Approach for creating the dataset:
We have to determine the risk factor for cardiovascular diseases from the main 4 features ***(Total cholesterol,
High-density lipoprotein (HDL) cholesterol,
Low-density lipoprotein (LDL) cholesterol,
Triglycerides)***.<br><br>
**<font color='red'>BINARY CLASSIFICATION</font> is the optimal solution and approach for creating a dataset to meet the requirements, by creating a class variable(column).**<br>
***In this column 1 represents the high risk factor whereas 0 represents that the respective patient is below the border-line of the risk factor.***

### Creating two different class columns:
It is necessary to take into consideration all the 4 features of cholesterol attributes to determine the risk factor. But since there are two types of cholesterols i.e ***good cholesterol*** and ***bad cholesterol***, also ***triglycerides*** determines the amount of fat so it should be considered seperately.<br>
Instead of creating a class column for all the 4 features it is correct to use the ratio of different features as it sets a perfect borderline for the risk factor taking into consideration all the features.<br><br>
After doing certain research on the prominent cholesterol ratios used for lipid tests it was found that two ratios were quite dominant in predicting the **<font color='blue'>cardiovascular disease risk factors</font>**<br>
These ratios are as follows:
1. **Triglyceride/HDL ratio**: The triglyceride/HDL level which is considered ideal is **2** or less; **4** is high and **6** or greater is considered too high.
2. **Total Cholesterol/HDL ratio**: Ideally, the ratio should be below **4**. The lower this number is, the healthier a person’s cholesterol levels are.<br><br>
**<font color='red'>Taking into consideration these two ratios, two seperate class columns have to be created and also two seperate machine learning models will be generated with respect to these two columns.</font>**

### Getting started with dataset generation

#### Import the necessary libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

#### Read the raw data into a dataframe.

In [2]:
lipid_data = pd.read_csv("Book1.csv")

#### Read the first 5 rows to get an overview of our data. 

In [3]:
lipid_data.head()

Unnamed: 0,Name,Age,Sex,Test,Total Cholesterol Level,HDL Cholesterol,Triglyceridies,LDL Cholesterol,VLDL Cholesterol,TC/HDLC Ratio,LDC/HDLC Ratio
0,Bhimrao Nimbalkar,56.0,M,LIPID,170.4,41.1,132.6,96.3,27.0,4.1,2.343066
1,Uttam Prakash,53.0,M,LIPID,249.5,53.3,195.4,133.6,39.0,4.7,2.506567
2,Satish Magdum,58.0,M,LIPID,173.0,42.3,361.2,89.9,72.0,4.1,2.125296
3,Santosh Dayama,60.0,F,LIPID,230.1,48.1,140.36,121.35,28.0,4.8,2.522869
4,Vidya Bade,51.0,F,LIPID,169.5,38.6,241.5,83.0,48.0,4.4,2.150259


#### Look at the datatypes and handle missing data.

In [4]:
lipid_data.info() ## Total non-null values are 114. The datatypes for all these values are correct.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114 entries, 0 to 113
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Name                     114 non-null    object 
 1   Age                      112 non-null    float64
 2   Sex                      114 non-null    object 
 3   Test                     114 non-null    object 
 4   Total Cholesterol Level  114 non-null    float64
 5   HDL Cholesterol          114 non-null    float64
 6   Triglyceridies           114 non-null    float64
 7   LDL Cholesterol          114 non-null    float64
 8   VLDL Cholesterol         113 non-null    float64
 9   TC/HDLC Ratio            113 non-null    float64
 10  LDC/HDLC Ratio           114 non-null    float64
dtypes: float64(8), object(3)
memory usage: 9.9+ KB


In [5]:
lipid_data.isnull().sum()

Name                       0
Age                        2
Sex                        0
Test                       0
Total Cholesterol Level    0
HDL Cholesterol            0
Triglyceridies             0
LDL Cholesterol            0
VLDL Cholesterol           1
TC/HDLC Ratio              1
LDC/HDLC Ratio             0
dtype: int64

**From the above results we have found 3 columns that have missing values i.e <font color='brown'>(['Age'],['VLDL Cholesterol' ] and ['TC/HDLC Ratio']</font>. We can only manipulate the ['TC/HDLC Ratio'] column since it is the ratio of two values. So lets take a look at that first.**

In [6]:
lipid_data.loc[lipid_data['TC/HDLC Ratio'].isnull()]

Unnamed: 0,Name,Age,Sex,Test,Total Cholesterol Level,HDL Cholesterol,Triglyceridies,LDL Cholesterol,VLDL Cholesterol,TC/HDLC Ratio,LDC/HDLC Ratio
67,Sangita Kambale,48.0,F,LIPID,149.2,35.45,129.36,66.6,26.0,,1.878702


In [7]:
# This is the ratio of two values i.e Total Cholestrol Level and HDL Cholestrol.So lets extract these two values and insert their ratio inplace of the null value.

Total_Cholesterol_Level = lipid_data.loc[lipid_data['TC/HDLC Ratio'].isnull(),'Total Cholesterol Level'].values
HDL_Cholesterol = lipid_data.loc[lipid_data['TC/HDLC Ratio'].isnull(),'HDL Cholesterol'].values
lipid_data.loc[lipid_data['TC/HDLC Ratio'].isnull(),'TC/HDLC Ratio'] = Total_Cholesterol_Level/HDL_Cholesterol
#Lets' take a look at the row now. 
lipid_data.loc[lipid_data['Name']== 'Sangita Kambale']

Unnamed: 0,Name,Age,Sex,Test,Total Cholesterol Level,HDL Cholesterol,Triglyceridies,LDL Cholesterol,VLDL Cholesterol,TC/HDLC Ratio,LDC/HDLC Ratio
67,Sangita Kambale,48.0,F,LIPID,149.2,35.45,129.36,66.6,26.0,4.208745,1.878702


**<font color='brown'>We will deal with the two columns having missing data later as they are not that significant for us now</font>**

#### Drop the columns that are unnecessary.
We dont need the ['Test','VLDL Cholesterol','LDC/HDLC Ratio'] columns as of now, so let's drop these columns.

In [8]:
lipid_data.drop(['Test','VLDL Cholesterol','LDC/HDLC Ratio'],axis=1,inplace=True)
lipid_data.head()

Unnamed: 0,Name,Age,Sex,Total Cholesterol Level,HDL Cholesterol,Triglyceridies,LDL Cholesterol,TC/HDLC Ratio
0,Bhimrao Nimbalkar,56.0,M,170.4,41.1,132.6,96.3,4.1
1,Uttam Prakash,53.0,M,249.5,53.3,195.4,133.6,4.7
2,Satish Magdum,58.0,M,173.0,42.3,361.2,89.9,4.1
3,Santosh Dayama,60.0,F,230.1,48.1,140.36,121.35,4.8
4,Vidya Bade,51.0,F,169.5,38.6,241.5,83.0,4.4


#### Converting Text data into numerical categorical data.
The 'sex' column has text data so let's convert into numerical data. Pandas has an inbuilt function for this called as **pd.get_dummies**. Let's make use of that.

In [9]:
dummy = pd.get_dummies(lipid_data['Sex'])
dummy

Unnamed: 0,F,F.1,M
0,0,0,1
1,0,0,1
2,0,0,1
3,1,0,0
4,1,0,0
...,...,...,...
109,0,0,1
110,0,0,1
111,0,0,1
112,1,0,0


**There is some issue with the textual data as it shows three different categories, so let's convert this into proper data first and take a look at the issue.**

In [10]:
lipid_data['Sex'] = lipid_data['Sex'].map({'M':'Male','F':'Female'})
lipid_data.head()

Unnamed: 0,Name,Age,Sex,Total Cholesterol Level,HDL Cholesterol,Triglyceridies,LDL Cholesterol,TC/HDLC Ratio
0,Bhimrao Nimbalkar,56.0,Male,170.4,41.1,132.6,96.3,4.1
1,Uttam Prakash,53.0,Male,249.5,53.3,195.4,133.6,4.7
2,Satish Magdum,58.0,Male,173.0,42.3,361.2,89.9,4.1
3,Santosh Dayama,60.0,Female,230.1,48.1,140.36,121.35,4.8
4,Vidya Bade,51.0,Female,169.5,38.6,241.5,83.0,4.4


In [11]:
# Now let's take a look at the count of values for male and female.
lipid_data['Sex'].count()

113

In [12]:
# It shows a total of 113 entries whereas our total entries are 114. So there is one misising data.
# Let's inspect it.
lipid_data.loc[lipid_data['Sex'].isnull()]

Unnamed: 0,Name,Age,Sex,Total Cholesterol Level,HDL Cholesterol,Triglyceridies,LDL Cholesterol,TC/HDLC Ratio
61,Anushka Gaikwad,30.0,,129.1,34.78,81.65,78.0,3.7


In [13]:
# The name of the person signifies her being a female. So lets make the necessary changes by updating.
lipid_data = lipid_data.fillna({'Sex':'Female'})
# Lets get the categorical columns now.
dummy = pd.get_dummies(lipid_data['Sex'])
dummy

Unnamed: 0,Female,Male
0,0,1
1,0,1
2,0,1
3,1,0
4,1,0
...,...,...
109,0,1
110,0,1
111,0,1
112,1,0


In [14]:
# Now lets add this column to our dataframe and drop the preious 'Sex' column
lipid_data = pd.concat((lipid_data,dummy),axis=1)
lipid_data.drop('Sex',axis=1,inplace=True)

#### Generating the required class columns
<font color='brown'>We have to create 2 class columns as follows:</font><br>
1. **Risk Factor 1**: This represents the risk factor of patients having high <font color='purple'>TC/HDLC Ratio.</font>
2. **Risk Factor 2**: This represents the risk factor of patients having high <font color='brown'>Triglyceridies/HDL Cholesterol</font> ratio.

In [15]:
# Risk Factor 1 column.
lipid_data['Risk Factor 1'] = 1 #Set the value of all records to 1.

#According to the borderline value set the value to 0 who fall below the borderline
lipid_data.loc[lipid_data['TC/HDLC Ratio']<4.0,'Risk Factor 1'] = 0
lipid_data.head()

Unnamed: 0,Name,Age,Total Cholesterol Level,HDL Cholesterol,Triglyceridies,LDL Cholesterol,TC/HDLC Ratio,Female,Male,Risk Factor 1
0,Bhimrao Nimbalkar,56.0,170.4,41.1,132.6,96.3,4.1,0,1,1
1,Uttam Prakash,53.0,249.5,53.3,195.4,133.6,4.7,0,1,1
2,Satish Magdum,58.0,173.0,42.3,361.2,89.9,4.1,0,1,1
3,Santosh Dayama,60.0,230.1,48.1,140.36,121.35,4.8,1,0,1
4,Vidya Bade,51.0,169.5,38.6,241.5,83.0,4.4,1,0,1


In [16]:
#Take a look at the count of the 1's and 0's.
lipid_data['Risk Factor 1'].value_counts()

1    60
0    54
Name: Risk Factor 1, dtype: int64

In [17]:
#Risk Factor 2 column.

# First we need to generate the ratio of columns 'Triglyceridies/HDL Cholesterol'. Lets' do that.
lipid_data['Trigly/HDL'] = lipid_data['Triglyceridies']/lipid_data['HDL Cholesterol']
lipid_data['Trigly/HDL'] = lipid_data['Trigly/HDL'].round(1)

# Now perform the same steps as in the previous cell with the borderline value set to 3.5.
lipid_data['Risk Factor 2'] = 1
lipid_data.loc[lipid_data['Trigly/HDL']<3.5,'Risk Factor 2'] = 0
lipid_data['Risk Factor 2'].value_counts()

0    70
1    44
Name: Risk Factor 2, dtype: int64

#### Rearranging the position of columns.

In [18]:
lipid_data.head()

Unnamed: 0,Name,Age,Total Cholesterol Level,HDL Cholesterol,Triglyceridies,LDL Cholesterol,TC/HDLC Ratio,Female,Male,Risk Factor 1,Trigly/HDL,Risk Factor 2
0,Bhimrao Nimbalkar,56.0,170.4,41.1,132.6,96.3,4.1,0,1,1,3.2,0
1,Uttam Prakash,53.0,249.5,53.3,195.4,133.6,4.7,0,1,1,3.7,1
2,Satish Magdum,58.0,173.0,42.3,361.2,89.9,4.1,0,1,1,8.5,1
3,Santosh Dayama,60.0,230.1,48.1,140.36,121.35,4.8,1,0,1,2.9,0
4,Vidya Bade,51.0,169.5,38.6,241.5,83.0,4.4,1,0,1,6.3,1


All the **features** should be on the **left side** and the **class variables(columns)** should be on the **right**. Python has an inbuilt function **.reindex**, but we will use a more smart technique instead of <font color ='brown'>passing the name of all the columns as an argument.</font>

In [19]:
#First lets drop the name column.
lipid_data.drop('Name',axis=1,inplace=True)

#save the list of columns in a list data structure.
cols = list(lipid_data.columns)

#rearrange by slicing and indexing
lipid_data = lipid_data[[cols[0]]+cols[6:8]+cols[1:6]+[cols[9]]+cols[8:11:2]]
lipid_data.head()

Unnamed: 0,Age,Female,Male,Total Cholesterol Level,HDL Cholesterol,Triglyceridies,LDL Cholesterol,TC/HDLC Ratio,Trigly/HDL,Risk Factor 1,Risk Factor 2
0,56.0,0,1,170.4,41.1,132.6,96.3,4.1,3.2,1,0
1,53.0,0,1,249.5,53.3,195.4,133.6,4.7,3.7,1,1
2,58.0,0,1,173.0,42.3,361.2,89.9,4.1,8.5,1,1
3,60.0,1,0,230.1,48.1,140.36,121.35,4.8,2.9,1,0
4,51.0,1,0,169.5,38.6,241.5,83.0,4.4,6.3,1,1


#### Saving the final dataset.
Now lets save the dataset in csv format

In [20]:
lipid_data.to_csv("lipid_data.csv",index=False)