# Cirrhosis Data Analysis
This notebook aims to analyze the Cirrhosis dataset, which contains clinical data related to primary biliary cirrhosis (PBC) of the liver. The dataset includes various variables like age, sex, presence of ascites, serum bilirubin levels, and more. We will perform Exploratory Data Analysis (EDA) and initial data cleaning.

In [None]:
# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Reading the dataset
# Note: Replace the file path with the actual path to the 'Cirrhosis - Cirrhosis.csv' file
cirrhosis_data = pd.read_csv('Cirrhosis - Cirrhosis.csv')
# Displaying the first few rows of the dataset
cirrhosis_data.head()

In [None]:
# Checking for missing values in the dataset
missing_values = cirrhosis_data.isnull().sum()
missing_values

## Data Cleaning
Upon checking for missing values, we found that the 'triglicerides' column has 30 missing values and the 'copper' column has 2 missing values. We need to decide how to handle these missing values.

In [None]:
# Handling missing values by replacing them with the median value of the respective columns
cirrhosis_data['triglicerides'].fillna(cirrhosis_data['triglicerides'].median(), inplace=True)
cirrhosis_data['copper'].fillna(cirrhosis_data['copper'].median(), inplace=True)
# Confirming that there are no more missing values
cirrhosis_data.isnull().sum()

## Exploratory Data Analysis (EDA)
With the missing values handled, let's move on to the Exploratory Data Analysis. We will visualize the data to understand its distribution and relationships between variables.

In [None]:
# Pie chart for the 'status' variable
status_counts = cirrhosis_data['status'].value_counts()
labels = ['Censored', 'Censored due to liver tx', 'Death']
plt.figure(figsize=(8, 6))
plt.pie(status_counts, labels=labels, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Status')
plt.show()

In [None]:
# Histogram for the 'age' variable
plt.figure(figsize=(8, 6))
plt.hist(cirrhosis_data['age'], bins=20, color='blue', edgecolor='black')
plt.title('Distribution of Age in Days')
plt.xlabel('Age in Days')
plt.ylabel('Frequency')
plt.show()

## Summary
The dataset contains clinical data related to primary biliary cirrhosis (PBC) of the liver. We performed initial data cleaning to handle missing values in the 'triglicerides' and 'copper' columns. The EDA revealed the distribution of 'status' and 'age' variables. The cleaned dataset is now ready for further analysis by the modeling team.