# How can you help here?

The company wants to know:

* Which variables are significant in predicting the reason for hospitalization for different regions

* How well some variables like viral load, smoking, Severity Level describe the hospitalization charges


# Column Profiling

* **Age**: This is an integer indicating the age of the primary beneficiary (excluding those above 64 years, since they are generally covered by the government).

* **Sex**: This is the policy holder's gender, either male or female

* **Viral Load**: Viral load refers to the amount of virus in an infected person's blood

* **Severity Level**: This is an integer indicating how severe the patient is

* **Smoker**: This is yes or no depending on whether the insured regularly smokes tobacco.

* **Region**: This is the beneficiary's place of residence in Delhi, divided into four geographic regions - northeast, southeast, southwest, or northwest

* **Hospitalization charges**: Individual medical costs billed to health insurance

# Concept Used:

* Graphical and Non-Graphical Analysis
* 2-sample t-test: testing for difference across populations
* ANOVA
* Chi-square

## Importing the libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")
import datetime as dt
import statsmodels.api as sm
from scipy.stats import shapiro,f_oneway,levene,ttest_ind,chi2_contingency
import warnings
warnings.filterwarnings('ignore')

## Reading the dataset

In [3]:
df=pd.read_csv('apollo_hospital.csv')

## Data check for Null values etc

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,age,sex,smoker,region,viral load,severity level,hospitalization charges
0,0,19,female,yes,southwest,9.3,0,42212
1,1,18,male,no,southeast,11.26,1,4314
2,2,28,male,no,southeast,11.0,3,11124
3,3,33,male,no,northwest,7.57,0,54961
4,4,32,male,no,northwest,9.63,0,9667


In [6]:
df.tail()

Unnamed: 0.1,Unnamed: 0,age,sex,smoker,region,viral load,severity level,hospitalization charges
1333,1333,50,male,no,northwest,10.32,3,26501
1334,1334,18,female,no,northeast,10.64,0,5515
1335,1335,18,female,no,southeast,12.28,0,4075
1336,1336,21,female,no,southwest,8.6,0,5020
1337,1337,61,female,yes,northwest,9.69,0,72853


In [7]:
df.shape

(1338, 8)

In [8]:
df.isnull().any()

Unnamed: 0                 False
age                        False
sex                        False
smoker                     False
region                     False
viral load                 False
severity level             False
hospitalization charges    False
dtype: bool

In [9]:
df.describe()

Unnamed: 0.1,Unnamed: 0,age,viral load,severity level,hospitalization charges
count,1338.0,1338.0,1338.0,1338.0,1338.0
mean,668.5,39.207025,10.221233,1.094918,33176.058296
std,386.391641,14.04996,2.032796,1.205493,30275.029296
min,0.0,18.0,5.32,0.0,2805.0
25%,334.25,27.0,8.7625,0.0,11851.0
50%,668.5,39.0,10.13,1.0,23455.0
75%,1002.75,51.0,11.5675,2.0,41599.5
max,1337.0,64.0,17.71,5.0,159426.0


### Insights 
* No null values in the dataset 
* data is available for 1338 patients
* Ages from 18 to 64 are covered
* Average age of the patient is 39 years
* On average viral load fluctuates from 8 to 12
* Most of the patients have the low severity level
* Hospitalization charges has min amount of ₹2805 and highest amount of ₹159426