In [1]:
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('idsp.csv', encoding='ISO-8859-1')


In [3]:
df.head()

Unnamed: 0,year,week,outbreak_starting_date,reporting_date,state,district,disease_illness_name,status,cases,deaths,unit,note
0,2025,16,15-04-2025,15-04-2025,Andhra Pradesh,Kakinada,Acute Diarrheal Disease,Reported,22,0,"cases in absolute number, deaths in absolute n...",
1,2025,16,15-04-2025,17-04-2025,Assam,Biswanath,Chickenpox,Reported,1,1,"cases in absolute number, deaths in absolute n...",
2,2025,16,19-04-2025,20-04-2025,Assam,Dhemaji,Food Poisoning,Reported,16,0,"cases in absolute number, deaths in absolute n...",
3,2025,16,19-04-2025,19-04-2025,Bihar,Gopalganj,Fever with Rash,Reported,5,0,"cases in absolute number, deaths in absolute n...",
4,2025,16,12-04-2025,15-04-2025,Bihar,Madhubani,Acute Diarrheal Disease,Reported,21,0,"cases in absolute number, deaths in absolute n...",


In [4]:
df.columns

Index(['year', 'week', 'outbreak_starting_date', 'reporting_date', 'state',
       'district', 'disease_illness_name', 'status', 'cases', 'deaths', 'unit',
       'note'],
      dtype='object')

In [6]:
df.shape

(6474, 12)

In [7]:
df['unit'].value_counts()

unit
cases in absolute number, deaths in absolute number    6474
Name: count, dtype: int64

In [8]:
df['note'].value_counts()

note
cases: Cases reported from Hyderbagh, Nanded Corporation, District Nanded                                                       1
cases: Cases reported from Islampura Kinwat Council area, Taluk Kinwat, District Nanded                                         1
cases: Cases reported from Village Gillesugur Camp, PHC Gillesugur, Taluk Raichur, District Raichur                             1
cases: Cases reported from Village Gillesugur Camp, PHC/CHC Gillesugur, Taluk Raichur, District Raichur                         1
cases: Cases were reported from Village Äì Malayampattu, district Thiruvannamalai.                                             1
cases: Cases reported from Village Malayampattu, HSC Mullipattu, PHC Malayampattu, Block Thacthur, District Thiruvannamalai.    1
Name: count, dtype: int64

In [5]:
df.isnull().sum()

year                         0
week                         0
outbreak_starting_date       0
reporting_date            1019
state                        0
district                     1
disease_illness_name         5
status                       0
cases                        0
deaths                       0
unit                         0
note                      6468
dtype: int64

In [10]:
df.shape

(6474, 12)

In [12]:
df.drop(columns=['note'], inplace=True)


In [13]:
df.isnull().sum()

year                         0
week                         0
outbreak_starting_date       0
reporting_date            1019
state                        0
district                     1
disease_illness_name         5
status                       0
cases                        0
deaths                       0
unit                         0
dtype: int64

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6474 entries, 0 to 6473
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   year                    6474 non-null   int64 
 1   week                    6474 non-null   int64 
 2   outbreak_starting_date  6474 non-null   object
 3   reporting_date          5455 non-null   object
 4   state                   6474 non-null   object
 5   district                6473 non-null   object
 6   disease_illness_name    6469 non-null   object
 7   status                  6474 non-null   object
 8   cases                   6474 non-null   int64 
 9   deaths                  6474 non-null   int64 
dtypes: int64(4), object(6)
memory usage: 505.9+ KB


In [15]:
df['unit'].value_counts()

unit
cases in absolute number, deaths in absolute number    6474
Name: count, dtype: int64

In [16]:
df.drop(columns=['unit'], inplace=True)


### 🔍 What is `outbreak_starting_date`?

- `outbreak_starting_date` refers to the **first known date** when a **disease outbreak** was **reported or identified** in a specific **district** or **state**.
- It marks the **beginning of an outbreak episode** tracked by **IDSP** (Integrated Disease Surveillance Programme, India).

---

### 🏥 What is an "Outbreak"?

- An **outbreak** is the occurrence of **more cases than expected** of a disease in a particular time and place.
  - Example: 20+ cases of Dengue in one village in a week.

---

### 📆 Why is `outbreak_starting_date` important?

#### ✅ (a) Temporal Tracking
- Helps track **when** the outbreak began.
- Useful for building **timelines** of disease spread.

#### ✅ (b) Early Response Window
- Critical for **public health teams** to analyze **response time**.
- Helps detect **delays between outbreak and action**.

#### ✅ (c) Data Filtering & Analysis
- You can:
  - Filter by **month or season** to study seasonal diseases (e.g., Dengue, Malaria).
  - Visualize **trends over years**.

---

### 🧾 Example:

| year | week | outbreak_starting_date | state   | disease   | cases | deaths |
|------|------|-------------------------|---------|-----------|-------|--------|
| 2022 | 15   | 2022-04-08              | Bihar   | Cholera   | 24    | 1      |
| 2022 | 16   | 2022-04-12              | Gujarat | Measles   | 17    | 0      |

Here, the outbreak of **Cholera** in Bihar started on **April 8, 2022**, during **week 15** of the year.

---

### 🛠️ Tips for Use:
- Convert to `datetime`:
```python
df['outbreak_starting_date'] = pd.to_datetime(df['outbreak_starting_date'])


In [18]:
df['outbreak_starting_date'] = pd.to_datetime(df['outbreak_starting_date'])

  df['outbreak_starting_date'] = pd.to_datetime(df['outbreak_starting_date'])


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6474 entries, 0 to 6473
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   year                    6474 non-null   int64         
 1   week                    6474 non-null   int64         
 2   outbreak_starting_date  6474 non-null   datetime64[ns]
 3   reporting_date          5455 non-null   object        
 4   state                   6474 non-null   object        
 5   district                6473 non-null   object        
 6   disease_illness_name    6469 non-null   object        
 7   status                  6474 non-null   object        
 8   cases                   6474 non-null   int64         
 9   deaths                  6474 non-null   int64         
dtypes: datetime64[ns](1), int64(4), object(5)
memory usage: 505.9+ KB


How to fill reporting_date try to fill using state district and find the diffence between outbreak_strarting_date and reporting_date mode of it and add that many days on outbreak_starting_date. I do think this is the beast apporch to fill this column. 

else null values i'll drop. 

In [21]:
df['reporting_date'] = pd.to_datetime(df['reporting_date'])

  df['reporting_date'] = pd.to_datetime(df['reporting_date'])


In [22]:
df['state'].value_counts()

state
Kerala                                      843
Maharashtra                                 609
Karnataka                                   587
Madhya Pradesh                              580
Odisha                                      462
Tamil Nadu                                  393
Assam                                       378
Jharkhand                                   355
Chhattisgarh                                312
Uttar Pradesh                               304
Gujarat                                     254
Jammu and Kashmir                           230
Bihar                                       225
West Bengal                                 166
Andhra Pradesh                              144
Meghalaya                                   113
Punjab                                       73
Rajasthan                                    60
Uttarakhand                                  57
Haryana                                      55
Arunachal Pradesh                 

In [23]:
df['district'].value_counts()

district
Ernakulam            125
Palakkad             117
Malappuram           100
Kannur                78
Thrissur              67
                    ... 
Gautam Budh Nagar      1
Pathankot              1
Botad                  1
Pakke Kessang          1
Jagtial                1
Name: count, Length: 707, dtype: int64

In [24]:
df['disease_illness_name'].value_counts()

disease_illness_name
Acute Diarrheal Disease           1967
Food Poisoning                     812
Dengue                             509
Chickenpox                         439
Hepatitis A                        385
                                  ... 
Leptospirosis and Scrub Typhus       1
Fever with Altered sensandium        1
Tetanus                              1
Fever of Unknown Cause               1
Malaria (Plasmodium Vivax)           1
Name: count, Length: 127, dtype: int64

Yaha pe kaam karna perega. 

In [27]:
df['status'].value_counts()

status
Reported in Same Week      4733
Reported Late               958
Reported                    665
Reported in same week        70
Reported late                46
Previous Week Follow up       2
Name: count, dtype: int64

In [29]:
df['deaths'].value_counts()

deaths
0     5672
1      650
2       96
3       30
4        8
6        5
7        4
5        3
16       1
73       1
66       1
41       1
9        1
13       1
Name: count, dtype: int64

In [30]:
df['cases'].value_counts()

cases
1      344
5      277
10     230
6      228
12     219
      ... 
265      1
496      1
251      1
186      1
253      1
Name: count, Length: 281, dtype: int64