# **PART 1: Covid 19 Data Wrangling and Analysis**

#### Step 1: Import required libraries.


In [1]:
import pandas as pd
import numpy as np

#### Step 2: Import the datasets.

We have two diffrent datasets. We will import them and perform data wrangling to obtain the clean data.<br>
Clean datasets will be used for further analysis and Visulization. 

#### `Dataset 1: INDIA Dataset`

In [2]:
df_ind = pd.read_csv('covid_19_india.csv')
print("Data read into pandas dataframe.")

Data read into pandas dataframe.


In [3]:
df_ind.head()

Unnamed: 0,Sno,Date,Time,State/UnionTerritory,ConfirmedIndianNational,ConfirmedForeignNational,Cured,Deaths,Confirmed
0,1,2020-01-30,6:00 PM,Kerala,1,0,0,0,1
1,2,2020-01-31,6:00 PM,Kerala,1,0,0,0,1
2,3,2020-02-01,6:00 PM,Kerala,2,0,0,0,2
3,4,2020-02-02,6:00 PM,Kerala,3,0,0,0,3
4,5,2020-02-03,6:00 PM,Kerala,3,0,0,0,3


Let's check the shape of the dataset.

In [4]:
df_ind.shape

(18110, 9)

In [5]:
df_ind = df_ind.rename(columns= {'Sno': 'Serial Number'})

Now check the data types of each column and confirm if they are in correct datatype.

In [6]:
df_ind.dtypes

Serial Number                int64
Date                        object
Time                        object
State/UnionTerritory        object
ConfirmedIndianNational     object
ConfirmedForeignNational    object
Cured                        int64
Deaths                       int64
Confirmed                    int64
dtype: object

**The datatype of Date and Time columns is 'object'. Let's change it to 'datetime' format using pandas function.**

In [7]:
df_ind['Date'] = pd.to_datetime(df_ind['Date'])
df_ind['Time'] = pd.to_datetime(df_ind['Time'])
df_ind.dtypes

  df_ind['Time'] = pd.to_datetime(df_ind['Time'])


Serial Number                        int64
Date                        datetime64[ns]
Time                        datetime64[ns]
State/UnionTerritory                object
ConfirmedIndianNational             object
ConfirmedForeignNational            object
Cured                                int64
Deaths                               int64
Confirmed                            int64
dtype: object

Let's check the info of our `df_ind` dataset and check for any missing values.

In [8]:
df_ind.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18110 entries, 0 to 18109
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Serial Number             18110 non-null  int64         
 1   Date                      18110 non-null  datetime64[ns]
 2   Time                      18110 non-null  datetime64[ns]
 3   State/UnionTerritory      18110 non-null  object        
 4   ConfirmedIndianNational   18110 non-null  object        
 5   ConfirmedForeignNational  18110 non-null  object        
 6   Cured                     18110 non-null  int64         
 7   Deaths                    18110 non-null  int64         
 8   Confirmed                 18110 non-null  int64         
dtypes: datetime64[ns](2), int64(4), object(3)
memory usage: 1.2+ MB


There are no missing values in df_ind dataset and columns are of correct datatype.

#### `Dataset 2: STATE-WISE Dataset`

In [9]:
df_state = pd.read_csv('covid_vaccine_statewise.csv')
print("Statewise Vaccination data imported.")

Statewise Vaccination data imported.


In [10]:
df_state.head()

Unnamed: 0,Updated On,State,Total Doses Administered,Sessions,Sites,First Dose Administered,Second Dose Administered,Male (Doses Administered),Female (Doses Administered),Transgender (Doses Administered),...,18-44 Years (Doses Administered),45-60 Years (Doses Administered),60+ Years (Doses Administered),18-44 Years(Individuals Vaccinated),45-60 Years(Individuals Vaccinated),60+ Years(Individuals Vaccinated),Male(Individuals Vaccinated),Female(Individuals Vaccinated),Transgender(Individuals Vaccinated),Total Individuals Vaccinated
0,16/01/2021,India,48276.0,3455.0,2957.0,48276.0,0.0,,,,...,,,,,,,23757.0,24517.0,2.0,48276.0
1,17/01/2021,India,58604.0,8532.0,4954.0,58604.0,0.0,,,,...,,,,,,,27348.0,31252.0,4.0,58604.0
2,18/01/2021,India,99449.0,13611.0,6583.0,99449.0,0.0,,,,...,,,,,,,41361.0,58083.0,5.0,99449.0
3,19/01/2021,India,195525.0,17855.0,7951.0,195525.0,0.0,,,,...,,,,,,,81901.0,113613.0,11.0,195525.0
4,20/01/2021,India,251280.0,25472.0,10504.0,251280.0,0.0,,,,...,,,,,,,98111.0,153145.0,24.0,251280.0


In [11]:
df_state.shape

(7845, 24)

In [12]:
df_state.columns

Index(['Updated On', 'State', 'Total Doses Administered', 'Sessions',
       ' Sites ', 'First Dose Administered', 'Second Dose Administered',
       'Male (Doses Administered)', 'Female (Doses Administered)',
       'Transgender (Doses Administered)', ' Covaxin (Doses Administered)',
       'CoviShield (Doses Administered)', 'Sputnik V (Doses Administered)',
       'AEFI', '18-44 Years (Doses Administered)',
       '45-60 Years (Doses Administered)', '60+ Years (Doses Administered)',
       '18-44 Years(Individuals Vaccinated)',
       '45-60 Years(Individuals Vaccinated)',
       '60+ Years(Individuals Vaccinated)', 'Male(Individuals Vaccinated)',
       'Female(Individuals Vaccinated)', 'Transgender(Individuals Vaccinated)',
       'Total Individuals Vaccinated'],
      dtype='object')

**There are NaN values in our df_state dataset. Let's check the numnber of missing values in each column and replace the NaN values with 0.**

In [13]:
missing_data = df_state.isnull()
missing_data.head()

Unnamed: 0,Updated On,State,Total Doses Administered,Sessions,Sites,First Dose Administered,Second Dose Administered,Male (Doses Administered),Female (Doses Administered),Transgender (Doses Administered),...,18-44 Years (Doses Administered),45-60 Years (Doses Administered),60+ Years (Doses Administered),18-44 Years(Individuals Vaccinated),45-60 Years(Individuals Vaccinated),60+ Years(Individuals Vaccinated),Male(Individuals Vaccinated),Female(Individuals Vaccinated),Transgender(Individuals Vaccinated),Total Individuals Vaccinated
0,False,False,False,False,False,False,False,True,True,True,...,True,True,True,True,True,True,False,False,False,False
1,False,False,False,False,False,False,False,True,True,True,...,True,True,True,True,True,True,False,False,False,False
2,False,False,False,False,False,False,False,True,True,True,...,True,True,True,True,True,True,False,False,False,False
3,False,False,False,False,False,False,False,True,True,True,...,True,True,True,True,True,True,False,False,False,False
4,False,False,False,False,False,False,False,True,True,True,...,True,True,True,True,True,True,False,False,False,False


In [14]:
for column in missing_data.columns.values.tolist():
    print(missing_data[column].value_counts())
    print(" ")
    

Updated On
False    7845
Name: count, dtype: int64
 
State
False    7845
Name: count, dtype: int64
 
Total Doses Administered
False    7621
True      224
Name: count, dtype: int64
 
Sessions
False    7621
True      224
Name: count, dtype: int64
 
 Sites 
False    7621
True      224
Name: count, dtype: int64
 
First Dose Administered
False    7621
True      224
Name: count, dtype: int64
 
Second Dose Administered
False    7621
True      224
Name: count, dtype: int64
 
Male (Doses Administered)
False    7461
True      384
Name: count, dtype: int64
 
Female (Doses Administered)
False    7461
True      384
Name: count, dtype: int64
 
Transgender (Doses Administered)
False    7461
True      384
Name: count, dtype: int64
 
 Covaxin (Doses Administered)
False    7621
True      224
Name: count, dtype: int64
 
CoviShield (Doses Administered)
False    7621
True      224
Name: count, dtype: int64
 
Sputnik V (Doses Administered)
True     4850
False    2995
Name: count, dtype: int64
 
AEFI
False  

**We have NaN values in multiple columns. There is no way to replace these values  with a specific number. We can replace all NaN values with "0" or we can use the `interpolate` function in df_state dataset.**

_Using Interpolate()_ 

It estimates and fills missing values by linearly interpolating between neighboring data points, creating a smoother dataset. It is particularly useful for time series data.

In [15]:
df_state = df_state.interpolate()
df_state.shape

  df_state = df_state.interpolate()


(7845, 24)

In [16]:
df_state.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7845 entries, 0 to 7844
Data columns (total 24 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Updated On                           7845 non-null   object 
 1   State                                7845 non-null   object 
 2   Total Doses Administered             7845 non-null   float64
 3   Sessions                             7845 non-null   float64
 4    Sites                               7845 non-null   float64
 5   First Dose Administered              7845 non-null   float64
 6   Second Dose Administered             7845 non-null   float64
 7   Male (Doses Administered)            7685 non-null   float64
 8   Female (Doses Administered)          7685 non-null   float64
 9   Transgender (Doses Administered)     7685 non-null   float64
 10   Covaxin (Doses Administered)        7845 non-null   float64
 11  CoviShield (Doses Administered

**We still have some missing values in our dataset. It is better to either drop the rows containing missing values or replace these values with "0".** 

In [17]:
df_state = df_state.fillna(0)
df_state.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7845 entries, 0 to 7844
Data columns (total 24 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Updated On                           7845 non-null   object 
 1   State                                7845 non-null   object 
 2   Total Doses Administered             7845 non-null   float64
 3   Sessions                             7845 non-null   float64
 4    Sites                               7845 non-null   float64
 5   First Dose Administered              7845 non-null   float64
 6   Second Dose Administered             7845 non-null   float64
 7   Male (Doses Administered)            7845 non-null   float64
 8   Female (Doses Administered)          7845 non-null   float64
 9   Transgender (Doses Administered)     7845 non-null   float64
 10   Covaxin (Doses Administered)        7845 non-null   float64
 11  CoviShield (Doses Administered

### **Now we have clean datasets.**

The datatype of every column is in correct format. Now let's save these cleaned datasets for further analysis.

In [20]:
df_state.to_csv('covid_vaccine_statewise_CLEAN_DATA.csv')

In [21]:
df_ind.to_csv('covid_19_india_CLEAN_DATA.csv')

## <h3 align="center"> **Data Analysis performed by LOVISH GARLANI** <h3/>
## <h4 align="center"> **LinkedIn: www.linkedin.com/in/lovish-garlani** <h4/>