# Title: Exploratory Data Analysis (EDA) with Data Cleaning on a Custom Dataset

<b>Objective</b>: In this assignment, you will create a custom dataset similar to a COVID-19 dataset, perform data cleaning, and then conduct Exploratory Data Analysis (EDA). The aim is to uncover insights from the data through various EDA techniques, including descriptive statistics, data visualization, and identifying patterns or anomalies.

In [3]:
import pandas as pd 

pd.cut to create a new category


<b>Instructions</b>:
- Create a Custom Dataset:
      -  Create a dataset that mimics a COVID-19 dataset with at least 100 records and the following features:
     - Date: The date of the recorded data.
    -  Country: The country where the data is recorded.
         - Total_Cases: The total number of COVID-19 cases reported on that date.
        -  New_Cases: The number of new COVID-19 cases reported on that date.
        - Total_Deaths: The total number of deaths reported due to COVID-19 on that date.
        - New_Deaths: The number of new deaths reported on that date.
        - Total_Recoveries: The total number of recoveries reported on that date.
        - New_Recoveries: The number of new recoveries reported on that date.
        - Active_Cases: The number of currently active COVID-19 cases.
        - Vaccination_Rate: The percentage of the population vaccinated as of that date.
 

## Step1 : Creating dataset


In [7]:
import pandas as pd
import numpy as np 
import random

In [8]:
from datetime import timedelta, datetime

In [9]:
np.random.seed(42) # set up a random seed to 42 


In [10]:
# Generate dates
start_date = datetime.strptime('2022-01-01', '%Y-%m-%d') # it only gives me date , .strptime removes the time.
end_date = datetime.strptime('2022-04-10', '%Y-%m-%d') # The format and time is followed as mentioned in the start_date to end_date. ( year, month, and date)
date_range = pd.date_range(start=start_date, end=end_date).to_list()

In [11]:
date_range

[Timestamp('2022-01-01 00:00:00'),
 Timestamp('2022-01-02 00:00:00'),
 Timestamp('2022-01-03 00:00:00'),
 Timestamp('2022-01-04 00:00:00'),
 Timestamp('2022-01-05 00:00:00'),
 Timestamp('2022-01-06 00:00:00'),
 Timestamp('2022-01-07 00:00:00'),
 Timestamp('2022-01-08 00:00:00'),
 Timestamp('2022-01-09 00:00:00'),
 Timestamp('2022-01-10 00:00:00'),
 Timestamp('2022-01-11 00:00:00'),
 Timestamp('2022-01-12 00:00:00'),
 Timestamp('2022-01-13 00:00:00'),
 Timestamp('2022-01-14 00:00:00'),
 Timestamp('2022-01-15 00:00:00'),
 Timestamp('2022-01-16 00:00:00'),
 Timestamp('2022-01-17 00:00:00'),
 Timestamp('2022-01-18 00:00:00'),
 Timestamp('2022-01-19 00:00:00'),
 Timestamp('2022-01-20 00:00:00'),
 Timestamp('2022-01-21 00:00:00'),
 Timestamp('2022-01-22 00:00:00'),
 Timestamp('2022-01-23 00:00:00'),
 Timestamp('2022-01-24 00:00:00'),
 Timestamp('2022-01-25 00:00:00'),
 Timestamp('2022-01-26 00:00:00'),
 Timestamp('2022-01-27 00:00:00'),
 Timestamp('2022-01-28 00:00:00'),
 Timestamp('2022-01-

In [12]:
# create a list of countries
countries_list = [ "Italy", "Spain", "United States", "Australia", "Canada", "Mexico", "Nepal", "India", "France", "China"]
countries_list 

['Italy',
 'Spain',
 'United States',
 'Australia',
 'Canada',
 'Mexico',
 'Nepal',
 'India',
 'France',
 'China']

In [13]:
# using loop to generate data on dates for each countriers.
data= []
for date in date_range:
    for country in countries_list:
        Total_Cases= random.randint(100,500)
        New_Cases = random.randint(200,600)
        Total_Deaths =  random.randint(80,800)
        New_Deaths = random.randint(70,700)
        Total_Recoveries = random.randint(170,900)
        New_Recoveries = random.randint(160,290)
        Active_Cases = Total_Cases - (Total_Recoveries +  Total_Deaths)
        Vaccination_Rate = random.randint(0,100) 
        data.append( [date,country,Total_Cases, New_Cases, Total_Deaths, New_Deaths,  Total_Recoveries, New_Recoveries,Active_Cases, Vaccination_Rate]
        )

        
        
        


In [14]:
data

[[Timestamp('2022-01-01 00:00:00'),
  'Italy',
  183,
  528,
  427,
  526,
  740,
  245,
  -984,
  6],
 [Timestamp('2022-01-01 00:00:00'),
  'Spain',
  141,
  254,
  407,
  106,
  218,
  208,
  -484,
  45],
 [Timestamp('2022-01-01 00:00:00'),
  'United States',
  159,
  390,
  664,
  621,
  458,
  202,
  -963,
  4],
 [Timestamp('2022-01-01 00:00:00'),
  'Australia',
  153,
  477,
  131,
  462,
  508,
  254,
  -486,
  14],
 [Timestamp('2022-01-01 00:00:00'),
  'Canada',
  369,
  370,
  698,
  417,
  261,
  201,
  -590,
  34],
 [Timestamp('2022-01-01 00:00:00'),
  'Mexico',
  373,
  328,
  494,
  631,
  382,
  167,
  -503,
  90],
 [Timestamp('2022-01-01 00:00:00'),
  'Nepal',
  453,
  595,
  762,
  294,
  365,
  283,
  -674,
  9],
 [Timestamp('2022-01-01 00:00:00'),
  'India',
  265,
  523,
  763,
  606,
  852,
  239,
  -1350,
  50],
 [Timestamp('2022-01-01 00:00:00'),
  'France',
  293,
  481,
  329,
  229,
  658,
  273,
  -694,
  87],
 [Timestamp('2022-01-01 00:00:00'),
  'China',
  38

In [15]:
df= pd.DataFrame(data,columns=["date","country", "Total_Cases", "New_Cases","Total_Deaths","New_Deaths", " Total_Recoveries"," New_Recoveries ","Active_Cases", "Vaccination_Rate"] )
df

Unnamed: 0,date,country,Total_Cases,New_Cases,Total_Deaths,New_Deaths,Total_Recoveries,New_Recoveries,Active_Cases,Vaccination_Rate
0,2022-01-01,Italy,183,528,427,526,740,245,-984,6
1,2022-01-01,Spain,141,254,407,106,218,208,-484,45
2,2022-01-01,United States,159,390,664,621,458,202,-963,4
3,2022-01-01,Australia,153,477,131,462,508,254,-486,14
4,2022-01-01,Canada,369,370,698,417,261,201,-590,34
...,...,...,...,...,...,...,...,...,...,...
995,2022-04-10,Mexico,396,551,771,271,377,248,-752,78
996,2022-04-10,Nepal,254,480,579,154,256,190,-581,53
997,2022-04-10,India,234,494,556,203,821,280,-1143,21
998,2022-04-10,France,435,395,510,458,776,199,-851,36


In [16]:
pd.set_option('display.max_rows',None) # it is used to display maximum rows. remember it.

In [17]:
df

Unnamed: 0,date,country,Total_Cases,New_Cases,Total_Deaths,New_Deaths,Total_Recoveries,New_Recoveries,Active_Cases,Vaccination_Rate
0,2022-01-01,Italy,183,528,427,526,740,245,-984,6
1,2022-01-01,Spain,141,254,407,106,218,208,-484,45
2,2022-01-01,United States,159,390,664,621,458,202,-963,4
3,2022-01-01,Australia,153,477,131,462,508,254,-486,14
4,2022-01-01,Canada,369,370,698,417,261,201,-590,34
5,2022-01-01,Mexico,373,328,494,631,382,167,-503,90
6,2022-01-01,Nepal,453,595,762,294,365,283,-674,9
7,2022-01-01,India,265,523,763,606,852,239,-1350,50
8,2022-01-01,France,293,481,329,229,658,273,-694,87
9,2022-01-01,China,386,234,285,580,278,284,-177,51


In [18]:
pd.reset_option('display.max_rows') # it is for top 5 and bottom 5

In [19]:
df

Unnamed: 0,date,country,Total_Cases,New_Cases,Total_Deaths,New_Deaths,Total_Recoveries,New_Recoveries,Active_Cases,Vaccination_Rate
0,2022-01-01,Italy,183,528,427,526,740,245,-984,6
1,2022-01-01,Spain,141,254,407,106,218,208,-484,45
2,2022-01-01,United States,159,390,664,621,458,202,-963,4
3,2022-01-01,Australia,153,477,131,462,508,254,-486,14
4,2022-01-01,Canada,369,370,698,417,261,201,-590,34
...,...,...,...,...,...,...,...,...,...,...
995,2022-04-10,Mexico,396,551,771,271,377,248,-752,78
996,2022-04-10,Nepal,254,480,579,154,256,190,-581,53
997,2022-04-10,India,234,494,556,203,821,280,-1143,21
998,2022-04-10,France,435,395,510,458,776,199,-851,36


In [20]:
with pd.option_context('display.max_rows', None): # this is also build in temporary(of all the rows)
    print(df)

          date        country  Total_Cases  New_Cases  Total_Deaths  \
0   2022-01-01          Italy          183        528           427   
1   2022-01-01          Spain          141        254           407   
2   2022-01-01  United States          159        390           664   
3   2022-01-01      Australia          153        477           131   
4   2022-01-01         Canada          369        370           698   
5   2022-01-01         Mexico          373        328           494   
6   2022-01-01          Nepal          453        595           762   
7   2022-01-01          India          265        523           763   
8   2022-01-01         France          293        481           329   
9   2022-01-01          China          386        234           285   
10  2022-01-02          Italy          269        414           189   
11  2022-01-02          Spain          436        207           401   
12  2022-01-02  United States          198        556           534   
13  20

In [41]:
pd.set_option('display.max_rows', 200)
print(df)

          date        country  Total_Cases  New_Cases  Total_Deaths  \
0   2022-01-01          Italy          183        528           427   
1   2022-01-01          Spain          141        254           407   
2   2022-01-01  United States          159        390           664   
3   2022-01-01      Australia          153        477           131   
4   2022-01-01         Canada          369        370           698   
..         ...            ...          ...        ...           ...   
995 2022-04-10         Mexico          396        551           771   
996 2022-04-10          Nepal          254        480           579   
997 2022-04-10          India          234        494           556   
998 2022-04-10         France          435        395           510   
999 2022-04-10          China          265        288            84   

     New_Deaths   Total_Recoveries   New_Recoveries   Active_Cases  \
0           526                740               245          -984   
1      