In [1]:
from urllib.request import urlretrieve
import pandas as pd

In [2]:
italy_covid_url = 'https://gist.githubusercontent.com/aakashns/f6a004fa20c84fec53262f9a8bfee775/raw/f309558b1cf5103424cef58e2ecb8704dcd4d74c/italy-covid-daywise.csv'

urlretrieve(italy_covid_url, 'd') # Trying to retrive the csv file attachted to the url

('d', <http.client.HTTPMessage at 0x1d17e76f110>)

In [3]:
covid_df=pd.read_csv('italy-covid-daywise.csv') # Pandas reading csv file

In [4]:
type(covid_df) # Tying to check the type of the data

pandas.core.frame.DataFrame

In [5]:
covid_df  # Printing the values stored by pandas

Unnamed: 0,date,new_cases,new_deaths,new_tests
0,2019-12-31,0.0,0.0,
1,2020-01-01,0.0,0.0,
2,2020-01-02,0.0,0.0,
3,2020-01-03,0.0,0.0,
4,2020-01-04,0.0,0.0,
...,...,...,...,...
243,2020-08-30,1444.0,1.0,53541.0
244,2020-08-31,1365.0,4.0,42583.0
245,2020-09-01,996.0,6.0,54395.0
246,2020-09-02,975.0,8.0,


we can view some basic information about the data frame using the **.info** method

covid_df.info()

It appears that each column contians a values of a specific data type. For the numeric columns, you can view
some statistical information like mean, standard deviation, minimum/maximum values and the number of non-empty
values using the **.describe** method

In [6]:
covid_df.describe()

Unnamed: 0,new_cases,new_deaths,new_tests
count,248.0,248.0,135.0
mean,1094.818548,143.133065,31699.674074
std,1554.508002,227.105538,11622.209757
min,-148.0,-31.0,7841.0
25%,123.0,3.0,25259.0
50%,342.0,17.0,29545.0
75%,1371.75,175.25,37711.0
max,6557.0,971.0,95273.0


The columns properlycontains the list of the columns with the data frame

In [7]:
covid_df.columns

Index(['date', 'new_cases', 'new_deaths', 'new_tests'], dtype='object')

You can also retrieve the number of row and columns in data frame using the **.shape** method

In [8]:
covid_df.shape

(248, 4)

In [9]:
covid_df.at[246,'new_cases']

np.float64(975.0)

You're seeing **np.float64(975.0)** even though you didn't import numpy because Pandas internally uses NumPy,
and some of its operations (like data access with .at[], .iloc[], etc.) may return NumPy data types such as np.float64.

So if you wish to work with native python float then use:
**float(covid_df.at[246, 'new_cases'])**

## Why NumPy is used (especially by Pandas):
NumPy (short for Numerical Python) is a powerful library that provides:

* Efficient multidimensional arrays (ndarray)
* Fast mathematical operations on these arrays
* Built-in support for linear algebra, statistics, random sampling, etc.

In short Pandas depends on NumPy for:
* Speed
* Memory Effiency
* Mathematical Funtionality
* Data Type Consistancy

In [10]:
float(covid_df.at[246, 'new_cases'])

975.0

Instead of using indexing notation [], Pandas also allows accessing columns as properties of the data frame using **"." (dot)** notation.
However this method only works for column whose name do not contain spaces or special characters.

In [11]:
covid_df.new_cases

0         0.0
1         0.0
2         0.0
3         0.0
4         0.0
        ...  
243    1444.0
244    1365.0
245     996.0
246     975.0
247    1326.0
Name: new_cases, Length: 248, dtype: float64

you can also pass a list of columns within the indexing notation **[]** to access the subset of the data frame with just the
given columns

In [12]:
cases_df = covid_df[['date', 'new_cases']]

In [13]:
cases_df

Unnamed: 0,date,new_cases
0,2019-12-31,0.0
1,2020-01-01,0.0
2,2020-01-02,0.0
3,2020-01-03,0.0
4,2020-01-04,0.0
...,...,...
243,2020-08-30,1444.0
244,2020-08-31,1365.0
245,2020-09-01,996.0
246,2020-09-02,975.0


In order to access a specific row of data use **.loc** method

In [14]:
covid_df.loc[243]

date          2020-08-30
new_cases         1444.0
new_deaths           1.0
new_tests        53541.0
Name: 243, dtype: object

In [15]:
type(covid_df.loc[243])

pandas.core.series.Series

to find the first and the last row of data, we can use the **.head()** and **.tail()** methods.

In [16]:
covid_df.head(5)  # To get the first 5 rows of data

Unnamed: 0,date,new_cases,new_deaths,new_tests
0,2019-12-31,0.0,0.0,
1,2020-01-01,0.0,0.0,
2,2020-01-02,0.0,0.0,
3,2020-01-03,0.0,0.0,
4,2020-01-04,0.0,0.0,


In [17]:
covid_df.tail(4)   # To get the last 4 rows of data

Unnamed: 0,date,new_cases,new_deaths,new_tests
244,2020-08-31,1365.0,4.0,42583.0
245,2020-09-01,996.0,6.0,54395.0
246,2020-09-02,975.0,8.0,
247,2020-09-03,1326.0,6.0,


The distinction between *0* and *NaN* is subtle but important. In this data set it represents that daily tests numbers
were not reported on specific dates. In fact Italy started reportingdaily test from April 19, 2020 by thatt time, 935310
tests had already been conducted.

We can find the first index which doesn't contain *NaN* value by using **first_valid_index()** method of a series.

In [18]:
covid_df.new_tests.first_valid_index()

111

To check if the value of the new tests is changing from NaN value to an actual floating number then **use a range function in the .loc[]** method

In [19]:
covid_df.loc[108 : 113]

Unnamed: 0,date,new_cases,new_deaths,new_tests
108,2020-04-17,3786.0,525.0,
109,2020-04-18,3493.0,575.0,
110,2020-04-19,3491.0,480.0,
111,2020-04-20,3047.0,433.0,7841.0
112,2020-04-21,2256.0,454.0,28095.0
113,2020-04-22,2729.0,534.0,44248.0


Use **.sample()** method to retirieve a random sample of rows from the data frame

In [20]:
covid_df.sample(10)

Unnamed: 0,date,new_cases,new_deaths,new_tests
221,2020-08-08,552.0,3.0,26631.0
20,2020-01-20,0.0,0.0,
209,2020-07-27,254.0,5.0,19374.0
226,2020-08-13,476.0,10.0,25629.0
204,2020-07-22,128.0,15.0,29288.0
232,2020-08-19,401.0,5.0,41290.0
100,2020-04-09,3836.0,540.0,
214,2020-08-01,379.0,9.0,31905.0
43,2020-02-12,0.0,0.0,
0,2019-12-31,0.0,0.0,


# Analyzing Data from the data frames

#### Q1: What is the total number of reported cases and deaths to covi-19 in Italy?
-> *Similar to NumPy arrays, a Pandas series supports* **sum** *method to answer these question*

In [21]:
total_cases = covid_df.new_cases.sum()
total_deaths = covid_df.new_deaths.sum()

In [22]:
print(f"The total number of new cases in Italy during covid-19 is: {int(total_cases)} \nThe total number of deaths in Italy during covid-19 is: {int(total_deaths)}")

The total number of new cases in Italy during covid-19 is: 271515 
The total number of deaths in Italy during covid-19 is: 35497


#### Q2: What is the overal death rate in Italy during covid-19?

In [23]:
death_rate = total_deaths / total_cases

In [24]:
print(f"The overal death rate in Italy during the covid-19 is: {death_rate:.4f}%")

The overal death rate in Italy during the covid-19 is: 0.1307%


#### Q3: What is the overal number of tests conducted if a total of 935310 tests were conducted before the daily tests were being reported?

In [25]:
total_tests = 935310 + covid_df.new_tests.sum()

In [26]:
print(f"The total number of test which were conducted in Italy during the covid-19 is: {int(total_tests)}")

The total number of test which were conducted in Italy during the covid-19 is: 5214766


#### Q4: What fraction of the test was returned a positive result?

In [27]:
positive_tests = total_cases / covid_df.new_tests.sum()

In [28]:
print(f"The total number of tests which turned out to be positive are: {positive_tests:.4f}%")

The total number of tests which turned out to be positive are: 0.0634%


#### Q5: Print all the rows of data that have 1,000 or more cases.

In [29]:
high_new_cases = covid_df.new_cases >= 1000

In [30]:
high_new_cases

0      False
1      False
2      False
3      False
4      False
       ...  
243     True
244     True
245    False
246    False
247     True
Name: new_cases, Length: 248, dtype: bool

In [31]:
high_cases = covid_df[high_new_cases]

In [32]:
print(f"The rows of data which had 1000 or more new cases in a day are: \n{high_cases}")

The rows of data which had 1000 or more new cases in a day are: 
           date  new_cases  new_deaths  new_tests
68   2020-03-08     1247.0        36.0        NaN
69   2020-03-09     1492.0       133.0        NaN
70   2020-03-10     1797.0        98.0        NaN
72   2020-03-12     2313.0       196.0        NaN
73   2020-03-13     2651.0       189.0        NaN
..          ...        ...         ...        ...
241  2020-08-28     1409.0         5.0    65135.0
242  2020-08-29     1460.0         9.0    64294.0
243  2020-08-30     1444.0         1.0    53541.0
244  2020-08-31     1365.0         4.0    42583.0
247  2020-09-03     1326.0         6.0        NaN

[72 rows x 4 columns]


Well we colud also reduce them into one single line

Like:

In [33]:
print(f"The rows of data which had 1000 or more cases in a day are:\n{covid_df[covid_df.new_cases >= 1000]}")

The rows of data which had 1000 or more cases in a day are:
           date  new_cases  new_deaths  new_tests
68   2020-03-08     1247.0        36.0        NaN
69   2020-03-09     1492.0       133.0        NaN
70   2020-03-10     1797.0        98.0        NaN
72   2020-03-12     2313.0       196.0        NaN
73   2020-03-13     2651.0       189.0        NaN
..          ...        ...         ...        ...
241  2020-08-28     1409.0         5.0    65135.0
242  2020-08-29     1460.0         9.0    64294.0
243  2020-08-30     1444.0         1.0    53541.0
244  2020-08-31     1365.0         4.0    42583.0
247  2020-09-03     1326.0         6.0        NaN

[72 rows x 4 columns]


The data frame contains 72 rows but only first 5 and the last 5 rows of data are being displayed by deafult
jupyter, for brevity. So if you wish to go through all the rows, then we can modify some display options.

In [38]:
#from IPython.display import display
#with pd.option_context('display.max_rows',100):   # The 100 here limits the DataFrame display to 100 rows within this block
#    display(covid_df[covid_df.new_cases >= 1000])

## How this works:

**1. from IPython.display import diaplay**'

*This imports the display() function, which is used in Jupyter Notebooks 
(and IPython environments) to nicely format and show objects like DataFrames, HTML, etc.*

**2. with pd.option_context('display.max_rows', 100):**

*That means:
Inside this with block, Pandas will show up to 100 rows when displaying a DataFrame.
Once the block is done, Pandas will return to its previous setting.*

**3. covid_df[covid_df.new_cases >= 1000]**

*This is a boolean filter on the DataFrame covid_df. It selects only the rows where the 
new_cases column has values greater than or equal to 1000.*
*Which will return a subset of the original DataFrame that matches this condition.*