In [1]:
from urllib.request import urlretrieve
import pandas as pd

In [None]:
italy_covid_url = 'https://gist.githubusercontent.com/aakashns/f6a004fa20c84fec53262f9a8bfee775/raw/f309558b1cf5103424cef58e2ecb8704dcd4d74c/italy-covid-daywise.csv'

urlretrieve(italy_covid_url, 'd') # Trying to retrive the csv file attachted to the url

In [None]:
covid_df=pd.read_csv('italy-covid-daywise.csv') # Pandas reading csv file

In [None]:
type(covid_df) # Tying to check the type of the data

In [None]:
covid_df  # Printing the values stored by pandas

we can view some basic information about the data frame using the **.info** method

covid_df.info()

It appears that each column contians a values of a specific data type. For the numeric columns, you can view
some statistical information like mean, standard deviation, minimum/maximum values and the number of non-empty
values using the **.describe** method

In [None]:
covid_df.describe()

The columns properlycontains the list of the columns with the data frame

In [None]:
covid_df.columns

You can also retrieve the number of row and columns in data frame using the **.shape** method

In [None]:
covid_df.shape

In [None]:
covid_df.at[246,'new_cases']

You're seeing **np.float64(975.0)** even though you didn't import numpy because Pandas internally uses NumPy,
and some of its operations (like data access with .at[], .iloc[], etc.) may return NumPy data types such as np.float64.

So if you wish to work with native python float then use:
**float(covid_df.at[246, 'new_cases'])**

## Why NumPy is used (especially by Pandas):
NumPy (short for Numerical Python) is a powerful library that provides:

* Efficient multidimensional arrays (ndarray)
* Fast mathematical operations on these arrays
* Built-in support for linear algebra, statistics, random sampling, etc.

In short Pandas depends on NumPy for:
* Speed
* Memory Effiency
* Mathematical Funtionality
* Data Type Consistancy

In [None]:
float(covid_df.at[246, 'new_cases'])

Instead of using indexing notation [], Pandas also allows accessing columns as properties of the data frame using **"." (dot)** notation.
However this method only works for column whose name do not contain spaces or special characters.

In [None]:
covid_df.new_cases

you can also pass a list of columns within the indexing notation **[]** to access the subset of the data frame with just the
given columns

In [None]:
cases_df = covid_df[['date', 'new_cases']]

In [None]:
cases_df

In order to access a specific row of data use **.loc** method

In [None]:
covid_df.loc[243]

In [None]:
type(covid_df.loc[243])

to find the first and the last row of data, we can use the **.head()** and **.tail()** methods.

In [None]:
covid_df.head(5)  # To get the first 5 rows of data

In [None]:
covid_df.tail(4)   # To get the last 4 rows of data

The distinction between *0* and *NaN* is subtle but important. In this data set it represents that daily tests numbers
were not reported on specific dates. In fact Italy started reportingdaily test from April 19, 2020 by thatt time, 935310
tests had already been conducted.

We can find the first index which doesn't contain *NaN* value by using **first_valid_index()** method of a series.

In [None]:
covid_df.new_tests.first_valid_index()

To check if the value of the new tests is changing from NaN value to an actual floating number then **use a range function in the .loc[]** method

In [None]:
covid_df.loc[108 : 113]

Use **.sample()** method to retirieve a random sample of rows from the data frame

In [None]:
covid_df.sample(10)

# Analyzing Data from the data frames

#### Q1: What is the total number of reported cases and deaths to covi-19 in Italy?
-> *Similar to NumPy arrays, a Pandas series supports* **sum** *method to answer these question*

In [None]:
total_cases = covid_df.new_cases.sum()
total_deaths = covid_df.new_deaths.sum()

In [None]:
print(f"The total number of new cases in Italy during covid-19 is: {int(total_cases)} \nThe total number of deaths in Italy during covid-19 is: {int(total_deaths)}")

#### Q2: What is the overal death rate in Italy during covid-19?

In [None]:
death_rate = total_deaths / total_cases

In [None]:
print(f"The overal death rate in Italy during the covid-19 is: {death_rate:.4f}%")

#### Q3: What is the overal number of tests conducted if a total of 935310 tests were conducted before the daily tests were being reported?

In [None]:
total_tests = 935310 + covid_df.new_tests.sum()

In [None]:
print(f"The total number of test which were conducted in Italy during the covid-19 is: {int(total_tests)}")

#### Q4: What fraction of the test was returned a positive result?

In [None]:
positive_tests = total_cases / covid_df.new_tests.sum()

In [None]:
print(f"The total number of tests which turned out to be positive are: {positive_tests:.4f}%")

#### Q5: Print all the rows of data that have 1,000 or more cases.

In [None]:
high_new_cases = covid_df.new_cases >= 1000

In [None]:
high_new_cases

In [None]:
high_cases = covid_df[high_new_cases]

In [None]:
print(f"The rows of data which had 1000 or more new cases in a day are: \n{high_cases}")

Well we colud also reduce them into one single line

Like:

In [None]:
print(f"The rows of data which had 1000 or more cases in a day are:\n{covid_df[covid_df.new_cases >= 1000]}")

The data frame contains 72 rows but only first 5 and the last 5 rows of data are being displayed by deafult
jupyter, for brevity. So if you wish to go through all the rows, then we can modify some display options.

In [None]:
#from IPython.display import display
#with pd.option_context('display.max_rows',100):   # The 100 here limits the DataFrame display to 100 rows within this block
#    display(covid_df[covid_df.new_cases >= 1000])

## How this works:

**1. from IPython.display import diaplay**'

*This imports the display() function, which is used in Jupyter Notebooks 
(and IPython environments) to nicely format and show objects like DataFrames, HTML, etc.*

**2. with pd.option_context('display.max_rows', 100):**

*That means:
Inside this with block, Pandas will show up to 100 rows when displaying a DataFrame.
Once the block is done, Pandas will return to its previous setting.*

**3. covid_df[covid_df.new_cases >= 1000]**

*This is a boolean filter on the DataFrame covid_df. It selects only the rows where the 
new_cases column has values greater than or equal to 1000.*
*Which will return a subset of the original DataFrame that matches this condition.*

#### Q6: Determine the days when the ratio of cases reported to test conducted is higher that the overal **positive_rate**

In [None]:
positive_tests

In [None]:
ratio_df = covid_df[(covid_df.new_cases / covid_df.new_tests) > positive_tests]

In [None]:
print(f"The days when the ratio of cases reported to test conducted is higher than the overal Positive rate is: \n{ratio_df}")