# Numpy Basics

This tutorial covers the following topics:

- Working with numerical data in Python
- Going from Python lists to Numpy arrays
- Multi-dimensional Numpy arrays and their benefits
- Array operations, broadcasting, indexing, and slicing
- Working with CSV data files using Numpy

The "data" typically refers to numerical data, e.g., stock prices, sales figures, sensor measurements, sports scores, database tables, etc. The [Numpy](https://numpy.org) library provides specialized data structures, functions, and other tools for numerical computing in Python. Let's work through an example to see why & how to use Numpy for working with numerical data.


> Suppose we want to use climate data like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples. A simple approach for doing this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall (in  millimeters) & average relative humidity (in percentage) as a linear equation.
>
> `yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity`

We're expressing the yield of apples as a weighted sum of the temperature, rainfall, and humidity. This equation is an approximation since the actual relationship may not necessarily be linear, and there may be other factors involved. But a simple linear model like this often works well in practice.

Based on some statical analysis of historical data, we might come up with reasonable values for the weights `w1`, `w2`, and `w3`. Here's an example set of values:

In [None]:
w1, w2, w3 = 0.3, 0.2, 0.5

Given some climate data for a region, we can now predict the yield of apples. Here's some sample data:

<img src="https://i.imgur.com/TXPBiqv.png" style="width:360px;">

To begin, we can define some variables to record climate data for a region.

In [None]:
kanto_temp = 73
kanto_rainfall = 67
kanto_humidity = 43

We can now substitute these variables into the linear equation to predict the yield of apples.

In [None]:
kanto_yield_apples = kanto_temp * w1 + kanto_rainfall * w2 + kanto_humidity * w3
kanto_yield_apples

In [None]:
print("The expected yield of apples in Kanto region is {} tons per hectare.".format(kanto_yield_apples))

To make it slightly easier to perform the above computation for multiple regions, we can represent the climate data for each region as a vector, i.e., a list of numbers.

In [None]:
kanto = [73, 67, 43]
johto = [91, 88, 64]
hoenn = [87, 134, 58]
sinnoh = [102, 43, 37]
unova = [69, 96, 70]

The three numbers in each vector represent the temperature, rainfall, and humidity data, respectively. 

We can also represent the set of weights used in the formula as a vector.

In [None]:
weights = [w1, w2, w3]

We can now write a function `crop_yield` to calcuate the yield of apples (or any other crop) given the climate data and the respective weights.

In [None]:
def crop_yield(region, weights):
    result = 0
    for x, w in zip(region, weights):
        result += x * w
    return result

In [None]:
crop_yield(kanto, weights)

In [None]:
crop_yield(johto, weights)

In [None]:
crop_yield(unova, weights)

## Going from Python lists to Numpy arrays


The calculation performed by the `crop_yield` (element-wise multiplication of two vectors and taking a sum of the results) is also called the *dot product*. Learn more about dot product here: https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/dot-cross-products/v/vector-dot-product-and-vector-length . 

The Numpy library provides a built-in function to compute the dot product of two vectors. However, we must first convert the lists into Numpy arrays.

Let's install the Numpy library using the `pip` package manager.

In [None]:
!pip install numpy --upgrade --quiet

Next, let's import the `numpy` module. It's common practice to import numpy with the alias `np`.

In [None]:
import numpy as np

We can now use the `np.array` function to create Numpy arrays.

In [None]:
kanto = np.array([73, 67, 43])

In [None]:
kanto

In [None]:
weights = np.array([w1, w2, w3])

In [None]:
weights

Numpy arrays have the type `ndarray`.

In [None]:
type(kanto)

In [None]:
type(weights)

Just like lists, Numpy arrays support the indexing notation `[]`.

In [None]:
weights[0]

In [None]:
kanto[2]

## Operating on Numpy arrays

We can now compute the dot product of the two vectors using the `np.dot` function.

In [None]:
np.dot(kanto, weights)

We can achieve the same result with low-level operations supported by Numpy arrays: performing an element-wise multiplication and calculating the resulting numbers' sum.

In [None]:
(kanto * weights).sum()

The `*` operator performs an element-wise multiplication of two arrays if they have the same size. The `sum` method calculates the sum of numbers in an array.

In [None]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

In [None]:
arr1 * arr2

In [None]:
arr2.sum()

## Benefits of using Numpy arrays

Numpy arrays offer the following benefits over Python lists for operating on numerical data:

- **Ease of use**: You can write small, concise, and intuitive mathematical expressions like `(kanto * weights).sum()` rather than using loops & custom functions like `crop_yield`.
- **Performance**: Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements & loops that are interpreted at runtime

Here's a comparison of dot products performed using Python loops vs. Numpy arrays on two vectors with a million elements each.

In [None]:
# Python lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))

# Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

In [None]:
%%time
result = 0
for x1, x2 in zip(arr1, arr2):
    result += x1*x2
result

In [None]:
%%time
np.dot(arr1_np, arr2_np)

As you can see, using `np.dot` is 100 times faster than using a `for` loop. This makes Numpy especially useful while working with really large datasets with tens of thousands or millions of data points.

## Multi-dimensional Numpy arrays 

We can now go one step further and represent the climate data for all the regions using a single 2-dimensional Numpy array.

In [None]:
climate_data = np.array([[73, 67, 43],
                         [91, 88, 64],
                         [87, 134, 58],
                         [102, 43, 37],
                         [69, 96, 70]])

In [None]:
climate_data

If you've taken a linear algebra class in high school, you may recognize the above 2-d array as a matrix with five rows and three columns. Each row represents one region, and the columns represent temperature, rainfall, and humidity, respectively.

Numpy arrays can have any number of dimensions and different lengths along each dimension. We can inspect the length along each dimension using the `.shape` property of an array.

<img src="https://fgnt.github.io/python_crashkurs_doc/_images/numpy_array_t.png" width="420">



In [None]:
# 2D array (matrix)
climate_data.shape

In [None]:
weights

In [None]:
# 1D array (vector)
weights.shape

In [None]:
# 3D array 
arr3 = np.array([
    [[11, 12, 13], 
     [13, 14, 15]], 
    [[15, 16, 17], 
     [17, 18, 19.5]]])

In [None]:
arr3.shape

All the elements in a numpy array have the same data type. You can check the data type of an array using the `.dtype` property.

In [None]:
weights.dtype

In [None]:
climate_data.dtype

If an array contains even a single floating point number, all the other elements are also converted to floats.

In [None]:
arr3.dtype

We can now compute the predicted yields of apples in all the regions, using a single matrix multiplication between `climate_data` (a 5x3 matrix) and `weights` (a vector of length 3). Here's what it looks like visually:

<img src="https://i.imgur.com/LJ2WKSI.png" width="240">

You can learn about matrices and matrix multiplication by watching the first 3-4 videos of this playlist: https://www.youtube.com/watch?v=xyAuNHPsq-g&list=PLFD0EB975BA0CC1E0&index=1 .

We can use the `np.matmul` function or the `@` operator to perform matrix multiplication.

In [None]:
np.matmul(climate_data, weights)

In [None]:
climate_data @ weights

## Working with CSV data files

Numpy also provides helper functions reading from & writing to files. `climate.txt` contains 10,000 climate measurements (temperature, rainfall & humidity) in the following format:


```
temperature,rainfall,humidity
25.00,76.00,99.00
39.00,65.00,70.00
59.00,45.00,77.00
84.00,63.00,38.00
66.00,50.00,52.00
41.00,94.00,77.00
91.00,57.00,96.00
49.00,96.00,99.00
67.00,20.00,28.00
...
```

This format of storing data is known as *comma-separated values* or CSV. 

> **CSVs**: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)


To read this file into a numpy array, we can use the `genfromtxt` function.

In [None]:
climate_data = np.genfromtxt('climate.txt', delimiter=',', skip_header=1)

In [None]:
climate_data

In [None]:
climate_data.shape

We can now perform a matrix multiplication using the `@` operator to predict the yield of apples for the entire dataset using a given set of weights.

In [None]:
weights = np.array([0.3, 0.2, 0.5])

In [None]:
yields = climate_data @ weights

In [None]:
yields

In [None]:
yields.shape

Let's add the `yields` to `climate_data` as a fourth column using the [`np.concatenate`](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html) function.

In [None]:
climate_results = np.concatenate((climate_data, yields.reshape(10000, 1)), axis=1)

In [None]:
climate_results

There are a couple of subtleties here:

* Since we wish to add new columns, we pass the argument `axis=1` to `np.concatenate`. The `axis` argument specifies the dimension for concatenation.

*  The arrays should have the same number of dimensions, and the same length along each except the dimension used for concatenation. We use the [`np.reshape`](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html) function to change the shape of `yields` from `(10000,)` to `(10000,1)`.

Here's a visual explanation of `np.concatenate` along `axis=1` (can you guess what `axis=0` results in?):

<img src="https://www.w3resource.com/w3r_images/python-numpy-image-exercise-58.png" width="300">

The best way to understand what a Numpy function does is to experiment with it and read the documentation to learn about its arguments & return values. Use the cells below to experiment with `np.concatenate` and `np.reshape`.

Let's write the final results from our computation above back to a file using the `np.savetxt` function.

In [None]:
climate_results

In [None]:
np.savetxt('climate_results.txt', 
           climate_results, 
           fmt='%.2f', 
           delimiter=',',
           header='temperature,rainfall,humidity,yeild_apples', 
           comments='')

The results are written back in the CSV format to the file `climate_results.txt`. 

```
temperature,rainfall,humidity,yeild_apples
25.00,76.00,99.00,72.20
39.00,65.00,70.00,59.70
59.00,45.00,77.00,65.20
84.00,63.00,38.00,56.80
...
```



Numpy provides hundreds of functions for performing operations on arrays. Here are some commonly used functions:


* Mathematics: `np.sum`, `np.exp`, `np.round`, arithemtic operators 
* Array manipulation: `np.reshape`, `np.stack`, `np.concatenate`, `np.split`
* Linear Algebra: `np.matmul`, `np.dot`, `np.transpose`, `np.eigvals`
* Statistics: `np.mean`, `np.median`, `np.std`, `np.max`

> **How to find the function you need?** The easiest way to find the right function for a specific operation or use-case is to do a web search. For instance, searching for "How to join numpy arrays" leads to [this tutorial on array concatenation](https://cmdlinetips.com/2018/04/how-to-concatenate-arrays-in-numpy/). 

You can find a full list of array functions here: https://numpy.org/doc/stable/reference/routines.html

## Arithmetic operations, broadcasting and comparison

Numpy arrays support arithmetic operators like `+`, `-`, `*`, etc. You can perform an arithmetic operation with a single number (also called scalar) or with another array of the same shape. Operators make it easy to write mathematical expressions with multi-dimensional arrays.

In [None]:
arr2 = np.array([[1, 2, 3, 4], 
                 [5, 6, 7, 8], 
                 [9, 1, 2, 3]])

In [None]:
arr3 = np.array([[11, 12, 13, 14], 
                 [15, 16, 17, 18], 
                 [19, 11, 12, 13]])

In [None]:
# Adding a scalar
arr2 + 3

In [None]:
# Element-wise subtraction
arr3 - arr2

In [None]:
# Division by scalar
arr2 / 2

In [None]:
# Element-wise multiplication
arr2 * arr3

In [None]:
# Modulus with scalar
arr2 % 4

### Array Broadcasting

Numpy arrays also support *broadcasting*, allowing arithmetic operations between two arrays with different numbers of dimensions but compatible shapes. Let's look at an example to see how it works.

In [None]:
arr2 = np.array([[1, 2, 3, 4], 
                 [5, 6, 7, 8], 
                 [9, 1, 2, 3]])

In [None]:
arr2.shape

In [None]:
arr4 = np.array([4, 5, 6, 7])

In [None]:
arr4.shape

In [None]:
arr2 + arr4

When the expression `arr2 + arr4` is evaluated, `arr4` (which has the shape `(4,)`) is replicated three times to match the shape `(3, 4)` of `arr2`. Numpy performs the replication without actually creating three copies of the smaller dimension array, thus improving performance and using lower memory.

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/02.05-broadcasting.png" width="360">

Broadcasting only works if one of the arrays can be replicated to match the other array's shape.

In [None]:
arr5 = np.array([7, 8])

In [None]:
arr5.shape

In [None]:
arr2 + arr5

In the above example, even if `arr5` is replicated three times, it will not match the shape of `arr2`. Hence `arr2 + arr5` cannot be evaluated successfully. Learn more about broadcasting here: https://numpy.org/doc/stable/user/basics.broadcasting.html .

### Array Comparison

Numpy arrays also support comparison operations like `==`, `!=`, `>` etc. The result is an array of booleans.

In [None]:
arr1 = np.array([[1, 2, 3], [3, 4, 5]])
arr2 = np.array([[2, 2, 3], [1, 2, 5]])

In [None]:
arr1 == arr2

In [None]:
arr1 != arr2

In [None]:
arr1 >= arr2

In [None]:
arr1 < arr2

Array comparison is frequently used to count the number of equal elements in two arrays using the `sum` method. Remember that `True` evaluates to `1` and `False` evaluates to `0` when booleans are used in arithmetic operations.

In [None]:
(arr1 == arr2).sum()

## Array indexing and slicing

Numpy extends Python's list indexing notation using `[]` to multiple dimensions in an intuitive fashion. You can provide a comma-separated list of indices or ranges to select a specific element or a subarray (also called a slice) from a Numpy array.

In [None]:
arr3 = np.array([
    [[11, 12, 13, 14], 
     [13, 14, 15, 19]], 
    
    [[15, 16, 17, 21], 
     [63, 92, 36, 18]], 
    
    [[98, 32, 81, 23],      
     [17, 18, 19.5, 43]]])

In [None]:
arr3.shape

In [None]:
# Single element
arr3[1, 1, 2]

In [None]:
# Subarray using ranges
arr3[1:, 0:1, :2]

In [None]:
# Mixing indices and ranges
arr3[1:, 1, 3]

In [None]:
# Mixing indices and ranges
arr3[1:, 1, :3]

In [None]:
# Using fewer indices
arr3[1]

In [None]:
# Using fewer indices
arr3[:2, 1]

However, we cannot use too many indices like `arr3[1,3,2,1]`. This will give us `IndexError: too many indices for array: array is 3-dimensional, but 4 were indexed`

The notation and its results can seem confusing at first, so take your time to experiment and become comfortable with it. Use the cells below to try out some examples of array indexing and slicing, with different combinations of indices and ranges. Here are some more examples demonstrated visually:

<img src="https://scipy-lectures.org/_images/numpy_indexing.png" width="360">

## Other ways of creating Numpy arrays

Numpy also provides some handy functions to create arrays of desired shapes with fixed or random values. Check out the [official documentation](https://numpy.org/doc/stable/reference/routines.array-creation.html) or use the `help` function to learn more.

In [None]:
# All zeros
np.zeros((3, 2))

In [None]:
# All ones
np.ones([2, 2, 3])

In [None]:
# Identity matrix
np.eye(3)

In [None]:
# Random vector
np.random.rand(5)

In [None]:
# Random matrix
np.random.randn(2, 3) # rand vs. randn - what's the difference?

In [None]:
# Fixed value
np.full([2, 3], 42)

In [None]:
# Range with start, end and step
np.arange(10, 90, 3)

In [None]:
# Equally spaced numbers in a range
np.linspace(3, 27, 9)

## Exercises

**Q1: Write a NumPy program to compute the covariance matrix of two given arrays.**

Example:

Original array1:
``[0 1 2]``

Original array2:
``[2 1 0]``

Covariance matrix of the said arrays:
```python
[[ 1. -1.]
[-1. 1.]]
```

In [None]:
x = np.array([0, 1, 2])
y = np.array([2, 1, 0])
# ENTER CODE HERE

**Q2: Write a NumPy program to get the minimum and maximum value of a given array along the second axis.**

Example:

Original array:
```python 
[[0 1]
[2 3]]
```
Maximum value along the second axis:
``[1 3]``
Minimum value along the second axis:
``[0 2]``

In [None]:
x = np.arange(4).reshape((2, 2))
# ENTER YOUR CODE HERE

**Q3: Write a Python program to count number of occurrences of each value in a given array of non-negative integers**


Example:

Original array:
``[0, 1, 6, 1, 4, 1, 2, 2, 7]``

Number of occurrences of each value in array:
``[1 3 2 0 1 0 1 1]``

In [None]:
array1 = [0, 1, 6, 1, 4, 1, 2, 2, 7] 
# ENTER YOUR CODE HERE

**Q4: Compute the min-by-max for each row for given 2d numpy array.**

Example:

Original array:
```python
[[9 9 4]
 [8 8 1]
 [5 3 6]
 [3 3 3]
 [2 1 9]]
 ```

Output:

`[ 0.44444444,  0.125     ,  0.5       ,  1.        ,  0.11111111]`

In [None]:
np.random.seed(100)
a = np.random.randint(1,10, [5,3])
# ENTER YOUR CODE HERE

## Summary and Further Reading

With this, we complete our discussion of numerical computing with Numpy. We've covered the following topics in this tutorial:

- Going from Python lists to Numpy arrays
- Operating on Numpy arrays
- Benefits of using Numpy arrays over lists
- Multi-dimensional Numpy arrays
- Working with CSV data files
- Arithmetic operations and broadcasting
- Array indexing and slicing
- Other ways of creating Numpy arrays


Check out the following resources for learning more about Numpy:

- Official tutorial: https://numpy.org/devdocs/user/quickstart.html
- Numpy tutorial on W3Schools: https://www.w3schools.com/python/numpy_intro.asp
- Advanced Numpy (exploring the internals): http://scipy-lectures.org/advanced/advanced_numpy/index.html

#Pandas Basics
![](https://i.imgur.com/zfxLzEv.png)
This tutorial covers the following topics:

- Reading a CSV file into a Pandas data frame
- Retrieving data from Pandas data frames
- Querying, soring, and analyzing data
- Basic plotting using line and bar charts
- Writing data frames to CSV files

## Reading a CSV file using Pandas

[Pandas](https://pandas.pydata.org/) is a popular Python library used for working in tabular data (similar to the data stored in a spreadsheet). Pandas provides helper functions to read data from various file formats like CSV, Excel spreadsheets, HTML tables, JSON, SQL, and more. In this exercise we will work on `italy-covid-daywise.txt` file which contains day-wise Covid-19 data for Italy in the following format:

```
date,new_cases,new_deaths,new_tests
2020-04-21,2256.0,454.0,28095.0
2020-04-22,2729.0,534.0,44248.0
2020-04-23,3370.0,437.0,37083.0
2020-04-24,2646.0,464.0,95273.0
2020-04-25,3021.0,420.0,38676.0
2020-04-26,2357.0,415.0,24113.0
2020-04-27,2324.0,260.0,26678.0
2020-04-28,1739.0,333.0,37554.0
...
```

This format of storing data is known as *comma-separated values* or CSV. 

> **CSVs**: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)

In [None]:
#restart the kernel after installation
!pip install pandas-profiling --upgrade --quiet

To read the file, we can use the `read_csv` method from Pandas. First, let's install the Pandas library.

We can now import the `pandas` module. As a convention, it is imported with the alias `pd`.

In [None]:
import pandas as pd

In [None]:
covid_df = pd.read_csv('italy-covid-daywise.csv')

Data from the file is read and stored in a `DataFrame` object - one of the core data structures in Pandas for storing and working with tabular data. We typically use the `_df` suffix in the variable names for dataframes.

In [None]:
type(covid_df)

In [None]:
covid_df

Here's what we can tell by looking at the dataframe:

- The file provides four day-wise counts for COVID-19 in Italy
- The metrics reported are new cases, deaths, and tests
- Data is provided for 248 days: from Dec 12, 2019, to Sep 3, 2020

Keep in mind that these are officially reported numbers. The actual number of cases & deaths may be higher, as not all cases are diagnosed. 

We can view some basic information about the data frame using the `.info` method.

In [None]:
covid_df.info()

It appears that each column contains values of a specific data type. You can view statistical information for numerical columns (mean, standard deviation, minimum/maximum values, and the number of non-empty values) using the `.describe` method.

In [None]:
covid_df.describe()

The `columns` property contains the list of columns within the data frame.

In [None]:
covid_df.columns

You can also retrieve the number of rows and columns in the data frame using the `.shape` method.

In [None]:
covid_df.shape

Here's a summary of the functions & methods we've looked at so far:

* `pd.read_csv` - Read data from a CSV file into a Pandas `DataFrame` object
* `.info()` - View basic infomation about rows, columns & data types
* `.describe()` - View statistical information about numeric columns
* `.columns` - Get the list of column names
* `.shape` - Get the number of rows & columns as a tuple


## Retrieving data from a data frame

The first thing you might want to do is retrieve data from this data frame, e.g., the counts of a specific day or the list of values in a particular column. To do this, it might help to understand the internal representation of data in a data frame. Conceptually, you can think of a dataframe as a dictionary of lists: keys are column names, and values are lists/arrays containing data for the respective columns. 

In [None]:
# Pandas format is simliar to this
covid_data_dict = {
    'date':       ['2020-08-30', '2020-08-31', '2020-09-01', '2020-09-02', '2020-09-03'],
    'new_cases':  [1444, 1365, 996, 975, 1326],
    'new_deaths': [1, 4, 6, 8, 6],
    'new_tests': [53541, 42583, 54395, None, None]
}

Representing data in the above format has a few benefits:

* All values in a column typically have the same type of value, so it's more efficient to store them in a single array.
* Retrieving the values for a particular row simply requires extracting the elements at a given index from each column array.
* The representation is more compact (column names are recorded only once) compared to other formats that use a dictionary for each row of data (see the example below).

In [None]:
# Pandas format is not similar to this
covid_data_list = [
    {'date': '2020-08-30', 'new_cases': 1444, 'new_deaths': 1, 'new_tests': 53541},
    {'date': '2020-08-31', 'new_cases': 1365, 'new_deaths': 4, 'new_tests': 42583},
    {'date': '2020-09-01', 'new_cases': 996, 'new_deaths': 6, 'new_tests': 54395},
    {'date': '2020-09-02', 'new_cases': 975, 'new_deaths': 8 },
    {'date': '2020-09-03', 'new_cases': 1326, 'new_deaths': 6},
]

With the dictionary of lists analogy in mind, you can now guess how to retrieve data from a data frame. For example, we can get a list of values from a specific column using the `[]` indexing notation.

In [None]:
covid_data_dict['new_cases']

In [None]:
covid_df['new_cases']

Each column is represented using a data structure called `Series`, which is essentially a numpy array with some extra methods and properties.

In [None]:
type(covid_df['new_cases'])

Like arrays, you can retrieve a specific value with a series using the indexing notation `[]`.

In [None]:
covid_df['new_cases'][246]

In [None]:
covid_df['new_tests'][240]

Pandas also provides the `.at` method to retrieve the element at a specific row & column directly.

In [None]:
covid_df.at[246, 'new_cases']

In [None]:
covid_df.at[240, 'new_tests']

Instead of using the indexing notation `[]`, Pandas also allows accessing columns as properties of the dataframe using the `.` notation. However, this method only works for columns whose names do not contain spaces or special characters.

In [None]:
covid_df.new_cases

Further, you can also pass a list of columns within the indexing notation `[]` to access a subset of the data frame with just the given columns.

In [None]:
cases_df = covid_df[['date', 'new_cases']]
cases_df

The new data frame `cases_df` is simply a "view" of the original data frame `covid_df`. Both point to the same data in the computer's memory. Changing any values inside one of them will also change the respective values in the other. Sharing data between data frames makes data manipulation in Pandas blazing fast. You needn't worry about the overhead of copying thousands or millions of rows every time you want to create a new data frame by operating on an existing one.

Sometimes you might need a full copy of the data frame, in which case you can use the `copy` method.

In [None]:
covid_df_copy = covid_df.copy()

The data within `covid_df_copy` is completely separate from `covid_df`, and changing values inside one of them will not affect the other.

To access a specific row of data, Pandas provides the `.loc` method.

In [None]:
covid_df

In [None]:
covid_df.loc[243]

Each retrieved row is also a `Series` object.

In [None]:
type(covid_df.loc[243])

We can use the `.head` and `.tail` methods to view the first or last few rows of data.

In [None]:
covid_df.head(5)

In [None]:
covid_df.tail(4)

Notice above that while the first few values in the `new_cases` and `new_deaths` columns are `0`, the corresponding values within the `new_tests` column are `NaN`. That is because the CSV file does not contain any data for the `new_tests` column for specific dates (you can verify this by looking into the file). These values may be missing or unknown.

In [None]:
covid_df.at[0, 'new_tests']

In [None]:
type(covid_df.at[0, 'new_tests'])

The distinction between `0` and `NaN` is subtle but important. In this dataset, it represents that daily test numbers were not reported on specific dates. Italy started reporting daily tests on Apr 19, 2020. 93,5310 tests had already been conducted before Apr 19. 

We can find the first index that doesn't contain a `NaN` value using a column's `first_valid_index` method.

In [None]:
covid_df.new_tests.first_valid_index()

Let's look at a few rows before and after this index to verify that the values change from `NaN` to actual numbers. We can do this by passing a range to `loc`.

In [None]:
covid_df.loc[108:113]

We can use the `.sample` method to retrieve a random sample of rows from the data frame.

In [None]:
covid_df.sample(10)

Notice that even though we have taken a random sample, each row's original index is preserved - this is a useful property of data frames.



Here's a summary of the functions & methods we looked at in this section:

- `covid_df['new_cases']` - Retrieving columns as a `Series` using the column name
- `new_cases[243]` - Retrieving values from a `Series` using an index
- `covid_df.at[243, 'new_cases']` - Retrieving a single value from a data frame
- `covid_df.copy()` - Creating a deep copy of a data frame
- `covid_df.loc[243]` - Retrieving a row or range of rows of data from the data frame
- `head`, `tail`, and `sample` - Retrieving multiple rows of data from the data frame
- `covid_df.new_tests.first_valid_index` - Finding the first non-empty index in a series



## Analyzing data from data frames

Let's try to answer some questions about our data.

**Q: What are the total number of reported cases and deaths related to Covid-19 in Italy?**

Similar to Numpy arrays, a Pandas series supports the `sum` method to answer these questions.

In [None]:
total_cases = covid_df.new_cases.sum()
total_deaths = covid_df.new_deaths.sum()

In [None]:
print('The number of reported cases is {} and the number of reported deaths is {}.'.format(int(total_cases), int(total_deaths)))

**Q: What is the overall death rate (ratio of reported deaths to reported cases)?**

In [None]:
death_rate = covid_df.new_deaths.sum() / covid_df.new_cases.sum()

In [None]:
print("The overall reported death rate in Italy is {:.2f} %.".format(death_rate*100))

**Q: What is the overall number of tests conducted? A total of 935310 tests were conducted before daily test numbers were reported.**


In [None]:
initial_tests = 935310
total_tests = initial_tests + covid_df.new_tests.sum()

In [None]:
total_tests

**Q: What fraction of tests returned a positive result?**

In [None]:
positive_rate = total_cases / total_tests

In [None]:
print('{:.2f}% of tests in Italy led to a positive diagnosis.'.format(positive_rate*100))

## Querying and sorting rows

Let's say we want only want to look at the days which had more than 1000 reported cases. We can use a boolean expression to check which rows satisfy this criterion.

In [None]:
high_new_cases = covid_df.new_cases > 1000

In [None]:
high_new_cases

The boolean expression returns a series containing `True` and `False` boolean values. You can use this series to select a subset of rows from the original dataframe, corresponding to the `True` values in the series.

In [None]:
covid_df[high_new_cases]

We can write this succinctly on a single line by passing the boolean expression as an index to the data frame.

In [None]:
high_cases_df = covid_df[covid_df.new_cases > 1000]

In [None]:
high_cases_df

The data frame contains 72 rows, but only the first & last five rows are displayed by default with Jupyter for brevity. We can change some display options to view all the rows.

In [None]:
from IPython.display import display
with pd.option_context('display.max_rows', 100):
    display(covid_df[covid_df.new_cases > 1000])

We can also formulate more complex queries that involve multiple columns. As an example, let's try to determine the days when the ratio of cases reported to tests conducted is higher than the overall `positive_rate`.

In [None]:
positive_rate

In [None]:
high_ratio_df = covid_df[covid_df.new_cases / covid_df.new_tests > positive_rate]

In [None]:
high_ratio_df

The result of performing an operation on two columns is a new series.

In [None]:
covid_df.new_cases / covid_df.new_tests

We can use this series to add a new column to the data frame.

In [None]:
covid_df['positive_rate'] = covid_df.new_cases / covid_df.new_tests

In [None]:
covid_df

However, keep in mind that sometimes it takes a few days to get the results for a test, so we can't compare the number of new cases with the number of tests conducted on the same day. Any inference based on this `positive_rate` column is likely to be incorrect. It's essential to watch out for such subtle relationships that are often not conveyed within the CSV file and require some external context. It's always a good idea to read through the documentation provided with the dataset or ask for more information.

For now, let's remove the `positive_rate` column using the `drop` method.

In [None]:
covid_df.drop(columns=['positive_rate'], inplace=True)

Can you figure the purpose of the `inplace` argument?

### Sorting rows using column values

The rows can also be sorted by a specific column using `.sort_values`. Let's sort to identify the days with the highest number of cases, then chain it with the `head` method to list just the first ten results.

In [None]:
covid_df.sort_values('new_cases', ascending=False).head(10)

It looks like the last two weeks of March had the highest number of daily cases. Let's compare this to the days where the highest number of deaths were recorded.

In [None]:
covid_df.sort_values('new_deaths', ascending=False).head(10)

It appears that daily deaths hit a peak just about a week after the peak in daily new cases.

Let's also look at the days with the least number of cases. We might expect to see the first few days of the year on this list.

In [None]:
covid_df.sort_values('new_cases').head(10)

It seems like the count of new cases on Jun 20, 2020, was `-148`, a negative number! Not something we might have expected, but that's the nature of real-world data. It could be a data entry error, or the government may have issued a correction to account for miscounting in the past. Can you dig through news articles online and figure out why the number was negative?

Let's look at some days before and after Jun 20, 2020.

In [None]:
covid_df.loc[169:175]

For now, let's assume this was indeed a data entry error. We can use one of the following approaches for dealing with the missing or faulty value:
1. Replace it with `0`.
2. Replace it with the average of the entire column
3. Replace it with the average of the values on the previous & next date
4. Discard the row entirely

Which approach you pick requires some context about the data and the problem. In this case, since we are dealing with data ordered by date, we can go ahead with the third approach.

You can use the `.at` method to modify a specific value within the dataframe.

In [None]:
covid_df.at[172, 'new_cases'] = (covid_df.at[171, 'new_cases'] + covid_df.at[173, 'new_cases'])/2

Here's a summary of the functions & methods we looked at in this section:

- `covid_df.new_cases.sum()` - Computing the sum of values in a column or series
- `covid_df[covid_df.new_cases > 1000]` - Querying a subset of rows satisfying the chosen criteria using boolean expressions
- `df['pos_rate'] = df.new_cases/df.new_tests` - Adding new columns by combining data from existing columns
- `covid_df.drop('positive_rate')` - Removing one or more columns from the data frame
- `sort_values` - Sorting the rows of a data frame using column values
- `covid_df.at[172, 'new_cases'] = ...` - Replacing a value within the data frame

## Exercises

Try the following exercises to become familiar with Pandas dataframe and practice your skills. In this assignment, we're going to analyze an operate on data from a CSV file.



In [None]:
#Load the CSV file in the Pandas data frame
countries_df = # ENTER YOUR CODE HERE

**Q1: How many countries does the dataframe contain?**

Hint: Use the `.shape` method.

In [None]:
num_countries = # ENTER YOUR CODE HERE

In [None]:
print('There are {} countries in the dataset'.format(num_countries))

**Q2: Retrieve a list of continents from the dataframe?**

*Hint: Use the `.unique` method of a series.*

In [None]:
continents =  # ENTER YOUR CODE HERE

In [None]:
continents

**Q3: What is the total population of all the countries listed in this dataset?**

In [None]:
total_population = # ENTER YOUR CODE HERE

In [None]:
print('The total population is {}.'.format(int(total_population)))

In [None]:
jovian.commit(project='pandas-practice-assignment', environment=None)

**Q4: Create a dataframe containing 10 countries with the highest population.**

*Hint: Chain the `sort_values` and `head` methods.*

In [None]:
most_populous_df = # ENTER YOUR CODE HERE

In [None]:
most_populous_df

**Q5: Add a new column in `countries_df` to record the overall GDP per country (product of population & per capita GDP).**



In [None]:
countries_df['gdp'] = # ENTER YOUR CODE HERE

In [None]:
countries_df

**Q6: Count the number of countries for which the `total_tests` data is missing.**

*Hint: Use the `.isna` method.*

In [None]:
total_tests_missing = # ENTER YOUR CODE HERE

In [None]:
print("The data for total tests is missing for {} countries.".format(int(total_tests_missing)))

## Summary and Further Reading


We've covered the following topics in this tutorial:

- Reading a CSV file into a Pandas data frame
- Retrieving data from Pandas data frames
- Querying, soring, and analyzing data
- Merging, grouping, and aggregation of data
- Extracting useful information from dates
- Basic plotting using line and bar charts
- Writing data frames to CSV files


Check out the following resources to learn more about Pandas:

* User guide for Pandas: https://pandas.pydata.org/docs/user_guide/index.html
* Python for Data Analysis (book by Wes McKinney - creator of Pandas): https://www.oreilly.com/library/view/python-for-data/9781491957653/

# Scikit-learn Basics

The purpose of this guide is to illustrate some of the main features that scikit-learn provides. It assumes a very basic working knowledge of machine learning practices (model fitting, predicting, cross-validation, etc.).

[`Scikit-learn`](https://scikit-learn.org/stable/getting_started.html) is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

In [None]:
!pip install -U scikit-learn 

## Fitting and Predicting: estimator basics

Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. Each estimator can be fitted to some data using its fit method.

Here is a simple example where we fit a RandomForestClassifier to some very basic data:
```python
>>> from sklearn.ensemble import RandomForestClassifier
>>> clf = RandomForestClassifier(random_state=0)
>>> X = [[ 1,  2,  3],  # 2 samples, 3 features
...      [11, 12, 13]]
>>> y = [0, 1]  # classes of each sample
>>> clf.fit(X, y)
RandomForestClassifier(random_state=0)
```
The fit method generally accepts 2 inputs:

- The samples matrix (or design matrix) X. The size of `X` is typically `(n_samples, n_features)`, which means that samples are represented as rows and features are represented as columns.

- The target values y which are real numbers for regression tasks, or integers for classification (or any other discrete set of values). For unsupervized learning tasks, `y` does not need to be specified. `y` is usually 1d array where the `i`th entry corresponds to the target of the `i`th sample (row) of `X`.

Both `X` and `y` are usually expected to be numpy arrays or equivalent array-like data types, though some estimators work with other formats such as sparse matrices.

Once the estimator is fitted, it can be used for predicting target values of new data. You don’t need to re-train the estimator:
```python
>>> clf.predict(X)  # predict classes of the training data
array([0, 1])
>>> clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data
array([0, 1])
```

### Exercise 1
For this exercise we will use the famous [Iris dataset](https://archive.ics.uci.edu/ml/datasets/Iris/).

Store the dataframe in the variable `iris_df` and split the features and labels using [`.iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html). Store the feature dataframe in `X` and the labels in `y`.

In [None]:
import numpy as np
import pandas as pd
from sklearn import datasets

In [None]:
iris = datasets.load_iris()

iris_df = pd.DataFrame(
    data=np.c_[iris['data'], 
    iris['target']],
    columns=iris['feature_names']+['target']
) # ENTER YOUR CODE HERE

In [None]:
assert isinstance(iris_df, pd.DataFrame)
assert X.shape == (150,4)
assert y.shape == (150,1)

Use [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforest#sklearn.ensemble.RandomForestClassifier) class to fit the dataset. [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split) is used to split the dataset into training and testing examples. Use the Training examples to fit the model and predict on the Test examples. Store the prediction resuls in `y_pred`

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y.values.ravel(), random_state=0)
# ENTER YOUR CODE HERE

In [None]:
assert isinstance(y_pred, np.ndarray)
assert len(y_pred) == 38
from sklearn.metrics import accuracy_score
assert np.round(accuracy_score(y_test, y_pred), 2) == 0.97

## Model Evaluation

Fitting a model to some data does not entail that it will predict well on unseen data. This needs to be directly evaluated. We have just seen the `train_test_split` helper that splits a dataset into train and test sets, but `scikit-learn` provides many other tools for model evaluation, in particular for [`cross-validation`](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation).

We here briefly show how to perform a 5-fold cross-validation procedure, using the `cross_validate` helper. Note that it is also possible to manually iterate over the folds, use different data splitting strategies, and use custom scoring functions.

```python
>>> from sklearn.datasets import make_regression
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.model_selection import cross_validate
...
>>> X, y = make_regression(n_samples=1000, random_state=0)
>>> lr = LinearRegression()
...
>>> result = cross_validate(lr, X, y)  # defaults to 5-fold CV
>>> result['test_score']  # r_squared score is high because dataset is easy
array([1., 1., 1., 1., 1.])
```

### Exercise 2

In this exercise, we will do a cross validation experiment on our iris dataset using [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression#sklearn.linear_model.LogisticRegression) model.

The `LogisticRegression`(LR) model have a number of hyperparameters which can be tuned to find the best parameters for higher accuracy. In this exercise, you are free to use any [Model Selection](https://scikit-learn.org/stable/modules/classes.html?highlight=model_selection#module-sklearn.model_selection) classes to find the best parameter for the LR model.

Store the predictions from the classifier on `X_test` in the variable `y_pred`

In [None]:
from sklearn.linear_model import LogisticRegression

estimator = LogisticRegression(solver='liblinear', random_state=0)

# ENTER YOUR CODE HERE

In [None]:
from sklearn import metrics

assert np.round(metrics.accuracy_score(y_pred=y_pred, y_true=y_test), 2) == 0.89