# Unit 2
---




1. [Introducing Pandas](#section1)
2. [Reading files](#section2)
3. [Selecting data](#section3)
4. [Conditional selection](#section4)








<a id='section1'></a>

## 1. Introducing Pandas
---

<div>
<img src="images/pandas.JPG" width="400"/>
</div>



[Panda's documentation](https://pandas.pydata.org/pandas-docs/stable/)



To begin we need to import pandas
When you see pd, know it is referring to pandas

In [2]:
import numpy as np
import pandas as pd

Pandas is a popular Python library used for working in tabular data (similar to the data stored in a spreadsheet).

There are two main data structure used by pandas
- Series: like a vectore or a list
- Dataframe: equivalent to a table. 

Each column in a pandas Dataframe is a pandas Series data structure. We will mainly be looking at the Dataframe.

We can easily create a Pandas Dataframe by reading a .csv file


<a id='section2'></a>

## 2. Reading files
---

<div>
<img src="images/reading.PNG" width="400"/>
</div>


We will read the whole file at once using Pandas.
Sometimes you might want to read the file line by line, and process each line. Thats possible of course. See for example [here.](https://www.geeksforgeeks.org/read-a-file-line-by-line-in-python/)

We will read [data on COVID-19 vaccinations](https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations)

In order to do that, I retrieved the raw data's url

<div>
<img src="images/raw.png" width="800"/>
</div>


In [6]:
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv'
vacc_df = pd.read_csv(url)

read_csv has about 30 different options. See the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv)

For example, sep='\t' is used for tab delimited files and 'usecol' reads only specific columns. 

In [None]:
type(vacc_df)

view the shape of the dataframe:

In [None]:
vacc_df.shape

In [None]:
vacc_df

view basic information:

In [None]:
vacc_df.info()

In [None]:
vacc_df.columns

View the first few rows:

In [None]:
vacc_df.head()

`What do you think that the 'tail' command does? Try it out!`

`What happens if we just type data, without a head or a tail?` 

A summary of the functions so far:

* `pd.read_csv` - Read data from a CSV file into a Pandas `DataFrame` object
* `.info()` - View basic infomation about rows, columns & data types
* `.describe()` - View statistical information about numeric columns
* `.columns` - Get the list of column names
* `.shape` - Get the number of rows & columns as a tuple
* `.head`, `.tail`


<a id='section3'></a>

---
## 3. Selecting data

![](https://i.imgur.com/zfxLzEv.png)

In [None]:
# Pandas format is NOT similar to this
covid_data_list = [
    {'date': '2020-08-30', 'new_cases': 1444, 'new_deaths': 1, 'new_tests': 53541},
    {'date': '2020-08-31', 'new_cases': 1365, 'new_deaths': 4, 'new_tests': 42583},
    {'date': '2020-09-01', 'new_cases': 996, 'new_deaths': 6, 'new_tests': 54395},
    {'date': '2020-09-02', 'new_cases': 975, 'new_deaths': 8 },
    {'date': '2020-09-03', 'new_cases': 1326, 'new_deaths': 6},
]

In [None]:
# Pandas format is simliar to this
covid_data_dict = {
    'date':       ['2020-08-30', '2020-08-31', '2020-09-01', '2020-09-02', '2020-09-03'],
    'new_cases':  [1444, 1365, 996, 975, 1326],
    'new_deaths': [1, 4, 6, 8, 6],
    'new_tests': [53541, 42583, 54395, None, None]
}

#### The index of a dataframe doesn't have to be numeric

In [None]:
df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])
df

In our case it just happens to be numeric:

In [None]:
vacc_df.head()

In [None]:
vacc_df.location
#is the same as this:
#vacc_df['location']


note: using the `.` notation is possible only for columns whose names do not contain spaces or special characters

what data type is vacc_df.location? (list? series? dataframe?)

In [None]:
#retrieve a specific cell
vacc_df.location[0]

In [None]:
#retrieve two columns
vacc_df[['location','date']]

#### Seletcting subsets of rows and columns

`.loc` - selects subsets of rows and columns by label only. Allowed inputs are:

- A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).

- A list or array of labels, e.g. ['a', 'b', 'c'].

- A slice object with labels, e.g. 'a':'f'.

`.iloc` - selects subsets of rows and columns by integer location only

In [None]:
#retrieve three rows
vacc_df.loc[[2, 33, 300]]

The : operator 

 - when used alone it means "everything"

- also used to indicate a slice of values


In [None]:
#retrieve (slice) three rows
vacc_df.loc[2:4]

How would you select the the number of people vaccinated in Andorra on the first row Andorra is mentioned?

One way to do that is iloc. 

Select specific rows and/or columns using iloc (index based location)

(note that only the last row will show)

In [None]:
# Columns:
vacc_df.iloc[:,0] # first column of data frame  
vacc_df.iloc[:,1] # second column of data frame  
vacc_df.iloc[:,-1] # last column of data frame
#Rows and columns
vacc_df.iloc[0:5] # first five rows of dataframe
vacc_df.iloc[:, 0:2] # first two columns of data frame with all rows
vacc_df.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.

---
---

Now you: 

what do you do to select:

a. first five rows?

b. first two columns, all rows?

c. 1st and 3rd row and 2nd and 4th column?

---
---

Select by column names using loc (label based location):

In [None]:
vacc_df.loc[:,['location', 'total_vaccinations']]

In [None]:
vacc_df.loc[0:4:,['location', 'total_vaccinations']]

Semantics are similar to iloc. But note:

- `iloc` excludes the last element.  `df.iloc[0:1000]` will return entries 0...999
- `loc`, include the last element. 

you try it! What is the difference between:

> vacc_df.iloc[0:5]

> vacc_df.loc[0:5]

---
Note: indexing operators as the ones working on dictionaries, will also work in pandas. But for more advanced operations, better get used to loc and iloc.

---

<a id='section4'></a>

## 4. Conditional selection




In [None]:
vacc_df.loc[:,'location'] == 'Israel'

This creates a series of true/false 

We can insert this into data to select only that task:

In [None]:
vacc_df[vacc_df.loc[:,'location'] == 'Israel']

Another way:

In [None]:
vacc_df.loc[vacc_df.location == 'Israel']

Select two countries:

In [81]:
two_countries = vacc_df.loc[(vacc_df.location == 'Israel') | (vacc_df.location == 'Denmark')]
two_countries

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
594,Denmark,DNK,2020-12-27,6648.0,6648.0,,,,0.11,0.11,,
595,Denmark,DNK,2020-12-28,8654.0,8654.0,,2006.0,2006.0,0.15,0.15,,346.0
596,Denmark,DNK,2020-12-29,17786.0,17786.0,,9132.0,5569.0,0.31,0.31,,961.0
597,Denmark,DNK,2020-12-30,29447.0,29447.0,,11661.0,7600.0,0.51,0.51,,1312.0
598,Denmark,DNK,2020-12-31,35790.0,35790.0,,6343.0,7286.0,0.62,0.62,,1258.0
...,...,...,...,...,...,...,...,...,...,...,...,...
1219,Israel,ISR,2021-02-02,5102109.0,3239999.0,1862110.0,100142.0,133672.0,58.95,37.43,21.51,15444.0
1220,Israel,ISR,2021-02-03,5209744.0,3299271.0,1910473.0,107635.0,120353.0,60.19,38.12,22.07,13905.0
1221,Israel,ISR,2021-02-04,5337320.0,3371222.0,1966098.0,127576.0,108784.0,61.66,38.95,22.71,12568.0
1222,Israel,ISR,2021-02-05,5399405.0,3405387.0,1994018.0,62085.0,103270.0,62.38,39.34,23.04,11931.0


only the indexs of the tasks:

In [None]:
two_countries.index.values

the index in the first place:

In [None]:
two_countries.index.values[0]

Remove the world data:

In [None]:
vacc_df_noWorld = vacc_df.loc[vacc_df.location != 'World']
vacc_df_noWorld.tail()

Find the country with the maximum vaccinations

In [99]:
max_vacc = vacc_df_noWorld.total_vaccinations.max()
max_vacc

39037964.0

In [100]:
vacc_df_noWorld.loc[vacc_df_noWorld.total_vaccinations == max_vacc]

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
2404,United States,USA,2021-02-06,39037964.0,30250964.0,8317180.0,2218752.0,1351437.0,11.67,9.05,2.49,4041.0


----
#### Your turn:

Select the number of vaccinations in Israel on date 2021-02-06 (hint: use &)

Find all the countries with more than 3000000 vaccinations