# Unit 2
---




1. [Introducing Pandas](#section1)
2. [Reading files](#section2)
3. [Selecting data](#section3)
4. [Describe the data](#section4)
5. [Conditional selection](#section5)









<a id='section1'></a>

## 1. Introducing Pandas
---

<div>
<img src="images/pandas.JPG" width="400"/>
</div>



[Panda's documentation](https://pandas.pydata.org/pandas-docs/stable/)



To begin we need to import pandas
When you see pd, know it is referring to pandas

In [1]:
import numpy as np
import pandas as pd

Pandas is a popular Python library used for working in tabular data (similar to the data stored in a spreadsheet).

There are two main data structure used by pandas
- Series: equivalent to a vector or a list
- Dataframe: equivalent to a table. 

Each column in a pandas Dataframe is a pandas Series data structure. We will mainly be looking at the Dataframe.

We can easily create a Pandas Dataframe by reading a .csv file


<a id='section2'></a>


## 2. Reading files
---

<div>
<img src="images/reading.PNG" width="400"/>
</div>


We will read the whole file at once using Pandas.
Sometimes you might want to read the file line by line, and process each line. Thats possible of course. See for example [here.](https://www.geeksforgeeks.org/read-a-file-line-by-line-in-python/)

We will read data on [COVID-19 vaccinations](https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations)

In order to do that, I retrieved the raw data's url

Press on raw either here:

<div>
<img src="images/raw.png" width="800"/>
</div>

or here:

<div>
<img src="images/unit1_raw2.jpg" width="800"/>
</div>

and retrieve the link:

<div>
<img src="images/unit1_raw3.jpg" width="800"/>
</div>



In [2]:
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv'
vacc_df = pd.read_csv(url)

read_csv has about 30 different options. See the 
[documentation](https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.read_csv.html#pandas.read_csv)

For example, sep='\t' is used for tab delimited files and 'usecol' reads only specific columns. 

In [3]:
type(vacc_df)

pandas.core.frame.DataFrame

view the shape of the dataframe:

In [4]:
vacc_df.shape

(86936, 16)

view basic information:

In [5]:
vacc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86936 entries, 0 to 86935
Data columns (total 16 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   location                             86936 non-null  object 
 1   iso_code                             86936 non-null  object 
 2   date                                 86936 non-null  object 
 3   total_vaccinations                   46986 non-null  float64
 4   people_vaccinated                    44755 non-null  float64
 5   people_fully_vaccinated              42027 non-null  float64
 6   total_boosters                       18202 non-null  float64
 7   daily_vaccinations_raw               39169 non-null  float64
 8   daily_vaccinations                   86659 non-null  float64
 9   total_vaccinations_per_hundred       46986 non-null  float64
 10  people_vaccinated_per_hundred        44755 non-null  float64
 11  people_fully_vaccinated_per_

In [6]:
vacc_df.columns

Index(['location', 'iso_code', 'date', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated', 'total_boosters',
       'daily_vaccinations_raw', 'daily_vaccinations',
       'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred',
       'people_fully_vaccinated_per_hundred', 'total_boosters_per_hundred',
       'daily_vaccinations_per_million', 'daily_people_vaccinated',
       'daily_people_vaccinated_per_hundred'],
      dtype='object')

View the first few rows:

In [7]:
vacc_df.head()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,,0.0,0.0,,,,,
1,Afghanistan,AFG,2021-02-23,,,,,,1367.0,,,,,34.0,1367.0,0.003
2,Afghanistan,AFG,2021-02-24,,,,,,1367.0,,,,,34.0,1367.0,0.003
3,Afghanistan,AFG,2021-02-25,,,,,,1367.0,,,,,34.0,1367.0,0.003
4,Afghanistan,AFG,2021-02-26,,,,,,1367.0,,,,,34.0,1367.0,0.003


What do you think that the 'tail' command does? Try it out!

What happens if we just type vacc_df, without a head or a tail?

In [53]:
vacc_df.tail()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
35158,Zimbabwe,ZWE,2021-07-17,1771434.0,1132045.0,639389.0,39694.0,41958.0,11.92,7.62,4.3,2823.0
35159,Zimbabwe,ZWE,2021-07-18,1785533.0,1144379.0,641154.0,14099.0,42019.0,12.01,7.7,4.31,2827.0
35160,Zimbabwe,ZWE,2021-07-19,1827638.0,1184435.0,643203.0,42105.0,42253.0,12.3,7.97,4.33,2843.0
35161,Zimbabwe,ZWE,2021-07-20,1897337.0,1247494.0,649843.0,69699.0,45971.0,12.77,8.39,4.37,3093.0
35162,Zimbabwe,ZWE,2021-07-21,1949472.0,1292642.0,656830.0,52135.0,47976.0,13.12,8.7,4.42,3228.0


---
A summary of the functions so far:

>* `pd.read_csv` - Read data from a CSV file into a Pandas `DataFrame` object
>* `.info()` - View basic infomation about rows, columns & data types
>* `.columns` - Get the list of column names
>* `.shape` - Get the number of rows & columns as a tuple
>* `.head()`, `.tail()` - View the beginning/end of the file



<a id='section3'></a>

---
## 3. Selecting data

![](https://i.imgur.com/zfxLzEv.png)

##### Pandas format is similar to a dictionary, not to a list

a list:

In [10]:

covid_data_list = [
    {'date': '2020-08-30', 'new_cases': 1444, 'new_deaths': 1, 'new_tests': 53541},
    {'date': '2020-08-31', 'new_cases': 1365, 'new_deaths': 4, 'new_tests': 42583},
    {'date': '2020-09-01', 'new_cases': 996, 'new_deaths': 6, 'new_tests': 54395},
    {'date': '2020-09-02', 'new_cases': 975, 'new_deaths': 8 },
    {'date': '2020-09-03', 'new_cases': 1326, 'new_deaths': 6},
]
covid_data_list

[{'date': '2020-08-30',
  'new_cases': 1444,
  'new_deaths': 1,
  'new_tests': 53541},
 {'date': '2020-08-31',
  'new_cases': 1365,
  'new_deaths': 4,
  'new_tests': 42583},
 {'date': '2020-09-01', 'new_cases': 996, 'new_deaths': 6, 'new_tests': 54395},
 {'date': '2020-09-02', 'new_cases': 975, 'new_deaths': 8},
 {'date': '2020-09-03', 'new_cases': 1326, 'new_deaths': 6}]

a dictionary:

In [11]:
covid_data_dict = {
    'date':       ['2020-08-30', '2020-08-31', '2020-09-01', '2020-09-02', '2020-09-03'],
    'new_cases':  [1444, 1365, 996, 975, 1326],
    'new_deaths': [1, 4, 6, 8, 6],
    'new_tests': [53541, 42583, 54395, None, None]
}

#### The index of a dataframe doesn't have to be numeric

In [12]:
df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])
df

Unnamed: 0,age,color,food,height,score,state
Jane,30,blue,Steak,165,4.6,NY
Nick,2,green,Lamb,70,8.3,TX
Aaron,12,red,Mango,120,9.0,FL
Penelope,4,white,Apple,80,3.3,AL
Dean,32,gray,Cheese,180,1.8,AK
Christina,33,black,Melon,172,9.5,TX
Cornelia,69,red,Beans,150,2.2,TX


In our our file the index is numeric:

In [13]:
vacc_df.head()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,,0.0,0.0,,,,,
1,Afghanistan,AFG,2021-02-23,,,,,,1367.0,,,,,34.0,1367.0,0.003
2,Afghanistan,AFG,2021-02-24,,,,,,1367.0,,,,,34.0,1367.0,0.003
3,Afghanistan,AFG,2021-02-25,,,,,,1367.0,,,,,34.0,1367.0,0.003
4,Afghanistan,AFG,2021-02-26,,,,,,1367.0,,,,,34.0,1367.0,0.003


return a single column as a series:

note: using the `.` notation is possible only for columns whose names do not contain spaces or special characters. 

In [16]:
vacc_df.location
vacc_df['location']

0        Afghanistan
1        Afghanistan
2        Afghanistan
3        Afghanistan
4        Afghanistan
            ...     
86931       Zimbabwe
86932       Zimbabwe
86933       Zimbabwe
86934       Zimbabwe
86935       Zimbabwe
Name: location, Length: 86936, dtype: object

In [17]:
type(vacc_df.location)

pandas.core.series.Series

return a single column as a dataframe:

In [18]:
vacc_df[['location']]

Unnamed: 0,location
0,Afghanistan
1,Afghanistan
2,Afghanistan
3,Afghanistan
4,Afghanistan
...,...
86931,Zimbabwe
86932,Zimbabwe
86933,Zimbabwe
86934,Zimbabwe


retrieve a specific cell

In [19]:
vacc_df.location[600]

'Africa'

retrieve two columns

In [20]:
vacc_df[['location','date']]

Unnamed: 0,location,date
0,Afghanistan,2021-02-22
1,Afghanistan,2021-02-23
2,Afghanistan,2021-02-24
3,Afghanistan,2021-02-25
4,Afghanistan,2021-02-26
...,...,...
86931,Zimbabwe,2022-03-01
86932,Zimbabwe,2022-03-02
86933,Zimbabwe,2022-03-03
86934,Zimbabwe,2022-03-04


#### Seletcting subsets of rows and columns

One way to do that is iloc. 

`.iloc` - selects subsets of rows and columns by integer location only

In [38]:
vacc_df.iloc[0]  #first row as a series
vacc_df.iloc[0:1]  #first row as a dataframe
vacc_df.iloc[-1] #last row as a series
vacc_df.iloc[-1:] #last row as a dataframe

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
86935,Zimbabwe,ZWE,2022-03-05,7936145.0,4377373.0,3410340.0,148432.0,5524.0,8458.0,52.58,29.0,22.6,0.98,560.0,2776.0,0.018


The : operator 

 - when used alone it means "everything"

- also used to indicate a ***slice*** of values


In [39]:
vacc_df.iloc[2:4] # second and third row
vacc_df.iloc[[-1,2,22]] #a few specific rows

# Columns:
vacc_df.iloc[:,0] # first column of data frame  
vacc_df.iloc[:,1] # second column of data frame  
vacc_df.iloc[:,-1] # last column of data frame

#Rows and columns
vacc_df.iloc[0:5] # first five rows of dataframe
vacc_df.iloc[:, 0:2] # first two columns of data frame with all rows
vacc_df.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.

Unnamed: 0,location,people_fully_vaccinated,total_boosters
0,Afghanistan,,
3,Afghanistan,,
6,Afghanistan,,
24,Afghanistan,,


What if I want to select the 'daily_vaccinations' column, but I don't remember the column number?

Use `.loc`

`.loc` - selects subsets of rows and columns by label only. Allowed inputs are:

- A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).

- A list or array of labels, e.g. ['a', 'b', 'c'].

- A slice object with labels, e.g. 'a':'f'.

In [40]:
vacc_df.loc[2:3,['daily_vaccinations','date']]

Unnamed: 0,daily_vaccinations,date
2,1367.0,2021-02-24
3,1367.0,2021-02-25


I'm missing the location. Let's add it. 

In [41]:
vacc_df.loc[0:3,['location','daily_vaccinations','date']]

Unnamed: 0,location,daily_vaccinations,date
0,Afghanistan,,2021-02-22
1,Afghanistan,1367.0,2021-02-23
2,Afghanistan,1367.0,2021-02-24
3,Afghanistan,1367.0,2021-02-25


Semantics are similar to iloc. But note:

- `iloc` excludes the last element.  `df.iloc[0:1000]` will return entries 0...999
- `loc`, includes the last element.  `df.loc[0:1000]` will return entries 0...1000

you try it! What is the difference between:

> vacc_df.iloc[0:5]

> vacc_df.loc[0:5]

In [24]:
 vacc_df.iloc[0:5]

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,0.0,0.0,,
1,Afghanistan,AFG,2021-02-23,,,,,1367.0,,,,35.0
2,Afghanistan,AFG,2021-02-24,,,,,1367.0,,,,35.0
3,Afghanistan,AFG,2021-02-25,,,,,1367.0,,,,35.0
4,Afghanistan,AFG,2021-02-26,,,,,1367.0,,,,35.0


In [25]:
vacc_df.loc[0:5]

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,0.0,0.0,,
1,Afghanistan,AFG,2021-02-23,,,,,1367.0,,,,35.0
2,Afghanistan,AFG,2021-02-24,,,,,1367.0,,,,35.0
3,Afghanistan,AFG,2021-02-25,,,,,1367.0,,,,35.0
4,Afghanistan,AFG,2021-02-26,,,,,1367.0,,,,35.0
5,Afghanistan,AFG,2021-02-27,,,,,1367.0,,,,35.0


---
---

Now you: 

what do you do to select:

a. first five rows?

b. first two columns, all rows?

c. 1st and 3rd row and 2nd and 4th column?

---


In [80]:
vacc_df.iloc [:,0:2]


Unnamed: 0,location,iso_code
0,Afghanistan,AFG
1,Afghanistan,AFG
2,Afghanistan,AFG
3,Afghanistan,AFG
4,Afghanistan,AFG
...,...,...
35158,Zimbabwe,ZWE
35159,Zimbabwe,ZWE
35160,Zimbabwe,ZWE
35161,Zimbabwe,ZWE


---
A summary of the functions in this unit:

>* `.iloc` - selects rows and columns by integer location
>* `.loc` - selects rows and columns by label location



Note: indexing operators as the ones working on dictionaries, will also work in pandas. But for more advanced operations, better get used to loc and iloc.

---

<a id='section4'></a>

## 4. Describe the data:

Describe the data

In [49]:
vacc_df.describe()

Unnamed: 0,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
count,46986.0,44755.0,42027.0,18202.0,39169.0,86659.0,46986.0,44755.0,42027.0,18202.0,86659.0,85355.0,85355.0
mean,167055700.0,85458370.0,67193760.0,18849330.0,1124142.0,512292.4,74.578996,38.725218,33.077047,13.063171,3337.479535,211245.7,0.149874
std,765400400.0,392974300.0,323889200.0,90052820.0,4280085.0,2779368.0,64.948555,28.964507,27.837182,17.794349,3895.063352,1172379.0,0.245209
min,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,648811.2,410028.0,285498.0,2978.0,6495.0,1121.0,12.99,9.255,5.29,0.01,702.0,445.0,0.024
50%,4766018.0,2862765.0,2214907.0,496477.0,38856.0,9646.0,61.05,38.01,28.03,3.59,2232.0,4016.0,0.073
75%,29095070.0,17029830.0,13502900.0,4003726.0,263207.0,65838.0,126.08,65.63,58.835,21.82,4818.0,26826.0,0.192
max,10881330000.0,4986748000.0,4414217000.0,1429801000.0,54906390.0,43536980.0,336.16,124.65,121.53,89.99,117497.0,21419050.0,11.75


a note on e: e stands for exponent of 10, and it's always followed by another number, which is the value of the exponent

10e + 1 = 100

10e + 2 = 1000

describe categorical data:

In [43]:
vacc_df[['location']].describe()

Unnamed: 0,location
count,86936
unique,235
top,High income
freq,460


`High income` is not a location, we have a few of these here. Also have `Africa`, `World` and some others

<a id='section5'></a>

## 5. Conditional selection




In [None]:
vacc_df.loc[:,'location'] == 'Israel'

0        False
1        False
2        False
3        False
4        False
         ...  
86931    False
86932    False
86933    False
86934    False
86935    False
Name: location, Length: 86936, dtype: bool

This creates a series of true/false 

We can insert this into the dataframe to select only that task:

In [None]:
vacc_df[vacc_df.loc[:,'location'] == 'Israel']

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
85467,World,OWID_WRL,2020-12-02,0.000000e+00,0.000000e+00,,,0.0,0.0,0.00,0.00,,,0.0,0.0,0.000
85468,World,OWID_WRL,2020-12-03,0.000000e+00,0.000000e+00,,,0.0,0.0,0.00,0.00,,,0.0,0.0,0.000
85469,World,OWID_WRL,2020-12-04,1.000000e+00,1.000000e+00,,,0.0,0.0,0.00,0.00,,,0.0,0.0,0.000
85470,World,OWID_WRL,2020-12-05,1.000000e+00,1.000000e+00,,,0.0,0.0,0.00,0.00,,,0.0,0.0,0.000
85471,World,OWID_WRL,2020-12-06,1.000000e+00,1.000000e+00,,,0.0,0.0,0.00,0.00,,,0.0,0.0,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85922,World,OWID_WRL,2022-03-02,1.074307e+10,4.975229e+09,4.393554e+09,1.403893e+09,24458817.0,22845824.0,136.42,63.18,55.79,17.83,2901.0,6233023.0,0.079
85923,World,OWID_WRL,2022-03-03,1.075692e+10,4.976451e+09,4.399052e+09,1.407711e+09,25001449.0,21837931.0,136.60,63.19,55.86,17.88,2773.0,5654484.0,0.072
85924,World,OWID_WRL,2022-03-04,1.076896e+10,4.978811e+09,4.402673e+09,1.410391e+09,23182834.0,20749352.0,136.75,63.22,55.91,17.91,2635.0,5204433.0,0.066
85925,World,OWID_WRL,2022-03-05,1.078704e+10,4.982569e+09,4.408999e+09,1.413665e+09,21699233.0,19892930.0,136.98,63.27,55.99,17.95,2526.0,4665358.0,0.059


Another way:

In [31]:
vacc_df.loc[vacc_df.location == 'Israel']

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
15458,Israel,ISR,2020-12-19,60.0,60.0,,,,0.00,0.00,,
15459,Israel,ISR,2020-12-20,7434.0,7434.0,,7374.0,7374.0,0.09,0.09,,852.0
15460,Israel,ISR,2020-12-21,32318.0,32318.0,,24884.0,16129.0,0.37,0.37,,1863.0
15461,Israel,ISR,2020-12-22,76933.0,76933.0,,44615.0,25624.0,0.89,0.89,,2960.0
15462,Israel,ISR,2020-12-23,139765.0,139765.0,,62832.0,34926.0,1.61,1.61,,4035.0
...,...,...,...,...,...,...,...,...,...,...,...,...
15669,Israel,ISR,2021-07-18,10980188.0,5744452.0,5235736.0,11665.0,7985.0,126.86,66.37,60.49,923.0
15670,Israel,ISR,2021-07-19,10994393.0,5747537.0,5246856.0,14205.0,9183.0,127.02,66.40,60.62,1061.0
15671,Israel,ISR,2021-07-20,11008624.0,5750067.0,5258557.0,14231.0,10060.0,127.19,66.43,60.75,1162.0
15672,Israel,ISR,2021-07-21,11022474.0,5752297.0,5270177.0,13850.0,10620.0,127.35,66.46,60.89,1227.0


Select two countries:

In [82]:
two_countries = vacc_df.loc[(vacc_df.location == 'Israel') | (vacc_df.location == 'Denmark')]
two_countries.head()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
8047,Denmark,DNK,2020-12-17,1.0,1.0,,,,0.0,0.0,,
8048,Denmark,DNK,2020-12-18,2.0,2.0,,1.0,1.0,0.0,0.0,,0.0
8049,Denmark,DNK,2020-12-19,3.0,3.0,,1.0,1.0,0.0,0.0,,0.0
8050,Denmark,DNK,2020-12-20,,,,,1.0,,,,0.0
8051,Denmark,DNK,2020-12-21,4.0,4.0,,,1.0,0.0,0.0,,0.0


only the indexs of the tasks:

In [33]:
two_countries.index.values

array([ 8047,  8048,  8049,  8050,  8051,  8052,  8053,  8054,  8055,
        8056,  8057,  8058,  8059,  8060,  8061,  8062,  8063,  8064,
        8065,  8066,  8067,  8068,  8069,  8070,  8071,  8072,  8073,
        8074,  8075,  8076,  8077,  8078,  8079,  8080,  8081,  8082,
        8083,  8084,  8085,  8086,  8087,  8088,  8089,  8090,  8091,
        8092,  8093,  8094,  8095,  8096,  8097,  8098,  8099,  8100,
        8101,  8102,  8103,  8104,  8105,  8106,  8107,  8108,  8109,
        8110,  8111,  8112,  8113,  8114,  8115,  8116,  8117,  8118,
        8119,  8120,  8121,  8122,  8123,  8124,  8125,  8126,  8127,
        8128,  8129,  8130,  8131,  8132,  8133,  8134,  8135,  8136,
        8137,  8138,  8139,  8140,  8141,  8142,  8143,  8144,  8145,
        8146,  8147,  8148,  8149,  8150,  8151,  8152,  8153,  8154,
        8155,  8156,  8157,  8158,  8159,  8160,  8161,  8162,  8163,
        8164,  8165,  8166,  8167,  8168,  8169,  8170,  8171,  8172,
        8173,  8174,

the index in the first place:

In [34]:
two_countries.index.values[0]

8047

how many rows for the two countries?

In [81]:
len(two_countries)

433

In [36]:
two_countries.count()

location                               433
iso_code                               433
date                                   433
total_vaccinations                     376
people_vaccinated                      431
people_fully_vaccinated                399
daily_vaccinations_raw                 371
daily_vaccinations                     431
total_vaccinations_per_hundred         376
people_vaccinated_per_hundred          431
people_fully_vaccinated_per_hundred    399
daily_vaccinations_per_million         431
dtype: int64

At the end of the file we have some world data.

Use str.contains if you're not sure how this location is called

In [86]:
vacc_df[vacc_df['location'].str.contains('Worl')]                          

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
34612,World,OWID_WRL,2020-12-02,0.000000e+00,0.000000e+00,,,,0.00,0.00,,
34613,World,OWID_WRL,2020-12-03,0.000000e+00,0.000000e+00,,0.0,0.0,0.00,0.00,,0.0
34614,World,OWID_WRL,2020-12-04,1.000000e+00,1.000000e+00,,1.0,0.0,0.00,0.00,,0.0
34615,World,OWID_WRL,2020-12-05,1.000000e+00,1.000000e+00,,0.0,0.0,0.00,0.00,,0.0
34616,World,OWID_WRL,2020-12-06,1.000000e+00,1.000000e+00,,0.0,0.0,0.00,0.00,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
34840,World,OWID_WRL,2021-07-18,3.662774e+09,2.055135e+09,1.015653e+09,19296764.0,30735269.0,46.99,26.37,13.03,3943.0
34841,World,OWID_WRL,2021-07-19,3.695078e+09,2.067338e+09,1.025288e+09,32304332.0,31101056.0,47.40,26.52,13.15,3990.0
34842,World,OWID_WRL,2021-07-20,3.727059e+09,2.079488e+09,1.033198e+09,31980155.0,30874061.0,47.81,26.68,13.25,3961.0
34843,World,OWID_WRL,2021-07-21,3.760632e+09,2.087375e+09,1.043255e+09,33573399.0,30413169.0,48.25,26.78,13.38,3902.0


Remove the world data:

In [38]:
vacc_df_noWorld = vacc_df.loc[vacc_df.location != 'World']
vacc_df_noWorld.tail()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
35158,Zimbabwe,ZWE,2021-07-17,1771434.0,1132045.0,639389.0,39694.0,41958.0,11.92,7.62,4.3,2823.0
35159,Zimbabwe,ZWE,2021-07-18,1785533.0,1144379.0,641154.0,14099.0,42019.0,12.01,7.7,4.31,2827.0
35160,Zimbabwe,ZWE,2021-07-19,1827638.0,1184435.0,643203.0,42105.0,42253.0,12.3,7.97,4.33,2843.0
35161,Zimbabwe,ZWE,2021-07-20,1897337.0,1247494.0,649843.0,69699.0,45971.0,12.77,8.39,4.37,3093.0
35162,Zimbabwe,ZWE,2021-07-21,1949472.0,1292642.0,656830.0,52135.0,47976.0,13.12,8.7,4.42,3228.0


Find the country with the maximum vaccinations:

In [87]:
max_vacc = vacc_df_noWorld['total_vaccinations'].max()
max_vacc

2383659444.0

In [40]:
vacc_df_noWorld.loc[vacc_df_noWorld.total_vaccinations == max_vacc]

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
1950,Asia,OWID_ASI,2021-07-22,2383659000.0,1246089000.0,461410827.0,19637719.0,19995909.0,51.37,26.86,9.94,4310.0


What do you think this function does?

In [41]:
vacc_df_noWorld.total_vaccinations.mean()

36918988.861587614

----
#### Your turn:

Select the number of daily vaccinations in Israel on date 2021-02-06 (hint: use &)

In [88]:
vacc_df.loc[(vacc_df.location == 'Israel') & (vacc_df.date == '2021-07-20')]

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
15671,Israel,ISR,2021-07-20,11008624.0,5750067.0,5258557.0,14231.0,10060.0,127.19,66.43,60.75,1162.0


Find all the countries with more than 3000000 vaccinations

In [43]:
vacc_df_noWorld.loc[(vacc_df.daily_vaccinations > 300000)]

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
224,Africa,OWID_AFR,2021-03-27,1.030312e+07,6962741.0,3606163.0,325837.0,301268.0,0.77,0.52,0.27,225.0
225,Africa,OWID_AFR,2021-03-28,1.056197e+07,7218927.0,3608826.0,258849.0,335169.0,0.79,0.54,0.27,250.0
226,Africa,OWID_AFR,2021-03-29,1.077488e+07,7305693.0,3743035.0,212915.0,348122.0,0.80,0.54,0.28,260.0
227,Africa,OWID_AFR,2021-03-30,1.097167e+07,7387045.0,3870510.0,196785.0,355345.0,0.82,0.55,0.29,265.0
267,Africa,OWID_AFR,2021-05-09,2.119979e+07,16063514.0,5518752.0,729471.0,351787.0,1.58,1.20,0.41,262.0
...,...,...,...,...,...,...,...,...,...,...,...,...
33674,Upper middle income,OWID_UMC,2021-07-18,1.928636e+09,943605954.0,379618270.0,11998743.0,16733054.0,72.65,35.54,14.30,6303.0
33675,Upper middle income,OWID_UMC,2021-07-19,1.943552e+09,945911507.0,381584744.0,14915465.0,16626595.0,73.21,35.63,14.37,6263.0
33676,Upper middle income,OWID_UMC,2021-07-20,1.961319e+09,950175010.0,384027832.0,17767193.0,16728652.0,73.88,35.79,14.47,6301.0
33677,Upper middle income,OWID_UMC,2021-07-21,1.980014e+09,953111895.0,386240484.0,18694961.0,16468485.0,74.58,35.90,14.55,6203.0


---
Summary of the functions in this unit:

>* `.describe()` - View statistical information about the data
>* `.index.values` - the row indexes of this part of the dataframes
>* `.str.contains` - selects rows and columns that contain a string
>* `.max` - maximum value
>* `.mean` - average value
>* `.count` - the number of rows that contain a value
>* `len()` - dataframe length