# Unit 3 - missing values
---

1. Find rows with missing values
2. Remove missing values using dropna()  
3. Fill missing values using fillna()
4. Fill missing values using interpolate()
5. A note on slicing - copy()
6. GroupBy()





In [4]:
import pandas as pd
import numpy as np

In [5]:
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv'
vacc_df = pd.read_csv(url)

<a id='section1'></a>

`null` / `na` - no value

`NaN` - **N**ot **a** **N**umber - the value is missing. This value will be ignored in calculations such as `.mean()`


### 1. Find rows with missing values

In [6]:
vacc_df.isnull().sum()

location                                   0
iso_code                                   0
date                                       0
total_vaccinations                      9699
people_vaccinated                      10474
people_fully_vaccinated                13309
daily_vaccinations_raw                 11887
daily_vaccinations                       240
total_vaccinations_per_hundred          9699
people_vaccinated_per_hundred          10474
people_fully_vaccinated_per_hundred    13309
daily_vaccinations_per_million           240
dtype: int64

`isnull()` is a pandas function, so either use it on a dataframe or call it through pd

In [7]:
pd.isnull(vacc_df).sum()

location                                   0
iso_code                                   0
date                                       0
total_vaccinations                      9699
people_vaccinated                      10474
people_fully_vaccinated                13309
daily_vaccinations_raw                 11887
daily_vaccinations                       240
total_vaccinations_per_hundred          9699
people_vaccinated_per_hundred          10474
people_fully_vaccinated_per_hundred    13309
daily_vaccinations_per_million           240
dtype: int64

In [8]:
vacc_df['daily_vaccinations'].notnull().sum()

24090

In [9]:
vacc_df['daily_vaccinations'].isnull().sum()

240

`isnan` is a numpy function

In [10]:
np.isnan(vacc_df['daily_vaccinations']).sum()

240

### 2. Remove missing values using dropna() 

##### Look at Zimbabwe for example. Zimbabwe contains missing values:

In [11]:
zimbabwe = vacc_df.loc[vacc_df.location == 'Zimbabwe']
#zimbabwe.head(10)

In [None]:
zimbabwe['total_vaccinations'].isnull().sum()

In [None]:
zimbabwe['total_vaccinations'].notnull().sum()

##### We can see the difference between the number of values per row:

In [12]:
zimbabwe.count()

location                               106
iso_code                               106
date                                   106
total_vaccinations                     103
people_vaccinated                      103
people_fully_vaccinated                 74
daily_vaccinations_raw                 101
daily_vaccinations                     105
total_vaccinations_per_hundred         103
people_vaccinated_per_hundred          103
people_fully_vaccinated_per_hundred     74
daily_vaccinations_per_million         105
dtype: int64

##### Remove all rows that contain one or more missing values: 

In [13]:
zimbabwe.dropna()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
24256,Zimbabwe,ZWE,2021-03-22,43574.0,43294.0,280.0,845.0,845.0,0.29,0.29,0.00,57.0
24257,Zimbabwe,ZWE,2021-03-23,45197.0,44135.0,1062.0,1623.0,807.0,0.30,0.30,0.01,54.0
24258,Zimbabwe,ZWE,2021-03-24,51893.0,49404.0,2489.0,6696.0,1755.0,0.35,0.33,0.02,118.0
24259,Zimbabwe,ZWE,2021-03-25,58987.0,54892.0,4095.0,7094.0,2712.0,0.40,0.37,0.03,182.0
24260,Zimbabwe,ZWE,2021-03-26,67662.0,61093.0,6569.0,8675.0,3711.0,0.46,0.41,0.04,250.0
...,...,...,...,...,...,...,...,...,...,...,...,...
24325,Zimbabwe,ZWE,2021-05-30,1011973.0,670755.0,341218.0,9508.0,14420.0,6.81,4.51,2.30,970.0
24326,Zimbabwe,ZWE,2021-05-31,1020078.0,675678.0,344400.0,8105.0,15022.0,6.86,4.55,2.32,1011.0
24327,Zimbabwe,ZWE,2021-06-01,1031281.0,678003.0,353278.0,11203.0,14756.0,6.94,4.56,2.38,993.0
24328,Zimbabwe,ZWE,2021-06-02,1040214.0,682242.0,357972.0,8933.0,14739.0,7.00,4.59,2.41,992.0


Note: `dropna()`, like most other functions in the pandas API returns a new DataFrame 
(a copy of the original with changes) as the result, so you should assign it back if you want to see changes:

In [14]:
zimbabwe.head()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
24224,Zimbabwe,ZWE,2021-02-18,0.0,0.0,,,,0.0,0.0,,
24225,Zimbabwe,ZWE,2021-02-19,,,,,328.0,,,,22.0
24226,Zimbabwe,ZWE,2021-02-20,,,,,328.0,,,,22.0
24227,Zimbabwe,ZWE,2021-02-21,,,,,328.0,,,,22.0
24228,Zimbabwe,ZWE,2021-02-22,1314.0,1314.0,,,328.0,0.01,0.01,,22.0


assign it back:

In [15]:
zimbabwe2 = zimbabwe.dropna()
zimbabwe2

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
24256,Zimbabwe,ZWE,2021-03-22,43574.0,43294.0,280.0,845.0,845.0,0.29,0.29,0.00,57.0
24257,Zimbabwe,ZWE,2021-03-23,45197.0,44135.0,1062.0,1623.0,807.0,0.30,0.30,0.01,54.0
24258,Zimbabwe,ZWE,2021-03-24,51893.0,49404.0,2489.0,6696.0,1755.0,0.35,0.33,0.02,118.0
24259,Zimbabwe,ZWE,2021-03-25,58987.0,54892.0,4095.0,7094.0,2712.0,0.40,0.37,0.03,182.0
24260,Zimbabwe,ZWE,2021-03-26,67662.0,61093.0,6569.0,8675.0,3711.0,0.46,0.41,0.04,250.0
...,...,...,...,...,...,...,...,...,...,...,...,...
24325,Zimbabwe,ZWE,2021-05-30,1011973.0,670755.0,341218.0,9508.0,14420.0,6.81,4.51,2.30,970.0
24326,Zimbabwe,ZWE,2021-05-31,1020078.0,675678.0,344400.0,8105.0,15022.0,6.86,4.55,2.32,1011.0
24327,Zimbabwe,ZWE,2021-06-01,1031281.0,678003.0,353278.0,11203.0,14756.0,6.94,4.56,2.38,993.0
24328,Zimbabwe,ZWE,2021-06-02,1040214.0,682242.0,357972.0,8933.0,14739.0,7.00,4.59,2.41,992.0


##### Remove all values for a specific column - using `subset`

In [16]:
zimbabwe.dropna(subset = ['total_vaccinations'])

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
24224,Zimbabwe,ZWE,2021-02-18,0.0,0.0,,,,0.00,0.00,,
24228,Zimbabwe,ZWE,2021-02-22,1314.0,1314.0,,,328.0,0.01,0.01,,22.0
24229,Zimbabwe,ZWE,2021-02-23,4041.0,4041.0,,2727.0,808.0,0.03,0.03,,54.0
24230,Zimbabwe,ZWE,2021-02-24,7872.0,7872.0,,3831.0,1312.0,0.05,0.05,,88.0
24231,Zimbabwe,ZWE,2021-02-25,11007.0,11007.0,,3135.0,1572.0,0.07,0.07,,106.0
...,...,...,...,...,...,...,...,...,...,...,...,...
24325,Zimbabwe,ZWE,2021-05-30,1011973.0,670755.0,341218.0,9508.0,14420.0,6.81,4.51,2.30,970.0
24326,Zimbabwe,ZWE,2021-05-31,1020078.0,675678.0,344400.0,8105.0,15022.0,6.86,4.55,2.32,1011.0
24327,Zimbabwe,ZWE,2021-06-01,1031281.0,678003.0,353278.0,11203.0,14756.0,6.94,4.56,2.38,993.0
24328,Zimbabwe,ZWE,2021-06-02,1040214.0,682242.0,357972.0,8933.0,14739.0,7.00,4.59,2.41,992.0


For more columns:

In [17]:
zimbabwe.dropna(subset = ['total_vaccinations', 'daily_vaccinations_per_million']).head()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
24228,Zimbabwe,ZWE,2021-02-22,1314.0,1314.0,,,328.0,0.01,0.01,,22.0
24229,Zimbabwe,ZWE,2021-02-23,4041.0,4041.0,,2727.0,808.0,0.03,0.03,,54.0
24230,Zimbabwe,ZWE,2021-02-24,7872.0,7872.0,,3831.0,1312.0,0.05,0.05,,88.0
24231,Zimbabwe,ZWE,2021-02-25,11007.0,11007.0,,3135.0,1572.0,0.07,0.07,,106.0
24232,Zimbabwe,ZWE,2021-02-26,12579.0,12579.0,,1572.0,1750.0,0.08,0.08,,118.0


---
>A summary of the functions so far:
>
>* `.isnull()` - display rows that contain missing values
>* `.notnull()` - display rows that don't contain missing values
>* `.dropna()` - Remove rows with missing values according to parameters:
    * `.dropna()` (default) - drops rows if at least one column has NaN
    * `.dropna(subset = ['column_name'])` - drop rows that contain missing values in the subset of column names
    * `.dropna(how='all')` - drops rows only if all of its columns have NaNs
    * `.dropna(thresh = k)` - k how many non-null values you want to keep (k=3 means the row should contain at least 3 non-null values)
    * `.dropna(axis=1)` - drop columns instead of rows
> 

See documnetation [here.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

---


### 3. Fill missing values using fillna()

Use `.fillna()` to fill missing dataframe values with:
* Whatever value you choose
* Mean, median, mode

Replace all NaNs with 0s

In [18]:
vacc_df.fillna(0, inplace = False )
vacc_df

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,0.00,0.00,,
1,Afghanistan,AFG,2021-02-23,,,,,1367.0,,,,35.0
2,Afghanistan,AFG,2021-02-24,,,,,1367.0,,,,35.0
3,Afghanistan,AFG,2021-02-25,,,,,1367.0,,,,35.0
4,Afghanistan,AFG,2021-02-26,,,,,1367.0,,,,35.0
...,...,...,...,...,...,...,...,...,...,...,...,...
24325,Zimbabwe,ZWE,2021-05-30,1011973.0,670755.0,341218.0,9508.0,14420.0,6.81,4.51,2.30,970.0
24326,Zimbabwe,ZWE,2021-05-31,1020078.0,675678.0,344400.0,8105.0,15022.0,6.86,4.55,2.32,1011.0
24327,Zimbabwe,ZWE,2021-06-01,1031281.0,678003.0,353278.0,11203.0,14756.0,6.94,4.56,2.38,993.0
24328,Zimbabwe,ZWE,2021-06-02,1040214.0,682242.0,357972.0,8933.0,14739.0,7.00,4.59,2.41,992.0


>`inplace = False` is the default. This doesn't change the vacc_df dataframe. 
>
>To change it you need:
>
>`vacc_df.fillna(0 , inplace = True)`
>
>or to assign:
>
>`vacc_df = vacc_df.fillna(0)`
>
>But we won't do that! This is where some **business understanding** comes in: it's not a good idea to fill a column like `total_vaccinations` with 0s. 
>
>See what happens:

In [19]:
vacc_df.fillna(0).head(10)

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
0,Afghanistan,AFG,2021-02-22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Afghanistan,AFG,2021-02-23,0.0,0.0,0.0,0.0,1367.0,0.0,0.0,0.0,35.0
2,Afghanistan,AFG,2021-02-24,0.0,0.0,0.0,0.0,1367.0,0.0,0.0,0.0,35.0
3,Afghanistan,AFG,2021-02-25,0.0,0.0,0.0,0.0,1367.0,0.0,0.0,0.0,35.0
4,Afghanistan,AFG,2021-02-26,0.0,0.0,0.0,0.0,1367.0,0.0,0.0,0.0,35.0
5,Afghanistan,AFG,2021-02-27,0.0,0.0,0.0,0.0,1367.0,0.0,0.0,0.0,35.0
6,Afghanistan,AFG,2021-02-28,8200.0,8200.0,0.0,0.0,1367.0,0.02,0.02,0.0,35.0
7,Afghanistan,AFG,2021-03-01,0.0,0.0,0.0,0.0,1580.0,0.0,0.0,0.0,41.0
8,Afghanistan,AFG,2021-03-02,0.0,0.0,0.0,0.0,1794.0,0.0,0.0,0.0,46.0
9,Afghanistan,AFG,2021-03-03,0.0,0.0,0.0,0.0,2008.0,0.0,0.0,0.0,52.0


So we'll use 0's only for the daily_vaccinations columns, and perhaps for some other columns (which?)

In [21]:
vacc_df['daily_vaccinations'].fillna(0 , inplace = True)

checkout some of the data to see that it works

In [22]:
vacc_df.iloc[0:3,[0,2,7]]

Unnamed: 0,location,date,daily_vaccinations
0,Afghanistan,2021-02-22,0.0
1,Afghanistan,2021-02-23,1367.0
2,Afghanistan,2021-02-24,1367.0


Other options - using central measures:

In [55]:
# Using median
vacc_df['daily_vaccinations'].fillna(vacc_df['daily_vaccinations'].median(), inplace=True)
  
# Using mean
#vacc_df['daily_vaccinations'].fillna(vacc_df['daily_vaccinations'].mean(), inplace=True)
  
# Using mode
#vacc_df['daily_vaccinations'].fillna(vacc_df['daily_vaccinations'].mode(), inplace=True)


What about `total_vaccinations`? - there are some `NaN`s there as well:

In [23]:
vacc_df.iloc[52:62,[0,2,3]]

Unnamed: 0,location,date,total_vaccinations
52,Afghanistan,2021-04-15,
53,Afghanistan,2021-04-16,
54,Afghanistan,2021-04-17,
55,Afghanistan,2021-04-18,
56,Afghanistan,2021-04-19,
57,Afghanistan,2021-04-20,
58,Afghanistan,2021-04-21,
59,Afghanistan,2021-04-22,240000.0
60,Afghanistan,2021-04-23,
61,Afghanistan,2021-04-24,


For the `total_vaccinations` we'll use `ffill` which fills the missing values with first non-missing value that occurs before it.

Yes, `bfill` exists as well. If does what you think it does :-)

In [24]:
vacc_df[['total_vaccinations']].fillna(method='ffill')[52:62]
#vacc_df['total_vaccinations'][52:62]

Unnamed: 0,total_vaccinations
52,120000.0
53,120000.0
54,120000.0
55,120000.0
56,120000.0
57,120000.0
58,120000.0
59,240000.0
60,240000.0
61,240000.0


The first value for some country might be NaN 

Business understanding: this isn't good enought! We need to aggregate by country!!

Use `groupby()` and `apply`  (This is more advanced and we will return to it shortly)

We will create a new column here, `newTotal` - so we can see the difference in `total_vaccinations`


In [25]:
vacc_df['newTotal'] = vacc_df.groupby('location')[['total_vaccinations']].apply(lambda x: x.fillna(method='ffill'))
vacc_df.iloc[52:62,[0,2,3,12]]

Unnamed: 0,location,date,total_vaccinations,newTotal
52,Afghanistan,2021-04-15,,120000.0
53,Afghanistan,2021-04-16,,120000.0
54,Afghanistan,2021-04-17,,120000.0
55,Afghanistan,2021-04-18,,120000.0
56,Afghanistan,2021-04-19,,120000.0
57,Afghanistan,2021-04-20,,120000.0
58,Afghanistan,2021-04-21,,120000.0
59,Afghanistan,2021-04-22,240000.0,240000.0
60,Afghanistan,2021-04-23,,240000.0
61,Afghanistan,2021-04-24,,240000.0


### 4. Fill missing values using interpolate()

In [26]:
vacc_df['newTotal2'] = vacc_df['total_vaccinations'].interpolate(method ='linear') 
vacc_df.iloc[52:62,[0,2,3,12, 13]]

Unnamed: 0,location,date,total_vaccinations,newTotal,newTotal2
52,Afghanistan,2021-04-15,,120000.0,184000.0
53,Afghanistan,2021-04-16,,120000.0,192000.0
54,Afghanistan,2021-04-17,,120000.0,200000.0
55,Afghanistan,2021-04-18,,120000.0,208000.0
56,Afghanistan,2021-04-19,,120000.0,216000.0
57,Afghanistan,2021-04-20,,120000.0,224000.0
58,Afghanistan,2021-04-21,,120000.0,232000.0
59,Afghanistan,2021-04-22,240000.0,240000.0,240000.0
60,Afghanistan,2021-04-23,,240000.0,253921.157895
61,Afghanistan,2021-04-24,,240000.0,267842.315789


---
>A summary of the functions so far:
>
>* `.fillna()` - fill missing values according to parameters:
    * `.fillna('k')`  - with value k, create a new dataframe
    * `.fillna('k', inplace = True)` - with value k, into the existing dataframe
    * `.fillna(method='ffill')` - fill with first non-missing value that occurs before it 
    * `.fillna(method='bfill')` - fill with first non-missing value that occurs after it  
> * `interpolate` - fill using some interpolation technique
>
>See documnetation:
>
>* [Missing data handling documentation](https://pandas-docs.github.io/pandas-docs-travis/reference/frame.html#missing-data-handling)
---

### 5. A note on slicing

Slicing is taking only part of a dataframe. For example - the slice we named zimbabwe:

In [27]:
zimbabwe = vacc_df.loc[vacc_df.location == 'Zimbabwe']

When we change data in a slice, we are changing the ORIGINAL dataframe. This will cause a warning to appear:

In [28]:
zimbabwe.fillna(0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


The warning will disappear if you rerun the command, but it can still be scary. Best way to avoid it is to create a `copy` of the dataframe:

In [29]:
zimbabwe = vacc_df.loc[vacc_df.location == 'Zimbabwe'].copy()
zimbabwe.fillna(0, inplace=True)

This works fine, no warnings. But - this won't change the original dataframe (which might be a good thing, if you didn't plan to change it, or a bad thing, if you did)

What about changes in the original dataframe? Your copy will not change.
If you do  want your copy to change, use a shallow copy:

In [30]:
small_example = pd.Series([1, 2], index=["a", "b"])
small_example

a    1
b    2
dtype: int64

In [31]:
my_deep_copy = small_example.copy()
my_deep_copy

a    1
b    2
dtype: int64

In [32]:
my_shallow_copy = small_example.copy(deep=False)
my_shallow_copy

a    1
b    2
dtype: int64

Make a change to the dataframe - where will it appear?

In [33]:
small_example[0] = -100
small_example

a   -100
b      2
dtype: int64

In [34]:
my_deep_copy

a    1
b    2
dtype: int64

In [35]:
my_shallow_copy

a   -100
b      2
dtype: int64

### 6. Groupby()

#### Group according to something + some columns + some summary statistic

The `mean` of `daily_vaccinations` according to `location`:


In [38]:
vacc_df.groupby('location')[['daily_vaccinations']].mean()

Unnamed: 0_level_0,daily_vaccinations
location,Unnamed: 1_level_1
Afghanistan,6.037238e+03
Africa,2.250192e+05
Albania,5.352462e+03
Algeria,3.139545e+03
Andorra,2.737717e+02
...,...
Wallis and Futuna,1.107042e+02
World,1.071848e+07
Yemen,4.072381e+03
Zambia,2.953380e+03


The same, but for two columns (though as we said, not much business logic for mean value of `total_vaccinations`)

In [39]:
vacc_df.groupby('location')[['daily_vaccinations', 'total_vaccinations']].mean()

Unnamed: 0_level_0,daily_vaccinations,total_vaccinations
location,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,6.037238e+03,3.715074e+05
Africa,2.250192e+05,1.181322e+07
Albania,5.352462e+03,3.100638e+05
Algeria,3.139545e+03,2.501000e+04
Andorra,2.737717e+02,1.358505e+04
...,...,...
Wallis and Futuna,1.107042e+02,5.303818e+03
World,1.071848e+07,5.399040e+08
Yemen,4.072381e+03,6.131250e+04
Zambia,2.953380e+03,7.530355e+04


Still the same, but using a lambda function

In [40]:
vacc_df.groupby('location')[['daily_vaccinations', 'total_vaccinations']].apply(lambda x: x.mean())

Unnamed: 0_level_0,daily_vaccinations,total_vaccinations
location,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,6.037238e+03,3.715074e+05
Africa,2.250192e+05,1.181322e+07
Albania,5.352462e+03,3.100638e+05
Algeria,3.139545e+03,2.501000e+04
Andorra,2.737717e+02,1.358505e+04
...,...,...
Wallis and Futuna,1.107042e+02,5.303818e+03
World,1.071848e+07,5.399040e+08
Yemen,4.072381e+03,6.131250e+04
Zambia,2.953380e+03,7.530355e+04


`fillna()` is not an aggregation function, so the result is different:

In [42]:
vacc_df.groupby('location')[['daily_vaccinations']].apply(lambda x: x.fillna(x.sum()))

Unnamed: 0_level_0,Unnamed: 1_level_0,daily_vaccinations
location,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,0,0.0
Afghanistan,1,1367.0
Afghanistan,2,1367.0
Afghanistan,3,1367.0
Afghanistan,4,1367.0
...,...,...
Zimbabwe,24325,14420.0
Zimbabwe,24326,15022.0
Zimbabwe,24327,14756.0
Zimbabwe,24328,14739.0


The same but for two columns:

In [43]:
vacc_df.groupby('location')[['daily_vaccinations', 'total_vaccinations']].apply(lambda x: x.fillna(x.mean()))

Unnamed: 0_level_0,Unnamed: 1_level_0,daily_vaccinations,total_vaccinations
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,0,0.0,0.000000e+00
Afghanistan,1,1367.0,3.715074e+05
Afghanistan,2,1367.0,3.715074e+05
Afghanistan,3,1367.0,3.715074e+05
Afghanistan,4,1367.0,3.715074e+05
...,...,...,...
Zimbabwe,24325,14420.0,1.011973e+06
Zimbabwe,24326,15022.0,1.020078e+06
Zimbabwe,24327,14756.0,1.031281e+06
Zimbabwe,24328,14739.0,1.040214e+06


---
>A summary:
>
>* `.copy()` - creates a copy of the slice of the dataframe
>
>* `.copy(deep=False)` - updates to the original dataframe will show in the copy
>
>* `.groupby()` - group according to the columns specified
---