<a href="https://colab.research.google.com/github/nurimammasri/Wooky-Pandas/blob/master/%5BAddition%5D%20Pandas%20Data%20Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Pandas

In this section of the course we will learn how to use pandas for data analysis. Think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the notebooks in this order:

* Introduction to Pandas
* Series
* DataFrames
* Missing Data
* Summary Functions and Aggregation (GroupBy)
* Combining Data - Merging, Joining, and Concatenating
* Operations
* Data Input and Output
___

In [None]:
# Importing required libraries and fixing options
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.max_columns = None
pd.options.display.max_rows = None

%matplotlib inline

## Let's explore some COVID19 data available online using basic **Pandas functions**:
[Reference](https://ourworldindata.org/coronavirus/country/indonesia)

In [None]:
# This step is required only if you want to read from google drive
# We are basically uploading data from local to the Virtual Memory
from google.colab import files
uploaded = files.upload()

Saving covid-data1.csv to covid-data1.csv


In [None]:
# Importing the data from file
import io
covid_df = pd.read_csv('covid-data1.csv')


# Let's see a sample of the data
covid_df.head() # top 5 rows

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,stringency_index,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,0.126,0.126,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,39835428.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,0.126,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,39835428.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,0.126,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,39835428.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,0.126,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,39835428.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,,0.126,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,39835428.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,


In [None]:
covid_df.tail() #last 5 rows

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,stringency_index,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality
116762,ZWE,Africa,Zimbabwe,2021-09-10,126163.0,107.0,118.857,4532.0,11.0,10.714,8359.5,7.09,7.875,300.288,0.729,0.71,0.59,,,,,,,,,2411.0,1180285.0,78.205,0.16,4185.0,0.277,0.028,35.2,tests performed,4656448.0,2824296.0,1832152.0,,54428.0,40605.0,30.85,18.71,12.14,,2690.0,,15092171.0,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571,
116763,ZWE,Africa,Zimbabwe,2021-09-11,126220.0,57.0,113.571,4536.0,4.0,10.0,8363.277,3.777,7.525,300.553,0.265,0.663,0.6,,,,,,,,,2166.0,1182451.0,78.349,0.144,4108.0,0.272,0.028,36.2,tests performed,4708905.0,2844848.0,1864057.0,,52457.0,44094.0,31.2,18.85,12.35,,2922.0,,15092171.0,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571,
116764,ZWE,Africa,Zimbabwe,2021-09-12,126269.0,49.0,102.714,4538.0,2.0,8.0,8366.523,3.247,6.806,300.686,0.133,0.53,0.6,,,,,,,,,2035.0,1184486.0,78.483,0.135,3978.0,0.264,0.026,38.7,tests performed,,,,,,42719.0,,,,,2831.0,,15092171.0,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571,
116765,ZWE,Africa,Zimbabwe,2021-09-13,126399.0,130.0,104.0,4543.0,5.0,7.143,8375.137,8.614,6.891,301.017,0.331,0.473,,,,,,,,,,,,,,,,,,,4752356.0,2856655.0,1895701.0,,,41369.0,31.49,18.93,12.56,,2741.0,,15092171.0,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571,
116766,ZWE,Africa,Zimbabwe,2021-09-14,126817.0,418.0,145.857,4550.0,7.0,6.714,8402.833,27.696,9.664,301.481,0.464,0.445,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,15092171.0,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571,


In [None]:
covid_df.sample(10) #Sample 10 rows randomly selected from the data

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,stringency_index,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality
31731,EGY,Africa,Egypt,2021-08-04,284472.0,57.0,49.143,16550.0,10.0,6.143,2728.53,0.547,0.471,158.74,0.096,0.059,1.03,,,,,,,,,,,,,,,,,,,,,,,30634.0,,,,,294.0,43.52,104258300.0,97.999,25.3,5.159,2.891,10550.206,1.3,525.432,17.31,0.2,50.1,89.827,1.6,71.99,0.707,
47590,HUN,Europe,Hungary,2021-05-11,792879.0,493.0,1148.857,28792.0,99.0,106.714,82298.699,51.172,119.248,2988.532,10.276,11.077,0.65,,,3282.0,340.663,,,,,6513.0,5094469.0,528.792,0.676,15309.0,1.589,0.075,13.3,tests performed,,,,,,81210.0,,,,,8429.0,66.67,9634162.0,108.043,43.4,18.577,11.976,26777.561,0.5,278.296,7.55,26.8,34.8,,7.02,76.88,0.854,
61358,LIE,Europe,Liechtenstein,2020-08-05,89.0,0.0,0.143,1.0,0.0,0.0,2326.554,0.0,3.734,26.141,0.0,0.0,,,,,,,,,,28.0,3349.0,87.546,0.732,25.0,0.654,0.0,,tests performed,,,,,,,,,,,,,38254.0,237.012,,,,,,,7.77,,,,2.397,82.49,0.919,
61935,LTU,Europe,Lithuania,2020-08-18,2467.0,37.0,27.143,64.0,0.0,0.0,917.147,13.755,10.091,23.793,0.0,0.0,1.27,,,,,,,,,4268.0,539894.0,200.714,1.587,3527.0,1.311,0.01,100.0,tests performed,,,,,,,,,,,,28.7,2689862.0,45.135,43.5,19.002,13.778,29524.265,0.7,342.989,3.67,21.3,38.0,,6.56,75.93,0.882,
6026,OWID_ASI,,Asia,2021-06-15,53789601.0,135673.0,149132.714,755623.0,3686.0,4909.143,11494.338,28.992,31.868,161.47,0.788,1.049,,,,,,,,,,,,,,,,,,,1436419000.0,987735147.0,126274765.0,,29316878.0,24309365.0,30.69,21.11,2.7,,5195.0,,4679661000.0,,,,,,,,,,,,,,,
13645,BOL,South America,Bolivia,2021-08-21,486394.0,1210.0,680.429,18296.0,40.0,20.571,41105.099,102.257,57.503,1546.193,3.38,1.738,0.92,,,,,,,,,5528.0,2228828.0,188.358,0.467,6751.0,0.571,0.101,9.9,tests performed,5319711.0,3044011.0,2275700.0,,49248.0,42747.0,44.96,25.72,19.23,,3613.0,56.48,11832940.0,10.202,25.4,6.704,4.393,6885.829,7.1,204.299,6.89,,,25.383,1.1,71.51,0.718,
44076,GIN,Africa,Guinea,2020-11-28,13039.0,0.0,34.429,76.0,0.0,0.143,966.05,0.0,2.551,5.631,0.0,0.011,1.03,,,,,,,,,,,,,,,,,,,,,,,,,,,,,52.78,13497240.0,51.755,19.0,3.135,1.733,1998.926,35.3,336.717,2.42,,,17.45,0.3,61.6,0.477,
54738,JOR,Asia,Jordan,2020-11-27,207601.0,4580.0,4752.286,2570.0,61.0,64.857,20216.239,446.002,462.779,250.267,5.94,6.316,0.92,,,,,,,,,26257.0,2487561.0,242.239,2.557,24543.0,2.39,0.194,5.2,tests performed,,,,,,,,,,,,81.48,10269020.0,109.285,23.2,3.81,2.361,8337.49,0.1,208.257,11.75,,,,1.4,74.53,0.729,
76682,NGA,Africa,Nigeria,2020-05-30,9855.0,553.0,332.714,273.0,12.0,7.429,46.618,2.616,1.574,1.291,0.057,0.035,1.19,,,,,,,,,2099.0,60825.0,0.288,0.01,2500.0,0.012,0.133,7.5,samples tested,,,,,,,,,,,,84.26,211400700.0,209.588,18.1,2.751,1.447,5338.454,,181.013,2.42,0.6,10.8,41.949,,54.69,0.539,
8174,BHS,North America,Bahamas,2021-02-06,8256.0,0.0,11.714,176.0,0.0,0.0,20800.476,0.0,29.513,443.421,0.0,0.0,1.05,,,,,,,,,,,,,,,,,,,,,,,,,,,,,68.98,396914.0,39.497,34.3,8.996,5.2,27717.847,,235.954,13.17,3.1,20.4,,2.9,73.92,0.814,


In [None]:
covid_df['location'].unique()

array(['Afghanistan', 'Africa', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Anguilla', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba',
       'Asia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain',
       'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin',
       'Bermuda', 'Bhutan', 'Bolivia', 'Bonaire Sint Eustatius and Saba',
       'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'British Virgin Islands', 'Brunei', 'Bulgaria', 'Burkina Faso',
       'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde',
       'Cayman Islands', 'Central African Republic', 'Chad', 'Chile',
       'China', 'Colombia', 'Comoros', 'Congo', 'Cook Islands',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Curacao',
       'Cyprus', 'Czechia', 'Democratic Republic of Congo', 'Denmark',
       'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
       'Eswatini', 'Ethi

In [None]:
covid_df.shape

(116767, 62)

In [None]:
interested_countries_list = ['India','Indonesia']
covid_df1 = covid_df[covid_df['location'].isin(interested_countries_list)]

In [None]:
covid_df1.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,stringency_index,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality
48284,IND,Asia,India,2020-01-30,1.0,1.0,,,,,0.001,0.001,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.19,1393409000.0,450.419,28.2,5.989,3.414,6426.674,21.2,282.28,10.39,1.9,20.6,59.55,0.53,69.66,0.645,
48285,IND,Asia,India,2020-01-31,1.0,0.0,,,,,0.001,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.19,1393409000.0,450.419,28.2,5.989,3.414,6426.674,21.2,282.28,10.39,1.9,20.6,59.55,0.53,69.66,0.645,
48286,IND,Asia,India,2020-02-01,1.0,0.0,,,,,0.001,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.19,1393409000.0,450.419,28.2,5.989,3.414,6426.674,21.2,282.28,10.39,1.9,20.6,59.55,0.53,69.66,0.645,
48287,IND,Asia,India,2020-02-02,2.0,1.0,,,,,0.001,0.001,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.19,1393409000.0,450.419,28.2,5.989,3.414,6426.674,21.2,282.28,10.39,1.9,20.6,59.55,0.53,69.66,0.645,
48288,IND,Asia,India,2020-02-03,3.0,1.0,,,,,0.002,0.001,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.19,1393409000.0,450.419,28.2,5.989,3.414,6426.674,21.2,282.28,10.39,1.9,20.6,59.55,0.53,69.66,0.645,


In [None]:
covid_df[covid_df['location'] == 'Indonesia'].sample(5)

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,stringency_index,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality
49002,IDN,Asia,Indonesia,2020-07-04,62142.0,1447.0,1332.857,3089.0,53.0,52.714,224.857,5.236,4.823,11.177,0.192,0.191,1.16,,,,,,,,,,529669.0,1.917,,11443.0,0.041,0.116,8.6,people tested,,,,,,,,,,,,62.5,276361788.0,145.725,29.3,5.319,3.053,11188.744,5.7,342.864,6.32,2.8,76.1,64.204,1.04,71.72,0.718,
48934,IDN,Asia,Indonesia,2020-04-27,9096.0,214.0,333.714,765.0,22.0,25.0,32.913,0.774,1.208,2.768,0.08,0.09,1.07,,,,,,,,,2435.0,59409.0,0.215,0.009,2237.0,0.008,0.149,6.7,people tested,,,,,,,,,,,,80.09,276361788.0,145.725,29.3,5.319,3.053,11188.744,5.7,342.864,6.32,2.8,76.1,64.204,1.04,71.72,0.718,
49239,IDN,Asia,Indonesia,2021-02-26,1322866.0,8232.0,8509.571,35786.0,268.0,233.429,4786.718,29.787,30.791,129.49,0.97,0.845,0.86,,,,,,,,,39766.0,7141629.0,25.842,0.144,43227.0,0.156,0.197,5.1,people tested,2449451.0,1583581.0,865870.0,,133786.0,82443.0,0.89,0.57,0.31,,298.0,65.28,276361788.0,145.725,29.3,5.319,3.053,11188.744,5.7,342.864,6.32,2.8,76.1,64.204,1.04,71.72,0.718,
49310,IDN,Asia,Indonesia,2021-05-08,1709762.0,6130.0,5268.857,46842.0,179.0,170.0,6186.68,22.181,19.065,169.495,0.648,0.615,0.99,,,,,,,,,44705.0,10175419.0,36.819,0.162,44611.0,0.161,0.118,8.5,people tested,,,,,,244222.0,,,,,884.0,68.98,276361788.0,145.725,29.3,5.319,3.053,11188.744,5.7,342.864,6.32,2.8,76.1,64.204,1.04,71.72,0.718,
48921,IDN,Asia,Indonesia,2020-04-14,4839.0,282.0,300.143,459.0,60.0,34.0,17.51,1.02,1.086,1.661,0.217,0.123,1.35,,,,,,,,,4237.0,31628.0,0.114,0.015,2599.0,0.009,0.115,8.7,people tested,,,,,,,,,,,,71.76,276361788.0,145.725,29.3,5.319,3.053,11188.744,5.7,342.864,6.32,2.8,76.1,64.204,1.04,71.72,0.718,


In [None]:
# df.to_csv('covid_data_output.csv',header=True,index=False)

In [None]:
covid_df1.describe()

Unnamed: 0,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,stringency_index,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality
count,1156.0,1156.0,1146.0,1106.0,1106.0,1146.0,1156.0,1156.0,1146.0,1106.0,1106.0,1146.0,1093.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,919.0,986.0,986.0,919.0,1084.0,1084.0,1083.0,1083.0,431.0,431.0,396.0,0.0,389.0,485.0,431.0,431.0,396.0,0.0,485.0,1148.0,1156.0,1156.0,1156.0,1156.0,1156.0,1156.0,1156.0,1156.0,1156.0,1156.0,1156.0,1156.0,1156.0,1156.0,1156.0,0.0
mean,6216842.0,32431.635813,32630.973951,95842.739602,527.045208,507.206813,5943.142958,33.749507,33.953388,114.189867,0.74391,0.714928,1.100128,,,,,,,,,594918.3,101577800.0,80.392494,0.472452,519150.8,0.433565,0.112364,16.818098,142062000.0,110365200.0,34498250.0,,2011055.0,1727924.0,15.267425,11.274826,4.345152,,1909.721649,68.513258,850346300.0,302.289218,28.734775,5.663273,3.238497,8741.797997,13.664533,311.733467,8.411332,2.337543,47.581834,61.812585,0.777941,70.661488,0.68049,
std,9545185.0,63255.400727,63128.877807,122051.622,826.329089,792.360976,6822.849907,49.068959,48.731353,114.341749,1.120586,1.085204,0.259964,,,,,,,,,702977.5,149639400.0,103.726664,0.476926,666874.2,0.448337,0.075804,15.367124,186746900.0,146216200.0,41797570.0,,2396145.0,2002560.0,14.069265,10.477594,3.889263,,1476.333379,15.775909,558551200.0,152.354531,0.550027,0.335017,0.180509,2381.152695,7.750383,30.293497,2.035101,0.450022,27.751372,2.327115,0.255013,1.030051,0.036502,
min,1.0,-1858.0,0.0,1.0,-39.0,0.0,0.001,-1.333,0.0,0.001,-0.028,0.0,0.61,,,,,,,,,25.0,1230.0,0.004,0.0,179.0,0.001,0.014,2.4,0.0,0.0,5468.0,,5162.0,11823.0,0.0,0.0,0.0,,43.0,10.19,276361800.0,145.725,28.2,5.319,3.053,6426.674,5.7,282.28,6.32,1.9,20.6,59.55,0.53,69.66,0.645,
25%,148858.8,2946.5,3250.82175,8509.5,94.25,93.4645,337.16875,6.78375,7.20725,19.554,0.18175,0.145,0.94,,,,,,,,,30980.5,2486416.0,5.20375,0.0915,30035.5,0.094,0.043,6.55,15342260.0,10928610.0,6796893.0,,387563.0,275354.0,2.665,2.005,0.895,,762.0,62.5,276361800.0,145.725,28.2,5.319,3.053,6426.674,5.7,282.28,6.32,1.9,20.6,59.55,0.53,69.66,0.645,
50%,1529596.0,9094.0,9412.857,43353.5,198.0,182.357,4110.2155,19.106,19.267,91.2685,0.384,0.3735,1.07,,,,,,,,,143737.0,11668310.0,32.631,0.252,103567.5,0.186,0.107,9.3,58495880.0,43471430.0,15987890.0,,1105807.0,823774.0,11.48,8.6,3.35,,1534.0,68.98,1393409000.0,450.419,28.2,5.989,3.414,6426.674,21.2,282.28,10.39,1.9,20.6,59.55,0.53,69.66,0.645,
75%,9439470.0,38637.75,39138.536,142921.25,574.75,565.9285,7836.0835,35.9775,35.59075,161.276,0.70575,0.69225,1.2,,,,,,,,,1072930.0,178166600.0,127.86375,0.77,1014655.0,0.72825,0.153,23.3,191280900.0,149626400.0,43331750.0,,2918836.0,2458290.0,24.685,18.395,6.345,,3009.0,74.54,1393409000.0,450.419,29.3,5.989,3.414,11188.744,21.2,342.864,10.39,2.8,76.1,64.204,1.04,71.72,0.718,
max,33316760.0,414188.0,391232.0,443497.0,7374.0,4190.0,23910.248,297.248,280.773,504.466,7.487,6.474,2.28,,,,,,,,,3740296.0,544445000.0,390.729,2.684,3080396.0,2.211,0.424,71.1,750284600.0,570046700.0,180237900.0,,16777690.0,9340631.0,53.85,40.91,15.32,,6703.0,100.0,1393409000.0,450.419,29.3,5.989,3.414,11188.744,21.2,342.864,10.39,2.8,76.1,64.204,1.04,71.72,0.718,


In [None]:
covid_df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1156 entries, 48284 to 49439
Data columns (total 62 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   iso_code                               1156 non-null   object 
 1   continent                              1156 non-null   object 
 2   location                               1156 non-null   object 
 3   date                                   1156 non-null   object 
 4   total_cases                            1156 non-null   float64
 5   new_cases                              1156 non-null   float64
 6   new_cases_smoothed                     1146 non-null   float64
 7   total_deaths                           1106 non-null   float64
 8   new_deaths                             1106 non-null   float64
 9   new_deaths_smoothed                    1146 non-null   float64
 10  total_cases_per_million                1156 non-null   float64
 11 

In [None]:
covid_df1['total_cases'].mean()

6216841.846020761

In [None]:
covid_df1.shape

(1156, 62)

In [None]:
covid_df1['location'].value_counts()

India        594
Indonesia    562
Name: location, dtype: int64

## Missing Data

Let's see a few convenient methods to deal with Missing Data in pandas:

In [None]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})

df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [None]:
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [None]:
df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


In [None]:
df.dropna(thresh=2, axis=1)

Unnamed: 0,A,C
0,1.0,1
1,2.0,2
2,,3


In [None]:
df.fillna(value='FILL VALUE')

Unnamed: 0,A,B,C
0,1,5,1
1,2,FILL VALUE,2
2,FILL VALUE,FILL VALUE,3


In [None]:
#Imputation
df['A'].fillna(value=df['A'].mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

In [None]:
#Number of NULLS in a column
df.isna().sum()

A    1
B    2
C    0
dtype: int64

In [None]:
# Let's impute on the covid data - Remember to use inplace = True
covid_df['new_deaths'].fillna(0,inplace=True)

In [None]:
covid_df.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,stringency_index,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,0.0,,0.126,0.126,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,39835428.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,0.0,,0.126,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,39835428.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,0.0,,0.126,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,39835428.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,0.0,,0.126,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,39835428.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,0.0,,0.126,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,39835428.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,


## Data Aggregation

In [None]:
# Create dataframe
data = {'Company':['GOJEK','GOJEK','TOKO','TOKO','FB','FB'],
       'Person':['Ayub','Amri','Calvin','Addie','Becca','Sara'],
       'Sales':[200,1200,340,124,243,350],
       'Margin':[40,40,34,100,56,60]}

In [None]:
df = pd.DataFrame(data)
df

Unnamed: 0,Company,Person,Sales,Margin
0,GOJEK,Ayub,200,40
1,GOJEK,Amri,1200,40
2,TOKO,Calvin,340,34
3,TOKO,Addie,124,100
4,FB,Becca,243,56
5,FB,Sara,350,60


In [None]:
df.describe()

Unnamed: 0,Sales,Margin
count,6.0,6.0
mean,409.5,55.0
std,396.581265,24.256958
min,124.0,34.0
25%,210.75,40.0
50%,291.5,48.0
75%,347.5,59.0
max,1200.0,100.0


**Now you can use the .groupby() method to group rows together based off of a column name**


For instance let's group based off of Company. This will create a *DataFrameGroupBy* object:

In [None]:
df.groupby('Company')
#select company from df group by company

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f83871c7110>

In [None]:
# You can save this object as a new variable:
by_comp = df.groupby("Company")

In [None]:
# And then call aggregate methods off the object:
by_comp.mean()

Unnamed: 0_level_0,Sales,Margin
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,296.5,58.0
GOJEK,700.0,40.0
TOKO,232.0,67.0


In [None]:
# In one step:
df.groupby('Company').mean()

Unnamed: 0_level_0,Sales,Margin
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,296.5,58.0
GOJEK,700.0,40.0
TOKO,232.0,67.0


**More examples of aggregate methods in pandas:**

In [None]:
by_comp.std()

Unnamed: 0_level_0,Sales,Margin
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,75.660426,2.828427
GOJEK,707.106781,0.0
TOKO,152.735065,46.669048


In [None]:
by_comp.min()

Unnamed: 0_level_0,Person,Sales,Margin
Company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FB,Becca,243,56
GOJEK,Amri,200,40
TOKO,Addie,124,34


In [None]:
by_comp.max()

Unnamed: 0_level_0,Person,Sales,Margin
Company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FB,Sara,350,60
GOJEK,Ayub,1200,40
TOKO,Calvin,340,100


In [None]:
by_comp.count()

Unnamed: 0_level_0,Person,Sales,Margin
Company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FB,2,2,2
GOJEK,2,2,2
TOKO,2,2,2


In [None]:
by_comp.describe()

Unnamed: 0_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Margin,Margin,Margin,Margin,Margin,Margin,Margin,Margin
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
FB,2.0,296.5,75.660426,243.0,269.75,296.5,323.25,350.0,2.0,58.0,2.828427,56.0,57.0,58.0,59.0,60.0
GOJEK,2.0,700.0,707.106781,200.0,450.0,700.0,950.0,1200.0,2.0,40.0,0.0,40.0,40.0,40.0,40.0,40.0
TOKO,2.0,232.0,152.735065,124.0,178.0,232.0,286.0,340.0,2.0,67.0,46.669048,34.0,50.5,67.0,83.5,100.0


In [None]:
by_comp.describe().transpose()

Unnamed: 0,Company,FB,GOJEK,TOKO
Sales,count,2.0,2.0,2.0
Sales,mean,296.5,700.0,232.0
Sales,std,75.660426,707.106781,152.735065
Sales,min,243.0,200.0,124.0
Sales,25%,269.75,450.0,178.0
Sales,50%,296.5,700.0,232.0
Sales,75%,323.25,950.0,286.0
Sales,max,350.0,1200.0,340.0
Margin,count,2.0,2.0,2.0
Margin,mean,58.0,40.0,67.0


In [None]:
by_comp.describe().transpose()['GOJEK']

Sales   count       2.000000
        mean      700.000000
        std       707.106781
        min       200.000000
        25%       450.000000
        50%       700.000000
        75%       950.000000
        max      1200.000000
Margin  count       2.000000
        mean       40.000000
        std         0.000000
        min        40.000000
        25%        40.000000
        50%        40.000000
        75%        40.000000
        max        40.000000
Name: GOJEK, dtype: float64

In [None]:
# Let's try some aggregagation on our COVID Data:
covid_df.groupby('location')['new_cases'].aggregate(['count','mean','median']).sort_values('location')

Unnamed: 0_level_0,count,mean,median
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,569,270.966608,95.0
Africa,580,13935.263793,10917.5
Albania,555,285.461261,132.0
Algeria,568,353.042254,250.5
Andorra,562,26.866548,13.0
Angola,544,93.836397,71.5
Anguilla,0,,
Antigua and Barbuda,551,4.181488,0.0
Argentina,561,9322.367201,7693.0
Armenia,563,443.699822,276.0


## Merging, Joining, and Concatenating

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. In this section we will discuss these 3 methods with examples.

In [None]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])

In [None]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [None]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [None]:
df3

Unnamed: 0,A,B,C,D
8,A8,B8,C8,D8
9,A9,B9,C9,D9
10,A10,B10,C10,D10
11,A11,B11,C11,D11


### Concatenation
Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together:

In [None]:
# Similar to Union of 2 or more tables in SQL
pd.concat([df1,df2,df3])

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


In [None]:
pd.concat([df1,df2,df3],axis=1)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2
0,A0,B0,C0,D0,,,,,,,,
1,A1,B1,C1,D1,,,,,,,,
2,A2,B2,C2,D2,,,,,,,,
3,A3,B3,C3,D3,,,,,,,,
4,,,,,A4,B4,C4,D4,,,,
5,,,,,A5,B5,C5,D5,,,,
6,,,,,A6,B6,C6,D6,,,,
7,,,,,A7,B7,C7,D7,,,,
8,,,,,,,,,A8,B8,C8,D8
9,,,,,,,,,A9,B9,C9,D9


In [None]:
df2.reset_index()

Unnamed: 0,index,A,B,C,D
0,4,A4,B4,C4,D4
1,5,A5,B5,C5,D5
2,6,A6,B6,C6,D6
3,7,A7,B7,C7,D7


In [None]:
pd.concat([df1,df2.reset_index(drop=True),df3.reset_index(drop=True)],axis=1)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2
0,A0,B0,C0,D0,A4,B4,C4,D4,A8,B8,C8,D8
1,A1,B1,C1,D1,A5,B5,C5,D5,A9,B9,C9,D9
2,A2,B2,C2,D2,A6,B6,C6,D6,A10,B10,C10,D10
3,A3,B3,C3,D3,A7,B7,C7,D7,A11,B11,C11,D11


### Merging

The **merge** function allows you to merge DataFrames together using a similar logic as joining SQL Tables together.

In [None]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']}) 

In [None]:
left

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3


In [None]:
right

Unnamed: 0,key,C,D
0,K0,C0,D0
1,K1,C1,D1
2,K2,C2,D2
3,K3,C3,D3


In [None]:
pd.merge(left,right,how='inner',on='key')

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3


More complicated example:

In [None]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                               'key2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})

In [None]:
pd.merge(left, right, on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


In [None]:
pd.merge(left, right, how='outer', on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,
5,K2,K0,,,C3,D3


In [None]:
pd.merge(left, right, how='right', on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2
3,K2,K0,,,C3,D3


In [None]:
pd.merge(left, right, how='left', on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,


In [None]:
pd.merge(left, right, how='inner', left_on = 'key1', right_on = 'key2')

Unnamed: 0,key1_x,key2_x,A,B,key1_y,key2_y,C,D
0,K0,K0,A0,B0,K0,K0,C0,D0
1,K0,K0,A0,B0,K1,K0,C1,D1
2,K0,K0,A0,B0,K1,K0,C2,D2
3,K0,K0,A0,B0,K2,K0,C3,D3
4,K0,K1,A1,B1,K0,K0,C0,D0
5,K0,K1,A1,B1,K1,K0,C1,D1
6,K0,K1,A1,B1,K1,K0,C2,D2
7,K0,K1,A1,B1,K2,K0,C3,D3


### Joining
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

In [None]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

In [None]:
left.join(right)

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


In [None]:
left.join(right, how='outer')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2
K3,,,C3,D3


## Operations on Pandas Columns
* Addition, Subtraction, etc.
* sort_values(), sort_index()
* Dropping columns
* Applying functions to Pandas Dataframes (Map and Apply)

### Column Operations

In [None]:
temp_df = covid_df1 [['location',	'date', 'new_cases']].tail(10)
temp_df

Unnamed: 0,location,date,new_cases
49430,Indonesia,2021-09-05,5403.0
49431,Indonesia,2021-09-06,4413.0
49432,Indonesia,2021-09-07,7201.0
49433,Indonesia,2021-09-08,6731.0
49434,Indonesia,2021-09-09,5990.0
49435,Indonesia,2021-09-10,5376.0
49436,Indonesia,2021-09-11,5001.0
49437,Indonesia,2021-09-12,3779.0
49438,Indonesia,2021-09-13,2577.0
49439,Indonesia,2021-09-14,4128.0


In [None]:
temp_df['new_cases_added'] = temp_df['new_cases'] + temp_df['new_cases']
temp_df

Unnamed: 0,location,date,new_cases,new_cases_added
49430,Indonesia,2021-09-05,5403.0,10806.0
49431,Indonesia,2021-09-06,4413.0,8826.0
49432,Indonesia,2021-09-07,7201.0,14402.0
49433,Indonesia,2021-09-08,6731.0,13462.0
49434,Indonesia,2021-09-09,5990.0,11980.0
49435,Indonesia,2021-09-10,5376.0,10752.0
49436,Indonesia,2021-09-11,5001.0,10002.0
49437,Indonesia,2021-09-12,3779.0,7558.0
49438,Indonesia,2021-09-13,2577.0,5154.0
49439,Indonesia,2021-09-14,4128.0,8256.0


In [None]:
temp_df['new_cases_twice'] = temp_df['new_cases'] * 2
temp_df

Unnamed: 0,location,date,new_cases,new_cases_added,new_cases_twice
49430,Indonesia,2021-09-05,5403.0,10806.0,10806.0
49431,Indonesia,2021-09-06,4413.0,8826.0,8826.0
49432,Indonesia,2021-09-07,7201.0,14402.0,14402.0
49433,Indonesia,2021-09-08,6731.0,13462.0,13462.0
49434,Indonesia,2021-09-09,5990.0,11980.0,11980.0
49435,Indonesia,2021-09-10,5376.0,10752.0,10752.0
49436,Indonesia,2021-09-11,5001.0,10002.0,10002.0
49437,Indonesia,2021-09-12,3779.0,7558.0,7558.0
49438,Indonesia,2021-09-13,2577.0,5154.0,5154.0
49439,Indonesia,2021-09-14,4128.0,8256.0,8256.0


In [None]:
temp_df.sort_values(by = ['new_cases','date'],ascending=False, inplace=True)

In [None]:
temp_df

Unnamed: 0,location,date,new_cases,new_cases_added,new_cases_twice
49432,Indonesia,2021-09-07,7201.0,14402.0,14402.0
49433,Indonesia,2021-09-08,6731.0,13462.0,13462.0
49434,Indonesia,2021-09-09,5990.0,11980.0,11980.0
49430,Indonesia,2021-09-05,5403.0,10806.0,10806.0
49435,Indonesia,2021-09-10,5376.0,10752.0,10752.0
49436,Indonesia,2021-09-11,5001.0,10002.0,10002.0
49431,Indonesia,2021-09-06,4413.0,8826.0,8826.0
49439,Indonesia,2021-09-14,4128.0,8256.0,8256.0
49437,Indonesia,2021-09-12,3779.0,7558.0,7558.0
49438,Indonesia,2021-09-13,2577.0,5154.0,5154.0


In [None]:
temp_df.sort_index(inplace=True)
temp_df

Unnamed: 0,location,date,new_cases,new_cases_added,new_cases_twice
49430,Indonesia,2021-09-05,5403.0,10806.0,10806.0
49431,Indonesia,2021-09-06,4413.0,8826.0,8826.0
49432,Indonesia,2021-09-07,7201.0,14402.0,14402.0
49433,Indonesia,2021-09-08,6731.0,13462.0,13462.0
49434,Indonesia,2021-09-09,5990.0,11980.0,11980.0
49435,Indonesia,2021-09-10,5376.0,10752.0,10752.0
49436,Indonesia,2021-09-11,5001.0,10002.0,10002.0
49437,Indonesia,2021-09-12,3779.0,7558.0,7558.0
49438,Indonesia,2021-09-13,2577.0,5154.0,5154.0
49439,Indonesia,2021-09-14,4128.0,8256.0,8256.0


In [None]:
temp_df.drop(['new_cases_added','new_cases_twice'],axis = 1)

Unnamed: 0,location,date,new_cases
49430,Indonesia,2021-09-05,5403.0
49431,Indonesia,2021-09-06,4413.0
49432,Indonesia,2021-09-07,7201.0
49433,Indonesia,2021-09-08,6731.0
49434,Indonesia,2021-09-09,5990.0
49435,Indonesia,2021-09-10,5376.0
49436,Indonesia,2021-09-11,5001.0
49437,Indonesia,2021-09-12,3779.0
49438,Indonesia,2021-09-13,2577.0
49439,Indonesia,2021-09-14,4128.0


In [None]:
temp_df

Unnamed: 0,location,date,new_cases,new_cases_added,new_cases_twice
49430,Indonesia,2021-09-05,5403.0,10806.0,10806.0
49431,Indonesia,2021-09-06,4413.0,8826.0,8826.0
49432,Indonesia,2021-09-07,7201.0,14402.0,14402.0
49433,Indonesia,2021-09-08,6731.0,13462.0,13462.0
49434,Indonesia,2021-09-09,5990.0,11980.0,11980.0
49435,Indonesia,2021-09-10,5376.0,10752.0,10752.0
49436,Indonesia,2021-09-11,5001.0,10002.0,10002.0
49437,Indonesia,2021-09-12,3779.0,7558.0,7558.0
49438,Indonesia,2021-09-13,2577.0,5154.0,5154.0
49439,Indonesia,2021-09-14,4128.0,8256.0,8256.0


In [None]:
temp_df.drop(columns = ['new_cases_added','new_cases_twice'],inplace = True)

### Applying functions to Pandas dataframes
[Reference Link for Map and Apply](https://towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff)

In [None]:
temp_df

Unnamed: 0,location,date,new_cases
49430,Indonesia,2021-09-05,5403.0
49431,Indonesia,2021-09-06,4413.0
49432,Indonesia,2021-09-07,7201.0
49433,Indonesia,2021-09-08,6731.0
49434,Indonesia,2021-09-09,5990.0
49435,Indonesia,2021-09-10,5376.0
49436,Indonesia,2021-09-11,5001.0
49437,Indonesia,2021-09-12,3779.0
49438,Indonesia,2021-09-13,2577.0
49439,Indonesia,2021-09-14,4128.0


In [None]:
# In-built Python method
temp_df['sqrt_new_cases'] = temp_df['new_cases'].apply(np.sqrt)
temp_df

Unnamed: 0,location,date,new_cases,sqrt_new_cases
49430,Indonesia,2021-09-05,5403.0,73.505102
49431,Indonesia,2021-09-06,4413.0,66.430415
49432,Indonesia,2021-09-07,7201.0,84.858706
49433,Indonesia,2021-09-08,6731.0,82.042672
49434,Indonesia,2021-09-09,5990.0,77.39509
49435,Indonesia,2021-09-10,5376.0,73.321211
49436,Indonesia,2021-09-11,5001.0,70.717749
49437,Indonesia,2021-09-12,3779.0,61.473572
49438,Indonesia,2021-09-13,2577.0,50.764161
49439,Indonesia,2021-09-14,4128.0,64.249514


Create a column as 'new_cases_category' which shows:
* <=40 Cases             -- 'Low'
* 40k+ to 50k Cases      -- 'Medium'
* Greater than 50k Cases -- 'High'

In [None]:
# UDF - User defined function
def category_fn (number_of_cases):
  if number_of_cases<=40000: cat = 'Low'
  elif number_of_cases>50000: cat = 'High'
  else: cat = 'Medium'
  return cat

In [None]:
category_fn (45000)

'Medium'

In [None]:
temp_df['new_cases_category'] = temp_df['new_cases'].apply(category_fn)
temp_df

Unnamed: 0,location,date,new_cases,sqrt_new_cases,new_cases_category
49430,Indonesia,2021-09-05,5403.0,73.505102,Low
49431,Indonesia,2021-09-06,4413.0,66.430415,Low
49432,Indonesia,2021-09-07,7201.0,84.858706,Low
49433,Indonesia,2021-09-08,6731.0,82.042672,Low
49434,Indonesia,2021-09-09,5990.0,77.39509,Low
49435,Indonesia,2021-09-10,5376.0,73.321211,Low
49436,Indonesia,2021-09-11,5001.0,70.717749,Low
49437,Indonesia,2021-09-12,3779.0,61.473572,Low
49438,Indonesia,2021-09-13,2577.0,50.764161,Low
49439,Indonesia,2021-09-14,4128.0,64.249514,Low


In [None]:
temp_df['new_cases_category1'] = temp_df['new_cases'].map(category_fn)
temp_df

Unnamed: 0,location,date,new_cases,sqrt_new_cases,new_cases_category,new_cases_category1
49430,Indonesia,2021-09-05,5403.0,73.505102,Low,Low
49431,Indonesia,2021-09-06,4413.0,66.430415,Low,Low
49432,Indonesia,2021-09-07,7201.0,84.858706,Low,Low
49433,Indonesia,2021-09-08,6731.0,82.042672,Low,Low
49434,Indonesia,2021-09-09,5990.0,77.39509,Low,Low
49435,Indonesia,2021-09-10,5376.0,73.321211,Low,Low
49436,Indonesia,2021-09-11,5001.0,70.717749,Low,Low
49437,Indonesia,2021-09-12,3779.0,61.473572,Low,Low
49438,Indonesia,2021-09-13,2577.0,50.764161,Low,Low
49439,Indonesia,2021-09-14,4128.0,64.249514,Low,Low


#### Comparing map, applymap and apply: **Context Matters**

[Reference Link](https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas)

**First major difference: DEFINITION**

* `map` is defined on Series ONLY
* `applymap` is defined on DataFrames ONLY
* `apply` is defined on BOTH

**Second major difference: INPUT ARGUMENT**
* `map` accepts dicts, Series, or callable
* `applymap` and apply accept callables only

**Third major difference: BEHAVIOR**

* `map` is elementwise for Series
* `applymap` is elementwise for DataFrames
* `apply` also works elementwise but is suited to more complex operations and aggregation. The behaviour and return value depends on the function.

**Fourth major difference (the most important one): USE CASE**

* `map` is meant for mapping values from one domain to another, so is optimised for performance (e.g., df['A'].map({1:'a', 2:'b', 3:'c'}))
* `applymap` is good for elementwise transformations across multiple rows/columns (e.g., df[['A', 'B', 'C']].applymap(str.strip))
* `apply` is for applying any function that cannot be vectorised (e.g., df['sentences'].apply(nltk.sent_tokenize))

&nbsp;

**Summarizing:**
<img src="https://i.stack.imgur.com/IZys3.png">

> **Footnotes:**
1. `map` when passed a dictionary/Series will map elements based on the keys in that dictionary/Series. Missing values will be recorded as NaN in the output.
2. `applymap` in more recent versions has been optimised for some operations. You will find `applymap` slightly faster than apply in some cases. My suggestion is to test them both and use whatever works better.
3. `map` is optimised for elementwise mappings and transformation. Operations that involve dictionaries or Series will enable pandas to use faster code paths for better performance.
4. `Series.apply` returns a scalar for aggregating operations, Series otherwise. Similarly for `DataFrame.apply`. Note that `apply` also has fastpaths when called with certain NumPy functions such as `mean`, `sum`, etc.