<a href="https://colab.research.google.com/github/karuego/SIC-Batch5/blob/main/W1_P1_Python_for_AI%2C_Data_Science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using NumPy

Once you've installed NumPy you can import it as a library:

In [None]:
import numpy as np
import pandas as pd

## List and Array Performance Comparison

Before we jump in, let's check list and array computation time

In [None]:
l = list(range(1000000))
a = np.arange(10000000)

print(l)
print(a)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
%time np.sum(a ** 2)

CPU times: user 1.02 ms, sys: 1 µs, total: 1.02 ms
Wall time: 1.17 ms


333328333350000

In [None]:
%time sum([x ** 2 for x in l])

CPU times: user 31.5 ms, sys: 3.99 ms, total: 35.5 ms
Wall time: 37.4 ms


333328333350000

---
Numpy has many built-in functions and capabilities. We won't cover them all but instead we will focus on some of the most important aspects of Numpy: vectors,arrays,matrices, and number generation. Let's start by discussing arrays.

## Numpy Arrays

NumPy arrays are the main way we will use Numpy throughout the course. Numpy arrays essentially come in two flavors: vectors and matrices. Vectors are strictly 1-d arrays and matrices are 2-d (but you should note a matrix can still have only one row or one column).

Let's begin our introduction by exploring how to create NumPy arrays.

In [None]:
my_list = [1,2,3]
my_list

[1, 2, 3]

In [None]:
np.array(my_list)

array([1, 2, 3])

In [None]:
my_matrix = [[1,2,3],[4,5,6],[7,8,9]]
my_matrix

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

In [None]:
my_matrix * 2

[[1, 2, 3], [4, 5, 6], [7, 8, 9], [1, 2, 3], [4, 5, 6], [7, 8, 9]]

In [None]:
np.array(my_matrix)

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [None]:
np.array(my_matrix)*2

array([[ 2,  4,  6],
       [ 8, 10, 12],
       [14, 16, 18]])

In [None]:
np.array(my_matrix).dtype

dtype('int64')

In [None]:
c = np.array([1, 2.0, 3])
print(c)
print(c.dtype)

[1. 2. 3.]
float64


In [None]:
np.random.rand(2)

array([0.56224426, 0.61661102])

In [None]:
np.random.rand(5,5)

array([[0.83147161, 0.95775041, 0.5643127 , 0.45690319, 0.35891455],
       [0.38705312, 0.45252009, 0.72974705, 0.75727892, 0.20278781],
       [0.6194628 , 0.74993029, 0.5754149 , 0.99134575, 0.4435996 ],
       [0.31671341, 0.86795413, 0.91041619, 0.72520115, 0.97736045],
       [0.52666752, 0.91548616, 0.06966299, 0.52856098, 0.45108406]])

---
## Array Attributes and Methods

Let's discuss some useful attributes and methods or an array

In [None]:
arr = np.arange(25)
ranarr = np.random.randint(0,100, 2)
ranarr

array([65, 55])

In [None]:
arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])

In [None]:
ranarr

array([23, 47, 35, 32, 43, 29, 19, 45, 20, 30])

In [None]:
# ranarr = np.random.randint(0,100, 2)
arr = np.arange(12)
print(arr)
arr.reshape(2,3,2)

[ 0  1  2  3  4  5  6  7  8  9 10 11]


array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]]])

In [None]:
ranarr = np.random.randint(0,100,5)
ranarr

array([66, 43, 31, 99, 20])

In [None]:
ranarr.max()

99

In [None]:
np.max(ranarr)

99

In [None]:
ranarr.argmax()

3

In [None]:
ranarr.min()

20

In [None]:
ranarr.argmin()

4

---
## Dimensions and Shapes

`ndim`, `size`, and `shape`

- Shape is an attribute that arrays have (not a method)
- `ndim` returns the array dimension
- `size` returns the number of elements in the array

In [None]:
A = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
A

array([[1, 2, 3],
       [4, 5, 6]])

In [None]:
A.shape

(2, 3)

In [None]:
A.ndim

2

In [None]:
A.size

6

In [None]:
B = np.array([
    [
        [12, 11, 10],
        [9, 8, 7],
    ],
    [
        [6, 5, 4],
        [3, 2, 1]
    ]
])

In [None]:
B

array([[[12, 11, 10],
        [ 9,  8,  7]],

       [[ 6,  5,  4],
        [ 3,  2,  1]]])

In [None]:
B.shape

(2, 2, 3)

In [None]:
B.ndim

3

In [None]:
B.size

12

# NumPy Indexing and Selection

## Bracket Indexing and Selection

In [None]:
arr = np.arange(0,11)

In [None]:
arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [None]:
#Get a value at an index
arr[8]

8

In [None]:
#Get values in a range
arr[1:5]

array([1, 2, 3, 4])

In [None]:
#Get values in a range
arr[-4:]

array([ 7,  8,  9, 10])

In [None]:
#Get values in a range
arr[:5]

array([0, 1, 2, 3, 4])

In [None]:
arr[0:]

In [None]:
arr[1:3]

In [None]:
arr[1:-1]

In [None]:
print(arr)
arr[::5]

[ 0  1  2  3  4  5  6  7  8  9 10]


array([ 0,  5, 10])

In [None]:
#Setting a value with index range (Broadcasting)
arr[0:5]=100

#Show
arr

array([100, 100, 100, 100, 100,   5,   6,   7,   8,   9,  10])

## Indexing a 2D array (matrices)

The general format are:
- **arr_2d[row][col]** or
- **arr_2d[row,col]**.

I recommend usually using the comma notation for clarity.

Similar with Python List, array index also start from 0

In [None]:
arr_2d = np.array(([5,10,15],[20,25,30],[35,40,45]))

#Show
arr_2d

array([[ 5, 10, 15],
       [20, 25, 30],
       [35, 40, 45]])

In [None]:
#Indexing row
arr_2d[1]

array([20, 25, 30])

In [None]:
# Format is arr_2d[row][col] or arr_2d[row,col]

# Getting individual element value
arr_2d[1][0]

20

In [None]:
# Getting individual element value
arr_2d[1,0]

20

In [None]:
# 2D array slicing

#Shape (2,2) from top right corner
arr_2d[:2,1:]

In [None]:
#Shape bottom row
arr_2d[2]

In [None]:
#Shape bottom row
arr_2d[2,:]

## Selection

Let's briefly go over how to use brackets for selection based off of comparison operators.

In [None]:
arr = np.arange(1,11)
arr

In [None]:
arr > 4

In [None]:
bool_arr = arr>4

In [None]:
bool_arr

In [None]:
arr[bool_arr]

In [None]:
arr[arr<6]

In [None]:
x = 2
arr[arr>x]

# NumPy Operations

In [None]:
arr = np.arange(0,10)

In [None]:
arr

In [None]:
arr + arr

In [None]:
arr * arr

In [None]:
arr - arr

In [None]:
arr.sum()

In [None]:
arr.mean()

In [None]:
A = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

In [None]:
A.sum()

In [None]:
A.mean()

In [None]:
A.sum(axis=1)

In [None]:
A.sum(axis=0)

In [None]:
B = np.array([
    [6, 5],
    [4, 3],
    [2, 1]
])

In [None]:
A.dot(B), A @ B

In [None]:
B.T

# Introduction to Pandas

In this section of the course we will learn how to use pandas for data analysis. Think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the notebooks in this order:

* Introduction to Pandas
* Series
* DataFrames
* Summary Functions and Aggregation (GroupBy)
* Combining Data - Merging, Joining, and Concatenating (Optional)
* Operations
* Data Input and Output
___

In [None]:
# Importing required libraries and fixing options
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.max_columns = None
pd.options.display.max_rows = None

%matplotlib inline

## Let's explore some COVID19 data available online using basic **Pandas functions**:
[Reference](https://ourworldindata.org/coronavirus/country/indonesia)

In [None]:
# Importing the data from file
covid_df = pd.read_csv('https://covid.ourworldindata.org/data/owid-covid-data.csv')


# Let's see a sample of the data
covid_df.head(10) # top 5 rows

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,new_people_vaccinated_smoothed,new_people_vaccinated_smoothed_per_hundred,stringency_index,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-01-05,,0.0,,,0.0,,,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
1,AFG,Asia,Afghanistan,2020-01-06,,0.0,,,0.0,,,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
2,AFG,Asia,Afghanistan,2020-01-07,,0.0,,,0.0,,,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
3,AFG,Asia,Afghanistan,2020-01-08,,0.0,,,0.0,,,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
4,AFG,Asia,Afghanistan,2020-01-09,,0.0,,,0.0,,,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
5,AFG,Asia,Afghanistan,2020-01-10,,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
6,AFG,Asia,Afghanistan,2020-01-11,,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
7,AFG,Asia,Afghanistan,2020-01-12,,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
8,AFG,Asia,Afghanistan,2020-01-13,,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
9,AFG,Asia,Afghanistan,2020-01-14,,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,


In [None]:
covid_df.tail(10) #last 10 rows

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,new_people_vaccinated_smoothed,new_people_vaccinated_smoothed_per_hundred,stringency_index,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
400990,ZWE,Africa,Zimbabwe,2024-05-03,266362.0,0.0,0.429,5740.0,0.0,0.0,16320.662,0.0,0.026,351.704,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,
400991,ZWE,Africa,Zimbabwe,2024-05-04,266362.0,0.0,0.429,5740.0,0.0,0.0,16320.662,0.0,0.026,351.704,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,
400992,ZWE,Africa,Zimbabwe,2024-05-05,266362.0,0.0,0.0,5740.0,0.0,0.0,16320.662,0.0,0.0,351.704,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,
400993,ZWE,Africa,Zimbabwe,2024-05-06,266362.0,0.0,0.0,5740.0,0.0,0.0,16320.662,0.0,0.0,351.704,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,
400994,ZWE,Africa,Zimbabwe,2024-05-07,266362.0,0.0,0.0,5740.0,0.0,0.0,16320.662,0.0,0.0,351.704,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,
400995,ZWE,Africa,Zimbabwe,2024-05-08,266362.0,0.0,0.0,5740.0,0.0,0.0,16320.662,0.0,0.0,351.704,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,
400996,ZWE,Africa,Zimbabwe,2024-05-09,266362.0,0.0,0.0,5740.0,0.0,0.0,16320.662,0.0,0.0,351.704,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,
400997,ZWE,Africa,Zimbabwe,2024-05-10,266362.0,0.0,0.0,5740.0,0.0,0.0,16320.662,0.0,0.0,351.704,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,
400998,ZWE,Africa,Zimbabwe,2024-05-11,266362.0,0.0,0.0,5740.0,0.0,0.0,16320.662,0.0,0.0,351.704,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,
400999,ZWE,Africa,Zimbabwe,2024-05-12,266362.0,0.0,0.0,5740.0,0.0,0.0,16320.662,0.0,0.0,351.704,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,


In [None]:
covid_df.sample(10, random_state=111) #Sample 10 rows randomly selected from the data

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,new_people_vaccinated_smoothed,new_people_vaccinated_smoothed_per_hundred,stringency_index,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
284294,POL,Europe,Poland,2020-03-06,,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,11.11,124.027,41.8,16.763,10.202,27216.445,,227.331,5.91,23.3,33.1,,6.62,78.73,0.88,39857144.0,,,,
326655,SVK,Europe,Slovakia,2023-10-03,1867525.0,0.0,14.857,21167.0,0.0,0.0,330918.737,0.0,2.633,3750.717,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,113.128,41.2,15.07,9.167,30155.152,0.7,287.959,7.29,23.1,37.7,,5.82,77.54,0.86,5643455.0,,,,
52952,BGR,Europe,Bulgaria,2021-04-06,352259.0,0.0,3498.429,13507.0,0.0,122.429,51940.628,0.0,515.844,1991.609,0.0,18.052,0.9,,,,,,,,,2189411.0,19244.0,317.957,2.795,15912.0,2.311,0.1655,6.0,tests performed,519635.0,414790.0,104845.0,,9586.0,8701.0,7.66,6.12,1.55,,1283.0,7020.0,0.104,53.7,65.18,44.7,20.801,13.272,18563.307,1.5,424.688,5.81,30.1,44.4,,7.454,75.05,0.816,6781955.0,,,,
174425,JAM,North America,Jamaica,2023-05-28,154965.0,39.0,5.571,3540.0,2.0,0.286,54808.653,13.794,1.971,1252.042,0.707,0.101,,,,,,,,,,,,,,,,,,,,,,,,28.0,,,,,10.0,23.0,0.001,,266.879,31.4,9.684,6.39,8193.571,,206.537,11.28,5.3,28.6,66.425,1.7,74.47,0.734,2827382.0,,,,
296805,RWA,Africa,Rwanda,2023-12-17,133208.0,0.0,0.0,1468.0,0.0,0.0,9669.078,0.0,0.0,106.557,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,494.869,20.3,2.974,1.642,1854.211,56.0,191.375,4.28,4.7,21.0,4.617,,69.02,0.543,13776702.0,,,,
269827,OMN,Asia,Oman,2024-02-07,399449.0,0.0,0.0,4628.0,0.0,0.0,87286.454,0.0,0.0,1011.297,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,14.98,30.7,2.355,1.53,37960.709,,266.342,12.61,0.5,15.6,97.4,1.6,77.86,0.813,4576300.0,,,,
218762,MTQ,North America,Martinique,2020-08-17,336.0,0.0,0.0,16.0,0.0,0.0,914.256,0.0,0.0,43.536,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,45.7,,12.543,,,,,,,,,82.54,,367512.0,,,,
12641,AIA,North America,Anguilla,2024-02-24,3904.0,0.0,0.0,12.0,0.0,0.0,245890.282,0.0,0.0,755.81,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,81.88,,15877.0,,,,
103228,GNQ,Africa,Equatorial Guinea,2020-02-25,,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,45.194,22.4,2.846,1.752,22604.873,,202.812,7.78,,,24.64,2.1,58.74,0.592,1674916.0,,,,
158026,HUN,Europe,Hungary,2022-02-05,1508358.0,0.0,15271.571,41229.0,0.0,58.143,151330.59,0.0,1532.167,4136.424,0.0,5.833,1.05,,,,,,,,,,,,,37630.0,3.875,0.4119,2.4,tests performed,,,,,,14915.0,,,,,1496.0,1904.0,0.019,28.57,108.043,43.4,18.577,11.976,26777.561,0.5,278.296,7.55,26.8,34.8,,7.02,76.88,0.854,9967304.0,,,,


In [None]:
covid_df['location'].value_counts()

location
Malaysia                            1602
Asia                                1602
Europe                              1602
Czechia                             1602
High income                         1602
India                               1602
Upper middle income                 1602
European Union                      1602
World                               1602
Lithuania                           1602
Lower middle income                 1602
Greece                              1601
Bulgaria                            1601
Italy                               1600
North America                       1600
Canada                              1600
Bangladesh                          1599
Netherlands                         1599
Estonia                             1598
New Zealand                         1595
Oceania                             1595
Argentina                           1594
Mexico                              1594
Thailand                            1591
Portuga

In [None]:
covid_df.shape

(401000, 67)

In [None]:
interested_countries_list = ['India','Indonesia']

covid_df1 = covid_df[
    covid_df['location'].isin(interested_countries_list)
]

In [None]:
covid_df1.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,new_people_vaccinated_smoothed,new_people_vaccinated_smoothed_per_hundred,stringency_index,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
160444,IND,Asia,India,2020-01-05,,0.0,,,0.0,,,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,450.419,28.2,5.989,3.414,6426.674,21.2,282.28,10.39,1.9,20.6,59.55,0.53,69.66,0.645,1417173000.0,,,,
160445,IND,Asia,India,2020-01-06,,0.0,,,0.0,,,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,450.419,28.2,5.989,3.414,6426.674,21.2,282.28,10.39,1.9,20.6,59.55,0.53,69.66,0.645,1417173000.0,,,,
160446,IND,Asia,India,2020-01-07,,0.0,,,0.0,,,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,450.419,28.2,5.989,3.414,6426.674,21.2,282.28,10.39,1.9,20.6,59.55,0.53,69.66,0.645,1417173000.0,,,,
160447,IND,Asia,India,2020-01-08,,0.0,,,0.0,,,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,450.419,28.2,5.989,3.414,6426.674,21.2,282.28,10.39,1.9,20.6,59.55,0.53,69.66,0.645,1417173000.0,,,,
160448,IND,Asia,India,2020-01-09,,0.0,,,0.0,,,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,450.419,28.2,5.989,3.414,6426.674,21.2,282.28,10.39,1.9,20.6,59.55,0.53,69.66,0.645,1417173000.0,,,,


In [None]:
covid_df1["location"].unique()

array(['India', 'Indonesia'], dtype=object)

In [None]:
covid_df1.head()

# slicing cara 1
slicing_1 = covid_df1[["iso_code", "continent"]].head()

#cara 2
slicing_2 = covid_df1["iso_code"].head()

# cara 3
slicing_3 = covid_df1.iso_code.head()

display(slicing_1)
display(slicing_2)
display(slicing_3)

Unnamed: 0,iso_code,continent
160444,IND,Asia
160445,IND,Asia
160446,IND,Asia
160447,IND,Asia
160448,IND,Asia


160444    IND
160445    IND
160446    IND
160447    IND
160448    IND
Name: iso_code, dtype: object

160444    IND
160445    IND
160446    IND
160447    IND
160448    IND
Name: iso_code, dtype: object

In [None]:
covid_df.iloc[: , :2].head()

Unnamed: 0,iso_code,continent
0,AFG,Asia
1,AFG,Asia
2,AFG,Asia
3,AFG,Asia
4,AFG,Asia


In [None]:
covid_df[covid_df.location == 'Indonesia'].sample(5)

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,new_people_vaccinated_smoothed,new_people_vaccinated_smoothed_per_hundred,stringency_index,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
162835,IDN,Asia,Indonesia,2022-03-04,5539394.0,0.0,48841.286,148073.0,0.0,244.0,20106.595,0.0,177.281,537.467,0.0,0.886,0.74,,,,,,,,,57117140.0,193981.0,208.645,0.709,213616.0,0.78,0.1578,6.3,people tested,351058900.0,191631442.0,146554884.0,11942963.0,1414860.0,971776.0,127.43,69.56,53.2,4.33,3527.0,157190.0,0.057,60.47,145.725,29.3,5.319,3.053,11188.744,5.7,342.864,6.32,2.8,76.1,64.204,1.04,71.72,0.718,275501344.0,,,,
162802,IDN,Asia,Indonesia,2022-01-30,4343185.0,56807.0,8115.286,144303.0,83.0,11.857,15764.66,206.195,29.456,523.783,0.301,0.043,2.29,,,,,,,,,48343446.0,205803.0,176.595,0.752,222605.0,0.813,0.0365,27.4,people tested,,184557715.0,128005763.0,,,1314110.0,,66.99,46.46,,4770.0,489483.0,0.178,60.85,145.725,29.3,5.319,3.053,11188.744,5.7,342.864,6.32,2.8,76.1,64.204,1.04,71.72,0.718,275501344.0,,,,
163304,IDN,Asia,Indonesia,2023-06-16,6810119.0,0.0,226.0,161821.0,0.0,4.571,24719.005,0.0,0.82,587.369,0.0,0.017,,,,,,,,,,,,,,,,,,,,,,,,4452.0,,,,,16.0,239.0,0.0,,145.725,29.3,5.319,3.053,11188.744,5.7,342.864,6.32,2.8,76.1,64.204,1.04,71.72,0.718,275501344.0,,,,
163581,IDN,Asia,Indonesia,2024-03-19,6828884.0,0.0,4.0,162056.0,0.0,0.0,24787.117,0.0,0.015,588.222,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,145.725,29.3,5.319,3.053,11188.744,5.7,342.864,6.32,2.8,76.1,64.204,1.04,71.72,0.718,275501344.0,,,,
163471,IDN,Asia,Indonesia,2023-11-30,6813429.0,0.0,0.0,161918.0,0.0,0.0,24731.019,0.0,0.0,587.721,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,145.725,29.3,5.319,3.053,11188.744,5.7,342.864,6.32,2.8,76.1,64.204,1.04,71.72,0.718,275501344.0,,,,


In [None]:
covid_df.to_csv('covid_data_output.csv', header=True, index=False)
# covid_df.to_

In [None]:
covid_df1.describe(include="all")

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,new_people_vaccinated_smoothed,new_people_vaccinated_smoothed_per_hundred,stringency_index,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
count,3192,3192,3192,3192,3089.0,3180.0,3170.0,3040.0,3179.0,3169.0,3089.0,3180.0,3170.0,3040.0,3179.0,3169.0,2047.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1181.0,1164.0,1181.0,1164.0,1189.0,1189.0,1181.0,1181.0,1203,1633.0,1676.0,1646.0,943.0,1555.0,2270.0,1633.0,1676.0,1646.0,943.0,2270.0,2270.0,2270.0,2184.0,3192.0,3192.0,3192.0,3192.0,3192.0,3192.0,3192.0,3192.0,3192.0,3192.0,3192.0,3192.0,3192.0,3192.0,3192.0,0.0,0.0,0.0,0.0
unique,2,1,2,1602,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
top,IND,Asia,India,2020-01-05,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,samples tested,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
freq,1602,3192,1602,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,831,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
mean,,,,,17771560.0,16310.54,16361.790988,250412.838816,218.830135,219.517968,18788.085344,17.788714,17.844685,339.663321,0.303483,0.304446,1.036766,,,,,,,,,259622700.0,764886.3,209.703279,0.673671,764142.0,0.66979,0.063634,91.182218,,1230772000.0,608485900.0,517807100.0,156084200.0,1570860.0,1169859.0,100.804225,51.816844,43.267053,12.236394,1402.842291,542875.0,0.064586,56.362601,298.644733,28.747932,5.655259,3.234179,8798.757741,13.479135,312.45812,8.36265,2.348308,48.245677,61.868252,0.784041,70.686128,0.681363,848483200.0,,,,
std,,,,,18192040.0,128912.0,46388.354569,200547.036844,1509.741831,534.140774,11446.417969,111.212659,38.720757,201.906051,2.072225,0.732027,0.337476,,,,,,,,,289471300.0,1217877.0,190.677049,0.831601,732170.0,0.461496,0.066816,178.40239,,939519500.0,438144400.0,417531300.0,91757630.0,2430290.0,1933790.0,60.237354,26.997616,26.547572,6.017369,1833.972756,1088594.0,0.103232,20.726648,152.369793,0.550082,0.33505,0.180527,2381.391228,7.751159,30.296532,2.035304,0.450067,27.754152,2.327348,0.255038,1.030154,0.036505,570921300.0,,,,
min,,,,,2.0,0.0,0.0,2.0,0.0,0.0,0.001,0.0,0.0,0.001,0.0,0.0,0.34,,,,,,,,,6500.0,157.0,0.005,0.0,1121.0,0.001,0.0009,3.3,,0.0,0.0,5468.0,3294338.0,0.0,0.0,0.0,0.0,0.0,0.23,0.0,0.0,0.0,0.0,145.725,28.2,5.319,3.053,6426.674,5.7,282.28,6.32,1.9,20.6,59.55,0.53,69.66,0.645,275501300.0,,,,
25%,,,,,4129020.0,0.0,172.286,122111.0,0.0,2.286,6805.253,0.0,0.182,131.273,0.0,0.003,0.85,,,,,,,,,20846130.0,156104.0,49.82,0.331,160090.0,0.324,0.0138,10.1,,203166800.0,139737500.0,74268110.0,37458810.0,2309.5,4452.0,35.89,26.935,13.2175,6.95,16.0,239.0,0.0,37.04,145.725,28.2,5.319,3.053,6426.674,5.7,282.28,6.32,1.9,20.6,59.55,0.53,69.66,0.645,275501300.0,,,,
50%,,,,,6813429.0,0.0,1804.714,161284.0,0.0,24.0,23436.793,0.0,3.428,374.458,0.0,0.044,0.99,,,,,,,,,107728100.0,569358.5,143.062,0.6385,617041.0,0.642,0.0377,26.5,,1576006000.0,837806200.0,626521800.0,222828800.0,510024.0,205397.0,131.97,70.055,58.41,15.96,308.5,35674.5,0.009,61.57,450.419,28.2,5.989,3.414,6426.674,21.2,282.28,10.39,1.9,20.6,59.55,0.53,69.66,0.645,1417173000.0,,,,
75%,,,,,43019450.0,0.0,10667.857,521685.0,0.0,185.286,30355.821,0.0,17.56,564.883,0.0,0.331,1.18,,,,,,,,,511184500.0,1168970.0,363.17,0.862,1153561.0,0.849,0.0991,72.5,,2206429000.0,1027375000.0,951894500.0,227339200.0,2024684.0,1442309.0,155.7,72.5,67.17,16.04,2310.0,507533.2,0.086,71.59,450.419,29.3,5.989,3.414,11188.744,21.2,342.864,10.39,2.8,76.1,64.204,1.04,71.72,0.718,1417173000.0,,,,


In [None]:
covid_df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3192 entries, 160444 to 163635
Data columns (total 67 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   iso_code                                    3192 non-null   object 
 1   continent                                   3192 non-null   object 
 2   location                                    3192 non-null   object 
 3   date                                        3192 non-null   object 
 4   total_cases                                 3089 non-null   float64
 5   new_cases                                   3180 non-null   float64
 6   new_cases_smoothed                          3170 non-null   float64
 7   total_deaths                                3040 non-null   float64
 8   new_deaths                                  3179 non-null   float64
 9   new_deaths_smoothed                         3169 non-null   float64
 10  total_case

In [None]:
covid_df1['total_cases'].mean()

In [None]:
covid_df1.shape

In [None]:
covid_df1['location'].value_counts()

## Data Aggregation

In [None]:
# Create dataframe
data = {
    'Company':['SAMSUNG','SAMSUNG','SKILVUL','SKILVUL','FB','FB'],
    'Person':['Ayub','Amri','Calvin','Addie','Becca','Sara'],
    'Sales':[200,1200,340,124,243,350],
    'Margin':[40,40,34,100,56,60]
    }

In [None]:
df = pd.DataFrame(data)
df

Unnamed: 0,Company,Person,Sales,Margin
0,SAMSUNG,Ayub,200,40
1,SAMSUNG,Amri,1200,40
2,SKILVUL,Calvin,340,34
3,SKILVUL,Addie,124,100
4,FB,Becca,243,56
5,FB,Sara,350,60


In [None]:
df.describe(include="all")

Unnamed: 0,Company,Person,Sales,Margin
count,6,6,6.0,6.0
unique,3,6,,
top,SAMSUNG,Ayub,,
freq,2,1,,
mean,,,409.5,55.0
std,,,396.581265,24.256958
min,,,124.0,34.0
25%,,,210.75,40.0
50%,,,291.5,48.0
75%,,,347.5,59.0


**Now you can use the .groupby() method to group rows together based off of a column name**


For instance let's group based off of Company. This will create a *DataFrameGroupBy* object:

In [None]:
df.groupby('Company')
#select company from df group by company

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7eb424f8f1c0>

In [None]:
# You can save this object as a new variable:
by_comp = df.groupby('Company')
print(by_comp)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7eb424f8cca0>


In [None]:
# And then call aggregate methods off the object:
display(df)
by_comp['Sales'].max()

Unnamed: 0,Company,Person,Sales,Margin
0,SAMSUNG,Ayub,200,40
1,SAMSUNG,Amri,1200,40
2,SKILVUL,Calvin,340,34
3,SKILVUL,Addie,124,100
4,FB,Becca,243,56
5,FB,Sara,350,60


Company
FB          350
SAMSUNG    1200
SKILVUL     340
Name: Sales, dtype: int64

In [None]:
# In one step:
df.groupby('Company')['Sales'].std()

Company
FB          75.660426
SAMSUNG    707.106781
SKILVUL    152.735065
Name: Sales, dtype: float64

**More examples of aggregate methods in pandas:**

In [None]:
by_comp['Sales'].std()

In [None]:
by_comp.min()

In [None]:
by_comp.max()

In [None]:
by_comp.count()

In [None]:
by_comp.describe()

In [None]:
display(df.groupby('Company')['Sales'].aggregate(['mean','sum','max','min']))

print()

df.groupby('Company')['Sales'].aggregate(['mean','sum','max','min']).transpose()

Unnamed: 0_level_0,mean,sum,max,min
Company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
FB,296.5,593,350,243
SAMSUNG,700.0,1400,1200,200
SKILVUL,232.0,464,340,124





Company,FB,SAMSUNG,SKILVUL
mean,296.5,700.0,232.0
sum,593.0,1400.0,464.0
max,350.0,1200.0,340.0
min,243.0,200.0,124.0


In [None]:
by_comp.describe().transpose()

Unnamed: 0,Company,FB,SAMSUNG,SKILVUL
Sales,count,2.0,2.0,2.0
Sales,mean,296.5,700.0,232.0
Sales,std,75.660426,707.106781,152.735065
Sales,min,243.0,200.0,124.0
Sales,25%,269.75,450.0,178.0
Sales,50%,296.5,700.0,232.0
Sales,75%,323.25,950.0,286.0
Sales,max,350.0,1200.0,340.0
Margin,count,2.0,2.0,2.0
Margin,mean,58.0,40.0,67.0


In [None]:
by_comp.describe().transpose()['SKILVUL']

In [None]:
# Let's try some aggregagation on our COVID Data:
covid_df.groupby(['continent','location'])['new_cases'].aggregate(['count','mean','median']).sort_values('location')

## Operations on Pandas Columns
* Addition, Subtraction, etc.
* sort_values(), sort_index()
* Dropping columns
* Applying functions to Pandas Dataframes (Map and Apply)

### Column Operations

In [None]:
temp_df = covid_df1 [['location',	'date', 'new_cases']].tail(10)
temp_df

In [None]:
temp_df['new_cases_added'] = temp_df['new_cases'] + temp_df['new_cases']
temp_df

In [None]:
temp_df['new_cases_twice'] = temp_df['new_cases'] *2

In [None]:
temp_df.sort_values(by = ['new_cases','date'],ascending=[False,True], inplace=True)

In [None]:
temp_df

In [None]:
temp_df.sort_index(inplace=True)
temp_df

In [None]:
temp_df.drop(['new_cases_added','new_cases_twice'],axis = 1)

In [None]:
temp_df

In [None]:
temp_df.drop(columns = ['new_cases_added','new_cases_twice'],inplace = True)

### Applying functions to Pandas dataframes
[Reference Link for Map and Apply](https://towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff)

In [None]:
temp_df

In [None]:
# In-built Python method
temp_df['sqrt_new_cases'] = temp_df['new_cases'].apply(np.sqrt)
temp_df

Create a column as 'new_cases_category' which shows:
* <=40 Cases             -- 'Low'
* 40k+ to 50k Cases      -- 'Medium'
* Greater than 50k Cases -- 'High'

In [None]:
# UDF - User defined function
def category_fn (number_of_cases):
  if number_of_cases<=40000: cat = 'Low'
  elif number_of_cases>50000: cat = 'High'
  else: cat = 'Medium'
  return cat

In [None]:
category_fn (55000)

In [None]:
temp_df['new_cases_category'] = temp_df['new_cases'].apply(category_fn)
temp_df

In [None]:
temp_df['new_cases_category1'] = temp_df['new_cases'].map(category_fn)
temp_df

#### Comparing map, applymap and apply: **Context Matters**

[Reference Link](https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas)

**First major difference: DEFINITION**

* `map` is defined on Series ONLY
* `applymap` is defined on DataFrames ONLY
* `apply` is defined on BOTH

**Second major difference: INPUT ARGUMENT**
* `map` accepts dicts, Series, or callable
* `applymap` and apply accept callables only

**Third major difference: BEHAVIOR**

* `map` is elementwise for Series
* `applymap` is elementwise for DataFrames
* `apply` also works elementwise but is suited to more complex operations and aggregation. The behaviour and return value depends on the function.

**Fourth major difference (the most important one): USE CASE**

* `map` is meant for mapping values from one domain to another, so is optimised for performance (e.g., df['A'].map({1:'a', 2:'b', 3:'c'}))
* `applymap` is good for elementwise transformations across multiple rows/columns (e.g., df[['A', 'B', 'C']].applymap(str.strip))
* `apply` is for applying any function that cannot be vectorised (e.g., df['sentences'].apply(nltk.sent_tokenize))

&nbsp;

**Summarizing:**
<img src="https://i.stack.imgur.com/IZys3.png">

> **Footnotes:**
1. `map` when passed a dictionary/Series will map elements based on the keys in that dictionary/Series. Missing values will be recorded as NaN in the output.
2. `applymap` in more recent versions has been optimised for some operations. You will find `applymap` slightly faster than apply in some cases. My suggestion is to test them both and use whatever works better.
3. `map` is optimised for elementwise mappings and transformation. Operations that involve dictionaries or Series will enable pandas to use faster code paths for better performance.
4. `Series.apply` returns a scalar for aggregating operations, Series otherwise. Similarly for `DataFrame.apply`. Note that `apply` also has fastpaths when called with certain NumPy functions such as `mean`, `sum`, etc.

## Merging, Joining, and Concatenating (Optional)

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. In this section we will discuss these 3 methods with examples.

In [None]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7])

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])

In [None]:
df1

In [None]:
df2

In [None]:
df3

### Concatenation
Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together:

In [None]:
# Similar to Union of 2 or more tables in SQL
pd.concat([df1,df2,df3])

In [None]:
pd.concat([df1,df2,df3],axis=1)

In [None]:
df2.reset_index()

In [None]:
pd.concat([df1,df2.reset_index(drop=True),df3.reset_index(drop=True)],axis=1)

### Merging

The **merge** function allows you to merge DataFrames together using a similar logic as joining SQL Tables together.

In [None]:
df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})

df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})

In [None]:
pd.merge(df1,df2,how='inner',on='key')

More complicated example:

In [None]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                               'key2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})

In [None]:
pd.merge(left, right, on=['key1', 'key2'])

In [None]:
pd.merge(left, right, how='outer', on=['key1', 'key2'])

In [None]:
pd.merge(left, right, how='right', on=['key1', 'key2'])

In [None]:
pd.merge(left, right, how='left', on=['key1', 'key2'])

In [None]:
pd.merge(left, right, how='inner', left_on = 'key1', right_on = 'key2')

### Joining
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

In [None]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2'])

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

In [None]:
left.join(right)

In [None]:
left.join(right, how='outer')