# Population mondiale
Data source: [Wikipedia](https://fr.wikipedia.org/wiki/Liste_des_pays_par_population)

![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8b/Population_balls_%28narrow%29.png/800px-Population_balls_%28narrow%29.png)

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## 1. Get Data

**Paths & Config**

In [2]:
URL = "https://fr.wikipedia.org/wiki/Liste_des_pays_par_population"
DATA = "demographie.csv"

**Retreive page HTML code source**

In [3]:
# Retreive source code
source = requests.get(URL)
# Convert into bs4 object
soup = BeautifulSoup(source.text, "html.parser")
# Extract table
table = soup.find("table", {"class": "wikitable"})

**Convert into Pandas DataFrame**

In [4]:
data = pd.read_html(str(table), encoding="utf-8")[0]
data.head()

Unnamed: 0,Rang,Pays ou territoire,Population en 2021 (projections de l'ONU de 2019)[2]
0,-,Monde,7 874 966 000
1,1,Chine[a],1 444 216 000
2,2,Inde,1 393 409 000
3,3,États-Unis[b],332 915 000
4,4,Indonésie,276 362 000


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 236 entries, 0 to 235
Data columns (total 3 columns):
 #   Column                                                Non-Null Count  Dtype 
---  ------                                                --------------  ----- 
 0   Rang                                                  236 non-null    object
 1   Pays ou territoire                                    236 non-null    object
 2   Population en 2021 (projections de l'ONU de 2019)[2]  236 non-null    object
dtypes: object(3)
memory usage: 5.7+ KB


## 2. Clean Data

In [6]:
# Rename columns
data.columns = ["Rang", "Pays", "Population"]
data.head()

Unnamed: 0,Rang,Pays,Population
0,-,Monde,7 874 966 000
1,1,Chine[a],1 444 216 000
2,2,Inde,1 393 409 000
3,3,États-Unis[b],332 915 000
4,4,Indonésie,276 362 000


In [7]:
# Remove first row
data.drop([0], axis=0, inplace=True)
data.head()

Unnamed: 0,Rang,Pays,Population
1,1,Chine[a],1 444 216 000
2,2,Inde,1 393 409 000
3,3,États-Unis[b],332 915 000
4,4,Indonésie,276 362 000
5,5,Pakistan,225 200 000


In [8]:
# Remove [*]
data["Pays"] = data["Pays"].apply(lambda x: x.split("[")[0])
data.head()

Unnamed: 0,Rang,Pays,Population
1,1,Chine,1 444 216 000
2,2,Inde,1 393 409 000
3,3,États-Unis,332 915 000
4,4,Indonésie,276 362 000
5,5,Pakistan,225 200 000


In [9]:
# Convert Population into int
def num_only(x):
    y = [char for char in x if char.isnumeric()]
    return int("".join(y))
data["Population"] = data["Population"].apply(num_only)
data.head()

Unnamed: 0,Rang,Pays,Population
1,1,Chine,1444216000
2,2,Inde,1393409000
3,3,États-Unis,332915000
4,4,Indonésie,276362000
5,5,Pakistan,225200000


## 3. Save data

In [10]:
# Save raw data
data.to_csv(DATA, index=False)