# Imports

In [58]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import scipy.stats as st
#from pandas_profiling import ProfileReport
import missingno as msno

In [18]:
ls -lh ./data

ls: cannot access '.csv': No such file or directory
./data:
EdStatsCountry.csv                         EdStatsData.csv:Zone.Identifier
EdStatsCountry.csv:Zone.Identifier         EdStatsFootNote.csv
EdStatsCountry-Series.csv                  EdStatsFootNote.csv:Zone.Identifier
EdStatsCountry-Series.csv:Zone.Identifier  EdStatsSeries.csv
EdStatsData.csv                            EdStatsSeries.csv:Zone.Identifier


# Situation initale, objectifs


## Situation initiale

### Contexte


Vous êtes Data Scientist dans une start-up de la EdTech, nommée academy, qui propose des contenus de formation en ligne pour un public de **niveau lycée et université**.

Mark, votre manager, vous a convié à une réunion pour vous présenter le **projet d’expansion à l’international de l’entreprise**. Il vous confie une première mission d’analyse exploratoire, pour **déterminer si les données sur l’éducation de la banque mondiale permettent d’informer le projet d’expansion**.

### Questions lors de la réunion

Voici les différentes questions que Mark aimerait explorer, que vous avez notées durant la réunion

- Quels sont les pays avec un fort potentiel de clients pour nos services ?
- Pour chacun de ces pays, quelle sera l’évolution de ce potentiel de clients ?
- Dans quels pays l'entreprise doit-elle opérer en priorité ?

### Mail post-réunion


#### Description Source Data

"Les données de la Banque mondiale sont disponibles à l’adresse suivante : https://datacatalog.worldbank.org/dataset/education-statistics

Ou en téléchargement direct à ce [lien](https://s3-eu-west-1.amazonaws.com/static.oc-static.com/prod/courses/files/Parcours_data_scientist/Projet+-+Donn%C3%A9es+%C3%A9ducatives/Projet+Python_Dataset_Edstats_csv.zip)."

> The World Bank EdStats All Indicator Query holds over 4,000 internationally comparable indicators that describe education access, progression, completion, literacy, teachers, population, and expenditures. The indicators cover the education cycle from pre-primary to vocational and tertiary education.The query also holds learning outcome data from international and regional learning assessments (e.g. PISA, TIMSS, PIRLS), equity data from household surveys, and projection/attainment data to 2050. For further information, please visit the EdStats website.

"Je te laisse regarder la page d'accueil qui décrit le jeu de données. En résumé, l’organisme “EdStats All Indicator Query” de la Banque mondiale répertorie 4000 indicateurs internationaux décrivant l’accès à l’éducation, l’obtention de diplômes et des informations relatives aux professeurs, aux dépenses liées à l’éducation... Tu trouveras plus d'info [sur ce site](http://datatopics.worldbank.org/education/)"

#### Requêtes supplémentaires

Pour la pré-analyse, pourrais-tu :

- Valider la qualité de ce jeu de données (comporte-t-il beaucoup de données manquantes, dupliquées ?)
- Décrire les informations contenues dans le jeu de données (nombre de colonnes ? nombre de lignes ?)
- Sélectionner les informations qui semblent pertinentes pour répondre à la problématique (quelles sont les colonnes contenant des informations qui peuvent être utiles pour répondre à la problématique de l’entreprise ?)
- Déterminer des ordres de grandeurs des indicateurs statistiques classiques pour les différentes zones géographiques et pays du monde (moyenne/médiane/écart-type par pays et par continent ou bloc géographique)

Ton travail va nous permettre de déterminer si ce jeu de données peut informer les décisions d'ouverture vers de nouveaux pays. On va partager ton analyse avec le board, alors merci de soigner la présentation et de l'illustrer avec des graphiques pertinents et lisibles !

###  Objectif :

"Permettre de déterminer si ce jeu de données peut informer les décisions d'ouverture vers de nouveaux pays. On va partager ton analyse avec le board, alors merci de soigner la présentation et de l'illustrer avec des graphiques pertinents et lisibles"

# Exploration

## Main DFs

### data (ED_Stats)

In [81]:
data  = pd.read_csv('./data/EdStatsData.csv')
data.shape

(886930, 70)

In [82]:
data.dtypes

Country Name       object
Country Code       object
Indicator Name     object
Indicator Code     object
1970              float64
                   ...   
2085              float64
2090              float64
2095              float64
2100              float64
Unnamed: 69       float64
Length: 70, dtype: object

In [27]:
data.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
0,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2,,,,,,,...,,,,,,,,,,
1,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.F,,,,,,,...,,,,,,,,,,
2,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,,,,,,,...,,,,,,,,,,
3,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.M,,,,,,,...,,,,,,,,,,
4,Arab World,ARB,"Adjusted net enrolment rate, primary, both sex...",SE.PRM.TENR,54.822121,54.894138,56.209438,57.267109,57.991138,59.36554,...,,,,,,,,,,


In [101]:
# recherche de valeurs manquantes
data_na = data.isnull().sum()
data_na

Country Name           0
Country Code           0
Indicator Name         0
Indicator Code         0
1970              814642
                   ...  
2085              835494
2090              835494
2095              835494
2100              835494
Unnamed: 69       886930
Length: 70, dtype: int64

In [34]:
data.isnull().mean()

Country Name      0.000000
Country Code      0.000000
Indicator Name    0.000000
Indicator Code    0.000000
1970              0.918496
                    ...   
2085              0.942007
2090              0.942007
2095              0.942007
2100              0.942007
Unnamed: 69       1.000000
Length: 70, dtype: float64

In [31]:
# variables ayant des valeurs manquantes
nb_na_EDSS[nb_na_EDSS>0]

1970           814642
1971           851393
1972           851311
1973           851385
1974           851200
                ...  
2085           835494
2090           835494
2095           835494
2100           835494
Unnamed: 69    886930
Length: 66, dtype: int64

In [36]:
# recherche de duplicats
data.duplicated().sum()

0

In [44]:
data.duplicated(subset=['Country Name']).sum()

886688

In [45]:
data.duplicated(subset=['Country Code']).sum()

886688

In [46]:
data.duplicated(subset=['Indicator Name']).sum()

883265

In [49]:
data

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
0,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2,,,,,,,...,,,,,,,,,,
1,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.F,,,,,,,...,,,,,,,,,,
2,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,,,,,,,...,,,,,,,,,,
3,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.M,,,,,,,...,,,,,,,,,,
4,Arab World,ARB,"Adjusted net enrolment rate, primary, both sex...",SE.PRM.TENR,54.822121,54.894138,56.209438,57.267109,57.991138,59.36554,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886925,Zimbabwe,ZWE,"Youth illiterate population, 15-24 years, male...",UIS.LP.AG15T24.M,,,,,,,...,,,,,,,,,,
886926,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, b...",SE.ADT.1524.LT.ZS,,,,,,,...,,,,,,,,,,
886927,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, f...",SE.ADT.1524.LT.FE.ZS,,,,,,,...,,,,,,,,,,
886928,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, g...",SE.ADT.1524.LT.FM.ZS,,,,,,,...,,,,,,,,,,


In [1]:
nb_na_EDSS.dropna()

NameError: name 'nb_na_EDSS' is not defined

In [8]:
data.shape[0]

886930

In [9]:
sns.heatmap(data.isnull(), cbar=False)

<AxesSubplot:>

In [None]:
msno.matrix(data)

<AxesSubplot:>

In [10]:
# détail
data.loc[data['1970'].isnull()]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
0,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2,,,,,,,...,,,,,,,,,,
1,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.F,,,,,,,...,,,,,,,,,,
2,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,,,,,,,...,,,,,,,,,,
3,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.M,,,,,,,...,,,,,,,,,,
8,Arab World,ARB,"Adjusted net enrolment rate, upper secondary, ...",UIS.NERA.3,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886925,Zimbabwe,ZWE,"Youth illiterate population, 15-24 years, male...",UIS.LP.AG15T24.M,,,,,,,...,,,,,,,,,,
886926,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, b...",SE.ADT.1524.LT.ZS,,,,,,,...,,,,,,,,,,
886927,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, f...",SE.ADT.1524.LT.FE.ZS,,,,,,,...,,,,,,,,,,
886928,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, g...",SE.ADT.1524.LT.FM.ZS,,,,,,,...,,,,,,,,,,


In [None]:
#profile = ProfileReport(EDS_stats_df, minimal=True)
#profile.to_file(output_file="./PP_reports/EDS_stats_df.html")

### country

In [83]:
country  = pd.read_csv('./data/EdStatsCountry.csv')
country.shape

(241, 32)

In [84]:
EDS_statscountry_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241 entries, 0 to 240
Data columns (total 32 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Country Code                                       241 non-null    object 
 1   Short Name                                         241 non-null    object 
 2   Table Name                                         241 non-null    object 
 3   Long Name                                          241 non-null    object 
 4   2-alpha code                                       238 non-null    object 
 5   Currency Unit                                      215 non-null    object 
 6   Special Notes                                      145 non-null    object 
 7   Region                                             214 non-null    object 
 8   Income Group                                       214 non-null    object 
 9   WB-2 code 

In [14]:
country.head()

Unnamed: 0,Country Code,Short Name,Table Name,Long Name,2-alpha code,Currency Unit,Special Notes,Region,Income Group,WB-2 code,...,IMF data dissemination standard,Latest population census,Latest household survey,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,Latest water withdrawal data,Unnamed: 31
0,ABW,Aruba,Aruba,Aruba,AW,Aruban florin,SNA data for 2000-2011 are updated from offici...,Latin America & Caribbean,High income: nonOECD,AW,...,,2010,,,Yes,,,2012.0,,
1,AFG,Afghanistan,Afghanistan,Islamic State of Afghanistan,AF,Afghan afghani,Fiscal year end: March 20; reporting period fo...,South Asia,Low income,AF,...,General Data Dissemination System (GDDS),1979,"Multiple Indicator Cluster Survey (MICS), 2010/11","Integrated household survey (IHS), 2008",,2013/14,,2012.0,2000.0,
2,AGO,Angola,Angola,People's Republic of Angola,AO,Angolan kwanza,"April 2013 database update: Based on IMF data,...",Sub-Saharan Africa,Upper middle income,AO,...,General Data Dissemination System (GDDS),1970,"Malaria Indicator Survey (MIS), 2011","Integrated household survey (IHS), 2008",,2015,,,2005.0,
3,ALB,Albania,Albania,Republic of Albania,AL,Albanian lek,,Europe & Central Asia,Upper middle income,AL,...,General Data Dissemination System (GDDS),2011,"Demographic and Health Survey (DHS), 2008/09",Living Standards Measurement Study Survey (LSM...,Yes,2012,2010.0,2012.0,2006.0,
4,AND,Andorra,Andorra,Principality of Andorra,AD,Euro,,Europe & Central Asia,High income: nonOECD,AD,...,,2011. Population figures compiled from adminis...,,,Yes,,,2006.0,,


In [55]:
country.duplicated()

0

In [102]:
# recherche de valeurs manquantes
country_na = country.isnull().sum()
country_na

Country Code                                           0
Short Name                                             0
Table Name                                             0
Long Name                                              0
2-alpha code                                           3
Currency Unit                                         26
Special Notes                                         96
Region                                                27
Income Group                                          27
WB-2 code                                              1
National accounts base year                           36
National accounts reference year                     209
SNA price valuation                                   44
Lending category                                      97
Other groups                                         183
System of National Accounts                           26
Alternative conversion factor                        194
PPP survey year                

In [53]:
country.isnull().mean()

Country Code                                         0.000000
Short Name                                           0.000000
Table Name                                           0.000000
Long Name                                            0.000000
2-alpha code                                         0.012448
Currency Unit                                        0.107884
Special Notes                                        0.398340
Region                                               0.112033
Income Group                                         0.112033
WB-2 code                                            0.004149
National accounts base year                          0.149378
National accounts reference year                     0.867220
SNA price valuation                                  0.182573
Lending category                                     0.402490
Other groups                                         0.759336
System of National Accounts                          0.107884
Alternat

In [76]:
# variables ayant des valeurs manquantes
nb_na_EDSSC[nb_na_EDSSC>0]

2-alpha code                                           3
Currency Unit                                         26
Special Notes                                         96
Region                                                27
Income Group                                          27
WB-2 code                                              1
National accounts base year                           36
National accounts reference year                     209
SNA price valuation                                   44
Lending category                                      97
Other groups                                         183
System of National Accounts                           26
Alternative conversion factor                        194
PPP survey year                                       96
Balance of Payments Manual in use                     60
External debt Reporting status                       117
System of trade                                       41
Government Accounting concept  

In [16]:
#profile_EDSSC = ProfileReport(EDS_statscountry_df, minimal=True)
#profile_EDSSC.to_file(output_file="./PP_reports/EDS_statscountry_df.html")

## Secondary DFs

### country_series

In [94]:
country_series = pd.read_csv('./data/EdStatsCountry-Series.csv')
country_series.shape

(613, 4)

In [18]:
country_series.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 613 entries, 0 to 612
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   CountryCode  613 non-null    object 
 1   SeriesCode   613 non-null    object 
 2   DESCRIPTION  613 non-null    object 
 3   Unnamed: 3   0 non-null      float64
dtypes: float64(1), object(3)
memory usage: 19.3+ KB


In [19]:
EDS_countryseries_df.head()

Unnamed: 0,CountryCode,SeriesCode,DESCRIPTION,Unnamed: 3
0,ABW,SP.POP.TOTL,Data sources : United Nations World Population...,
1,ABW,SP.POP.GROW,Data sources: United Nations World Population ...,
2,AFG,SP.POP.GROW,Data sources: United Nations World Population ...,
3,AFG,NY.GDP.PCAP.PP.CD,Estimates are based on regression.,
4,AFG,SP.POP.TOTL,Data sources : United Nations World Population...,


In [21]:
#profile_EDSCS = ProfileReport(EDS_countryseries_df, minimal=True)
#profile_EDSCS.to_file(output_file="./PP_reports/EDS_countryseries_df.html")

In [103]:
# recherche de valeurs manquantes
country_series_na = country_series.isnull().sum()
country_series_na

CountryCode      0
SeriesCode       0
DESCRIPTION      0
Unnamed: 3     613
dtype: int64

### foot_note

In [96]:
foot_note = pd.read_csv('./data/EdStatsFootNote.csv')
foot_note.shape

(643638, 5)

In [23]:
foot_note.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 643638 entries, 0 to 643637
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   CountryCode  643638 non-null  object 
 1   SeriesCode   643638 non-null  object 
 2   Year         643638 non-null  object 
 3   DESCRIPTION  643638 non-null  object 
 4   Unnamed: 4   0 non-null       float64
dtypes: float64(1), object(4)
memory usage: 24.6+ MB


In [24]:
foot_note.head()

Unnamed: 0,CountryCode,SeriesCode,Year,DESCRIPTION,Unnamed: 4
0,ABW,SE.PRE.ENRL.FE,YR2001,Country estimation.,
1,ABW,SE.TER.TCHR.FE,YR2005,Country estimation.,
2,ABW,SE.PRE.TCHR.FE,YR2000,Country estimation.,
3,ABW,SE.SEC.ENRL.GC,YR2004,Country estimation.,
4,ABW,SE.PRE.TCHR,YR2006,Country estimation.,


In [25]:
foot_note.SeriesCode.head(30)

0           SE.PRE.ENRL.FE
1           SE.TER.TCHR.FE
2           SE.PRE.TCHR.FE
3           SE.SEC.ENRL.GC
4              SE.PRE.TCHR
5              SE.PRE.NENR
6        SE.SEC.ENRL.VO.FE
7           SE.SEC.ENRL.GC
8           SE.PRM.TCHR.FE
9        SE.PRE.TCHR.FE.ZS
10             SE.PRE.ENRL
11          SE.PRE.NENR.FE
12    SE.SEC.ENRL.VO.FE.ZS
13          SE.SEC.TCHR.FE
14    SE.SEC.ENRL.VO.FE.ZS
15          SE.SEC.ENRL.VO
16       SE.PRE.TCHR.FE.ZS
17          SE.PRE.ENRL.FE
18          SE.PRE.NENR.MA
19          SE.SEC.TCHR.FE
20             SE.PRE.ENRL
21             SE.PRE.TCHR
22             SE.PRE.NENR
23          SE.PRM.TCHR.FE
24          SE.PRE.TCHR.FE
25             SE.PRE.ENRL
26          SE.SEC.TCHR.FE
27          SE.SEC.ENRL.GC
28          SE.PRM.TCHR.FE
29       SE.SEC.ENRL.VO.FE
Name: SeriesCode, dtype: object

In [26]:
type(foot_note.SeriesCode)

pandas.core.series.Series

In [28]:
#profile_EDSSfoot = ProfileReport(foot_note, minimal=True)
#profile_EDSSfoot.to_file(output_file="./PP_reports/EDS_statsfootnote_df.html")

In [104]:
# recherche de valeurs manquantes
foot_note_na = foot_note.isnull().sum()
foot_note_na

CountryCode         0
SeriesCode          0
Year                0
DESCRIPTION         0
Unnamed: 4     643638
dtype: int64

### series

In [98]:
series = pd.read_csv("./data/EdStatsSeries.csv")
series.shape

(3665, 21)

In [99]:
series.head()

Unnamed: 0,Series Code,Topic,Indicator Name,Short definition,Long definition,Unit of measure,Periodicity,Base Period,Other notes,Aggregation method,...,Notes from original source,General comments,Source,Statistical concept and methodology,Development relevance,Related source links,Other web links,Related indicators,License Type,Unnamed: 20
0,BAR.NOED.1519.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 15-19 with...,Percentage of female population age 15-19 with...,,,,,,...,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
1,BAR.NOED.1519.ZS,Attainment,Barro-Lee: Percentage of population age 15-19 ...,Percentage of population age 15-19 with no edu...,Percentage of population age 15-19 with no edu...,,,,,,...,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
2,BAR.NOED.15UP.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 15+ with n...,Percentage of female population age 15+ with n...,,,,,,...,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
3,BAR.NOED.15UP.ZS,Attainment,Barro-Lee: Percentage of population age 15+ wi...,Percentage of population age 15+ with no educa...,Percentage of population age 15+ with no educa...,,,,,,...,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
4,BAR.NOED.2024.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 20-24 with...,Percentage of female population age 20-24 with...,,,,,,...,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,


In [105]:
# recherche de valeurs manquantes
series_na = series.isnull().sum()
series_na

Series Code                               0
Topic                                     0
Indicator Name                            0
Short definition                       1509
Long definition                           0
Unit of measure                        3665
Periodicity                            3566
Base Period                            3351
Other notes                            3113
Aggregation method                     3618
Limitations and exceptions             3651
Notes from original source             3665
General comments                       3651
Source                                    0
Statistical concept and methodology    3642
Development relevance                  3662
Related source links                   3450
Other web links                        3665
Related indicators                     3665
License Type                           3665
Unnamed: 20                            3665
dtype: int64

# Réponses questions :

## Qualité du jeu de données (comporte-t-il beaucoup de données manquantes, dupliquées ?)

- beaucoup de données manquantes
- colonnes mal typées (dates en "object")

### données manquantes

In [107]:
data_na

Country Name           0
Country Code           0
Indicator Name         0
Indicator Code         0
1970              814642
                   ...  
2085              835494
2090              835494
2095              835494
2100              835494
Unnamed: 69       886930
Length: 70, dtype: int64

In [108]:
country_na

Country Code                                           0
Short Name                                             0
Table Name                                             0
Long Name                                              0
2-alpha code                                           3
Currency Unit                                         26
Special Notes                                         96
Region                                                27
Income Group                                          27
WB-2 code                                              1
National accounts base year                           36
National accounts reference year                     209
SNA price valuation                                   44
Lending category                                      97
Other groups                                         183
System of National Accounts                           26
Alternative conversion factor                        194
PPP survey year                

In [109]:
country_series_na

CountryCode      0
SeriesCode       0
DESCRIPTION      0
Unnamed: 3     613
dtype: int64

In [110]:
foot_note_na

CountryCode         0
SeriesCode          0
Year                0
DESCRIPTION         0
Unnamed: 4     643638
dtype: int64

In [111]:
series_na

Series Code                               0
Topic                                     0
Indicator Name                            0
Short definition                       1509
Long definition                           0
Unit of measure                        3665
Periodicity                            3566
Base Period                            3351
Other notes                            3113
Aggregation method                     3618
Limitations and exceptions             3651
Notes from original source             3665
General comments                       3651
Source                                    0
Statistical concept and methodology    3642
Development relevance                  3662
Related source links                   3450
Other web links                        3665
Related indicators                     3665
License Type                           3665
Unnamed: 20                            3665
dtype: int64

## Description données du jeu de données (nb col. ? nb lignes ?)

In [100]:
print(f'data : {data.shape[0]} lignes, {data.shape[1]} colonnes')
print(f'country : {country.shape[0]} lignes, {country.shape[1]} colonnes')
print(f'country_series : {country_series.shape[0]} lignes, { country_series.shape[1]} colonnes')
print(f'foot_note : {foot_note.shape[0]} lignes, {foot_note.shape[1]} colonnes')
print(f'series : {series.shape[0]} lignes, {series.shape[1]} colonnes')

data : 886930 lignes, 70 colonnes
country : 241 lignes, 32 colonnes
country_series : 613 lignes, 4 colonnes
foot_note : 643638 lignes, 5 colonnes
series : 3665 lignes, 21 colonnes


## Informations pertinentes à la problématique initiale :

###  Quels sont les pays avec un fort potentiel de clients pour nos services ?

pistes de points clés, chercher [indicateurs](https://datatopics.worldbank.org/education/indicators) avec :


- pays ayant des revenues moyens /hauts : 
        country['Income Group']
        
- gross enrolment secondary (lycée) : 
        data[data['Indicator Code'] == "SE.SEC.ENRL"]
        
- gross enrolment tertiary (uni) 
        data[data['Indicator Code'] == "SE.TER.ENRR"]

- completion



- school life expectancy primary to secondary
        data[data['Indicator Code'] == "SE.SCH.LIFE"]


- Expenditure on education as % of total government expenditure (%)
        data[data['Indicator Code'] == "UIS.XSPENDP.23.FDPUB.FNS"]
        data[data['Indicator Code'] == "UIS.XSPENDP.56.FDPUB.FNS"]


- Government expenditure on education as % of GDP (%)
        data[data['Indicator Code'] == ""]


#### explo

In [126]:
#data[data['Indicator Code'].str.contains("SE.SEC.ENRL")]
data[data['Indicator Code'] == "SE.SEC.ENRL"]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
1191,Arab World,ARB,"Enrolment in secondary education, both sexes (...",SE.SEC.ENRL,4842861.5,4981843.5,5270417.5,5593058.5,5938865.5,6437610.0,...,,,,,,,,,,
4856,East Asia & Pacific,EAS,"Enrolment in secondary education, both sexes (...",SE.SEC.ENRL,43519952.0,50267036.0,56118968.0,61798564.0,61542244.0,64871984.0,...,,,,,,,,,,
8521,East Asia & Pacific (excluding high income),EAP,"Enrolment in secondary education, both sexes (...",SE.SEC.ENRL,30872386.0,37713208.0,43312420.0,48635540.0,48004968.0,50954704.0,...,,,,,,,,,,
12186,Euro area,EMU,"Enrolment in secondary education, both sexes (...",SE.SEC.ENRL,25840982.0,26196552.0,27031316.0,28144684.0,29213564.0,29875074.0,...,,,,,,,,,,
15851,Europe & Central Asia,ECS,"Enrolment in secondary education, both sexes (...",SE.SEC.ENRL,75103024.0,75992608.0,77399336.0,78980016.0,80360736.0,81036688.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
869796,Virgin Islands (U.S.),VIR,"Enrolment in secondary education, both sexes (...",SE.SEC.ENRL,,6500.0,7100.0,9291.0,9340.0,9160.0,...,,,,,,,,,,
873461,West Bank and Gaza,PSE,"Enrolment in secondary education, both sexes (...",SE.SEC.ENRL,,,,,,,...,,,,,,,,,,
877126,"Yemen, Rep.",YEM,"Enrolment in secondary education, both sexes (...",SE.SEC.ENRL,,,,,,,...,,,,,,,,,,
880791,Zambia,ZMB,"Enrolment in secondary education, both sexes (...",SE.SEC.ENRL,56182.0,60235.0,64695.0,65494.0,70812.0,77672.0,...,,,,,,,,,,


###  Pour chacun de ces pays, quelle sera l’évolution de ce potentiel de clients ?

- population total
        data[data['Indicator Code'] == "SP.POP.TOTL"]
- population growth
        data[data['Indicator Code'] == "SP.POP.GROW"]
        
        
- 

### Dans quels pays l'entreprise doit-elle opérer en priorité ?

## Ordres de grandeurs des indicateurs statistiques classiques pour les différentes zones géographiques et pays du monde (moyenne/médiane/écart-type par pays et par continent ou bloc géographique)