# **Part I: Data Preprocessing**

## Goal: Familiarize with the World Development Indicators dataset and reshape it into panel data.
## Panel data, also known as longitudinal data or cross-sectional time series data, is a type of data that combines both cross-sectional and time series dimensions. 
### What does panel data looks like?

![](https://ai-studio-static-online.cdn.bcebos.com/d0129c6f110d4f49a62ed9529a6eee1906d0eb75f6464f049ab0d7573d5ab3d6)


# **World Bank World Development Indicators WDI dataset**
    
>The World Development Indicators is a compilation of relevant, high-quality, and internationally comparable statistics about global development and the fight against poverty. The database contains 1,400 time series indicators for 217 economies and more than 40 country groups, with data for many indicators going back more than 50 years.


>https://datatopics.worldbank.org/world-development-indicators/

# Q1 How to read the CSV file and name the dataframe as df?

In [1]:
import pandas as pd

df = pd.read_csv('/home/aistudio/data/data259984/WDIData.csv')

### If the file format is xlsx, please use read_excel().

df

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 67
0,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,17.392349,17.892005,18.359993,18.795151,19.295176,19.788156,20.279599,20.773627,,
1,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,,,,,,,...,6.720331,7.015917,7.281390,7.513673,7.809566,8.075889,8.366010,8.684137,,
2,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,,,,,,,...,38.184152,38.543180,38.801719,39.039014,39.323186,39.643848,39.894830,40.213891,,
3,Africa Eastern and Southern,AFE,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,31.859257,33.903515,38.851444,40.197332,43.028332,44.389773,46.268621,48.103609,,
4,Africa Eastern and Southern,AFE,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,,,,,,,...,17.623956,16.516633,24.594474,25.389297,27.041743,29.138285,30.998687,32.772690,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
393143,Zimbabwe,ZWE,Women who believe a husband is justified in be...,SG.VAW.REFU.ZS,,,,,,,...,,14.500000,,,,,,,,
393144,Zimbabwe,ZWE,Women who were first married by age 15 (% of w...,SP.M15.2024.FE.ZS,,,,,,,...,,3.700000,,,,5.400000,,,,
393145,Zimbabwe,ZWE,Women who were first married by age 18 (% of w...,SP.M18.2024.FE.ZS,,,,,,,...,,32.400000,,,,33.700000,,,,
393146,Zimbabwe,ZWE,Women's share of population ages 15+ living wi...,SH.DYN.AIDS.FE.ZS,,,,,,,...,59.400000,59.500000,59.700000,59.900000,60.100000,60.300000,60.500000,60.700000,,


In [2]:
### Check the first or last few rows.
## Check the first 5 rows.
df.head(5)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 67
0,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,17.392349,17.892005,18.359993,18.795151,19.295176,19.788156,20.279599,20.773627,,
1,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,,,,,,,...,6.720331,7.015917,7.28139,7.513673,7.809566,8.075889,8.36601,8.684137,,
2,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,,,,,,,...,38.184152,38.54318,38.801719,39.039014,39.323186,39.643848,39.89483,40.213891,,
3,Africa Eastern and Southern,AFE,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,31.859257,33.903515,38.851444,40.197332,43.028332,44.389773,46.268621,48.103609,,
4,Africa Eastern and Southern,AFE,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,,,,,,,...,17.623956,16.516633,24.594474,25.389297,27.041743,29.138285,30.998687,32.77269,,


In [3]:
## Check the last 3 rows.
df.tail(3)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 67
393145,Zimbabwe,ZWE,Women who were first married by age 18 (% of w...,SP.M18.2024.FE.ZS,,,,,,,...,,32.4,,,,33.7,,,,
393146,Zimbabwe,ZWE,Women's share of population ages 15+ living wi...,SH.DYN.AIDS.FE.ZS,,,,,,,...,59.4,59.5,59.7,59.9,60.1,60.3,60.5,60.7,,
393147,Zimbabwe,ZWE,Young people (ages 15-24) newly infected with HIV,SH.HIV.INCD.YG,,,,,,,...,19000.0,17000.0,15000.0,13000.0,10000.0,8600.0,7700.0,6800.0,,


# Q2 What is the type of the dataframe?

## In a Pandas DataFrame, there are several data types that you can encounter. The main types of data include:

### 1. Integer: Whole numbers, which can be of different sizes (e.g., int32, int64).
### 2. Float: Decimal numbers (e.g., float32, float64).
### 3. Object: Used for strings or mixed types. This is the default type for text data.
### 4. Boolean: Represents True or False values.
### 5. Datetime: Used for date and time data (e.g., datetime64).
### 6. Timedelta: Represents differences in time (duration).

In [4]:
df.dtypes

Country Name       object
Country Code       object
Indicator Name     object
Indicator Code     object
1960              float64
                   ...   
2019              float64
2020              float64
2021              float64
2022              float64
Unnamed: 67       float64
Length: 68, dtype: object

### We may want to drop the last column named "Unnamed:67" later...

# Q3 How many columns and rows?

In [5]:
df.shape
### There are 393148 rows and 68 columns in this dataset.

(393148, 68)

# Q4 What are the country names in this dataset?

In [6]:
df["Country Name"].unique()

array(['Africa Eastern and Southern', 'Africa Western and Central',
       'Arab World', 'Caribbean small states',
       'Central Europe and the Baltics', 'Early-demographic dividend',
       'East Asia & Pacific',
       'East Asia & Pacific (excluding high income)',
       'East Asia & Pacific (IDA & IBRD countries)', 'Euro area',
       'Europe & Central Asia',
       'Europe & Central Asia (excluding high income)',
       'Europe & Central Asia (IDA & IBRD countries)', 'European Union',
       'Fragile and conflict affected situations',
       'Heavily indebted poor countries (HIPC)', 'High income',
       'IBRD only', 'IDA & IBRD total', 'IDA blend', 'IDA only',
       'IDA total', 'Late-demographic dividend',
       'Latin America & Caribbean',
       'Latin America & Caribbean (excluding high income)',
       'Latin America & the Caribbean (IDA & IBRD countries)',
       'Least developed countries: UN classification',
       'Low & middle income', 'Low income', 'Lower middle in

# Q5 How many indicators and countries?

In [7]:
## Use nuique() to get the number of unique values.
df["Country Name"].nunique()
### There are 266 countries in the WDI dataset.

266

In [8]:
df["Indicator Name"].nunique()
### WDI dataset provides 1478 indicators.

1478

# Q6 How to drop and keep certain columns?

In [9]:
df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 67
0,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,17.392349,17.892005,18.359993,18.795151,19.295176,19.788156,20.279599,20.773627,,
1,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,,,,,,,...,6.720331,7.015917,7.28139,7.513673,7.809566,8.075889,8.36601,8.684137,,
2,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,,,,,,,...,38.184152,38.54318,38.801719,39.039014,39.323186,39.643848,39.89483,40.213891,,
3,Africa Eastern and Southern,AFE,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,31.859257,33.903515,38.851444,40.197332,43.028332,44.389773,46.268621,48.103609,,
4,Africa Eastern and Southern,AFE,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,,,,,,,...,17.623956,16.516633,24.594474,25.389297,27.041743,29.138285,30.998687,32.77269,,


In [10]:
###Drop the last column
df_drop = df.drop(columns=["Unnamed: 67"])
df_drop.head(3)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,16.914625,17.392349,17.892005,18.359993,18.795151,19.295176,19.788156,20.279599,20.773627,
1,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,,,,,,,...,6.473301,6.720331,7.015917,7.28139,7.513673,7.809566,8.075889,8.36601,8.684137,
2,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,,,,,,,...,37.870347,38.184152,38.54318,38.801719,39.039014,39.323186,39.643848,39.89483,40.213891,


In [11]:
## Keep the certain columns
### For instance we want to keep the following three columns.
df_col = df[["Country Name", "Indicator Name", "2020"]]
df_col.head()

Unnamed: 0,Country Name,Indicator Name,2020
0,Africa Eastern and Southern,Access to clean fuels and technologies for coo...,20.279599
1,Africa Eastern and Southern,Access to clean fuels and technologies for coo...,8.36601
2,Africa Eastern and Southern,Access to clean fuels and technologies for coo...,39.89483
3,Africa Eastern and Southern,Access to electricity (% of population),46.268621
4,Africa Eastern and Southern,"Access to electricity, rural (% of rural popul...",30.998687


# Q7 How to drop and keep certain rows?

In [12]:
## Delete specific rows based on the index
### In Pandas, an index is a fundamental component that serves as a unique identifier for each row in a DataFrame or Series.
### In this dataset, the index ranges from 0 to 393147. We can drop the first row by specifying index[0].
df_drop0 = df.drop(df.index[0])
df_drop0

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 67
1,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,,,,,,,...,6.720331,7.015917,7.281390,7.513673,7.809566,8.075889,8.366010,8.684137,,
2,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,,,,,,,...,38.184152,38.543180,38.801719,39.039014,39.323186,39.643848,39.894830,40.213891,,
3,Africa Eastern and Southern,AFE,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,31.859257,33.903515,38.851444,40.197332,43.028332,44.389773,46.268621,48.103609,,
4,Africa Eastern and Southern,AFE,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,,,,,,,...,17.623956,16.516633,24.594474,25.389297,27.041743,29.138285,30.998687,32.772690,,
5,Africa Eastern and Southern,AFE,"Access to electricity, urban (% of urban popul...",EG.ELC.ACCS.UR.ZS,,,,,,,...,65.998898,67.022332,68.907404,70.663096,71.565376,72.611685,74.129923,75.559174,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
393143,Zimbabwe,ZWE,Women who believe a husband is justified in be...,SG.VAW.REFU.ZS,,,,,,,...,,14.500000,,,,,,,,
393144,Zimbabwe,ZWE,Women who were first married by age 15 (% of w...,SP.M15.2024.FE.ZS,,,,,,,...,,3.700000,,,,5.400000,,,,
393145,Zimbabwe,ZWE,Women who were first married by age 18 (% of w...,SP.M18.2024.FE.ZS,,,,,,,...,,32.400000,,,,33.700000,,,,
393146,Zimbabwe,ZWE,Women's share of population ages 15+ living wi...,SH.DYN.AIDS.FE.ZS,,,,,,,...,59.400000,59.500000,59.700000,59.900000,60.100000,60.300000,60.500000,60.700000,,


In [13]:
## Keep specific rows based on certain conditions
### Choose the country named United States
df_US = df[df["Country Name"] == "United States"]

df_US

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 67
376890,United States,USA,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,,
376891,United States,USA,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,,,,,,,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,,
376892,United States,USA,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,,,,,,,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,,
376893,United States,USA,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,,
376894,United States,USA,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,,,,,,,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
378363,United States,USA,Women who believe a husband is justified in be...,SG.VAW.REFU.ZS,,,,,,,...,,,,,,,,,,
378364,United States,USA,Women who were first married by age 15 (% of w...,SP.M15.2024.FE.ZS,,,,,,,...,,,,,,,,,,
378365,United States,USA,Women who were first married by age 18 (% of w...,SP.M18.2024.FE.ZS,,,,,,,...,,,,,,,,,,
378366,United States,USA,Women's share of population ages 15+ living wi...,SH.DYN.AIDS.FE.ZS,,,,,,,...,22.8,22.7,22.5,22.4,22.3,22.3,22.2,22.1,,


In [14]:
## How about two conditions? Use "&" to add more conditions.
df_US_gdp = df[
    (df["Country Name"] == "United States")
    & (df["Indicator Name"] == "GDP per capita (current US$)")
]

df_US_gdp

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 67
377370,United States,USA,GDP per capita (current US$),NY.GDP.PCAP.CD,3007.123445,3066.562869,3243.843078,3374.515171,3573.941185,3827.52711,...,55123.849787,56762.729452,57866.744934,59907.754261,62823.309438,65120.394663,63528.634303,70219.472454,76398.591742,


# Q8 How to reshape data?
## Pandas provides multiple methods like melt(), pivot_table(), stack(), unstack() ,etc to reshape data.
### https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html

## 1) melt(). This function is used to transform or reshape data from a wide format to a long format. It essentially unpivots the DataFrame, converting columns into rows.
### https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html#pandas.DataFrame.melt

### Key parameters:
#### id_vars: A list or tuple of column names to use as identifier variables. These columns will remain as columns in the resulting DataFrame.
#### value_vars: A list or tuple of column names to unpivot. These columns will be converted into a single column in the resulting DataFrame.
#### var_name: The name to use for the column that contains the variable names (default is 'variable').
#### value_name: The name to use for the column that contains the values (default is 'value').

In [15]:
df_gdp_melt = (
    df_US_gdp.drop(columns=["Unnamed: 67"])
    .melt(
        id_vars=["Country Name", "Country Code", "Indicator Name", "Indicator Code"],
        value_name="GDP per capita (current US$)",
        var_name="Year",
    )
    .drop(columns=["Country Code", "Indicator Code"])
)
df_gdp_melt

### By using the melt() function, the DataFrame df_gdp_melt is now clean and in panel format. It contains unique values for Country Name and Year.

Unnamed: 0,Country Name,Indicator Name,Year,GDP per capita (current US$)
0,United States,GDP per capita (current US$),1960,3007.123445
1,United States,GDP per capita (current US$),1961,3066.562869
2,United States,GDP per capita (current US$),1962,3243.843078
3,United States,GDP per capita (current US$),1963,3374.515171
4,United States,GDP per capita (current US$),1964,3573.941185
...,...,...,...,...
58,United States,GDP per capita (current US$),2018,62823.309438
59,United States,GDP per capita (current US$),2019,65120.394663
60,United States,GDP per capita (current US$),2020,63528.634303
61,United States,GDP per capita (current US$),2021,70219.472454


### In the last example, we used df_US_gdp, which contains only GDP per capita (current US$) for the United States. It could become more complicated if we add more variables. For instance, we can use df_US, which contains all the variables provided by the WDI.

In [16]:
df_US

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 67
376890,United States,USA,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,,
376891,United States,USA,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,,,,,,,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,,
376892,United States,USA,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,,,,,,,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,,
376893,United States,USA,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,,
376894,United States,USA,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,,,,,,,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
378363,United States,USA,Women who believe a husband is justified in be...,SG.VAW.REFU.ZS,,,,,,,...,,,,,,,,,,
378364,United States,USA,Women who were first married by age 15 (% of w...,SP.M15.2024.FE.ZS,,,,,,,...,,,,,,,,,,
378365,United States,USA,Women who were first married by age 18 (% of w...,SP.M18.2024.FE.ZS,,,,,,,...,,,,,,,,,,
378366,United States,USA,Women's share of population ages 15+ living wi...,SH.DYN.AIDS.FE.ZS,,,,,,,...,22.8,22.7,22.5,22.4,22.3,22.3,22.2,22.1,,


In [17]:
df_melt = (
    df_US.drop(columns=["Unnamed: 67"])
    .melt(
        id_vars=["Country Name", "Country Code", "Indicator Name", "Indicator Code"],
        var_name="Year",
    )
    .drop(columns=["Country Code","Indicator Code"])
)

### df_melt isn't a panel data we wanted, we want the Indicator Name listed by columns.
df_melt

Unnamed: 0,Country Name,Indicator Name,Year,value
0,United States,Access to clean fuels and technologies for coo...,1960,
1,United States,Access to clean fuels and technologies for coo...,1960,
2,United States,Access to clean fuels and technologies for coo...,1960,
3,United States,Access to electricity (% of population),1960,
4,United States,"Access to electricity, rural (% of rural popul...",1960,
...,...,...,...,...
93109,United States,Women who believe a husband is justified in be...,2022,
93110,United States,Women who were first married by age 15 (% of w...,2022,
93111,United States,Women who were first married by age 18 (% of w...,2022,
93112,United States,Women's share of population ages 15+ living wi...,2022,


## 2） pivot_table(). This function is used to create a pivot table from a DataFrame. It allows you to summarize and aggregate data based on one or more columns, providing insights into the relationships between different variables.
### https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html#pandas.pivot_table

### key parameters:

#### data: The DataFrame to be used for creating the pivot table.
#### values: The column(s) to aggregate.
#### index: The column(s) to be used as the index of the resulting pivot table.
#### columns: The column(s) to be used as the columns of the resulting pivot table.
#### aggfunc: The aggregation function(s) to apply to the values. It can be a single function, a list of functions, or a dictionary mapping columns to functions.
#### fill_value: The value to replace missing values with (default is None).

In [18]:
df_pivottable = df_melt.pivot_table(
    values="value",
    index=["Country Name", "Year"],
    columns="Indicator Name",
)

df_pivottable.head()

### Now the variables in the Indicator Name are listed as columns. However, it looks strange due to the multilevel index.

Unnamed: 0_level_0,Indicator Name,Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)","Account ownership at a financial institution or with a mobile-money-service provider, male (% of population ages 15+)","Account ownership at a financial institution or with a mobile-money-service provider, older adults (% of population ages 25+)",...,"Vulnerable employment, female (% of female employment) (modeled ILO estimate)","Vulnerable employment, male (% of male employment) (modeled ILO estimate)","Vulnerable employment, total (% of total employment) (modeled ILO estimate)","Wage and salaried workers, female (% of female employment) (modeled ILO estimate)","Wage and salaried workers, male (% of male employment) (modeled ILO estimate)","Wage and salaried workers, total (% of total employment) (modeled ILO estimate)","Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)",Women Business and the Law Index Score (scale 1-100),Women's share of population ages 15+ living with HIV (%),Young people (ages 15-24) newly infected with HIV
Country Name,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
United States,1960,,,,,,,,,,,...,,,,,,,,,,
United States,1961,,,,,,,,,,,...,,,,,,,,,,
United States,1962,,,,,,,,,,,...,,,,,,,,,,
United States,1963,,,,,,,,,,,...,,,,,,,,,,
United States,1964,,,,,,,,,,,...,,,,,,,,,,


In [27]:
# How to get rid of multilevel index after using pivottable?

# After using reset_index, we can get rid of the multilevel index.
WDI_US_0 = df_pivottable.reset_index()

WDI_US_0

Indicator Name,Country Name,Year,Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)",...,"Vulnerable employment, female (% of female employment) (modeled ILO estimate)","Vulnerable employment, male (% of male employment) (modeled ILO estimate)","Vulnerable employment, total (% of total employment) (modeled ILO estimate)","Wage and salaried workers, female (% of female employment) (modeled ILO estimate)","Wage and salaried workers, male (% of male employment) (modeled ILO estimate)","Wage and salaried workers, total (% of total employment) (modeled ILO estimate)","Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)",Women Business and the Law Index Score (scale 1-100),Women's share of population ages 15+ living with HIV (%),Young people (ages 15-24) newly infected with HIV
0,United States,1960,,,,,,,,,...,,,,,,,,,,
1,United States,1961,,,,,,,,,...,,,,,,,,,,
2,United States,1962,,,,,,,,,...,,,,,,,,,,
3,United States,1963,,,,,,,,,...,,,,,,,,,,
4,United States,1964,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,United States,2018,100.0,100.0,100.0,100.0,100.0,100.0,,,...,3.536253,4.583925,4.099700,94.82776,92.76611,93.71900,43.839207,91.25,22.3,6400.0
59,United States,2019,100.0,100.0,100.0,100.0,100.0,100.0,,,...,3.474749,4.418259,3.981149,94.91877,93.03202,93.90617,44.845071,91.25,22.3,6100.0
60,United States,2020,100.0,100.0,100.0,100.0,100.0,100.0,,,...,3.735224,4.354936,4.068956,94.53311,92.97375,93.69337,43.603848,91.25,22.2,
61,United States,2021,100.0,100.0,100.0,100.0,100.0,100.0,94.95,96.79,...,3.895916,4.605415,4.276413,94.29491,92.61697,93.39504,,91.25,22.1,


In [28]:
# However, in the new index, the first row named 'Indicator Name' should be empty. Therefore, we use rename_axis to reset the index.
WDI_US = WDI_US_0.rename_axis("", axis=1)

WDI_US.head()

# Now, WDI_US is the panel data we wanted!

Unnamed: 0,Country Name,Year,Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)",...,"Vulnerable employment, female (% of female employment) (modeled ILO estimate)","Vulnerable employment, male (% of male employment) (modeled ILO estimate)","Vulnerable employment, total (% of total employment) (modeled ILO estimate)","Wage and salaried workers, female (% of female employment) (modeled ILO estimate)","Wage and salaried workers, male (% of male employment) (modeled ILO estimate)","Wage and salaried workers, total (% of total employment) (modeled ILO estimate)","Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)",Women Business and the Law Index Score (scale 1-100),Women's share of population ages 15+ living with HIV (%),Young people (ages 15-24) newly infected with HIV
0,United States,1960,,,,,,,,,...,,,,,,,,,,
1,United States,1961,,,,,,,,,...,,,,,,,,,,
2,United States,1962,,,,,,,,,...,,,,,,,,,,
3,United States,1963,,,,,,,,,...,,,,,,,,,,
4,United States,1964,,,,,,,,,...,,,,,,,,,,


In [20]:
### Recheck the data types and find that the column 'Year' is of type object. 
### However, an object type cannot be used in mathematical calculations or sorting.
WDI_US.dtypes


Country Name                                                                                         object
Year                                                                                                 object
Access to clean fuels and technologies for cooking (% of population)                                float64
Access to clean fuels and technologies for cooking, rural (% of rural population)                   float64
Access to clean fuels and technologies for cooking, urban (% of urban population)                   float64
                                                                                                     ...   
Wage and salaried workers, total (% of total employment) (modeled ILO estimate)                     float64
Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)    float64
Women Business and the Law Index Score (scale 1-100)                                                float64
Women's share of population

In [29]:
### So, the 'Year' column needs to be converted to an integer.
WDI_US["Year"] = WDI_US['Year'].astype(int)

WDI_US.dtypes


Country Name                                                                                         object
Year                                                                                                  int64
Access to clean fuels and technologies for cooking (% of population)                                float64
Access to clean fuels and technologies for cooking, rural (% of rural population)                   float64
Access to clean fuels and technologies for cooking, urban (% of urban population)                   float64
                                                                                                     ...   
Wage and salaried workers, total (% of total employment) (modeled ILO estimate)                     float64
Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)    float64
Women Business and the Law Index Score (scale 1-100)                                                float64
Women's share of population

# Q9 How to export data to csv?

In [33]:
WDI_US.to_csv(
    '/home/aistudio/data/WDI_US.csv', index=False
)

### By setting index=False, we drop the index from the CSV file. You can set it to True if you want to keep the index.

# Q10 How many missing values for each variable?
### isna(): This method checks each element in the WDI_US DataFrame for missing values (NaN). It returns a DataFrame of the same shape, where each element is either True (if the original value is NaN) or False (if the original value is not NaN).
### sum(): After checking for missing values, this method aggregates the results. By default, it sums along the columns (i.e., it counts the number of True values for each column). The result is a Series where the index is the column names of WDI_US, and the values are the counts of missing values (NaN) in each column.
### sort_values(ascending=True): This sorts the resulting Series by the count of missing values in ascending order. This means that columns with fewer missing values will appear first, and columns with more missing values will appear later.

In [30]:
WDI_US.isna()

Unnamed: 0,Country Name,Year,Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)",...,"Vulnerable employment, female (% of female employment) (modeled ILO estimate)","Vulnerable employment, male (% of male employment) (modeled ILO estimate)","Vulnerable employment, total (% of total employment) (modeled ILO estimate)","Wage and salaried workers, female (% of female employment) (modeled ILO estimate)","Wage and salaried workers, male (% of male employment) (modeled ILO estimate)","Wage and salaried workers, total (% of total employment) (modeled ILO estimate)","Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)",Women Business and the Law Index Score (scale 1-100),Women's share of population ages 15+ living with HIV (%),Young people (ages 15-24) newly infected with HIV
0,False,False,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
1,False,False,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
2,False,False,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
3,False,False,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
4,False,False,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,False,False,False,False,False,False,False,False,True,True,...,False,False,False,False,False,False,False,False,False,False
59,False,False,False,False,False,False,False,False,True,True,...,False,False,False,False,False,False,False,False,False,False
60,False,False,False,False,False,False,False,False,True,True,...,False,False,False,False,False,False,False,False,False,True
61,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,True


In [31]:
WDI_US.isna().sum()


Country Name                                                                                         0
Year                                                                                                 0
Access to clean fuels and technologies for cooking (% of population)                                41
Access to clean fuels and technologies for cooking, rural (% of rural population)                   41
Access to clean fuels and technologies for cooking, urban (% of urban population)                   41
                                                                                                    ..
Wage and salaried workers, total (% of total employment) (modeled ILO estimate)                     32
Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)    22
Women Business and the Law Index Score (scale 1-100)                                                10
Women's share of population ages 15+ living with HIV (%)                

In [32]:
isna_data = WDI_US.isna().sum().sort_values(ascending=True)
isna_data


Population ages 70-74, male (% of male population)                           0
Population ages 65 and above, total                                          0
Population ages 65-69, female (% of female population)                       0
Population ages 65-69, male (% of male population)                           0
Population ages 70-74, female (% of female population)                       0
                                                                            ..
Trained teachers in secondary education (% of total teachers)               62
Number of surgical procedures (per 100,000 population)                      62
Trained teachers in upper secondary education, male (% of male teachers)    62
Trained teachers in primary education, male (% of male teachers)            62
Trained teachers in primary education, female (% of female teachers)        62
Length: 1165, dtype: int64

# Now we obtain the data for the United States and all variables provided by the WDI dataset, reshape the data into panel format, and count the number of missing values for all variables.