# **World Bank World Development Indicators WDI dataset**
    
>The World Development Indicators is a compilation of relevant, high-quality, and internationally comparable statistics about global development and the fight against poverty. The database contains 1,400 time series indicators for 217 economies and more than 40 country groups, with data for many indicators going back more than 50 years.

>https://datatopics.worldbank.org/world-development-indicators/

# Q1 How to read the csv file and name the dataframe as df?

In [None]:
import pandas as pd
from pathlib import Path

df = pd.read_csv(
    Path.home()
    / "OneDrive"
    / "Rawdata"
    / "World Bank World Development Index"
    / "WDI_csv"
    / "WDIData.csv",
    na_values="..",
)


### If you are using Windows, for example the path is: 'C:/Users/XXX/Desktop/df.csv'
### If your file is xlsx, please use pd.read_excel()

In [None]:
### Check the first or last few rows
df
df.head()
df.tail()

# Q2 What is the type of the dataframe?

In [None]:
df.dtypes

# Q3 How many columns and rows?

In [None]:
df.shape

# Q4 What are the country names in this dataset?

In [None]:
df["Country Name"].unique()

# Q5 How many indicators and countries?

In [None]:
### Number of unique values
df["Country Name"].nunique()

# Q6 How to drop and keep certain columns?

In [None]:
### Drop the last column
df = df.drop(columns=["Unnamed: 67"])
df.head(3)

In [None]:
### Keep the certain columns
df_col = df[["Country Name", "Indicator Name", "1996", "2020"]]
df_col

# Q7 How to drop and keep certain rows?

In [None]:
### Drop the first row by index
df_drop0 = df.drop(df.index[0])

In [None]:
### keep rows based on conditions
df_US = df[df["Country Name"] == "United States"]

In [None]:
df_US_hiv = df[
    (df["Country Name"] == "United States")
    & (df["Indicator Name"] == "Young people (ages 15-24) newly infected with HIV")
]

df_US_hiv.head()

# Q8 How to reshape data?
## Pandas provides multiple methods like melt(), pivot_table(), stack(), unstack() ,etc to reshape data.
### https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html

## 1) melt(). This function is used to transform or reshape data from a wide format to a long format. It essentially unpivots the DataFrame, converting columns into rows.
### https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html#pandas.DataFrame.melt

### Key parameters:
#### id_vars: A list or tuple of column names to use as identifier variables. These columns will remain as columns in the resulting DataFrame.
#### value_vars: A list or tuple of column names to unpivot. These columns will be converted into a single column in the resulting DataFrame.
#### var_name: The name to use for the column that contains the variable names (default is 'variable').
#### value_name: The name to use for the column that contains the values (default is 'value').

In [None]:
# df_US

In [None]:
df_hiv_melt = df_US_hiv.drop(columns="Indicator Code").melt(
    id_vars=["Country Name", "Country Code", "Indicator Name"],
    var_name=["Year"],
    value_name="Young people (ages 15-24) newly infected with HIV",
)

df_hiv_melt.head()

In [None]:
df_melt = df_US.drop(columns="Indicator Code").melt(
    id_vars=["Country Name", "Country Code", "Indicator Name"], var_name=["Year"]
)

df_melt.head()

## 2） pivot_table(). This function is used to create a pivot table from a DataFrame. It allows you to summarize and aggregate data based on one or more columns, providing insights into the relationships between different variables.
### https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html#pandas.pivot_table

### key parameters:

#### data: The DataFrame to be used for creating the pivot table.
#### values: The column(s) to aggregate.
#### index: The column(s) to be used as the index of the resulting pivot table.
#### columns: The column(s) to be used as the columns of the resulting pivot table.
#### aggfunc: The aggregation function(s) to apply to the values. It can be a single function, a list of functions, or a dictionary mapping columns to functions.
#### fill_value: The value to replace missing values with (default is None).

In [None]:
df_pivottable = df_melt.pivot_table(
    values="value",
    index=["Country Name", "Country Code", "Year"],
    columns="Indicator Name",
)

df_pivottable.head()

In [None]:
# reset_index()
# rename_axis
WDI_US = df_pivottable.reset_index().rename_axis("", axis=1)
WDI_US.head()

# Q9 How to export data to csv or xlsx?

In [None]:
WDI_US.to_csv(
    Path.home() / "OneDrive" / "2024" / "Big Data Analysis" / "WDI_US.csv", index=False
)

In [None]:
WDI_US.head()

# Q10 How many missing values for each variable?

In [None]:
isna_data = WDI_US.isna().sum().sort_values(ascending=True)
isna_data

In [None]:
count_data = WDI_US.count().sort_values(ascending=False)
count_data