A data science project analyzing global economic performance using the Penn World Table dataset.
The project explores GDP per capita trends, country clustering, statistical testing, and future predictions using machine learning.
This project aims to:
- Analyze global GDP per capita trends (2000–2018)
- Identify economic groupings of countries using clustering
- Compare developing vs developed economies statistically
- Predict future GDP per capita trends using regression
- Visualize global economic patterns
Source: Penn World Table (PWT 10.01)
The dataset contains macroeconomic indicators for countries worldwide.
Key variables used:
| Variable | Description |
|---|---|
country |
Country name |
year |
Year of observation |
rgdpo |
Real GDP (output-side) |
pop |
Population |
gdp_per_capita |
Calculated as rgdpo / pop |
The analysis focuses on the 2000–2018 time range.
Data preprocessing steps included:
- Filtering data between 2000–2018
- Removing missing values
- Removing rows with population = 0
- Creating a new feature:
gdp_per_capita = rgdpo / popThis metric allows comparison of economic productivity per person across countries.
Descriptive statistics were used to understand global economic patterns.
Metrics calculated:
- Mean GDP per capita
- Median GDP per capita
- Mode GDP per capita
- Quartiles (25%, 50%, 75%)
These provide insights into global standards of living and economic inequality.
To identify economic groups, K-Means clustering was applied.
Steps:
- Calculate average GDP per capita per country
- Standardize values using StandardScaler
- Apply K-Means clustering (k = 4)
Clusters represent different economic categories such as:
- Low-income economies
- Emerging economies
- Upper-middle income countries
- High-income economies
Visualization example:
- Scatter plot of GDP per capita vs population
- Countries grouped by cluster
The project tests whether GDP per capita distributions differ between economic groups.
Shapiro–Wilk Test
Used to check whether the GDP distribution is normal.
Result:
- GDP per capita data is not normally distributed.
Mann–Whitney U Test
Used instead of a t-test because the data is non-normal.
This test compares GDP distributions between developing and developed country clusters.
Year-over-year GDP growth was calculated using:
GDP Growth = (GDP_t − GDP_(t−1)) / GDP_(t−1)
The average global GDP growth was then visualized over time to observe long-term economic trends.
Countries were ranked by average GDP per capita.
The project identifies the top 10 performing economies based on average GDP per capita between 2000–2018.
Example country comparison:
- United States
- Germany
- Egypt
- India
A Linear Regression model was used to predict GDP per capita trends.
Example prediction:
Finland GDP per Capita (2000–2030)
Model workflow:
- Train model using historical data (2000–2018)
- Use
yearas predictor - Predict GDP per capita through 2030
- Compare actual vs predicted values
This demonstrates a basic economic forecasting approach.
The project includes:
- GDP clustering scatter plots
- Global GDP growth over time
- Country GDP comparisons
- GDP prediction graphs
Libraries used:
- Matplotlib
- Seaborn
| Technology | Purpose |
|---|---|
| Python | Programming language |
| Pandas | Data manipulation |
| NumPy | Numerical computing |
| Matplotlib | Data visualization |
| Seaborn | Statistical visualization |
| Scikit-learn | Machine learning |
| SciPy | Statistical testing |
git clone https://github.com/yourusername/gdp-analysis-project.git
cd gdp-analysis-projectpip install pandas numpy matplotlib seaborn scikit-learn scipypython gdp_analysis.pyMake sure the dataset file pwt1001.csv is located in the project directory.
- GDP per capita varies widely between countries.
- Machine learning clustering reveals clear economic groupings.
- GDP distributions are non-normal, requiring non-parametric tests.
- Linear regression can approximate long-term economic trends, though it cannot capture complex economic shocks.
Potential extensions:
- Add more economic indicators (inflation, unemployment, education)
- Use time series models (ARIMA, Prophet, LSTM)
- Build an interactive dashboard (Plotly / Streamlit)
- Apply multi-feature clustering for deeper economic classification
Olga Chitembo
This project is part of my data science and analytics portfolio, demonstrating skills in:
- Data analysis
- Statistical testing
- Machine learning
- Economic data interpretation
- Data visualization