<a href="https://colab.research.google.com/github/lz2855/final_project_fdi/blob/main/final_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring the Relationship Between Internet Infrastructure and Foreign Direct Investment (FDI)

## Introduction

Foreign Direct Investment (FDI) is an important driver of economic development, bringing in capital, technology, and expertise. In today’s digital economy, internet infrastructure may play a key role in attracting foreign investments. This analysis aims to examine the relationship between secure internet servers per million people and FDI as a percentage of GDP across different countries in 2022.

## Hypothesis

The hypothesis is that countries with a higher number of secure internet servers per million people receive more FDI as a percentage of GDP. The assumption is that better internet infrastructure enhances business operations, security, and digital transactions, making a country more attractive for foreign investors.

## Data Sources
I use two datasets from the World Bank:

Foreign Direct Investment (FDI) as a % of GDP (1960–2023) - Contains yearly FDI inflows as a percentage of GDP.

Source: [World Bank](https://data.worldbank.org/indicator/BX.KLT.DINV.WD.GD.ZS)

Internet Infrastructure: Secure Internet Servers per 1M People (2010–2023) - Measures secure internet servers per million people.

Source: [World Bank](https://databank.worldbank.org/reports.aspx?source=2&series=IT.NET.SECR.P6&country=#)

For this study, I focus on data from 2022 and merge these datasets based on Country Code.

## Data Cleaning and Processing

We start by loading the datasets and filtering only relevant columns.

In [None]:
import pandas as pd

# Load Foreign Direct Investment (FDI) dataset
fdi_df = pd.read_csv("/content/fdi.csv", skiprows=4)

# Load Internet Infrastructure dataset
internet_df = pd.read_csv("/content/internet.csv")

In [None]:
# Select relevant columns
fdi_df = fdi_df[['Country Name', 'Country Code', '2022']]
fdi_df.rename(columns={'2022': 'FDI as % of GDP (2022)'}, inplace=True)

internet_df = internet_df[['Country Name', 'Country Code', '2022 [YR2022]']]
internet_df.rename(columns={'2022 [YR2022]': 'Secure Internet Servers per 1M people (2022)'}, inplace=True)
internet_df['Secure Internet Servers per 1M people (2022)'] = pd.to_numeric(internet_df['Secure Internet Servers per 1M people (2022)'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fdi_df.rename(columns={'2022': 'FDI as % of GDP (2022)'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  internet_df.rename(columns={'2022 [YR2022]': 'Secure Internet Servers per 1M people (2022)'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  internet_df['Secure Internet Servers per 1M people (2022)'] = pd.to_numeric(internet_df['Secure Internet Servers per 1M people (2022)'], er

## Merging Datasets
Now that we have clean data, we merge them using Country Code and drop any missing values to ensure accurate analysis.

In [None]:
# Merge datasets on 'Country Code'
merged_df = pd.merge(fdi_df, internet_df, on='Country Code', how='inner')

# Drop rows with missing values
merged_df.dropna(inplace=True)

## Explore the Data
Let’s take a look at the first few rows of our merged dataset:

In [None]:
print(merged_df.head())

                Country Name_x Country Code  FDI as % of GDP (2022)  \
0                        Aruba          ABW                7.567072   
1  Africa Eastern and Southern          AFE                1.695914   
3   Africa Western and Central          AFW                1.204459   
4                       Angola          AGO               -6.320564   
5                      Albania          ALB                7.579342   

                Country Name_y  Secure Internet Servers per 1M people (2022)  
0                        Aruba                                   1621.470506  
1  Africa Eastern and Southern                                   1332.877133  
3   Africa Western and Central                                     36.735969  
4                       Angola                                     36.537083  
5                      Albania                                   1107.395392  


## Data Visualization
We visualize the relationship between Secure Internet Servers per 1M People and FDI as % of GDP using a scatter plot.
To better understand country-specific patterns, we use Plotly to create an interactive scatter plot where users can hover over data points to see country names.

In [None]:
import plotly.express as px
import pandas as pd

# Filter data within a reasonable range for better visibility
filtered_df = merged_df[(merged_df['FDI as % of GDP (2022)'] >= 0) &
                        (merged_df['FDI as % of GDP (2022)'] <= 50) &
                        (merged_df['Secure Internet Servers per 1M people (2022)'] <= 50000)]  # Adjust threshold

# Create interactive scatter plot
fig = px.scatter(filtered_df, x='Secure Internet Servers per 1M people (2022)',
                 y='FDI as % of GDP (2022)',
                 hover_name='Country Name_x',
                 title='Relationship Between Internet Infrastructure and FDI',
                 labels={
                     'Secure Internet Servers per 1M people (2022)': 'Secure Internet Servers per 1M People (2022)',
                     'FDI as % of GDP (2022)': 'FDI as % of GDP (2022)'
                 })

fig.show()

## Observations
At first glance, the scatter plot does **not** indicate a strong correlation.

Most data points cluster near the lower end of the x-axis (low-to-moderate internet infrastructure) with a wide spread of FDI values.
No clear upward or downward trend – meaning more internet servers do not necessarily attract higher FDI.
A few high-FDI outliers exist at low internet server counts, further weakening any potential correlation.

To quantify this, we can do some **correlation and regression analysis**.


In [None]:
import scipy.stats as stats

# Compute Pearson correlation
correlation, p_value = stats.pearsonr(filtered_df['Secure Internet Servers per 1M people (2022)'],
                                     filtered_df['FDI as % of GDP (2022)'])

print(f"Pearson correlation coefficient: {correlation:.4f}")
print(f"P-value: {p_value:.4f}")

Pearson correlation coefficient: 0.0296
P-value: 0.6872


The correlation coefficient is near 0, meaning that there is no significant relationships. And the p-value is < 0.05, meaning that this finding is statistically significant.

## Expanding the Analysis: Regional Perspectives

While the initial analysis found no significant correlation between internet infrastructure and FDI across all countries, it's possible that this relationship varies by region or income level.

By hovering onto the data points on the previous scatter plot, it is especially obvious that countries that show no effects of higher FDI even with better internet infrastructure, are countries such as Belgium, Canada, New Zealand, etc. They happen to be developed countries. What about developing countries?

By segmenting the data, we can uncover nuanced patterns that may be obscured in a global analysis.​ I am choosing Latin American countreis as a case study.



## Filtering dataset

In [None]:
# Filter dataset for only Latin American and Caribbean countries
latin_america_countries = [
    "ARG", "BOL", "BRA", "CHL", "COL", "CRI", "CUB", "DOM", "ECU", "SLV",
    "GTM", "HND", "JAM", "MEX", "NIC", "PAN", "PRY", "PER", "URY", "VEN"
]

latin_america_df = merged_df[merged_df["Country Code"].isin(latin_america_countries)]

## Data visialization

In [None]:
import statsmodels.api as sm
# Create interactive scatter plot
fig = px.scatter(
    latin_america_df,
    x='Secure Internet Servers per 1M people (2022)',
    y='FDI as % of GDP (2022)',
    hover_name='Country Name_x', # Changed hover_name to 'Country Name_x'
    title='Interactive Relationship Between Internet Infrastructure and FDI in Latin America',
    labels={
        'Secure Internet Servers per 1M people (2022)': 'Secure Internet Servers per 1M People (2022)',
        'FDI as % of GDP (2022)': 'FDI as % of GDP (2022)'
    }
)

fig.show()

Unlike the global model (which showed no correlation at all), the Latin America model suggests a possible relationship between internet infrastructure and FDI.

In [None]:
# Compute Pearson correlation for Latin America
correlation_latam, p_value_latam = stats.pearsonr(
    latin_america_df['Secure Internet Servers per 1M people (2022)'],
    latin_america_df['FDI as % of GDP (2022)']
)

print(f"Pearson correlation coefficient: {correlation_latam:.4f}")
print(f"P-value: {p_value_latam:.4f}")

Pearson correlation coefficient: 0.3960
P-value: 0.1038


Pearson correlation coefficient: 0.396 → Moderate positive correlation (stronger than the global result).

P-value: 0.104 → Not statistically significant (p > 0.05), meaning we can't confirm this correlation with high confidence. This is due to a small sample size of 18 countries.

## Conclusion

The study first set off to test if internet infrastructure is a strong determinant of FDI.

It is found that this relationship isn't true on a global scale as other elements like political stability, market size, and investment policies likely play a larger role in attracting foreign investments.

This might also be that some countries have more diverse sources of GDP and a larger economy, suggesting that digital readiness is not a key driver of investment in developed countries.

In contrast, for developing economies, better internet infrastructure could be a competitive advantage in attracting foreign investors, as digital accessibility may be a key consideration in emerging markets.
