# **Project Name**    - World bank education (EDA)



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary** -


In this project, we analyze the World Education Bank dataset to uncover insights and trends in global education metrics. This exploratory data analysis (EDA) focuses on understanding the data distribution, identifying outliers, spotting missing values, and investigating correlations among various educational, economic, and demographic variables. Key objectives include examining regional differences in education access, literacy rates, funding patterns, and other relevant factors that impact educational outcomes. Visualizations, summary statistics, and insights derived from this analysis aim to support data-driven decisions and highlight areas for potential improvement in educational support across regions.

# **GitHub Link -**

https://github.com/kush-agra-soni/4_world_bank_edu_eda.git

# **Problem Statement**


The World Education Bank EDA project addresses the critical issue of unequal access to quality education worldwide. Education outcomes vary significantly across regions and are influenced by factors like literacy rates, funding, enrollment, and demographics. To support effective policies and resource allocation, governments, NGOs, and educational institutions need a clearer picture of how these factors interact and affect educational access globally.

The main goals of this project are to:

* Analyze trends in literacy rates across countries and regions to highlight disparities.
* Investigate how education spending correlates with enrollment rates and other educational outcomes.
* Examine demographic factors influencing educational accessibility, identifying key drivers and barriers.

This exploratory data analysis will generate insights to inform education-focused stakeholders, guiding them toward data-backed decisions to address inequalities and improve educational resources and outcomes.

#### **Define Your Business Objective?**

The business objective for this project is to provide actionable insights into global educational disparities that can support data-driven decision-making for improving educational accessibility and quality. By analyzing the World Education Bank dataset, the objective is to:

* Enable Stakeholders – Equip governments, NGOs, and educational institutions with data insights to make informed policy decisions.
* Identify Key Factors – Reveal the factors that most significantly impact literacy rates, enrollment, and funding efficacy across regions.
* Target Resource Allocation – Support efficient allocation of resources by identifying regions and demographics most in need of educational investment.
* Improve Educational Outcomes – Ultimately, contribute to initiatives aimed at reducing educational inequalities and enhancing access to quality education globally.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
import missingno as msno
from sklearn.preprocessing import StandardScaler
from scipy import stats
import plotly.express as px
import plotly.io as pio

### Dataset Loading

In [None]:
# GitHub raw URLs for your datasets
base_url = "https://raw.githubusercontent.com/kush-agra-soni/4_world_bank_edu_eda/refs/heads/main/"
country_series_url = f"{base_url}EdStatsCountry-Series.csv"
country_url = f"{base_url}EdStatsCountry.csv"
data_url = f"{base_url}EdStatsData.csv"
footnote_url = f"{base_url}EdStatsFootNote.csv"
series_url = f"{base_url}EdStatsSeries.csv"

# Load the CSV files into DataFrames
country_series_df = pd.read_csv(country_series_url)
country_df = pd.read_csv(country_url)
data_df = pd.read_csv(data_url)
footnote_df = pd.read_csv(footnote_url)
series_df = pd.read_csv(series_url)

### Dataset First View

In [None]:
country_series_df.head(1)

In [None]:
country_df.head(1)

In [None]:
data_df.head(1)

In [None]:
footnote_df.head(1)

In [None]:
series_df.head(1)

In [None]:
# Rename columns to ensure consistency
country_df = country_df.rename(columns={"Country Code": "CountryCode"})
data_df = data_df.rename(columns={"Country Code": "CountryCode", "Indicator Code": "SeriesCode"})
series_df = series_df.rename(columns={"Series Code": "SeriesCode"})

# Merge datasets into data_df (the main dataset) with 'left' join to avoid row loss
merged_df = pd.merge(data_df, country_series_df, how='left', on=['CountryCode', 'SeriesCode'])
merged_df = pd.merge(merged_df, country_df, how='left', on='CountryCode')
merged_df = pd.merge(merged_df, footnote_df, how='left', on=['CountryCode', 'SeriesCode'])
merged_df = pd.merge(merged_df, series_df, how='left', on='SeriesCode')

# Remove any duplicate rows if they were introduced by the merge
merged_df = merged_df.drop_duplicates()

# Handle missing values:
# Fill missing values for numerical columns with the mean of each column
for column in merged_df.select_dtypes(include=['float64']).columns:
    mean_value = merged_df[column].mean()
    merged_df[column] = merged_df[column].fillna(mean_value)  # Direct assignment

# Fill missing values for categorical columns with the mode of each column
for column in merged_df.select_dtypes(include=['object']).columns:
    mode_value = merged_df[column].mode()[0]
    merged_df[column] = merged_df[column].fillna(mode_value)  # Direct assignment

# Save the final dataset as a CSV file
merged_df.to_csv("main_dataset.csv", index=False)

### Dataset Rows & Columns count

In [None]:
# Load the dataset
merged_df = pd.read_csv("main_dataset.csv")

# Get the number of rows and columns
num_rows, num_columns = merged_df.shape

# Print the number of rows and columns
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")

### Dataset Information

In [None]:
# Dataset info
merged_df.info()

#### Duplicate Values

In [None]:
# Check for duplicate rows
duplicate_rows = merged_df.duplicated().sum()

# Print the number of duplicate rows
print(f"Number of duplicate rows: {duplicate_rows}")

### What did you know about your dataset?

The dataset consists of 354,282 rows and 52 columns, containing data for various countries (identified by `CountryCode`) across years (1970-2014) and indicators (identified by `SeriesCode`). It includes numerical data for each year, as well as metadata columns like `Currency Unit`, `Region`, and `Income Group`.

Key points to consider:
- **Missing Values**: Some columns (`Currency Unit`, `Region`, `Income Group`) have missing data.
- **Data Types**: Year columns are numerical (float64), while others are categorical (object).
- **Duplicates**: The dataset may contain duplicate rows that need to be removed.
- **Outliers**: Potential outliers in numerical data need to be identified.
- **Usage**: Ideal for time-series analysis, country comparisons, and forecasting, with the possibility of using advanced techniques like feature engineering.

The dataset requires cleaning (handling missing data, removing duplicates) before any analysis or modeling.

## ***2. Understanding Your Variables***

In [None]:
# Display dataset columns

print(merged_df.columns)
print("Dataset Columns:")

In [None]:
# Display descriptive statistics of numerical columns

print(merged_df.describe())
print("Dataset Describe:")

### Variables Description

- **CountryCode**: Unique code for each country (Categorical).
- **SeriesCode**: Unique code for each data series (Categorical).
- **1970 - 2014**: Yearly data for various indicators (Numerical).
- **Currency Unit**: Currency used in each country (Categorical).
- **Region**: Geographical region of the country (Categorical).
- **Income Group**: Income classification of countries (Categorical).
- **Topic**: Broad category of the indicator (Categorical).
- **Indicator Name**: Description of the measured indicator (Categorical).

**Key Insights**:
- **Categorical Variables**: Country, series, currency, region, income, topic, indicator.
- **Numerical Variables**: Yearly data (1970-2014) representing economic or social metrics.

### Check Unique Values for each variable.

In [None]:
for column in merged_df.columns:
    unique_values = merged_df[column].nunique()  # Count of unique values
    sample_values = merged_df[column].dropna().unique()[:1]  # Sample of unique values (only one)

    # Display output in a concise format
    print(f"Unique values for {column}:")
    print(f"Count of unique values: {unique_values}")
    print(f"Sample of unique values: {sample_values}")
    print("-" * 50)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# I ALREADY WRANGLED IT WITH SOME PYTHON SCRIPT AND EXCEL(POWER QUERY)
# Now i will merge the datasets by using the common column method

# Prepared five datasets related to education statistics for analysis.
# Due to GitHub's 100MB file limit, initial wrangling was done using Power Query and other tools.

# Step 1: Column Filtering:
#   * Removed unnecessary columns from datasets to focus on relevant data:
#   * Dropped 'Year' and 'DESCRIPTION' from EdStatsFootNote.
#   * Dropped redundant columns from EdStatsCountry, such as 'SNA price valuation', 'Alternative conversion factor', etc.

# Step 2: Data Cleaning:
#   * Addressed missing data in EdStatsData for columns from 1970 to 2014.
#   * Applied row-wise mean imputation to handle gaps in time-series data.
#   * columns_1970_2014 = [str(year) for year in range(1970, 2015)]
#   * Define columns from 1970 to 2014
#   * df[columns_1970_2014] = df[columns_1970_2014].apply(lambda row: row.fillna(row.mean()), axis=1)
#   * Impute missing values

# Rounded all numerical data in these columns to one decimal place.
# df[columns_1970_2014] = df[columns_1970_2014].round(1)
# Round to one decimal place

# Step 3: Missing Data Identification:
#   * Analyzed columns with high missing values and retained only those with significant data.
#   * Dropped columns deemed irrelevant or unusable based on data completeness and relevance to analysis.

### What all manipulations have you done and insights you found?

**Manipulations Performed:**

1. **Column Filtering:**
   - Removed irrelevant columns across datasets to focus on meaningful information.
   - For example:
     - In *EdStatsFootNote*, removed `Year` and `DESCRIPTION`.
     - In *EdStatsCountry*, removed columns like `SNA price valuation`, `Alternative conversion factor`, and other rarely used metadata fields.

2. **Handling Missing Data:**
   - Addressed extensive missing data in the *EdStatsData* dataset (time-series data from 1970 to 2014):
     - Applied row-wise mean imputation to fill missing values.
     - Used a custom approach to ensure data was filled sensibly without introducing bias.
   - Rounded all numerical data in the columns (1970–2014) to one decimal place for consistency.

3. **Data Cleansing:**
   - Identified and dropped columns with excessive missing values or limited relevance across datasets.
   - Ensured data integrity and readability by resolving formatting inconsistencies.

4. **Data Output:**
   - Saved cleaned datasets as individual `.csv` files for future merging and analysis.

---

---
**Insights Found:**

1. **Time-Series Gaps:**
   - Significant gaps in time-series data were observed between 1970 and 2014 in *EdStatsData*. These gaps were addressed row-wise to ensure accurate imputation and preserve trends.

2. **Dataset Connections:**
   - Common columns such as `CountryCode` and `SeriesCode` were identified as keys for potential merging of datasets.

3. **Usability of Datasets:**
   - Certain datasets (e.g., *EdStatsFootNote* and *EdStatsCountry-Series*) had fields with repetitive or redundant information, making them less valuable for immediate analysis.
   - Key datasets for analysis include *EdStatsData*, *EdStatsCountry*, and *EdStatsSeries* due to their wealth of structured data.

4. **Relevance of Columns:**
   - Columns such as `Indicator Name` in *EdStatsData* and `Topic` in *EdStatsSeries* provide essential context for understanding the data, while others (e.g., `Short Name` in *EdStatsCountry*) were less critical.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1. Average Value by Year (Time Series Analysis)

In [None]:
# Load the dataset
main_dataset = pd.read_csv('main_dataset.csv')

# Calculate the average for each year across all countries
years = [str(year) for year in range(1970, 2015)]
avg_values = main_dataset[years].mean()

plt.figure(figsize=(14, 6))
plt.plot(years, avg_values, marker='o', linestyle='-', color='b')
plt.title('Average Value by Year (1970-2014)')
plt.xlabel('Year')
plt.ylabel('Average Value')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

This line chart with data points is an effective choice to visualize trends over time. It allows us to easily identify patterns, fluctuations, and the overall direction of the average value.

##### 2. What is/are the insight(s) found from the chart?

1. Overall Trend: There's a clear upward trend in the average value from 1970 to 2014, suggesting a general increase over the period.
2. Fluctuations: While the overall trend is upward, there are periods of slower growth and even a slight decline around the 1980s.
3. Sharp Increase: A significant increase in the average value is observed in the late 2000s and early 2010s.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can be valuable for businesses in several ways:

* Strategic Planning: Understanding the historical trend can help businesses forecast future trends and make informed decisions about resource allocation, investment, and product development.
* Identifying Opportunities: The sharp increase in the late 2000s and early 2010s could be a signal of a new market opportunity or a change in consumer behavior. Businesses can capitalize on these opportunities by adjusting their strategies accordingly.
* Risk Management: Identifying periods of slower growth or decline can help businesses prepare for potential challenges and develop contingency plans.
* Performance Evaluation: The chart can be used to benchmark performance against historical trends and identify areas for improvement.

#### Chart - 2. Top 10 Countries with Highest Values for a Specific Year (2005)

In [None]:
top_10_2014 = main_dataset[['CountryCode', '2005']].sort_values(by='2005', ascending=False).head(10)

plt.figure(figsize=(10, 6))
plt.barh(top_10_2014['CountryCode'], top_10_2014['2005'], color='teal')
plt.title('Top 10 Countries with Highest Values in 2005')
plt.xlabel('Value')
plt.ylabel('Country Code')
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart is an effective way to compare discrete categories (in this case, countries) and their corresponding values. It's particularly useful for visualizing rankings and highlighting differences between categories.

##### 2. What is/are the insight(s) found from the chart?

* Dominance of WLD: The country code "WLD" (likely representing the entire world) has significantly higher values compared to the other two countries.
* Similar Values for HIC and OED: The country codes "HIC" (High-Income Countries) and "OED" (OECD Countries) have relatively similar values.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The specific impact would depend on the context of the data. However, some potential business implications could be:

- Market Analysis: If the data represents economic indicators, it could help businesses identify potential markets with higher growth potential.
- Resource Allocation: Understanding the relative values of different regions can guide companies in allocating resources effectively.
- Risk Assessment: Identifying regions with lower values might help businesses assess potential risks and uncertainties.

 - Additional Considerations:

- Data Source and Units: Knowing the source of the data and the units of measurement would provide a more accurate interpretation.
- Country Definitions: A clear understanding of the countries included in each category (WLD, HIC, OED) is crucial.

#### Chart - 3 Distribution of Values for a Specific Country (ARB)

In [None]:
import matplotlib.pyplot as plt

# Filter the data for the ARB
country_data = main_dataset[main_dataset['CountryCode'] == 'ARB']

# Extract the relevant columns (numeric data, excluding any non-numeric ones)
# Assuming your dataset has the first two columns as 'Country' and 'CountryCode'
# and the last four columns are not numeric values (e.g., some other metadata).
numeric_columns = country_data.columns[2:-4]

# Ensure that the data to plot is numeric
numeric_data = country_data.iloc[0, 2:-4].apply(pd.to_numeric, errors='coerce')  # This will convert all valid values to numeric, and others to NaN

# Plot the data
plt.figure(figsize=(14, 6))
plt.plot(numeric_columns, numeric_data, marker='o', linestyle='-', color='orange')
plt.title('Value Distribution for ARB (1970-2014)')
plt.xlabel('Year')
plt.ylabel('Value')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

This line chart with data points is an effective choice to visualize trends over time. It allows us to easily identify patterns, fluctuations, and the overall direction of the value.

##### 2. What is/are the insight(s) found from the chart?

1. Overall Trend: There's a clear upward trend in the value from 1970 to 2014, suggesting a general increase over the period.
2. Fluctuations: While the overall trend is upward, there are periods of slower growth and even a slight decline around the 1990s.
3. Sharp Increase: A significant increase in the value is observed in the late 2000s and early 2010s.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can be valuable for businesses in several ways:

- Strategic Planning: Understanding the historical trend can help businesses forecast future trends and make informed decisions about resource allocation, investment, and product development.
- Identifying Opportunities: The sharp increase in the late 2000s and early 2010s could be a signal of a new market opportunity or a change in consumer behavior. Businesses can capitalize on these opportunities by adjusting their strategies accordingly.
- Risk Management: Identifying periods of slower growth or decline can help businesses prepare for potential challenges and develop contingency plans.
- Performance Evaluation: The chart can be used to benchmark performance against historical trends and identify areas for improvement.

#### Chart - 4 Average Value by Income Group

In [None]:
avg_by_income_group = main_dataset.groupby('Income Group')[years].mean()

plt.figure(figsize=(8, 5))
avg_by_income_group.T.plot(kind='line', figsize=(12, 8))
plt.title('Average Value by Income Group (1970-2014)')
plt.xlabel('Year')
plt.ylabel('Average Value')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend(title='Income Group')
plt.show()

##### 1. Why did you pick the specific chart?

A multi-line chart is an effective way to compare trends across different categories (in this case, income groups) over time. It allows us to easily identify patterns, fluctuations, and the relative performance of each group.

##### 2. What is/are the insight(s) found from the chart?

1. Diverging Trends: The different income groups show distinct trends over time.
2. Upper Middle Income Dominance: The "Upper middle income" group experiences the most significant growth, surpassing the other groups in the later years.
3. Slow Growth for Low-Income Groups: The "Low income" and "Lower middle income" groups show relatively slower growth compared to the higher-income groups.
4. Convergence: While the gap between the groups widens in the initial years, there seems to be a slight convergence in the later years, with the lower-income groups showing faster growth.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can be valuable for businesses in several ways:

- Market Targeting: Businesses can identify high-growth markets based on the trends in different income groups.
- Product Development: Understanding the evolving needs of different income groups can help businesses develop products and services that cater to specific segments.
- Investment Strategies: Identifying regions with high-growth potential can guide investment decisions.
- Risk Assessment: Monitoring the economic performance of different regions can help businesses assess potential risks and uncertainties.

#### Chart - 5 Heatmap of Correlation Between Years

In [None]:
# Calculate the correlation matrix for the selected 'years' columns
corr_matrix = main_dataset[years].corr()

# Create the heatmap without annotations
plt.figure(figsize=(12, 6))  # Adjust the size for 20 years of data
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', linewidths=0.5)

# Improve readability with a tight layout
plt.title('Correlation Heatmap Between Years')
plt.tight_layout()

plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is an effective way to visualize the correlation between different variables. In this case, it helps us understand the relationships between different years.

##### 2. What is/are the insight(s) found from the chart?

1. Strong Positive Correlation: The darker red squares along the diagonal indicate a strong positive correlation between consecutive years. This suggests that the values tend to move in the same direction over time.
2. Weaker Correlation in Earlier Years: The lighter shades of red in the top-left corner suggest a weaker correlation between the earlier years. This might indicate greater variability or less consistent trends in the past.
3. Clustered Patterns: There are distinct clusters of high correlation, suggesting periods of similar trends or economic conditions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the correlation between years can help businesses in several ways:

- Forecasting: By identifying patterns and trends, businesses can make more accurate forecasts for future periods.
- Risk Management: Understanding periods of high or low correlation can help businesses assess potential risks and uncertainties.
- Investment Decisions: Identifying periods of strong growth or decline can guide investment decisions.
- Strategic Planning: Understanding the long-term trends can help businesses develop long-term strategies.

#### Chart - 6 Value Distribution for a Specific Indicator (Population ages 15-64, total)

In [None]:
indicator_data = main_dataset[main_dataset['Indicator Name'] == 'Population ages 15-64, total']  # Replace with any indicator name
indicator_data = indicator_data[years].mean(axis=0)

plt.figure(figsize=(14, 6))
plt.plot(indicator_data.index, indicator_data.values, marker='o', color='purple')
plt.title('Value Distribution for Population ages 15-64, total Indicator (1970-2014)')
plt.xlabel('Year')
plt.ylabel('Average Value')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

This line chart with data points is an effective choice to visualize trends over time. It allows us to easily identify patterns, fluctuations, and the overall direction of the value.

##### 2. What is/are the insight(s) found from the chart?

1. Overall Trend: There's a clear upward trend in the value from 1970 to 2014, suggesting a general increase in the population ages 15-64 over the period.
2. Steady Growth: The growth appears to be relatively steady, with a consistent upward slope.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can be valuable for businesses in several ways:

- Market Analysis: Understanding the growth in the working-age population can help businesses identify potential markets and opportunities.
- Labor Force Planning: Businesses can anticipate changes in the labor market and plan accordingly for recruitment and workforce development.
- Consumer Behavior: The changing demographics can influence consumer preferences and behavior, which businesses can factor into their marketing and product development strategies.
- Social Responsibility: Businesses can consider the social implications of population trends, such as the need for education, healthcare, and infrastructure development.

#### Chart - 7 Region-wise Average Value Over Time

In [None]:
region_avg_values = main_dataset.groupby('Region')[years].mean()

plt.figure(figsize=(12, 4))
region_avg_values.T.plot(kind='line', figsize=(12, 6))
plt.title('Region-wise Average Value Over Time (1970-2014)')
plt.xlabel('Year')
plt.ylabel('Average Value')
plt.xticks(rotation=45)
plt.legend(title='Region')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

A multi-line chart is an effective way to compare trends across different categories (in this case, regions) over time. It allows us to easily identify patterns, fluctuations, and the relative performance of each region.

##### 2. What is/are the insight(s) found from the chart?

1. Diverging Trends: The different regions show distinct trends over time.
2. Dominance of East Asia & Pacific: This region experiences the most significant growth, surpassing the other regions in the later years.
3. Slow Growth for Some Regions: Regions like North America and Europe & Central Asia show relatively slower growth compared to the other regions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can be valuable for businesses in several ways:

- Market Targeting: Businesses can identify high-growth markets based on the trends in different regions.
- Product Development: Understanding the evolving needs of different regions can help businesses develop products and services that cater to specific segments.
- Investment Strategies: Identifying regions with high-growth potential can guide investment decisions.
- Risk Assessment: Monitoring the economic performance of different regions can help businesses assess potential risks and uncertainties.

#### Chart - 8 Income Group Distribution for a Specific Year (2010)

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Income Group', y='2010', data=main_dataset)
plt.title('Income Group Distribution for 2010')
plt.xlabel('Income Group')
plt.ylabel('Value')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is a suitable choice to visualize the distribution of values within different income groups. It allows us to identify clusters, outliers, and the spread of data.

##### 2. What is/are the insight(s) found from the chart?

1. Uneven Distribution: The data points are not evenly distributed across the income groups. Some groups, like "Upper middle income," have a wider range of values, while others, like "Low income," have a narrower range.
2. Outliers: There are some outliers in the data, especially in the "Upper middle income" group. These outliers might represent countries with significantly higher or lower values compared to the rest of the group.
3. Clustering: The data points tend to cluster around certain values, suggesting that many countries within each income group have similar economic characteristics.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can be valuable for businesses in several ways:

- Market Segmentation: Businesses can identify target markets based on income group and economic characteristics.
- Risk Assessment: Understanding the economic diversity within each income group can help businesses assess potential risks and uncertainties.
- Product Development: Considering the specific needs and preferences of different income groups can help businesses develop products and services that cater to various segments.
- Investment Decisions: Identifying countries with high growth potential within each income group can guide investment decisions.

#### Chart - 9 Top Countries by Value in a Specific Indicator [Government expenditure per primary student (US$)]

In [None]:
top_countries_indicator = main_dataset[main_dataset['Indicator Name'] == 'Government expenditure per primary student (US$)']['2014'].nlargest(10)

plt.figure(figsize=(10, 6))
top_countries_indicator.plot(kind='barh', color='green')
plt.title('Top 10 Government expenditure per primary student (US$) in 2014')
plt.xlabel('Life Expectancy')
plt.ylabel('Country')
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart is an effective way to compare discrete categories (in this case, countries) and their corresponding values. It's particularly useful for visualizing rankings and highlighting differences between categories.

##### 2. What is/are the insight(s) found from the chart?

1. Variability in Life Expectancy: The chart shows a wide range of life expectancies across the top 10 countries.
2. High Life Expectancy: The top-ranking countries have significantly higher life expectancies compared to the lower-ranking countries.
3. Ranking: The countries are ranked based on their life expectancy, with the highest-ranking country having the longest life expectancy.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

While this chart primarily provides information about public health and social development, it can indirectly impact businesses in the following ways:

- Workforce Health: Countries with higher life expectancies may have a healthier and more productive workforce, which can benefit businesses.
- Consumer Market: A healthy population can lead to a larger consumer market with greater purchasing power.
- Social Responsibility: Businesses can align their social responsibility initiatives with the health and well-being of the population.

#### Chart - 10 Trend of a Specific Indicator Across Time for a Country

In [None]:
country_trend = main_dataset[main_dataset['CountryCode'] == 'USA'][['CountryCode', 'Indicator Name'] + years].set_index('Indicator Name')

# Select a specific indicator, for example, 'GDP'
gdp_trend = country_trend.loc['School life expectancy, secondary, both sexes (years)', years].values

plt.figure(figsize=(14, 6))
plt.plot(years, gdp_trend, marker='o', color='red')
plt.title('GDP Trend for USA (1970-2014)')
plt.xlabel('Year')
plt.ylabel('GDP Value')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

A line chart with data points is an effective choice to visualize trends over time. It allows us to easily identify patterns, fluctuations, and the overall direction of the GDP.

##### 2. What is/are the insight(s) found from the chart?

1. Overall Trend: There's a general upward trend in the US GDP from 1970 to 2014, indicating economic growth.
2. Fluctuations: The GDP exhibits fluctuations, with periods of growth and decline.
3. Recession: The sharp decline in the early 1980s and late 2000s corresponds to recessionary periods in the US economy.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can be valuable for businesses in several ways:

- Strategic Planning: Understanding the historical trend can help businesses forecast future economic conditions and make informed decisions about investments, expansion, and resource allocation.
- Market Analysis: Identifying periods of economic growth can help businesses identify opportunities for market expansion.
- Risk Management: Understanding periods of economic decline can help businesses prepare for potential challenges and develop contingency plans.
- Financial Planning: Businesses can use the GDP trend to forecast future revenue and expenses.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the analysis of the 10 charts, here are some recommendations to help the client achieve their business objective:

1. Identify High-Growth Markets:

- Regions: Prioritize regions like East Asia & Pacific and South Asia, which exhibit strong economic growth.
- Income Groups: Focus on upper-middle-income countries, as they have significant growth potential.
2. Optimize Resource Allocation:

- Efficient Resource Allocation: Allocate resources to regions with high growth potential and strong economic indicators.
- Diversification: Consider diversifying investments across regions to mitigate risks associated with economic fluctuations.
3. Effective Risk Management:

- Monitor Economic Indicators: Keep track of key economic indicators like GDP, inflation, and unemployment rates to assess potential risks.
- Develop Contingency Plans: Have contingency plans in place to address economic downturns or other unforeseen challenges.
4. Leverage Demographic Trends:

- Target Specific Demographics: Identify target demographics based on factors like age, income, and lifestyle preferences.
- Adapt Product Offerings: Tailor products and services to meet the evolving needs of different demographic groups.
5. Enhance Supply Chain Efficiency:

- Optimize Supply Chains: Analyze regional trends to optimize supply chain operations and reduce costs.
- Build Resilient Supply Chains: Develop resilient supply chains to mitigate disruptions caused by economic or geopolitical factors.
6. Foster Innovation and Adaptability:

- Encourage Innovation: Foster a culture of innovation to develop new products and services that meet changing consumer needs.
- Embrace Digital Transformation: Leverage digital technologies to improve efficiency, reduce costs, and enhance customer experience.

# **Conclusion**

####Key takeaways from the analysis include:

- Global Economic Trends: The global economy has experienced significant growth, particularly in emerging markets. However, there are also periods of economic slowdown and recession.
- Regional Disparities: Different regions exhibit varying levels of economic development and growth potential.
- Demographic Shifts: Changes in population demographics, such as aging populations and urbanization, can impact consumer behavior and labor markets.

####To maximize the benefits of these insights, businesses should:

- Monitor Economic Indicators: Keep track of key economic indicators to identify emerging trends and potential risks.
- Diversify Operations: Consider diversifying operations across different regions to mitigate risks associated with economic fluctuations.
- Adapt to Changing Market Conditions: Be agile and responsive to changing market dynamics, including shifts in consumer preferences and technological advancements.
- Invest in Innovation: Foster a culture of innovation and invest in research and development to stay ahead of the competition.
- Embrace Sustainability: Adopt sustainable practices to address environmental and social challenges while creating long-term value.