<a href="https://colab.research.google.com/github/mehri-satari/Data-Mining-Course-Project/blob/main/Untitled70.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

we will explore various visualization techniques using the ** World Bank dataset** This dataset have a wide range of information about countries and their development statistics, including economic indicators, health metrics, and demographic data.


In [14]:
!pip install wbgapi

Collecting wbgapi
  Downloading wbgapi-1.0.12-py3-none-any.whl.metadata (13 kB)
Downloading wbgapi-1.0.12-py3-none-any.whl (36 kB)
Installing collected packages: wbgapi
Successfully installed wbgapi-1.0.12


In [None]:
import wbgapi as wb

In [None]:
url = "/content/world_bank.csv"
wb = pd.read_csv(url, index_col=0)
wb.head()

# Bar Plots
-Bar plots are a fundamental visualization tool used to display the
**distribution of a categorical variable**.

-Useful for comparing the frequency or count of different categories within a dataset.





In this example, we will visualize the distribution of countries across different continents.

We will use the value_counts() method to count the occurrences of each unique value in the Continent column of our World Bank dataset. Then, we will plot these counts using plt.bar.

In [None]:
wb["Continent"].value_counts()

# Bar plot using matplotlib

In this example, we will visualize the distribution of countries across different continents.


In [None]:
# Count the number of countries in each continent
continents = wb["Continent"].value_counts()

# plot these counts: Create a bar plot for the distribution of countries across the continents
plt.bar(continents.index, continents.values)

# Set the x-axis label
plt.xlabel("Continents")

# Set the y-axis label
plt.ylabel("Counts")

# Set the title of the plot
plt.title("Distribution of countries across the continents");

# Bar plot using Pandas library
The pandas library is a powerful tool for data manipulation and analysis.
This example will show you how to use pandas' native plotting tools to create informative bar charts.

In [None]:
# Count the number of countries in each continent and plot it
wb["Continent"].value_counts().plot(kind='bar')

# Set the y-axis label
plt.ylabel("Counts")

# Set the title of the plot
plt.title("Distribution of countries across the continents");

# Bar plot using seaborn

- Equivalently, we could use the countplot method of the seaborn library to create our bar plot.

- The countplot function is particularly useful for visualizing the distribution of categorical data.

- It automatically counts the occurrences of each unique value in the specified column and creates a bar for each category.

- In the example below, we will use sns.countplot to visualize the distribution of countries across different continents. This will help us understand how many countries belong to each continent in our World Bank dataset.

In [None]:
# Create a count plot for the 'Continent' column in the World Bank dataset
sns.countplot(data=wb, x='Continent', hue='Continent')
# hue='Continent' tells Seaborn to color the bars by continent.
# So each continent has its own color and it also appears in the legend.

# Set the title of the plot
plt.title("Distribution of countries across the continents");

- Bar plots work best for categories, not numbers.

- If we use sns.countplot on a quantitative variable, it creates a bar for every unique value, making the plot messy and unhelpful.

To visualize the distribution of a continuous variable, we use different types of plots:

- Histogram
- Box plot
- Violin plot

In [None]:
sns.countplot(data=wb, x='Gross national income per capita, Atlas method: $: 2016')
plt.title("GNI distribution for different countries");

# Box Plots and Violin Plots

Box plots and violin plots both show the distribution of a variable.

- Box plots display the median, the middle 50% of the data (IQR), and identify outliers.

- Violin plots show the shape and density of the data, combining a box plot with a smooth distribution curve.

Both are useful for comparing distributions across groups (e.g., comparing income across continents).

For example, we can use these plots to compare the distribution of Gross National Income per capita across different continents in our World Bank dataset.

Atlas method: World Bank’s way to convert a country’s GNI( Gross National Income) into U.S. dollars using a 3-year average exchange rate adjusted for inflation differences. This smooths big currency swings.

$: Values are in U.S. dollars

2016: The year of the data (GNI per capita for 2016).

In [None]:
sns.boxplot(data=wb, y="Gross national income per capita, Atlas method: $: 2016")
plt.title("The distribution of GNI per capita in different countries");

In [None]:
sns.violinplot(data=wb, y="Gross national income per capita, Atlas method: $: 2016")
plt.title("The distribution of GNI per capita in different countries");

# Histograms


# Matplotlib histogram
A histogram groups continuous data into bins and shows how many values fall in each range.

It’s great for seeing a variable’s distribution: center, spread, skew, peaks, and outliers.

Example: a GNI-per-capita histogram shows global income spread and extreme values.


density=True changes what the height of the bars represents in a histogram.

Without density=True:
The bar heights show counts — how many data points fall in each bin.

With density=True:
The bar heights show density — the proportion of the data in each bin so that the total area under the histogram equals 1.

In [None]:

gni = wb["Gross national income per capita, Atlas method: $: 2016"]

# draw the density-scaled histogram with white bin borders.
plt.hist(gni, density=True, edgecolor="white")
# The `edgecolor` argument controls the color of the bin edges

# Add labels
plt.xlabel("Gross national income per capita")
plt.ylabel("Density")
plt.title("Distribution of gross national income per capita");

# Seaborn histogram

sns.histplot(..., stat="density") produces the same density interpretation as density=True but with Seaborn’s nicer defaults (smart binning, style, and automatic NaN handling).

Use this when you want a clean look and easy access to Seaborn features (like hue= )


stat="density" → area under bars = 1.

Seaborn chooses bins automatically (bins='auto') unless you set bins=....


In [None]:
sns.histplot(data=wb, x="Gross national income per capita, Atlas method: $: 2016", stat="density")
plt.title("Distribution of gross national income per capita");

**Comparing Distributions Across Categories**

- Now we want to compare the distribution of GNI per capita between groups.

- We use the hue parameter in sns.histplot to assign different colors to each category.

- This allows us to overlay histograms (or density curves) in one plot.

- The legend shows which color corresponds to each hemisphere.

- The legend is important because color is being used to encode meaningful information.

In [None]:
# Create a new variable to store the hemisphere in which each country is located
north = ["Asia", "Europe", "N. America"]
south = ["Africa", "Oceania", "S. America"]
wb.loc[wb["Continent"].isin(north), "Hemisphere"] = "Northern"
# True for rows whose Continent is in the north list (["Asia","Europe","N. America"]).
# assign the string "Northern" to the Hemisphere column for those rows.

wb.loc[wb["Continent"].isin(south), "Hemisphere"] = "Southern"
# assigning "Southern" where Continent is in ["Africa","Oceania","S. America"].

In [None]:
sns.histplot(data=wb, x="Gross national income per capita, Atlas method: $: 2016", hue="Hemisphere", stat="density")
plt.title("Distribution of gross national income per capita highlighted for different hemispheres");

# Scatter plots
Scatter plots are used to visualize the relationship between two quantitative continuous variables.


# matplotlib scatter

What it shows: Relationship between % growth per capita (2016) and female adult literacy (2005–2014).

Each dot = one country.

Single color; no automatic legend.

We add axis labels manually.

In [None]:
plt.scatter(wb['per capita: % growth: 2016'], wb['Adult literacy rate: Female: % ages 15 and older: 2005-14'])
plt.xlabel("% growth per capita")
plt.ylabel("Female adult literacy rate");

# Seaborn scatter (grouped by continent)

Same relationship, but hue="Continent" colors points by continent.

Seaborn adds a legend automatically and uses nicer defaults.

data=wb lets us reference columns by name (x=..., y=...).

In [None]:
sns.scatterplot(data=wb, x='per capita: % growth: 2016', \
                y='Adult literacy rate: Female: % ages 15 and older: 2005-14', hue="Continent")
plt.xlabel("% growth per capita")
plt.ylabel("Female adult literacy rate");

The plots above suffer from **overplotting** – many scatter points are stacked on top of one another (particularly in the upper right region of the plot).

- **Jittering** is a processed used to address overplotting. A small amount of random noise is added to the x and y values of all datapoints.
- **Jitter** improves visibility only.

Decreasing the size of each scatter point using the s parameter of plt.scatter also helps.

In [None]:
# draws n random numbers uniformly between a and b.
# len(wb) makes one noise value per row (country),so every point gets its own jitter.
random_x_noise = np.random.uniform(-1, 1, len(wb))
random_y_noise = np.random.uniform(-5, 5, len(wb))

plt.scatter(wb['per capita: % growth: 2016']+random_x_noise, \
            wb['Adult literacy rate: Female: % ages 15 and older: 2005-14']+random_y_noise, s=15)

plt.xlabel("% growth per capita (jittered)")
plt.ylabel("Female adult literacy rate (jittered)");

# Linear Fit (sns.lmplot)

**sns.lmplot(...)** makes a scatter plot and fits a linear regression line (with a 95% confidence band).

- x: per-capita growth (2016)
- y: female adult literacy (2005–2014).

Use it to quickly see the trend (positive/negative) and spot outliers.


In [None]:
sns.lmplot(data=wb, x='per capita: % growth: 2016', \
           y='Adult literacy rate: Female: % ages 15 and older: 2005-14');

# Joint Distribution (sns.jointplot)
**sns.jointplot(...)** shows a scatter plot of the two variables in the center, and also shows histograms of each variable along the top and right.

- x: per-capita growth (2016)
- y: female adult literacy (2005–2014).

This lets us see both the relationship between the variables and the distribution of each variable at the same time.


In [None]:
sns.jointplot(data=wb, x='per capita: % growth: 2016', \
              y='Adult literacy rate: Female: % ages 15 and older: 2005-14');

# Hex Plots
Instead of plotting every point, we show how dense the data is in 2D.
Each hexagon’s darkness shows how many points fall there—darker = more points.

In [None]:
sns.jointplot(data=wb, x='per capita: % growth: 2016', \
              y='Adult literacy rate: Female: % ages 15 and older: 2005-14',
              kind='hex');

#Contour Plots
They work like topographic maps: each contour line represents an area with the same density of points.
Darker regions or tighter lines mean more datapoints are concentrated there.

In [None]:
sns.kdeplot(data=wb, x='per capita: % growth: 2016', \
              y='Adult literacy rate: Female: % ages 15 and older: 2005-14', fill=True);
No description has been provided for this image