# Data Visualization with Python

- Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

#### Why learn data visualization?
- Learning data visualization is essential for interpreting data effectively. 
- It helps in making data-driven decisions, discovering insights, and communicating findings in a clear and compelling manner.

#### Introduction to Plotly

- Plotly is an interactive, open-source plotting library that supports over 40 unique chart types, covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use cases. 
- Plotly can be used to create beautiful and interactive visualizations in Python.

In [None]:
# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

In [None]:
# Load and display the Gapminder dataset
df = px.data.gapminder()

df.head()

In [None]:
df.info()

In [None]:
df['year'].value_counts()

### 1. Line Chart
- **Explanation**: A line chart displays information as a series of data points called 'markers' connected by straight line segments. It is similar to an area chart but without the filled area.
- **Best Use**: Use a line chart to visualize data points over a continuous interval or time period. It's ideal for showing trends over time, such as stock prices, temperature changes, or sales figures.


In [None]:
help(px.line)

In [None]:
df[df['country'].isin(selected_countries)]

In [None]:
# Line chart showing population growth over time for selected countries.

selected_countries = ['China', 'India', 'United States', 'Indonesia', 'Brazil', 'Nigeria']

fig = px.line(df[df['country'].isin(selected_countries)], x='year', y='pop', color='country',
              width= 1000, height= 600, title='Population Growth Over Time for Selected Countries')

fig.show()

**Insights**:
- **Population Growth**:
  - **China** and **India** show significant population growth, reflecting their status as the two most populous countries.
  - **United States** shows steady growth.
  - **Indonesia** and **Brazil** also show substantial growth, but at a slower pace compared to China and India.

###  2. Area Chart

- **Explanation**: An area chart is similar to a line chart but with the area below the line filled in. It is useful for showing trends over time among related attributes.
- **Best Use**: Use an area chart when you want to display the cumulative effect or the overall volume. It works well for data that represents part-to-whole relationships or when the total value is significant.

In [None]:
help(px.area)

In [None]:
# Area chart showing GDP per capita over time for different continents.

fig = px.area(df, x='year', y='gdpPercap', color='continent', line_group='country', title='GDP per Capita Over Time by Continent')
fig.show()

In [None]:
df2 = df[["country", "continent","year", "gdpPercap"]]

df3 = df2.groupby(['continent', 'year'])['gdpPercap'].mean().reset_index()

In [None]:
df3

**Insights**:
- **Trends in Economic Growth**:
  - **Europe** and **North America** show consistent and high economic growth over the years.
  - **Asia** shows a significant upward trend, particularly in recent decades, indicating rapid economic growth in countries like China and India.
  - **Africa** and **South America** have lower GDP per capita values, but Africa shows gradual improvement, while South America exhibits a mix of growth and stagnation in different periods.
  - **Oceania** has a small representation but shows high GDP per capita, driven mainly by Australia and New Zealand.


In [None]:
# Practice Question: Create an area chart showing population over time by continent.

### 3. Bar Chart
- **Explanation**: A bar chart represents categorical data with rectangular bars with heights or lengths proportional to the values they represent.
- **Best Use**: Use a bar chart to compare different categories of data. It works well for showing relative quantities, such as the population of different countries or sales figures for different products.


In [None]:
df.head()

In [None]:
avg_lifeExp = df.groupby('continent')['lifeExp'].agg(['mean', 'sum']).reset_index()
avg_lifeExp                                                     

In [None]:
avg_lifeExp = df.groupby('continent')['lifeExp'].mean().reset_index()
avg_lifeExp

In [None]:
help(px.bar)

In [None]:
# Bar chart showing the average life expectancy by continent.

avg_lifeExp = df.groupby('continent')['lifeExp'].mean().reset_index()
fig = px.bar(avg_lifeExp, x='continent', y='lifeExp', title='Average Life Expectancy by Continent')
fig.show()

**Insights**:
- **Life Expectancy**:
  - **Europe** and **Oceania** have the highest average life expectancy.
  - **Africa** has the lowest, indicating significant health challenges and disparities.

### 4. Histogram Chart
- **Explanation**: A histogram is a graphical representation of the distribution of numerical data. It groups data into bins or intervals and counts the number of data points in each bin.
- **Best Use**: Use a histogram when you need to understand the distribution of a dataset, especially for continuous data. It's great for identifying the frequency of different ranges of values within your data.


In [None]:
help(px.histogram)

In [None]:
# Histogram of life expectancy.

fig = px.histogram(df, x='lifeExp', title='Distribution of Life Expectancy')
fig.show()

**Insights**:
- **Distribution**:
  - Most countries have a life expectancy between 50 and 80 years.
  - There are peaks around 60 and 70 years, indicating common ranges of life expectancy.
  - Few countries have life expectancy below 50, which could indicate poor health conditions and economic challenges.

In [None]:
df[df['lifeExp'] < 30]

In [None]:
# Practice Question: Create a histogram showing the distribution of GDP per capita.

### 5. Scatter Plot
- **Explanation**: A scatter plot uses dots to represent the values of two different variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point.
- **Best Use**: Use a scatter plot to determine the relationship or correlation between two variables. It's useful for spotting trends, clusters, and outliers.


In [None]:
help(px.scatter)

In [None]:
# Scatter plot of GDP per capita vs life expectancy.

fig = px.scatter(df, x='gdpPercap', y='lifeExp', color='continent', hover_name='country', log_x=True, title='GDP per Capita vs Life Expectancy')
fig.show()


In [None]:
df[['gdpPercap', 'lifeExp']].corr()

**Insights**:
- **Correlation**:
  - There is a positive correlation between GDP per capita and life expectancy.
  - Wealthier countries tend to have higher life expectancy.
  - Outliers include countries with high GDP but relatively lower life expectancy and vice versa, indicating factors other than economic wealth impacting health outcomes.

In [None]:
# Practice Question: Create a scatter plot showing life expectancy vs population, colored by continent.

### 6. Box Plot
- **Explanation**: A box plot (or box-and-whisker plot) displays the distribution of a dataset based on a five-number summary: minimum, first quartile, median, third quartile, and maximum.
- **Best Use**: Use a box plot to understand the spread and skewness of your data. It’s particularly useful for identifying outliers and comparing distributions across multiple groups.

In [None]:
# Box plot of GDP per capita by continent.

fig = px.box(df, x='continent', y='gdpPercap', title='GDP per Capita by Continent')
fig.show()

**Insights**:
- **Economic Disparities**:
  - **Europe** and **North America** show higher median GDP per capita.
  - **Africa** and **Asia** have lower median GDP per capita, but Asia has a wide range, indicating significant economic disparities within the continent.
  - **Oceania** shows high GDP per capita with less variation.

**More on the boxplot:**
- A `box plot` is a way of statistically representing the *distribution* of the data through five main dimensions:

*   **Minimum:** The smallest number in the dataset excluding the outliers.
*   **First quartile:** Middle number between the `minimum` and the `median`.
*   **Second quartile (Median):** Middle number of the (sorted) dataset.
*   **Third quartile:** Middle number between `median` and `maximum`.
*   **Maximum:** The largest number in the dataset excluding the outliers.

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/labs/Module%203/images/boxplot_complete.png" width="440," align="center">

### 7. Pie Chart
- **Explanation**: A pie chart is a circular statistical graphic divided into slices to illustrate numerical proportions. Each slice represents a category’s contribution to the whole.
- **Best Use**: Use a pie chart to show the proportion of different categories within a whole. It's best for displaying data with a small number of categories that sum up to a meaningful whole, like market share or budget distribution.

In [None]:
df[df['year'] == 2007]

In [None]:
latest_year = df['year'].max()
pop_by_continent = df[df['year'] == latest_year].groupby('continent')['pop'].sum().reset_index()
pop_by_continent

In [None]:
# Pie chart showing the proportion of the world population by continent,
# i.e the population distribution by continent for a specific year.

latest_year = df['year'].max()
pop_by_continent = df[df['year'] == latest_year].groupby('continent')['pop'].sum().reset_index()
fig = px.pie(pop_by_continent, values='pop', names='continent', title='World Population by Continent')
fig.show()

**Insights**:
- **Demographic Distribution**:
  - **Asia** holds the largest proportion of the world population.
  - **Africa** has a significant share, reflecting its growing population.
  - **Europe**, **North America**, and **Oceania** have smaller shares.

### 8. Heatmap
- **Explanation**: A heatmap is a data visualization technique that shows the magnitude of a phenomenon as color in two dimensions.
- **Best Use**: Use a heatmap to visualize the correlation between variables in a dataset. It's excellent for showing relationships between two factors and for spotting patterns, correlations, or anomalies.


In [None]:
corr_matrix = df[['gdpPercap', 'lifeExp', 'pop']].corr()

corr_matrix

In [None]:
# Heatmap of the correlation between different variables.

corr_matrix = df[['gdpPercap', 'lifeExp', 'pop']].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

**Insights**:
- **Correlations**:
  - Strong positive correlation between GDP per capita and life expectancy.
  - Population shows less correlation with GDP per capita and life expectancy, indicating other factors at play in determining economic wealth and health outcomes.


---
_**Your Dataness**_,  
`Obinna Oliseneku` (_**Hybraid**_)  
**[LinkedIn](https://www.linkedin.com/in/obinnao/)** | **[GitHub](https://github.com/hybraid6)**  