## Introduction

In [None]:
# initial exploration
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 

books = pd.read_csv('books.csv')

books.head()
books.info()

# a closer look at categorical columns
books.value_counts('genre')

# .describe() numerical columns
books.describe()

# visualizing numerical data
sns.histplot(data=books, x='rating')
plt.show()

# adjusting bin width
sns.histplot(data=books, x='rating', binwidth=0.1)
plt.show()

### Counting categorical values
Recall from the previous exercise that the unemployment DataFrame contains 182 rows of country data including country_code, country_name, continent, and unemployment percentages from 2010 through 2021.

You'd now like to explore the categorical data contained in unemployment to understand the data that it contains related to each continent.

The unemployment DataFrame has been loaded for you along with pandas as pd.

> Instructions
- Use a pandas function to count the values associated with each continent in the unemployment DataFrame.

In [None]:
# Count the values associated with each continent in unemployment
print(unemployment.continent.value_counts())
# or print(unemployment.value_counts('continent'))

### Global unemployment in 2021
It's time to explore some of the numerical data in unemployment! What was typical unemployment in a given year? What was the minimum and maximum unemployment rate, and what did the distribution of the unemployment rates look like across the world? A histogram is a great way to get a sense of the answers to these questions.

Your task in this exercise is to create a histogram showing the distribution of global unemployment rates in 2021.

The unemployment DataFrame has been loaded for you along with pandas as pd.

> Instructions
- Import the required visualization libraries.
- Create a histogram of the distribution of 2021 unemployment percentages across all countries in unemployment; show a full percentage point in each bin.

In [None]:
# Import the required visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Create a histogram of 2021 unemployment; show a full percent in each bin
sns.histplot(data=unemployment, x="2021", binwidth=1)
plt.show()

## Data validation

In [None]:
# updating data types
books['year'] = books['year'].astype('int')
books.dtypes

# validating categorical data
books['genre'].isin(['Fiction', 'Non Fiction']) # returns a boolean series

# use a tilde to negate the boolean series
~books['genre'].isin(['Fiction', 'Non Fiction']) # returns a boolean series of opposite values

# filter the dataframe using the boolean series
books[books['genre'].isin(['Fiction', 'Non Fiction'])]

# validating numerical data
books.select_dtypes('number').head()

books['year'].min()
books['year'].max()

# visualizing numerical data
sns.boxplot(data=books, x='year')
plt.show()

# group by genre
sns.boxplot(data=books, x='year', y='genre')
plt.show()


### Validating continents
Your colleague has informed you that the data on unemployment from countries in Oceania is not reliable, and you'd like to identify and exclude these countries from your unemployment data. The .isin() function can help with that!

Your task is to use .isin() to identify countries that are not in Oceania. These countries should return True while countries in Oceania should return False. This will set you up to use the results of .isin() to quickly filter out Oceania countries using Boolean indexing.

The unemployment DataFrame is available, and pandas has been imported as pd.

> Instructions
- Define a Series of Booleans describing whether or not each continent is outside of Oceania; call this Series not_oceania.
- Use Boolean indexing to print the unemployment DataFrame without any of the data related to countries in Oceania.

In [None]:
# Define a Series describing whether each continent is outside of Oceania
not_oceania = ~unemployment["continent"].isin(["Oceania"])

# Print unemployment without records related to countries in Oceania
print(unemployment[not_oceania])

### Validating range
Now it's time to validate our numerical data. We saw in the previous lesson using .describe() that the largest unemployment rate during 2021 was nearly 34 percent, while the lowest was just above zero.

Your task in this exercise is to get much more detailed information about the range of unemployment data using Seaborn's boxplot, and you'll also visualize the range of unemployment rates in each continent to understand geographical range differences.

unemployment is available, and the following have been imported for you: Seaborn as sns, matplotlib.pyplot as plt, and pandas as pd.

> Instructions
- Print the minimum and maximum unemployment rates, in that order, during 2021.
- Create a boxplot of 2021 unemployment rates, broken down by continent.




In [None]:
# Print the minimum and maximum unemployment rates during 2021
print(unemployment["2021"].min(), unemployment["2021"].max())

# Create a boxplot of 2021 unemployment rates, broken down by continent
sns.boxplot(data=unemployment, x="2021", y="continent")
plt.show()

## Data summarization

In [None]:
# mean values by genre
books.groupby('genre').mean()

# mean and std
books.groupby('genre').agg(['mean', 'std'])

# specifying aggregation functions for columns
books.agg({'rating': ['mean', 'std'], 'year': ['median']})

In [None]:
# named summary columns
books.groupby('genre').agg(
    mean_rating=('rating', 'mean'),
    std_rating=('rating', 'std'),
    median_year=('year', 'median')  
)

In [None]:
# visualizing categorical summaries
sns.barplot(data=books, x='genre', y='rating')
plt.show()


### Named aggregations
You've seen how .groupby() and .agg() can be combined to show summaries across categories. Sometimes, it's helpful to name new columns when aggregating so that it's clear in the code output what aggregations are being applied and where.

Your task is to create a DataFrame called continent_summary which shows a row for each continent. The DataFrame columns will contain the mean unemployment rate for each continent in 2021 as well as the standard deviation of the 2021 employment rate. And of course, you'll rename the columns so that their contents are clear!

The unemployment DataFrame is available, and pandas has been imported as pd.

> Instructions
- Create a column called mean_rate_2021 which shows the mean 2021 unemployment rate for each continent.
- Create a column called std_rate_2021 which shows the standard deviation of the 2021 unemployment rate for each continent.

In [None]:
continent_summary = unemployment.groupby("continent").agg(
    # Create the mean_rate_2021 column
    mean_rate_2021 = ('2021', 'mean'),
    # Create the std_rate_2021 column
    std_rate_2021 = ('2021', 'std'),
)
print(continent_summary)

### Visualizing categorical summaries
As you've learned in this chapter, Seaborn has many great visualizations for exploration, including a bar plot for displaying an aggregated average value by category of data.

In Seaborn, bar plots include a vertical bar indicating the 95% confidence interval for the categorical mean. Since confidence intervals are calculated using both the number of values and the variability of those values, they give a helpful indication of how much data can be relied upon.

Your task is to create a bar plot to visualize the means and confidence intervals of unemployment rates across the different continents.

unemployment is available, and the following have been imported for you: Seaborn as sns, matplotlib.pyplot as plt, and pandas as pd.

> Instructions
- Create a bar plot showing continents on the x-axis and their respective average 2021 unemployment rates on the y-axis.

In [None]:
# Create a bar plot of continents and their 2021 average unemployment
sns.barplot(data=unemployment, x="continent", y="2021")
plt.show()