# Walkthroughs and Exercises for GenAi-Powered Data Analysis in Python

Dr. Chester Ismay

# Data Analytics Kickoff + Course Goals

## Walkthrough #1: Setting Up the Python Environment

If you haven’t already installed Python, Jupyter, and the necessary
packages, there are instructions on the course repo in the README to do
so
[here](https://github.com/ismayc/oreilly-genai-powered-data-analysis-with-python/blob/main/README.md).

If you aren’t able to do this on your machine, you may want to check out
[Google Colab](https://colab.research.google.com/). It’s a free service
that allows you to run Jupyter notebooks in the cloud.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [None]:
# For plotly to load directly in Jupyter notebook
import plotly.offline as pyo
pyo.init_notebook_mode(connected=True)

## Exercise #1: Setting Up the Python Environment

Follow the instructions above in Walkthrough #1 to check for correct
installation of necessary packages. We’ll wait a few minutes to make
sure as many of you are set up as possible. Please give a thumbs up in
the pulse check if you are ready to move on.

We’ll work with ChatGPT as our GenAI tool. If you are getting errors at
this point and would like to ask it for assistance, go for it! We’ll
make more use of it throughout the course, and I’ll give tips along the
way too.

------------------------------------------------------------------------

# *Day 1: Prompt to Wrangle*

------------------------------------------------------------------------

# Module 1: Data Wrangling with Pandas

## Walkthrough #2: Cleaning and Preparing Data with Pandas

### Import data from a CSV or from an Excel file

In [None]:
# Load the data from a CSV file


### Perform an initial exploration of the data

In [None]:
# Display the first few rows of the DataFrame


In [None]:
# Display the information about the DataFrame


In [None]:
# Display summary statistics of the DataFrame


### Handle missing data

#### Remove rows

In [None]:
# Remove rows with any missing values


In [None]:
# Remove rows with missing values in specific columns


#### Replace missing values with specific value

In [None]:
# Replace missing values with a specific value (e.g., 0 for numerical columns, 'Unknown' for categorical columns)
economies_fill_value = economies.fillna({
    'gdp_percapita': 0,
    'gross_savings': 0,
    'inflation_rate': 0,
    'total_investment': 0,
    'unemployment_rate': 0,
    'exports': 0,
    'imports': 0,
    'income_group': 'Unknown'
})

# Display the DataFrame after replacing missing values with specific values


### Convert a column to a different data type

In [None]:
# Change year to be a string instead of an integer


# Display the information on the DataFrame with year as a string


In [None]:
# Change the year of string type back to integer


# Display the information on the DataFrame with year as an integer


### Rename a column

In [None]:
# Rename the 'income_group' column to 'income_category'


### Filtering rows based on conditions

#### Conditions on a single column

In [None]:
# Filter rows where 'gdp_percapita' is greater than 20,000


In [None]:
# Filter rows where 'income_group' is 'High income'


#### Conditions on multiple columns

In [None]:
# Filter rows where inflation_rate is less than 0 and income_group is 'Low income'


## Exercise #2: Cleaning and Preparing Data with Pandas

In [None]:
# Load the populations data from a CSV file


In [None]:
# Examine the first few rows


In [None]:
# Display the information about the DataFrame


In [None]:
# Display summary statistics of the DataFrame


### Handle Missing Data

#### Remove rows

In [None]:
# Remove rows with any missing values


In [None]:
# Remove rows with missing values in fertility_rate and life_expectancy


#### Replace missing values with specific value

In [None]:
# Replace missing values with a specific value (e.g., 0 for numerical columns, 
# 'Unknown' for categorical columns)


### Convert a Column to a Different Data Type and Rename a Column

#### Convert a Column to a Different Data Type

In [None]:
# Convert the 'year' column to string type


In [None]:
# Convert it back to integer


#### Rename a Column

In [None]:
# Rename the 'fertility_rate' column to 'fertility'


#### Filter a DataFrame

In [None]:
# Filter the DataFrame to include only rows where the 'continent' is 'Asia'


In [None]:
# Filter the DataFrame to include only rows where the 'year' is 2020


In [None]:
# Filter the DataFrame to include only rows where the 
# 'fertility_rate' is greater than 2


# Module 2: Transforming and Aggregating Data with Pandas

## Walkthrough #3: Summarizing Data with Pandas

### Grouping data

In [None]:
# Get the mean gdp per capita for each income_group


### Applying Functions

#### Applying a function element-wise with `map()`

In [None]:
# Convert income_group to uppercase using map()


#### Applying a Function to Groups with `groupby()` and `agg()`

In [None]:
# Calculate the median gdp_percapita and inflation_rate for each income_group


### Summary tables

In [None]:
# Create a pivot table of gdp_percapita and inflation_rate 
# by income_group and year


### Analyzing categorical data

#### Using cross-tabulation

In [None]:
# Show counts of income_group by year


#### By getting group counts

In [None]:
# Count the occurrences of each income_group


## Exercise #3: Summarizing Data with Pandas

### Grouping Data

In [None]:
# Group data by continent and calculate the mean life expectancy

### Applying Functions

#### Applying a function element-wise with `map()`

In [None]:
# Convert continent to uppercase using map()


#### Applying a function to groups with `groupby()` and `agg()`

In [None]:
# Calculate the median fertility rate and life expectancy for each continent


### Summary Tables

In [None]:
# Create a pivot table of fertility rate and life expectancy by continent and year


### Analyzing Categorical Data

#### Using Cross-Tabulation

In [None]:
# Create a cross-tabulation of continent and year


#### By Getting Group Counts

In [None]:
# Count the occurrences of each region


------------------------------------------------------------------------

# Module 3: Exploring and Learning from Mistakes

Use the provided prompt as your initial guide. Here are 20 Python errors
that you’ll attempt to use LLMs to help you debug. Make sure to run the
code in Jupyter first and then try to debug!

## Walkthrough and Exercise #4: Debug with GenAI

### 1

In [None]:
populations.head

### 2

In [None]:
populations.size.mean()

### 3

In [None]:
populations['Life_Expectancy'].mean()

### 4

In [None]:
asia = populations[populations['continent'] = 'Asia']

### 5

In [None]:
populations[populations['continent'] == 'Asia' & populations['year'] == 2020]

### 6

In [None]:
populations[populations['population_size'] > 1_000_000]

### 7

In [None]:
populations['double_size'] = populations['size'].apply(lambda x: x * 2, axis=1)

### 8

In [None]:
populations['fertility_rate'].fillna(0)

------------------------------------------------------------------------

# *Day 2: Visualize to Tell*

------------------------------------------------------------------------

# Module 4: Data Visualization Basics with Matplotlib and Seaborn

## Walkthrough #5: Data Visualization Techniques

### Line plot with Matplotlib

In [None]:
# Filter data for a specific country


# Line plot of gdp_percapita over the years


### Bar chart with Matplotlib

In [None]:
# Filter data for Caribbean countries and the year 2020



# Bar chart of gdp_percapita for different Caribbean countries in 2020







# Horizontal version


### Adding labels and titles

In [None]:
# Filter data for a specific country


# Line plot of gdp_percapita over the years with labels and titles


### Adjusting axes and tick marks

In [None]:
# Bar chart of gdp_percapita for different Caribbean countries in 2020 with 
# adjusted axes and tick marks






# Adjust axes


# Adjust tick marks


### Histogram with Seaborn

In [None]:
# Histogram of gdp_percapita


### Boxplot with Seaborn

In [None]:
# Boxplot of gdp_percapita by income_group


### Violin plot with Seaborn

In [None]:
# Violin plot of gdp_percapita by income_group


## Exercise #5: Data Visualization Techniques

### Line Plot with Matplotlib

In [None]:
# Filter data for India


# Line plot of fertility rate over the years


### Bar Chart with Matplotlib

In [None]:
# Filter data for selected Asian countries and the year 2020
asian_countries = ['CHN', 'IND', 'IDN', 'PAK', 'BGD']


# Bar chart of population size for selected Asian countries in 2020


### Adding Labels and Titles

In [None]:
# Filter data for Nigeria


# Line plot of life expectancy over the years with labels and titles


### Adjusting Axes and Tick Marks

In [None]:
# Filter data for selected African countries ('NGA', 'ETH', 'EGY', 'ZAF', 'DZA')
# and the year 2020
african_countries = ['NGA', 'ETH', 'EGY', 'ZAF', 'DZA']


# Bar chart of fertility rate for selected African countries in 2020 with 
# adjusted axes and tick marks






# Adjust axes


# Adjust tick marks


### Histogram with Seaborn

In [None]:
# Histogram of life expectancy


### Boxplot with Seaborn

In [None]:
# Boxplot of fertility rate by continent


### Violin Plot with Seaborn

In [None]:
# Violin plot of fertility rate by continent


------------------------------------------------------------------------

# Module 5: Interactive Data Visualization with Plotly

## Walkthrough #6: Interactive Charts and Dashboards with Plotly

### Basic interactive chart

In [None]:
# Filter data for a specific country


# Create an interactive line chart


### Adding interactive elements

In [None]:
# Create an interactive scatter plot





# Add hover, zoom, and selection tools


### Designing a simple dashboard

In [None]:
# Filter data for the year 2020


# Create a subplot figure with 1 row and 2 columns





# Line chart of GDP Per Capita for Afghanistan





# Bar chart of GDP Per Capita for different countries in 2020




# Update layout


## Exercise #6: Interactive Charts and Dashboards with Plotly

### Basic Interactive Chart

In [None]:
# Filter data for a specific country (Brazil)


# Create an interactive line chart (Fertility Rate Over Years)


### Adding Interactive Elements

In [None]:
# Create an interactive scatter plot





# Add hover, zoom, and selection tools


### Designing a Simple Dashboard

In [None]:
# Filter data for the year 2020


# Create a subplot figure with 1 row and 2 columns





# Line chart of Life Expectancy for Brazil





# Bar chart of Life Expectancy for South American countries in 2020





# Update layout to add a title and hide the legend
