# Section 3. Data Visualization

#### Instructor: Pierre Biscaye

The content of this notebook draws on material from UC Berkeley's D-Lab Python Fundamentals [course](https://github.com/dlab-berkeley/Python-Fundamentals).
    
### Learning Objectives 
    
* Apply several Pandas methods to summarize data.
* Create simple visualizations using Pandas.
* Consider simple approaches to detecting and treating outliers. 
* Visualize data using the Matplotlib package.

### Sections
1. Methods to visualize `DataFrame` and `Series` objects
2. Graphics using Matplotlib

### Libraries loaded
* pandas
* matplotlib.pyplot
* numpy

### Files loaded
* gapminder.csv

# 1. Exploring Data Frames

Pandas has many methods: some allow you to work with entire DataFrames, while others operate on individual columns. This section focuses on learning to distinguish between these methods.

Some methods work on entire DataFrames. We can look at the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to see all the methods and attributes that are available for `DataFrame` objects. Learning how to read documentation is an important skill! 

## Summary Statistics
The [`describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) method will give some summary statistics for a `DataFrame`. Run the cell below to see how it works.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('Data/gapminder.csv')

df.describe()

In [None]:
df.columns

*Question*: Why are only some of the columns in the `DataFrame` visible in the output?

The result of `describe()` is itself a data frame. You can therefore index and subset it.

In [None]:
df.describe().loc['mean','lifeExp']

The `describe()` method is customizable. For example, I can ask for particular percentiles by passing the `percentiles` argument.

In [None]:
df.describe(percentiles=[.25, .5, .75, .99])

Percentiles can also be calculate using the `quantile` method directly on a column.

In [None]:
df['lifeExp'].quantile(0.9)

## Sorting Values

Let's say we want to find the countries with the highest `gdpPercap`.

If we want to sort the values in a DataFrame we can use the [`sort_values()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) method on a DataFrame. It takes as an argument the column we want to sort the DataFrame on. 

By default, `sort_values()` sorts in **ascending order**. You can add the argument `ascending=False` when running `sort_values()` to show results in descending order.

In [None]:
df.sort_values('gdpPercap')

## Plots

Pandas has a convenient [`plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) method that allows you to create different kinds of visualizations. Some of these visualizations can be called on a DataFrame object.

For instance, **scatter plots** visualize the relationship between different variables (columns) in a DataFrame. This is why we run the method on an entire DataFrame.

We can create a scatter `plot()` by specifying the columns to use for the `x` and `y` axes. 

In [None]:
df.plot(kind='scatter', x='lifeExp', y='gdpPercap')

There are many more plot options. You can use `df.plot?` to explore the documentation.

One example is a **histogram**. A histogram shows the distribution of a variable using binned values. We can call this using the syntax: `df[column].plot(kind='hist')`. Use a histogram if you want to show distributions of continuous variables.

*Note* You can try changing the value for the `bins` parameter. What does the `bins` parameter seem to be determining?

In [None]:
df['lifeExp'].plot(kind='hist', bins=30, title='Histogram of life expectancy')

**Bar plots** show the relationship between a numeric and a categorical variable. Use a bar plot when you want to illustrate differences in frequencies of some category. Here, we use the `continent` (categorical) and `lifeExp` (numeric) columns. 

Because there are many observations per continent, we first collapse the data to the continent level.

In [None]:
df.groupby('continent')['lifeExp'].mean().plot(kind='bar')

In the below cell, we retrieve the 10 data points with the lowest life expectancy in our data using the `sort_values()` method, and then plot those data points in a bar plot.

Note that `plot.bar()` is a method of its own, and is an alternative to using `plot()` with the `type=bar` argument.

In [None]:
# Sort values based on low life expectancy, get top 10
low_lifeExp = df.sort_values('lifeExp', ascending=True)[:10]

# Visualize with bar plot 
low_lifeExp.plot.bar(x='country', y='lifeExp', figsize=(6,4))

Suppose we want to calculate mean GDP per capita by continent in 2007. Complete the below code to create this dataframe and graph a bar plot, after sorting by increasing GDP.

In [None]:
# Group by 'continent' and calculate the mean of 'gdpPercap'
gdp_mean_df = df.groupby()...

# Plot sorted dataframe

## Looking for outliers

When working with economic data, outliers are common. These could represent data entry errors or true outlier cases (in the gapminder data, think for example of small oil-rich nations or sudden resource booms) that can skew the distribution and resulting analyses. 

Identifying outliers visually first is always the best practice. Let's start by looking at GDP per capita. The most standard way to spot outliers in a single variable is using a Box Plot. In a box plot, outliers are explicitly plotted as individual points (usually "fliers") beyond the "whiskers."

In [None]:
df['gdpPercap'].plot(kind='box', title='Boxplot of GDP Per Capita');

**Interpreting the output**: The green line is the median and the box is the interquartile range (IQR) - the distance from the 1st to the 3rd quartile. The 'whiskers' extend to 1.5 times the IQR above and below Q1 and Q3, or to the smallest and largest values if they are less than 1.5*IQR away.

What does the above graph suggest about outliers in this variable?

It is always a good idea to plot your main analysis variables to see whether there appear to be outliers.

**Identifying outliers**: If you do see potential outliers, you will want to address them in a systematic way. To identify outliers statistically, we first need to define a threshold. A common approach for "extreme" values is using the 99th percentile (if the outliers are on the larger end of the distribution) or 1st percentile (if they are one the smaller end). Another approach, as suggested by the above box and whisker plot, would be to identify values more than 1.5*IQR away from Q1 or Q3. There are tradeoffs to different approaches.

For now, let's use the 99th percentile to identify extreme GDP per capita values.


In [None]:
# Calculate the 99th percentile threshold
upper_limit = df['gdpPercap'].quantile(0.99)

# Identify which rows are outliers
outliers = df[df['gdpPercap'] > upper_limit]
print(f"Count of observations above 99th percentile: {len(outliers)}")
outliers.head()

**Addressing outliers**: There are many ways to address outliers. 

For example:
1. Setting Outliers to Missing (NaN): Use this if you believe the outliers are data entry errors or so extreme they will bias your model results. The downsides are that this will reduce sample size and introduce bias if outliers aren't random.
2. Winsorizing (Capping at the 99th Percentile): Use this if you want to keep the data points--if you believe they truly represent large values--but "mute" their impact. This replaces any value higher than the 99th percentile with the value of the 99th percentile itself.
3. Log transformation: As with winsorizing, use this if the data are "right skewed" (like income) to pull extremes closer to the mean. This is less useful if the outliers may be data entry errors.
4. Imputation: Set outliers to some other value, such as the median for a group or a predicted value based on some other characteristics. This is useful if you think the outliers represent errors and you think you can guess at what the true value could be, but can artificially reduce the variance of your data.

Let's look at how the first two methods affect the values of GDP per capita.

In [None]:
# Set values above the 99th percentile to NaN
df_nan = df['gdpPercap'].copy()
df_nan[df_nan > upper_limit] = np.nan

# Replace values above the limit with the limit itself
df_cap = df['gdpPercap'].copy()
df_cap[df_cap > upper_limit] = upper_limit

# Create a comparison table
stats_comparison = pd.DataFrame({
    'Original': [df['gdpPercap'].count(), df['gdpPercap'].mean(), df['gdpPercap'].std()],
    'Replaced with NaN': [df_nan.count(), df_nan.mean(), df_nan.std()],
    'Capped (99th Pct)': [df_cap.count(), df_cap.mean(), df_cap.std()]
}, index=['Count (N)', 'Mean', 'Std Dev'])

print("Statistical Impact of Outlier Handling:")
print(stats_comparison.round(2))

# 2. Graphics using Matplotlib

It's common to use `matplotlib` for graphics, similar to `ggplot` in R. In `matplotlib`, a plot consists of a figure and one or more axes. The axes contain important information about each plot, such as its axis labels, or title.

It is a standard to import matplotlib.pyplot as plt. 

In [None]:
import matplotlib.pyplot as plt

Let's explore the `plot()` method. It takes several arguments (see `plt.plot?` for more information), and plots a line between points.

The most basic structure is to call a single x coordinate and y coordinate. This plots a **point**.

In [None]:
plt.plot(3, 4, marker='o')

In [None]:
# Adding a colon removes some plot backend information
plt.plot(3, 4, marker='o');

In [None]:
# You can also use plt.show() to explicitly tell matplotlib to render the plot and close the figure creation
plt.plot(3, 4, marker='o');
plt.show()

We can also specify several **formatting** arguments. See `plt.plot?` for more information,

In [None]:
plt.plot(3, 4, marker='^',color='green', markersize=12);

We can also plot a **line** by specifying lists of coordinates.

In [None]:
plt.plot((3, 6), (4, 9));

In [None]:
# Longer line
plt.plot((3, 6, 4, 12), (4, 9, 7, 0)); 

In [None]:
# With formatting
plt.plot((3, 6, 5), (4, 9, 9), color='green', 
         marker='o', markersize=12,
         linestyle='dashed',
         linewidth=2); 

We can also **plot multiple things** together. Note that matplotlib will automatically assume you are adding things to the same plot if the code is in the same cell, until you tell it to render the plot and close the figure creation.

In [None]:
plt.plot(3, 4, '.')
plt.plot(6, 9, '^')
plt.plot(4, 7, 'X', color='red', markersize=12)
plt.plot((3, 6), (4, 9))
plt.show()

We can add **labels and titles** to the plot using  `plt.xlabel()`, `plt.ylabel()`, and `plt.title()`. See [this resource](https://www.w3schools.com/python/matplotlib_labels.asp) for more information!

In [None]:
plt.plot(3, 4, '.')
plt.plot(6, 9, '^')
plt.plot(4, 7, 'X', color='red', markersize=12)
plt.plot((3, 6), (4, 9))
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Path from A to B, showing point C')
plt.show()

We can also **save and output** the results of our figures using `plt.savefig()`. 

By default it goes to your working folder unless you specify something else.

In [None]:
plt.plot(3, 4, '.')
plt.plot(6, 9, '^')
plt.plot(4, 7, 'X', color='red', markersize=12)
plt.plot((3, 6), (4, 9))
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Path from A to B, showing point C')
plt.savefig('practice_fig.jpg')

**Practice**: Write some code to plot a square with blue lines and black square markers on the corners. Add a marker for a point in the center of the square. Label the axes and increase their fontsize.

In [None]:
# code here

## Plotting existing objects

Much of the time we will have existing x and y objects we want to plot.

In [None]:
x=[2,8,33,7,-1]
y=[-3,8,3,4,3]
plt.plot(x, y, '-o') # the dash says to link the points
plt.show() 

In [None]:
#Create two random variables as example
import numpy as np

x = np.random.normal(0, 10, size=500)
y = np.random.normal(10, 1, size=500)
plt.plot(x, y, 'o');

## Using subplots

Plots can also be organized by axes and subplots. Passing these arguments allows you to specify things like the figure size and to graph things side by side.

In [None]:
# Simply changeing the figure size
# figsize sets the width and height in inches
fig, ax = plt.subplots(ncols=1, figsize=(8,8))
ax.plot(x, y, 'o')
plt.show()

Suppose I want to plot the distribution of the two variables side by side.

In [None]:
# 1. Create a figure with 1 row and 2 columns
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

# 2. Plot the first histogram on the first subplot (index 0)
ax[0].hist(x, bins=30, color='skyblue', edgecolor='black')
ax[0].set_title('Histogram of X')
ax[0].set_xlabel('Value')
ax[0].set_ylabel('Frequency')

# 3. Plot the second histogram on the second subplot (index 1)
ax[1].hist(y, bins=30, color='salmon', edgecolor='black')
ax[1].set_title('Histogram of Y')
ax[1].set_xlabel('Value')
ax[1].set_ylabel('Frequency')

# 4. Use tight_layout to prevent labels from overlapping
plt.tight_layout()
plt.show()

## Plotting data from data frames

There are two ways to do this.
1. You can call the columns directly, e.g., `df['column']`.
2. You can specify column names if they are indexed, and then add `data=df`.

Let's practice with the gapminder dataset and the `plt.scatter` method, which plots a scatterplot rather than a connected line.

In [None]:
df.columns

In [None]:
plt.plot(df['year'], df['lifeExp'],'o')

In [None]:
# identical method
plt.scatter('year', 'lifeExp', data=df)

Let's plot life expectancy and GDP per capita by year from the gapminder dataset on side by side graphs.

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(12,6))
ax[0].scatter('year', 'lifeExp', data=df)
ax[1].scatter('year', 'gdpPercap', data=df)
ax[0].set_title('Life expectancy')
ax[1].set_title('GDP per capita')
for x in range(2):
    ax[x].set_xlabel('Year')
plt.show()

**Practice:** Create a figure with two subplots on separate rows. In the first, plot lifeExp against gdpPercap before 1980. In the second, plot lifeExp against gdpPercap after 1980. Add titles to each subfigure and to the axes. Save the plot to your working directory.

In [None]:
# Code here