# Examples for packages/statistics/index.md

(plotting-simple-quantities-of-a-pandas-dataframe)=

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# For loading data as data frames.
import pandas as pd

## Plotting simple quantities of a pandas dataframe

<!--- plot_pandas -->

This example loads from a CSV file data with mixed numerical and
categorical entries, and plots a few quantities, separately for females
and males, thanks to the pandas integrated plotting tool (that uses
matplotlib behind the scene).

See http://pandas.pydata.org/pandas-docs/stable/visualization.html

In [None]:
data = pd.read_csv("examples/brain_size.csv", sep=";", na_values=".")

# Box plots of different columns for each sex
groupby_sex = data.groupby("Gender")
groupby_sex.boxplot(column=["FSIQ", "VIQ", "PIQ"])

# Scatter matrices for different columns
pd.plotting.scatter_matrix(data[["Weight", "Height", "MRI_Count"]])
pd.plotting.scatter_matrix(data[["PIQ", "VIQ", "FSIQ"]]);

## Boxplots and paired differences

<!--- plot_paired_boxplots -->

Plot boxplots for FSIQ, PIQ, and the paired difference between the two:
while the spread (error bars) for FSIQ and PIQ are very large, there is a
systematic (common) effect due to the subjects. This effect is cancelled
out in the difference and the spread of the difference ("paired" by
subject) is much smaller than the spread of the individual measures.

In [None]:
data = pd.read_csv("examples/brain_size.csv", sep=";", na_values=".")
# Box plot of FSIQ and PIQ (different measures od IQ)
plt.figure(figsize=(4, 3))
data.boxplot(column=["FSIQ", "PIQ"])
# Boxplot of the difference
plt.figure(figsize=(4, 3))
plt.boxplot(data["FSIQ"] - data["PIQ"])
plt.xticks((1,), ("FSIQ - PIQ",))

## Simple Regression

<!--- plot_regression -->

Fit a simple linear regression using 'statsmodels', compute corresponding
p-values.

**Original author: Thomas Haslwanter**

In [None]:
# For statistics.
# Import the formula interface to Statsmodels.
import statsmodels.formula.api as smf

# Analysis of Variance (ANOVA) on linear models
from statsmodels.stats.anova import anova_lm

# Generate and show the data
x = np.linspace(-5, 5, 20)

# To get reproducible values, provide a seed value
rng = np.random.default_rng(27446968)

y = -5 + 3 * x + 4 * rng.normal(size=x.shape)

# Plot the data
plt.figure(figsize=(5, 4))
plt.plot(x, y, "o");

Multilinear regression model, calculating fit, P-values, confidence
intervals etc.

In [None]:
# Convert the data into a Pandas DataFrame to use the formulas framework
# in statsmodels
data = pd.DataFrame({"x": x, "y": y})

In [None]:
# Fit the model
model = smf.ols("y ~ x", data).fit()

In [None]:
# Show the summary
model.summary()

In [None]:
# Perform analysis of variance on fitted linear model
anova_results = anova_lm(model)
anova_results

Plot the fitted model

In [None]:
# Retrieve the parameter estimates
offset, coef = model._results.params
plt.plot(x, x * coef + offset)
plt.xlabel("x")
plt.ylabel("y");

## Multiple Regression

<!--- plot_regression_3d -->

Calculate using 'statsmodels' just the best fit, or all the corresponding
statistical parameters.

Also shows how to make 3d plots.

Original author: Thomas Haslwanter

In [None]:
# For 3d plots. This import is necessary to have 3D plotting below
from mpl_toolkits.mplot3d import Axes3D

In [None]:
# Generate and show the data
x = np.linspace(-5, 5, 21)
# We generate a 2D grid
X, Y = np.meshgrid(x, x)

# To get reproducible values, provide a seed value
rng = np.random.default_rng(27446968)

# Z is the elevation of this 2D grid
Z = -5 + 3 * X - 0.5 * Y + 8 * rng.normal(size=X.shape)

# Plot the data
ax: Axes3D = plt.figure().add_subplot(projection="3d")
surf = ax.plot_surface(X, Y, Z, cmap="coolwarm", rstride=1, cstride=1)
ax.view_init(20, -120)
ax.set_xlabel("X")
ax.set_ylabel("Y")
ax.set_zlabel("Z");

Multilinear regression model, calculating fit, P-values, confidence
intervals etc.

Convert the data into a Pandas DataFrame to use the formulas framework
in statsmodels

In [None]:
# First we need to flatten the data: it's 2D layout is not relevant.
X = X.flatten()
Y = Y.flatten()
Z = Z.flatten()

In [None]:
data = pd.DataFrame({"x": X, "y": Y, "z": Z})

In [None]:
# Fit the model
model = smf.ols("z ~ x + y", data).fit()
# Show the summary
model.summary()

In [None]:
print("\nRetrieving the parameter estimates manually:")
print(model._results.params)

In [None]:
# Perform analysis of variance on fitted linear model
anova_results = anova_lm(model)
anova_results

## Analysis of Iris petal and sepal sizes

<!--- plot_iris_analysis -->

Illustrate an analysis on a real dataset:

- Visualizing the data to formulate intuitions
- Fitting of a linear model
- Hypothesis test of the effect of a categorical variable in the presence
  of a continuous confound

In [None]:
# Load the data
data = pd.read_csv("examples/iris.csv")

Plot a scatter matrix

In [None]:
# Express the names as categories
categories = pd.Categorical(data["name"])

# The parameter 'c' is passed to plt.scatter and will control the color
pd.plotting.scatter_matrix(data, c=categories.codes, marker="o")

fig = plt.gcf()
fig.suptitle("blue: setosa, green: versicolor, red: virginica", size=13)

Statistical analysis

Let us try to explain the sepal length as a function of the petal
width and the category of iris

In [None]:
model = smf.ols("sepal_width ~ name + petal_length", data).fit()
model.summary()

Now formulate a "contrast", to test if the offset for versicolor and
virginica are identical

In [None]:
print("Testing the difference between effect of versicolor and virginica")
print(model.f_test([0, 1, -1, 0]))

## Visualizing factors influencing wages

<!--- plot_wage_data -->

This example uses Seaborn to quickly plot various factors relating wages,
experience, and education.

Seaborn (https://seaborn.pydata.org) is a library that combines
visualization and statistical fits to show trends in data.

Note that importing Seaborn changes the matplotlib style to have an
"excel-like" feeling. This changes affect other matplotlib figures. To
restore defaults once this example is run, we would need to call
`plt.rcdefaults()`.

In [None]:
data = pd.read_csv("examples/wages.txt",
    skiprows=27,
    skipfooter=6,
    sep=None,
    header=None,
    engine="python"  # To allow use of skipfooter.
)
# Give names to the columns
names = [
    "education: Number of years of education",
    "south: 1=person lives in South, 0=Person lives elsewhere",
    "sex: 1=female, 0=Male",
    "experience: Number of years of work experience",
    "union: 1=union member, 0=Not union member",
    "wage: wage (dollars per hour)",
    "age: years",
    "race: 1=other, 2=Hispanic, 3=White",
    "occupation: 1=Management, 2=Sales, 3=Clerical, 4=Service, 5=Professional, 6=Other",
    "sector: 0=Other, 1=Manufacturing, 2=Construction",
    "marr: 0=unmarried,  1=Married",
]
short_names = [n.split(":")[0] for n in names]
data.columns = pd.Index(short_names)
# Log-transform the wages, because they typically are increased with
# multiplicative factors
data["wage"] = np.log10(data["wage"])
# Convert genders to strings (this is particularly useful so that the
# statsmodels formulas detects that `sex` is a categorical variable)
data["sex"] = np.choose(data['sex'], ["male", "female"])

Plot scatter matrices highlighting different aspects

In [None]:
import seaborn

In [None]:
seaborn.pairplot(data, vars=["wage", "age", "education"], kind="reg")

In [None]:
seaborn.pairplot(data, vars=["wage", "age", "education"], kind="reg", hue="sex")
plt.suptitle("Effect of sex: 1=Female, 0=Male")

In [None]:
seaborn.pairplot(data, vars=["wage", "age", "education"], kind="reg", hue="race")
plt.suptitle("Effect of race: 1=Other, 2=Hispanic, 3=White")

In [None]:
seaborn.pairplot(data, vars=["wage", "age", "education"], kind="reg", hue="union")
plt.suptitle("Effect of union: 1=Union member, 0=Not union member")

Plot a simple regression

In [None]:
seaborn.lmplot(y="wage", x="education", data=data)

## Test for an education/sex interaction in wages

<!--- plot_wage_education_gender -->

Wages depend mostly on education. Here we investigate how this dependence
is related to gender: not only does gender create an offset in wages, it
also seems that wages increase more with education for males than
females.

Does our data support this last hypothesis? We will test this using
statsmodels' formulas
(http://statsmodels.sourceforge.net/stable/example_formulas.html).

In [None]:
# simple plotting

# Plot 2 linear fits for male and female.
seaborn.lmplot(y="wage", x="education", hue="sex", data=data)

# statistical analysis
import statsmodels.formula.api as sm

# Note that this model is not the plot displayed above: it is one
# joined model for male and female, not separate models for male and
# female. The reason is that a single model enables statistical testing
result = sm.ols(formula="wage ~ education + sex", data=data).fit()
result.summary()

In [None]:
# The plots above highlight that there is not only a different offset in
# wage but also a different slope
#
# We need to model this using an interaction
result = sm.ols(
    formula="wage ~ education + sex + education * sex", data=data
).fit()
result.summary()

Looking at the p-value of the interaction of sex and education, the
data does not support the hypothesis that education benefits males
more than female (p-value > 0.05).

## Other examples

### Air fares before and after 9/11

<!--- plot_airfare -->

This is a business-intelligence (BI) like application.

What is interesting here is that we may want to study fares as a function
of the year, paired accordingly to the trips, or forgetting the year,
only as a function of the trip endpoints.

Using statsmodels' linear models, we find that both with an OLS (ordinary
least square) and a robust fit, the intercept and the slope are
significantly non-zero: the air fares have decreased between 2000 and
2001, and their dependence on distance travelled has also decreased

In [None]:
# As a separator, '\s+' is a regular expression that means 'one or more
# spaces'
data = pd.read_csv(
    "examples/airfares.txt",
    sep=r'\s+',
    header=0,
    names=[
        "city1",
        "city2",
        "pop1",
        "pop2",
        "dist",
        "fare_2000",
        "nb_passengers_2000",
        "fare_2001",
        "nb_passengers_2001",
    ],
)

In [None]:
# we log-transform the number of passengers
data["nb_passengers_2000"] = np.log10(data["nb_passengers_2000"])
data["nb_passengers_2001"] = np.log10(data["nb_passengers_2001"])

Make a dataframe with the year as an attribute, instead of separate columns

This involves a small danse in which we separate the dataframes in 2,
one for year 2000, and one for 2001, before concatenating again.

In [None]:
# Make an index of each flight
data_flat = data.reset_index()

In [None]:
data_2000 = data_flat[
    ["city1", "city2", "pop1", "pop2", "dist", "fare_2000", "nb_passengers_2000"]
]
# Rename the columns
data_2000.columns = pd.Index(
    ["city1", "city2", "pop1", "pop2", "dist", "fare", "nb_passengers"]
)
# Add a column with the year
data_2000.insert(0, "year", 2000)

In [None]:
data_2001 = data_flat[
    ["city1", "city2", "pop1", "pop2", "dist", "fare_2001", "nb_passengers_2001"]
]
# Rename the columns
data_2001.columns = pd.Index(
    ["city1", "city2", "pop1", "pop2", "dist", "fare", "nb_passengers"]
)
# Add a column with the year
data_2001.insert(0, "year", 2001)

In [None]:
data_flat = pd.concat([data_2000, data_2001])

Plot scatter matrices highlighting different aspects

In [None]:
seaborn.pairplot(
    data_flat, vars=["fare", "dist", "nb_passengers"], kind="reg", markers="."
)

In [None]:
# A second plot, to show the effect of the year (ie the 9/11 effect)
seaborn.pairplot(
    data_flat,
    vars=["fare", "dist", "nb_passengers"],
    kind="reg",
    hue="year",
    markers=".",
)

Plot the difference in fare

In [None]:
plt.figure(figsize=(5, 2))
seaborn.boxplot(data.fare_2001 - data.fare_2000)
plt.title("Fare: 2001 - 2000")
plt.subplots_adjust()

In [None]:
plt.figure(figsize=(5, 2))
seaborn.boxplot(data.nb_passengers_2001 - data.nb_passengers_2000)
plt.title("NB passengers: 2001 - 2000")
plt.subplots_adjust()

In [None]:
# Statistical testing: dependence of fare on distance and number of
# passengers
result = sm.ols(formula="fare ~ 1 + dist + nb_passengers", data=data_flat).fit()
result.summary()

In [None]:
# Using a robust fit
result = sm.rlm(formula="fare ~ 1 + dist + nb_passengers", data=data_flat).fit()
result.summary()

Statistical testing: regression of fare on distance: 2001/2000 difference

In [None]:
result = sm.ols(formula="fare_2001 - fare_2000 ~ 1 + dist", data=data).fit()
result.summary()

In [None]:
# Plot the corresponding regression
data["fare_difference"] = data["fare_2001"] - data["fare_2000"]
seaborn.lmplot(x="dist", y="fare_difference", data=data)

### Relating Gender and IQ

<!--- plot_brain_size -->

Going back to the brain size + IQ data, test if the VIQ of male and
female are different after removing the effect of brain size, height and
weight.

Notice that here 'Gender' is a categorical value. As it is a non-float
data type, statsmodels is able to automatically infer this.

In [None]:
data = pd.read_csv("examples/brain_size.csv", sep=";", na_values=".")

model = smf.ols("VIQ ~ Gender + MRI_Count + Height", data).fit()
model.summary()

In [None]:
# Here, we don't need to define a contrast, as we are testing a single
# coefficient of our model, and not a combination of coefficients.
# However, defining a contrast, which would then be a 'unit contrast',
# will give us the same results
print(model.f_test([0, 1, 0, 0]))

Here we plot a scatter matrix to get intuitions on our results.
This goes beyond what was asked in the exercise

This plotting is useful to get an intuitions on the relationships between
our different variables

In [None]:
# Fill in the missing values for Height for plotting
data["Height"] = data["Height"].ffill()

In [None]:
# The parameter 'c' is passed to plt.scatter and will control the color
# The same holds for parameters 'marker', 'alpha' and 'cmap', that
# control respectively the type of marker used, their transparency and
# the colormap
pd.plotting.scatter_matrix(
    data[["VIQ", "MRI_Count", "Height"]],
    c=(data["Gender"] == "Female"),
    marker="o",
    alpha=1,
    cmap="winter",
)

fig = plt.gcf()
fig.suptitle("blue: male, green: female", size=13);