# Learning Python as an R user

# Introduction

I'm teaching myself Python with examples I used to teach myself R for my research. I plan to update this post with more detailed notes as I learn how to interpret Python code.

# Install Python Anaconda

You can find documentation and a list of installation tutorials [here](https://docs.anaconda.com/anaconda/install/).

# Install Jupyter

You can find an installation tutorial [here](https://jupyter.org/install).

# shell

On my Mac, I open [Terminal](https://support.apple.com/guide/terminal/welcome/mac) and execute code to open jupyter notebook and more code to install new Python packages

# Import libraries

In [365]:
# did you install these libraries somewhere? what does import do? what do the dots "." mean?

In [366]:
import pandas
import math
import numpy
import plotly.express
import plotly.offline
import scipy
import statsmodels.api
import statsmodels.formula.api
import patsy.contrasts

## Offline Plotting Settings

In [367]:
# what do this code mean? why do you have to start with "plotly"?

In [368]:
plotly.offline.init_notebook_mode(connected = True)

# Generate data
## `group` variable with 4 levels

In [369]:
# similar to the `base::rep` function in R. [] make lists. what are lists?

In [370]:
group = numpy.repeat(["A", "B", "C", "D"], repeats = 50)

## Save parameters

In [371]:
# () make tuples. what are tuples?

In [372]:
mean = (3, 4, 4, 3)
sigma = (1, 1, 1, 1)
n = (50, 50, 50, 50)

## Random normal variable `x` whose means depend on levels of `group` variable

In [373]:
# what are list comprehensions? why do you have to concatenate it? what is zip?

In [374]:
# Loop through equal-length mean, sigma, and n (group size)
x = numpy.concatenate(
    [numpy.random.normal(loc = i, scale = j, size = k) for (i, j, k) in zip(mean, sigma, n)]
)

## Variable `y` correlated r = 0.75 with x

In [375]:
# pretty similar to stats::rnorm in R. annoying part is that I had to concatenate the list above

In [376]:
y = x * 0.75 + numpy.random.normal(loc = 0, scale = 1, size = sum(n))

## Store variables in a data frame

In [377]:
# why the colons? why the squigly brackets?

In [378]:
data1 = pandas.DataFrame({"y": y, "x": x, "group": group})

## Create contrast variables for use in linear regression on `group` variable

In [379]:
# this is similar to base::factor in r, except you use these arrays in the regression syntax. they're not an attribute of the factor.

In [380]:
# Helmert contrasts
group_helmert = patsy.contrasts.Helmert().code_without_intercept(list(set(group)))

# 2 main effects and 1 interaction contrast
group_factorial = patsy.contrasts.ContrastMatrix([[-1, -1, 1], [-1, 1, -1], [1, -1, -1], [1, 1, 1]], 
                                                 ["Main Effect 1", "Main Effect 2", "Interaction"])

## Add contrast variables to data frame

In [381]:
# not too efficient. why do I need .loc? matrix[row, column] is similar to R as well as assignment.

In [382]:
data1.loc[data1["group"] == "A", "main1"] = -1
data1.loc[data1["group"] == "B", "main1"] = -1
data1.loc[data1["group"] == "C", "main1"] = 1
data1.loc[data1["group"] == "D", "main1"] = 1

data1.loc[data1["group"] == "A", "main2"] = -1
data1.loc[data1["group"] == "B", "main2"] = 1
data1.loc[data1["group"] == "C", "main2"] = -1
data1.loc[data1["group"] == "D", "main2"] = 1

data1.loc[data1["group"] == "A", "interaction"] = 1
data1.loc[data1["group"] == "B", "interaction"] = -1
data1.loc[data1["group"] == "C", "interaction"] = -1
data1.loc[data1["group"] == "D", "interaction"] = 1

# Plots
## Boxplots

In [383]:
# similar to ggplot2::geom_boxplot

In [384]:
# boxplots of y by group
plotly.offline.iplot(
    plotly.express.box(data1, x = "group", y = "y")
)

In [385]:
# boxplots of x by group
plotly.offline.iplot(
    plotly.express.box(data1, x = "group", y = "x")
)

## Histograms

In [386]:
# similar to ggplot2::geom_histogram

In [387]:
# y histograms by group
plotly.offline.iplot(
    plotly.express.histogram(data1, x = "y", facet_col = "group")
)

In [388]:
# x histograms by group
plotly.offline.iplot(
    plotly.express.histogram(data1, x = "x", facet_col = "group")
)

## Scatterplot

In [389]:
# similar to ggplot2::geom_point and ggplot2::geom_smooth

In [390]:
plotly.offline.iplot(
    plotly.express.scatter(data1, x = "x", y = "y", color = "group", trendline = "ols", facet_col = "group")
)

## Bars of `group` means with 95% confidence intervals

In [391]:
# had to manually compute summary data in a data frame (haven't found ggplot2::stat_summary)

In [392]:
group_desc = data1.groupby("group")["y"].agg(["mean", "sem", "count"]).reset_index()
group_desc["df"] = group_desc["count"] - 1
group_desc["lower"] = group_desc["mean"] - scipy.stats.t.ppf(1 - 0.05 / 2, df = group_desc["df"]) * group_desc["sem"]
group_desc["upper"] = group_desc["mean"] + scipy.stats.t.ppf(1 - 0.05 / 2, df = group_desc["df"]) * group_desc["sem"]

# Plot
plotly.offline.iplot(
    plotly.express.bar(group_desc, x = "group", y = "mean", error_y_minus = "lower", error_y = "upper", color = "group")
)

# Descriptive Statistics

In [393]:
# similar to psych::describe and psych::describeBy in R

In [394]:
data1.groupby("group")[["x", "y"]].describe().round(2)

Unnamed: 0_level_0,x,x,x,x,x,x,x,x,y,y,y,y,y,y,y,y
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
group,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
A,50.0,3.14,0.85,1.35,2.53,3.15,3.72,4.82,50.0,2.28,1.35,-0.5,1.19,2.54,3.23,4.62
B,50.0,3.86,1.07,1.53,2.94,3.73,4.58,6.8,50.0,3.14,1.45,0.17,2.13,3.05,4.19,6.29
C,50.0,3.96,1.2,0.94,3.1,3.88,4.72,6.4,50.0,3.19,1.38,-0.75,2.4,3.25,3.87,6.17
D,50.0,3.23,0.98,1.2,2.5,3.3,3.93,5.23,50.0,2.39,1.14,0.34,1.47,2.48,2.99,5.1


# Correlation

In [395]:
# not a lot of help here from python libraries. they have r and p, like SPSS. had to manually compute the rest. print syntax is pretty neat with the {}

In [396]:
# Save r and p-value
r1, pvalue1 = scipy.stats.pearsonr(x, y)

# Save degrees of freedom
ddf1 = len(x) - 2

# Compute t-statistic and degress of freedom, isf for upper tail of t distribution
t1 = scipy.stats.t.isf(pvalue1 / 2, df = ddf1)

# Compute standard error
se1 = r1 / t1

# Compute lower and upper confidence intervals
lower1, upper1 = (numpy.tanh(numpy.arctanh(r1) - 1 / math.sqrt(len(x) - 3) * scipy.stats.norm.ppf(1 - 0.05 / 2)),
                  numpy.tanh(numpy.arctanh(r1) + 1 / math.sqrt(len(x) - 3) * scipy.stats.norm.ppf(1 - 0.05 / 2)))

# Print most basic results
# "r = {0}, 95%CI [{1}, {2}], t({3}) = {4}, p = {5}".format(r1.round(2), lower1.round(2), upper1.round(2), ddf1, t1.round(2), pvalue1.round(3))

# Also as data frame
pandas.DataFrame({"r": r1, "lower": lower1, "upper": upper1, "t": t1, "df": ddf1, "p": pvalue1}, index = [0])

Unnamed: 0,r,lower,upper,t,df,p
0,0.735492,0.664565,0.793284,15.274958,198,2.563995e-35


# Analyses
## Regression
### Fit linear regression

In [397]:
# regress y on numeric/continuous x
ols_fit1 = statsmodels.formula.api.ols("y ~ x", data = data1).fit()

### Results Summary

In [398]:
ols_fit1.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.541
Model:,OLS,Adj. R-squared:,0.539
Method:,Least Squares,F-statistic:,233.3
Date:,"Tue, 14 Jan 2020",Prob (F-statistic):,2.56e-35
Time:,09:47:27,Log-Likelihood:,-271.04
No. Observations:,200,AIC:,546.1
Df Residuals:,198,BIC:,552.7
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.5789,0.228,-2.541,0.012,-1.028,-0.130
x,0.9375,0.061,15.275,0.000,0.816,1.059

0,1,2,3
Omnibus:,1.528,Durbin-Watson:,2.152
Prob(Omnibus):,0.466,Jarque-Bera (JB):,1.567
Skew:,-0.154,Prob(JB):,0.457
Kurtosis:,2.695,Cond. No.,13.5


## Fit linear regression

In [399]:
# helmert contrasts on group
ols_fit2 = statsmodels.formula.api.ols("y ~ C(group, group_helmert)", data = data1).fit()

### Results Summary

In [400]:
ols_fit2.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.091
Model:,OLS,Adj. R-squared:,0.077
Method:,Least Squares,F-statistic:,6.57
Date:,"Tue, 14 Jan 2020",Prob (F-statistic):,0.000297
Time:,09:47:27,Log-Likelihood:,-339.31
No. Observations:,200,AIC:,686.6
Df Residuals:,196,BIC:,699.8
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.7488,0.094,29.155,0.000,2.563,2.935
"C(group, group_helmert)[H.B]",0.4315,0.133,3.236,0.001,0.169,0.695
"C(group, group_helmert)[H.A]",0.1598,0.077,2.075,0.039,0.008,0.312
"C(group, group_helmert)[H.C]",-0.1208,0.054,-2.220,0.028,-0.228,-0.013

0,1,2,3
Omnibus:,0.591,Durbin-Watson:,1.977
Prob(Omnibus):,0.744,Jarque-Bera (JB):,0.727
Skew:,-0.093,Prob(JB):,0.695
Kurtosis:,2.77,Cond. No.,2.45


In [401]:
# factorial contrasts (2 main effects and 1 interaction)
ols_fit3 = statsmodels.formula.api.ols("y ~ C(group, group_factorial)", data = data1).fit()

## Results Summary

In [402]:
ols_fit3.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.091
Model:,OLS,Adj. R-squared:,0.077
Method:,Least Squares,F-statistic:,6.57
Date:,"Tue, 14 Jan 2020",Prob (F-statistic):,0.000297
Time:,09:47:27,Log-Likelihood:,-339.31
No. Observations:,200,AIC:,686.6
Df Residuals:,196,BIC:,699.8
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.7488,0.094,29.155,0.000,2.563,2.935
"C(group, group_factorial)Main Effect 1",0.0389,0.094,0.413,0.680,-0.147,0.225
"C(group, group_factorial)Main Effect 2",0.0151,0.094,0.160,0.873,-0.171,0.201
"C(group, group_factorial)Interaction",-0.4165,0.094,-4.417,0.000,-0.602,-0.231

0,1,2,3
Omnibus:,0.591,Durbin-Watson:,1.977
Prob(Omnibus):,0.744,Jarque-Bera (JB):,0.727
Skew:,-0.093,Prob(JB):,0.695
Kurtosis:,2.77,Cond. No.,1.0


## Fit linear regression

In [403]:
ols_fit4 = statsmodels.formula.api.ols("y ~ main1 + main2 + interaction", data = data1).fit()

## Results Summary

In [404]:
ols_fit4.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.091
Model:,OLS,Adj. R-squared:,0.077
Method:,Least Squares,F-statistic:,6.57
Date:,"Tue, 14 Jan 2020",Prob (F-statistic):,0.000297
Time:,09:47:27,Log-Likelihood:,-339.31
No. Observations:,200,AIC:,686.6
Df Residuals:,196,BIC:,699.8
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.7488,0.094,29.155,0.000,2.563,2.935
main1,0.0389,0.094,0.413,0.680,-0.147,0.225
main2,0.0151,0.094,0.160,0.873,-0.171,0.201
interaction,-0.4165,0.094,-4.417,0.000,-0.602,-0.231

0,1,2,3
Omnibus:,0.591,Durbin-Watson:,1.977
Prob(Omnibus):,0.744,Jarque-Bera (JB):,0.727
Skew:,-0.093,Prob(JB):,0.695
Kurtosis:,2.77,Cond. No.,1.0
