## Linear regression in Python with `pandas` / `patsy` / `statsmodels` / `sklearn`

Statsmodels uses Patsy to provide formula syntax similar to `R`'s.

Formulas in `R` look like this:

```
Y ~ X1 + X2 + X3
```

In Python, start with data in `pandas` data frames:

In [None]:
import pandas as pd

In [None]:
url = "http://data.princeton.edu/wws509/datasets/salary.dat"

In [None]:
data = pd.read_csv(url, sep='\s+')

In [None]:
data.head()

`patsy` can produce design matrices from formula specifications:

In [None]:
from patsy import dmatrices

In [None]:
y, X = dmatrices('sl ~ sx + yr + rk', data=data, return_type='dataframe')

In [None]:
y.head()

In [None]:
X.head()

`statsmodels` includes `patsy` for model specification and provides an array of modeling techniques with output that resembles Stata's.

In [None]:
import statsmodels.formula.api as smf

In [None]:
model = smf.ols(formula="sl ~ yr", data=data).fit()
model.summary()

In [None]:
model = smf.ols(formula="sl ~ sx + yr + rk", data=data).fit()
model.summary()

`sklearn` does not integrate `patsy`, but it offers far more modeling options. The documentation is quite good. Check out the section on [Linear Models](http://scikit-learn.org/stable/modules/linear_model.html).

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)
model.score(X, y)

In [None]:
from sklearn.linear_model import Ridge

model = Ridge(alpha = .5)
model.fit(X, y)

print model.coef_

In [None]:
from sklearn.linear_model import RidgeCV

model = RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0])
model.fit(X, y)

print model.coef_
print model.alpha_