# On Multiple Linear Regression - Codealong

In [1]:
import pandas as pd
import numpy as np

The main idea here is pretty simple. Whereas, in simple linear regression we took our dependent variable to be a function only of a single independent variable, here we'll be taking the dependent variable to be a function of multiple independent variables.

Our regression equation, then, instead of looking like $\hat{y} = mx + b$, will now look like:

$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + ... + \hat{\beta}_nx_n$.

Remember that the hats ( $\hat{}$ ) indicate parameters that are estimated.

## Dealing with Categorical Variables

One issue we'd like to resolve is what to do with categorical variables, i.e. variables that represent categories rather than continua. In a Pandas DataFrame, these columns may well have strings or objects for values, but they need not. Recall e.g. the heart-disease dataset from Kaggle in which the target variable took values 0-4, each representing a different stage of heart disease.

### Ordinal Mapping
If we have our data in Series or Data Frames, we can convert these categories to numbers using pandas Series’ astype method and specify ‘category’.


In [2]:
df = pd.DataFrame({'vertebrates': ['Bird', 'Bird', 'Mammal', 'Fish', 'Amphibian', 'Reptile', 'Mammal']})

df.vertebrates.astype("category").cat.codes

0    1
1    1
2    3
3    2
4    0
5    4
6    3
dtype: int8

You can always pass the types of vertebrates in separately so you have a record of the labels to match the categories.

Any missing categories in this case will be represented by -1


In [4]:
ordered_satisfaction = ['Very Unhappy', 'Unhappy', 'Neutral', 'Happy', 'Very Happy']

df = pd.DataFrame({'satisfaction':['Mad', 'Happy', 'Unhappy', 'Neutral']})
df

Unnamed: 0,satisfaction
0,Mad
1,Happy
2,Unhappy
3,Neutral


In [5]:
df.satisfaction.astype("category",
  ordered=True,
  categories=ordered_satisfaction
)

  exec(code_obj, self.user_global_ns, self.user_ns)


0        NaN
1      Happy
2    Unhappy
3    Neutral
Name: satisfaction, dtype: category
Categories (5, object): [Very Unhappy < Unhappy < Neutral < Happy < Very Happy]

### Dummying

One very effective way of dealing with categorical variables is to dummy them out. What this involves is making a new column for _each categorical value in the column we're dummying out_. We'll do this below in our air safety dataset where we have a column of airline names.

These new columns will be filled only with 0's and 1's, a 1 representing the presence of the relevant categorical value.

Let's look at a simple example:

In [None]:
chars = pd.read_csv('ds_chars.csv', index_col=0)

In [None]:
# Let's try using pd.get_dummies() to create our dummy columns:
state_dums = pd.get_dummies(chars['home_state'])
state_dums
# We could also have used LabelBinarizer from sklearn.preprocessing


# Now we need to add these dummy columns to our original dataset:

chars_states = pd.concat([chars, state_dums], axis=1)
chars_states

## Drug Use Dataset

In [6]:
drugs = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/drug-use-by-age/drug-use-by-age.csv')

In [7]:
drugs.head()

Unnamed: 0,age,n,alcohol-use,alcohol-frequency,marijuana-use,marijuana-frequency,cocaine-use,cocaine-frequency,crack-use,crack-frequency,...,oxycontin-use,oxycontin-frequency,tranquilizer-use,tranquilizer-frequency,stimulant-use,stimulant-frequency,meth-use,meth-frequency,sedative-use,sedative-frequency
0,12,2798,3.9,3.0,1.1,4.0,0.1,5.0,0.0,-,...,0.1,24.5,0.2,52.0,0.2,2.0,0.0,-,0.2,13.0
1,13,2757,8.5,6.0,3.4,15.0,0.1,1.0,0.0,3.0,...,0.1,41.0,0.3,25.5,0.3,4.0,0.1,5.0,0.1,19.0
2,14,2792,18.1,5.0,8.7,24.0,0.1,5.5,0.0,-,...,0.4,4.5,0.9,5.0,0.8,12.0,0.1,24.0,0.2,16.5
3,15,2956,29.2,6.0,14.5,25.0,0.5,4.0,0.1,9.5,...,0.8,3.0,2.0,4.5,1.5,6.0,0.3,10.5,0.4,30.0
4,16,3058,40.1,10.0,22.5,30.0,1.0,7.0,0.0,1.0,...,1.1,4.0,2.4,11.0,1.8,9.5,0.3,36.0,0.2,3.0


In [None]:
drugs.info()

In [None]:
drugs['age'] = drugs['age'].map(int)

What happened?

In [None]:
# Let's take a closer look at this 'age' column:

drugs['age'][:15]

In [None]:
drugs = drugs.head(10)
drugs['age'] = drugs['age'].map(int)
drugs

## Model Selection

Let's imagine that I'm going to try to predict age based on factors to do with drug use.

Now: Which columns (predictors) should I choose? Even ignoring the non-numeric categories in my dataset, there are still 20 predictors I could choose! For each of these predictors, I could either use it or not use it in my model, which means that there are 2^20 = 1,048,576 different models I could construct! Well, okay, one of these is the "empty model" with no predictors in it. But there are still 1,048,575 models from which I can choose!

How can I decide which predictors to use in my model?

### Correlation

In [None]:
# Use the .corr() DataFrame method to find out about the
# correlation values between all pairs of variables!

drugs.corr()

In [None]:
import seaborn as sns
%matplotlib inline
sns.set(rc={'figure.figsize':(9, 9)})

# Use the .heatmap method to depict the relationships visually!
sns.heatmap(drugs.corr());

In [None]:
# Let's look at the correlations with 'age'
# (our dependent variable) in particular.

drugs.corr()

In [None]:
X = drugs[['alcohol-use', 'tranquilizer-frequency', 'stimulant-use']]
y = drugs['age']

### Multicollinearity

Probably 'alcohol-use' and 'alcohol-frequency' are highly correlated _with each other_ as well as with 'age'. This can lead to the production of an _overfit_ model. We'll stick a pin in this and return to the issue of overfit models soon.

## Multiple Regression in StatsModels

In [None]:
import statsmodels.api as sm

In [None]:
predictors = np.asarray(X)
predictors_int = sm.add_constant(predictors)
model = sm.OLS(y, predictors_int).fit()
model.summary()

In [None]:
predictors = np.asarray(X)
predictors_int = sm.add_constant(X)
model = sm.OLS(np.asarray(y), predictors_int).fit()
model.summary()