## Case Study: Advertising Data

In [1]:
# imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split

# allow plots to appear directly in the notebook
%matplotlib inline
sns.set()

### Read data into a DataFrame

Use pandas.read_csv() to load data into a dataframe. 

The data can be a file on your computer or even a file from an online source.

In [None]:
data = pd.read_csv('Advertising.csv', index_col=0)   
#index_col=0 sets the first col in csv file as row index
data.head()

What are the input variables?
- TV: advertising dollars spent on TV for a single product in a given market 
- Radio: advertising dollars spent on Radio
- Newspaper: advertising dollars spent on Newspaper

All above are in thousands of dollars.

What is the target variable?
- Sales: sales of a single product in a given market (in thousands of widgets)

### Exploratory Data Analysis

Show the shape of the DataFrame

In [None]:
data.shape

Get a high level overview of the data: column name, column types and any potential missing values.

In [None]:
data.info()

Get a high level summary of each numerical column

In [None]:
data.describe()

Any missing values or no illogical values (i.e negative quantities)? If not, this is a clean data set.

### Simple Linear Regression

#### Relationship between sales and Newspaper spend

$$y = \beta_0 + \beta_1x$$
- $y$ is the target variable, i.e. the sales
- $x$ corresponds to the input variable, i.e. Newspaper spend
- $\beta_0$ is the intercept with the y-axis
- $\beta_1$ is the coefficient for the input variable

Prepare the data

In [None]:
# create x and y
feature_cols = ['Newspaper']
x_orig = data.loc[:,feature_cols].values 
# feature_cols must be a list of column names, even if just 1 col
# x for linear regression must be 2D array (x cannot be 1D flat array)
display(x_orig[:5])

y = data.loc[:,"Sales"]

y = data['Sales'] #alternative

display(y[:5])

#### Use LinearRegression model 

Objective: minimize the **sum of squared residuals (SSR)**
$$
\min~\text{SSR} = \min\sum\limits_{i=1}^nu_i^2 = \min\sum\limits_{i=1}^n(y_i - \hat{y}_i)^2. 
$$

In [None]:
# create LinearRegression object and fit
lm = LinearRegression()
lm.fit(x_orig, y)

# print the results
print(lm.intercept_)
print(lm.coef_)

#### Using the Model for Prediction

Let's say that there were 2 new markets where the Newspaper advertising spend was **\$100,000** and **\$200,000**. What would we predict for the Sales in that market?

$$y_1 = 12.3514 + 0.05469 \times 100$$
$$y_2 = 12.3514 + 0.05469 \times 200$$

In [None]:
x_new = np.array([  [100],[200]   ]) 
# x must be 2D array with multiple rows and 1 column; x cannot be 1D array
y_pred = lm.predict(x_new)
y_pred

#### How Well Does the Model Fit the data?
Let's feed the original Newspaper column back to the trained linear model to get the predicted y value by the model

In [None]:
y_pred = lm.predict(x_orig)

Use a plot to check if the prediction (linear model) fits the original data (Newspaper spend vs. Sales) well

In [None]:
# Plot the original data set on a scatterplot
sns.scatterplot(x=data['Newspaper'], y=y)

# plot the linear fit as a lineplot
sns.lineplot(x=data['Newspaper'], y=y_pred, color="orange")

#### Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) 

In [None]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y, y_pred)
print("MSE:", mse, "RMSE:", np.sqrt(mse))

#### R-Squared

$$
  R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}
  $$



In [None]:
lm.score(x_orig, y)

### Multiple Linear Regression

Let's now consider all input variables, TV, Newspaper, Radio

In [None]:
x_all = data.loc[:, 'TV':'Newspaper']
y = data.loc[:,"Sales"]

# create a LinearRegression model
lm_all_sklearn = LinearRegression()

lm_all_sklearn.fit(x_all, y)
print(lm_all_sklearn.intercept_)
print(lm_all_sklearn.coef_)

#R2 score
lm_all_sklearn.score(x_all,y)

In [None]:
data.loc[:, ['Newspaper','Sales']].corr()

### Dealing with categorical variables

In [None]:
data = pd.read_csv('condo.csv')
data.head()

**_For student’s own exploration_**

Option 1: Use pandas.get_dummies()

In [None]:
type_dummies = pd.get_dummies(data.type) 
data = pd.concat([data, type_dummies], axis=1) 
data

Option 2: Use scikit-learn OneHotEncoder
    
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

## Case Study: Condo Data

### Ordinary Least Square (OLS) model

In [2]:
import statsmodels.formula.api as smf

In [None]:
d5_condo = data.loc[(data['district_code']==5) & 
            (data['area']<1500) & 
            (data['remaining_years']<100)]

#### Categorical variable 'type' with 2 values

In [None]:
d5_model = smf.ols('price ~ area + type', data=d5_condo)
result = d5_model.fit()
print(result.summary())

#### Categorical variable 'segment' with more than 2 values

In [None]:
small_condo = data.loc[data['area'] < 1500]
small_condo_model = smf.ols('price ~ segment + area', data=small_condo)
result = small_condo_model.fit()
print(result.summary())

#### Interaction terms

In [None]:
d5_model = smf.ols('price ~ type * area', d5_condo)
result = d5_model.fit()
print(result.summary())

#### How to treat numbers as categorical variables

In [None]:
small_condo_model = smf.ols('price ~ district_code + area', data=small_condo)
result = small_condo_model.fit()
print(result.summary())

In [None]:
small_condo_model = smf.ols('price ~ C(district_code) + area', data=small_condo)
result = small_condo_model.fit()
print(result.summary())

## Case Study: Wage Data

### OLS with nonlinear terms

In [None]:
wage = pd.read_csv('wage.csv')
wage_female = wage.loc[wage['female']==1]
wage_female.head()

Can the relationship be fitted to a straight line?

In [None]:
plt.figure(figsize=(4.5, 3.5))
plt.scatter(wage_female['exper'], wage_female['wage'], 
            c='b', alpha=0.3)

plt.xlabel('Working experience (years)', fontsize=13)
plt.ylabel('Hourly wage (Dollars)', fontsize=13)
plt.show()

Let's try Model 1
$$
y_{wage} = \beta_0 + \beta_1 x_{exper}
$$
   

In [None]:
model1 = smf.ols('wage ~ exper', data=wage_female)
result1 = model1.fit()
print(result1.summary())

Let's try Model 2 

$$
y_{wage} = \beta_0 + \beta_1 x_{exper} + \beta_2 \sqrt{x_{exper}}
$$
   

In [None]:
model2 = smf.ols('wage ~ exper + np.sqrt(exper)', data=wage_female)
result2 = model2.fit()
print(result2.summary())

Let's try Model 3

$$
log(y_{wage}) = \beta_0 + \beta_1 x_{exper} + \beta_2 \sqrt{x_{exper}}
$$
   

In [None]:
model3 = smf.ols('np.log(wage) ~ exper + np.sqrt(exper)', data=wage_female)
result3 = model3.fit()
print(result3.summary())

## Coding exercise

In [3]:
data = pd.read_csv('insurance.csv')  
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,3,16884.924
1,18,1,33.77,1,0,2,1725.5523
2,28,1,33.0,3,0,2,4449.462
3,33,1,22.705,0,0,1,21984.47061
4,32,1,28.88,0,0,1,3866.8552


Questions for exploration:

- Which variable is the target variable y?
- Which input variables are categorical? Which ones are numerical?
- Any interaction terms?
- Any non-linear terms?

In [None]:
model = smf.ols('charges ~ age + sex + bmi + children + smoker + region', data=data)
result = model.fit()
print(result.summary())

In [None]:
model = smf.ols('charges ~ age + C(sex) * bmi + children + C(smoker) + C(region)', data=data)
result = model.fit()
print(result.summary())

In [None]:
model = smf.ols('charges ~ age + C(sex) * bmi + children + C(smoker) + C(region)', data=data)
result = model.fit()
print(result.summary())

In [None]:
model = smf.ols('charges ~ age + C(sex) * bmi + children + C(smoker) * bmi + C(region)', data=data)
result = model.fit()
print(result.summary())

In [None]:
model = smf.ols('charges ~ C(sex) * age + C(sex) * bmi + children + C(smoker) * bmi + C(region)', data=data)
result = model.fit()
print(result.summary())

In [None]:
# making children categorical variable is as good as introducing a few binary variables
# R2 increases as we introduce more variables
# shall we make children a categorical variable?

model = smf.ols('charges ~ C(sex) * age + C(sex) * bmi + C(children) + C(smoker) * bmi + C(region)', data=data)
result = model.fit()
print(result.summary())

In [None]:
data.describe() #maybe charges have outliers
# shall we make 'charges' a log term?

In [None]:
model = smf.ols('np.log(charges) ~ C(sex) * age + C(sex) * bmi + children + C(smoker) * bmi + C(region)', data=data)
result = model.fit()
print(result.summary())