# Module 4: Model Development

## Linear Regression and Multiple Linear Regression

+ Linear Regression: refer to 1 variable to make a prediction
+ Multiple Linear Regression: refer to multiple variables to make a prediction

### Fitting a simple Linear model estimator

+ x: predictor variable => df[['columnName']]<br>
+ y: target variable => df['valueName']

1. Import linear_model from scikit-learn<br>
<code>from sklearn.linear_model import LinearRegression</code>
2. Create a Linear regression Object using the constructor:<br>
<code>lm=LinearRegression()</code>

### Multiple Linear Regression:

This method is used to explain the relationship between:
+ One continuous target (Y) variable
+ Two or more predictor (X) variables

#### Fitting a estimator
1. We can extract 4 predictor variables and store them in a variable Z:<br>
<code>Z = df[['columnName1','columnName2','columnName3','columnName4']]</code>
2. Then train the model as before:<br>
<code>lm.fit(Z, df['valueName'])</code>
3. We can also obtain a prediction:<br>
<code>Yhat = lm.predcit(X)</code>

## Model Evaluation using Visualization

### Regression Plot<br>
Giving a good estimate of:
1. The relationship between 2 variables
2. The strength of correlation
3. The direction of the relationship (positive or negative)

Using Python library: <b>seaborn</b><br>
<code>import seaborn as sns</code><br>
<code>sns.regplot(x="columnName", y="valueName", data=df)<br>
plt.ylim(0,)</code>

### Residual Plot
+ Represent the error between the actual values.
+ See the difference between predicted values & actual ones by substracting each other.
    + Plot the x axis with the independent variables as the y axis.
    + Expect to see the results to have zero mean and be distributed evenly around the x axis with similar variance.

<code>import seaborn as sns</code><br>
<code>sns.resiplot(df['columnName'], df['valueName'])</code>

### Distribution Plots<br>
<code>import seaborn as sns<br>
axl = sns.distplot(df['valueName'], hist=False, color="r", label="Actual Value")<br>
sns.distplot(Yhat, hist=False, color="b", label="Fitted Value", ax=axl)</code>

## Polynomial Regression and Pipelines

### Polynomial Regressions
+ A special case of the general linear regression model
+ Useful for describing curvilinear relationships
<br>

<b>Curvilinear relationships:</b>
By squaring or setting higher-order terms of the predictor variables

Calculate Polynomial of 3rd order<br>
<code>f=np.polyfit(x,y,3)<br>
p=np.poly1d(f)</code>

#### Polynomial Regression with More than 1 dimension<br>
The "preprocessing" library in scikit-learn,<br>
<code>from sklearn.preprocessing import PolynomialFeatures<br>
pr=PolynomialFeatures(degree=2, include_bias=False)<br>
x_polly=pr.fit_transform(x[['columnName1', 'columnName2']])</code><br>

Normalize each feature simultaneously<br>
<code>form sklearn.preprocessing import StandardScaler<br>
SCALE=StandardScaler()<br>
SCALE.fit(x_data[['columnName1','columnName2']])<br>
x_scale=SCALE.transform(x_data[['columnName1','columnName2']])</code>


### Pipelines<br>
Normalization => Polynomial transform => Linear Regression<br>
<code>
from sklearn.preprocessing import PolynomialFeatures<br>
from sklearn.linear_model import LinearRegression<br>
from sklearn.preprocessing import StandardScaler<br>
</code>

#### Pipeline constructor<br>
<code>
Input=[('scale',StandardScaler()),('polynomial',PolynomialFeatures(degree=2)),...,('mode',LinearRegression())]<br>
</code>

Pine constructor:<br>
<code>pipe=Pipeline(Input)</code>

#### Train the pipeline object<br>
<code>
pipe.fit(df[['columnName1','columnName2','columnName3','columnName4']],y)
yhat=pipe.predict(X[['columnName1','columnName2','columnName3','columnName4']])
</code>

## Measures for In-Sample Evaluation<br>

2 important measures to determine the fit of a model:
+ Mean Squared Error (MSE), generally between 0 and 1
    + <code>from sklearn.metrics import mean_squared_error<br>
    mean_squared_error(df['valueName'], Y_predict_simple_fit)</code>
+ R-squared (R^2): The Coefficient of Determination
    + <code>R^2 = (1-(MSE of regression line/MSE of the average of the data))</code>

## Prediction and Decision Making

Determining a Good Model Fit:
+ Do the predicted values make sense
+ Visualization
+ Numerical measures for evaluation
+ Comparing Models