#### Learning Objectives
- Define data modeling and simple linear regression.
- Build a linear regression model using a data set that meets the linearity assumption using the scikit-learn library.
- Understand and identify multicollinearity in a multiple regression.

## What is Data Modeling?

A Data Model for a Data Scientist is an artifact created by the machine learning process one might even consider a program in its own right. The model will accept data and return the appropriate output. As discussed before, in supervised learning a model is the combination of the algorithm trained with the training data.

The overall process is fairly linear and is the end result of our Data Science process. Our goal is to:

1. Create a Hypothesis or a question we want to test/explore

2. Load, clean and transform relevant data

3. Identify relevant features/variables for both the question and the model

4. Build a process with an appropriate machine learning algorithm or statistical methodology that suits your data, use case, and available computational resources

5. Test, evaluate and refine your model. Often, Data Scientists create numerous models until aligning on the one creating the best output

6. Deploy the model - This can be accomplished through batch processing or moved into a real time production environment

7. Monitor and refine the model over time


The end result is a clear and defined process to accept raw information and create predictive or prosciptive insights to your organization.

## Exploring ATP Matches Data
---

In [None]:
# Standard imports
import pandas as pd
import numpy as np

# Visualization imports
import seaborn as sns
import matplotlib.pyplot as plt

# Specific imports
# These are new! Notice we're using the 'from' approach to import only what we need.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split

# Statistics imports
from scipy import stats
import statsmodels.api as sm

# magic and parameters
%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14
plt.style.use("fivethirtyeight")

<a id="read-in-the--capital-bikeshare-data"></a>
### Read In the Capital Bikeshare Data

In [None]:
# Read the data and set the datetime as the index.
url = './data/bikeshare.csv'
bikes_df = pd.read_csv(url, index_col='datetime', parse_dates=True)

In [None]:
# load the data into a dataframe named lv_reviews
path = './data/LasVegasTripAdvisorReviews-Dataset.csv'

lv_reviews = pd.read_csv('./data/LasVegasTripAdvisorReviews-Dataset.csv')

<a id="what-is-multicollinearity"></a>
## What Is Multicollinearity?
---

Multicollinearity happens when two or more features are highly correlated with each other. The problem is that due to the high correlation, it's hard to disambiguate which feature has what kind of effect on the outcome. In other words, the features mask each other. 

<a id="feature-engineering"></a>
### More Feature Engineering
- **hour:** as a single numeric feature (0 through 23)
**We will try both a square root and log transformation**
Then, try using each of the three features (on its own) with `train_test_rmse` to see which one performs the best!

#### 4. Bike Weather Dummy Variables
Build dummy variables for weather, append it to the `bike_dummies` dataframe, and check the model performance with both `total rentals` and 

## What is Data Modeling?

The overall process is fairly linear and is the end result of our Data Science process. Our goal is to:

1. Create a Hypothesis or a question we want to test/explore

2. Load, clean and transform relevant data

<a id="feature-selection"></a>
### Feature Selection
How do we choose which features to include in the model? We're going to use **train/test split** (and eventually **cross-validation**).
More generally:
- This course focuses on general purpose approaches that can be applied to any model, rather than model-specific approaches.

| Variable| Description |
|---------|----------------|
|datetime| hourly date + timestamp  |
|season|  1=winter, 2=spring, 3=summer, 4=fall |
|holiday| whether the day is considered a holiday|
|workingday| whether the day is neither a weekend nor holiday|
|weather| See Below|
|temp| temperature in Celsius|
|atemp| "feels like" temperature in Celsius|
|humidity| relative humidity|
|windspeed| wind speed|
|casual| number of non-registered user rentals initiated|
|registered| number of registered user rentals initiated|
|count| number of total rentals|

> _Details on Weather Variable_

> **1**: Clear, Few clouds, Partly cloudy, Partly cloudy

> **2**: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

### Lesson Guide
- [Introduce the Bikeshare Data Set](#introduce-the-bikeshare-dataset)
	- [Read in the  Capital Bikeshare Data](#read-in-the--capital-bikeshare-data)