# Unit 4 Project - Model Building 

# Part 1 - Learn Linear Regression Theory

### Step 1: Describe Linear Regression Models 

Answer the following questions after reading the resource material and watching the video lectures linked in the Linear Regression Theory section:

- In your own words, briefly describe a simple linear regression model.
> A simple linear regression model describes the relationship between a dependent variable, Y, and one or more independent variables, X.
- What type of machine learning is linear regression?
> Linear regression is a supervise machine-learning regression algorithm. 
- What is a “line of best fit” using the OLS method?
> A straight line that minimizes the distance between it and some data.
- What is the difference between correlation and regression?
> Correlation quantifies the strength of the linear relationship between a pair of variables, whereas regression expreses the relationship in the form of an equation. 
- Describe a scenario when you would use linear regression and why you would use it. 
> I personally would use linear regression for many things. For example, you can use it to forecast cash intake and output. Businesses rely heavily on linear regression models for forecasting sales to try to determine how successful or not their upcoming months will be.

### Step 2: Describe Linear Regression Assumptions 

In your own words, describe the following linear regression assumptions after reading the resource material and watching the video lectures linked in the Linear Regression Theory section:

- Linearity
> The relationship between X and the mean of Y is linear
- Normality
> For any fixed value of X, Y is normally distributed
- Homoscedasticity
> The variance of residual is the same for any value of X.
- No multicollinearity
> Occurs when independent variables in the model are not highly correlated with each other, allowing for more reliable and interpretable results
- No endogeneity
> The independent variables are exogenous, meaning they are not affected by the dependent variable and therefore the coefficient estimates are more accurate and reliable
- No autocorrelation
> Means there is no relationship between the values of a time series and its lagged values.

### Step 3: Describe How to Interpret Results from Correlation Table in Model Summary 

In your own words, describe the following terms after reading the resource material and watching the video lectures linked in the Linear Regression Theory section:

- Coefficient of Constant / Intercept (b0)
> Represents the expected mean value of the dependent variable (y) when all independent variables (x) are equal to 0. The value of b0 can be positive, negative, or zero and its interpretation depends on the context of the problem being analyzed. 
- Coefficient of Independent Variable (B1)
> Represents the change in the dependent variable (y) per unit change in the independent variable (x). It measures the strength and direction of the relationship between the two variables. 
- Standard Error
> Measure of the variability or uncertainty associated with a statistical estimate such as the coefficient in a regression model or a mean of a sample. 
- T-Statistic
> Measure of how many standard deviations a coefficient in a regression model deviates from zero
- P-Value of T-Statistic (from the independent variable)
> The probability of obtaining a result this or more extreme given that the null hypothesis is true.


### Step 4:  Explain R-Squared and adjusted R-squared

In your own words, describe the following terms after reading the resource material and watching the video lectures linked in the Linear Regression Theory section:

- R-squared
> Explains the proportion of variance in the dependent variable that can be explained by the independent variable.
- Adjusted R-squared
> A modified version of r-squared that adjusts for the number of predictors.


# Part 2 - Linear Regression in Practice

### Step 1: Import libraries and load dataset

In [None]:
# make necessary imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.api as sm

%matplotlib inline
sns.set()

In [None]:
# load data
df = pd.read_csv('house_data.csv')
df

### Step 2: Explore the data 

#### Spend time exploring the data and looking for relationships between variables. 

In [None]:
# explore data below (you will need to make new cells)
sns.pairplot(df);

In [None]:
df.info()

In [None]:
df = df[['price', 'sqft_living', 'sqft_lot', 'year_built', 'house_condition', 'grade', 'view']].copy()

In [None]:
df.describe()

In [None]:
sns.pairplot(df);

In [None]:
# imports pearsonr function to find correlations and pvalues
from scipy.stats import pearsonr

# creates function which rates pvalues on a 3-1 star basis with 3 ***'s being the most statistically significant
def corrfunc(x, y, **kwargs):
    def pvalue_stars(p):
        if 0.05 >= p > 0.01:
            return '*'
        elif 0.01 >= p > 0.001:
            return '**'
        elif p <= 0.001:
            return '***'
        else:
            return ''
    cmap = kwargs['cmap']
    norm = kwargs['norm']
    ax = plt.gca()
    ax.grid(False)
    r, p = pearsonr(x, y)
    facecolor = cmap(norm(r))
    ax.set_facecolor(facecolor)
    lightness = (max(facecolor[:3]) + min(facecolor[:3])) / 2
    ax.annotate(f"{r:.2f}{pvalue_stars(p)}", xy=(.5, .5), xycoords=ax,
                color='white' if lightness < 0.7 else 'black',
                size=18, ha='center', va='center')

# pairgrid w/ half pair plot, histograms down diagonal, correlation heatmap, & pvalue's represented by *'s

g = sns.PairGrid(df, height=1.5, diag_sharey=False)
g.map_lower(sns.scatterplot)
g.map_diag(sns.kdeplot, shade=True)
g.map_upper(corrfunc,
            cmap=plt.get_cmap('RdBu_r'), 
            norm=plt.Normalize(vmin=-1, vmax=1))
g.add_legend()
plt.show()

### Step 3: Determine independent and dependent variable

In [None]:
# set X 
x1 = df['sqft_living']
# set Y 
y=df['price']

In [None]:
# visualize relationship between X and Y
plt.figure(figsize=(10,6))
plt.scatter(x1, y)
plt.xlabel('Sq. Ft. of Living Space', fontsize = 18)
plt.ylabel('Price', fontsize = 18);

### Step 4: Fit Regression 

Solution below is for example only. Results may vary depending on variables chosen by student to use in linear regression model and which statistical package is used for linear regression model.

In [None]:
# fit model to X and Y variables (you will need to make more cells)
x = sm.add_constant(x1)
results = sm.OLS(y,x).fit()

In [None]:
###Step 5: Interpret Model Summary

In [None]:
# print out and interpret model summary // terms identified below
results.summary()

Interpret the following from your model:

- R-squared 
> The r-squared value is .376. This means that 37.6% of the variation in house prices depends on the square footage.
- Coefficient of independent variable
> The coefficient of the independent variable is 202.9775. This means that for a one foot (squared) increase in square footage, we can expect a 203 ish dollar increase in price.
- P-value of T-statistic
> The p-value of the t-statistic is 0. This means that we are confident that we can reject the null hypothesis (that there is no linear relationship between the variables square foot and price). So we have strong evidence of a linear relationship.
- P-value of F-statistic
> The f statistic tests the overall significance of the regression model. it tests that all of the independent variables combined do not contribute significantly to the model. Since the p value is 0, it means there is enough evidence to conclude that there is a relationship between the independent and dependent variables.

### Step 6: Predict Values

Solution below is for example only. Results may vary depending on variables chosen by student to use in linear regression model. 

In [None]:
# predict new y values (dependent variables) by feeding model new independent variable values
new_df = pd.DataFrame({'constant':1, 'sqft_living':[1100, 1500]})
new_df

In [None]:
predictions = results. predict(new_df)
predictions

In [None]:
new_df['price_predictions']=results.predict(new_df)
new_df

# Part 3 - Multiple Linear Regression

### Step 1: From Data Exploration in Part 2, Pick Another Independent Variable 

In [None]:
# Pick another independent variable. Be sure to avoid multicollinearity. 
y = df['price']
x1 = df[['sqft_living', 'year_built']]

### Step 2: Fit A New Multiple Linear Regression Model to the New Independent Variables

Solution below is for example only. Results may vary depending on variables chosen by student to use in linear regression model. 

In [None]:
# fit new regression model (you will need to make more cells)
x = sm.add_constant(x1)
results = sm.OLS(y, x).fit()
results.summary()

### Step 3: Interpret Model Summary 

Print the output of the results summary from your model. Interpret the following parts of your results summary. Include an explanation of what the given numbers mean in regards to your model.
- R-squared 
> 41.5% of the variation in price is explained by square feet of living space and year built.
- Adjusted R-squared
> The same value. The adjusted r-squared value adjusts for the fact that adding more independent variables will always increase r-squared, even if the variables do not contribute to the explanation of the dependent variable.
- Coefficient of independent variables
> The coefficient of sqft_living is 226.3349. Technically this means a house with 0 square feet would cost 226ish dollars. This doesn't make much sense because the y-intercept in this case does not have much meaning. For the year built, if it were built in the year 0 it would cost -1715 dollars. This also does not make sense.
- P-value of T-statistic
> 0
- P-value of F-statistic
> 0


If you would like, continue to create new linear models as you add more than two independent variables. Notice the differences in the R-Squared values you get from each model. 

### Step 4: Predict Values

In [None]:
# predict new y values (dependent variables) by feeding model new independent variable values
new_df= pd.DataFrame({'constant':1, 'sqft_living':[1100, 1500], 'year_built':[1960, 1980]})
new_df

In [None]:
predictions = results.predict(new_df)
predictions

In [None]:
new_df['price_predictions'] = results.predict(new_df)
new_df

### Step 5: Report Observations in Difference Between Simple and Multiple Linear Regression Models You Made 

In [None]:
# create new markdown cell and write down your observations

# Part 4 - Multivariate Time Series Analysis

In [None]:
# complete the time series analysis exercise separately from data-time-series folder

# Part 5 - Submit Project 