# Regression Analysis using Python

This notebook will cover some of the basic ways to perform regression analysis on a dataframe to predict a continuous variable

Tools and libraries used: 
  
 -Python
  
 -Jupyter notebook

 -Pandas
 
 -Numpy
 
 -Scikit-learn (Used for sample data and regression algorithims)

## What is Regression analysis?

Regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors') - Wikipedia

## Linear Regression  

Linear regression is used to fit a straight line or 'trendline' to two variables X and Y that are dependant on each other

In this notebook I will be using a sample dataset that contains Boston House Pricing data

To get started the following libraries need to be imported. I will not import all the libraries used in this notebook here to avoid confusion. Usually though it is standard practice to import all the libraries at the top of your notebook

In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets

In [4]:
##import house price data from sklearn datasets
boston = datasets.load_boston()
#the data is imported as a json object by default
#to create a pandas dataframe from it we need to do the following

names = boston.feature_names #used to assign names to columns

bos = pd.DataFrame(boston.data) #import data into dataframe
bos.columns = names #assign column names 
bos.head() #display first five rows

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [7]:
target = boston.target #this is price variable what we will be trying to predict
bos['PRICE'] = target #create new column with target
bos.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [9]:
#To get a description of the data we can run the following
print(boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

To start with let's perform a very simple regression on two variables. For this i am going to choose 'RM' vs 'PRICE' as these should be directly correlated. i.e the more rooms the house has the higher its value

In [15]:
#Plotting the data to look for a linear relationship

#Here i am using the library plotly to visualise the data. This package requires
#an account as it is hosted in the cloud. 
#import plotly.plotly as plt
import plotly as py
py.tools.set_credentials_file(username='j.carpenter_12', 
                              api_key='nFGzoPt30albxesyjGOJ')

ModuleNotFoundError: No module named 'plotly'