# Linear Regression With SFrames

This is a summary (by example) of how to perform a linear regression.

## Imports

In [11]:
# third party
import graphlab
import matplotlib.pyplot as plt
import pandas
import seaborn as sns

In [12]:
%matplotlib inline

## Load the data

In [13]:
sales = graphlab.SFrame.read_csv(os.path.join('../../../large_data/csvs/Philadelphia_Crime_Rate_noNA.csv'))
sales.head()

Columns:
	HousePrice	int
	HsPrc ($10,000)	float
	CrimeRate	float
	MilesPhila	float
	PopChg	float
	Name	str
	County	str

Rows: 10

Data:
+------------+-----------------+-----------+------------+--------+------------+
| HousePrice | HsPrc ($10,000) | CrimeRate | MilesPhila | PopChg |    Name    |
+------------+-----------------+-----------+------------+--------+------------+
|   140463   |     14.0463     |    29.7   |    10.0    |  -1.0  |  Abington  |
|   113033   |     11.3033     |    24.1   |    18.0    |  4.0   |   Ambler   |
|   124186   |     12.4186     |    19.5   |    25.0    |  8.0   |   Aston    |
|   110490   |      11.049     |    49.4   |    25.0    |  2.7   |  Bensalem  |
|   79124    |      7.9124     |    54.1   |    19.0    |  3.9   | Bristol B. |
|   92634    |      9.2634     |    48.6   |    20.0    |  0.6   | Bristol T. |
|   89246    |      8.9246     |    30.8   |    15.0    |  -2.6  | Brookhaven |
|   195145   |     19.5145     |    10.8   |    20.0    |  -3.5 

Parsing completed. Parsed 99 lines in 0.039094 secs.

Finished parsing file /home/cronos/projects/machine_learning/machine_learning/large_data/csvs/Philadelphia_Crime_Rate_noNA.csv

------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,float,float,float,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


Parsing completed. Parsed 99 lines in 0.031703 secs.

Finished parsing file /home/cronos/projects/machine_learning/machine_learning/large_data/csvs/Philadelphia_Crime_Rate_noNA.csv

## Fit the regression model

The target here it the sale-price of a house ('HousePrice') and the prediction variable is the crime-rate in the house's area ('CrimeRate')

In [14]:
print(graphlab.linear_regression.create.__doc__)


    Create a :class:`~graphlab.linear_regression.LinearRegression` to
    predict a scalar target variable as a linear function of one or more
    features. In addition to standard numeric and categorical types, features
    can also be extracted automatically from list- or dictionary-type SFrame
    columns.

    The linear regression module can be used for ridge regression, Lasso, and
    elastic net regression (see References for more detail on these methods). By
    default, this model has an l2 regularization weight of 0.01.

    Parameters
    ----------
    dataset : SFrame
        The dataset to use for training the model.

    target : string
        Name of the column containing the target variable.

    features : list[string], optional
        Names of the columns containing features. 'None' (the default) indicates
        that all columns except the target variable should be used as features.

        The features are columns in the input SFrame that can be of the
       

In [15]:
crime_model = graphlab.linear_regression.create(sales, target='HousePrice',
                                                features=['CrimeRate'],
                                                validation_set=None,
                                                verbose=False)

## Plot the line

In [16]:
def plot_data(data, model, title):
    figure = plt.figure()
    axe = figure.gca()
    lines = axe.plot(data['CrimeRate'],data['HousePrice'],'.', label='Data')
    lines = axe.plot(data['CrimeRate'], model.predict(data),'-', label='Fit')
    label = axe.set_xlabel("Crime Rate")
    label = axe.set_ylabel("House Price")
    title = axe.set_title(title)
    legend = axe.legend()


In [17]:
plot_data(sales, crime_model, 'Philadelpdhia Crime Rate vs House Price')

<matplotlib.figure.Figure at 0x7fd921fbbe50>

## Identify the outlier

In [18]:
maximum_crime = sales['CrimeRate'].argmax()
outlier = sales[maximum_crime]
print(outlier)

{'Name': 'Phila,CC', 'PopChg': 4.8, 'County': 'Phila', 'HousePrice': 96200, 'MilesPhila': 0.0, 'HsPrc ($10,000)': 9.62, 'CrimeRate': 366.1}


## Get the model coefficients

In [19]:
coefficients = crime_model.get('coefficients')

print(coefficients)

+-------------+-------+----------------+---------------+
|     name    | index |     value      |     stderr    |
+-------------+-------+----------------+---------------+
| (intercept) |  None | 176626.046881  | 11245.5882194 |
|  CrimeRate  |  None | -576.804949058 |  226.90225951 |
+-------------+-------+----------------+---------------+
[2 rows x 4 columns]



In [20]:
intercept, slope = coefficients['value']
print("y = {m:.2f} x + {b:.2f}".format(m=slope, b=intercept))

y = -576.80 x + 176626.05


## Predict House Price based on new crime rate

In [21]:
print(crime_model.predict.__doc__)


        Return target value predictions for ``dataset``, using the trained
        linear regression model. This method can be used to get fitted values
        for the model by inputting the training dataset.

        Parameters
        ----------
        dataset : SFrame | pandas.Dataframe
            Dataset of new observations. Must include columns with the same
            names as the features used for model training, but does not require
            a target column. Additional columns are ignored.

        missing_value_action : str, optional
            Action to perform when missing values are encountered. This can be
            one of:

            - 'auto': Default to 'impute'
            - 'impute': Proceed with evaluation by filling in the missing
              values with the mean of the training data. Missing
              values are also imputed if an entire column of data is
              missing during evaluation.
            - 'error': Do not proceed with predictio

Although I'm predicting values, I'll use real data points so that the values can be checked.

In [22]:
new_data = graphlab.SFrame({'CrimeRate': [sales[0]['CrimeRate']]})
prediction = crime_model.predict(new_data)
actual = sales[0]['HousePrice']
print("Prediction: {0:.2f}".format(prediction[0]))
print("Actual: {0:.2f}".format(actual))
print('Difference: {0:.2f}'.format(prediction[0] - actual))

Prediction: 159494.94
Actual: 140463.00
Difference: 19031.94


In [23]:
outlier_check = crime_model.predict(outlier)
print("Prediction: {0:.2f}".format(outlier_check[0]))
print("Actual Data: {0:.2f}".format(outlier['HousePrice']))
print("Error predicting the outlier: {0:.2f}".format(outlier['HousePrice'] - outlier_check[0]))

Prediction: -34542.24
Actual Data: 96200.00
Error predicting the outlier: 130742.24
