# Linear Regression in Python

By: [Paul Jeffries](https://twitter.com/ByPaulJ) 

Work in progress regression vignette in Python using house price data from the well-known Ames Housing dataset, [see De Cock (2011)](http://jse.amstat.org/v19n3/decock.pdf).

by Paul Jeffries

In [1]:
import datetime
# prints the present date and time as a form of log
print("This notebook was last run: ", datetime.datetime.now())

This notebook was last run:  2019-04-01 23:29:00.187946


In [2]:
# key packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from janitor import clean_names, remove_empty

In [3]:
# basic setup steps

# prints the present working directory (helpful for sourcing the CSVs later)
%pwd

'/Users/pauljeffries/Desktop/personal/personal_code/data-science-toolkit/regression'

In [4]:
# setting seed for any functions that require randomization (for repeatability)
np.random.seed(123)

# Exploratory Data Analysis (EDA)

Library used shortly below "pyjanitor" is [explained at this link here](https://github.com/ericmjl/pyjanitor). 

The specific version of the Ames Housing dataset used here is the slighty-processed version of the dataset prepared by Max Khun as part of his [AmesHousing package in R](https://github.com/topepo/AmesHousing).

In [5]:
# import the datasets from csv into a pandas df
life_exp_base_df = pd.read_csv('data/ames_housing_processed.csv')

# print the head of the df to inspect what we imported
life_exp_base_df.head()

Unnamed: 0,MS_SubClass,MS_Zoning,Lot_Frontage,Lot_Area,Street,Alley,Lot_Shape,Land_Contour,Utilities,Lot_Config,...,Fence,Misc_Feature,Misc_Val,Mo_Sold,Year_Sold,Sale_Type,Sale_Condition,Sale_Price,Longitude,Latitude
0,One_Story_1946_and_Newer_All_Styles,Residential_Low_Density,141,31770,Pave,No_Alley_Access,Slightly_Irregular,Lvl,AllPub,Corner,...,No_Fence,,0,5,2010,WD,Normal,215000,-93.619754,42.054035
1,One_Story_1946_and_Newer_All_Styles,Residential_High_Density,80,11622,Pave,No_Alley_Access,Regular,Lvl,AllPub,Inside,...,Minimum_Privacy,,0,6,2010,WD,Normal,105000,-93.619756,42.053014
2,One_Story_1946_and_Newer_All_Styles,Residential_Low_Density,81,14267,Pave,No_Alley_Access,Slightly_Irregular,Lvl,AllPub,Corner,...,No_Fence,Gar2,12500,6,2010,WD,Normal,172000,-93.619387,42.052659
3,One_Story_1946_and_Newer_All_Styles,Residential_Low_Density,93,11160,Pave,No_Alley_Access,Regular,Lvl,AllPub,Corner,...,No_Fence,,0,4,2010,WD,Normal,244000,-93.61732,42.051245
4,Two_Story_1946_and_Newer,Residential_Low_Density,74,13830,Pave,No_Alley_Access,Slightly_Irregular,Lvl,AllPub,Inside,...,Minimum_Privacy,,0,3,2010,WD,Normal,189900,-93.638933,42.060899


In [6]:
clean_life_exp_df = (life_exp_base_df
          .clean_names(strip_underscores=True)
          .remove_empty()
     ) # further method chaining possible

In [7]:
clean_life_exp_df.head()

Unnamed: 0,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,...,fence,misc_feature,misc_val,mo_sold,year_sold,sale_type,sale_condition,sale_price,longitude,latitude
0,One_Story_1946_and_Newer_All_Styles,Residential_Low_Density,141,31770,Pave,No_Alley_Access,Slightly_Irregular,Lvl,AllPub,Corner,...,No_Fence,,0,5,2010,WD,Normal,215000,-93.619754,42.054035
1,One_Story_1946_and_Newer_All_Styles,Residential_High_Density,80,11622,Pave,No_Alley_Access,Regular,Lvl,AllPub,Inside,...,Minimum_Privacy,,0,6,2010,WD,Normal,105000,-93.619756,42.053014
2,One_Story_1946_and_Newer_All_Styles,Residential_Low_Density,81,14267,Pave,No_Alley_Access,Slightly_Irregular,Lvl,AllPub,Corner,...,No_Fence,Gar2,12500,6,2010,WD,Normal,172000,-93.619387,42.052659
3,One_Story_1946_and_Newer_All_Styles,Residential_Low_Density,93,11160,Pave,No_Alley_Access,Regular,Lvl,AllPub,Corner,...,No_Fence,,0,4,2010,WD,Normal,244000,-93.61732,42.051245
4,Two_Story_1946_and_Newer,Residential_Low_Density,74,13830,Pave,No_Alley_Access,Slightly_Irregular,Lvl,AllPub,Inside,...,Minimum_Privacy,,0,3,2010,WD,Normal,189900,-93.638933,42.060899
