### Week 4

### 4.1. Questions

1. Scaling: Suppose the features in your training set have very different scales. Which algorithms might suffer from this, and how? What can you do about it?
2. Can Gradient Descent get stuck in a local minimum when training a Logistic Regression model?
3. Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime. Should you implement two Logistic Regression classifiers or one Softmax Regression classifier?

### 4.2. Practice
- Apply Linear Regression with California housing data

In [1]:
from sklearn.datasets import fetch_california_housing

# Get data
data = fetch_california_housing()
X = data['data']
y = data['target']

print(data['DESCR'])

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

* Apply Linear Regression

In [46]:
from sklearn.linear_model import LinearRegression,Ridge

model = LinearRegression()
model.fit(X, y)

LinearRegression()

In [12]:
model.score(X, y)

0.6062326851998051

* Evaluate: *score(X, y)* method. It works by
    - Internally running predict(X) to produce predicted values.
    - Using the predicted values to evaluate the model compared to the true label values that were passed to the method.

The evaluation equation varies depending if the model is a regressor or classifier. For regression, it is the $R^2$ value while for classification, it is accuracy.

In [3]:
print("R^2: {:g}".format(model.score(X, y)))

R^2: 0.606233


In [4]:
# Apply StandardScaler
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd

# create and fit scaler
scaler = StandardScaler()
scaler.fit(X)

# scale data set
Xt = scaler.transform(X)

# create data frame with results
stats = np.vstack((X.mean(axis=0), X.var(axis=0), Xt.mean(axis=0), Xt.var(axis=0))).T
feature_names = data['feature_names']
columns = ['unscaled mean', 'unscaled variance', 'scaled mean', 'scaled variance']

df = pd.DataFrame(stats, index=feature_names, columns=columns)
df

Unnamed: 0,unscaled mean,unscaled variance,scaled mean,scaled variance
MedInc,3.870671,3.609148,6.6097e-17,1.0
HouseAge,28.639486,158.3886,5.508083e-18,1.0
AveRooms,5.429,6.121236,6.6097e-17,1.0
AveBedrms,1.096675,0.2245806,-1.060306e-16,1.0
Population,1425.476744,1282408.0,-1.101617e-17,1.0
AveOccup,3.070655,107.8648,3.442552e-18,1.0
Latitude,35.631861,4.562072,-1.079584e-15,1.0
Longitude,-119.569704,4.013945,-8.526513e-15,1.0


In [5]:
data['feature_names']

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

In [6]:
X_df = pd.DataFrame(X)
X_df.columns = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']

In [7]:
X_df['label'] =  y
X_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,label
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [8]:
pd.DataFrame(Xt)

Unnamed: 0,0,1,2,3,4,5,6,7
0,2.344766,0.982143,0.628559,-0.153758,-0.974429,-0.049597,1.052548,-1.327835
1,2.332238,-0.607019,0.327041,-0.263336,0.861439,-0.092512,1.043185,-1.322844
2,1.782699,1.856182,1.155620,-0.049016,-0.820777,-0.025843,1.038503,-1.332827
3,0.932968,1.856182,0.156966,-0.049833,-0.766028,-0.050329,1.038503,-1.337818
4,-0.012881,1.856182,0.344711,-0.032906,-0.759847,-0.085616,1.038503,-1.337818
...,...,...,...,...,...,...,...,...
20635,-1.216128,-0.289187,-0.155023,0.077354,-0.512592,-0.049110,1.801647,-0.758826
20636,-0.691593,-0.845393,0.276881,0.462365,-0.944405,0.005021,1.806329,-0.818722
20637,-1.142593,-0.924851,-0.090318,0.049414,-0.369537,-0.071735,1.778237,-0.823713
20638,-1.054583,-0.845393,-0.040211,0.158778,-0.604429,-0.091225,1.778237,-0.873626


In [13]:
model2 = LinearRegression()
model2.fit(Xt, y)
model2.score(Xt, y)

0.606232685199805

### 4.3. Home work

In [23]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
degree=3
polyreg=make_pipeline(PolynomialFeatures(degree),LinearRegression())
polyreg.fit(Xt,y)
polyreg.score(Xt, y)

0.7385168108924949

In [67]:
# https://stackoverflow.com/questions/40452759/pandas-latitude-longitude-to-distance-between-successive-rows
# New distance feature from Long, Lat values
import numpy as np
df = X_df.copy().drop(['label'], axis = 1)

# vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
    """
    slightly modified version: of http://stackoverflow.com/a/29546836/2901002

    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees or in radians)

    All (lat, lon) coordinates must have numeric dtypes and be of equal length.

    """
#     if to_radians:
#         lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
    lat1 = np.radians(lat1)
    lat2 = np.radians(lat2)
   
    a = np.sin((lat2-lat1)/2.0)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))



df['dist'] = \
    haversine(df.Latitude.astype(float), df.Longitude.astype(float),float(0),float(0))



In [68]:
X_new = df.values
X_new

array([[ 8.32520000e+00,  4.10000000e+01,  6.98412698e+00, ...,
         3.78800000e+01, -1.22230000e+02,  1.54672945e+04],
       [ 8.30140000e+00,  2.10000000e+01,  6.23813708e+00, ...,
         3.78600000e+01, -1.22220000e+02,  1.54468378e+04],
       [ 7.25740000e+00,  5.20000000e+01,  8.28813559e+00, ...,
         3.78500000e+01, -1.22240000e+02,  1.54921038e+04],
       ...,
       [ 1.70000000e+00,  1.70000000e+01,  5.20554273e+00, ...,
         3.94300000e+01, -1.21220000e+02,  1.13232033e+04],
       [ 1.86720000e+00,  1.80000000e+01,  5.32951289e+00, ...,
         3.94300000e+01, -1.21320000e+02,  1.18047371e+04],
       [ 2.38860000e+00,  1.60000000e+01,  5.25471698e+00, ...,
         3.93700000e+01, -1.21240000e+02,  1.14212718e+04]])

In [48]:
# Change hyperparameters/model: C, alpha, Ridge, Lasso, Elastic Net
model = Ridge(alpha=0.01)
model.fit(X, y)
model.score(X, y)

0.6062326851971467

In [64]:
model.fit(X_new, y)
model.score(X_new, y)

0.9999999999996579