<a href="https://colab.research.google.com/github/gogzicole/Hamoye-Data-science/blob/master/Custom_Classes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd 
import numpy as np 
from pandas import Series, DataFrame
from sklearn.datasets import fetch_california_housing

In [2]:
dataset = fetch_california_housing()

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /root/scikit_learn_data


In [49]:
X = dataset['data']
y = dataset['target']

In [4]:
df = DataFrame(dataset.data, columns = dataset['feature_names'])
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [5]:
print(dataset['DESCR'])

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

**Column Transformers**

for illustration  would perform Standard scaling on all the caiifornia dataset features except lattitude and longitude. in this case i would select all the columns that would be scaled while letting the remainder passthrough using te remainder arguement.

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

In [7]:
col_transformer = ColumnTransformer(remainder = 'passthrough',transformers =
                                    [('scaler', StandardScaler(), slice(0,6))])
col_transformer.fit(dataset.data)
Xt = col_transformer.transform(dataset.data)



Dataset after Transformation

In [8]:
df1 = DataFrame(Xt, columns = dataset['feature_names'])
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


also we could re-write our column transformer to drop MedInC, let lattitude and longitude passthrough while performing transformation.

In [9]:
col_transformer = ColumnTransformer( remainder = 'passthrough', transformers =
                                    [('remove','drop', 0),('scaler',StandardScaler(), slice(1,6))])
col_transformer.fit(X) # X is the same as dataset.data
Xt1 = col_transformer.transform(X)

In [10]:
df2 = DataFrame(Xt1, columns = dataset.feature_names[1:])
df2.head()

Unnamed: 0,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,0.982143,0.628559,-0.153758,-0.974429,-0.049597,37.88,-122.23
1,-0.607019,0.327041,-0.263336,0.861439,-0.092512,37.86,-122.22
2,1.856182,1.15562,-0.049016,-0.820777,-0.025843,37.85,-122.24
3,1.856182,0.156966,-0.049833,-0.766028,-0.050329,37.85,-122.25
4,1.856182,0.344711,-0.032906,-0.759847,-0.085616,37.85,-122.25


**Feature Unions and Pipe Lines**

to illustrate Feature union we will apply the PCA and select KBest transformers. The PCA, principal components analysis, transformer returns a new set of uncorrelated features based on the original features whie select KBest returns the KBest features based on a p-assed criterion. For the example, the selector will return the two features with the largest correlati0n with the labels. When using PCA the data needs to have zero mean As a result we create a pipeline object that represents the required two step process.We will have the PCA object return $ uncorrelated features. The result of the Union between PCA an SelectKBest will be a dataset of six features.

In [16]:
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.decomposition import PCA 
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LinearRegression

In [21]:
scaler = StandardScaler()
pca = PCA(n_components = 4)
selector = SelectKBest(f_regression, k = 2)
lin_reg = LinearRegression()

In [23]:
pca_pipe = Pipeline([('scaler', scaler),('dim_red', pca)])
union = FeatureUnion([('pca_pipe',pca_pipe),('selector', selector)])
pipe = Pipeline([('union', union),('lin_reg',lin_reg)])
pipe.fit(X,y)

Pipeline(memory=None,
         steps=[('union',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('pca_pipe',
                                                 Pipeline(memory=None,
                                                          steps=[('scaler',
                                                                  StandardScaler(copy=True,
                                                                                 with_mean=True,
                                                                                 with_std=True)),
                                                                 ('dim_red',
                                                                  PCA(copy=True,
                                                                      iterated_power='auto',
                                                                      n_components=4,
                                                                      random_

In [32]:
print('R^2 value of the model is {}'.format(pipe.score(X,y)))

R^2 value of the model is 0.5288130088767813


**Custom Estimators**

we would create a custom transformer that replaces outliers, values outside a given interval. The algorithm of the transformer is as follows.

1. for each feature, determine the lower and upper bound of acceptable values. Those lower and upper bounds are based on  the qth percentile. for example if the 5% and 95% percentile for a feature is 1.3 and 7.5 respectively then all values outside of (1.3,7.5) is considered an outlier.

2. for each outlier value of  feature it is rplaced with the qth percentile of the feature. Using the same values as before, if a value for a feature is 0.7 and the acceptable range is (1.3,7.5), it is replaced by 1.3. if the value resides to the right of the interval, larger than the upper bound, then it is replaced by 7.5.

In [43]:
from sklearn.base import BaseEstimator, TransformerMixin
class OutlierReplacer(BaseEstimator, TransformerMixin):
  def __init__(self,q_lower,q_upper):
    self.q_upper = q_upper
    self.q_lower = q_lower

  def fit(self, X, y = None):
    self.upper = np.percentile(X, self.q_upper, axis = 0)
    self.lower = np.percentile(X, self.q_lower, axis = 0)
    return self

  def transform(self, X):
    Xt = X.copy()
    ind_lower = X < self.q_lower
    ind_upper = X > self.q_upper
    for i in range(X.shape[-1]):
        Xt[ind_lower[:,i], i] = self.lower[i]
        Xt[ind_upper[:,i], i] = self.upper[i]
    return Xt

In [44]:
replacer = OutlierReplacer(5, 95)
replacer.fit(X)
Xt = replacer.transform(X)

1. The california housing data has feature for latittude and longitude. Create a custom tranformer that return the distance away from a given set of coordinates. Use this custom transformer to create feature for the distance away from Los Angeles and San Francisco and include them with the origina feaures.  

In [54]:
class Distance(BaseEstimator, TransformerMixin):
  def __init__(self, coord):
    self.coord = coord   #coord would be a list or turple wheer the 0 index is the lat and the 1 index is tyhe long

  def fit(self, X, y = None):
    return self

  def transform(self, X):
    lat = X[:,0]
    lon = X[:,1]
    
    dist = np.sqrt((lat - self.coord[0])**2 +(lon - self.coord[1])**2)
    dist = dist.reshape(-1,1)

    return dist
  


In [55]:
coord_LA = (34, -118)
dist_LA = Distance(coord_LA)
dis = dist_LA.fit_transform(X[:,-2:])
dis.shape

(20640, 1)

lets create a drop class that drops the un needed columns instead of slicing it in the fit_transform method

In [56]:
class Drop_col(BaseEstimator, TransformerMixin):
  def __init__(self, ind_col):
    self.ind_col = ind_col

  def fit(self, X, y = None):
    return self

  def transform(self, X):
    return X[:, self.ind_col]

In [58]:
coord_LA = (34, -118)
coord_SF = ( 37, -112)
dist_LA = Distance(coord_LA)
dist_SF = Distance(coord_SF)
drop = Drop_col([0,1,2,3,4,5])
union = FeatureUnion([('drop', drop),('LA', dist_LA),('SF',dist_SF)])
pipe = Pipeline([('union', union), ('lin_reg', lin_reg)])
pipe.fit(X,y)

Pipeline(memory=None,
         steps=[('union',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('drop',
                                                 Drop_col(ind_col=[0, 1, 2, 3,
                                                                   4, 5])),
                                                ('LA',
                                                 Distance(coord=(34, -118))),
                                                ('SF',
                                                 Distance(coord=(37, -112)))],
                              transformer_weights=None, verbose=False)),
                ('lin_reg',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

In [59]:
pipe.score(X,y)

0.5491806157969854

In [60]:
pipe.predict(X)

array([4.18426961, 4.00041895, 3.70072415, ..., 0.62692676, 0.7715129 ,
       1.09751205])

In [61]:
pipe.named_steps

{'lin_reg': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
 'union': FeatureUnion(n_jobs=None,
              transformer_list=[('drop', Drop_col(ind_col=[0, 1, 2, 3, 4, 5])),
                                ('LA', Distance(coord=(34, -118))),
                                ('SF', Distance(coord=(37, -112)))],
              transformer_weights=None, verbose=False)}