![](MSFTLogo.png)
# SQL Server Machine Learning Services
## 03 - Create Data Features using T-SQL

After data exploration, you have collected some insights from the data, and are ready to move on to feature engineering. This process of creating features from the raw data can be a critical step in advanced analytics modeling.

In this Notebook, you'll learn how to create features from raw data by using a Transact-SQL function. You'll then call that function from a stored procedure to create a table that contains the feature values.

> Note: You can also do your Feature Engineering in Python, R or another language using the Machine Learning functions in SQL Server. Using T-SQL allows the data professional to do some of that work for the Data Scientist, so you'll explore doing that in this Notebook.

## Define the Function

The distance values reported in the original data are based on the reported meter distance, and don't necessarily represent geographical distance or distance traveled. Therefore, you'll need to calculate the direct distance between the pick-up and drop-off points, by using the coordinates available in the source NYC Taxi dataset. You can do this by using the Haversine formula in a custom Transact-SQL function.

You'll use one custom T-SQL function, *fnCalculateDistance*, to compute the distance using the Haversine formula, and use a second custom T-SQL function, *fnEngineerFeatures*, to create a table containing all the features.

## Calculate trip distance using fnCalculateDistance

You create the function *fnCalculateDistance* using the following code:


In [3]:
USE NYCTaxi;
GO

CREATE FUNCTION [dbo].[fnCalculateDistance] (@Lat1 float, @Long1 float, @Lat2 float, @Long2 float)
-- User-defined function that calculates the direct distance between two geographical coordinates
RETURNS float
AS
BEGIN
  DECLARE @distance decimal(28, 10)
  -- Convert to radians
  SET @Lat1 = @Lat1 / 57.2958
  SET @Long1 = @Long1 / 57.2958
  SET @Lat2 = @Lat2 / 57.2958
  SET @Long2 = @Long2 / 57.2958
  -- Calculate distance
  SET @distance = (SIN(@Lat1) * SIN(@Lat2)) + (COS(@Lat1) * COS(@Lat2) * COS(@Long2 - @Long1))
  --Convert to miles
  IF @distance <> 0
  BEGIN
    SET @distance = 3958.75 * ATAN(SQRT(1 - POWER(@distance, 2)) / @distance);
  END
  RETURN @distance
END
GO

: Msg 2714, Level 16, State 3, Procedure fnCalculateDistance, Line 2
There is already an object named 'fnCalculateDistance' in the database.

The function is a scalar-valued function, returning a single data value of a predefined type.
It takes latitude and longitude values as inputs, obtained from trip pick-up and drop-off locations. The Haversine formula converts locations to radians and uses those values to compute the direct distance in miles between those two locations.
To add the computed value to a table that can be used for training the model, you'll use another function, fnEngineerFeatures.

## Save the features using fnEngineerFeatures

The custom T-SQL function *fnEngineerFeatures* is a table-valued function that takes multiple columns as inputs, and outputs a table with multiple feature columns. The purpose of this function is to create a feature set for use in building a model. The function *fnEngineerFeatures* calls the previously created T-SQL function, *fnCalculateDistance* to get the direct distance between pickup and dropoff locations.

The following code creates the function:

In [2]:
CREATE FUNCTION [dbo].[fnEngineerFeatures] (
@passenger_count int = 0,
@trip_distance float = 0,
@trip_time_in_secs int = 0,
@pickup_latitude float = 0,
@pickup_longitude float = 0,
@dropoff_latitude float = 0,
@dropoff_longitude float = 0)
RETURNS TABLE
AS
  RETURN
  (
  -- Add the SELECT statement with parameter references here
  SELECT
    @passenger_count AS passenger_count,
    @trip_distance AS trip_distance,
    @trip_time_in_secs AS trip_time_in_secs,
    [dbo].[fnCalculateDistance](@pickup_latitude, @pickup_longitude, @dropoff_latitude, @dropoff_longitude) AS direct_distance
  )
GO


To verify that this function works, you can use it to calculate the geographical distance for those trips where the metered distance was 0 but the pick-up and drop-off locations were different:

In [4]:
SELECT tipped, fare_amount, passenger_count,(trip_time_in_secs/60) as TripMinutes,
    trip_distance, pickup_datetime, dropoff_datetime,
    dbo.fnCalculateDistance(pickup_latitude, pickup_longitude,  dropoff_latitude, dropoff_longitude) AS direct_distance
    FROM nyctaxi_sample
    WHERE pickup_longitude != dropoff_longitude and pickup_latitude != dropoff_latitude and trip_distance = 0
    ORDER BY trip_time_in_secs DESC;
GO

tipped,fare_amount,passenger_count,TripMinutes,trip_distance,pickup_datetime,dropoff_datetime,direct_distance
0,2.5,1,71582,0,2013-08-04 18:34:51.000,2013-08-04 18:46:48.000,0.3014894181
1,52.0,1,88,0,2013-05-09 11:27:23.000,2013-05-09 12:56:12.000,12.7244322117
0,52.0,2,88,0,2013-05-09 07:38:46.000,2013-05-09 09:07:20.000,12.9178069408
1,52.0,1,87,0,2013-05-10 11:11:03.000,2013-05-10 12:38:54.000,14.8324096485
1,51.0,1,87,0,2013-05-08 08:17:46.000,2013-05-08 09:44:50.000,6.897777475
0,2.5,1,83,0,2013-10-23 10:33:58.000,2013-10-23 11:57:01.000,1.3541008656
0,52.0,1,81,0,2013-04-29 06:40:00.000,2013-04-29 08:02:00.000,12.443978189
1,52.0,1,79,0,2013-06-12 11:19:29.000,2013-06-12 12:39:05.000,12.8927909112
0,67.5,1,78,0,2013-04-25 14:55:00.000,2013-04-25 16:14:00.000,11.3907642258
1,52.0,1,78,0,2013-05-08 07:10:42.000,2013-05-08 08:28:46.000,14.8623866515


Now proceed to the **04 - Train and save a Python model using T-SQL** Jupyter Notebook.