## Importing Required Libraries and Functions

The following libraries and functions are imported to support various data manipulation and aggregation tasks:
- `SparkSession`: The entry point to programming with Spark.
- `functions` as `F`: Importing PySpark SQL functions with a shorthand to make function calls more concise.
- Specific functions such as `col`, `sum`, `approx_count_distinct`, `hour`, `dayofweek`, `datediff`, `lit`, `unix_timestamp`, and `when`, which will be used for column operations, aggregations, and feature engineering.

In [6]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import col, sum, approx_count_distinct, hour, dayofweek, datediff, lit, unix_timestamp, when

In this step, we initialize a Spark session that will be used for all subsequent data processing tasks in this notebook. The Spark session is configured with several important settings that optimize performance and ensure consistency in data handling.

In [7]:
spark = (
    SparkSession.builder.appName("MAST30034 Project 1")  # Setting the application name for the Spark session
    .config("spark.sql.repl.eagerEval.enabled", True)  # Enabling eager evaluation for interactive querying
    .config("spark.sql.parquet.cacheMetadata", "true")  # Caching Parquet metadata to speed up file reading
    .config("spark.sql.session.timeZone", "Etc/UTC")  # Setting the session's time zone to UTC for consistency
    .getOrCreate()  # Creating the Spark session or retrieving an existing one
)

## Initial Examination of the Dataset

First, we load the preprocessed Parquet file into a Spark DataFrame and then count the number of records (rows) in the DataFrame. This gives us an initial sense of the dataset's size.

In [8]:
# Load the preprocessed Parquet file into a Spark DataFrame
sdf = spark.read.parquet('/Users/jennymai/Desktop/data_sci/mast_project1/data/raw')

# Count the number of records (rows) in the DataFrame
sdf.count()

19325003

In [9]:
sdf.limit(10)

                                                                                

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
1,2022-12-01 00:37:35,2022-12-01 00:47:35,1.0,2.0,1.0,N,170,237,1,8.5,3.0,0.5,3.1,0.0,0.3,15.4,2.5,0.0
1,2022-12-01 00:34:35,2022-12-01 00:55:21,0.0,8.4,1.0,N,138,141,2,26.0,4.25,0.5,0.0,0.0,0.3,31.05,2.5,1.25
1,2022-12-01 00:33:26,2022-12-01 00:37:34,1.0,0.8,1.0,N,140,140,1,5.0,3.0,0.5,1.76,0.0,0.3,10.56,2.5,0.0
1,2022-12-01 00:45:51,2022-12-01 00:53:16,1.0,3.0,1.0,N,141,79,3,10.0,3.0,0.5,0.0,0.0,0.3,13.8,2.5,0.0
2,2022-12-01 00:49:49,2022-12-01 00:54:13,1.0,0.76,1.0,N,261,231,1,5.0,0.5,0.5,1.76,0.0,0.3,10.56,2.5,0.0
1,2022-12-01 00:25:25,2022-12-01 00:35:38,2.0,2.6,1.0,N,237,164,1,10.5,3.0,0.5,4.25,0.0,0.3,18.55,2.5,0.0
2,2022-12-01 00:05:37,2022-12-01 00:10:48,1.0,0.94,1.0,N,79,144,1,5.5,0.5,0.5,1.86,0.0,0.3,11.16,2.5,0.0
2,2022-12-01 00:20:12,2022-12-01 00:28:49,1.0,2.09,1.0,N,79,186,1,9.0,0.5,0.5,2.0,0.0,0.3,14.8,2.5,0.0
1,2022-12-01 00:00:54,2022-12-01 00:05:41,1.0,0.8,1.0,N,142,143,1,5.5,3.0,0.5,1.85,0.0,0.3,11.15,2.5,0.0
2,2022-12-01 00:11:23,2022-12-01 00:30:00,1.0,7.62,1.0,N,138,255,1,24.0,0.5,0.5,5.31,0.0,0.3,31.86,0.0,1.25


### Descriptive Statistics of the DataFrame

We generate descriptive statistics for the columns in the DataFrame. This includes metrics like count, mean, standard deviation, min, and max values for each column. The `.limit(25)` function is used to restrict the output to the first 25 rows of the summary, which is useful for large datasets.

In [10]:
# Generate descriptive statistics for the DataFrame, limiting the output to the first 25 rows
sdf.describe().limit(25)

24/08/23 17:28:52 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

summary,VendorID,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
count,19325003.0,18749713.0,19325003.0,18749713.0,18749713,19325003.0,19325003.0,19325003.0,19325003.0,19325003.0,19325003.0,19325003.0,19325003.0,19325003.0,19325003.0,18749713.0,18749713.0
mean,1.7366218261389144,1.3775969797511034,4.4731091617420455,1.511516416277945,,165.81878698802788,163.8886903665681,1.1913664903441412,8.619574211193223,1.37918114734575,0.4874071817738096,12.481317347272658,0.5587781290405625,0.786410356575844,26.085003642540244,2.2763282696647145,0.110280394691908
stddev,0.4503651288017578,0.9115429260790552,363.4348135144023,6.521382764229247,,64.37805416509637,69.91840539145225,0.5403365296476417,31985.042259392096,1.6916961295284452,0.1026072723928135,31985.02206297459,2.104686487247564,0.3492941909534933,21.746516925703688,0.7712032427241626,0.37471728113382
min,1.0,0.0,0.0,1.0,N,1.0,1.0,0.0,-133391414.0,-7.5,-0.5,-110.0,-73.3,-1.0,-1635.8,-2.5,-1.75
max,6.0,9.0,335004.33,99.0,Y,265.0,265.0,5.0,5901.74,96.38,53.16,133391363.53,655.55,1.0,5902.54,2.75,1.75


### Calculating Approximate Median Values for Selected Columns

We calculate the approximate median values for a selected list of numeric columns. The median is a useful measure of central tendency, especially in skewed distributions. We use Spark's `approxQuantile` function for efficiency, which approximates the median.

In [11]:
# List of columns for which we want to calculate the approximate median
column_list = ['trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge', 'airport_fee', 'passenger_count']

# Calculate and print the approximate median for each column in the list
for column in column_list:
    median_approx = sdf.approxQuantile(column, [0.5], 0.01)[0]  # Approximate median calculation
    print(f"The approximate median value of column '{column}' is: {median_approx}")

                                                                                

The approximate median value of column 'trip_distance' is: 1.8


                                                                                

The approximate median value of column 'fare_amount' is: 12.8


                                                                                

The approximate median value of column 'extra' is: 1.0


                                                                                

The approximate median value of column 'mta_tax' is: 0.5


                                                                                

The approximate median value of column 'tolls_amount' is: 0.0


                                                                                

The approximate median value of column 'improvement_surcharge' is: 1.0


                                                                                

The approximate median value of column 'total_amount' is: 19.32


                                                                                

The approximate median value of column 'congestion_surcharge' is: 2.5


                                                                                

The approximate median value of column 'airport_fee' is: 0.0




The approximate median value of column 'passenger_count' is: 1.0


                                                                                

### Checking for Missing Values in the DataFrame

We check for missing (null) values in each column of the DataFrame. This step is crucial for data cleaning, as missing values can impact the accuracy of models and analyses. The output shows the count of null values for each column.

In [12]:
# Select columns and count the number of null values in each
sdf.select([sum(col(c).isNull().cast("int")).alias(c) for c in sdf.columns]).show()



+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|airport_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|       0|                   0|                    0|         575290|            0|    575290|            575290|           0|           0|           0|          0|    0|      0|         

                                                                                

### Dropping Rows with Missing Values

We remove any rows that contain missing values. This is a straightforward approach to handle missing data, ensuring that subsequent analyses and models work with complete cases.

In [13]:
sdf = sdf.dropna()

### Counting Approximate Unique Values in Each Column

We compute the approximate number of unique values in each column. This is useful to understand the diversity or cardinality of the data within each column, which can inform feature selection or further data processing steps.

In [14]:
print('Approximate unique values in each column')
for column in sdf.columns:
    distinct_count = sdf.select(approx_count_distinct(column)).collect()[0][0]
    print(f"{column}: {distinct_count}")

Approximate unique values in each column


                                                                                

VendorID: 2


                                                                                

tpep_pickup_datetime: 9679197


                                                                                

tpep_dropoff_datetime: 9788423


                                                                                

passenger_count: 10


                                                                                

trip_distance: 6446


                                                                                

RatecodeID: 7


                                                                                

store_and_fwd_flag: 2


                                                                                

PULocationID: 264


                                                                                

DOLocationID: 262


                                                                                

payment_type: 5


                                                                                

fare_amount: 6385


                                                                                

extra: 121


                                                                                

mta_tax: 19


                                                                                

tip_amount: 6016


                                                                                

tolls_amount: 1805


                                                                                

improvement_surcharge: 5


                                                                                

total_amount: 24995


                                                                                

congestion_surcharge: 7




airport_fee: 5


                                                                                

### Grouping by Variables and Counting Occurrences

Groups the DataFrame by the `passenger_count`, `mta_tax`, and `RatecodeID` columns and counts the occurrences of each group. The result is ordered by the count in descending order, and the top 25 rows are displayed.

In [15]:
sdf.groupBy('passenger_count').count().orderBy('count', ascending=False).limit(25)

                                                                                

passenger_count,count
1.0,14031997
2.0,2848258
3.0,718882
4.0,377282
0.0,321354
5.0,276668
6.0,175111
8.0,72
7.0,61
9.0,28


In [16]:
sdf.groupBy('mta_tax').count().orderBy('count', ascending=False).limit(25)

                                                                                

mta_tax,count
0.5,18405655
0.0,177194
-0.5,157022
0.8,9203
1.3,404
1.5,125
0.3,61
4.0,18
3.3,7
3.0,5


In [17]:
sdf.groupBy('RatecodeID').count().orderBy('count', ascending=False).limit(25)

                                                                                

RatecodeID,count
1.0,17710228
2.0,747019
5.0,121197
99.0,83207
3.0,59635
4.0,28356
6.0,71


### Calculating Quantiles for Fare Amount

Calculates the 50th, 75th, 90th, 95th, and 99th percentiles of the `fare_amount` column. These quantiles provide insights into the distribution of fare amounts within the dataset.

In [18]:
fare_amount_quantiles = sdf.approxQuantile("fare_amount", [0.5, 0.75, 0.9, 0.95, 0.99], 0.01)
print(f"50th percentile (median): {fare_amount_quantiles[0]}")
print(f"75th percentile: {fare_amount_quantiles[1]}")
print(f"90th percentile: {fare_amount_quantiles[2]}")
print(f"95th percentile: {fare_amount_quantiles[3]}")
print(f"99th percentile: {fare_amount_quantiles[4]}")



50th percentile (median): 12.5
75th percentile: 19.8
90th percentile: 38.0
95th percentile: 54.1
99th percentile: 5901.74


                                                                                

### Calculating Quantiles for Trip Distance

Calculates the 50th, 75th, 90th, 95th, and 99th percentiles of the `trip_distance` column. These quantiles provide insights into the distribution of trip distances within the dataset.

In [19]:
trip_distance_quantiles = sdf.approxQuantile("trip_distance", [0.5, 0.75, 0.9, 0.95, 0.99], 0.01)
print(f"50th percentile (median): {trip_distance_quantiles[0]}")
print(f"75th percentile: {trip_distance_quantiles[1]}")
print(f"90th percentile: {trip_distance_quantiles[2]}")
print(f"95th percentile: {trip_distance_quantiles[3]}")
print(f"99th percentile: {trip_distance_quantiles[4]}")



50th percentile (median): 1.8
75th percentile: 3.34
90th percentile: 8.7
95th percentile: 13.93
99th percentile: 103319.46


                                                                                

## Data Filtering and Feature Engineering

This part performs a series of filtering operations and feature engineering steps on the DataFrame. The goal is to clean the dataset by removing outliers and irrelevant data points, as well as to create new features that could be useful for subsequent analysis or modeling.

#### Filtering Operations:

1. **RatecodeID Filtering**:
   - Retains only records where `RatecodeID` is between 1 and 6 (inclusive). This helps to focus on valid trip records with known rate codes.

2. **MTA Tax Filtering**:
   - Retains records where the `mta_tax` is either 0 or 0.5, ensuring only valid and common MTA tax values are included.

3. **Improvement Surcharge Filtering**:
   - Filters records where `improvement_surcharge` is either 0.0 or 0.3, which are the typical charges seen in the dataset.

4. **Airport Fee Filtering**:
   - Retains records where `airport_fee` is either 0.0 or 1.25, which are the standard charges for trips to/from airports.

5. **Store and Forward Flag Conversion**:
   - Converts the `store_and_fwd_flag` column from categorical values ('Y', 'N') to binary (1, 0) for easier analysis. Here, 'Y' (Yes) is converted to 1, and 'N' (No) is converted to 0.

6. **Passenger Count Filtering**:
   - Filters records to retain only those with `passenger_count` between 1 and 6, which are typical and valid passenger counts for taxi rides.

7. **Trip Distance Filtering**:
   - Filters trips where `trip_distance` is greater than 0.2 miles but less than or equal to 75 miles, removing trips that are either too short or suspiciously long.

8. **Fare Amount Filtering**:
   - Retains records where `fare_amount` is between $3.00 and $200.00, focusing on reasonable fare amounts that correspond to typical taxi rides.

9. **Trip Duration Calculation and Filtering**:
   - Creates a new column `trip_duration_mins` by calculating the trip duration in minutes from `tpep_pickup_datetime` and `tpep_dropoff_datetime`.
   - Filters records where `trip_duration_mins` is greater than 1 minute and less than or equal to 180 minutes, removing extremely short or excessively long trips.

#### Feature Engineering:

1. **Datetime Casting**:
   - Converts `tpep_pickup_datetime` and `tpep_dropoff_datetime` columns to timestamp data types for accurate date and time calculations.

2. **Hour and Day of Week Extraction**:
   - Creates new columns `pickup_hour` and `dropoff_hour` to extract the hour of the day from the pickup and dropoff timestamps.
   - Creates new columns `pickup_dayofweek` and `dropoff_dayofweek` to extract the day of the week from the pickup and dropoff timestamps.

3. **Days Since Reference Date**:
   - Creates a new column `days_since_2022_11_01` to calculate the number of days since November 1, 2022, for each trip. This feature can help capture trends or seasonality effects over time.

4. **Distance-Time Interaction Term**:
   - Creates a new interaction feature `distance_time_interaction` by multiplying `trip_distance` with `pickup_hour`. This feature might capture interactions between the time of day and the distance traveled, which could be predictive of certain outcomes.

5. **Final Filtering**:
   - Further filters records to retain only those where `days_since_2022_11_01` is between 0 and 180 days, focusing the analysis on a specific time window.

#### Sampling:
- The final step displays the first 25 rows of the filtered and enhanced DataFrame to give an overview of the resulting data structure and the new features created.

This detailed data preparation ensures that the dataset is clean, relevant, and enriched with new features, which are crucial for any downstream analysis or predictive modeling tasks.

In [20]:
# Filter the DataFrame to retain records where RatecodeID is between 1 and 6
df_filtered = sdf.filter((col("RatecodeID") >= 1) & (col("RatecodeID") <= 6))

# Further filter to retain records where MTA tax is either 0 or 0.5
df_filtered = df_filtered.filter((col("mta_tax") == 0) | (col("mta_tax") == 0.5))

# Filter to retain records where the improvement surcharge is either 0.0 or 0.3
df_filtered = df_filtered.filter((col("improvement_surcharge") == 0.0) | (col("improvement_surcharge") == 0.3))

# Filter to retain records where airport fee is either 0.0 or 1.25
df_filtered = df_filtered.filter((col("airport_fee") == 0.0) | (col("airport_fee") == 1.25))

# Convert the store_and_fwd_flag column to binary: 'Y' becomes 1, 'N' becomes 0
df_filtered = df_filtered.withColumn("store_and_fwd_flag", when(col("store_and_fwd_flag") == 'Y', 1).otherwise(0))

# Filter to retain records with a valid passenger count between 1 and 6
df_filtered = df_filtered.filter((col("passenger_count") >= 1.0) & (col("passenger_count") <= 6.0))

# Filter to retain records where trip distance is greater than 0.2 miles but less than or equal to 75 miles
df_filtered = df_filtered.filter((col("trip_distance") > 0.2) & (col("trip_distance") <= 75.0))

# Filter to retain records where fare amount is greater than $3.00 but less than or equal to $200.00
df_filtered = df_filtered.filter((col("fare_amount") > 3.0) & (col("fare_amount") <= 200.0))

# Calculate the trip duration in minutes and add it as a new column
df_filtered = df_filtered.withColumn("trip_duration_mins",
                                     (unix_timestamp(col("tpep_dropoff_datetime")) - unix_timestamp(col("tpep_pickup_datetime"))) / 60)

# Filter to retain records where trip duration is greater than 1 minute but less than or equal to 180 minutes
df_filtered = df_filtered.filter((col("trip_duration_mins") > 1.0) & (col("trip_duration_mins") <= 180))

# Convert the pickup and dropoff datetime columns to timestamp data types
df_filtered = df_filtered.withColumn("tpep_pickup_datetime", col("tpep_pickup_datetime").cast("timestamp")) \
                         .withColumn("tpep_dropoff_datetime", col("tpep_dropoff_datetime").cast("timestamp"))

# Extract hour of day and day of the week from the pickup and dropoff timestamps, add as new columns
df_filtered = df_filtered.withColumn("pickup_hour", hour(col("tpep_pickup_datetime"))) \
                         .withColumn("pickup_dayofweek", dayofweek(col("tpep_pickup_datetime"))) \
                         .withColumn("dropoff_hour", hour(col("tpep_dropoff_datetime"))) \
                         .withColumn("dropoff_dayofweek", dayofweek(col("tpep_dropoff_datetime"))) \
                         .withColumn("days_since_2022_11_01", datediff(col("tpep_pickup_datetime"), lit("2022-11-01")))

# Filter to retain records where the trip occurred between 0 and 180 days since November 1, 2022
df_filtered = df_filtered.filter((col("days_since_2022_11_01") >= 0.0) & (col("days_since_2022_11_01") <= 180))

# Create an interaction term between trip distance and pickup hour, and add it as a new column
df_filtered = df_filtered.withColumn("distance_time_interaction", col("trip_distance") * col("pickup_hour"))

# Display the first 25 rows of the filtered and processed DataFrame
df_filtered.limit(25)

                                                                                

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,trip_duration_mins,pickup_hour,pickup_dayofweek,dropoff_hour,dropoff_dayofweek,days_since_2022_11_01,distance_time_interaction
1,2022-12-01 00:37:35,2022-12-01 00:47:35,1.0,2.0,1.0,0,170,237,1,8.5,3.0,0.5,3.1,0.0,0.3,15.4,2.5,0.0,10.0,0,5,0,5,30,0.0
1,2022-12-01 00:33:26,2022-12-01 00:37:34,1.0,0.8,1.0,0,140,140,1,5.0,3.0,0.5,1.76,0.0,0.3,10.56,2.5,0.0,4.133333333333334,0,5,0,5,30,0.0
1,2022-12-01 00:45:51,2022-12-01 00:53:16,1.0,3.0,1.0,0,141,79,3,10.0,3.0,0.5,0.0,0.0,0.3,13.8,2.5,0.0,7.416666666666667,0,5,0,5,30,0.0
2,2022-12-01 00:49:49,2022-12-01 00:54:13,1.0,0.76,1.0,0,261,231,1,5.0,0.5,0.5,1.76,0.0,0.3,10.56,2.5,0.0,4.4,0,5,0,5,30,0.0
1,2022-12-01 00:25:25,2022-12-01 00:35:38,2.0,2.6,1.0,0,237,164,1,10.5,3.0,0.5,4.25,0.0,0.3,18.55,2.5,0.0,10.216666666666669,0,5,0,5,30,0.0
2,2022-12-01 00:05:37,2022-12-01 00:10:48,1.0,0.94,1.0,0,79,144,1,5.5,0.5,0.5,1.86,0.0,0.3,11.16,2.5,0.0,5.183333333333334,0,5,0,5,30,0.0
2,2022-12-01 00:20:12,2022-12-01 00:28:49,1.0,2.09,1.0,0,79,186,1,9.0,0.5,0.5,2.0,0.0,0.3,14.8,2.5,0.0,8.616666666666667,0,5,0,5,30,0.0
1,2022-12-01 00:00:54,2022-12-01 00:05:41,1.0,0.8,1.0,0,142,143,1,5.5,3.0,0.5,1.85,0.0,0.3,11.15,2.5,0.0,4.783333333333333,0,5,0,5,30,0.0
2,2022-12-01 00:11:23,2022-12-01 00:30:00,1.0,7.62,1.0,0,138,255,1,24.0,0.5,0.5,5.31,0.0,0.3,31.86,0.0,1.25,18.616666666666667,0,5,0,5,30,0.0
1,2022-12-01 00:14:29,2022-12-01 00:30:10,2.0,3.1,1.0,0,234,143,1,13.0,3.0,0.5,1.68,0.0,0.3,18.48,2.5,0.0,15.683333333333334,0,5,0,5,30,0.0


We calculate and display the percentage of data retained after the filtering and cleaning process. This is an important step to understand how much of the original dataset is still available for analysis after removing unwanted or irrelevant records.

In [21]:
# Calculate and display the percentage of the dataset that remains after filtering
print('Cleaned dataset contains ', str(df_filtered.count() / sdf.count() * 100)[:6] + '%', ' of the initial dataset.')



Cleaned dataset contains  26.753%  of the initial dataset.


                                                                                

The cleaned and filtered DataFrame is saved to disk in Parquet format. This step ensures that the processed data is stored in a highly efficient, columnar storage format, making it ready for further analysis or modeling.

In [22]:
# Save the cleaned DataFrame to a Parquet file, overwriting any existing file in the target directory
df_filtered.write.parquet('/Users/jennymai/Desktop/data_sci/mast_project1/data/curated', mode='overwrite')

                                                                                

24/08/24 06:43:51 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 3456678 ms exceeds timeout 120000 ms
24/08/24 06:43:51 WARN SparkContext: Killing executors is not supported by current scheduler.
24/08/24 06:43:52 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:101)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:85)
	at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:80)
	at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:642)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1223)
	at 