Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reliability] Consistent handling of direction_id when NaNs present #90

Merged
merged 5 commits into from
Jul 12, 2018

Conversation

yiyange
Copy link
Collaborator

@yiyange yiyange commented Jul 12, 2018

Fix issue #89

@kuanb kuanb changed the title Drop direction_id column if it contains nan [reliability] Consistent handling of direction_id when NaNs present Jul 12, 2018
Copy link
Owner

@kuanb kuanb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Leaving one initial comment. Want to make sure this is "upstream" enough to handle all issues with the direction id column.

@@ -165,7 +175,7 @@ def generate_all_observed_edge_costs(trips_and_stop_times: pd.DataFrame
dir_mask = (tst_sub.direction_id == direction)
tst_sub_dir = tst_sub[dir_mask]
else:
tst_sub_dir = tst_sub
tst_sub_dir = tst_sub.copy()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the removal of the .copy()? Without examining it deeply, shouldn't we defensively employ .copy() to avoid upstream impacts? It's removal does not seem to be relevant to handling the direction_id column being dropped.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this. I am so sorry.

@@ -74,9 +74,19 @@ def generate_route_costs(self, route_id: str):
'departure_time']
trips_and_stop_times = trips_and_stop_times.sort_values(sort_list)

# Check direction_id column value before
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our discussion offline - the advantage to adding the handling here is that it is within a single route. This means that we do not end up tossing direction id if a specific route happens to have all direction id rows filled in. It is possible that a route operator has direction id for one route but not another.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments added. thanks!

@codecov
Copy link

codecov bot commented Jul 12, 2018

Codecov Report

Merging #90 into master will increase coverage by 0.26%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #90      +/-   ##
==========================================
+ Coverage   91.64%   91.91%   +0.26%     
==========================================
  Files          12       12              
  Lines         862      866       +4     
==========================================
+ Hits          790      796       +6     
+ Misses         72       70       -2
Impacted Files Coverage Δ
peartree/parallel.py 98.56% <100%> (+1.52%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 488f060...e0b90c3. Read the comment docs.

@kuanb
Copy link
Owner

kuanb commented Jul 12, 2018

@yiyange made some slight changes to the comments (see above commit).

@kuanb
Copy link
Owner

kuanb commented Jul 12, 2018

Final request, @yiyange, please add a test in test_paths.py. Something along the lines of:

def test_loading_in_too_small_timeframes():
    path = fixture('samtrans-2017-11-28.zip')
    feed = get_representative_feed(path)

    # Overwrite the stop times so that direction id is null
    st = feed.stop_times
    st.direction_id = np.nan
    feed.stop_times = st

    # Make sure the behavior is as expected
    start = 7 * 60 * 60
    end = 10 * 60 * 60
    G = load_feed_as_graph(feed, start, end)

    # Add an assertion that stops times are not null

Copy link
Owner

@kuanb kuanb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you.

# trips_and_stop_times to generate wait and edge costs
# Note: Advantage to adding handling at route level is that peartree
# avoids tossing direction id if a specific route has all direction
# id rows filled in (while another does not, which is possible).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evidence for the aforementioned case. this is happening logs where there are non values in direction_id column. it is clear that some routes have full coverage but some do not. therefore we can't move this check further upstream.

	Reduced selected trips on route 295-155 from 12 to 9.
this is happening
	Reduced selected trips on route 296-155 from 45 to 29.
this is happening
	Reduced selected trips on route 297-157 from 8 to 0.
	Reduced selected trips on route 397-157 from 7 to 0.
	Reduced selected trips on route 398-157 from 11 to 6.
this is happening
	Reduced selected trips on route 399-157 from 9 to 0.
	Reduced selected trips on route ECR-155 from 59 to 41.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants