Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert nan to 0 in avg_num_tokens() #2046

Merged
merged 8 commits into from
May 21, 2022
Merged

Convert nan to 0 in avg_num_tokens() #2046

merged 8 commits into from
May 21, 2022

Conversation

hungcs
Copy link
Contributor

@hungcs hungcs commented May 20, 2022

avg_num_tokens() should always return an int, but it's possible for it to return a nan.

titanic_simple_df = pd.DataFrame(
    {
        "PassengerId": [1], "Survived": [0], "Pclass": [3], "Name": ["Braund, Mr. Owen Harris"],
        "Sex": ["male"], "Age": [22.0], "SibSp": [1], "Parch": [0], "Ticket": ["A/5 21171"], "Fare": ["7.25"],
        "Cabin": [None], "Embarked": ["S"], "split": [0]
    }
)

>>> cab = titanic_simple_df["Cabin"]
>>> Series(cab).str.split().str.len()
0    None
dtype: object
>>> Series(cab).str.split().str.len().mean()
nan
get_dataset_info_from_source(table_source).to_dict()\n  File 
"/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/automl/base_config.py", 
line 189, in get_dataset_info_from_source\n    avg_words = 
source.get_avg_num_tokens(field)\n  File 
"/home/ray/src/predibase_engine/sources/dask_util.py", line 46, in 
get_avg_num_tokens\n    return avg_num_tokens(self.sample)\n  File 
"/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/automl/utils.py", line 
61, in avg_num_tokens\n    avg_words = 
round(Series(unique_entries).str.split().str.len().mean())\nValueError: cannot 
convert float NaN to integer (type: ErrorResponse, retryable: true)')

@hungcs hungcs requested a review from tgaddair May 20, 2022 20:32
@github-actions
Copy link

github-actions bot commented May 20, 2022

Unit Test Results

6 files  ±       0  6 suites  ±0   29s ⏱️ - 2h 7m 25s
1 tests  - 2 784  0 ✔️  - 2 750  0 💤  -   35  0 ±0  1 🔥 +1 
6 runs   - 8 349  0 ✔️  - 8 246  0 💤  - 109  0 ±0  6 🔥 +6 

For more details on these errors, see this check.

Results for commit bb3a870. ± Comparison against base commit 55b7672.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a unit test?

@hungcs hungcs merged commit f54bf05 into master May 21, 2022
@hungcs hungcs deleted the nan branch May 21, 2022 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants