Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get it working like in Quick-Start #54

Closed
clausnizer-ondics opened this issue Dec 14, 2020 · 6 comments
Closed

Can't get it working like in Quick-Start #54

clausnizer-ondics opened this issue Dec 14, 2020 · 6 comments

Comments

@clausnizer-ondics
Copy link
Contributor

clausnizer-ondics commented Dec 14, 2020

  • igel version: 0.3.1 (latest pip)
  • Python version: 3.8.5
  • Operating System: docker on top of Ubuntu 16.04.6 LTS (4.4.0)

Description

Very new to ML, don't know what and how to do something with the Igel.
I followed the Quick-Start Demo to get an Idea.

Resulted in this igel.yaml:

dataset:
  preprocess:
    missing_values: mean
    scale:
      method: standard
      target: inputs
  split:
    shuffle: true
    test_size: 0.1
  type: csv
# model definition
model:
    # in the type field, you can write the type of problem you want to solve. Whether regression, classification or clustering
    # Then, provide the algorithm you want to use on the data. Here I'm using the random forest algorithm
    type: classification
    algorithm: RandomForest     # make sure you write the name of the algorithm in pascal case
    arguments:
        n_estimators: 100   # here, I set the number of estimators (or trees) to 100
        max_depth: 30       # set the max_depth of the tree

# target you want to predict
# Here, as an example, I'm using the famous indians-diabetes dataset, where I want to predict whether someone have diabetes or not.
# Depending on your data, you need to provide the target(s) you want to predict here
target:
    - sick

What I Did

... with having a big question mark above my head:

$ igel fit -dp 'diabetes.csv' -yml 'igel.yaml' 

         _____          _       _
        |_   _| __ __ _(_)_ __ (_)_ __   __ _
          | || '__/ _` | | '_ \| | '_ \ / _` |
          | || | | (_| | | | | | | | | | (_| |
          |_||_|  \__,_|_|_| |_|_|_| |_|\__, |
                                        |___/
        
INFO - Entered CLI args: {'data_path': 'diabetes.csv', 'yaml_path': 'igel.yaml', 'cmd': 'fit'}
INFO - Executing command: fit ...
INFO - reading data from diabetes.csv
INFO - You passed the configurations as a yaml file.
INFO - your chosen configuration: {'dataset': {'preprocess': {'missing_values': 'mean', 'scale': {'method': 'standard', 'target': 'inputs'}}, 'split': {'shuffle': True, 'test_size': 0.1}, 'type': 'csv'}, 'model': {'type': 'classification', 'algorithm': 'RandomForest', 'arguments': {'n_estimators': 100, 'max_depth': 30}}, 'target': ['sick']}
INFO - dataset_props: {'preprocess': {'missing_values': 'mean', 'scale': {'method': 'standard', 'target': 'inputs'}}, 'split': {'shuffle': True, 'test_size': 0.1}, 'type': 'csv'} 
model_props: {'type': 'classification', 'algorithm': 'RandomForest', 'arguments': {'n_estimators': 100, 'max_depth': 30}} 
 target: ['sick'] 

INFO - dataset shape: (768, 9)
INFO - dataset attributes: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
INFO - Check for missing values in the dataset ...  
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64  
 ----------------------------------------------------------------------------------------------------
INFO - shape of the dataset after handling missing values => (768, 9)
ERROR - error occured while preparing the data: ('chosen target(s) to predict must exist in the dataset',)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/igel/igel.py", line 245, in _process_data
    raise Exception("chosen target(s) to predict must exist in the dataset")
Exception: chosen target(s) to predict must exist in the dataset
Traceback (most recent call last):
  File "/opt/conda/bin/igel", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/igel/cli.py", line 508, in main
    CLI()
  File "/opt/conda/lib/python3.8/site-packages/igel/cli.py", line 166, in __init__
    getattr(self, self.cmd.command)()
  File "/opt/conda/lib/python3.8/site-packages/igel/cli.py", line 297, in fit
    Igel(**self.dict_args)
  File "/opt/conda/lib/python3.8/site-packages/igel/igel.py", line 102, in __init__
    getattr(self, self.command)()
  File "/opt/conda/lib/python3.8/site-packages/igel/igel.py", line 336, in fit
    x_train, y_train, x_test, y_test = self._prepare_fit_data()
TypeError: cannot unpack non-iterable NoneType object

If I understand right, the Igel want's to have a column named sick in dataset.csv. So there is a missing link and I have no idea how to close this.

Can you provide test-data, maybe as part of this repo, to get something to work?
Or help me finding the missing part?

Please help

@red-eyed-tree-frog
Copy link
Contributor

If I infer correctly you must first use some heuristic and categorise your data into sick / not sick. Maybe add a column called sick where 0 = not sick and 1 = sick? Your heuristic could be age > x and blood pressure > y then sick as a simplified example.

@clausnizer-ondics
Copy link
Contributor Author

Thanks for your help. If this will give me a working example it would be fine! :-)
I know how to add a column sick to the diabetes.csv but I have no Idea where to put the heuristic stuff. ;-)

Can you provide a step by step guide on how to do this?

@red-eyed-tree-frog
Copy link
Contributor

red-eyed-tree-frog commented Dec 14, 2020

Actually I think the column you are looking for is 'outcome' not 'sick'. Use that as your target instead.

@nidhaloff
Copy link
Owner

nidhaloff commented Dec 14, 2020

@Anenizer Hi, the data that I used in the docs can be found in this repo. Please go to the examples folder and then check the datasets under the data folder. Or simply click here.

Now coming to your issue. Notice that there are multiple versions of the famous Indian-diabetes dataset. If you visit kaggel, you will find many version of it, each having different attributes/feature names. The one I'm using here has an attribute called sick, which indicates whether a patient sick or not (0 means not sick and 1 means sick). The trick is if you are using a dataset with other attribute names then you will have to provide what you want to predict in the target field inside the .yaml file. Simply put, if the name of the attribute in your dataset is let's say "patient-status" instead of sick, then you have to provide:

target:
     - patient-status

in your .yaml file. This way igel will recognize that you want to predict the patient-status from your dataset. Hope this was helpful ;)

@nidhaloff
Copy link
Owner

@Anenizer does this answer your question? if not feel free to re-open the issue or create a new one if you have other questions

@clausnizer-ondics
Copy link
Contributor Author

Yes, I now have a working example, this was my goal. Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants