Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: indices are out-of-bounds #10

Open
mbeyeler opened this issue May 12, 2020 · 8 comments
Open

IndexError: indices are out-of-bounds #10

mbeyeler opened this issue May 12, 2020 · 8 comments

Comments

@mbeyeler
Copy link

Hi Nick,

Great package!

I just ran into an IndexError when the DataFrame index values are not from a RangeIndex. I would imagine this to happen quite often if the user passes in training data from a shuffled train-test split.

Code to reproduce the error:

import pandas as pd
import smogn
housing = pd.read_csv('https://raw.githubusercontent.com/nickkunz/smogn/master/data/housing.csv')
smogn.smoter(housing[housing.index > 10], 'SalePrice')

smogn.smoter(housing[housing.index > 10].reset_index(), 'SalePrice') fixes it, but is not necessarily desirable because I would like (need) to preserve the original index.

Best,
Michael

@nickkunz
Copy link
Owner

@mbeyeler Hello and thank you for raising this issue. It is an important use case, especially in the scenario where the data is train and test set split. I have made a note to address it for future builds!

@sherryxiaa
Copy link

sherryxiaa commented May 19, 2020

I have the "IndexError: list assignment index out of range" error but reseting the index did not seem to solve the issue

@BrutishGuy
Copy link

I am having same issue as @sherryxiaa . It seems to work with the housing dataset provided in the examples, but not on my own dataframes. I have tried to perform a reset_index() operation but this does not fix it. I can attempt to reproduce this using the housing dataset so that you too can investigate.

@jesperbruunhansen
Copy link

I was having the same error but after i did a df.reset_index(drop=True) the error got away. Does this help you?

Ex:

df_smogn = smogn.smoter(data=df.reset_index(drop=True), y="my_y_col")

@kevalshah90
Copy link

kevalshah90 commented Jan 25, 2022

I am running into this issue as well. My index is RangeIndex(start=0, stop=1857, step=1)

I tried the approaches in this thread but none of them worked for me.

df_ml_smogn = smogn.smoter(
    
    ## main arguments
    data = df_ml.reset_index(drop=True),    ## pandas dataframe
    y = 'Revenue_per_sqft_month',              ## string ('header name')
    k = 7,                                                          ## positive integer (k < n)
    samp_method = 'extreme',                      ## string ('balance' or 'extreme')
    drop_na_col = True,
    drop_na_row = True,
    
    ## phi relevance arguments
    replace = True,                  ## sampling with replacement 
    rel_thres = 0.75,                ## positive real number (0 < R < 1)
    rel_method = 'auto',             ## string ('auto' or 'manual')
    rel_xtrm_type = 'high',          ## string ('low' or 'both' or 'high')
    rel_coef = 2.25                  ## positive real number (0 < R)

)

@mbeyeler @jesperbruunhansen @BrutishGuy @sherryxiaa were you able to fix this issue?

@kevalshah90
Copy link

@nickkunz any thoughts on this?

@jellis-ventiv
Copy link

I have encountered 2 types of indexing errors, one out-of-bounds which I was able to fix using index_reset() and another error "index out of range" as indicated in a couple of the posts above. That one I have not been able to work around. I encounter this error here:

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\smogn\smoter.py:240, in smoter(data, y, k, pert, samp_method, under_samp, drop_na_col, drop_na_row, replace, rel_thres, rel_method, rel_xtrm_type, rel_coef, rel_ctrl_pts_rg)
234 ## over-sampling
235 if s_perc[i] > 1:
236
237 ## generate synthetic observations in training set
238 ## considered 'minority'
239 ## (see 'over_sampling()' function for details)
--> 240 synth_obs = over_sampling(
241 data = data,
242 index = list(b_index[i].index),
243 perc = s_perc[i],
244 pert = pert,
245 k = k
246 )
248 ## concatenate over-sampling
...
75 ## distance equals 1 for values that are not equal
76 else:
77 dist[i] = 1

IndexError: list assignment index out of range

Any thoughts on how to fix/work around it are appreciated.

@knlpscience
Copy link

I also encountered the same issue, and upon checking, it seems that the problem lies in the dist_metrics.py file. In the heom_dist function, the dist list is initialized with [None] * d_num, which causes an IndexError: list assignment index out of range if the dataset has more categorical variables than numerical ones. Therefore, when initializing dist, both d_num and d_nom should be considered. For those in a hurry, there is a fix in pull request #48 that someone has made, which could be useful to check out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants