New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Creme on non scikit learn dataset. #170
Comments
Hey there @andrewczgithub, You don't have to apologize for asking a question :). The idea is that In your case, you have to reshape the data a little bit. What you really want is: X = [
{
'timestamp': Timestamp('2012-01-10 00:00:00'),
'volume_change_ratio': -0.344719768623466,
'name': 'AAPL',
},
{
'timestamp': Timestamp('2012-01-10 00:00:00'),
'volume_change_ratio': 0.20302817925763325,
'name': 'CSCO',
},
# etceteraTimestTimestamp('2012-01-10 00:00:00') amp('2012-01-10 00:00:00')
]
Y = [
-0.0013231888852133222,
0.005841741901221553
]
for x, y in zip(X, Y):
# do your thing Now naturally the Tell me if something is not clear. |
@andrewczgithub did you manage to solve your issue? |
hi @MaxHalford !! yes!! |
You can do You can also use a from creme import *
model = (
compose.Blacklister('timestamp', 'name') |
preprocessing.StandardScaler() |
linear_model.LinearRegression()
) |
I tried the below code
|
except i got 0 as the error metric |
If you're doing regression you have to use a |
lol of course! Sincere apologies :( |
|
Hi @MaxHalford with the above code I get an error TypeError Traceback (most recent call last) ~/creme/creme/stream.py in iter_pandas(X, y, **kwargs) ~/creme/creme/stream.py in iter_array(X, y, feature_names, target_names, shuffle, random_state) TypeError: 'float' object is not subscriptable |
It seems like your |
I honestly think i just need to re run some code. |
Hi @MaxHalford !! from creme import datasets model = preprocessing.StandardScaler() metric = metrics.Accuracy() for x, y in stream.iter_pandas(X, y): print(metric) The above code did print metric as zero |
I need to experiment a bit more |
No worries, I like helping out. To go further and understand what's going wrong, could you provide the dataset you're using? |
Hi @MaxHalford !! The y values date symbol |
the code to generate the datasets are 👍 from pandas_datareader import data as pdr
import yfinance as yf
yf.pdr_override() # <== that's all it takes :-)
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like # may be necessary in some versions of pandas
import pandas_datareader.data as web
def get_symbols(symbols, begin_date=None,end_date=None):
out = pd.DataFrame()
for symbol in symbols:
df = web.DataReader(symbol, begin_date, end_date)[['Open','High','Low','Close','Volume']].reset_index()
df.columns = ['date','open','high','low','close','volume'] #my convention: always lowercase
df['symbol'] = symbol # add a new column which contains the symbol so we can keep multiple symbols in the same dataframe
df = df.set_index(['date','symbol'])
out = pd.concat([out,df],axis=0) #stacks on top of previously collected data
return out.sort_index()
prices= get_symbols(['AAPL','CSCO','MSFT','INTC', 'SPY'],begin_date='2012-01-01',end_date=None) |
features = pd.DataFrame(index=prices.index) features['intraday_chg'] = (prices.groupby(level='symbol').close features['day_of_week'] = features.index.get_level_values('date').weekday features['day_of_month'] = features.index.get_level_values('date').day features.dropna(inplace=True) outcomes = pd.DataFrame(index=prices.index) next day's opening changeoutcomes['open_1'] = prices.groupby(level='symbol').open.shift(-1) next day's closing changefunc_one_day_ahead = lambda x: x.pct_change(-1) print((outcomes.tail(40))) first, create y (a series) and X (a dataframe), with only rows wherea valid value exists for both y and Xy = outcomes.close_1 |
sincere apologies I dont know how to format the python code :( |
Check out this tutorial :) If it's not too much work, could you please upload a CSV file with the data you're using? |
Hi @MaxHalford !! |
volume_change_ratio,momentum_5_day,intraday_chg,day_of_week,day_of_month,close_1 |
the last column is the y value |
When you iterate over your dataset, you overwrite the |
Furthemore, you use |
hi @MaxHalford Hope your well. from creme import feature_extraction
from creme import linear_model
from creme import metrics
from creme import preprocessing
from creme import stats
#means = (
# feature_extraction.TargetAgg(by='volume_change_ratio', how=stats.RollingMean(7)),
# feature_extraction.TargetAgg(by='volume_change_ratio', how=stats.RollingMean(14)),
# feature_extraction.TargetAgg(by='volume_change_ratio', how=stats.RollingMean(21))
#)
scaler = preprocessing.StandardScaler()
lin_reg = linear_model.LinearRegression()
metric = metrics.MAE()
for x, y in zip(X,y):
# Process the rolling means of the target
# for mean in means:
# x = {**X, **mean.transform_one(X)}
# mean.fit_one(X, y)
# Remove the key/value pairs that aren't features
# for key in ['date', 'symbol']:
# X.pop(key)
X=X.values()
# Rescale the data
X = scaler.fit_one(X, y).transform_one(X)
# Fit the linear regression
y_pred = lin_reg.predict_one(X)
lin_reg.fit_one(X, y)
# Update the metric using the out-of-fold prediction
metric.update(y, y_pred)
print(metric) |
although i have had not much luck |
Hi ! As I said earlier, your are overwriting the variable for x, y in zip(X,y): So, you have to choose another variable name. Here is something which should work: from creme.metrics import MAE
from creme.preprocessing import StandardScaler
from creme import linear_model
from creme import optim
from creme.stream import iter_pandas
metric = MAE()
scaler = StandardScaler()
lin_reg = linear_model.LinearRegression(optim.SGD(0.001))
#supposing you are using two pandas dataframes
for x, truth in iter_pandas(X, y):
pred = lin_reg.predict_one(x)
metric.update(y_true=truth, y_pred=pred)
# Here, you don't need to provide the ground truth for StandardScaler
x = scaler.fit_one(x).transform_one(x)
lin_reg.fit_one(x, truth)
print(metric) Hope I helped 👌 |
Awesome!! thank you so much for your help!!!! happy days!!!!!!!!!!!!!!!!!!!!!!!!!!! |
Why not use something like this..? feature_inputs_as_dicts = feature_dataframe().values()
label_list = labels[label_name]
for x, y in zip(feature_inputs_as_dicts, label_list):
print(x)
print(y)
this and that and this and that |
@landmann there are many ways to skin a cat :) |
hi, how to use pandas dataframe? Importantly, selected features in X and target variable in y? |
@dileepkumarg-sa there's an |
Hi @MaxHalford !
Apologies for the beginner question.
How do we use creme on non scikit learn data sets.
Bascilly I have two dictionaries.
X=
{(Timestamp('2012-01-10 00:00:00'),
'volume_change_ratio',
'AAPL'): -0.344719768623466,
(Timestamp('2012-01-10 00:00:00'),
'volume_change_ratio',
'CSCO'): 0.20302817925763325,
(Timestamp('2012-01-10 00:00:00'),
'volume_change_ratio',
'INTC'): -0.13517037149368347,
And Y =
{(Timestamp('2012-01-10 00:00:00'), 'AAPL'): -0.0013231888852133222,
(Timestamp('2012-01-10 00:00:00'), 'CSCO'): 0.005841741901221553,
(Timestamp('2012-01-10 00:00:00'), 'INTC'): -0.006252442360296984,
(Timestamp('2012-01-10 00:00:00'), 'MSFT'): -0.014727011494252928,
(Timestamp('2012-01-10 00:00:00'), 'SPY'): -0.003097653527452948,
(Timestamp('2012-01-11 00:00:00'), 'AAPL'): -0.0004970178926441138,
(Timestamp('2012-01-11 00:00:00'), 'CSCO'): 0.002621919244887305,
(Timestamp('2012-01-11 00:00:00'), 'INTC'): 0.0019379844961240345,
And I want to do a basic iterative linear regression on the data sets.
Best,
Andrew
The text was updated successfully, but these errors were encountered: