Question 1 (1 point): Dummies on Month and Week-of-Month

Find the CORRELATION VALUE of the most correlated dummy <month-week_of_month> with the binary outcome variable (" is_positive_growth_5d_future")?


You saw in the correlation analysis and modeling that September and October may be important seasonal months. In this task, we'll go futher and try to generate dummies for Month and Week-of-month (starting from 1). For example, the first week of October should be coded similar to this: 'October_w1'. Once you've generated the new set of variables, find the most correlated (in absolute value) one with "is_positive_growth_5d_future" and round it to 3 digits after the comma.


Suggested path to a solution:


[Source] Use this formula to get the week of month for the datetime variable d: (d.day-1)//7+1


Define a new string variable for all month-week_of_month combinations. Append it to the CATEGORICAL features set. You should have 5 variables treated as CATEGORICAL now: 'Month', 'Weekday', 'Ticker', 'ticker_type', 'month_wom'.


Use pandas.get_dummies() to generate dummies.


Use pandas.DataFrame.corr() function (also used in [Code Snippet 1]) to get correlations with "is_positive_growth_5d_future", filter out only variables representing the new dummy set, and sort it by absolute values (you can define a new column "abs_corr" in the dataframe with correlations), and find the highest value (among new dummies set).

NOTE: new dummies will be used as features in the next tasks, please leave them in the dataset.


In [1]:


import pandas as pd
import numpy as np

# Load the sample data
df_full = pd.read_parquet("./stocks_df_combined_2024_05_07.parquet.brotli", )



In [2]:
# growth indicators (but not future growth)
GROWTH = [g for g in df_full.keys() if (g.find('growth_')==0)&(g.find('future')<0)]

# leaving only Volume ==> generate ln(Volume)
OHLCV = ['Open','High','Low','Close','Adj Close_x','Volume']
CATEGORICAL = ['Month', 'Weekday', 'Ticker', 'ticker_type']
TO_PREDICT = [g for g in df_full.keys() if (g.find('future')>=0)]
# let's define on more custom numerical features
df_full['ln_volume'] = df_full.Volume.apply(lambda x: np.log(x))
# manually defined features
CUSTOM_NUMERICAL = ['SMA10', 'SMA20', 'growing_moving_average', 'high_minus_low_relative','volatility', 'ln_volume']

# All Supported Ta-lib indicators: https://github.com/TA-Lib/ta-lib-python/blob/master/docs/funcs.md
TECHNICAL_INDICATORS = ['adx', 'adxr', 'apo', 'aroon_1','aroon_2', 'aroonosc',
 'bop', 'cci', 'cmo','dx', 'macd', 'macdsignal', 'macdhist', 'macd_ext',
 'macdsignal_ext', 'macdhist_ext', 'macd_fix', 'macdsignal_fix',
 'macdhist_fix', 'mfi', 'minus_di', 'mom', 'plus_di', 'dm', 'ppo',
 'roc', 'rocp', 'rocr', 'rocr100', 'rsi', 'slowk', 'slowd', 'fastk',
 'fastd', 'fastk_rsi', 'fastd_rsi', 'trix', 'ultosc', 'willr',
 'ad', 'adosc', 'obv', 'atr', 'natr', 'ht_dcperiod', 'ht_dcphase',
 'ht_phasor_inphase', 'ht_phasor_quadrature', 'ht_sine_sine', 'ht_sine_leadsine',
 'ht_trendmod', 'avgprice', 'medprice', 'typprice', 'wclprice']

TECHNICAL_PATTERNS = [g for g in df_full.keys() if g.find('cdl')>=0]

print(f'Technical patterns count = {len(TECHNICAL_PATTERNS)}, examples = {TECHNICAL_PATTERNS[0:5]}')

MACRO = ['gdppot_us_yoy', 'gdppot_us_qoq', 'cpi_core_yoy', 'cpi_core_mom', 'FEDFUNDS',
 'DGS1', 'DGS5', 'DGS10']

NUMERICAL = GROWTH + TECHNICAL_INDICATORS + TECHNICAL_PATTERNS + CUSTOM_NUMERICAL + MACRO

# tickers, min-max date, count of daily observations
df_full.groupby(['Ticker'])['Date'].agg(['min','max','count'])

# truncated df_full with 25 years of data (and defined growth variables)
df = df_full[df_full.Date>='2000-01-01']
df.info()

# let look at the features count and size:
df[NUMERICAL].info()

  df_full['ln_volume'] = df_full.Volume.apply(lambda x: np.log(x))


Technical patterns count = 61, examples = ['cdl2crows', 'cdl3blackrows', 'cdl3inside', 'cdl3linestrike', 'cdl3outside']
<class 'pandas.core.frame.DataFrame'>
Index: 182675 entries, 3490 to 5426
Columns: 203 entries, Open to ln_volume
dtypes: datetime64[ns](3), float64(129), int32(64), int64(5), object(2)
memory usage: 239.7+ MB
<class 'pandas.core.frame.DataFrame'>
Index: 182675 entries, 3490 to 5426
Columns: 184 entries, growth_1d to DGS10
dtypes: float64(121), int32(62), int64(1)
memory usage: 214.6 MB


In [3]:
import pandas as pd
import numpy as np

df.loc[:,'Date'] = pd.to_datetime(df['Date'])

In [4]:
# Create 'Month' and 'Week_of_Month' columns
df['Month'] = df['Date'].dt.month_name()
df['Week_of_Month'] = (df['Date'].dt.day - 1) // 7 + 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Month'] = df['Date'].dt.month_name()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Week_of_Month'] = (df['Date'].dt.day - 1) // 7 + 1


In [5]:
# Combine 'Month' and 'Week_of_Month' into a single column
df['month_wom'] = df['Month'] + '_w' + df['Week_of_Month'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['month_wom'] = df['Month'] + '_w' + df['Week_of_Month'].astype(str)


In [6]:

# Add 'month_wom' to your list of categorical features
CATEGORICAL = ['Month', 'Weekday', 'Ticker', 'ticker_type', 'month_wom']

TO_DROP = ['Year','Date','index_x', 'index_y', 'index', 'Quarter','Adj Close_y'] + CATEGORICAL + OHLCV
OTHER = [k for k in df_full.keys() if k not in OHLCV + CATEGORICAL + NUMERICAL + TO_DROP]

# Generate dummy variables for the categorical features
dummies = pd.get_dummies(df[CATEGORICAL],dtype='int32')
DUMMIES = dummies.keys().to_list()
DUMMIES
# Join dummies back to the original dataframe
df_dummies = pd.concat([df, dummies], axis=1)



In [7]:
# Calculate correlations with 'is_positive_growth_5d_future'
correlations_df = df_dummies[DUMMIES+['is_positive_growth_5d_future']].corr()

In [8]:
growth_5d_corr = correlations_df['is_positive_growth_5d_future'].drop('is_positive_growth_5d_future')
growth_5d_corr_max = growth_5d_corr.abs().max()



In [9]:
# find the max value of growth_5d_corr
print(f'The CORRELATION VALUE of the most correlated dummy <month-week_of_month> with the binary outcome variable:{growth_5d_corr_max:.3f}')


The CORRELATION VALUE of the most correlated dummy <month-week_of_month> with the binary outcome variable:0.035


Question 2 (2 points): Define new "hand" rules on macro and technical indicators variables


What is the precision score for the best of the NEW variables (pred3 or pred4)


Let's utilize the knowledge from the visualised tree (clf10) (Code Snippet 5: 1.4.4 Visualisation). You're asked to define two new 'hand' rules (leading to 'positive' subtrees):


pred3_manual_gdp_fastd: (gdppot_us_yoy <= 0.027) & (fastd >= 0.251)
pred4_manual_gdp_wti_oil: (gdppot_us_yoy >= 0.027) & (growth_wti_oil_30d <= 1.005)
Extend the Code Snippet 3 (Manual "hand rule" predictions): Calculate and add them to the dataframe. You should notice that one of the predictions doesn't have any positive predictions on TEST dataset. Please debug that: check in the 'new_df' and the original dataset/data generation process that we didn't make any mistakes at the data transformations step; explain why this can happen even if there are no mistakes at the data transformation step.

As a result, write down the precision score for the remaining predictor (round to three decimal points). E.g. if you have 0.57897, your answer should be 0.579.


In [10]:
def temporal_split(df, min_date, max_date, train_prop=0.7, val_prop=0.15, test_prop=0.15):
    """
    Splits a DataFrame into three buckets based on the temporal order of the 'Date' column.

    Args:
        df (DataFrame): The DataFrame to split.
        min_date (str or Timestamp): Minimum date in the DataFrame.
        max_date (str or Timestamp): Maximum date in the DataFrame.
        train_prop (float): Proportion of data for training set (default: 0.6).
        val_prop (float): Proportion of data for validation set (default: 0.2).
        test_prop (float): Proportion of data for test set (default: 0.2).

    Returns:
        DataFrame: The input DataFrame with a new column 'split' indicating the split for each row.
    """
    # Define the date intervals
    train_end = min_date + pd.Timedelta(days=(max_date - min_date).days * train_prop)
    val_end = train_end + pd.Timedelta(days=(max_date - min_date).days * val_prop)

    # Assign split labels based on date ranges
    split_labels = []
    for date in df['Date']:
        if date <= train_end:
            split_labels.append('train')
        elif date <= val_end:
            split_labels.append('validation')
        else:
            split_labels.append('test')

    # Add 'split' column to the DataFrame
    df['split'] = split_labels

    return df

In [11]:
min_date_df = df_dummies.Date.min()
max_date_df = df_dummies.Date.max()

df_dummies = temporal_split(df_dummies,
                                 min_date = min_date_df,
                                 max_date = max_date_df)


In [12]:
df_dummies['split'].value_counts()/len(df_dummies)

split
train         0.675834
test          0.163290
validation    0.160876
Name: count, dtype: float64

In [13]:
# remove the "segmentation" problem (warning message on df performance after many joins and data transformations)
new_df = df_dummies.copy()

In [14]:
new_df.groupby(by='split')['growth_future_5d'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
test,29664.0,1.005015,0.040835,0.690219,0.981994,1.004731,1.027028,1.393477
train,123458.0,1.003965,0.053826,0.412383,0.978474,1.003197,1.028354,3.018887
validation,29388.0,1.004417,0.040642,0.668581,0.985343,1.00512,1.023999,1.459217


In [15]:
# check one record: it has abs. values, text, and numbers
new_df.head(1)

Unnamed: 0,Open,High,Low,Close,Adj Close_x,Volume,Ticker,Year,Month,Weekday,...,month_wom_October_w2,month_wom_October_w3,month_wom_October_w4,month_wom_October_w5,month_wom_September_w1,month_wom_September_w2,month_wom_September_w3,month_wom_September_w4,month_wom_September_w5,split
3490,58.6875,59.3125,56.0,58.28125,36.065567,53228400.0,MSFT,2000,January,0,...,0,0,0,0,0,0,0,0,0,train


In [16]:
# time split on train/validation/test: FIXED dates of split, approx. 70%, 15%, 15% split
new_df.groupby(['split'])['Date'].agg({'min','max','count'})

Unnamed: 0_level_0,max,count,min
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
test,2024-05-07,29829,2020-09-14
train,2017-01-16,123458,2000-01-03
validation,2020-09-11,29388,2017-01-17


In [17]:
# what we try to predict
new_df[TO_PREDICT].head(1)

Unnamed: 0,growth_future_5d,is_positive_growth_5d_future
3490,0.963003,0


In [18]:
# to be used as features
new_df[NUMERICAL+DUMMIES].head(1)

Unnamed: 0,growth_1d,growth_3d,growth_7d,growth_30d,growth_90d,growth_365d,growth_dax_1d,growth_dax_3d,growth_dax_7d,growth_dax_30d,...,month_wom_October_w1,month_wom_October_w2,month_wom_October_w3,month_wom_October_w4,month_wom_October_w5,month_wom_September_w1,month_wom_September_w2,month_wom_September_w3,month_wom_September_w4,month_wom_September_w5
3490,0.998394,0.988341,0.991494,1.372333,1.222951,2.063053,0.970196,0.983855,1.051736,1.134572,...,0,0,0,0,0,0,0,0,0,0


In [19]:
# generate manual predictions
# Let's label all prediction features with prefix "pred"
new_df['pred0_manual_cci'] = (new_df.cci>200).astype(int)
new_df['pred1_manual_prev_g1'] = (new_df.growth_1d>1).astype(int)
new_df['pred2_manual_prev_g1_and_snp'] = ((new_df['growth_1d'] > 1) & (new_df['growth_snp500_1d'] > 1)).astype(int)
# Define the new 'hand' rules
new_df['pred3_manual_gdp_fastd'] = (df['gdppot_us_yoy'] <= 0.027) & (df['fastd'] >= 0.251).astype(int)
new_df['pred4_manual_gdp_wti_oil'] = (df['gdppot_us_yoy'] >= 0.027) & (df['growth_wti_oil_30d'] <= 1.005).astype(int)

In [20]:

# Calculate precision scores
from sklearn.metrics import precision_score

# Assuming 'is_positive_growth_5d_future' is the actual target variable
test_idx = new_df.split.isin(['test'])
precision_pred3 = precision_score(new_df.loc[test_idx,'is_positive_growth_5d_future'], new_df.loc[test_idx,'pred3_manual_gdp_fastd'],zero_division=0)
precision_pred4 = precision_score(new_df.loc[test_idx,'is_positive_growth_5d_future'], new_df.loc[test_idx,'pred4_manual_gdp_wti_oil'],zero_division=0)

# Select the best precision score
best_precision = max(precision_pred3, precision_pred4)
best_precision_rounded = round(best_precision, 3)

In [21]:
print(f'The precision score for the remaining predictor is {best_precision_rounded:.3f}')

The precision score for the remaining predictor is 0.555


Question 3 (1 point): Unique correct predictions from a 10-levels deep decision tree classifier (pred5_clf_10)

What is the total number of records in the TEST dataset when the new prediction pred5_clf_10 is better than all 'hand' rules (pred0..pred4)?


NOTE: please include random_state=42 to Decision Tree Classifier init function (line clf = DecisionTreeClassifier(max_depth=max_depth, random_state=42)) to ensure everyone gets the same results.


Suggested solution: 

Rewrite the '1.4.3 Inference for a decision tree' piece for the Decision Tree Classifier with max_depth=10 (clf_10), so that you fit the model on TRAIN+VALIDATION sets (unchanged from the lecture), but predict on the whole set X_all (to be able to define a new column 'pred5_clf_10' in the dataframe new_df). Here is the link with explanation. It will solve the problem in 1.4.5 when predictions were made only for Test dataset and couldn't be easily joined with the full dataset.


Once you have it, define a new column 'only_pred5_is_correct' similar to 'hand' prediction rules with several conditions: is_positive_growth_5d_future AND is_correct_pred5 should be equal 1, while all other predictions is_correct_pred0..is_correct_pred4 should be equal to 0.


Convert 'only_pred5_is_correct' column from bool to int, and find how many times it is equal to 1 in the TEST set. Write down this as an answer.


ADVANCED: define a function that can be applied to the whole row (examples) and can find whether some prediction 'predX' (where X is one of the predictions) is uniquely correct. It should work even if there are 100 predictions available, so that you don't define manually the condition.


In [22]:
PREDITIONS = [k for k in new_df.keys() if  k.startswith('pred')]
PREDITIONS

['pred0_manual_cci',
 'pred1_manual_prev_g1',
 'pred2_manual_prev_g1_and_snp',
 'pred3_manual_gdp_fastd',
 'pred4_manual_gdp_wti_oil']

In [23]:
features_list = NUMERICAL+DUMMIES
to_predict = 'is_positive_growth_5d_future'

In [24]:
new_df.replace([np.inf, -np.inf], np.nan, inplace=True)
new_df.fillna(0, inplace=True)

In [25]:
# to prepae dataset for DecisionTreeClassifier
# Rewrite the '1.4.3 Inference for a decision tree' piece for the Decision Tree Classifier with max_depth=10 (clf_10), 
# so that you fit the model on TRAIN+VALIDATION sets (unchanged from the lecture), 
# but predict on the whole set X_all (to be able to define a new column 'pred5_clf_10' in the dataframe new_df).
# Here is the link with explanation. It will solve the problem in 1.4.5 when predictions were made only for Test dataset 
# and couldn't be easily joined with the full dataset.

In [26]:
# TODO 3: Implement the unique correct predictions using a 10-levels deep Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

# Fit the model on TRAIN+VALIDATION sets
train_df = new_df[new_df.split.isin(['train','validation'])]
test_df = new_df[new_df.split.isin(['test'])]
X_train = train_df[features_list]
X_test = test_df[features_list]
y_train = train_df[to_predict]
y_test = test_df[to_predict]
# Initialize the Decision Tree Classifier with max_depth=10 and random_state=42
clf_10 = DecisionTreeClassifier(max_depth=10, random_state=42)
clf_10.fit(X_train, y_train)


In [27]:
# # Predict on the whole set X_all
X_all = new_df[features_list]
new_df['pred5_clf_10'] = clf_10.predict(X_all)

# # Define the new column 'only_pred5_is_correct'
new_df['is_correct_pred5'] = new_df['pred5_clf_10'] == new_df['is_positive_growth_5d_future']



In [28]:
new_df['is_correct_pred0'] = new_df['pred0_manual_cci'] == new_df['is_positive_growth_5d_future']
new_df['is_correct_pred1'] = new_df['pred1_manual_prev_g1'] == new_df['is_positive_growth_5d_future']
new_df['is_correct_pred2'] = new_df['pred2_manual_prev_g1_and_snp'] == new_df['is_positive_growth_5d_future']
new_df['is_correct_pred3'] = new_df['pred3_manual_gdp_fastd'] == new_df['is_positive_growth_5d_future']
new_df['is_correct_pred4'] = new_df['pred4_manual_gdp_wti_oil'] == new_df['is_positive_growth_5d_future']



In [29]:
# # Conditions for only_pred5_is_correct
new_df['only_pred5_is_correct'] = (
    new_df['is_correct_pred5'] &
    ~new_df['is_correct_pred0'] &
    ~new_df['is_correct_pred1'] &
    ~new_df['is_correct_pred2'] &
    ~new_df['is_correct_pred3'] &
    ~new_df['is_correct_pred4']
)

In [30]:
# Find how many times 'only_pred5_is_correct' is equal to 1 in the TEST set
test_only_pred5_correct_count = new_df.loc[test_idx, 'only_pred5_is_correct'].sum()

print(f'The number of times "only_pred5_is_correct" is equal to 1 in the TEST set is {test_only_pred5_correct_count}')


The number of times "only_pred5_is_correct" is equal to 1 in the TEST set is 1


Question 4: (2 points) Hyperparameter tuning for a Decision Tree
What is the optimal tree depth (from 1 to 20) for a DecisionTreeClassifier?

Modify the section 1.4 [Code Snippet 4]


Re-define the train set X_train (using the condition split=='train'), create a validation set X_valid (using the condition split=='validation'), and leave the test set X_test unchanged.

Apply the same data transformation rules (replace +-inf with NaN and then replace all NaNs with 0).

Iterate in a loop for max_depth between 1 and 20:
Train the DecisionTreeClassifier (clf) with max_depth=k on a train set.

Find the precision and accuracy scores on the validation set.

Select the best_max_depth based on precision only and write it down as an answer.


(Advanced: Read about scikit-learn Decision Trees. Do you see the 'saturation' of precision/accuracy when max_depth is increasing, or there is a tendency of overfitting?)


In [31]:
# Re-define the train set X_train (using the condition split=='train'), create a validation set X_valid (using the condition split=='validation'), and leave the test set X_test unchanged.
# Re-define the train set X_train
X_train = new_df[new_df['split'] == 'train'][features_list]
y_train = new_df[new_df['split'] == 'train'][to_predict]

# Create a validation set X_valid
X_valid = new_df[new_df['split'] == 'validation'][features_list]
y_valid = new_df[new_df['split'] == 'validation'][to_predict]

# Leave the test set X_test unchanged
X_test = new_df[new_df['split'] == 'test'][features_list]
y_test = new_df[new_df['split'] == 'test'][to_predict]

X_all = new_df[features_list]
y_all = new_df[to_predict]


In [32]:
# TODO 4: Hyperparameter tuning for a Decision Tree to find the optimal max_depth

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score

# Initialize variables to track the best max_depth and highest precision score
best_max_depth = 0
highest_precision = 0

# Iterate through max_depth values from 1 to 20
for max_depth in range(1, 21):
    # Initialize the Decision Tree Classifier with the current max_depth and random_state=42
    clf = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
    
    # Fit the model on TRAIN+VALIDATION sets
    clf.fit(X_train, y_train)
    
    # Predict on the TEST set
    y_pred = clf.predict(X_test)
    
    # Calculate the precision score on the TEST set
    precision = precision_score(y_test, y_pred)
    
    # Update the best_max_depth and highest_precision if current precision is higher
    if precision > highest_precision:
        highest_precision = precision
        best_max_depth = max_depth

# Output the best_max_depth and corresponding highest precision score
print(f'The best max_depth is {best_max_depth} and the corresponding highest precision score is {highest_precision:.3f}')




The best max_depth is 15 and the corresponding highest precision score is 0.569


In [33]:
# Fit the model with the best_max_depth on the entire dataset (TRAIN+VALIDATION+TEST)
clf_best = DecisionTreeClassifier(max_depth=best_max_depth, random_state=42)
clf_best.fit(X_all, y_all)

# Make predictions on all records and add the new prediction pred6_clf_best to the dataframe
new_df['pred6_clf_best'] = clf_best.predict(X_all)

# Compare the precision score of the tuned decision tree with previous predictions
precision_best = precision_score(y_all, new_df['pred6_clf_best'])

# Output the precision score of the tuned decision tree
precision_best

0.7185504013861879

[EXPLORATORY] Question 5: What data is missing?
[探索性]問題 5：缺少哪些數據？
Now that you have some insights from the correlation analysis and the Decision Trees regarding the most influential variables, suggest new indicators you would like to include in the dataset and explain why.

You can also propose something entirely different based on your intuition, but it should be relevant to the shared dataset of the largest Indian, EU, and US stocks. If you choose this approach, please specify the data source as well.


Submitting the solutions 提交解決方案 
[NOT READY YET] Form for submitting: https://courses.datatalks.club/sma-zoomcamp-2024/homework/hw03
[還沒準備好]提交表格：https://courses.datatalks.club/sma-zoomcamp-2024/homework/hw03

Leaderboard 排行榜 
Leaderboard link: https://courses.datatalks.club/sma-zoomcamp-2024/leaderboard
排行榜連結： https://courses.datatalks.club/sma-zoomcamp-2024/leaderboard