# Problem Statement

In the realm of Neverland, where enchantment stretched far and wide, there lived a kind-hearted cattleman named Grassland Gus. His farm, Moo Meadows, was the sole source of the most delicious meat and milk in the kingdom. The people of Neverland would come from far and wide to procure these treasures.

Some procured these items directly from the farm at wholesale rates, while others obtained them from nearby groceries at retail prices. Fresh items were available at a premium, while frozen ones were sold at standard prices.

Gus packaged his dairy and meat in Enchanted Boxes. Each box held a different combination of meat and milk, and depending on their quality, some boxes were more valuable than others.

To purchase these magical boxes, the denizens of Neverland used Wishing Coins, which are tokens earned through acts of kindness. Every buyer had their own unique Magic Key, which kept track of all their purchases.

All exchanges of the kingdom are logged in the Enchanted Scroll, details of which are given in the file purchase.csv. The file contains records of purchases made over the last five months, including the date of purchase, the customer's magic key, the box ID purchased and purchase unit. Denizens select boxes to purchase from a list written on parchment. The dataset Boxes.csv enumerates all available boxes, including the box ID, product quality, delivery option, quantity of milk (cauldron), quantity of meat (stones) and box unit price.

There is no specific train.csv for this contest. Only **"purchase.csv"** and **"boxes.csv"** are given. You have to do everything from these two files.

**"problem 1.csv"** is given for you to predict, **"sample submission 1.csv"** is also there to help you about the submission template.

**You need to predict which of the Magic Keys given in “problem 1.csv” will buy milk and/or meat in the first 15 days of March-2019. Put Y in the purchase column if the Magic Keys will purchase and N if the Magic Keys will not make a purchase. Prepare and submit as submission.csv following the the template (sample submission 1.csv).**

------- 
Evaluation
The evaluation metric for this problem is Accuracy. Accuracy is a fundamental evaluation metric in machine learning, particularly for classification tasks. It measures the proportion of correctly predicted instances out of the total instances in the dataset.

# Problem One Standing
------

**6th Place out of 240**

# Generic testing
This code is to test out the generic theory or eda for the problem datasets

In [1]:
import pandas as pd
import numpy as np
purchase = pd.read_csv("Problem 1/purchase.csv")
box = pd.read_csv("Problem 1/boxes.csv")


## Dropping ducplicates, nan values and impossible values

In [2]:
purchase = purchase.dropna().drop_duplicates() # Drop NaN values and duplicates
positive_box_count_mask = purchase['BOX_COUNT'] >= 0
purchase = purchase[positive_box_count_mask]
purchase['PURCHASE_DATE'] = pd.to_datetime(purchase['PURCHASE_DATE'], format='%d/%m/%Y')
purchase = purchase.sort_values(by='PURCHASE_DATE') # Sort purchase data by purchase date in ascending order
purchase.count()


PURCHASE_DATE    2455723
MAGIC_KEY        2455723
BOX_ID           2455723
BOX_COUNT        2455723
dtype: int64

# 0. New dynamic method:

## calculate_avg_time_between_purchases and feature_extraction

In [3]:
def calculate_avg_time_between_purchases(group):
    if len(group) > 1:
        return np.mean(group['PURCHASE_DATE'].diff().dt.days)
    else:
        return 150
    
def feature_extraction(purchase,grouped_df,box):
    # Task 1: Calculate the frequency of purchases for each Magic Key within specific time intervals (bi-weekly and monthly)
    print('1/6 Extracting Bi-Weekly and Monthly Purchase Count...')
    biweekly_purchase_count = purchase.groupby(['MAGIC_KEY', pd.Grouper(key='PURCHASE_DATE', freq='2W')]).size().unstack(fill_value=0)
    monthly_purchase_count = purchase.groupby(['MAGIC_KEY', pd.Grouper(key='PURCHASE_DATE', freq='ME')]).size().unstack(fill_value=0)

    # Task 2: Calculate the average time between purchases for each Magic Key
    print('2/6 Extracting Average Time between purchase...')
    avg_time_between_purchases = grouped_df.apply(calculate_avg_time_between_purchases)

    # Task 3 days_since_last_purchase
    print('3/6 Extracting Days since last purchase...')
    last_purchase_date = grouped_df['PURCHASE_DATE'].max()
    days_since_last_purchase = (purchase['PURCHASE_DATE'].max() - last_purchase_date).dt.days.copy()
    
    # Task 4 purchase_count and total_spent
    print('4/6 Extracting purchase_count and total_spent...')
    merged_df = pd.merge(purchase, box, on='BOX_ID') 
    merged_df['SPENT'] = merged_df['BOX_COUNT'] * merged_df['UNIT_PRICE']
    grouped_df = merged_df.groupby('MAGIC_KEY') 
    purchase_count = grouped_df.size().rename('Purchase_Count') 
    total_spent = grouped_df['SPENT'].sum().rename('Total_Spent')
    # total_spent = grouped_df['UNIT_PRICE'].sum().rename('Total_Spent')

    # Task 5  total_milk_quantity & total_meat_quantity
    print('5/6 Extracting total_milk_quantity & total_meat_quantity...')
    total_milk_quantity = grouped_df['MILK'].sum().rename('Total_Milk_Quantity')
    total_meat_quantity = grouped_df['MEAT'].sum().rename('Total_Meat_Quantity')
    
    # Task 6 num_purchases_first_15_days and num_purchases_last_15_days
    print('6/6 Extracting num_purchases_first_15_days and num_purchases_last_15_days...')
    first_15_days_purchase = merged_df[merged_df['PURCHASE_DATE'].dt.day <= 15]
    num_purchases_first_15_days = first_15_days_purchase.groupby(['MAGIC_KEY', first_15_days_purchase['PURCHASE_DATE'].dt.month]).size().groupby('MAGIC_KEY').sum()
    last_15_days_purchase = merged_df[merged_df['PURCHASE_DATE'].dt.day > 15]
    num_purchases_last_15_days = last_15_days_purchase.groupby(['MAGIC_KEY', last_15_days_purchase['PURCHASE_DATE'].dt.month]).size().groupby('MAGIC_KEY').sum()



    # Combine all features into a DataFrame
    features = pd.DataFrame({
        'Biweekly_Purchase_Count': biweekly_purchase_count.mean(axis=1),
        'Monthly_Purchase_Count': monthly_purchase_count.mean(axis=1),
        'Avg_Time_Between_Purchases': avg_time_between_purchases,
        'Days_Since_Last_Purchase': days_since_last_purchase
    })
    purchase_history_features = pd.concat([purchase_count, total_spent], axis=1) # Create a new DataFrame with purchase history features
    features = features.join(purchase_history_features, how='left')
    box_features_df = pd.concat([total_milk_quantity, total_meat_quantity], axis=1)
    features = features.join(box_features_df, how='left')
    features['Num_Purchases_First_15_Days'] = num_purchases_first_15_days
    features['Num_Purchases_Last_15_Days'] = num_purchases_last_15_days
    features = features.fillna(0)
    return features
    


## Set time boundary

In [5]:
purchase_oct_nov = purchase[(purchase['PURCHASE_DATE'].dt.year == 2018) &
                                    ((purchase['PURCHASE_DATE'].dt.month == 10) |
                                     (purchase['PURCHASE_DATE'].dt.month == 11))]
grouped_df_oct_nov = purchase_oct_nov.groupby('MAGIC_KEY') # Group by MAGIC_KEY
features = feature_extraction(purchase_oct_nov,grouped_df_oct_nov,box)


1/6 Extracting Bi-Weekly and Monthly Purchase Count...
2/6 Extracting Average Time between purchase...


  avg_time_between_purchases = grouped_df.apply(calculate_avg_time_between_purchases)


3/6 Extracting Days since last purchase...
4/6 Extracting purchase_count and total_spent...
5/6 Extracting total_milk_quantity & total_meat_quantity...
6/6 Extracting num_purchases_first_15_days and num_purchases_last_15_days...


## Labelling features

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
def Labelling_features(features, purchase, year, month):
    # scaler = StandardScaler()
    # features_scaled = scaler.fit_transform(features)
    purchase_half = purchase[(purchase['PURCHASE_DATE'].dt.year == year) &  
                                    (purchase['PURCHASE_DATE'].dt.month == month) &   # Filter by month
                                    (purchase['PURCHASE_DATE'].dt.day <= 15)] 
    
    half_keys = purchase_half['MAGIC_KEY'].unique()
    features['labels'] = 0
    features.loc[features.index.isin(half_keys), 'labels'] = 1
    return features
    

In [8]:
features = Labelling_features(features,purchase, 2018,12) # for december of 2018

## Model Define

In [21]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras import regularizers

def define_model(n_input, n_hidden_layer, n_unit_per_layer,dropout_rate=0.2, l2_penalty=0.01):
    model = Sequential()
    model.add(Dense(n_unit_per_layer, input_shape=(n_input,), activation='relu', kernel_initializer='he_uniform', kernel_regularizer=regularizers.l2(l2_penalty)))
    model.add(Dropout(dropout_rate))
    model.add(Dense(1, activation='sigmoid'))
    return model

## Model train on OCT-NOV data. label DEC

In [22]:
labels = features['labels'].to_numpy()
features_pure = (features.drop(columns=['labels'])).to_numpy()
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features_pure)
num_node = 50

model_oct_nov = define_model(features_scaled.shape[1], 1, num_node)
model_oct_nov.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam(learning_rate=1e-2), metrics=['accuracy'])

X_train, X_test, y_train, y_test = train_test_split(features_scaled, labels, test_size=0.2, random_state=42)
hist_oct_nov = model_oct_nov.fit(X_train, y_train, epochs=50, batch_size=2048)
test_loss, test_accuracy = model_oct_nov.evaluate(X_test, y_test)

print("Test Loss:", test_loss)
print("Test Accuracy:", test_accuracy)

model_oct_nov_weights = model_oct_nov.get_weights()


Epoch 1/50


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m252/252[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7940 - loss: 0.7934
Epoch 2/50
[1m252/252[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8174 - loss: 0.4272
Epoch 3/50
[1m252/252[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8173 - loss: 0.4246
Epoch 4/50
[1m252/252[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8169 - loss: 0.4213
Epoch 5/50
[1m252/252[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8190 - loss: 0.4190
Epoch 6/50
[1m252/252[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8181 - loss: 0.4189
Epoch 7/50
[1m252/252[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8178 - loss: 0.4185
Epoch 8/50
[1m252/252[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8182 - loss: 0.4175
Epoch 9/50
[1m252/252[0m [32m━━━━━━━━━━━━━━━━━━━

## Model train on OCT-NOV-DEC Label Jan

In [23]:
# Time boundary, OCT, Nov, Dec
purchase_oct_nov_dec = purchase[(purchase['PURCHASE_DATE'].dt.year == 2018) &
                                    ((purchase['PURCHASE_DATE'].dt.month == 10) |
                                     (purchase['PURCHASE_DATE'].dt.month == 11) |
                                     (purchase['PURCHASE_DATE'].dt.month == 12))]
grouped_df_oct_nov_dec = purchase_oct_nov_dec.groupby('MAGIC_KEY') # Group by MAGIC_KEY
features = feature_extraction(purchase_oct_nov_dec,grouped_df_oct_nov_dec,box)

# Labelling 
features = Labelling_features(features,purchase, 2019, 1) # for January of 2019

#defining elimentary things
labels = features['labels'].to_numpy()
features_pure = (features.drop(columns=['labels'])).to_numpy()
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features_pure)
num_node = 50

# Model defining and setting previous weights
model_oct_nov_dec = define_model(features_scaled.shape[1], 1, num_node)
model_oct_nov_dec.set_weights(model_oct_nov_weights)
model_oct_nov_dec.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Model Training
X_train, X_test, y_train, y_test = train_test_split(features_scaled, labels, test_size=0.2, random_state=42)
hist_oct_nov_dec = model_oct_nov_dec.fit(X_train, y_train, epochs=50, batch_size=2048)
test_loss, test_accuracy = model_oct_nov_dec.evaluate(X_test, y_test)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_accuracy)

model_oct_nov_dec_weights = model_oct_nov_dec.get_weights()

1/6 Extracting Bi-Weekly and Monthly Purchase Count...
2/6 Extracting Average Time between purchase...


  avg_time_between_purchases = grouped_df.apply(calculate_avg_time_between_purchases)


3/6 Extracting Days since last purchase...
4/6 Extracting purchase_count and total_spent...
5/6 Extracting total_milk_quantity & total_meat_quantity...
6/6 Extracting num_purchases_first_15_days and num_purchases_last_15_days...


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/50
[1m330/330[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.8315 - loss: 0.4021
Epoch 2/50
[1m330/330[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8360 - loss: 0.3948
Epoch 3/50
[1m330/330[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8363 - loss: 0.3949
Epoch 4/50
[1m330/330[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8362 - loss: 0.3948
Epoch 5/50
[1m330/330[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8354 - loss: 0.3952
Epoch 6/50
[1m330/330[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8360 - loss: 0.3939
Epoch 7/50
[1m330/330[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8357 - loss: 0.3947
Epoch 8/50
[1m330/330[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8369 - loss: 0.3930
Epoch 9/50
[1m330/330[0m [32m━━━━━━━━

## Model train on OCT-NOV-DEC-Jan Label Feb

In [24]:
# Time boundary, OCT, Nov, Dec, Jan
purchase_oct_nov_dec_jan = purchase[((purchase['PURCHASE_DATE'].dt.year == 2018) | (purchase['PURCHASE_DATE'].dt.year == 2019)) &
                                    ((purchase['PURCHASE_DATE'].dt.month == 10) |
                                     (purchase['PURCHASE_DATE'].dt.month == 11) |
                                     (purchase['PURCHASE_DATE'].dt.month == 12) |
                                     (purchase['PURCHASE_DATE'].dt.month == 1))]
grouped_df_oct_nov_dec_jan = purchase_oct_nov_dec_jan.groupby('MAGIC_KEY') # Group by MAGIC_KEY
features = feature_extraction(purchase_oct_nov_dec_jan,grouped_df_oct_nov_dec_jan,box)

# Labelling 
features = Labelling_features(features,purchase, 2019, 2) # for February of 2019

#defining elimentary things
labels = features['labels'].to_numpy()
features_pure = (features.drop(columns=['labels'])).to_numpy()
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features_pure)
num_node = 50

# Model defining and setting previous weights
model_oct_nov_dec_jan = define_model(features_scaled.shape[1], 1, num_node)
model_oct_nov_dec_jan.set_weights(model_oct_nov_dec_weights)
model_oct_nov_dec_jan.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Model Training
X_train, X_test, y_train, y_test = train_test_split(features_scaled, labels, test_size=0.2, random_state=42)
hist_oct_nov_dec_jan = model_oct_nov_dec_jan.fit(X_train, y_train, epochs=50, batch_size=2048)
test_loss, test_accuracy = model_oct_nov_dec_jan.evaluate(X_test, y_test)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_accuracy)

model_oct_nov_dec_jan_weights = model_oct_nov_dec_jan.get_weights()

1/6 Extracting Bi-Weekly and Monthly Purchase Count...
2/6 Extracting Average Time between purchase...


  avg_time_between_purchases = grouped_df.apply(calculate_avg_time_between_purchases)


3/6 Extracting Days since last purchase...
4/6 Extracting purchase_count and total_spent...
5/6 Extracting total_milk_quantity & total_meat_quantity...
6/6 Extracting num_purchases_first_15_days and num_purchases_last_15_days...


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/50
[1m410/410[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.8417 - loss: 0.3809
Epoch 2/50
[1m410/410[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8431 - loss: 0.3787
Epoch 3/50
[1m410/410[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8433 - loss: 0.3789
Epoch 4/50
[1m410/410[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8431 - loss: 0.3790
Epoch 5/50
[1m410/410[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8430 - loss: 0.3797
Epoch 6/50
[1m410/410[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8438 - loss: 0.3787
Epoch 7/50
[1m410/410[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8434 - loss: 0.3789
Epoch 8/50
[1m410/410[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8429 - loss: 0.3793
Epoch 9/50
[1m410/410[0m [32m━━━━━━━━

## Saving and loading weights

In [27]:
# Assuming model is your TensorFlow model object

model_oct_nov_dec_jan.save_weights('model.weights.h5')

# model = define_model(features_scaled.shape[1], 1, num_node)
# model.load_weights('model_weights.h5')

## Check Feb test

In [28]:
grouped_df = purchase.groupby('MAGIC_KEY') # Group by MAGIC_KEY
features = feature_extraction(purchase,grouped_df,box)
features.shape

1/6 Extracting Bi-Weekly and Monthly Purchase Count...
2/6 Extracting Average Time between purchase...


  avg_time_between_purchases = grouped_df.apply(calculate_avg_time_between_purchases)


3/6 Extracting Days since last purchase...
4/6 Extracting purchase_count and total_spent...
5/6 Extracting total_milk_quantity & total_meat_quantity...
6/6 Extracting num_purchases_first_15_days and num_purchases_last_15_days...


(1274087, 10)

In [35]:
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

In [36]:
# creating the file for testing
bought = pd.read_csv('feb_bought.csv', header=None, names=['magic key'])
not_bought = pd.read_csv('feb_not_bought.csv', header=None, names=['magic key'])
merged_key = pd.concat([bought, not_bought]).drop_duplicates()
merged_key = merged_key.drop(merged_key.index[0])
merged_key['labels'] = merged_key['magic key'].isin(bought['magic key']).astype(int)
merged_key = merged_key.sort_values(by='magic key')
merged_key.shape

(59466, 2)

In [32]:
features_df_reset = features.reset_index() # Reset index of features DataFrame
magic_keys = merged_key['magic key']
filtered_features_df = features_df_reset[features_df_reset['MAGIC_KEY'].isin(magic_keys)]
filtered_features_df = filtered_features_df.sort_values(by='MAGIC_KEY')
filtered_features_df.set_index('MAGIC_KEY', inplace=True)
# filtered_features_df = filtered_features_df.drop(columns=['Label'])
filtered_features_df.shape

(59466, 10)

In [87]:
# suitable for neural networks
X_test = scaler.transform(filtered_features_df)
predictions = model_oct_nov_dec_jan.predict(X_test)
y_pred = (predictions > 0.2).astype(int)
y_true = merged_key['labels'].values
y_pred

[1m  70/1859[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m1s[0m 730us/step

[1m1859/1859[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 636us/step


array([[0],
       [1],
       [1],
       ...,
       [1],
       [1],
       [0]])

In [88]:
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)

0.8314835368109508

# Check

In [None]:
import pandas as pd
bought = pd.read_csv('feb_bought.csv', header=None, names=['magic key'])
not_bought = pd.read_csv('feb_not_bought.csv', header=None, names=['magic key'])
merged_key = pd.concat([bought, not_bought]).drop_duplicates()
merged_key = merged_key.drop(merged_key.index[0])
# Create the 'labels' column
merged_key['labels'] = merged_key['magic key'].isin(bought['magic key']).astype(int)
# Create the 'labels' column based on whether 'magic key' is in either 'bought' or 'not_bought'
# merged_key['labels'] = merged_key['magic key'].isin(bought['magic key']) | merged_key['magic key'].isin(not_bought['magic key'])
merged_key = merged_key.sort_values(by='magic key')
merged_key.shape

In [None]:
# use this if dropping labels are necessary 
features_df_reset = features.reset_index() # Reset index of features DataFrame
magic_keys = merged_key['magic key']
filtered_features_df = features_df_reset[features_df_reset['MAGIC_KEY'].isin(magic_keys)]
filtered_features_df = filtered_features_df.sort_values(by='MAGIC_KEY')
filtered_features_df.set_index('MAGIC_KEY', inplace=True)
filtered_features_df = filtered_features_df.drop(columns=['Label'])
filtered_features_df.shape


In [None]:
# suitable for random forrest without pca scaled
X_test = scaler.transform(filtered_features_df)
y_pred = classifier.predict(X_test)
y_true = merged_key['labels'].values

In [None]:
features_df_reset = features.reset_index() # Reset index of features DataFrame
magic_keys = merged_key['magic key']
filtered_features_df = features_df_reset[features_df_reset['MAGIC_KEY'].isin(magic_keys)]
filtered_features_df = filtered_features_df.sort_values(by='MAGIC_KEY')
filtered_features_df.set_index('MAGIC_KEY', inplace=True)
# filtered_features_df = filtered_features_df.drop(columns=['Label'])
filtered_features_df.shape

In [None]:
# suitable for neural networks
X_test = scaler.transform(filtered_features_df)
predictions = model.predict(X_test)
y_pred = (predictions > 0.5).astype(int)
y_true = merged_key['labels'].values


In [None]:
# suitable for pca scaled
X_test = scaler.transform(filtered_features_df)
X_test = pca.transform(X_test)
y_pred = classifier.predict(X_test)
y_true = merged_key['labels'].values

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)

# Submission

In [114]:
sample = pd.read_csv('Problem 1/sample submission 1.csv')
problem = pd.read_csv('Problem 1/problem 1.csv')
# features_main =  pd.read_csv('features.csv')
# features_main.set_index('MAGIC_KEY', inplace=True)


In [None]:
features.drop(columns=['Label'], inplace=True)
features.columns

In [None]:
extracted_features_df = features.loc[problem['MAGIC_KEY']]
extracted_features_df.shape

In [None]:
# reduced features 
extracted_features_df = features.loc[problem['MAGIC_KEY']]
extracted_features_df

Unnamed: 0_level_0,Biweekly_Purchase_Count,Monthly_Purchase_Count,Avg_Time_Between_Purchases,Days_Since_Last_Purchase,Purchase_Count,Total_Spent,Total_Milk_Quantity,Total_Meat_Quantity,Num_Purchases_First_15_Days,Num_Purchases_Last_15_Days
MAGIC_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
28D5BB06356,0.250000,0.6,32.500000,2,3,45.88,0.0,10.1,0.0,3.0
293BEAB4E98,0.333333,0.8,41.333333,20,4,72.12,58.0,6.2,4.0,0.0
2962EE8065C,0.166667,0.4,26.000000,11,2,31.92,0.0,7.2,0.0,2.0
2957BE29EA9,0.166667,0.4,29.000000,56,2,35.96,16.0,4.4,2.0,0.0
28E351A0745,0.333333,0.8,29.000000,34,4,63.84,0.0,14.4,0.0,4.0
...,...,...,...,...,...,...,...,...,...,...
28FB7C09776,0.333333,0.8,37.000000,3,4,75.92,48.0,8.8,1.0,3.0
28E0E3B69BF,0.166667,0.4,46.000000,0,2,29.92,0.0,6.4,1.0,1.0
28D343103A7,0.166667,0.4,24.000000,23,2,35.96,20.0,5.0,2.0,0.0
290B1D6D5CB,0.166667,0.4,33.000000,27,2,29.92,0.0,6.5,1.0,1.0


In [116]:
# suitable for neural networks
X_sub = scaler.transform(extracted_features_df)
predictions = model_oct_nov_dec_jan.predict(X_sub)
y_sub = (predictions > 0.375).astype(int)  #  

y_sub_labels = np.where(y_sub == 0, 'N', 'Y')
y_sub_labels
np.count_nonzero(y_sub_labels == 'Y')

[1m   1/1835[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m32s[0m 18ms/step

[1m1835/1835[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 629us/step


46729

In [117]:
y_sub_labels.shape

(58689, 1)

In [118]:
(46729/58689)*100

79.62139412837158

In [119]:
nn_submit = np.squeeze(y_sub_labels)
nn_submit

array(['N', 'Y', 'N', ..., 'Y', 'Y', 'Y'], dtype='<U1')

In [120]:
# Create a DataFrame with MAGIC_KEY and PURCHASE columns
submit = pd.DataFrame({'MAGIC_KEY': problem['MAGIC_KEY'], 'PURCHASE': nn_submit})
submit.to_csv('submit_p1_v4_0.2.csv', index=False)
submit.shape

(58689, 2)

In [121]:
import numpy as np

# Assuming 'nn_submit' is your numpy array
# Count the occurrences of 'Y' and 'N'
count_Y = np.count_nonzero(nn_submit == 'Y')
count_N = np.count_nonzero(nn_submit == 'N')

# Print the counts
print(f"Number of 'Y': {count_Y}")
print(f"Number of 'N': {count_N}")


Number of 'Y': 46729
Number of 'N': 11960
