In [1]:
import pandas as pd

df = pd.read_csv('/content/budget_data.csv')
print("DataFrame 'df' loaded successfully. Displaying the first 5 rows:")
df.head()

DataFrame 'df' loaded successfully. Displaying the first 5 rows:


Unnamed: 0,date,category,amount
0,2022-07-06 05:57:10 +0000,Restuarant,5.5
1,2022-07-06 05:57:27 +0000,Market,2.0
2,2022-07-06 05:58:12 +0000,Coffe,30.1
3,2022-07-06 05:58:25 +0000,Market,17.33
4,2022-07-06 05:59:00 +0000,Restuarant,5.5


In [2]:
print("First 5 rows of the DataFrame:")
print(df.head())

print("\nColumn names and their data types:")
df.info()

print("\nMissing values in each column:")
print(df.isnull().sum())

print("\nDescriptive statistics of numerical columns:")
print(df.describe())

First 5 rows of the DataFrame:
                        date    category  amount
0  2022-07-06 05:57:10 +0000  Restuarant    5.50
1  2022-07-06 05:57:27 +0000      Market    2.00
2  2022-07-06 05:58:12 +0000       Coffe   30.10
3  2022-07-06 05:58:25 +0000      Market   17.33
4  2022-07-06 05:59:00 +0000  Restuarant    5.50

Column names and their data types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3608 entries, 0 to 3607
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   date      3608 non-null   object 
 1   category  3608 non-null   object 
 2   amount    3608 non-null   float64
dtypes: float64(1), object(2)
memory usage: 84.7+ KB

Missing values in each column:
date        0
category    0
amount      0
dtype: int64

Descriptive statistics of numerical columns:
            amount
count  3608.000000
mean     17.130241
std      84.946260
min       0.050000
25%       4.800000
50%       7.100000
75%      11.505000


In [3]:
print("Unique values in the 'category' column:")
print(df['category'].unique())

Unique values in the 'category' column:
['Restuarant' 'Market' 'Coffe' 'Transport' 'Other' 'Phone' 'Communal'
 'Clothing' 'Motel' 'Travel' 'Rent Car' 'Sport' 'Events' 'Learning'
 'Health' 'Taxi' 'Business lunch' 'Film/enjoyment' 'Tech' 'joy' 'Fuel'
 'business_expenses']


In [4]:
tag_mapping = {
    'Restuarant': 'Food & Dining',
    'Coffe': 'Food & Dining',
    'Market': 'Groceries',
    'Business lunch': 'Food & Dining',
    'Fuel': 'Transportation',
    'Transport': 'Transportation',
    'Taxi': 'Transportation',
    'Rent Car': 'Transportation',
    'Motel': 'Travel & Accommodation',
    'Travel': 'Travel & Accommodation',
    'Communal': 'Utilities',
    'Phone': 'Utilities',
    'Clothing': 'Shopping',
    'Sport': 'Entertainment & Hobbies',
    'Events': 'Entertainment & Hobbies',
    'Film/enjoyment': 'Entertainment & Hobbies',
    'Learning': 'Education',
    'Health': 'Healthcare',
    'Tech': 'Shopping',
    'joy': 'Miscellaneous',
    'business_expenses': 'Business Expenses',
    'Other': 'Miscellaneous'
}

df['Tag'] = df['category'].map(tag_mapping)

print("DataFrame with new 'Tag' column:")
print(df)

DataFrame with new 'Tag' column:
                           date    category  amount             Tag
0     2022-07-06 05:57:10 +0000  Restuarant    5.50   Food & Dining
1     2022-07-06 05:57:27 +0000      Market    2.00       Groceries
2     2022-07-06 05:58:12 +0000       Coffe   30.10   Food & Dining
3     2022-07-06 05:58:25 +0000      Market   17.33       Groceries
4     2022-07-06 05:59:00 +0000  Restuarant    5.50   Food & Dining
...                         ...         ...     ...             ...
3603  2024-09-28 13:31:37 +0000      Market    8.00       Groceries
3604  2024-09-29 02:57:07 +0000   Transport    0.50  Transportation
3605  2024-09-29 04:29:03 +0000      Market    7.40       Groceries
3606  2024-09-29 04:53:24 +0000       Coffe   15.00   Food & Dining
3607  2024-09-29 10:40:38 +0000  Restuarant    8.00   Food & Dining

[3608 rows x 4 columns]


In [5]:
transactions_per_tag = df.groupby('Tag').size().sort_values(ascending=False)
total_amount_per_tag = df.groupby('Tag')['amount'].sum().sort_values(ascending=False)

print("Number of transactions per tag:")
print(transactions_per_tag)

print("\nTotal amount spent per tag:")
print(total_amount_per_tag)

Number of transactions per tag:
Tag
Food & Dining              1747
Groceries                   946
Transportation              411
Utilities                   152
Miscellaneous                91
Entertainment & Hobbies      73
Education                    72
Shopping                     53
Healthcare                   46
Travel & Accommodation        9
Business Expenses             8
dtype: int64

Total amount spent per tag:
Tag
Food & Dining              22279.31
Shopping                    7350.50
Groceries                   6451.03
Healthcare                  5896.50
Utilities                   4858.58
Travel & Accommodation      4431.45
Entertainment & Hobbies     4222.40
Education                   2525.31
Transportation              1820.78
Miscellaneous               1570.05
Business Expenses            400.00
Name: amount, dtype: float64


# Task
Train and evaluate a RandomForestClassifier, GradientBoostingClassifier, and LogisticRegression model to predict the 'Tag' based on date features and 'amount' from the `df` DataFrame, and then summarize which models achieve an accuracy between 80% and 90%.

## Process Date Column

### Subtask:
Convert the 'date' column to datetime objects and extract numerical features such as year, month, day, day of week, and hour, which can be used as predictors for the model.


**Reasoning**:
To prepare the 'date' column for analysis, I will convert it to datetime objects and then extract relevant time-based features such as year, month, day, day of week, and hour, which can serve as numerical predictors for the model.



In [6]:
df['date'] = pd.to_datetime(df['date'])

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_of_week'] = df['date'].dt.dayofweek
df['hour'] = df['date'].dt.hour

print("DataFrame with extracted date features:")
print(df.head())

DataFrame with extracted date features:
                       date    category  amount            Tag  year  month  \
0 2022-07-06 05:57:10+00:00  Restuarant    5.50  Food & Dining  2022      7   
1 2022-07-06 05:57:27+00:00      Market    2.00      Groceries  2022      7   
2 2022-07-06 05:58:12+00:00       Coffe   30.10  Food & Dining  2022      7   
3 2022-07-06 05:58:25+00:00      Market   17.33      Groceries  2022      7   
4 2022-07-06 05:59:00+00:00  Restuarant    5.50  Food & Dining  2022      7   

   day  day_of_week  hour  
0    6            2     5  
1    6            2     5  
2    6            2     5  
3    6            2     5  
4    6            2     5  


## Encode Target Variable

### Subtask:
Encode the 'Tag' column into numerical labels, as machine learning models require numerical input for the target variable.


**Reasoning**:
To encode the 'Tag' column into numerical labels, I will import `LabelEncoder` from `sklearn.preprocessing`, instantiate it, and then apply it to the 'Tag' column to create a new 'Tag_encoded' column.



In [7]:
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the 'Tag' column
df['Tag_encoded'] = label_encoder.fit_transform(df['Tag'])

print("DataFrame with 'Tag_encoded' column:")
print(df.head())

DataFrame with 'Tag_encoded' column:
                       date    category  amount            Tag  year  month  \
0 2022-07-06 05:57:10+00:00  Restuarant    5.50  Food & Dining  2022      7   
1 2022-07-06 05:57:27+00:00      Market    2.00      Groceries  2022      7   
2 2022-07-06 05:58:12+00:00       Coffe   30.10  Food & Dining  2022      7   
3 2022-07-06 05:58:25+00:00      Market   17.33      Groceries  2022      7   
4 2022-07-06 05:59:00+00:00  Restuarant    5.50  Food & Dining  2022      7   

   day  day_of_week  hour  Tag_encoded  
0    6            2     5            3  
1    6            2     5            4  
2    6            2     5            3  
3    6            2     5            4  
4    6            2     5            3  


## Prepare Features and Target

### Subtask:
Select the engineered date features and the 'amount' column as the independent variables (X) and the encoded 'Tag' column as the dependent variable (y) for model training.


**Reasoning**:
To prepare the data for model training, I will define the independent variables (X) by selecting the engineered date features and the 'amount' column, and the dependent variable (y) by selecting the 'Tag_encoded' column.



In [8]:
features = ['amount', 'year', 'month', 'day', 'day_of_week', 'hour']
X = df[features]
y = df['Tag_encoded']

print("First 5 rows of X (independent variables):")
print(X.head())

print("\nFirst 5 rows of y (dependent variable):")
print(y.head())

First 5 rows of X (independent variables):
   amount  year  month  day  day_of_week  hour
0    5.50  2022      7    6            2     5
1    2.00  2022      7    6            2     5
2   30.10  2022      7    6            2     5
3   17.33  2022      7    6            2     5
4    5.50  2022      7    6            2     5

First 5 rows of y (dependent variable):
0    3
1    4
2    3
3    4
4    3
Name: Tag_encoded, dtype: int64


## Split Data for Training and Testing

### Subtask:
Divide the dataset into training and testing sets to properly evaluate the models' performance on unseen data.


**Reasoning**:
To divide the dataset into training and testing sets, I will import `train_test_split` from `sklearn.model_selection`, then apply it to `X` and `y` with a test size of 20% and a random state of 42 for reproducibility, and finally print the shapes of the resulting sets to verify the split.



In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (2886, 6)
Shape of X_test: (722, 6)
Shape of y_train: (2886,)
Shape of y_test: (722,)


## Train and Evaluate Classification Models

### Subtask:
Train a RandomForestClassifier, GradientBoostingClassifier, and LogisticRegression model using the training data. Then, calculate their accuracy on the test set.


**Reasoning**:
To train and evaluate the classification models, I will import the necessary classifiers and accuracy metric from scikit-learn. Then, I will initialize and train each model (RandomForestClassifier, GradientBoostingClassifier, and LogisticRegression) on the training data, make predictions on the test data, and calculate their respective accuracy scores. Finally, I will print the accuracy of each model.



In [10]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize and train RandomForestClassifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_predictions)

# Initialize and train GradientBoostingClassifier
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)
gb_predictions = gb_model.predict(X_test)
gb_accuracy = accuracy_score(y_test, gb_predictions)

# Initialize and train LogisticRegression
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)
lr_predictions = lr_model.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_predictions)

print(f"RandomForestClassifier Accuracy: {rf_accuracy:.4f}")
print(f"GradientBoostingClassifier Accuracy: {gb_accuracy:.4f}")
print(f"LogisticRegression Accuracy: {lr_accuracy:.4f}")

RandomForestClassifier Accuracy: 0.6482
GradientBoostingClassifier Accuracy: 0.6565
LogisticRegression Accuracy: 0.5235


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**Reasoning**:
The previous Logistic Regression model did not converge due to reaching the iteration limit. I will increase `max_iter` to `5000` to allow the model more time to converge and achieve a better fit.



In [11]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize and train RandomForestClassifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_predictions)

# Initialize and train GradientBoostingClassifier
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)
gb_predictions = gb_model.predict(X_test)
gb_accuracy = accuracy_score(y_test, gb_predictions)

# Initialize and train LogisticRegression
lr_model = LogisticRegression(max_iter=5000, random_state=42) # Increased max_iter
lr_model.fit(X_train, y_train)
lr_predictions = lr_model.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_predictions)

print(f"RandomForestClassifier Accuracy: {rf_accuracy:.4f}")
print(f"GradientBoostingClassifier Accuracy: {gb_accuracy:.4f}")
print(f"LogisticRegression Accuracy: {lr_accuracy:.4f}")

RandomForestClassifier Accuracy: 0.6482
GradientBoostingClassifier Accuracy: 0.6565
LogisticRegression Accuracy: 0.5332


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**Reasoning**:
The `LogisticRegression` model continues to show a `ConvergenceWarning` and its accuracy is low, suggesting that feature scaling might help with convergence and performance. I will scale the features (X_train and X_test) using `StandardScaler` and then re-train and re-evaluate only the `LogisticRegression` model with the scaled data.



In [12]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Instantiate StandardScaler
scaler = StandardScaler()

# Fit and transform X_train, transform X_test
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train RandomForestClassifier (re-running for context/comparison)
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_predictions)

# Initialize and train GradientBoostingClassifier (re-running for context/comparison)
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)
gb_predictions = gb_model.predict(X_test)
gb_accuracy = accuracy_score(y_test, gb_predictions)

# Initialize and train LogisticRegression with scaled data
lr_model_scaled = LogisticRegression(max_iter=5000, random_state=42)
lr_model_scaled.fit(X_train_scaled, y_train)
lr_predictions_scaled = lr_model_scaled.predict(X_test_scaled)
lr_accuracy_scaled = accuracy_score(y_test, lr_predictions_scaled)

print(f"RandomForestClassifier Accuracy: {rf_accuracy:.4f}")
print(f"GradientBoostingClassifier Accuracy: {gb_accuracy:.4f}")
print(f"LogisticRegression (Scaled Features) Accuracy: {lr_accuracy_scaled:.4f}")

RandomForestClassifier Accuracy: 0.6482
GradientBoostingClassifier Accuracy: 0.6565
LogisticRegression (Scaled Features) Accuracy: 0.4945


### Model Performance Summary

Based on the evaluation, none of the trained models (RandomForestClassifier, GradientBoostingClassifier, and LogisticRegression) achieved an accuracy between 80% and 90%.

### Model Performance Summary

Based on the evaluation, none of the trained models (RandomForestClassifier, GradientBoostingClassifier, and LogisticRegression) achieved an accuracy between 80% and 90%.

## Filter Models by Accuracy

### Subtask:
Identify which of the trained models achieved an accuracy between 80% and 90% and display their performance.


**Reasoning**:
I will display the accuracies of all trained models and explicitly state that none of them achieved an accuracy between 80% and 90% as per the instruction.



In [14]:
model_accuracies = {
    "RandomForestClassifier": rf_accuracy,
    "GradientBoostingClassifier": gb_accuracy,
    "LogisticRegression_Scaled": lr_accuracy_scaled
}

print("Model Accuracies:")
for model_name, accuracy in model_accuracies.items():
    print(f"{model_name}: {accuracy:.4f}")


models_in_range = {}
for model_name, accuracy in model_accuracies.items():
    if 0.80 <= accuracy <= 0.90:
        models_in_range[model_name] = accuracy

if models_in_range:
    print("\nModels with accuracy between 80% and 90%:")
    for model_name, accuracy in models_in_range.items():
        print(f"{model_name}: {accuracy:.4f}")


Model Accuracies:
RandomForestClassifier: 0.6482
GradientBoostingClassifier: 0.6565
LogisticRegression_Scaled: 0.4945


## Final Task

### Subtask:
Summarize the models that meet the specified accuracy criteria and discuss potential next steps or further analysis.


## Summary:

### Q&A
None of the trained models (RandomForestClassifier, GradientBoostingClassifier, and LogisticRegression) achieved an accuracy between 80% and 90%. The highest accuracy obtained was 65.65% from the GradientBoostingClassifier.

### Data Analysis Key Findings
*   The 'date' column was successfully converted to datetime objects, and five new numerical features (year, month, day, day\_of\_week, hour) were extracted and added to the DataFrame.
*   The 'Tag' categorical target variable was successfully encoded into numerical labels and stored in a new 'Tag\_encoded' column.
*   The dataset was split into training and testing sets with a 80/20 ratio (2886 samples for training, 722 for testing).
*   The `RandomForestClassifier` achieved an accuracy of 64.82% on the test set.
*   The `GradientBoostingClassifier` achieved an accuracy of 65.65% on the test set.
*   The `LogisticRegression` model, after feature scaling, achieved an accuracy of 49.45% on the test set. Initial attempts without scaling and increased `max_iter` also showed lower accuracies (52.35% and 53.32%) and convergence issues.
*   None of the evaluated models reached the desired accuracy range of 80% to 90%.

### Insights or Next Steps
*   The current features derived from the date and amount are insufficient for achieving high predictive accuracy for the 'Tag' classification. Further feature engineering, potentially involving more domain-specific features or text-based features if available (e.g., transaction descriptions), could improve model performance.
*   Explore more advanced classification models, such as XGBoost or LightGBM, or consider hyperparameter tuning for the existing models to optimize their performance, as the current accuracies are relatively low.
