# Homework 1: Crime Data Analyst (Extra) (Individual)
### (For Big Data Analytics and Applications)

## Scenario
As a data science expert, you got a job from a police departement from another city. After you review the crime data, you are asksed to develop a prediction model to 

### Plan of attack:
    
1. **We want to use our skills to detect/predict crime in this city. Therefore, we want to build a classifier, which predicts the type of crime when it occurs. (100%)**
    1. Load the training data for a machine learning classifier in _'data/crime_training_data.csv'_ and corresponding training labels in _'data/crime_training_labels.csv'_.
    2. Train at least two kinds of classifiers with the training data set. 
    3. Predict the crime type (labels) of reported crimes in _'data/crime_test_data.csv'_.
    4. Save the predictions in a _.csv_ file.
    5. Compare the predicitons of your chosen classifiers and explain why results.

### Hand in:
Hand in the following files in a _.zip_ file:
   - your code in a jupyter notebook (_.ipynb_) or standard python source code (_.py_).
   - a _.pdf_ with your findings and plots (you can easily create a PDF in juypter notebook under ->File-> Download as-> PDF via Latex, might need to install some software)
   - your predictions as a _csv._ file.
   
The files should have your name(s) or your studentID in them, ex: _homework1_extra_M112010001.ipynb_ , _homework1_extra_M112010001.py_, or _homework1_extra_M112010001.pdf_.

Send the _.zip_ file to moodle system (moodle2.ntust.edu.tw) before **Tuesday March 4 2025 09:10 AM**.   

### Note on implementation:
- You are free to use any classification algorithm that you want. If you find better recommendation approaches on the web(there certainly are better, but also more involved ones), you are free to use those.
- Some proven algorithms, other than the ones learned in the lectures are: 
    - https://towardsdatascience.com/collaborative-filtering-simplified-the-basic-science-behind-recommendation-systems-1d7e7c58cd8/

# Load data

In [1]:
import pandas as pd

# Load the training data
training_data = pd.read_csv('data/crime_training_data.csv')
training_labels = pd.read_csv('data/crime_training_labels.csv')

# Load the test data
test_data = pd.read_csv('data/crime_test_data.csv')
test_labels = pd.read_csv('data/crime_test_labels.csv')

# To PyTorch Tensor

In [2]:
import torch
from sklearn.preprocessing import LabelEncoder

X_train = training_data.values
y_train = training_labels.values.ravel()  # Convert labels to a 1D array
X_test = test_data.values
y_test = test_labels.values.ravel()

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Convert boolean columns to integers
X_train = X_train.astype(float)
X_test = X_test.astype(float)

X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train_encoded, dtype=torch.long)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test_encoded, dtype=torch.long)

train_dataset = torch.utils.data.TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = torch.utils.data.TensorDataset(X_test_tensor, y_test_tensor)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)

# Model

## NN

In [3]:
import torch.nn as nn
import torch.optim as optim

class CrimeClassifier(nn.Module):
    def __init__(self, input_size, hidden_size=128, num_classes=4):
        super(CrimeClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x)) 
        x = self.fc3(x)
        return x


# Initialize the model
input_size = X_train.shape[1]
model = CrimeClassifier(input_size=input_size)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model
num_epochs = 120

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(train_loader):.4f}")

# Test the model
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for inputs, labels in test_loader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f"Test Accuracy: {accuracy:.2f}%")

Epoch [1/120], Loss: 1.9295
Epoch [2/120], Loss: 1.3930
Epoch [3/120], Loss: 1.3331
Epoch [4/120], Loss: 1.3434
Epoch [5/120], Loss: 1.2638
Epoch [6/120], Loss: 1.2281
Epoch [7/120], Loss: 1.1607
Epoch [8/120], Loss: 1.1305
Epoch [9/120], Loss: 1.1163
Epoch [10/120], Loss: 1.1033
Epoch [11/120], Loss: 1.0768
Epoch [12/120], Loss: 1.0731
Epoch [13/120], Loss: 1.0219
Epoch [14/120], Loss: 1.0555
Epoch [15/120], Loss: 1.0231
Epoch [16/120], Loss: 0.9882
Epoch [17/120], Loss: 0.9571
Epoch [18/120], Loss: 1.0023
Epoch [19/120], Loss: 0.9424
Epoch [20/120], Loss: 0.9593
Epoch [21/120], Loss: 0.9354
Epoch [22/120], Loss: 0.9171
Epoch [23/120], Loss: 0.8971
Epoch [24/120], Loss: 0.9000
Epoch [25/120], Loss: 0.9089
Epoch [26/120], Loss: 0.9451
Epoch [27/120], Loss: 0.9328
Epoch [28/120], Loss: 0.9235
Epoch [29/120], Loss: 0.9026
Epoch [30/120], Loss: 0.8995
Epoch [31/120], Loss: 0.9010
Epoch [32/120], Loss: 0.9075
Epoch [33/120], Loss: 0.8635
Epoch [34/120], Loss: 0.9073
Epoch [35/120], Loss: 0

## Descision Tree

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

# Assume training_data, training_labels, test_data, test_labels are already defined

# Encode labels into numeric values
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(training_labels.values.ravel())
y_test_encoded = label_encoder.transform(test_labels.values.ravel())

# Create a Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(training_data.values, y_train_encoded)

# Predict on the test set
y_pred = rf_model.predict(test_data.values)

# Calculate accuracy
accuracy = accuracy_score(y_test_encoded, y_pred)
print(f"Test Accuracy: {accuracy * 100:.2f}%")

Test Accuracy: 65.00%
