# Baseline Learning

This python file will perform a baseline learning with linear classification. Since our labels are numerical values but with no relationship between numbers, we decided to go with a classification approach after one-hot encoding. The file will begin with processing all the important feaures, then replacing null values or 0 values, finally there will be the classification with visualization.

**Authors:** Kevin Lu, Shrusti Jain, Smeet Patel, Taobo Liao

# Imports and Graph Configurations

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model

In [16]:
#for some reason, this needs to be in a separate cell
params={
    "font.size":15,
    "lines.linewidth":5,
}
plt.rcParams.update(params)

In [17]:
#download train and debug
!python -m pip install gdown
!gdown 1enR3DLH7iDuI0mG8rV3Z21tPhdZZRXOv
!gdown 1zeyltSH_KaN0qQCRCiZR8kXOG6VUXU9T

Defaulting to user installation because normal site-packages is not writeable


Downloading...
From (original): https://drive.google.com/uc?id=1enR3DLH7iDuI0mG8rV3Z21tPhdZZRXOv
From (redirected): https://drive.google.com/uc?id=1enR3DLH7iDuI0mG8rV3Z21tPhdZZRXOv&confirm=t&uuid=c99003b5-1435-4671-8667-37991bc04ae3
To: c:\Users\jain9\Predicting_LA_Crimes\src\train.pkl

  0%|          | 0.00/224M [00:00<?, ?B/s]
  0%|          | 1.05M/224M [00:00<00:25, 8.84MB/s]
  2%|▏         | 3.67M/224M [00:00<00:13, 16.9MB/s]
  3%|▎         | 5.77M/224M [00:00<00:11, 18.2MB/s]
  4%|▍         | 8.39M/224M [00:00<00:10, 20.1MB/s]
  5%|▍         | 10.5M/224M [00:00<00:10, 20.2MB/s]
  6%|▌         | 12.6M/224M [00:00<00:10, 20.5MB/s]
  7%|▋         | 14.7M/224M [00:00<00:10, 20.4MB/s]
  8%|▊         | 17.3M/224M [00:00<00:09, 21.3MB/s]
  9%|▉         | 19.9M/224M [00:00<00:09, 21.2MB/s]
 10%|█         | 22.5M/224M [00:01<00:09, 21.2MB/s]
 11%|█▏        | 25.2M/224M [00:01<00:09, 21.5MB/s]
 12%|█▏        | 27.8M/224M [00:01<00:09, 21.7MB/s]
 14%|█▎        | 30.4M/224M [00:01<00:08, 21.

In [18]:
crime_df_train = pd.read_pickle('train.pkl')
crime_df_debug = pd.read_pickle('debug.pkl')

# Data Conversion


First, we convert some of the categorical variables to one-hot encodings, and also remove features that we view as uninformative or carrying duplicate information.


Additional column has been added for Vict Age signifying whether the age is 0. An age of 0 represents that the crime did not include a victim or the victim is unidentified.

In [19]:
# Add a binary column indicating if Vict Age is 0
crime_df_train['Vict Age Was 0'] = (crime_df_train['Vict Age'] == 0).astype(int)

# Select relevant columns for analysis
selected_columns = [
    'Status',
    'Weapon Used Cd',
    'Vict Descent',
    'Vict Sex',
    'Vict Age',
    'Mocodes',
    'Crm Cd',
    'Part 1-2',
    'Rpt Dist No',
    'AREA',
    'TIME OCC',
    'DATE OCC',
    'Premis Cd',
    'Vict Age Was 0'
]

# Create a DataFrame with only the selected columns
crime_selected_df = crime_df_train[selected_columns]
crime_selected_df.head()

Unnamed: 0,Status,Weapon Used Cd,Vict Descent,Vict Sex,Vict Age,Mocodes,Crm Cd,Part 1-2,Rpt Dist No,AREA,TIME OCC,DATE OCC,Premis Cd,Vict Age Was 0
0,AA,,O,M,0,,510,1,784,7,2130,2020-03-01,101.0,1
1,IC,,O,M,47,1822 1402 0344,330,1,182,1,1800,2020-02-08,128.0,0
2,IC,,X,X,19,0344 1251,480,1,356,3,1700,2020-11-04,502.0,0
3,IC,,O,M,19,0325 1501,343,1,964,9,2037,2020-03-10,405.0,0
4,IC,,H,M,28,1822 1501 0930 2004,354,2,666,6,1200,2020-08-17,102.0,0


Multi-hot encoding for Mocodes as well as One-hot encoding for Status, Victim Sex, and Victim Descent

In [20]:
def convert_to_minutes(military_time):
    """
    Convert military time to minutes from midnight.

    Parameters:
    military_time (int): Time in military format, e.g., 2305 for 11:05 PM.

    Returns:
    int: Total minutes from midnight.
    """
    # Ensure the time is a four-digit string (e.g., '2305')
    military_time = str(military_time).zfill(4)

    # Extract hours and minutes from the string
    hours = int(military_time[:2])
    minutes = int(military_time[2:])

    # Calculate and return the total minutes from midnight
    total_minutes = hours * 60 + minutes
    return total_minutes

# Apply the convert_to_minutes function to 'TIME OCC' column
crime_selected_df['TIME OCC'] = crime_selected_df['TIME OCC'].apply(convert_to_minutes)

# Function to one-hot encode specified categorical columns
def one_hot_encode(df, columns):
    """
    Apply one-hot encoding to specified columns in the DataFrame.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    columns (list): List of columns to one-hot encode.

    Returns:
    pd.DataFrame: DataFrame with one-hot encoded columns.
    """
    for column in columns:

        # Create one-hot encoded columns for each category in the column
        one_hot = pd.get_dummies(df[column], prefix=column)

        # Convert one-hot encoded DataFrame to integer type for compactness
        one_hot = one_hot.astype(int)

        # Replace the original column with the one-hot encoded columns
        df[column] = one_hot.values.tolist()
    return df

# Function to multi-hot encode 'Mocodes' column where each row may contain multiple codes
def multi_hot_encode_mocodes(df):
    """
    Multi-hot encode the 'Mocodes' column.

    Parameters:
    df (pd.DataFrame): The input DataFrame.

    Returns:
    pd.DataFrame: DataFrame with 'Mocodes' column as multi-hot encoded vectors.
    """
    # Initialize a set of all unique Mocodes for multi-hot encoding
    all_mocodes = set()
    all_mocodes.add('NaN')

    # Populate the set with unique Mocodes from each row (handling NaN values)
    for mocode_str in df['Mocodes'].dropna():
        mocode_str = str(mocode_str)
        mocodes = mocode_str.split(' ')
        all_mocodes.update(mocodes)

    # Map each Mocode to a unique index in a binary vector
    mocode_index = {mocode: idx for idx, mocode in enumerate(sorted(all_mocodes))}

    # Define a helper function to encode Mocodes into a binary vector
    def encode_mocodes(mocode_str):
        # Split the Mocode string into individual codes, or set to 'NaN' if empty
        if isinstance(mocode_str, str):
            mocodes = mocode_str.split()
        else:
            mocodes = ['NaN']

        # Initialize a zero vector and set indices for each Mocode found
        encoded_vector = [0] * len(mocode_index)
        for mocode in mocodes:
            if mocode in mocode_index:
                encoded_vector[mocode_index[mocode]] = 1
        return encoded_vector

    # Apply the encoding function to the 'Mocodes' column
    df['Mocodes'] = df['Mocodes'].apply(encode_mocodes)
    return df

# Specify columns to one-hot encode
columns_to_encode = ['Status', 'Vict Descent', 'Vict Sex', 'Weapon Used Cd']

# Apply one-hot encoding to specified columns and store the result in a new DataFrame
crime_selected_one_hot_df = one_hot_encode(crime_selected_df.copy(), columns_to_encode)

# crime_selected_one_hot_df = multi_hot_encode_mocodes(crime_selected_one_hot_df.copy())
crime_selected_one_hot_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  crime_selected_df['TIME OCC'] = crime_selected_df['TIME OCC'].apply(convert_to_minutes)


Unnamed: 0,Status,Weapon Used Cd,Vict Descent,Vict Sex,Vict Age,Mocodes,Crm Cd,Part 1-2,Rpt Dist No,AREA,TIME OCC,DATE OCC,Premis Cd,Vict Age Was 0
0,"[1, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...","[0, 0, 0, 1, 0]",0,,510,1,784,7,1290,2020-03-01,101.0,1
1,"[0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...","[0, 0, 0, 1, 0]",47,1822 1402 0344,330,1,182,1,1080,2020-02-08,128.0,0
2,"[0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1]",19,0344 1251,480,1,356,3,1020,2020-11-04,502.0,0
3,"[0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...","[0, 0, 0, 1, 0]",19,0325 1501,343,1,964,9,1237,2020-03-10,405.0,0
4,"[0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0]",28,1822 1501 0930 2004,354,2,666,6,720,2020-08-17,102.0,0


In [21]:
crime_selected_one_hot_df['DATE OCC INT'] = crime_selected_one_hot_df['DATE OCC'].astype('int64') // (10**9 * 60 * 60 * 24)

# Identifying null values (NaN) in dataset of selected features and replacing it with 0

In [22]:
crime_selected_one_hot_df [['Status','Weapon Used Cd','Vict Descent','Vict Sex','Vict Age','Crm Cd','Part 1-2','Rpt Dist No','AREA','TIME OCC','DATE OCC','Premis Cd','Vict Age Was 0']].isna().any()

Status            False
Weapon Used Cd    False
Vict Descent      False
Vict Sex          False
Vict Age          False
Crm Cd            False
Part 1-2          False
Rpt Dist No       False
AREA              False
TIME OCC          False
DATE OCC          False
Premis Cd          True
Vict Age Was 0    False
dtype: bool

In [23]:
crime_selected_one_hot_df['Premis Cd'] = crime_selected_one_hot_df['Premis Cd'].fillna(0)

In [24]:
non_list = crime_selected_one_hot_df[['Vict Age Was 0', 'Vict Age', 'Rpt Dist No', 'AREA', 'TIME OCC','DATE OCC INT','Premis Cd']].to_numpy(dtype=np.float32)

In [25]:
X = np.concatenate([non_list, np.array(crime_selected_one_hot_df['Status'].to_list()), np.array(crime_selected_one_hot_df['Weapon Used Cd'].to_list()), np.array(crime_selected_one_hot_df['Vict Descent'].to_list()), np.array(crime_selected_one_hot_df['Vict Sex'].to_list())], axis=1)

In [26]:
Y = crime_selected_one_hot_df['Crm Cd'].to_numpy()
# unique_classes = np.unique(Y)
# class_to_index = {cls: idx for idx, cls in enumerate(unique_classes)}
# Y_indices = np.array([class_to_index[cls] for cls in Y])
# Y_one_hot = np.zeros((len(Y), len(unique_classes)), dtype=np.float32)
# Y_one_hot[np.arange(len(Y)), Y_indices] = 1


We first prepare the crime dataset for a regression model to predict a target variable (Crm Cd). We begin by converting the date field 'DATE OCC' to an integer format representing the number of seconds. Next, We check for missing values in various columns and fills them, specifically handling missing values in 'Premis Cd' by setting them to 0. We then select specific columns (Status, Weapon Used Cd, Vict Descent, Vict Sex), and one-hot encodes them into lists for later processing.

The data is then split into training and testing sets, with 25% of the data reserved for testing. A linear regression model is trained on this data, and the code concludes by printing the model's score (R² value) on the test set, providing an assessment of the model's performance.

In [27]:
from sklearn import model_selection
from sklearn.metrics import root_mean_squared_error
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, Y, test_size = 0.25)

# Splitting the data into training and testing data
regr = linear_model.RidgeClassifier(alpha=0)
regr.fit(X_train, y_train)
print(regr.score(X_test, y_test))
print(root_mean_squared_error(y_test, regr.predict(X_test)))

0.3881682716675114
224.4139489255895


# Visualization of Classification analysis


In [29]:
from sklearn.model_selection import learning_curve

# Plot 1: Learning Curve
def plot_learning_curve(model, X_train, y_train):
    train_sizes, train_scores, test_scores = learning_curve(
        model, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10)
    )

    # Calculate mean and standard deviation for training and test scores
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.figure(figsize=(10, 6))
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")

    plt.xlabel("Training Set Size")
    plt.ylabel("R^2 Score")
    plt.title("Learning Curve")
    plt.legend(loc="best")
    plt.show()

# Plot 2: Residual Plot
def plot_residuals(model, X_test, y_test):
    y_pred = model.predict(X_test)
    residuals = y_test - y_pred

    plt.figure(figsize=(10, 6))
    plt.scatter(y_pred, residuals, alpha=0.6)
    plt.axhline(y=0, color='r', linestyle='--')
    plt.xlabel("Predicted Values")
    plt.ylabel("Residuals")
    plt.title("Residual Plot")
    plt.show()

# Plot 3: Predicted vs. Actual Values
def plot_predictions(model, X_test, y_test):
    y_pred = model.predict(X_test)

    plt.figure(figsize=(10, 6))
    plt.scatter(y_test, y_pred, alpha=0.6)
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')  # Line y=x for perfect predictions
    plt.xlabel("Actual Values")
    plt.ylabel("Predicted Values")
    plt.title("Predicted vs Actual Values")
    plt.show()

# Visualize the training process
plot_learning_curve(regr, X_train, y_train)
plot_residuals(regr, X_test, y_test)
plot_predictions(regr, X_test, y_test)



MemoryError: Unable to allocate 476. MiB for an array with shape (532710, 117) and data type float64

## Interpretation
### Graph 1 (Learning Curve):
 Training Score (Red Line): The training score stabilizes around an 𝑅^2 score of approximately 0.604 as the training set size increases, indicating that the model is relatively stable on the training data.

 Cross-Validation Score (Green Line): The cross-validation score stabilizes around an 𝑅^2 score of approximately 0.603, which is lower than the training score but close to it. This small gap suggests that the model is not heavily overfitting.

### Graph 2 (Residual Plot):
 The residuals display a clear pattern, showing an increasing spread as the predicted values increase. This pattern may indicate heteroscedasticity, where the variance of the errors changes with the predicted value.
 The spread is also asymmetric, with many positive residuals for higher predicted values, suggesting that the model might systematically underpredict certain values.

### Graph 3 (Predicted vs. Actual Values Plot):
 The points are widely scattered around the perfect prediction line, indicating a lack of strong correlation between predicted and actual values.
 Many predicted values deviate significantly from the actual values, suggesting that the model struggles to accurately predict the target variable.