In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os
import sys

🎯 Goal
======
You're trying to model a count outcome (e.g., number of customer visits, events, accidents), so using Poisson regression makes sense. Linear regression isn’t ideal here because:

It assumes the response variable is continuous and normally distributed

It can predict negative values, which doesn't make sense for counts

INPUT:
    - Dataset X (features), shape: [n_samples, n_features]
    - Count target y, shape: [n_samples]

MODEL:
    - Choose a Poisson distribution for the output
    - Link function: log, so λ = exp(X · β)

OBJECTIVE:
    - Use Maximum Likelihood Estimation (MLE) to fit β
    - The log-likelihood for Poisson is:

      LL(β) = ∑ [ y_i * log(λ_i) - λ_i - log(y_i!) ]
           = ∑ [ y_i * (X_i · β) - exp(X_i · β) - log(y_i!) ]

    - Optimize β to maximize LL(β)

TRAINING:
    - Use gradient descent then a package from `scikit-learn`


Chicago Dataset
=======

From kaggle site --> https://www.kaggle.com/datasets/utkarshx27/crimes-2001-to-present

"This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. "

In [7]:
ds = pd.read_csv('../data/Crimes_-_2001_to_Present.csv') #this is pretty big

ds.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10224738,HY411648,09/05/2015 01:30:00 PM,043XX S WOOD ST,486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,...,12.0,61.0,08B,1165074.0,1875917.0,2015,02/10/2018 03:50:01 PM,41.815117,-87.67,"(41.815117282, -87.669999562)"
1,10224739,HY411615,09/04/2015 11:30:00 AM,008XX N CENTRAL AVE,870,THEFT,POCKET-PICKING,CTA BUS,False,False,...,29.0,25.0,06,1138875.0,1904869.0,2015,02/10/2018 03:50:01 PM,41.89508,-87.7654,"(41.895080471, -87.765400451)"
2,11646166,JC213529,09/01/2018 12:01:00 AM,082XX S INGLESIDE AVE,810,THEFT,OVER $500,RESIDENCE,False,True,...,8.0,44.0,06,,,2018,04/06/2019 04:04:43 PM,,,
3,10224740,HY411595,09/05/2015 12:45:00 PM,035XX W BARRY AVE,2023,NARCOTICS,POSS: HEROIN(BRN/TAN),SIDEWALK,True,False,...,35.0,21.0,18,1152037.0,1920384.0,2015,02/10/2018 03:50:01 PM,41.937406,-87.71665,"(41.937405765, -87.716649687)"
4,10224741,HY411610,09/05/2015 01:00:00 PM,0000X N LARAMIE AVE,560,ASSAULT,SIMPLE,APARTMENT,False,True,...,28.0,25.0,08A,1141706.0,1900086.0,2015,02/10/2018 03:50:01 PM,41.881903,-87.755121,"(41.881903443, -87.755121152)"


For this notebook I want to do something simple. I want to use poisson-regression to determine the number of crimes which will have occured in Chicago, by date, in any given year. So, we are going to need to write some code. 

Here is what we need to do:

    - Sort the dataset, 'ds', by date

    - Create working dataset 'X'

    - Write a function which creates our feature vector, 'y'. Create an initial, 'current' day variable. For each row a count variable will need to track the crimes which have ocured while a sub-script checks that the year has not changed. If the day has changed, update the current year and set counts to one. Else, add new crime to count. Append the count for the row in the new column

    - Write a function which creates our feature vector, 'y2'. Create an initial, 'current' year variable. For each row a count variable will need to track the crimes which have ocured while a sub-script checks that the year has not changed. If the year has changed, update the current year and set counts to one. Else, add new crime to count. Append the count for the row in the new column

    - Add a 'day_of_week' column to X

    - Add a 'month' column to X

    - Add 'day_of_year' column to X

    - Add 'weekend' column to X

    - Add 'holiday' column to X

    - Add 'lag_time', yesterday crime counts column to X

    - Add 'trend_over_time' column to X

    - Add 'week_of_year' column to X

    - Add 'year' column to X

    - Add intercept, 1, to X

    - Create beta variable (np.zeros(X.size(1)) ??)

    - Use SKLearn metrics loglikelihood ??
    
    - write gradient descent loop

After we get that working, we will implement a simple version using the sklearn library.