## Machine Learning

Perform the data analysis in a “batch” manner using machine learning to predict events such as days with high number of tickets (think of and implement at least one additional interesting learning problem). 

You will need to appropriately transform the augmented data. 

Ensure that each single worker will not have enough memory to store and process the entire dataset (e.g., 8GB per worker). Use at least three kinds of supervised machine learning algorithms:

1. One of the existing distributed algorithms from Dask-ML
2. A sophisticated third-party algorithm which “natively” supports distributed computing (such as XGBoost or LightGBM)
3. One of the common scikit-learn algorithms utilizing partial_fit.

For all three scenarios compare performance in terms of loss (error), scalability, time, and total memory consumption.

Note: scalability must be tested in the Arnes cluster by increasing the number of workers and
observing the total processing time.

In [4]:
import pandas as pd

In [5]:
weather_dataset_dir = '../datasets/weather.parquet'

weather_data = pd.read_parquet(weather_dataset_dir)
weather_data['date'] = pd.to_datetime(weather_data['date']).dt.date
weather_data['is_rainy'] = weather_data['prcp'] > 0            
# group by date and borough and get if it rained on that day
daily_weather = weather_data.groupby(['borough', 'date'])['is_rainy'].max().reset_index()


In [6]:
daily_weather

Unnamed: 0,borough,date,is_rainy
0,BX,2013-01-01,False
1,BX,2013-01-02,False
2,BX,2013-01-03,False
3,BX,2013-01-04,False
4,BX,2013-01-05,False
...,...,...,...
21250,R,2024-08-17,True
21251,R,2024-08-18,True
21252,R,2024-08-19,True
21253,R,2024-08-20,False


In [7]:
events_dataset_dir = '../datasets/events.parquet'

event_data = pd.read_parquet(events_dataset_dir)
event_data['date'] = pd.to_datetime(event_data['date']).dt.date
daily_events = event_data.groupby(['borough', 'date']).size().reset_index()
daily_events = daily_events.rename(columns={0: 'event_count'})

daily_events

Unnamed: 0,borough,date,event_count
0,BX,2021-03-18,13
1,BX,2021-03-19,22
2,BX,2021-03-22,13
3,BX,2021-03-23,13
4,BX,2021-04-07,27
...,...,...,...
3860,R,2024-07-14,32
3861,R,2024-07-20,34
3862,R,2024-07-27,32
3863,R,2024-07-28,11


In [8]:
sample_data_cleaned_dir = '../datasets/sample_data_cleaned.parquet'

sample_data_cleaned = pd.read_parquet(sample_data_cleaned_dir)
sample_data_cleaned['date'] = pd.to_datetime(sample_data_cleaned['issue_date']).dt.date

daily_tickets = sample_data_cleaned.groupby(['violation_county', 'date']).size().reset_index()
daily_tickets = daily_tickets.rename(columns={0: 'ticket_count'})

In [9]:
daily_tickets = daily_tickets.merge(daily_events, left_on=['violation_county', 'date'], right_on=['borough', 'date'], how='left')
daily_tickets = daily_tickets.drop(['borough'], axis=1) 

In [10]:
daily_tickets['event_count'] = daily_tickets['event_count'].fillna(0)

In [11]:
daily_tickets = daily_tickets.merge(daily_weather, left_on=['violation_county', 'date'], right_on=['borough', 'date'], how='left')
daily_tickets = daily_tickets.drop(['borough'], axis=1)
            

In [15]:
daily_tickets['is_rainy'] = daily_tickets['is_rainy'].infer_objects(copy=False)
daily_tickets['is_rainy'] = daily_tickets['is_rainy'].fillna(False)

  daily_tickets['is_rainy'] = daily_tickets['is_rainy'].fillna(False)


In [16]:
daily_tickets.reset_index(drop=True, inplace=True)

In [17]:
daily_tickets.head()

Unnamed: 0,violation_county,date,ticket_count,event_count,is_rainy
0,BX,2000-07-10,1,0.0,False
1,BX,2000-10-01,1,0.0,False
2,BX,2012-11-12,1,0.0,False
3,BX,2012-12-28,1,0.0,False
4,BX,2013-01-17,1,0.0,False
