<div style="background-color: #ffffff; padding: 20px; font-size: 28px; font-weight: bold; text-align: center; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
  <span style="color: #4CAF50;">Machine Learning Solutions for the Production of Structure Statistics</span>
</div>

---------------------------------------------------------------------------------------------------------------------------------

# Table of Contents

<div style="background-color: #f9f9f9; border: 2px solid #4CAF50; border-radius: 10px; padding: 20px;">

### | [Imports](#imports) |
<ul>
  <li><strong>Import necessary libraries</strong></li>
</ul>

### | [Choose Year](#velg-aargang) |
<ul>
  <li><strong>Declare the year you wish to work with</strong></li>
</ul>

### | [Visualisation](#visualisering) |
<ul>
  <li><strong>Collect the edited 'delreg' data and choose a plot to view</strong></li>
  <ul>
    <li><a href="#1">Plot for single industry</a></li>
    <li><a href="#2">Plot for every industry</a></li>
    <li><a href="#3">Plot based on 2 digit industry level</a></li>
    <li><a href="#4">Heatmap</a></li>
    <li><a href="#5">Interactive thematic kommune map</a></li>
    <li><a href="#6">Animated thematic kommune map</a></li>
    <li><a href="#7">Cumulative Histogram</a></li>
    <li><a href="#8">Linked Plots</a></li>
    <li><a href="#9">Bubble Plot</a></li>
    <li><a href="#10">Parallel Coordinates Plot</a></li>
    <li><a href="#11">Geographical Plot</a></li>
    <li><a href="#12">Animated barchart</a></li>
    <li><a href="#13">3D Scatterplot</a></li>
  </ul>   
</ul>

### | [Dashboard](#dashboard) |
<ul>
  <li><strong>Run and Open in another browser tab</strong></li>
</ul>

### | [ML Evaluation](#ml) |
<ul>
  <li><strong>Regression Problems:</strong></li>
  <ul>
    <li><a href="#14">XGBOOST</a></li>
    <li><a href="#15">Nearest Neighbors</a></li>
    <li><a href="#16">Neural Networks</a></li>
  </ul>
  <li><strong>Classification Problems:</strong></li>
  <ul>
    <li><a href="#17">XGBOOST</a></li>
    <li><a href="#18">Nearest Neighbors</a></li>
  </ul>
</ul>

### | [Update File](#oppdateringsfil) |
<ul>
  <li><strong>Run update file for chosen year. Can compare performance if looking at previous year</strong></li>
   <ul>
    <li><a href="#19">Test Restults</a></li>
    <li><a href="#20">Historic Test Results</a></li>
  </ul>
</ul>

### | [Moving Forward](#future) |
<ul>
  <li><strong>What are the current challenges/opportunities facing us in the future</strong></li>
</ul>
</div>


<a id='imports'></a>
# | [Imports](#imports) |

In [None]:
from imports import *

<a id='velg_aargang'></a>
# | [Choose Year](##velg_aargang) |

In [1]:
# skriv årgang
aar = 2023

<a id='visualisering'></a>
# | [Visualisation](#visualisering) |

In [None]:
# henter data for visualisering

timeseries_knn_kommune, histogram_data, knn_data, timeseries_knn_agg, koordinates = visualisations.gather_visualisation_data(aar)

<a id='1'></a>
#### Plot for single industry

In [None]:
visualisations.plots_time(timeseries_knn_agg)

<a id='2'></a>
#### Plot for every industry

In [None]:
visualisations.plot_all_time(timeseries_knn_agg)

<a id='3'></a>
#### Plot based on 2 digit industry level

In [None]:
visualisations.plot_n2(timeseries_knn_agg)

<a id='4'></a>
#### Heatmap

In [None]:
visualisations.heatmap(timeseries_knn_agg)

<a id='5'></a>
#### Interactive thematic kommune map

In [None]:
visualisations.thematic_kommune(timeseries_knn_kommune)

<a id='6'></a>
#### Animated thematic kommune map

In [None]:
visualisations.animated_thematic_kommune(timeseries_knn_kommune)

<a id='7'></a>
#### Cumulative Histogram

In [None]:
visualisations.cumulative_histogram(histogram_data)

<a id='8'></a>
#### Linked Plots

In [None]:
visualisations.linked_plots(timeseries_knn_agg)

<a id='9'></a>
#### Bubble Plot

In [None]:
visualisations.bubble_plot(timeseries_knn_kommune)

<a id='10'></a>
#### Parallel Coordinates Plot

In [None]:
visualisations.parallel_coordinates(timeseries_knn_agg)

<a id='11'></a>
#### Geographical Plot

In [None]:
visualisations.geomapping(koordinates)

<a id='12'></a>
#### Animated barchart

In [None]:
visualisations.animated_barchart(timeseries_knn_agg)

<a id='13'></a>
#### 3D Scatterplot

In [None]:
visualisations.scatter_3d(timeseries_knn_agg)

# <a id='dashboard'></a>
# | [Dashboard](#dashboard) |

In [None]:
app, port, service_prefix, domain = dash_application.run_dash_app(aar, timeseries_knn_kommune, histogram_data, knn_data, timeseries_knn_agg, koordinates)

if __name__ == "__main__":
    app.run(debug=True, port=port, jupyter_server_url=domain, jupyter_mode="tab", use_reloader=False)

# <a id='ml'></a>
# | [ML Evaluation](#ml) |

In [None]:
# Hente data
del timeseries_knn_kommune, histogram_data, knn_data, timeseries_knn_agg, koordinates

training_data, imputatable_df, foretak_pub = ml_modeller.hente_training_data()

# <a id='regression-problemer'></a>
### [Regression Problemer](#regression-problemer)

# <a id='14'></a>
#### [XG BOOST](#14)

In [None]:
# Choose Scaler (StandardScaler, MinMaxScaler, RobustScaler)
scaler = RobustScaler()

# Turn off GridSearch for faster run time
GridSearch=False

results = ml_modeller.xgboost_model(training_data, scaler, imputatable_df, GridSearch=GridSearch)

# Best result for GridSearch so far:

# <a id='15'></a>
#### [Nearest Neighbors](#15)

In [None]:
# velg Scaler (StandardScaler, MinMaxScaler, RobustScaler)
scaler = RobustScaler()

# Turn off GridSearch for faster run time
GridSearch=False

results = ml_modeller.knn_model(training_data, scaler, imputatable_df, GridSearch=GridSearch)

# Best result for GridSearch so far:
# Best parameters found by GridSearch: {'n_neighbors': 2}


# <a id='16'></a>
#### [Neural Networks](#16)

In [None]:
# choose Scaler (StandardScaler, MinMaxScaler, RobustScaler)
scaler = RobustScaler()

# choose epoch number and batch size (more = faster, less = possibly better learning/convergence ) If using GridSearch these numbers wont matter.
# GridSearch will take a long time to run. 
epochs_number = 200
batch_size = 500
GridSearch=False

# results = ml_modeller.nn_model_1(training_data, scaler, epochs_number, batch_size, imputatable_df, GridSearch=GridSearch)
results = ml_modeller.nn_model_2(training_data, scaler, epochs_number, batch_size, imputatable_df)



# <a id='klassifikasjon-problemer'></a>
### [Classification Problems](#klassifikasjon-problemer)

<a id='17'></a>
#### [XGBOOST](#17)

In [None]:
results = ml_modeller.xgboost_n3_klass(foretak_pub)

<a id='18'></a>
#### [Nearest Neighbors](#18)

In [None]:
results = ml_modeller.knn_n3_klass(foretak_pub)

# <a id='oppdateringsfil'></a>
# | [Update File](#oppdateringsfil) |

In [None]:
year = 2021
# Choose between knn_model, xgboost_model or nn_model
model = 'knn_model'
rate = 0.65
scaler = RobustScaler()
GridSearch=False

update_file, timeseries_knn_agg, timeseries_knn__kommune_agg, check_totals, check_manually = oppdateringsfil.create_bedrift_fil(year, model, rate, scaler, GridSearch=GridSearch)



# <a id='19'></a>
### Test Results

In [None]:
# change pd option to show all rows
pd.set_option('display.max_rows', None)
check_totals.head(25)

In [None]:
visualisations.guage(check_totals)

In [None]:
visualisations.thermometer(check_totals)

In [None]:
results = ml_modeller.test_results(update_file, aar)

---------------------------------------------------------------------------------------------------------------------------------

# <a id='20'></a>
### Historic Results:

#### 2021 XGBOOST:

- Mean Absolute Error for entire delreg: 8432.291199426325
- R² Score for entire delreg: 0.9606052910127973
-----------------------------------
- Mean Absolute Error for reg_type 02: 7087.746839112336
- R² Score for reg_type 02: 0.9232697374725767


#### 2021 K-Nearest Neighbors:

- Mean Absolute Error for entire delreg: 6559.171895864407
- R² Score for entire delreg: 0.982181406248118
-----------------------------------
- Mean Absolute Error for reg_type 02: 5615.106485704415
- R² Score for reg_type 02: 0.966718048747369

#### 2021 Neural Network:

Mean Absolute Error for entire delreg: 6728.031814063407
R² Score for entire delreg: 0.9820746908179295
-----------------------------------
Mean Absolute Error for reg_type 02: 5719.670829491291
R² Score for reg_type 02: 0.9665182851658081

--------------------------------------------------------------------------------------------------------------------------------
# <a id='future'></a>
<div style="background-color: #ffffff; padding: 20px; font-size: 28px; font-weight: bold; text-align: center; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
  <span style="color: #4CAF50;">Moving Forward.....</span>
</div>

---------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------

# Co-operation with other sections

#### A lot of potential here:

- S422 Has transaction data on a virksomhet level (but only international transactions and credit card data). This needs to be explored. There will be challenges but could be very useful. 

- The naring classification machine learning algorithms need to be improved and explored. I have only used financial data so far - with lack luster results, but there is a lot of potential to use 'yrke' data as well as implement more feature engineering. We can get this data from VoF

- It could be beneficial to create specific machine learning algorithms for each industry. For example 47.3 could benefit from road/highway network analyis (Data from SSB's SSIG package). We are able to calculate things like distance to nearest public transport, how often a road is used, distance via road to nearest gas station etc. This could be very useful but this would take a lot of resources - perhaps everyone should devlop their own machine learning algorithms for the industries they have responsibility for. 

#### Things left to do:

- controls for industrial variables are nearly finished, just a few bugs that need fixing. 
- the GridSearch for the Neural Network model is computationally heavy and will take a long time to run - but it may be worth running one time and saving the results for future runs. 

# Network Analysis

In [None]:
import knn
import networks

# Add enhet number here
enhet = ''

# Collect locations for a single enhet
enhet_df, _, _, _, _ = knn.knn(aar, enhet)

# Collect road frequencies data
frequencies = networks.networks(aar)

# Filter the dataset as the networks dataset is very large and is computationally heavy
filtered_df = enhet_df[enhet_df["kommune"].str.startswith("03")]

In [None]:
# Available colormaps in sgis
colormaps = ["viridis", "inferno", "magma", "cividis", "coolwarm", "Spectral", "RdYlBu", "plasma"]

m = sg.ThematicMap(sg.buff(frequencies, 18), filtered_df, column="frequency", size=15)
m.black = True
m.cmap = "plasma"  # You can change this to any of the colormaps listed above
m.title = "How often each road has been used"
m.legend.title = "Count"
m.plot()