# Kaggle Competiton - Mar 2023
## GoDaddy MicroBusiness Density Forecasting - Time Series Feature Selection
**Link to competition**: [CLICK ME](https://www.kaggle.com/competitions/godaddy-microbusiness-density-forecasting/overview)

In [1]:
import os
import sys

module_path = os.path.abspath(os.path.join('..', 'code'))
if module_path not in sys.path:
    sys.path.append(module_path)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import kmapper as km

from constants import *
from statsmodels.graphics.tsaplots import plot_acf
from sklearn import ensemble
from sklearn.cluster import KMeans

In [6]:
df = pd.read_csv(MD_PATH)
df = df.dropna()
df['cfips'] = df.loc[:, 'cfips'].astype('int').astype('str')
u = df['cfips'].unique()
cfips = {name: i for name, i in zip(u, range(len(u)))}
df['n_cfips'] = df['cfips'].apply(lambda x: cfips[x])

0       AL
1       AL
2       AL
3       AL
4       AL
        ..
3088    WY
3089    WY
3090    WY
3091    WY
3092    WY
Name: state, Length: 3078, dtype: object

In [3]:
feature_names = [n for n in df.columns if n not in ['county', 'state', 'cfips'] and 'md' not in n]
X = df[feature_names].values
y = df[[n for n in df.columns if "md" in n]].values

In [15]:
model = ensemble.IsolationForest(random_state=1028)
model.fit(X)
lens1 = model.decision_function(X).reshape((X.shape[0], 1))

mapper = km.KeplerMapper(verbose=3)
lens2 = mapper.fit_transform(X, projection='l2norm')

lens = np.c_[lens1, lens2]
lens

KeplerMapper(verbose=3)
..Composing projection pipeline of length 1:
	Projections: l2norm
	Distance matrices: False
	Scalers: MinMaxScaler()
..Projecting on data shaped (3078, 37)

..Projecting data using: l2norm

..Scaling with: MinMaxScaler()



array([[ 0.12577791,  0.00456183],
       [-0.02735091,  0.0190556 ],
       [ 0.15381665,  0.0020211 ],
       ...,
       [ 0.12730447,  0.00167653],
       [ 0.14741045,  0.00065431],
       [ 0.14570339,  0.00058261]])

In [18]:
N_CUBES = 10
N_CLUSTERS = 5
PER_OVERLAPP = 0.5

graph = mapper.map(
    lens,
    X,
    cover=km.Cover(n_cubes=N_CUBES, perc_overlap=PER_OVERLAPP),
    clusterer=KMeans(n_clusters=N_CLUSTERS, random_state=1028)
)

mapper.visualize(
    graph,
    path_html=os.path.join(OUTPUT_DIR, 'map.html'),
    title="MD Clustering 2019-2021",
    custom_tooltips=df.loc[:, 'county'],
    color_values=lens[0, :],
    color_function_name=["Isolation Forest", "L2-norm"],
    node_color_function=["mean", "std", "median", "max"],
)


Mapping on data shaped (3078, 37) using lens shaped (3078, 2)

Minimal points in hypercube before clustering: 5
Creating 100 hypercubes.
Cube_0 is empty.

   > Found 5 clusters in hypercube 1.
   > Found 5 clusters in hypercube 2.
   > Found 5 clusters in hypercube 3.
Cube_4 is empty.

Cube_5 is empty.

Cube_6 is empty.

   > Found 5 clusters in hypercube 7.
   > Found 5 clusters in hypercube 8.
   > Found 5 clusters in hypercube 9.
Cube_10 is empty.

Cube_11 is empty.

   > Found 5 clusters in hypercube 12.
   > Found 5 clusters in hypercube 13.
   > Found 5 clusters in hypercube 14.
   > Found 5 clusters in hypercube 15.
   > Found 5 clusters in hypercube 16.
Cube_17 is empty.

   > Found 5 clusters in hypercube 18.
   > Found 5 clusters in hypercube 19.
   > Found 5 clusters in hypercube 20.
   > Found 5 clusters in hypercube 21.
   > Found 5 clusters in hypercube 22.
Cube_23 is empty.

   > Found 5 clusters in hypercube 24.
   > Found 5 clusters in hypercube 25.
   > Found 5 cluste

Exception: 2 `color_function_names` values found, but 1 columns found in color_values. Must be equal.

In [None]:
df1 =