As found my previous notebooks, there are some interesting patterns in time series:

In the notebook titled [Clustering might help😎](https://www.kaggle.com/code/patrick0302/clustering-might-help), clustering results for time series across nearly 500 locations are provided. This clustering approach can significantly aid in the analysis and comprehension of the time series' characteristics.

Another notebook, [Find and fix the error bug🐛](https://www.kaggle.com/code/patrick0302/find-and-fix-the-error-bug), identifies a specific pattern change that led to an huge error. Addressing this issue enhances the accuracy of most participants' scores on the public LB.

So, what is this notebook for?

**This notebook integrates insights from previous works to detect more potential error bugs in your score. **

**It's time to find and fix them! 🔍**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.set_config_file(offline=True)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Read files
path = '/kaggle/input/playground-series-s3e20/'
path_cluster = '/kaggle/input/clustering-might-help/'
train = pd.read_csv(path_cluster+'train_with_ClusterNo.csv')
test = pd.read_csv(path_cluster+'test_with_ClusterNo.csv')
ss = pd.read_csv(path+'sample_submission.csv')

In [None]:
train

Thanks for the [nice notebook](https://www.kaggle.com/code/yeoyunsianggeremie/s3e20-kmeans-smoothing-ensemble-lazypred) from @yeoyunsianggeremie.

Let's take his submission as a starting point:

In [None]:
submission = pd.read_csv('/kaggle/input/s3e20-kmeans-smoothing-ensemble-lazypred/submission.csv')

Here's the average lineplot of clusters 0 and 4, derived from [Clustering might help😎](https://www.kaggle.com/code/patrick0302/clustering-might-help).

Do you find anything interesting when comparing the average lines across different years?

In [None]:
test['emission'] = submission['emission']

df_plot = pd.concat([train, test], axis=0)
df_plot = df_plot[df_plot['ClusterNo'].isin([2])]
df_plot = df_plot.pivot_table(index='week_no',columns='year',values='emission')
df_plot.columns = [2019, 2020, 2021, '2022 (pred)']

# Create the basic lineplot
ax = df_plot.plot(figsize=(15, 3),title='Average emissions across years from cluster 0 and 4')

plt.show()

Let's look closer - now you may notice that **the peak occurance of 2020 is different.**

Although it's just one-week shift, the [Find and fix the error bug🐛](https://www.kaggle.com/code/patrick0302/find-and-fix-the-error-bug) demonstrated that such one-week shift in one location could lead to a significant impact on your score.

So, why not experiment by shifting the peak one week earlier in 2022, mirroring the pattern observed in 2020?

In [None]:
df_plot = df_plot.iloc[10:21]

# Create the basic lineplot
ax = df_plot.plot(figsize=(6, 3),title="Emission peaks across years \n(2022's prediction is on 14th week)")

# Highlight
for year in [2019, 2021, '2022 (pred)']:
    ax.scatter(x=[14], y=[df_plot.loc[14, year]], color='red')

ax.scatter(x=[13], y=[df_plot.loc[13, 2020]], color='green')

plt.show()

To achieve this, we can discard the values for week 13 and replace them with the peak values from week 14.

In [None]:
submission.loc[(test['ClusterNo'].isin([0,3,4, 7]))&(test['week_no']==13), 'emission'] = np.nan
submission = submission.fillna(method='bfill')

In [None]:
#submission.loc[(test['ClusterNo'].isin([0,4]))&(test['week_no']==39), 'emission'] 

In [None]:
submission.loc[(test['ClusterNo'].isin([0,4]))&(test['week_no']==39), 'emission'] = np.nan
submission = submission.fillna(method='ffill')

In [None]:
submission.loc[(test['ClusterNo'].isin([0,4]))&(test['week_no']==39), 'emission'] 

The adjusted result for the 2022 prediction now aligns with the peak of 2020!

In [None]:
test['emission'] = submission['emission']

df_plot = pd.concat([train, test], axis=0)
df_plot = df_plot[df_plot['ClusterNo'].isin([0,4])]
df_plot = df_plot.pivot_table(index='week_no',columns='year',values='emission')
df_plot.columns = [2019, 2020, 2021, '2022 (pred)']
df_plot = df_plot.iloc[10:21]

# Create the basic lineplot
ax = df_plot.plot(figsize=(6, 3),title="Emission peaks across years \n(2022's prediction is on 13th week)")

# Highlight
for year in [2019, 2021]:
    ax.scatter(x=[14], y=[df_plot.loc[14, year]], color='red')

ax.scatter(x=[13], y=[df_plot.loc[13, '2022 (pred)']], color='green')
ax.scatter(x=[13], y=[df_plot.loc[13, 2020]], color='green')

plt.show()

In [None]:
sample_sub = pd.read_csv('/kaggle/input/playground-series-s3e20/sample_submission.csv')
asd = pd.DataFrame(sample_sub['ID_LAT_LON_YEAR_WEEK'].str.split('_',expand=True))
asd.columns = ['ID','latitude','longitude','year','week_no']
asd = asd.drop('ID',axis=1)
asd = asd.astype('float')
asd['emission'] = submission['emission'].values
asd

In [None]:
PATH = "/kaggle/input/playground-series-s3e20/"
trainee= pd.read_csv(PATH + "train.csv",index_col="ID_LAT_LON_YEAR_WEEK")
trainee

In [None]:
upper = trainee[asd.columns].reset_index(drop=True).copy()
new_form = pd.concat([upper,asd],axis=0).reset_index(drop=True).copy()
focus_1_mean = new_form[new_form['week_no']<=9].groupby(['year','week_no']).mean().reset_index().copy()
focus_1_std = new_form[new_form['week_no']<=9].groupby(['year','week_no']).std().reset_index().copy()

In [None]:
subset_1 =  new_form[new_form['week_no']<=9].copy()
subset_1.insert(1,"lat_lon_week_no", list(zip(subset_1["latitude"],subset_1["longitude"], subset_1["week_no"])))
#tempa = subset_1[subset_1['week_no']==0].reset_index().copy()
#latlon = [list(i) for i in np.unique([set(i) for i in tempa[['latitude','longitude']].values])]
#tempa.insert(1,"lat_lon", list(zip(tempa["latitude"],tempa["longitude"])))
tempa = subset_1.copy()
subtemp = tempa.pivot_table(index='lat_lon_week_no',columns='year',values='emission').copy()
subtemp['diff_2019_2021'] =  (subtemp[2021.0] - subtemp[2019.0])
subtemp['diff_2020_2021'] =  (subtemp[2021.0] - subtemp[2020.0])
subtemp['diff_2019_2020'] =  (subtemp[2020.0] - subtemp[2019.0])
subtemp['diff_2019_2021'] = subtemp['diff_2019_2021']/subtemp['diff_2019_2020']
subtemp['diff_2020_2021'] = subtemp['diff_2020_2021']/subtemp['diff_2019_2020']
subtemp['diff_scale'] = subtemp[['diff_2019_2021','diff_2020_2021']].abs().min(axis=1)
final_suba = subtemp[(subtemp['diff_scale'].abs()<=7)&(subtemp['diff_scale'].abs()>=3.5)&subtemp['diff_2019_2021']*subtemp['diff_2020_2021'] >0].reset_index().copy()
final_suba[2022.0] = final_suba[2021.0].values
final_suba

In [None]:
asd.insert(1,"lat_lon_week_no", list(zip(asd["latitude"],asd["longitude"], asd["week_no"])))
aassdd = asd.merge(final_suba[['lat_lon_week_no',2022.0]],how='left',on='lat_lon_week_no').copy().rename(columns={2022.0: "2022"})
aassdd['emission'] = np.where(pd.notna(aassdd['2022']),aassdd['2022'],aassdd['emission'])
aassdd = aassdd.drop('2022',axis=1)
assert pd.isna(aassdd['emission']).sum() == 0

In [None]:
submission['emission'] = aassdd['emission'].values

In [None]:
submission.to_csv('submission.csv',index=False)

In [None]:
submission

Whew! The bugs were removed!

Now you might want to discover more bugs or find a more elegant way in removing them? 

Your choice😎