# Gemma Summarizes Kaggle Competition Writeups

How can we use Gemma to summarize Kaggle competition writeups, or can we? Well, lets see.

This notebook is to trial Gemmas summarization skills on Kaggle competition writeups.

- v1: first trials with summarization
- v2: attempt at parsing the writeup placement for prioritization basis
- v3: trying to summarize each writeup first, and from those all writeups per competition
- v4: tuning prompt for better summary of key points, exploring Gemma's ability for overall summarization of multiple writeups across the competition, adding my own summary on Gemma learnings
- v5: optimized by clipping context to 4k on Kaggle to avoid GPU OOM. runs on P100 now much faster than 2xT4
- v6: after trialing these summaries as inputs with [a RAG notebook](https://www.kaggle.com/code/donkeys/qa-solutions-from-past-competitions), now added section on summarizing large writeup sets per competition in parts and learnings from that
- v7: analysis of larger set of writeups and summaries for different subset sizes.

# Executive Summary of (my) Gemma Learning

- Gemma is a bit talkative, at least the instruction tuned version. It seems to wish to tell me more than I asked for.
- Gemma does a decent job figuring out what position the solution described in a writeup finished in, if that information is in the writeup or its title. But can also miss it even if it is straight in the title.
- Due to talkative nature and unreliable position identification it is a bit difficult to use for sorting the writeups by their competition scores
- Gemma does not seem very good at distinguishing different parts of a longer prompt, such as competition description vs the solution writeup. It seems to mix potential approaches from the descriptions in the writeup.
- With little information (e.g., almost empty writeups), summaries seem to hallucinate the competition description into the writeup summary. With large number of writeups, Gemma sticks to the "facts" much better.
- When asked for a summary of key points, Gemma seems to want to always write its own extra opinion piece in the end about the summary.
- Asking simply for a list of key points without "summary" works better to avoid this opinion piece. Perhaps it is something to do with its training in relation to summarization.
- Gemma seems to produce summaries in the range of 200-300 tokens, regardless of the input context size. Again, perhaps something about how it has been trained?
- When a good prompt format and approach is found, Gemma seems to give quite good summaries. Probably quite common for models this size.
- When summarizing multiple parts, such as partial summaries of large writeup sets, subsections are needed or Gemma seems to lose it.
- When re-summarizing larger numbers of subsets, Gemma seems to more easily lose it. Some ways to mitigate that should be considered if using such hierarchical summarization approaches.

In [1]:
#have to update these first, or otherwise it seems the libraries might not update and load older versions
!pip install -q -U transformers accelerate bitsandbytes lxml
#flash attention does not work on Kaggle GPU's, too old
#!pip install flash-attn --no-build-isolation


[0m

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from IPython.display import HTML, Markdown, display
from tqdm.notebook import tqdm

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

on_kaggle = False
if os.path.exists("/kaggle/input"):
    on_kaggle = True

if on_kaggle:
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(os.path.join(dirname, filename))



# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [3]:
!nvidia-smi

Fri Apr 26 05:11:05 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:2D:00.0 Off |                  N/A |
|  0%   45C    P8              21W / 370W |     22MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Load Gemma Model and Try it with Poetry Prompt

The poetry prompt is actually taken from the [Huggingface Gemma page](https://huggingface.co/google/gemma-7b-it).

I use it here just as a smoke test to see the code and model works.

In [4]:
#https://www.kaggle.com/code/minhsienweng/create-ai-generated-essays-gemma/notebook
import torch
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
DEVICE = "auto"
print(f"Device: {DEVICE}")
print(f"CUDA Version: {torch.version.cuda}")
print(f"Pytorch {torch.__version__}")

Device: auto
CUDA Version: 12.1
Pytorch 2.2.0


In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

if on_kaggle:
    model_path = "/kaggle/input/gemma/transformers/7b-it/1"
else:
    model_path = "/mystuff/llm/Meta-Llama-3-8B-Instruct"
#    model_path = "/mystuff/llm/gemma-7b"
    
print(f"Model path: {model_path}")
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

g_model = AutoModelForCausalLM.from_pretrained(model_path, 
                                               device_map=DEVICE, 
                                               quantization_config=quantization_config,
                                               #attn_implementation="flash_attention_2",
                                              )


Model path: /mystuff/llm/Meta-Llama-3-8B-Instruct


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

So next to try the prompt from HF model page as the base smoke test on model working:


There you go. An interesting poem :) So the model now works, also on Kaggle.

In [6]:
%%time
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

prompt = "Write me a poem about Machine Learning."

chat = [{'content': prompt, 'role': 'user'}]
chat_tokens = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors='pt').to(g_model.device)

#input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = g_model.generate(chat_tokens, max_new_tokens=1000, eos_token_id=terminators, pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Write me a poem about Machine Learning.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

In silicon halls, where data reigns
A new kind of wisdom, algorithms sustain
Machine Learning's art, a wondrous game
Where patterns hide, and secrets are claimed

With each new input, a tale unfolds
A story woven, of ones and zeroes told
The machine learns, as data flows free
A symphony of bits, a harmony to see

In neural networks, nodes entwine
A web of connections, a mind divine
The algorithm whispers, "I shall know"
As hidden truths, begin to glow

Through backpropagation's gentle hand
The machine refines, its understanding grand
The errors dwindle, the accuracy grows
As the model learns, the data it knows

In reinforcement learning's realm of might
The machine explores, through trial and night
The rewards and penalties, a delicate dance
As the agent learns, to take a chance

In deep learning's labyrinthine depths
The machine unc

# Load Datasets

I use data from a set of Kaggle competition writeups, and competition descriptions.

In this notebook, I try out Gemma as a summarizer for those writeups. And the descriptions are there to help build a bit of context for the writeups and their summaries.

In [7]:
if on_kaggle:
    df_writeups = pd.read_csv(f"/kaggle/input/2023-kaggle-ai-report/kaggle_writeups_20230510.csv")
    df_comp_meta = pd.read_csv(f"/kaggle/input/kaggles-all-completed-competition-dataset/kaggle comp_submission.csv")
else:
    df_writeups = pd.read_csv(f"/mystuff/data/kaggle_writeups_20230510.csv")
    df_comp_meta = pd.read_csv(f"/mystuff/data/kaggle_comp_submission.csv")



Some of the writeups have a lot of HTML in them, causing the Gemma model sometimes to overflow its context window, and lose the text in the HTML markup. This shows in the Writeup column here:

In [8]:
df_writeups.head(2)

Unnamed: 0,Competition Launch Date,Title of Competition,Competition URL,Date of Writeup,Title of Writeup,Writeup,Writeup URL
0,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/18/2010 00:06:46,Released: my Source Code and Analysis,<p>I had a lot of fun with this competition an...,https://www.kaggle.com/c/2447/discussion/185
1,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/20/2010 04:38:53,6th place(UriB) by Uri Blass,<P>I calculated rating for every player in mon...,https://www.kaggle.com/c/2447/discussion/192


This `strip_html()` will remove HTML formatting from a writeup and only leave the actual text:

In [9]:
from lxml import html

def strip_html(html_text):
    tree = html.fromstring(html_text)
    clean_text = tree.text_content()
    return clean_text


There are 2 writeups with NAN values, dropping them allows processing the column at once, and removes invalid rows.

In [10]:
print(f"                 Initial set of writeups: {df_writeups.shape}")
df_writeups = df_writeups.dropna(subset=['Writeup'])
print(f"After removing rows with nan for writeup: {df_writeups.shape}")
# this shows there were 2 rows with nan writeup, 3127->3125

                 Initial set of writeups: (3127, 7)
After removing rows with nan for writeup: (3125, 7)


Initially, I tried simply apply `strip_html()` directly to the `Writeup` column for all rows. However, this made printing any example writeups in the notebook harder to read. So I added the stripped writeup as its own column `writeup_clean`.

In [11]:
#df_writeups.loc[:, "Writeup"] = df_writeups["Writeup"].apply(strip_html)
df_writeups["writeup_clean"] = df_writeups["Writeup"].apply(strip_html)
df_writeups = df_writeups.rename(columns={'Writeup': 'writeup'})


The `df_comp_meta` dataset contains competition metadata such as competition description.

In [12]:
df_comp_meta.head(2)

Unnamed: 0,comp_name,comp_Reward,comp_link,teams,competitors,Entries,Tag,desc,code_link,start_date,start_month,start_year,final_date,final_month,final_year
0,Tabular Playground Series - Sep 2022,Swag,https://www.kaggle.com/competitions/tabular-pl...,1381,1447,13085,tabular data,The competing Kaggle merchandise stores we saw...,https://www.kaggle.com/code/elem3ntary/tps-sep...,1,Sep,2022,1,Oct,2022
1,AI Village Capture the Flag @ DEFCON,25000,https://www.kaggle.com/competitions/ai-village...,668,668,4235,games,Help Henry Hacker get to Homecoming during DEF...,https://www.kaggle.com/code/tatamikenn/defcon3...,12,Aug,2022,12,Sep,2022


## Data stats

There appear to be 310 different competitions in the writeup data. There are more writeups (3125) than competitions because multiple participants did a writeup on single competitions from their own perspective.

In [13]:
df_writeups["Title of Competition"].nunique()

310

The first writeup in the dataset appears to be for a chess rating competition:

In [14]:
competitions = df_writeups["Title of Competition"].unique()
comp0 = competitions[0]
comp0

'Chess ratings - Elo versus the Rest of the World'

This competition appears to have 5 writeups:

In [15]:
df_comp0 = df_writeups[df_writeups["Title of Competition"]==comp0]
print(f"Chess competition dataframe shape: {df_comp0.shape}")
df_comp0

Chess competition dataframe shape: (5, 8)


Unnamed: 0,Competition Launch Date,Title of Competition,Competition URL,Date of Writeup,Title of Writeup,writeup,Writeup URL,writeup_clean
0,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/18/2010 00:06:46,Released: my Source Code and Analysis,<p>I had a lot of fun with this competition an...,https://www.kaggle.com/c/2447/discussion/185,I had a lot of fun with this competition and l...
1,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/20/2010 04:38:53,6th place(UriB) by Uri Blass,<P>I calculated rating for every player in mon...,https://www.kaggle.com/c/2447/discussion/192,I calculated rating for every player in months...
2,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/23/2010 10:38:23,7th place - littlefish,I'm a little surprised I ended up in the top-1...,https://www.kaggle.com/c/2447/discussion/194,I'm a little surprised I ended up in the top-1...
3,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/20/2010 11:27:17,3rd place: Chessmetrics - Variant,"<p><span id=""post_text_content_1230""><div dir=...",https://www.kaggle.com/c/2447/discussion/193,"Dear all,it was a great competition, thanks a ..."
4,08/03/2010 00:00:00,Chess ratings - Elo versus the Rest of the World,https://www.kaggle.com/c/2447,11/18/2010 02:44:10,2nd place: TrueSkill Through Time,"Wow, this is a surprise! I looked at this comp...",https://www.kaggle.com/c/2447/discussion/186,"Wow, this is a surprise! I looked at this comp..."


## Example Writeup

What does a writeup look like?

In [16]:
writeup0=df_comp0.iloc[1]["writeup"]
Markdown(writeup0)

<P>I calculated rating for every player in months 101-105 and after having the rating I have a simple formula to calculate the expected result only based on the rating and the color.<BR>The tricks that I used were mainly in calculating the rating but I will start explaining the simple part.<BR><BR>The first part was calculating the bonus for white<BR><BR>I had the following formula for this part:<BR>bonus=maximum((white_rating+black_rating-3100)/40.0,50)<BR><BR>Diff=white_rating+bonus-black_rating&nbsp; <BR><BR>Expected_result=0.5+Diff/850 <BR>When I changed it to be not more than <FONT size=2>0.970588 and not less than 0.1(practically it had a very small effect&nbsp;<BR>because&nbsp;the result was always bigger than 0.1 and there was only one case when I needed to reduce it to 0.970588)</P>
<P></FONT><BR>Now we go to the hard part that is how to calculate the rating for every player.<BR>For this purpose I admit that I used the future to predict the past(but I have also prediction based on a different model in the top 10 when I did not use the future to predict the past).<BR><BR>I used a function that I called repeat_strength_estimate<BR>The function get the following parameters:<BR>1)k that is the last month that is not missing.<BR>For the prediction of months 101-105 k=100 but for testing my parameters I used k=90,91,92,...99<BR>2)max_months(practically get the value 81 and I admit that it is not a good name)<BR>The meaning of max_months=81 is practically that I&nbsp;do not use the first 20 months to predict month 101 and that I do not use the first 21 months to predict month 102 and generally <BR>I do not use the first&nbsp;m-81 months to predict month number m.<BR><BR>3)<FONT size=2>big_dif=310<BR>big_dif was used to calculate performance rating and for some reason I found that small values give better results<BR>in my tests so I used this small value<BR><BR><FONT size=2>My formula for performance rating was</P>
<P>performance_rating=avg_rating+((result-opponents)/opponents)*big_dif;<BR><BR>the value of the division can be at most 1 and at least -1 because result is practically weighted half points and is something between 0 and twice the weight of the opponent.<BR><BR>opponents in this formula mean the number of weight opponents(when the weight is based on the distance in month from&nbsp;the month to predict)&nbsp;&nbsp;<BR>This formula means that even if a player lost all the games against the opponents then he still got performance rating that is only 310 elo weaker than the average of the opponents because the result of the division is always between -1 for losing all games and 1 for winning all games.<BR></FONT></FONT><BR>I guess that it was good because not all games are included so person who played against strong opponents probably performed practically better than his real score and it is not good for the real world when games are not missing.<BR><BR>4)num_avg=5.9 similiar to chess metrics(I added 5.9 faked opponents with average rating)<BR><BR>5)num_weak=2.2(added 2.2 faked weak opponents)<BR><BR>6)<FONT size=2>value_weak=2210(rating of the weak opponents like chess metrics<BR><BR>7)<FONT size=2>unrated=2285(I think that practically had no effect because players always&nbsp;have games in the last 80 months)</P></FONT>
<P></FONT><BR>8)<FONT size=2>minimal_game_finished=15(I reduce rating to players with less than 15 weighted games similiar to chess metrics)<BR><BR>9)reduction_per_game=12(the number that I reduce for less of experience for player without many weight games)<BR><BR>10)adding=39(the number that I add to rating of players after every iteration)<BR><BR>repeat_strength_estimate basically did 10 iterations for evaluating the strength of every player in every month.<BR></FONT>The evaluation of&nbsp;the strengh was based on 2 steps when step 1 was the function that calculate strength that is similiar to chess metrics but there are important differences and step 2 was deciding that place 50 has rating 2625 in the rating list that is exactly the same as chess metrics.<BR><BR><FONT size=2></P>
<P>calc_strength_chess_metric is the missing function to understand the algorithm and it basically got 11 parameters(all the 10 parameters that repeat_strength_estimate got and another parameter that is the month that we calculate&nbsp;estimate for it).<BR><BR>Note that&nbsp;the&nbsp;estimate for month 50 of player 1 when months 101-105 are missing is important because if player 2 played with player 1 at month 50 <BR>then it is going to influence the rating of player 2&nbsp; at month 101-105 that is used to calculate the expected result.<BR><BR>I use the word estimate and not rating because rating by definition assume that we do not have future results.<BR><BR>I had basically 2 steps in<BR>calc_strength_chess_metric<BR><BR>The first step was a loop that calculated the estimate for strength for every player in the relevant month.<BR>The second step is a&nbsp;step that I used only when I needed to predict the strength in the missing months and it is practically unfair trick but not something that is forbidden in the competition because I used the information about games and not about the results in the supposed missing months to&nbsp;calculate changes in the rating estimate in these months.<BR></FONT><BR>I did not finish to explain my algorithm&nbsp;and I plan also to send code later but now I need only to explain the 2 steps of calc_strength_chess_metric to explain my algorithm and I will do it later in another post(this part of the program is only slightly more than 100 lines of code in C).<BR></P>

# Attempt 1 - Ask Gemma for a useful summary for a Kaggle learner:

Try with the first writeup collected above:

In [17]:
prompt = f"Summarize the following Kaggle competition writeup." \
 f"Make it useful for someone trying to learn and apply a similar " \
 f"approach in a different competition: {writeup0}"
chat = [{'content': prompt, 'role': 'user'}]
chat_tokens = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors='pt').to(g_model.device)

input_token_count = chat_tokens[0].shape[-1]
input_token_count

1394

In [18]:

outputs = g_model.generate(chat_tokens, max_new_tokens=1000, eos_token_id=terminators, do_sample=False, pad_token_id=tokenizer.eos_token_id)

#outputs = g_model.generate(, max_new_tokens=1000, do_sample=False, eos_token_id=terminators, pad_token_id=tokenizer.pad_token_id)
new_tokens = outputs[0][input_token_count:]


### Writeup Summary:

In [19]:
print(len(new_tokens))

428


In [20]:
print(input_token_count)

1394


In [21]:
print(len(new_tokens))
summary1 = tokenizer.decode(new_tokens, skip_special_tokens=False)
Markdown(summary1)

428


The writeup describes a Kaggle competition entry that uses a rating system to predict the outcome of chess games. The author calculates a rating for each player in months 101-105 and uses a formula to calculate the expected result based on the rating and the color (white or black). The rating calculation involves a complex formula that takes into account various parameters, including the player's past performance, the number of games played, and the strength of the opponents.

The key components of the algorithm are:

1. The `repeat_strength_estimate` function, which calculates the rating for each player in each month. This function takes into account the player's past performance, the number of games played, and the strength of the opponents.
2. The `calc_strength_chess_metric` function, which is used to calculate the estimate of the player's strength in the missing months. This function uses a loop to calculate the estimate for each player in the relevant month, and then uses an unfair trick to adjust the estimate based on the information about games in the supposed missing months.

The author's approach involves using a combination of machine learning and chess metrics to predict the outcome of the games. The algorithm is complex and involves many parameters, but the author claims that it was effective in predicting the outcome of the games.

For someone trying to learn and apply a similar approach in a different competition, here are some key takeaways:

1. Use a combination of machine learning and domain-specific metrics to predict the outcome of the games.
2. Calculate a rating for each player based on their past performance and the strength of their opponents.
3. Use a formula to calculate the expected result based on the rating and the color.
4. Consider using an unfair trick to adjust the estimate based on the information about games in the supposed missing months.
5. Be prepared to iterate on your algorithm and adjust the parameters to improve the performance.

Note that the author's algorithm is complex and may not be easily applicable to other competitions. However, the general approach of using a combination of machine learning and domain-specific metrics to predict the outcome of games may be useful in other competitions.<|eot_id|>

# Attempt 2: Add more context to Gemma input

The summary above seems a bit hard to figure out, since it does not say what players or months it refers to. Maybe Gemma can improve the summary with a bit of knowledge about what kind of competition the writeup is for.

So trying here with adding writeup title, competition name, and competition description to the prompt. Lets see.

In [22]:
writeup0_title = df_comp0["Title of Writeup"].iloc[0]
writeup0_title

'Released: my Source Code and Analysis'

In [23]:
comp0_name = df_comp0["Title of Competition"].iloc[0]
comp0_name

'Chess ratings - Elo versus the Rest of the World'

In [24]:
comp0_desc = df_comp_meta[df_comp_meta["comp_name"] == comp0_name]["desc"].values[0]
Markdown(comp0_desc)

When predicting the outcome of chess games, you typically need two things; a rating system wherein the current ability of each player is estimated based on past results, and a model for estimating the expected score for each player, once you know their ratings.Most rating systems use some methodology to determine initial "seed" ratings for the pool of players, and then update those ratings based on ongoing results.  The most famous approach is the Elo approach, where the applied change to a player's rating is proportional to the amount by which they exceed their aggregate expected score across all their recent games.  The scaling factor is known as the "K-factor", and for the official ratings used throughout the world, the K-factor is highest for new players and lowest for topmost players.  But there are many other approaches: the Ken Thompson approach takes each player's most recent 100 games and calculates the rating that would be most likely to lead to that performance.  The Mark Glickman approach is similar to Elo but introduces additional parameters for each player, tracking the level of confidence and level of volatility for each player's rating, and then using these parameters to determine which K-factor to apply.The initial seed ratings are typically determined through a simultaneous calculation: a start rating is assumed for each player, then a "performance rating" is calculated for each player based on their results and the ratings of their opponents, and then those performance ratings are fed back into another iteration as the start ratings.  This is allowed to run until it converges upon a stable set of ratings.  This was the methodology used to calculate initial ratings for most major rating systems.  In fact this is the overall approach taken by the Jeff Sonas Chessmetrics rating calculation, and is used not just to calculate initial ratings but in fact to calculate all ratings.There is a general convention in chess rating systems whereby the difference in two ratings is used for calculating expected score when two players face each other.  Of course it could just as well be the ratio of the two ratings, or some other more complex relationship that depends on the magnitude of the ratings and not just their relative difference.Here are some links to articles existing on rating systems:Elo and Ken ThompsonGlickoChessmetricsMicrosoft TrueSkillJeff Moser has a C# implementation of Elo and TrueSkill on Github. has posted a Java implementation of Glicko on Github.

But first a helper method to avoid copy-pasting the generation code all the time:

In [25]:
def llama3_generate(prompt, debug=False):
    # trying to remove <|eot_id|> just in case Llama3 thinks the sequence ends in the middle
    prompt = prompt.replace("<|eot_id|>", "\n")
    #print(prompt)
    
    chat = [{'content': prompt, 'role': 'user'}]
    chat_tokens = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors='pt').to(g_model.device)
        
    input_token_count = chat_tokens[0].shape[-1]

#    input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
#    input_token_count = len(input_ids["input_ids"][0])

    outputs = g_model.generate(chat_tokens, max_new_tokens=1000, eos_token_id=terminators, do_sample=False, pad_token_id=tokenizer.eos_token_id)

    new_tokens = outputs[0][input_token_count:]
    output_token_count = len(new_tokens)
    if debug:
        print(f"input tokens: {input_token_count}, output tokens: {output_token_count}")
    
    output = tokenizer.decode(new_tokens, skip_special_tokens=False)
    output_md = Markdown(output)
    return output, output_md

And now, try to generate a writeup summary from the competition description + a writeup:

In [26]:
prompt = f"given the following kaggle competition description and solution writeup, " \
         f"summarize the solution for someone interested in applying a similar solution in a different context:\n\n" \
         f"description: {comp0_desc}\n\n" \
         f"writeup: {writeup0_title}: {writeup0}"

In [27]:
%%time
summary2, summary2_md = llama3_generate(prompt)
Markdown(summary2)

CPU times: user 15.9 s, sys: 144 ms, total: 16.1 s
Wall time: 16.1 s


The solution involves calculating a rating system for chess players based on their past results. The rating system uses a combination of Elo and Glicko approaches to estimate the current ability of each player. The solution also involves calculating the expected score for each player based on their ratings.

The key components of the solution are:

1. Calculating the initial seed ratings for each player using a simultaneous calculation approach.
2. Updating the ratings based on ongoing results using the Elo and Glicko approaches.
3. Calculating the expected score for each player based on their ratings using the difference in their ratings.

The solution also involves using a function called `repeat_strength_estimate` to calculate the rating for each player. This function takes several parameters, including the last month that is not missing, the maximum number of months to consider, and the big difference value.

The solution also involves using a function called `calc_strength_chess_metric` to calculate the strength of each player in a given month. This function takes 11 parameters, including the month to consider, the last month that is not missing, and the big difference value.

The solution is implemented in C and is based on the Elo and Glicko approaches. The solution also involves using a combination of linear and non-linear calculations to estimate the current ability of each player.

To apply a similar solution in a different context, you would need to:

1. Identify the relevant parameters and variables in the new context.
2. Determine the initial seed ratings for each player or entity in the new context.
3. Update the ratings based on ongoing results using a similar approach to Elo and Glicko.
4. Calculate the expected score for each player or entity based on their ratings.
5. Use a function similar to `repeat_strength_estimate` to calculate the rating for each player or entity.
6. Use a function similar to `calc_strength_chess_metric` to calculate the strength of each player or entity in a given month.

Note that the specific implementation details may vary depending on the new context, but the general approach and principles should remain the same.<|eot_id|>

### Better but not great..

The above summary seems better, but quite often when running this (assuming some variation if determinism is not 100% configured), Gemma confuses some of the competition description with the writeup, claiming the writeup solution uses some of the methods from the description (Elo rating etc). So next I tried to simplify the contest description a bit to avoid it polluting the writeup summary too much.

# Attempt 3: Summarize the Description First

As the details in the description seem to confuse Gemma vs the writeup itself, I decided to try to summarize the competition description first, to give basic context but with less details for Gemma to get confused.

In [28]:
prompt = f"Summarize the following kaggle competition description in a very concise way. "\
         f"Omit references to specific algorithms or known solutions, focus on overview of the contest topic and goals.\n\n" \
         f"description: {comp0_desc}\n\n" \
         f"summary: "

First a summary of the competition description itself:

In [29]:
Markdown(prompt)

Summarize the following kaggle competition description in a very concise way. Omit references to specific algorithms or known solutions, focus on overview of the contest topic and goals.

description: When predicting the outcome of chess games, you typically need two things; a rating system wherein the current ability of each player is estimated based on past results, and a model for estimating the expected score for each player, once you know their ratings.Most rating systems use some methodology to determine initial "seed" ratings for the pool of players, and then update those ratings based on ongoing results.  The most famous approach is the Elo approach, where the applied change to a player's rating is proportional to the amount by which they exceed their aggregate expected score across all their recent games.  The scaling factor is known as the "K-factor", and for the official ratings used throughout the world, the K-factor is highest for new players and lowest for topmost players.  But there are many other approaches: the Ken Thompson approach takes each player's most recent 100 games and calculates the rating that would be most likely to lead to that performance.  The Mark Glickman approach is similar to Elo but introduces additional parameters for each player, tracking the level of confidence and level of volatility for each player's rating, and then using these parameters to determine which K-factor to apply.The initial seed ratings are typically determined through a simultaneous calculation: a start rating is assumed for each player, then a "performance rating" is calculated for each player based on their results and the ratings of their opponents, and then those performance ratings are fed back into another iteration as the start ratings.  This is allowed to run until it converges upon a stable set of ratings.  This was the methodology used to calculate initial ratings for most major rating systems.  In fact this is the overall approach taken by the Jeff Sonas Chessmetrics rating calculation, and is used not just to calculate initial ratings but in fact to calculate all ratings.There is a general convention in chess rating systems whereby the difference in two ratings is used for calculating expected score when two players face each other.  Of course it could just as well be the ratio of the two ratings, or some other more complex relationship that depends on the magnitude of the ratings and not just their relative difference.Here are some links to articles existing on rating systems:Elo and Ken ThompsonGlickoChessmetricsMicrosoft TrueSkillJeff Moser has a C# implementation of Elo and TrueSkill on Github. has posted a Java implementation of Glicko on Github.

summary: 

In [30]:
%%time
comp_summary, comp_summary_md = llama3_generate(prompt)
Markdown(comp_summary)

CPU times: user 3.07 s, sys: 42 ms, total: 3.12 s
Wall time: 3.11 s


Here is a concise summary of the competition description:

The goal is to develop a rating system that estimates the current ability of each player based on past results, and a model to predict the expected score for each player. The system should determine initial ratings and update them based on ongoing results, using a methodology that takes into account factors such as the player's performance and the ratings of their opponents.<|eot_id|>

## Writeup Summary from Description Summary and the Writeup itself

Now that we have the competition description summary, we can give it and the writeup to Gemma and see how well it manages to summarize them together (vs full competition description + writeup above):

In [31]:
prompt = f"you are to answer as a helpful assistant helping understand questions about " \
         f"a kaggle competition writeup, and the solutions the writeup describes. " \
         f"a short description of the competition and its goals, and a solution writeup follows. " \
         f"description: {comp_summary}\n\n" \
         f"writeup: {writeup0_title}:\n {writeup0}\n\n" \
         f"question: summarize the solution in the above writeup, focusing on used data analysis methods. " \
         f"Focus on the key points of the solutions and how they might have helped achieving better score in the competition.\n\n" \
         f"answer: "

In [32]:
Markdown(prompt)

you are to answer as a helpful assistant helping understand questions about a kaggle competition writeup, and the solutions the writeup describes. a short description of the competition and its goals, and a solution writeup follows. description: Here is a concise summary of the competition description:

The goal is to develop a rating system that estimates the current ability of each player based on past results, and a model to predict the expected score for each player. The system should determine initial ratings and update them based on ongoing results, using a methodology that takes into account factors such as the player's performance and the ratings of their opponents.<|eot_id|>

writeup: Released: my Source Code and Analysis:
 <P>I calculated rating for every player in months 101-105 and after having the rating I have a simple formula to calculate the expected result only based on the rating and the color.<BR>The tricks that I used were mainly in calculating the rating but I will start explaining the simple part.<BR><BR>The first part was calculating the bonus for white<BR><BR>I had the following formula for this part:<BR>bonus=maximum((white_rating+black_rating-3100)/40.0,50)<BR><BR>Diff=white_rating+bonus-black_rating&nbsp; <BR><BR>Expected_result=0.5+Diff/850 <BR>When I changed it to be not more than <FONT size=2>0.970588 and not less than 0.1(practically it had a very small effect&nbsp;<BR>because&nbsp;the result was always bigger than 0.1 and there was only one case when I needed to reduce it to 0.970588)</P>
<P></FONT><BR>Now we go to the hard part that is how to calculate the rating for every player.<BR>For this purpose I admit that I used the future to predict the past(but I have also prediction based on a different model in the top 10 when I did not use the future to predict the past).<BR><BR>I used a function that I called repeat_strength_estimate<BR>The function get the following parameters:<BR>1)k that is the last month that is not missing.<BR>For the prediction of months 101-105 k=100 but for testing my parameters I used k=90,91,92,...99<BR>2)max_months(practically get the value 81 and I admit that it is not a good name)<BR>The meaning of max_months=81 is practically that I&nbsp;do not use the first 20 months to predict month 101 and that I do not use the first 21 months to predict month 102 and generally <BR>I do not use the first&nbsp;m-81 months to predict month number m.<BR><BR>3)<FONT size=2>big_dif=310<BR>big_dif was used to calculate performance rating and for some reason I found that small values give better results<BR>in my tests so I used this small value<BR><BR><FONT size=2>My formula for performance rating was</P>
<P>performance_rating=avg_rating+((result-opponents)/opponents)*big_dif;<BR><BR>the value of the division can be at most 1 and at least -1 because result is practically weighted half points and is something between 0 and twice the weight of the opponent.<BR><BR>opponents in this formula mean the number of weight opponents(when the weight is based on the distance in month from&nbsp;the month to predict)&nbsp;&nbsp;<BR>This formula means that even if a player lost all the games against the opponents then he still got performance rating that is only 310 elo weaker than the average of the opponents because the result of the division is always between -1 for losing all games and 1 for winning all games.<BR></FONT></FONT><BR>I guess that it was good because not all games are included so person who played against strong opponents probably performed practically better than his real score and it is not good for the real world when games are not missing.<BR><BR>4)num_avg=5.9 similiar to chess metrics(I added 5.9 faked opponents with average rating)<BR><BR>5)num_weak=2.2(added 2.2 faked weak opponents)<BR><BR>6)<FONT size=2>value_weak=2210(rating of the weak opponents like chess metrics<BR><BR>7)<FONT size=2>unrated=2285(I think that practically had no effect because players always&nbsp;have games in the last 80 months)</P></FONT>
<P></FONT><BR>8)<FONT size=2>minimal_game_finished=15(I reduce rating to players with less than 15 weighted games similiar to chess metrics)<BR><BR>9)reduction_per_game=12(the number that I reduce for less of experience for player without many weight games)<BR><BR>10)adding=39(the number that I add to rating of players after every iteration)<BR><BR>repeat_strength_estimate basically did 10 iterations for evaluating the strength of every player in every month.<BR></FONT>The evaluation of&nbsp;the strengh was based on 2 steps when step 1 was the function that calculate strength that is similiar to chess metrics but there are important differences and step 2 was deciding that place 50 has rating 2625 in the rating list that is exactly the same as chess metrics.<BR><BR><FONT size=2></P>
<P>calc_strength_chess_metric is the missing function to understand the algorithm and it basically got 11 parameters(all the 10 parameters that repeat_strength_estimate got and another parameter that is the month that we calculate&nbsp;estimate for it).<BR><BR>Note that&nbsp;the&nbsp;estimate for month 50 of player 1 when months 101-105 are missing is important because if player 2 played with player 1 at month 50 <BR>then it is going to influence the rating of player 2&nbsp; at month 101-105 that is used to calculate the expected result.<BR><BR>I use the word estimate and not rating because rating by definition assume that we do not have future results.<BR><BR>I had basically 2 steps in<BR>calc_strength_chess_metric<BR><BR>The first step was a loop that calculated the estimate for strength for every player in the relevant month.<BR>The second step is a&nbsp;step that I used only when I needed to predict the strength in the missing months and it is practically unfair trick but not something that is forbidden in the competition because I used the information about games and not about the results in the supposed missing months to&nbsp;calculate changes in the rating estimate in these months.<BR></FONT><BR>I did not finish to explain my algorithm&nbsp;and I plan also to send code later but now I need only to explain the 2 steps of calc_strength_chess_metric to explain my algorithm and I will do it later in another post(this part of the program is only slightly more than 100 lines of code in C).<BR></P>

question: summarize the solution in the above writeup, focusing on used data analysis methods. Focus on the key points of the solutions and how they might have helped achieving better score in the competition.

answer: 

In [33]:
%%time
summary3, summary3_md = llama3_generate(prompt)
Markdown(summary3)

CPU times: user 22.5 s, sys: 132 ms, total: 22.6 s
Wall time: 22.6 s


The solution writeup describes a rating system that estimates the current ability of each player based on past results. The system uses a combination of data analysis methods to calculate the rating and expected score for each player.

Key points of the solution:

1. **Rating calculation**: The solution uses a formula to calculate the rating for each player, which takes into account the player's performance and the ratings of their opponents. The formula is based on the concept of "bonus" for white, which is calculated as the maximum of the difference between the white and black ratings divided by 40, and 50.
2. **Performance rating**: The solution uses a performance rating formula that takes into account the result of the game, the opponents, and the rating of the player. The formula is designed to give a performance rating that is not too sensitive to the result of a single game.
3. **Repeat strength estimate**: The solution uses a function called "repeat_strength_estimate" to calculate the strength of each player in each month. The function takes into account the player's past performance, the ratings of their opponents, and the number of games played.
4. **Calc_strength_chess_metric**: The solution uses a function called "calc_strength_chess_metric" to calculate the strength of each player in each month. The function takes into account the player's past performance, the ratings of their opponents, and the number of games played.

Data analysis methods used:

1. **Regression analysis**: The solution uses a regression-like formula to calculate the rating and expected score for each player.
2. **Weighted average**: The solution uses a weighted average to calculate the performance rating, where the weights are based on the opponents and the result of the game.
3. **Iterative calculation**: The solution uses an iterative calculation to calculate the strength of each player in each month, where the calculation is based on the player's past performance and the ratings of their opponents.

How the solution might have helped achieving better score in the competition:

1. **Accurate rating calculation**: The solution's rating calculation formula takes into account the player's performance and the ratings of their opponents, which might have resulted in more accurate ratings.
2. **Robust performance rating**: The solution's performance rating formula is designed to give a performance rating that is not too sensitive to the result of a single game, which might have resulted in more robust ratings.
3. **Effective use of data**: The solution uses a combination of data analysis methods to calculate the rating and expected score for each player, which might have resulted in more effective use of the available data.
4. **Iterative calculation**: The solution's iterative calculation of the strength of each player in each month might have resulted in more accurate estimates of the player's strength.

Overall, the solution's use of data analysis methods, such as regression analysis, weighted average, and iterative calculation, might have helped achieve better scores in the competition by providing more accurate ratings and expected scores.<|eot_id|>

Seems better to me. But there are multiple writeups typically for a competition, so finding the best signal from the noise would be nice. That is, first to select which writeups to show first, or how to rank them for the user.

# Try to Rank the Solutions in a Competition

Because it could be helpful to show someone who wants a competition solution writeup summary the best scoring solutions first.

So, first a method to perform competition summarization all in once to apply across different competitions. See how I tried to trick people tend to give for prompting as an option, asking it to think step by step :).

In [34]:
def find_writeup_ranking(title, print_prompt=False, n_process=None):
    df_comp = df_writeups[df_writeups["Title of Competition"]==title]
    #display(df_comp)
    for i in range(0, df_comp.shape[0]):
        writeup_title = df_comp.iloc[i]["Title of Writeup"]
        writeup = df_comp.iloc[i]["writeup_clean"]
        prompt = f"Given the following competition solution writeup, " \
                 f"what is the placement of the described solution in the competition? "\
                 f"Think step by step. First check if the title contains the placement, then check if the writeup does. " \
                 f"If you cannot find the placement, answer only with 'insufficient information'.\n\n" \
                 f"title: {writeup_title}:\n" \
                 f"writeup: {writeup}"
        #prompt = f"{writeup_title}:\n" \
        #         f"{writeup}.\n\nBased on my analysis, the above write solution in the competition ranked at position: "
        if print_prompt:
            print(prompt)
        ranking, ranking_md = llama3_generate(prompt)
        print(f"--------------- {writeup_title} -------------------")
        display(ranking_md)
        if n_process is not None and i >= n_process-1:
            break

In [35]:
comp0 = competitions[0]
comp0

'Chess ratings - Elo versus the Rest of the World'

From reading Kaggle writeups in the past, I recall seeing those mostly posted for high-ranking notebooks. The following shows how the title often states the ranking of the writeup solution in the competition, and how well Gemma does with identifying that information:

In [36]:
find_writeup_ranking(comp0)

--------------- Released: my Source Code and Analysis -------------------


Insufficient information.<|eot_id|>

--------------- 6th place(UriB) by Uri Blass -------------------


The title mentions "6th place" and the writeup does not provide any additional information about the placement. Therefore, the answer is:

6th place<|eot_id|>

--------------- 7th place - littlefish -------------------


The placement is explicitly mentioned in the title: "7th place - littlefish". Therefore, the described solution is in 7th place.<|eot_id|>

--------------- 3rd place: Chessmetrics - Variant -------------------


The title explicitly states that the solution is the 3rd place winner, so the answer is: **3rd place**.<|eot_id|>

--------------- 2nd place: TrueSkill Through Time -------------------


According to the writeup, the author placed 2nd in the competition.<|eot_id|>

However, as the above shows, not all solutions are perfectly ranked with this. The model sometimes seems to lose its touch. For example, in my current run the last of the solutions above has a title of "2nd place:...". For some reason Gemma does not recognize this. The way Gemma expresses the ranking in its output varies a lot, making it a bit harder to parse it reliably.

As a sidenote, some of the collected writeups in this dataset do now even seem to be actual solutions, just random looking forum posts that may have been caught "by accident" in the dataset, or that just refer to some notebook with a few words in the forum post.

Well, I guess all datasets end up with random noise. Thats why data cleaning is such a big job.. 

Here is one example, it just seems to congratulate the winners:

In [37]:
comp3 = competitions[3]
find_writeup_ranking(comp3, True, 1)

Given the following competition solution writeup, what is the placement of the described solution in the competition? Think step by step. First check if the title contains the placement, then check if the writeup does. If you cannot find the placement, answer only with 'insufficient information'.

title: Sharing Techniques Used:
writeup: Congratulations to all the leaders in this contest!    Unfortunately, these forums have been pretty quiet during the contest, but now that it's over, I'm wondering if people are willing to disclose the techniques they used so others can learn something new.In a few emails with a couple contestants, I know there are a mix of techniques being used out there --- KNNs, neural-nets, SVDs, node metrics (like Adar/Adamic, Jaccard, number of common neighbors), and some graph-based techniques (shortest paths, edge-betweenness centrality, etc.).   So, what techniques did you use? What worked, and what didn't?  Thanks!
--------------- Sharing Techniques Used ----

Insufficient information.<|eot_id|>

The above also illustrates how Gemma has a tendency to get overly talkative. My prompt just asks it to find the placement, not to tell me about what it thinks are the solutions used. Yet it still lists a set of solutions, that have not even been used in the writeup. They are just given as examples of what the writer thought were used generally.

# Attempt 4: Summarize Multiple Solutions

OK, so we have multiple writeups per competition, and automated ranking identification was so far not very successful. How about summarizing multiple writeups from the competition to give an idea of what was kind of solutions people generally used?

### Try with a single writeup first to experiment prompts

This prompt still asks for summary with that exact word.

In [38]:
prompt_writeup_and_desc = \
         f"you are to answer as a helpful assistant helping understand questions about " \
         f"a kaggle competition writeup, and the solutions the writeup describes. " \
         f"a short description of the competition and its goals, and a solution writeup follows. " \
         f"description: {comp_summary}\n\n" \
         f"writeup: {writeup0_title}:\n {writeup0}\n\n" \
         f"question: summarize the solution in the above writeup, focusing on used data analysis methods. " \
         f"Focus on the key points of the solutions and how they might have helped achieving better score in the competition.\n\n" \
         f"answer: "

In [39]:
%%time
summary3, summary3_md = llama3_generate(prompt_writeup_and_desc)
Markdown(summary3)

CPU times: user 22.7 s, sys: 156 ms, total: 22.8 s
Wall time: 22.8 s


The solution writeup describes a rating system that estimates the current ability of each player based on past results. The system uses a combination of data analysis methods to calculate the rating and expected score for each player.

Key points of the solution:

1. **Rating calculation**: The solution uses a formula to calculate the rating for each player, which takes into account the player's performance and the ratings of their opponents. The formula is based on the concept of "bonus" for white, which is calculated as the maximum of the difference between the white and black ratings divided by 40, and 50.
2. **Performance rating**: The solution uses a performance rating formula that takes into account the result of the game, the opponents, and the rating of the player. The formula is designed to give a performance rating that is not too sensitive to the result of a single game.
3. **Repeat strength estimate**: The solution uses a function called "repeat_strength_estimate" to calculate the strength of each player in each month. The function takes into account the player's past performance, the ratings of their opponents, and the number of games played.
4. **Calc_strength_chess_metric**: The solution uses a function called "calc_strength_chess_metric" to calculate the strength of each player in each month. The function takes into account the player's past performance, the ratings of their opponents, and the number of games played.

Data analysis methods used:

1. **Regression analysis**: The solution uses a regression-like formula to calculate the rating and expected score for each player.
2. **Weighted average**: The solution uses a weighted average to calculate the performance rating, where the weights are based on the opponents and the result of the game.
3. **Iterative calculation**: The solution uses an iterative calculation to calculate the strength of each player in each month, where the calculation is based on the player's past performance and the ratings of their opponents.

How the solution might have helped achieving better score in the competition:

1. **Accurate rating calculation**: The solution's rating calculation formula takes into account the player's performance and the ratings of their opponents, which might have resulted in more accurate ratings.
2. **Robust performance rating**: The solution's performance rating formula is designed to give a performance rating that is not too sensitive to the result of a single game, which might have resulted in more robust ratings.
3. **Effective use of data**: The solution uses a combination of data analysis methods to calculate the rating and expected score for each player, which might have resulted in more effective use of the available data.
4. **Iterative calculation**: The solution's iterative calculation of the strength of each player in each month might have resulted in more accurate estimates of the player's strength.

Overall, the solution's use of data analysis methods, such as regression analysis, weighted average, and iterative calculation, might have helped achieve better scores in the competition by providing more accurate ratings and expected scores.<|eot_id|>

## Change Prompt to Avoid "Summarize" term

The above summary shows, similar to the previous ones, how Gemma actually likes to give its opinions at the end of a summary. In this case, I am not so interested in its opinion, especialle since it seems to get confused even with contest description vs solution writeup.

So I tried to avoid this, by asking for the key points and avoiding the term "summarize". Lets see:

In [40]:
def summarize_description(description):
    #print("summarizing description")
    prompt = f"Summarize the following kaggle competition description in a very concise way. "\
         f"Omit references to specific algorithms or known solutions, focus on overview of the contest topic and goals.\n\n" \
         f"description: {description}\n\n" \
         f"summary: "
    summary, summary_md = llama3_generate(prompt)
    return summary

Now for the modified summarization prompt. Here I do not use the term "summarize", but rather as Gemma to just list the key points (from datascience application viewpoint):

In [41]:
%%time

def summarize_writeup_and_desc(df_comp_meta, comp_name, writeup_title, writeup):
    comp_desc = df_comp_meta[df_comp_meta["comp_name"] == comp_name]["desc"].values[0]
    desc_summary = summarize_description(comp_desc)
    prompt_writeup_and_desc = \
         f"You are Gemma, a helpful assistant for Kaggle competitors to understand their datascience problems " \
         f"and propose approaches to solve them. " \
         f"Your points should have enough details to be useful for similar problem solving, " \
         f"but not excessively focused on the specific approach in writeup. " \
         f"More in the sense of overall application lessons for similar problems. " \
         f"here is a short summary of the competition now being analyzed: " \
         f"description: {desc_summary}\n\n" \
         f"following will be a writeup of someone who participated in this competition." \
         f"list the key points of this from the datascience application viewpoint for similar problem solving.\n" \
         f"writeup:\n {writeup_title}:\n {writeup}\n\n" \
         f"answer: \n"
    final_summary, final_summary_md = llama3_generate(prompt_writeup_and_desc)
    return final_summary, final_summary_md

competitions = df_writeups["Title of Competition"].unique()
comp0 = competitions[0]
df_comp0 = df_writeups[df_writeups["Title of Competition"]==comp0]
writeup0_title = df_comp0["Title of Writeup"].iloc[0]
comp0_name = df_comp0["Title of Competition"].iloc[0]
writeup0=df_comp0.iloc[1]["writeup_clean"]
final_summary, final_summary_md = summarize_writeup_and_desc(df_comp_meta, comp0_name, writeup0_title, writeup0)
display(final_summary_md)

What a fascinating competition! As a helpful assistant, I'll extract the key points from a datascience application viewpoint for similar problem solving. Here are the key takeaways:

1. **Rating system design**: The competition involves developing a rating system that estimates the current ability of each player based on past results. This requires designing a methodology that takes into account factors such as player performance and opponent ratings.
2. **Formula-based approach**: The author used a formula-based approach to calculate the expected result, which is a simple and intuitive way to model the relationship between ratings and expected outcomes.
3. **Rating calculation**: The author used a more complex formula to calculate the rating for each player, which involves predicting the past based on future data (a technique known as "future to predict the past"). This approach is not uncommon in datascience, where predicting the past can be a useful technique for modeling complex systems.
4. **Parameter tuning**: The author used parameter tuning to optimize the performance of their rating system, which is a common practice in datascience. This involves adjusting parameters to find the optimal values that minimize errors or maximize performance.
5. **Data augmentation**: The author used data augmentation techniques, such as adding fake opponents with average ratings, to improve the robustness of their rating system. This is a common technique in datascience, where adding noise or perturbations to the data can help improve model performance.
6. **Iterative estimation**: The author used an iterative estimation approach to calculate the strength of each player, which involves repeating the estimation process multiple times to refine the estimates. This is a common technique in datascience, where iterative estimation can be used to improve the accuracy of model predictions.
7. **Unfair trick**: The author used an "unfair trick" to predict the strength of players in missing months, which involves using information about games and not results in those months. While this may not be a conventional approach, it highlights the importance of creative thinking and outside-the-box solutions in datascience.

Overall, this competition showcases the importance of designing a robust rating system that takes into account various factors, including player performance and opponent ratings. The use of formula-based approaches, parameter tuning, data augmentation, and iterative estimation are all important techniques in datascience that can be applied to similar problems.<|eot_id|>

CPU times: user 20.9 s, sys: 193 ms, total: 21.1 s
Wall time: 21.1 s


Now, in the above, the Gemma opinion piece from the end of the "summary" is gone. And in general, I find the key points to represent this writeup quite well. 

This especially serves my purposes here better, since if I want to try using Gemma to summarize multiple writeups, I am not interested in Gemma trying to summarize multiples of its own opinion pieces. Rather I would prefer it to summarize the used solutions across those multiple writeups.

## Try to summarize competition writeups together

To make it a bit quicker and concise, I summarize only a subset of the competitions here. First a set of helpers to summarize writeups for a competition together:

In [42]:
competitions = df_writeups["Title of Competition"].unique()
writeup_summaries = []
writeup_summaries_md = []
overall_summaries = []
processed_titles = set()
processed_titles_list = []
skipped_titles = set()
skipped_titles_list = []
overall_summary_prompts = []
prompts = []
writeups = []
prompt_lengths = []
writeup_lengths = []
summary_lengths = []


In [43]:
def summarize_competition_writeups(comps_to_summarize, df_writeups_here, model_tokenizer, max_comps=None):
    skip_count = 0
    for idx, competition in tqdm(enumerate(comps_to_summarize), total=len(comps_to_summarize)):
        df_comp = df_writeups_here[df_writeups_here["Title of Competition"] == competition]

        comp_name = df_comp["Title of Competition"].iloc[0]
        # have to skip overly long writeups. luckily not too many of those
        max_len = 8192 - 1001 #1000 here is requested count for max new tokens
        if on_kaggle:
            max_len = 4096 # Kaggle GPU seems unable to handle long sequence due to memory limit
        memory_error_msg = "skipped due to token limit (on Kaggle need shorter due to GPU memory limit)"
        #memory_skip = False

        # list_idx is to print position in saved lists, to be able to pick one later
        list_idx = len(overall_summaries)
        print(f"{list_idx}: competition {comp_name}, writeups {df_comp.shape[0]}")
        #here filter only the writeups for this competition
        intermediate = df_comp_meta[df_comp_meta["comp_name"] == comp_name]
        if intermediate.shape[0] == 0:
            # the two datasets I use here are not fully in sync, have to skip full mismatches
            print("competition metadata not found, skipping")
            skipped_titles.add(competition)
            skipped_titles_list.append(competition)
            skip_count += 1
            continue
        # on Kaggle it runs a bit slow at time, so had to implement some extra capping support
        if max_comps is not None and idx - skip_count >= max_comps:
            print("stopping due to maximum summaries reached")
            break

        comp_desc = df_comp_meta[df_comp_meta["comp_name"] == comp_name]["desc"].values[0]
        desc_summary = summarize_description(comp_desc)

        comp_writeup_summaries = []
        comp_writeup_summaries_md = []
        comp_prompts = []
        comp_writeups = []
        comp_prompt_lengths = []
        comp_writeup_lengths = []
        comp_summary_lengths = []
        prompts.append(comp_prompts)
        writeups.append(comp_writeups)
        writeup_summaries.append(comp_writeup_summaries)
        writeup_summaries_md.append(comp_writeup_summaries_md)
        prompt_lengths.append(comp_prompt_lengths)
        writeup_lengths.append(comp_writeup_lengths)
        summary_lengths.append(comp_summary_lengths)

        prompt_overall_summary = \
            f"The following gives a summary of a Kaggle competition description, " \
            f"and a set of one or more writeups on solutions used in that competition, separated by ======.\n\n" \
            f"Use these to summarize a set of guidelines for ideas on how to approach " \
            f"a given Kaggle data science competition. \n\n" \
            f"Competition description summary: {desc_summary}\n\n"

        # loop all writeups for each competition
        for i, row in tqdm(df_comp.iterrows(), total=df_comp.shape[0]):
            writeup_title = row["Title of Writeup"] 
            writeup = row["writeup_clean"]
            #print(i)

            prompt_writeup_and_desc = \
                 f"You are Gemma, a helpful assistant for Kaggle competitors to understand their datascience problems " \
                 f"and propose approaches to solve them. " \
                 f"Your points should have enough details to be useful for similar problem solving, " \
                 f"but not excessively focused on the specific approach in writeup. " \
                 f"More in the sense of overall application lessons for similar problems. " \
                 f"Here is a short summary of the competition now being analyzed:\n\n " \
                 f"description:\n {desc_summary}\n\n" \
                 f"Following will be a writeup of someone who participated in this competition. " \
                 f"List the key points of this from the datascience application viewpoint for similar problem solving.\n" \
                 f"Writeup:\n {writeup_title}:\n {writeup}\n\n" \
                 f"Answer: \n"

            input_ids = model_tokenizer(prompt_writeup_and_desc, return_tensors="pt")
            prompt_len = len(input_ids["input_ids"][0])

            comp_prompts.append(prompt_writeup_and_desc)
            comp_writeups.append(writeup)
            comp_prompt_lengths.append(prompt_len)
            input_ids = model_tokenizer(writeup_title+":\n"+writeup, return_tensors="pt")
            writeup_len = len(input_ids["input_ids"][0])
            comp_writeup_lengths.append(writeup_len)
            #print(f"prompt length: {prompt_len}, writeup length: {writeup_len}")

            if prompt_len > max_len:
                print(f"skipping writeup {writeup_title} for competition {competition} due to too high length {prompt_len}.")
                #continue
                summary, summary_md = memory_error_msg, memory_error_msg
                memory_skip = True
            else:
                summary, summary_md = llama3_generate(prompt_writeup_and_desc)

            # Store or process the 'summary' as needed
            #print(f"Competition: {comp_name}")
            #print(f"Writeup Title: {writeup_title}")
            #print(f"Summary: {summary}")
            #print("----------------------")
            comp_writeup_summaries.append(summary)
            comp_writeup_summaries_md.append(summary_md)
            input_ids = model_tokenizer(summary, return_tensors="pt")
            summary_len = len(input_ids["input_ids"][0])
            comp_summary_lengths.append(summary_len)

            input_ids = model_tokenizer(prompt_overall_summary, return_tensors="pt")
            prompt_overall_summary_len = len(input_ids["input_ids"][0])

            if prompt_overall_summary_len < max_len:
                prompt_overall_summary += f"\n\n======\n\n writeup summary:\n {summary}\n\n"
            else:
                print(f"skipping adding to overall prompt due to reaching max limit set: {prompt_overall_summary_len} > {max_len}")

        if competition not in processed_titles:
            processed_titles.add(competition)
            processed_titles_list.append(competition)

        prompt_overall_summary += \
            f"\n\n======\n\n Focus on the key points of the writeups and how they might have helped achieving better score in the competition." \
            f"Extract specifically used data analysis methods, and summarize how they are related across writeups.\n\n" \
            f"answer: "
        input_ids = model_tokenizer(prompt_overall_summary, return_tensors="pt")
        overall_prompt_len = len(input_ids["input_ids"][0])

        if overall_prompt_len > max_len:
            print(f"skipping overall summary for competition {competition} due to too high length {overall_prompt_len}.")
            overall_summary, overall_summary_md = memory_error_msg, memory_error_msg
        else:
            overall_summary, overall_summary_md = llama3_generate(prompt_overall_summary)

        input_ids = model_tokenizer(overall_summary, return_tensors="pt")
        overall_summary_len = len(input_ids["input_ids"][0])
        
        #print(overall_summary)
        print(f"     prompt lengths={comp_prompt_lengths}")
        print(f"    writeup lengths={comp_writeup_lengths}")
        print(f"    summary lengths={comp_summary_lengths}")
        print(f"    total writeups length:{sum(comp_writeup_lengths)}, overall prompt length: {overall_prompt_len}, overall summary length: {overall_summary_len}")
        overall_summaries.append(overall_summary_md)
        overall_summary_prompts.append(prompt_overall_summary)
    print(f"total of {len(prompts)} competitions processed")


With the above method, the following notebook cells will create overall summaries of competition writeups. This approach first uses Gemma to summarize each writeup for a competition separately, takes these summaries together, and finally uses Gemma to summarize the key points from all these writeup summaries as one. It prints out the following information:

- writeup count: how many writeups are there for the competition in the dataset.
- prompt lengths: how long is the separate writeup summary prompt used as counted in Gemma tokens. this prompt includes the writeup.
- writeup lengths: length of a separate writeup alone, without the rest of the prompt.
- summary lengths: length of the Gemma built summaries (list of key points) for each writeup
- total writeup length: length of all writeups for the competition, in Gemma tokens. if we concatenate the raw writeups in full
- overall prompt length: the summary prompt length using summaries of each writeup as input and not full writeups.
- overall summary length: length of the final overall summary generated by Gemma for overall prompt.

In [44]:
%%time
summarize_competition_writeups(competitions[:6], df_writeups, tokenizer)


  0%|          | 0/6 [00:00<?, ?it/s]

0: competition Chess ratings - Elo versus the Rest of the World, writeups 5


  0%|          | 0/5 [00:00<?, ?it/s]

     prompt lengths=[312, 1331, 604, 666, 2443]
    writeup lengths=[115, 1134, 407, 468, 2246]
    summary lengths=[417, 526, 402, 445, 430]
    total writeups length:4370, overall prompt length: 2440, overall summary length: 503
1: competition RTA Freeway Travel Time Prediction, writeups 1


  0%|          | 0/1 [00:00<?, ?it/s]

     prompt lengths=[190]
    writeup lengths=[12]
    summary lengths=[513]
    total writeups length:12, overall prompt length: 687, overall summary length: 463
2: competition Predict Grant Applications, writeups 2


  0%|          | 0/2 [00:00<?, ?it/s]

     prompt lengths=[193, 735]
    writeup lengths=[18, 560]
    summary lengths=[287, 426]
    total writeups length:578, overall prompt length: 891, overall summary length: 290
3: competition IJCNN Social Network Challenge, writeups 2
competition metadata not found, skipping
3: competition Stay Alert! The Ford Challenge, writeups 1


  0%|          | 0/1 [00:00<?, ?it/s]

     prompt lengths=[760]
    writeup lengths=[606]
    summary lengths=[371]
    total writeups length:606, overall prompt length: 520, overall summary length: 437
4: competition Don't Overfit!, writeups 1


  0%|          | 0/1 [00:00<?, ?it/s]

     prompt lengths=[443]
    writeup lengths=[265]
    summary lengths=[281]
    total writeups length:265, overall prompt length: 455, overall summary length: 319
total of 5 competitions processed
CPU times: user 4min 7s, sys: 1.42 s, total: 4min 8s
Wall time: 4min 8s


As shown above, most of these competitions actually just have one writeup in this dataset. Many of them are also very short, with the shortest being only 15 tokens. A few competitions had more than one writeup, and some had longer writeups. So lets take a look at some of these different types in a bit more detail.

### Overall summary, competition 1

The first competition has the most writeups in this set, 5 writeups. Lets see how all of them get summarized together.


In [45]:
overall_summaries[0]

Based on the writeup summaries, here are the key points that can be applied to similar datascience problems:

1. **Clearly define the problem and its requirements**: Understand the problem and its requirements, and define them clearly.
2. **Explore different approaches**: Don't be afraid to try new things and explore different approaches.
3. **Code sharing and collaboration**: Share code and collaborate with others to learn from their experiences and improve your approach.
4. **Develop an experimentation framework**: Develop a framework for experimentation to rapidly prototype and refine your solutions.
5. **View failure as an opportunity to learn**: View failure as an opportunity to learn and improve.
6. **Engage with the community**: Engage with the community to share knowledge and learn from others.
7. **Design a robust rating system**: Design a robust rating system that takes into account various factors, such as player performance and opponent ratings.
8. **Use historical data**: Use historical data to inform predictions and make informed decisions.
9. **Handle missing data**: Handle missing data using techniques such as using faked opponents and reducing ratings for players with less than 15 weighted games.
10. **Iterative approach**: Use an iterative approach to refine predictions and account for changing circumstances.
11. **Use well-defined metrics and formulas**: Use well-defined metrics and formulas to make predictions and inform decisions.
12. **Handle edge cases**: Consider edge cases and develop strategies to handle them.
13. **Use creative solutions**: Use creative solutions to handle missing data and improve the accuracy of predictions.
14. **Iterative refinement**: Refine predictions through multiple iterations.
15. **Use domain-specific knowledge**: Use domain-specific knowledge to inform predictions and make informed decisions.

The data analysis methods used across the writeups include:

* Weighting and decay
* Padding and normalization
* Iterative calculation
* Incorporating future results
* Simple modifications
* Exploration of established methods
* Customization of the rating formula
* Use of iterative rating formula
* Weighting and normalization
* Use of additional parameters
* Prediction formula
* Code implementation

These methods are related across the writeups in that they all involve using different techniques to develop a robust and accurate rating system. The methods are used to handle missing data, refine predictions, and account for changing circumstances. The use of iterative calculation, weighting and normalization, and incorporating future results are common across the writeups, indicating that these methods are effective in developing a robust rating system.<|eot_id|>

The above overall summary seems to make sense, although much of it focuses on very generic advice. 

Perhaps with some more prompt tuning it could be tuned to summarize the more specific solutions from the writeup. Although even in this case the "Data Analysis Methods" part does include some reference to those specific solutions. And the general advice is not bad in itself. So I leave further prompt tuning for the future. 

This summary is also back to providing Gemmas opinions in the end, likely because in my prompt I used the term "summarize" again. However, for me it is fine this time since this would be the final output of the overall summarization.

I will still have to look at the individual writeup details vs overall summary in a future update.

### Overall summary, competition 2

As seen in the above processing, competition 2 had a single writeup, that had the shortest set of 15 tokens. Lets see what Gemma has to say about that. 

But first, lets check what is such as super-short writeup anyway:


In [46]:
writeups[1]

["For anybody interest, here's the actual solution."]

This looks even a bit suspicious, as if something is left out of the writeup. Let's see if the problem is in what I picked from the dataframe, in the collection method of the writeups, or maybe the writeup is just this :).

In [47]:
df_writeups[df_writeups["Title of Competition"] == "RTA Freeway Travel Time Prediction"]


Unnamed: 0,Competition Launch Date,Title of Competition,Competition URL,Date of Writeup,Title of Writeup,writeup,Writeup URL,writeup_clean
5,11/23/2010 00:00:00,RTA Freeway Travel Time Prediction,https://www.kaggle.com/c/2467,02/16/2011 06:22:07,Solution,"For anybody interest, here's the actual solution.",https://www.kaggle.com/c/2467/discussion/294,"For anybody interest, here's the actual solution."


In [48]:
df_writeups[df_writeups["Title of Competition"] == "RTA Freeway Travel Time Prediction"]["writeup"]

5    For anybody interest, here's the actual solution.
Name: writeup, dtype: object

At least it seems that this is the actual writeup in the data I am using. 

So what does Gemma say when it tries to summarize this writeup that does not actually say much of anything?

In [49]:
Markdown(writeup_summaries[1][0])

Based on the writeup, I've identified the key points from a datascience application viewpoint that can be applied to similar problem-solving:

1. **Understanding the problem**: The competition's goal is to predict travel times on the M4 freeway, which requires a deep understanding of the problem and its context. This includes identifying the key factors that affect travel times, such as traffic volume, road conditions, and time of day.
2. **Data preparation**: The solution likely involved preparing the historical data for analysis, which may have included handling missing values, data cleaning, and feature engineering. This step is crucial in ensuring that the data is accurate and reliable.
3. **Feature selection**: The solution likely involved selecting the most relevant features that can help predict travel times. This may have included features such as traffic volume, road conditions, time of day, and weather.
4. **Model selection**: The solution likely involved selecting a suitable machine learning model that can accurately predict travel times. This may have included models such as linear regression, decision trees, random forests, or neural networks.
5. **Hyperparameter tuning**: The solution likely involved tuning the hyperparameters of the selected model to optimize its performance. This may have included techniques such as grid search, random search, or Bayesian optimization.
6. **Model evaluation**: The solution likely involved evaluating the performance of the selected model using metrics such as mean absolute error (MAE), mean squared error (MSE), or mean absolute percentage error (MAPE). This step is crucial in ensuring that the model is accurate and reliable.
7. **Ensemble methods**: The solution may have involved using ensemble methods such as bagging, boosting, or stacking to combine the predictions of multiple models and improve the overall performance.
8. **Handling temporal dependencies**: The solution likely involved handling temporal dependencies in the data, such as using techniques like ARIMA or LSTM to capture the temporal patterns in the data.
9. **Handling spatial dependencies**: The solution may have involved handling spatial dependencies in the data, such as using techniques like spatial autoregressive models or spatially-aware neural networks to capture the spatial patterns in the data.
10. **Interpretability**: The solution likely involved providing interpretability for the model's predictions, such as using techniques like partial dependence plots or SHAP values to understand how the model's predictions are influenced by the input features.

By applying these key points, similar problem-solving can be approached with a structured and systematic approach, increasing the chances of success in predicting travel times or solving similar problems.<|eot_id|>

The actual input I used for Gemma is the competition description summary + competition title + writeup title + the writeup, so the actual prompt was this: 

In [50]:
Markdown(prompts[1][0])

You are Gemma, a helpful assistant for Kaggle competitors to understand their datascience problems and propose approaches to solve them. Your points should have enough details to be useful for similar problem solving, but not excessively focused on the specific approach in writeup. More in the sense of overall application lessons for similar problems. Here is a short summary of the competition now being analyzed:

 description:
 The competition aims to predict travel times on Sydney's M4 freeway using historical data to improve road safety, efficiency, and inform commuters' decisions. Participants must forecast travel times for various time intervals ahead, with the goal of optimizing the road transport system and increasing functionality on the government's live traffic website.<|eot_id|>

Following will be a writeup of someone who participated in this competition. List the key points of this from the datascience application viewpoint for similar problem solving.
Writeup:
 Solution:
 For anybody interest, here's the actual solution.

Answer: 


In the writeup summary, Gemma seems to have again picked up parts of the competition description, and invented some possible solutions. So hallucinating, perhaps due to limited (or non-existent) writeup content. 

The general timeseries problem description is correct as Gemma seems to have deduced that from the competition description. The invented solutions, such as using ARIMA, seem to be halluciations. Probably from general approaches in its training data to these types of problems. Such insights might actually be useful, especially when the writeup is non-existent. But it is not what I asked for. So not entirely wrong perhaps, but misleading if we take is as is for a solution summary.

So how does Gemma summarize this summary itself? Remember, the approach I trial here is to first summarize all the writeups for a competition, and from these a single overall summary. Since there is just one writeup in this competition, Gemma just tries to re-summarize its own summary now. 

Let's see what it says:


In [51]:
overall_summaries[1]

Based on the writeup summary, here are the key points that can be applied to similar problem-solving:

**Understanding the problem**: Identify the key factors that affect the outcome, such as traffic volume, road conditions, and time of day.

**Data preparation**: Handle missing values, clean the data, and perform feature engineering to ensure accurate and reliable data.

**Feature selection**: Select the most relevant features that can help predict the outcome, such as traffic volume, road conditions, time of day, and weather.

**Model selection**: Choose a suitable machine learning model, such as linear regression, decision trees, random forests, or neural networks.

**Hyperparameter tuning**: Optimize the model's performance using techniques like grid search, random search, or Bayesian optimization.

**Model evaluation**: Use metrics like MAE, MSE, or MAPE to evaluate the model's performance and ensure accuracy and reliability.

**Ensemble methods**: Combine the predictions of multiple models to improve overall performance.

**Handling temporal dependencies**: Use techniques like ARIMA or LSTM to capture temporal patterns in the data.

**Handling spatial dependencies**: Use techniques like spatial autoregressive models or spatially-aware neural networks to capture spatial patterns in the data.

**Interpretability**: Provide interpretability for the model's predictions using techniques like partial dependence plots or SHAP values.

The data analysis methods used across the writeups are:

1. **Data preparation**: Handling missing values, data cleaning, and feature engineering.
2. **Feature selection**: Selecting the most relevant features.
3. **Model selection**: Choosing a suitable machine learning model.
4. **Hyperparameter tuning**: Optimizing the model's performance.
5. **Model evaluation**: Evaluating the model's performance using metrics.
6. **Ensemble methods**: Combining the predictions of multiple models.
7. **Handling temporal dependencies**: Capturing temporal patterns in the data.
8. **Handling spatial dependencies**: Capturing spatial patterns in the data.
9. **Interpretability**: Providing interpretability for the model's predictions.

These methods are related across writeups in that they are all part of the data science process, from understanding the problem to evaluating the model's performance. By applying these methods, data scientists can increase the chances of success in predicting travel times or solving similar problems.<|eot_id|>

In my trials, I made small typo corrections and wording changes into the overall summary prompt, and got a few different responses here due to these small changes affecting the text generation. 

Previously, Gemma identified the summary as general descriptions and said it was unable to give good overall summary of specific solutions, since there was no such information given. Now it seems to get more into the halluciations in the input. 

However, at least it has not invented things that are not in the input summary. In the previous attempt (when Gemma identified it as useless input), the input summary was much more generic. So I guess in this case I cannot blame Gemma since all this information it now gives is in the input after all. 

The part where it previously seemed to identify its own generic hallucinations was quite interesting though. But this input is different, so cannot say how common that result would be in such a case.

### Overall summary, competition 3

This 3rd competition in the dataset has 2 writeups. One very short (20 tokens), and one longer (561 tokens). 

Lets see what these two writeups are, and how Gemma handles them:

In [52]:
writeups[2][0]

'The solution file is attached to this post.Thanks all for participating,Anthony\t'

In [53]:
Markdown(writeups[2][1])

Because I have recently started employment with Kaggle, I am not eligible to win any prizes. Which means the prize-winner for this comp is Quan Sun (team 'student1')! Congratulations!My approach to this competition was to first analyze the data in Excel pivottables. I looked for groups which had high or low application success rates. In this way, I found a large number of strong predictors - including by date (new years day is a strong predictor, as are applications processed on a Sunday), and for many fields a null value was highly predictive.I then used C# to normalize the data into Grants and Persons objects, and constructed a dataset for modeling including these features: CatCode, NumPerPerson, PersonId, NumOnDate, AnyHasPhd, Country, Dept, DayOfWeek, HasPhd, IsNY, Month, NoClass, NoSpons, RFCD, Role, SEO, Sponsor, ValueBand, HasID, AnyHasID, AnyHasSucc, HasSucc, People.Count, AStarPapers, APapers, BPapers, CPapers, Papers, MaxAStarPapers, MaxCPapers, MaxPapers, NumSucc, NumUnsucc, MinNumSucc, MinNumUnsucc, PctRFCD, PctSEO, MaxYearBirth, MinYearUni, YearBirth, YearUni .Most of these are fairly obvious as to what they mean. Field names starting with 'Any' are true if any person attached to the grant has that feature (e.g. 'AnyHasPhd'). For most fields I had one predictor that just looks at person 1 (e.g. 'APapers' is number of A papers from person 1), and one for the maximum of all people in the application (e.g. 'MaxAPapers').Once I had created these features, I used a generalization of the random forest algorithm to build a model. I'll try to write some detail about how this algorithm works when I have more time, but really, the difference between it and a regular random forest is not that great.I pre-processed the data before running it through the model by grouping up small groups in categorical variables, and replacing continuous columns with null values with 2 columns (one containing a binary predictor that is true only where the continuous column is null, the other containing the original column, with nulls replaced by the median). Other than the Excel pivottables at the start, all the pre-processing and modelling was done in C#, using libraries I developed during this competition. I hope to document and release these libraries at some point - perhaps after tuning them in future comps.

So, the first writeup seems to just refer to some attachment. I did now know people could do that on Kaggle. 

The second one has some content. So lets see how Gemma handles this combination.

First the separate summary for the first writeup:

In [54]:
Markdown(writeup_summaries[2][0])

Based on the writeup, here are the key points from a datascience application viewpoint for similar problem solving:

1. **Clear goal definition**: The goal of the competition is well-defined, which is to predict the success of grant applications. This clarity helps in focusing the approach and evaluation of the solution.
2. **Dataset availability**: The availability of a dataset (11,883 applications) provides a solid foundation for building and testing a predictive model.
3. **Simple solution submission**: The solution file is attached to the post, indicating that the participant's approach is straightforward and easy to understand, which is important for reproducibility and collaboration.
4. **Lack of additional information**: The writeup does not provide any additional insights or details about the solution, such as the specific algorithms used, feature engineering techniques, or hyperparameter tuning. This might be due to the simplicity of the solution or the brevity of the writeup.

For similar problem solving, these key points can be applied:

* Clearly define the goal and scope of the problem to focus the approach.
* Ensure the availability of a suitable dataset for building and testing the model.
* Keep the solution submission simple and easy to understand for reproducibility and collaboration.
* Be prepared to provide additional details and insights about the solution, such as the algorithms used, feature engineering techniques, and hyperparameter tuning, to facilitate a deeper understanding of the approach.<|eot_id|>

This time, Gemma obviously fared better in identifying that there was no solution provided. Perhaps how the competition description was summarized leads to less likely hallucinations? This is what the prompt with the competition description summary and the writeup looked like in this case:

In [55]:
Markdown(prompts[2][0])

You are Gemma, a helpful assistant for Kaggle competitors to understand their datascience problems and propose approaches to solve them. Your points should have enough details to be useful for similar problem solving, but not excessively focused on the specific approach in writeup. More in the sense of overall application lessons for similar problems. Here is a short summary of the competition now being analyzed:

 description:
 The University of Melbourne is hosting a competition to predict the success of grant applications, using a dataset of 11,883 applications. The goal is to identify factors that determine success and develop a model that can accurately predict which applications are likely to succeed, reducing time wasted on unsuccessful applications.<|eot_id|>

Following will be a writeup of someone who participated in this competition. List the key points of this from the datascience application viewpoint for similar problem solving.
Writeup:
 Solution:
 The solution file is attached to this post.Thanks all for participating,Anthony	

Answer: 


Well, the above prompt does not give any proposed solutions in the description summary, so perhaps that indeed helped Gemma avoid hallucinating the solutions.

Here is the summary for the second, longer writeup, in this competition:

In [56]:
Markdown(writeup_summaries[2][1])

As a helpful assistant, I've extracted the key points from the writeup for similar problem-solving:

1. **Initial Data Exploration**: The author started by analyzing the data in Excel pivot tables to identify groups with high or low application success rates. This step is crucial in understanding the distribution of the data and identifying potential patterns.
2. **Feature Engineering**: The author created a large number of features by combining existing variables, such as:
	* Date-based features (e.g., New Year's Day, Sunday)
	* Field-specific features (e.g., null values, maximum values)
	* Person-specific features (e.g., number of papers, PhD status)
	* Grouping categorical variables into smaller groups
	* Replacing continuous columns with null values and creating binary predictors
3. **Data Normalization**: The author used C# to normalize the data into objects (Grants and Persons) and constructed a dataset for modeling.
4. **Modeling**: The author used a generalized random forest algorithm to build a model. The key takeaway is that the difference between this algorithm and a regular random forest is not significant.
5. **Pre-processing**: The author pre-processed the data by grouping small groups in categorical variables and replacing continuous columns with null values with two columns (binary predictor and original column with nulls replaced by the median).
6. **Custom Libraries**: The author developed custom libraries in C# during the competition and plans to document and release them in the future.

Overall, the key takeaways for similar problem-solving are:

* Perform initial data exploration to understand the distribution of the data and identify potential patterns.
* Engage in feature engineering to create meaningful features that capture the relationships between variables.
* Normalize the data into a suitable format for modeling.
* Use pre-processing techniques to handle missing values, categorical variables, and other data quality issues.
* Consider using custom libraries or developing your own algorithms to tackle specific challenges in the problem.

These lessons can be applied to various data science problems, including predicting the success of grant applications, and can help you develop a robust approach to tackling similar challenges.<|eot_id|>

And the overall summary from the combination of these two writeup summaries:

In [57]:
overall_summaries[2]

Based on the writeups, here are the key points that can be applied to similar problem-solving:

**Initial Data Exploration**: Perform data exploration to understand the distribution of the data and identify potential patterns.

**Feature Engineering**: Engage in feature engineering to create meaningful features that capture the relationships between variables.

**Data Normalization**: Normalize the data into a suitable format for modeling.

**Pre-processing**: Use pre-processing techniques to handle missing values, categorical variables, and other data quality issues.

**Modeling**: Use a suitable algorithm for modeling, such as generalized random forest.

**Custom Libraries**: Consider using custom libraries or developing your own algorithms to tackle specific challenges in the problem.

These key points can be applied to various data science problems, including predicting the success of grant applications.

The data analysis methods used across the writeups include:

* Initial data exploration using Excel pivot tables
* Feature engineering using combinations of existing variables
* Data normalization using C# to construct a dataset for modeling
* Pre-processing using techniques such as grouping categorical variables, replacing continuous columns with null values, and creating binary predictors
* Modeling using a generalized random forest algorithm

These methods are related across the writeups in that they all contribute to the development of a robust approach to tackling the problem. By performing initial data exploration, feature engineering, data normalization, pre-processing, and modeling, the authors were able to develop a solution that accurately predicted the success of grant applications.<|eot_id|>

This overall summary seems to be quite accurate. At least it is not picking up any hallucinations from the very short one. Probably the accuracy and lack of hallucination in the short summary helps too.

# Summaries for Competitions with Longer and Larger Set of Writeups

Most of the writeups in the above set are quite similar to the three I checked above. Most have only one writeup, and at most there are three writeups, some of which are usually quite short as well. 

Besides these few and short writeups, an interesting combination to check is longer writeups, and competitions with a larger set of writeups.

The writeups in this dataset appear to be in chronological ordering, so the older ones (starting from year 2010) are first in the list/dataframe. The newest ones (seems to be around 2022) are last. In general it seems that over time there are more writeups for the later competitions, and often with more text. Perhaps more people on Kaggle, more writing on them, or people gaming the discussion badges more. 

Whatever the reason, here are the summaries for the last 20 competitions in the dataset, similar to the above first ones from the dataset (with few and shorter writeups):

In [58]:
for competition in tqdm(competitions[-20:]):
    df_comp = df_writeups[df_writeups["Title of Competition"] == competition]
    comp_name = df_comp["Title of Competition"].iloc[0]
    comp_meta = df_comp_meta[df_comp_meta["comp_name"] == comp_name]
    found = comp_meta.shape[0] > 0
    if found:
        print(f"competition {comp_name}, writeups {df_comp.shape[0]}")
    else:
        print(f"competition {comp_name}, writeups {df_comp.shape[0]}, no metadata")


  0%|          | 0/20 [00:00<?, ?it/s]

competition Feedback Prize - Predicting Effective Arguments, writeups 19
competition American Express - Default Prediction, writeups 31
competition HuBMAP + HPA - Hacking the Human Body, writeups 11
competition Mayo Clinic - STRIP AI, writeups 10
competition Google Universal Image Embedding, writeups 16
competition RSNA 2022 Cervical Spine Fracture Detection, writeups 18
competition DFL - Bundesliga Data Shootout, writeups 8, no metadata
competition AI Village Capture the Flag @ DEFCON, writeups 14
competition Open Problems - Multimodal Single-Cell Integration, writeups 20, no metadata
competition Feedback Prize - English Language Learning, writeups 45, no metadata
competition Novozymes Enzyme Stability Prediction, writeups 9, no metadata
competition G2Net Detecting Continuous Gravitational Waves, writeups 14, no metadata
competition OTTO – Multi-Objective Recommender System, writeups 27, no metadata
competition RSNA Screening Mammography Breast Cancer Detection, writeups 19, no metada

With the amount of writeups these have, it takes a bit of a long time to run them all. Especially on Kaggle with the slower GPU's and limited GPU memory. So I selected a few to run here after initially trying all the 20 listed above:
- Feedback prize: 19 writeups, the second highest in the list that has metadata
- America Express: 31 writeups, the most writeups in list with metadata
- Mayo Clinit: 10 writeups, least in the set with metadata
- AI Village: 18 writeups, but with many very short ones

It seems to be a decent set to compliment the earlier one with few writeups and many short ones.

In [59]:
selected_competitions = [
                         "Feedback Prize - Predicting Effective Arguments",
                         "American Express - Default Prediction",
                         "Mayo Clinic - STRIP AI",
                         "AI Village Capture the Flag @ DEFCON",
                        ]

NOTE: following have capped context on Kaggle due to GPU memory limit

In [60]:
#summarize_competition_writeups(competitions[-20:])
summarize_competition_writeups(selected_competitions, df_writeups, tokenizer)

  0%|          | 0/4 [00:00<?, ?it/s]

5: competition Feedback Prize - Predicting Effective Arguments, writeups 19


  0%|          | 0/19 [00:00<?, ?it/s]

skipping adding to overall prompt due to reaching max limit set: 7405 > 7191
skipping adding to overall prompt due to reaching max limit set: 7405 > 7191
skipping adding to overall prompt due to reaching max limit set: 7405 > 7191
skipping overall summary for competition Feedback Prize - Predicting Effective Arguments due to too high length 7449.
     prompt lengths=[754, 1330, 759, 789, 735, 598, 896, 2117, 1091, 2349, 705, 840, 659, 2223, 516, 737, 599, 806, 614]
    writeup lengths=[543, 1118, 548, 576, 523, 386, 684, 1906, 879, 2137, 493, 628, 448, 2010, 305, 525, 387, 593, 402]
    summary lengths=[445, 564, 411, 458, 442, 422, 356, 525, 425, 682, 456, 386, 396, 537, 336, 296, 485, 351, 472]
    total writeups length:15091, overall prompt length: 7449, overall summary length: 18
6: competition American Express - Default Prediction, writeups 31


  0%|          | 0/31 [00:00<?, ?it/s]

skipping adding to overall prompt due to reaching max limit set: 7630 > 7191
skipping adding to overall prompt due to reaching max limit set: 7630 > 7191
skipping adding to overall prompt due to reaching max limit set: 7630 > 7191
skipping adding to overall prompt due to reaching max limit set: 7630 > 7191
skipping adding to overall prompt due to reaching max limit set: 7630 > 7191
skipping adding to overall prompt due to reaching max limit set: 7630 > 7191
skipping adding to overall prompt due to reaching max limit set: 7630 > 7191
skipping adding to overall prompt due to reaching max limit set: 7630 > 7191
skipping adding to overall prompt due to reaching max limit set: 7630 > 7191
skipping adding to overall prompt due to reaching max limit set: 7630 > 7191
skipping adding to overall prompt due to reaching max limit set: 7630 > 7191
skipping adding to overall prompt due to reaching max limit set: 7630 > 7191
skipping adding to overall prompt due to reaching max limit set: 7630 > 7191

  0%|          | 0/10 [00:00<?, ?it/s]

     prompt lengths=[2749, 453, 1254, 370, 724, 313, 603, 1284, 672, 465]
    writeup lengths=[2584, 288, 1089, 205, 559, 148, 438, 1118, 507, 300]
    summary lengths=[516, 460, 513, 472, 354, 347, 484, 408, 452, 484]
    total writeups length:7236, overall prompt length: 4713, overall summary length: 512
8: competition AI Village Capture the Flag @ DEFCON, writeups 14


  0%|          | 0/14 [00:00<?, ?it/s]

     prompt lengths=[268, 304, 414, 584, 292, 1588, 388, 2192, 299, 366, 813, 426, 251, 289]
    writeup lengths=[35, 70, 182, 351, 60, 1354, 155, 1959, 67, 134, 581, 194, 18, 57]
    summary lengths=[397, 525, 400, 372, 369, 418, 390, 466, 426, 422, 461, 360, 551, 410]
    total writeups length:5217, overall prompt length: 6286, overall summary length: 387
total of 9 competitions processed


It seems that the competition metadata I use does not have metadata for all the latest competition writeups. Its not a big deal since I can still explore the writeups with Gemma just fine.

## Overall about Overall Summarization

As I noted, it seems the number of writeups and their length increases quite a lot over time. Here there are 10-31 writeups for the ones that I had also the metadata for. 

The longest combination of the writeups I checked above (American Express Default Prediction, 31 writeups) is 30k Gemma tokens in length, meaning it would not fit into Gemma's 8k window.

While the 30k tokens would be too much for Gemma context window of 8k, as shown above, the combined summaries of those writeups fit in 8k (just had to cap it on Kaggle to 4k). The longest (American Express) going from 30k too 7.1k tokens with using the combined write summaries as input vs the raw writeups. 



First, a look at the first of the resulting overall summaries. This one had 19 writeups (or their summaries) as input as seen above:

In [61]:
overall_summaries[5]

'skipped due to token limit (on Kaggle need shorter due to GPU memory limit)'

This is followed by the longest one (American Express), which had 31 writeups, with the writeup summaries together amounting to over 7k tokens (it not capped as I had to do on Kaggle):

In [62]:
overall_summaries[6]

'skipped due to token limit (on Kaggle need shorter due to GPU memory limit)'

And one more for the road, this time the one with fewest writeups, meaning 10 in this case:

In [63]:
overall_summaries[7]

Based on the writeup summaries, here are the key points that can be applied to similar problem-solving scenarios:

**Common themes:**

1. **Preprocessing and feature engineering**: Many writeups emphasize the importance of proper preprocessing and feature engineering techniques, such as resizing, normalization, and data augmentation.
2. **Ensemble methods**: Combining multiple models or using ensemble methods can improve performance and reduce overfitting.
3. **Data augmentation**: Data augmentation techniques, such as random flips, color jitter, and pixel dropout, can increase the diversity of the training data and improve model performance.
4. **Model selection and tuning**: Selecting the best model and tuning hyperparameters are crucial steps in achieving good performance.
5. **Experimentation and iteration**: Experimenting with different approaches and iterating on the solution are essential for finding the best approach.

**Specific data analysis methods:**

1. **Tile selection**: Selecting the most informative tiles from the whole slide digital pathology images can improve model performance.
2. **Pseudo-labeling**: Creating pseudo-labels for the tiles can help in feature extraction and aggregation.
3. **Attention pooling**: Replacing average pooling with attention pooling in the classification head can improve model performance.
4. **Stratified splitting**: Splitting the data into training and validation sets while stratifying by class can ensure that the class distribution is preserved.
5. **Undersampling**: Undersampling the majority class can help balance the class distribution in imbalanced datasets.

**Relationship across writeups:**

1. **Preprocessing and feature engineering**: Many writeups emphasize the importance of proper preprocessing and feature engineering techniques, which are essential for ensuring the model receives high-quality input data.
2. **Ensemble methods**: Combining multiple models or using ensemble methods can improve performance and reduce overfitting, as seen in several writeups.
3. **Data augmentation**: Data augmentation techniques, such as random flips, color jitter, and pixel dropout, can increase the diversity of the training data and improve model performance, as seen in several writeups.
4. **Model selection and tuning**: Selecting the best model and tuning hyperparameters are crucial steps in achieving good performance, as seen in several writeups.
5. **Experimentation and iteration**: Experimenting with different approaches and iterating on the solution are essential for finding the best approach, as seen in several writeups.

By applying these key points and data analysis methods, datascientists can develop effective approaches to solve similar problems and improve their model's performance.<|eot_id|>

Finally, I can see one more interesting one in the above set. The "AI Village Capture the Flag @ DEFCON" competition has multiple writeups that are very short (under 100 tokens), along with a similar count of longer ones. Let's see how that looks in an overall summary:

In [64]:
overall_summaries[8]

Based on the writeup summaries, I've identified the key points that can be applied to similar problem-solving challenges in datascience. Here are the key takeaways:

**Common themes:**

1. **Challenge-based approach**: Breaking down complex problems into smaller, manageable challenges.
2. **Domain knowledge**: Understanding the underlying concepts and terminology of the problem domain.
3. **Exploration and experimentation**: Trying different approaches and iterating on solutions.
4. **Iterative approach**: Refining models and learning from mistakes.
5. **Collaboration and sharing**: Sharing knowledge and insights with others in the datascience community.

**Data analysis methods:**

1. **Data exploration**: Understanding data distribution, relationships, and patterns.
2. **Data preprocessing**: Cleaning, normalizing, and feature engineering data.
3. **Model selection and evaluation**: Selecting the most appropriate machine learning model and evaluating its performance.
4. **Image processing**: Manipulating and analyzing images using techniques like Fourier Transform.
5. **Adversarial AI**: Detecting and mitigating potential attacks on machine learning models.

**Relationships across writeups:**

1. **Data exploration** is a common theme across writeups, highlighting the importance of understanding data characteristics.
2. **Data preprocessing** is mentioned in several writeups, emphasizing the need to ensure data quality and consistency.
3. **Model selection and evaluation** is a crucial step in many writeups, demonstrating the importance of evaluating model performance.
4. **Image processing** is a key technique in several writeups, showcasing its applications in image manipulation and analysis.
5. **Adversarial AI** is a common theme in writeups involving machine learning security challenges, highlighting the need to detect and mitigate potential attacks.

By applying these key points and data analysis methods, datascience practitioners can develop effective strategies for tackling a wide range of challenges in machine learning and data analysis.<|eot_id|>

I found this a bit confusing as it seem to discuss topics related to image processing as well as many other topics. But checking the competition description, it seems to have been a security related competition with multiple challenges related to math, natural language, and image challenges. So the overall summary seems to make sense.

The set of smaller writeups mostly seem to be about referencing solutions in notebooks or otherwise. Without spamming this notebook too much, here is one example:

In [65]:
Markdown(writeups[8][0])

I've updated my solution notebook to run on Kaggle, included some descriptions of the solutions, as well as output. Hope everyone enjoys!
Solution Code

With the large variation in the number of writeups per competition, and their length/quality, some filtering would also be in order. The ranking would be useful in cases where the number of writeups is very high at least. 

One interesting point about all the summaries in the first part above (the shorter and fewer writeup competitions), and the last N writeups above is the summary size. There is a broad range of input there, from 1-31 writeups per competition, to 15-30k tokens in all writeups together. Yet the summary from Gemma is almost always around the range of 100-300 tokens. Potentially it has been trained to summarize this way?

The accuracy of the initial summaries across writeups of different length, and across multiple writeup summaries is something another open question. Something to investigate more in a future update to this notebook.

# Summarizing in Sub-Sets vs All at Once

Using the initial results from this notebook, I built a [second one](https://www.kaggle.com/code/donkeys/qa-solutions-from-past-competitions) to experiment a bit with a RAG approach to help a user find interesting competitions related to the one they wish to participate in, get a summary of any writeups for that competition, and ask questions about solutions described in those writeups, using the summaries as a basis for questions and writeups as basis for more detailed answers.

In trying that, I realized I wanted to have all writeups summarized regardless of how many and long writeups the competition had. In my above trials, I capped the max context length at 4k for Kaggle, and excluded some writeups if the total became too long. For example, in the above parts of this notebook it shows how the _American Express - Default Prediction_ competition had 31 writeups, and some of the were skipped beceause they did not fit in the 4k token limit I had to use on Kaggle. And even with the full 8k Gemma context, some of the ones with a really large writeup sets still get cut off on my desktop setup with more VRAM.

To get even the ones with a large number of writeups summarized, I tried summarizing all writeups in several smaller sets per competition, and from the to a final summary. If this works fine, it should also work for all the writeup counts.

## Filter to Only Competitions with only N Writeups

First to find all competitions with at least some writeups, to form a basis for what we are going to summarize all together. Well, it also allows trialing with a minimum set of writeups if wanting the see the effects better.

In [66]:
writeups = []
used_names = []
skipped_names_empty = []
skipped_names_few = []
# writeups need metadata to have a description that I use for the summary
skipped_names_nometa = []

pbar_selected = tqdm(total=0, desc="Selected Writeups")
pbar_skipped_empty = tqdm(total=0, desc="Skipped (No Writeups)")
pbar_skipped_few = tqdm(total=0, desc="Skipped (Too Few Writeups)")
pbar_skipped_nometa = tqdm(total=0, desc="Skipped (No Metadata)")

#if on_kaggle:
#    comps_to_find = 20
#else:
#    # should be all
#    comps_to_find = 10000

for idx, name in enumerate(competitions):
    comp_writeups = df_writeups[df_writeups["Title of Competition"] == name]
    print(f"processing {name}, writeups: {comp_writeups.shape[0]}")
    if len(comp_writeups) == 0:
        print(f"skipping {name} due to no writeups")
        skipped_names_empty.append(name)
        pbar_skipped_empty.update(1)
        continue
    comp_meta = df_comp_meta[df_comp_meta["comp_name"] == name]
    found = comp_meta.shape[0] > 0
    if not found:
        print(f"skipping {name} due to no metadata")
        skipped_names_nometa.append(name)
        pbar_skipped_nometa.update(1)
        continue
        
    #if len(comp_writeups) < 5:
    #    #print(f"skipping {name} due to too few writeups")
    #    skipped_names_few.append(name)
    #    pbar_skipped_few.update(1)
    #    continue    
    used_names.append(name)
    writeups.append(comp_writeups)
    pbar_selected.update(1)
#    if len(used_names) >= comps_to_find:
#        break

Selected Writeups: 0it [00:00, ?it/s]

Skipped (No Writeups): 0it [00:00, ?it/s]

Skipped (Too Few Writeups): 0it [00:00, ?it/s]

Skipped (No Metadata): 0it [00:00, ?it/s]

processing Chess ratings - Elo versus the Rest of the World, writeups: 5
processing RTA Freeway Travel Time Prediction, writeups: 1
processing Predict Grant Applications, writeups: 2
processing IJCNN Social Network Challenge, writeups: 2
skipping IJCNN Social Network Challenge due to no metadata
processing Stay Alert! The Ford Challenge, writeups: 1
processing Don't Overfit!, writeups: 1
processing Heritage Health Prize, writeups: 1
processing Mapping Dark Matter, writeups: 1
processing dunnhumby's Shopper Challenge, writeups: 2
processing Semi-Supervised Feature Learning, writeups: 1
processing Eye Movements Verification and Identification Competition, writeups: 1
processing Predicting a Biological Response, writeups: 1
processing Psychopathy Prediction Based on Twitter Usage, writeups: 1
processing Million Song Dataset Challenge, writeups: 1
processing Online Product Sales, writeups: 1
processing Personality Prediction Based on Twitter Stream, writeups: 1
processing EMC Israel Data S

The above illustrates also, how the latter competitions have much more writeups.

## Split Large Writeup Sets to Smaller Ones

Now a helper function to split a large set of writeups into somewhat even size subsets. So we dont end up with subsets with sizes like [10, 10, 10, 1]. Rather in such case [10, 10, 6, 5] so there is enough summarizing for everyone.

In [67]:
def split_list(data, max_sublist_size):
  sublists = []
  current_sublist = []
  for idx, item in data.iterrows():
    if len(current_sublist) < max_sublist_size:
      current_sublist.append(item)
    else:
      sublists.append(current_sublist)
      current_sublist = [item]
  # Add the last sublist, even if it's shorter than max_sublist_size
  sublists.append(current_sublist)

  # Ensure the last sublists are more evenly distributed
  last_sublists = sublists[-2:]  # Get the last two sublists
  total_length = sum(len(sublist) for sublist in last_sublists)
  ideal_length = total_length // len(last_sublists)

  # Distribute elements until a balanced state is achieved
  while len(last_sublists[0]) > ideal_length and len(last_sublists) > 1:
      #print(len(last_sublists[0]))
      last_sublists[1].append(last_sublists[0].pop(0))

  # If there are still more elements in the second-to-last sublist, distribute to previous ones
  if len(last_sublists) > 1 and len(last_sublists[1]) > ideal_length:
    for i in range(len(sublists) - 2, 0, -1):
      if len(sublists[i]) < max_sublist_size:
        sublists[i].append(last_sublists[1].pop(0))
        break

  return sublists



Now use the above helper to actually split the writeups into more even size subsets:

In [68]:
writeup_dataframes = []

#N = 5
N = 10
#N = "ALL"

split_size = N

# ALL just refers to summarizing all together, so no need for subsets then
# it is what I used for the desktop env where I ran full summaries for comparison
if N == "ALL":
    N2 = 10000
else:
    N2 = N
if True:
    for comp_writeups in writeups:
        writeups_sublists = split_list(comp_writeups, N2)
        #print([len(sub) for sub in writeups_sublists])
        writeups_subframes = [pd.DataFrame(sub) for sub in writeups_sublists]
        #print([subframe.shape[0] for subframe in writeups_subframes])
        writeup_dataframes.append(writeups_subframes)


## Collect Writeups for Subsets

In [69]:
# create new list of writeup dataframes filtering out all writeups for competitions in skipped_titles
writeup_dataframes_filtered = []
names_filtered = []
for idx, comp_dataframes in enumerate(writeup_dataframes):
    comp_name = comp_dataframes[0]["Title of Competition"].iloc[0]
    if comp_name in skipped_titles:
        continue
    names_filtered.append(comp_name)
    writeup_dataframes_filtered.append(comp_dataframes)


In [70]:
# concatenate all dataframes in the writeup_dataframes list into a single dataframe
# just useful when summarizing all together for a competition
df_writeups_all = pd.concat([df for sublist in writeup_dataframes_filtered for df in sublist])

In [71]:
len(skipped_names_nometa)

27

In [72]:
# these are just to check up the filtering is OK. 
# Previous filtering has given 310-27=283 competitions left, so thats the number needed here
print(len(writeup_dataframes_filtered))
print(df_writeups_all.shape)
print(df_writeups_all["Title of Competition"].nunique())

283
(2725, 8)
283


## Summarize the Selected Writeups as Input for Subsets

In [73]:
# first have to reset the lists to not pollute with previous experiments in this notebook
writeup_summaries = []
writeup_summaries_md = []
overall_summaries = []
processed_titles = set()
processed_titles_list = []
skipped_titles = set()
skipped_titles_list = []
overall_summary_prompts = []
prompts = []
writeups = []
prompt_lengths = []
writeup_lengths = []
summary_lengths = []
final_names = []

In [74]:
len(used_names)

283

In [77]:
# subset is how many competitions are used here, limiting it on Kaggle for GPU time
if on_kaggle:
    subset = 5
else:
    subset = len(used_names)

if N != "ALL":
    print(f"N = {N}")
    # taking the latest competitions subset to have competitions with more writeups
    subset = 33
    start_idx = 250
#    start_idx = len(used_names) - subset
#    start_idx = len(used_names) - subset
    for x in tqdm(range(start_idx, start_idx+subset)):
        print(x)
        name = used_names[x]
        final_names.append(name)
        # writeup_dataframes holds the writeups for the competition name at same index
        for df_subframe in writeup_dataframes_filtered[x]:
            summarize_competition_writeups([name], df_subframe, tokenizer)
else:
    print(f"ALL = {N}")
    # this was on local desktop with more resources and GPU time
    # in this case there are splits or subsets, all comp writeup summaries are summarized as one
    summarize_competition_writeups(used_names, df_writeups_all, tokenizer)


N = 10


  0%|          | 0/33 [00:00<?, ?it/s]

250


  0%|          | 0/1 [00:00<?, ?it/s]

378: competition Google Landmark Retrieval 2021, writeups 7


  0%|          | 0/7 [00:00<?, ?it/s]

     prompt lengths=[1269, 561, 2289, 1052, 2105, 975, 1838]
    writeup lengths=[1095, 386, 2114, 878, 1931, 801, 1664]
    summary lengths=[454, 436, 482, 431, 519, 433, 760]
    total writeups length:8869, overall prompt length: 3726, overall summary length: 563
total of 379 competitions processed
251


  0%|          | 0/1 [00:00<?, ?it/s]

379: competition Google Landmark Recognition 2021, writeups 5


  0%|          | 0/5 [00:00<?, ?it/s]

     prompt lengths=[498, 2171, 1423, 1040, 1037]
    writeup lengths=[307, 1979, 1231, 849, 846]
    summary lengths=[383, 479, 452, 490, 437]
    total writeups length:5212, overall prompt length: 2455, overall summary length: 588
total of 380 competitions processed
252


  0%|          | 0/1 [00:00<?, ?it/s]

380: competition chaii - Hindi and Tamil Question Answering, writeups 10


  0%|          | 0/10 [00:00<?, ?it/s]

     prompt lengths=[1524, 2403, 1675, 1312, 495, 1271, 954, 1760, 801, 870]
    writeup lengths=[1322, 2201, 1474, 1111, 294, 1070, 753, 1561, 599, 669]
    summary lengths=[489, 406, 462, 423, 488, 329, 439, 491, 373, 490]
    total writeups length:11054, overall prompt length: 4649, overall summary length: 402
total of 381 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

381: competition chaii - Hindi and Tamil Question Answering, writeups 9


  0%|          | 0/9 [00:00<?, ?it/s]

     prompt lengths=[1115, 1456, 372, 1170, 1030, 658, 1006, 722, 807]
    writeup lengths=[913, 1255, 171, 968, 829, 457, 805, 521, 605]
    summary lengths=[430, 438, 317, 446, 375, 485, 399, 422, 462]
    total writeups length:6524, overall prompt length: 4026, overall summary length: 490
total of 382 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

382: competition chaii - Hindi and Tamil Question Answering, writeups 9


  0%|          | 0/9 [00:00<?, ?it/s]

     prompt lengths=[636, 621, 978, 1958, 962, 494, 764, 575, 348]
    writeup lengths=[435, 421, 776, 1757, 761, 293, 563, 374, 147]
    summary lengths=[445, 420, 347, 475, 430, 455, 389, 401, 405]
    total writeups length:5527, overall prompt length: 4019, overall summary length: 518
total of 383 competitions processed
253


  0%|          | 0/1 [00:00<?, ?it/s]

383: competition Lux AI, writeups 4


  0%|          | 0/4 [00:00<?, ?it/s]

     prompt lengths=[4509, 600, 900, 645]
    writeup lengths=[4321, 413, 713, 458]
    summary lengths=[483, 505, 468, 499]
    total writeups length:5905, overall prompt length: 2159, overall summary length: 550
total of 384 competitions processed
254


  0%|          | 0/1 [00:00<?, ?it/s]

384: competition Google Brain - Ventilator Pressure Prediction, writeups 7


  0%|          | 0/7 [00:00<?, ?it/s]

     prompt lengths=[1865, 994, 459, 3182, 1154, 967, 1360]
    writeup lengths=[1664, 791, 257, 2981, 951, 765, 1159]
    summary lengths=[529, 445, 385, 607, 453, 460, 391]
    total writeups length:8568, overall prompt length: 3509, overall summary length: 781
total of 385 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

385: competition Google Brain - Ventilator Pressure Prediction, writeups 8


  0%|          | 0/8 [00:00<?, ?it/s]

     prompt lengths=[1293, 275, 2935, 1862, 802, 382, 877, 434]
    writeup lengths=[1091, 73, 2734, 1660, 601, 180, 676, 233]
    summary lengths=[455, 509, 429, 447, 528, 365, 451, 388]
    total writeups length:7248, overall prompt length: 3818, overall summary length: 1000
total of 386 competitions processed
255


  0%|          | 0/1 [00:00<?, ?it/s]

386: competition PetFinder.my - Pawpularity Contest, writeups 10


  0%|          | 0/10 [00:00<?, ?it/s]

     prompt lengths=[411, 558, 459, 533, 838, 805, 1513, 1272, 593, 631]
    writeup lengths=[228, 374, 276, 351, 654, 622, 1329, 1090, 409, 448]
    summary lengths=[456, 444, 446, 298, 486, 379, 479, 487, 450, 473]
    total writeups length:5781, overall prompt length: 4639, overall summary length: 501
total of 387 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

387: competition PetFinder.my - Pawpularity Contest, writeups 9


  0%|          | 0/9 [00:00<?, ?it/s]

     prompt lengths=[558, 248, 1551, 570, 604, 423, 282, 400, 433]
    writeup lengths=[374, 66, 1369, 387, 419, 242, 99, 217, 251]
    summary lengths=[445, 374, 523, 343, 528, 405, 389, 480, 391]
    total writeups length:3424, overall prompt length: 4112, overall summary length: 477
total of 388 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

388: competition PetFinder.my - Pawpularity Contest, writeups 8


  0%|          | 0/8 [00:00<?, ?it/s]

     prompt lengths=[940, 1690, 1294, 553, 853, 980, 793, 979]
    writeup lengths=[757, 1508, 1113, 369, 669, 797, 610, 796]
    summary lengths=[616, 511, 368, 515, 436, 475, 524, 423]
    total writeups length:6619, overall prompt length: 4095, overall summary length: 553
total of 389 competitions processed
256


  0%|          | 0/1 [00:00<?, ?it/s]

389: competition Sartorius - Cell Instance Segmentation, writeups 5


  0%|          | 0/5 [00:00<?, ?it/s]

     prompt lengths=[508, 415, 2120, 475, 661]
    writeup lengths=[347, 254, 1959, 314, 500]
    summary lengths=[511, 468, 377, 458, 533]
    total writeups length:3374, overall prompt length: 2531, overall summary length: 1000
total of 390 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

390: competition Sartorius - Cell Instance Segmentation, writeups 6


  0%|          | 0/6 [00:00<?, ?it/s]

     prompt lengths=[505, 1256, 1350, 901, 536, 979]
    writeup lengths=[344, 1095, 1190, 740, 374, 817]
    summary lengths=[451, 425, 434, 450, 413, 677]
    total writeups length:4560, overall prompt length: 3041, overall summary length: 388
total of 391 competitions processed
257


  0%|          | 0/1 [00:00<?, ?it/s]

391: competition Santa 2021 - The Merry Movie Montage, writeups 8


  0%|          | 0/8 [00:00<?, ?it/s]

     prompt lengths=[2987, 1510, 535, 736, 711, 887, 1193, 1501]
    writeup lengths=[2770, 1293, 319, 520, 494, 670, 977, 1284]
    summary lengths=[546, 390, 384, 436, 444, 401, 465, 500]
    total writeups length:8327, overall prompt length: 3827, overall summary length: 434
total of 392 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

392: competition Santa 2021 - The Merry Movie Montage, writeups 8


  0%|          | 0/8 [00:00<?, ?it/s]

     prompt lengths=[1712, 324, 411, 1112, 1410, 744, 1687, 728]
    writeup lengths=[1496, 108, 195, 896, 1194, 528, 1471, 511]
    summary lengths=[434, 379, 390, 367, 351, 519, 405, 406]
    total writeups length:6399, overall prompt length: 3512, overall summary length: 493
total of 393 competitions processed
258


  0%|          | 0/1 [00:00<?, ?it/s]

393: competition Feedback Prize - Evaluating Student Writing, writeups 8


  0%|          | 0/8 [00:00<?, ?it/s]

     prompt lengths=[2207, 1425, 823, 2008, 1342, 1196, 1071, 1966]
    writeup lengths=[2015, 1232, 630, 1816, 1149, 1003, 879, 1773]
    summary lengths=[474, 463, 446, 443, 304, 784, 399, 596]
    total writeups length:10497, overall prompt length: 4145, overall summary length: 627
total of 394 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

394: competition Feedback Prize - Evaluating Student Writing, writeups 9


  0%|          | 0/9 [00:00<?, ?it/s]

     prompt lengths=[1243, 454, 1046, 1102, 880, 833, 816, 1145, 841]
    writeup lengths=[1050, 262, 854, 910, 689, 641, 624, 953, 649]
    summary lengths=[372, 427, 439, 528, 558, 466, 589, 591, 417]
    total writeups length:6632, overall prompt length: 4630, overall summary length: 609
total of 395 competitions processed
259


  0%|          | 0/1 [00:00<?, ?it/s]

395: competition Ubiquant Market Prediction, writeups 7


  0%|          | 0/7 [00:00<?, ?it/s]

     prompt lengths=[404, 1732, 749, 448, 392, 523, 1306]
    writeup lengths=[235, 1564, 580, 279, 223, 354, 1137]
    summary lengths=[442, 527, 527, 395, 458, 437, 547]
    total writeups length:4372, overall prompt length: 3539, overall summary length: 526
total of 396 competitions processed
260


  0%|          | 0/1 [00:00<?, ?it/s]

396: competition Happywhale - Whale and Dolphin Identification, writeups 10


  0%|          | 0/10 [00:00<?, ?it/s]

     prompt lengths=[718, 1127, 1056, 578, 639, 1121, 785, 869, 925, 2040]
    writeup lengths=[531, 940, 870, 391, 452, 934, 598, 682, 739, 1852]
    summary lengths=[544, 360, 357, 567, 433, 566, 483, 516, 477, 511]
    total writeups length:7989, overall prompt length: 5059, overall summary length: 1000
total of 397 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

397: competition Happywhale - Whale and Dolphin Identification, writeups 6


  0%|          | 0/6 [00:00<?, ?it/s]

     prompt lengths=[713, 791, 687, 897, 1117, 828]
    writeup lengths=[526, 604, 501, 709, 930, 641]
    summary lengths=[485, 436, 455, 439, 359, 544]
    total writeups length:3911, overall prompt length: 2935, overall summary length: 330
total of 398 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

398: competition Happywhale - Whale and Dolphin Identification, writeups 5


  0%|          | 0/5 [00:00<?, ?it/s]

     prompt lengths=[569, 723, 771, 712, 495]
    writeup lengths=[382, 536, 584, 525, 308]
    summary lengths=[499, 469, 394, 455, 367]
    total writeups length:2335, overall prompt length: 2394, overall summary length: 503
total of 399 competitions processed
261


  0%|          | 0/1 [00:00<?, ?it/s]

399: competition NBME - Score Clinical Patient Notes, writeups 10


  0%|          | 0/10 [00:00<?, ?it/s]

     prompt lengths=[1744, 950, 899, 1123, 845, 1823, 709, 2719, 1770, 1285]
    writeup lengths=[1549, 757, 704, 928, 650, 1628, 513, 2525, 1575, 1090]
    summary lengths=[540, 438, 458, 488, 449, 431, 431, 459, 389, 360]
    total writeups length:11919, overall prompt length: 4696, overall summary length: 686
total of 400 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

400: competition NBME - Score Clinical Patient Notes, writeups 6


  0%|          | 0/6 [00:00<?, ?it/s]

     prompt lengths=[1813, 1246, 1028, 1093, 886, 766]
    writeup lengths=[1617, 1051, 833, 897, 690, 569]
    summary lengths=[505, 407, 598, 441, 427, 481]
    total writeups length:5657, overall prompt length: 3084, overall summary length: 609
total of 401 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

401: competition NBME - Score Clinical Patient Notes, writeups 6


  0%|          | 0/6 [00:00<?, ?it/s]

     prompt lengths=[854, 753, 859, 1814, 957, 1071]
    writeup lengths=[659, 559, 664, 1618, 763, 876]
    summary lengths=[427, 404, 432, 498, 346, 324]
    total writeups length:5139, overall prompt length: 2656, overall summary length: 705
total of 402 competitions processed
262


  0%|          | 0/1 [00:00<?, ?it/s]

402: competition H&M Personalized Fashion Recommendations, writeups 10


  0%|          | 0/10 [00:00<?, ?it/s]

     prompt lengths=[598, 681, 1495, 1280, 749, 1064, 852, 1020, 696, 2043]
    writeup lengths=[416, 499, 1311, 1098, 566, 882, 670, 837, 513, 1860]
    summary lengths=[484, 475, 380, 351, 431, 537, 394, 451, 531, 517]
    total writeups length:8652, overall prompt length: 4791, overall summary length: 672
total of 403 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

403: competition H&M Personalized Fashion Recommendations, writeups 7


  0%|          | 0/7 [00:00<?, ?it/s]

     prompt lengths=[639, 1151, 654, 763, 755, 690, 1183]
    writeup lengths=[457, 969, 472, 581, 573, 509, 1001]
    summary lengths=[481, 526, 380, 474, 411, 504, 515]
    total writeups length:4562, overall prompt length: 3510, overall summary length: 669
total of 404 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

404: competition H&M Personalized Fashion Recommendations, writeups 6


  0%|          | 0/6 [00:00<?, ?it/s]

     prompt lengths=[1171, 531, 554, 1601, 1087, 1319]
    writeup lengths=[989, 349, 371, 1418, 905, 1137]
    summary lengths=[499, 476, 405, 601, 459, 559]
    total writeups length:5169, overall prompt length: 3211, overall summary length: 827
total of 405 competitions processed
263


  0%|          | 0/1 [00:00<?, ?it/s]

405: competition BirdCLEF 2022, writeups 10


  0%|          | 0/10 [00:00<?, ?it/s]

     prompt lengths=[996, 764, 886, 1846, 320, 748, 1151, 1154, 826, 1071]
    writeup lengths=[816, 585, 708, 1667, 140, 568, 973, 975, 647, 892]
    summary lengths=[397, 445, 471, 607, 422, 541, 513, 356, 435, 610]
    total writeups length:7971, overall prompt length: 5034, overall summary length: 504
total of 406 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

406: competition BirdCLEF 2022, writeups 6


  0%|          | 0/6 [00:00<?, ?it/s]

     prompt lengths=[2520, 1622, 1620, 1023, 1100, 483]
    writeup lengths=[2340, 1443, 1441, 844, 921, 304]
    summary lengths=[427, 596, 452, 371, 394, 371]
    total writeups length:7293, overall prompt length: 2820, overall summary length: 424
total of 407 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

407: competition BirdCLEF 2022, writeups 5


  0%|          | 0/5 [00:00<?, ?it/s]

     prompt lengths=[1791, 1011, 675, 1395, 1162]
    writeup lengths=[1612, 832, 496, 1216, 983]
    summary lengths=[438, 440, 462, 406, 523]
    total writeups length:5139, overall prompt length: 2471, overall summary length: 496
total of 408 competitions processed
264


  0%|          | 0/1 [00:00<?, ?it/s]

408: competition March Machine Learning Mania 2022 - Women's, writeups 8


  0%|          | 0/8 [00:00<?, ?it/s]

     prompt lengths=[914, 717, 581, 250, 689, 959, 1076, 562]
    writeup lengths=[739, 543, 407, 77, 515, 784, 902, 388]
    summary lengths=[362, 407, 394, 300, 366, 412, 397, 504]
    total writeups length:4355, overall prompt length: 3360, overall summary length: 604
total of 409 competitions processed
265


  0%|          | 0/1 [00:00<?, ?it/s]

409: competition March Machine Learning Mania 2022 - Men’s, writeups 8


  0%|          | 0/8 [00:00<?, ?it/s]

     prompt lengths=[604, 679, 409, 724, 407, 561, 228, 442]
    writeup lengths=[432, 505, 234, 551, 235, 387, 55, 268]
    summary lengths=[480, 436, 465, 439, 380, 453, 313, 375]
    total writeups length:2667, overall prompt length: 3559, overall summary length: 716
total of 410 competitions processed
266


  0%|          | 0/1 [00:00<?, ?it/s]

410: competition GeoLifeCLEF 2022 - LifeCLEF 2022 x FGVC9, writeups 2


  0%|          | 0/2 [00:00<?, ?it/s]

     prompt lengths=[1370, 2392]
    writeup lengths=[1186, 2208]
    summary lengths=[453, 486]
    total writeups length:3394, overall prompt length: 1125, overall summary length: 672
total of 411 competitions processed
267


  0%|          | 0/1 [00:00<?, ?it/s]

411: competition Hotel-ID to Combat Human Trafficking 2022 - FGVC9, writeups 3


  0%|          | 0/3 [00:00<?, ?it/s]

     prompt lengths=[819, 622, 492]
    writeup lengths=[652, 456, 324]
    summary lengths=[432, 448, 378]
    total writeups length:1432, overall prompt length: 1434, overall summary length: 560
total of 412 competitions processed
268


  0%|          | 0/1 [00:00<?, ?it/s]

412: competition Sorghum -100 Cultivar Identification - FGVC 9, writeups 3


  0%|          | 0/3 [00:00<?, ?it/s]

     prompt lengths=[445, 387, 1348]
    writeup lengths=[251, 193, 1153]
    summary lengths=[428, 406, 535]
    total writeups length:1597, overall prompt length: 1572, overall summary length: 370
total of 413 competitions processed
269


  0%|          | 0/1 [00:00<?, ?it/s]

413: competition iWildCam 2022 - FGVC9, writeups 2


  0%|          | 0/2 [00:00<?, ?it/s]

     prompt lengths=[909, 634]
    writeup lengths=[716, 441]
    summary lengths=[584, 426]
    total writeups length:1157, overall prompt length: 1205, overall summary length: 427
total of 414 competitions processed
270


  0%|          | 0/1 [00:00<?, ?it/s]

414: competition Image Matching Challenge 2022, writeups 10


  0%|          | 0/10 [00:00<?, ?it/s]

     prompt lengths=[1656, 996, 1512, 564, 1347, 965, 1009, 1553, 1062, 1819]
    writeup lengths=[1472, 810, 1327, 378, 1162, 780, 824, 1368, 877, 1633]
    summary lengths=[381, 376, 482, 479, 406, 471, 477, 338, 581, 450]
    total writeups length:10631, overall prompt length: 4684, overall summary length: 435
total of 415 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

415: competition Image Matching Challenge 2022, writeups 6


  0%|          | 0/6 [00:00<?, ?it/s]

     prompt lengths=[1251, 776, 2517, 892, 1680, 519]
    writeup lengths=[1065, 591, 2332, 707, 1495, 334]
    summary lengths=[365, 429, 416, 499, 423, 357]
    total writeups length:6524, overall prompt length: 2704, overall summary length: 574
total of 416 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

416: competition Image Matching Challenge 2022, writeups 5


  0%|          | 0/5 [00:00<?, ?it/s]

     prompt lengths=[1529, 984, 596, 840, 3165]
    writeup lengths=[1344, 799, 411, 656, 2980]
    summary lengths=[539, 415, 437, 349, 526]
    total writeups length:6190, overall prompt length: 2474, overall summary length: 421
total of 417 competitions processed
271


  0%|          | 0/1 [00:00<?, ?it/s]

417: competition JPX Tokyo Stock Exchange Prediction, writeups 6


  0%|          | 0/6 [00:00<?, ?it/s]

     prompt lengths=[330, 854, 770, 400, 470, 581]
    writeup lengths=[140, 664, 580, 211, 280, 391]
    summary lengths=[326, 437, 414, 339, 424, 512]
    total writeups length:2266, overall prompt length: 2672, overall summary length: 449
total of 418 competitions processed
272


  0%|          | 0/1 [00:00<?, ?it/s]

418: competition Kore 2022, writeups 9


  0%|          | 0/9 [00:00<?, ?it/s]

     prompt lengths=[2978, 1893, 1245, 1715, 239, 3704, 1678, 1899, 1426]
    writeup lengths=[2817, 1731, 1084, 1553, 77, 3542, 1516, 1737, 1264]
    summary lengths=[474, 413, 446, 483, 369, 505, 496, 445, 464]
    total writeups length:15321, overall prompt length: 4308, overall summary length: 550
total of 419 competitions processed
273


  0%|          | 0/1 [00:00<?, ?it/s]

419: competition Foursquare - Location Matching, writeups 8


  0%|          | 0/8 [00:00<?, ?it/s]

     prompt lengths=[1216, 1255, 1380, 713, 1609, 824, 1377, 628]
    writeup lengths=[1049, 1088, 1213, 547, 1442, 657, 1210, 461]
    summary lengths=[562, 377, 637, 477, 475, 520, 537, 670]
    total writeups length:7667, overall prompt length: 4466, overall summary length: 452
total of 420 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

420: competition Foursquare - Location Matching, writeups 9


  0%|          | 0/9 [00:00<?, ?it/s]

     prompt lengths=[292, 1597, 1598, 1971, 1694, 841, 521, 653, 1725]
    writeup lengths=[126, 1429, 1431, 1804, 1528, 674, 354, 486, 1558]
    summary lengths=[408, 439, 381, 420, 460, 405, 528, 487, 471]
    total writeups length:9390, overall prompt length: 4217, overall summary length: 627
total of 421 competitions processed
274


  0%|          | 0/1 [00:00<?, ?it/s]

421: competition Herbarium 2022 - FGVC9, writeups 1


  0%|          | 0/1 [00:00<?, ?it/s]

     prompt lengths=[930]
    writeup lengths=[734]
    summary lengths=[494]
    total writeups length:734, overall prompt length: 684, overall summary length: 481
total of 422 competitions processed
275


  0%|          | 0/1 [00:00<?, ?it/s]

422: competition Google Smartphone Decimeter Challenge 2022, writeups 5


  0%|          | 0/5 [00:00<?, ?it/s]

     prompt lengths=[808, 1120, 2395, 699, 1218]
    writeup lengths=[613, 925, 2201, 504, 1023]
    summary lengths=[443, 399, 401, 425, 530]
    total writeups length:5266, overall prompt length: 2416, overall summary length: 466
total of 423 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

423: competition Google Smartphone Decimeter Challenge 2022, writeups 6


  0%|          | 0/6 [00:00<?, ?it/s]

     prompt lengths=[491, 1824, 954, 936, 615, 1704]
    writeup lengths=[296, 1629, 759, 741, 419, 1509]
    summary lengths=[356, 518, 407, 382, 393, 438]
    total writeups length:5353, overall prompt length: 2719, overall summary length: 661
total of 424 competitions processed
276


  0%|          | 0/1 [00:00<?, ?it/s]

424: competition Feedback Prize - Predicting Effective Arguments, writeups 9


  0%|          | 0/9 [00:00<?, ?it/s]

     prompt lengths=[1330, 759, 789, 735, 598, 896, 2117, 1091, 2349]
    writeup lengths=[1118, 548, 576, 523, 386, 684, 1906, 879, 2137]
    summary lengths=[564, 411, 458, 442, 422, 356, 525, 425, 682]
    total writeups length:8757, overall prompt length: 4548, overall summary length: 599
total of 425 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

425: competition Feedback Prize - Predicting Effective Arguments, writeups 10


  0%|          | 0/10 [00:00<?, ?it/s]

     prompt lengths=[705, 840, 659, 2223, 516, 737, 599, 806, 614, 754]
    writeup lengths=[493, 628, 448, 2010, 305, 525, 387, 593, 402, 543]
    summary lengths=[456, 386, 396, 537, 336, 296, 485, 351, 472, 445]
    total writeups length:6334, overall prompt length: 4430, overall summary length: 797
total of 426 competitions processed
277


  0%|          | 0/1 [00:00<?, ?it/s]

426: competition American Express - Default Prediction, writeups 10


  0%|          | 0/10 [00:00<?, ?it/s]

     prompt lengths=[1036, 1693, 5210, 1842, 1362, 311, 1112, 1116, 282, 298]
    writeup lengths=[841, 1498, 5015, 1646, 1167, 116, 917, 920, 88, 104]
    summary lengths=[508, 389, 475, 528, 303, 307, 384, 365, 331, 432]
    total writeups length:12312, overall prompt length: 4275, overall summary length: 559
total of 427 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

427: competition American Express - Default Prediction, writeups 10


  0%|          | 0/10 [00:00<?, ?it/s]

     prompt lengths=[841, 266, 559, 1082, 995, 599, 515, 467, 683, 328]
    writeup lengths=[646, 72, 364, 887, 800, 405, 321, 271, 487, 133]
    summary lengths=[502, 429, 529, 520, 499, 402, 469, 357, 459, 383]
    total writeups length:4386, overall prompt length: 4802, overall summary length: 570
total of 428 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

428: competition American Express - Default Prediction, writeups 6


  0%|          | 0/6 [00:00<?, ?it/s]

     prompt lengths=[990, 640, 266, 982, 608, 842]
    writeup lengths=[796, 444, 72, 787, 414, 647]
    summary lengths=[508, 438, 306, 575, 441, 431]
    total writeups length:3160, overall prompt length: 2924, overall summary length: 648
total of 429 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

429: competition American Express - Default Prediction, writeups 5


  0%|          | 0/5 [00:00<?, ?it/s]

     prompt lengths=[2054, 2095, 578, 3133, 838]
    writeup lengths=[1859, 1899, 383, 2938, 643]
    summary lengths=[524, 494, 392, 488, 414]
    total writeups length:7722, overall prompt length: 2530, overall summary length: 564
total of 430 competitions processed
278


  0%|          | 0/1 [00:00<?, ?it/s]

430: competition HuBMAP + HPA - Hacking the Human Body, writeups 5


  0%|          | 0/5 [00:00<?, ?it/s]

     prompt lengths=[1755, 936, 736, 845, 667]
    writeup lengths=[1580, 760, 560, 668, 491]
    summary lengths=[519, 373, 467, 409, 410]
    total writeups length:4059, overall prompt length: 2377, overall summary length: 537
total of 431 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

431: competition HuBMAP + HPA - Hacking the Human Body, writeups 6


  0%|          | 0/6 [00:00<?, ?it/s]

     prompt lengths=[713, 604, 1341, 725, 638, 2067]
    writeup lengths=[537, 427, 1165, 550, 462, 1890]
    summary lengths=[443, 462, 333, 503, 426, 421]
    total writeups length:5031, overall prompt length: 2794, overall summary length: 560
total of 432 competitions processed
279


  0%|          | 0/1 [00:00<?, ?it/s]

432: competition Mayo Clinic - STRIP AI, writeups 10


  0%|          | 0/10 [00:00<?, ?it/s]

     prompt lengths=[2749, 453, 1254, 370, 724, 313, 603, 1284, 672, 465]
    writeup lengths=[2584, 288, 1089, 205, 559, 148, 438, 1118, 507, 300]
    summary lengths=[516, 460, 513, 472, 354, 347, 484, 408, 452, 484]
    total writeups length:7236, overall prompt length: 4713, overall summary length: 512
total of 433 competitions processed
280


  0%|          | 0/1 [00:00<?, ?it/s]

433: competition Google Universal Image Embedding, writeups 8


  0%|          | 0/8 [00:00<?, ?it/s]

     prompt lengths=[1760, 372, 3010, 631, 1017, 739, 439, 565]
    writeup lengths=[1573, 186, 2822, 444, 831, 553, 252, 378]
    summary lengths=[467, 358, 491, 501, 504, 377, 456, 575]
    total writeups length:7039, overall prompt length: 3960, overall summary length: 522
total of 434 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

434: competition Google Universal Image Embedding, writeups 8


  0%|          | 0/8 [00:00<?, ?it/s]

     prompt lengths=[856, 1951, 482, 1394, 497, 638, 710, 788]
    writeup lengths=[669, 1765, 295, 1207, 310, 450, 523, 601]
    summary lengths=[470, 417, 316, 414, 391, 496, 409, 408]
    total writeups length:5820, overall prompt length: 3552, overall summary length: 622
total of 435 competitions processed
281


  0%|          | 0/1 [00:00<?, ?it/s]

435: competition RSNA 2022 Cervical Spine Fracture Detection, writeups 9


  0%|          | 0/9 [00:00<?, ?it/s]

     prompt lengths=[1384, 1380, 1314, 1264, 523, 1513, 1232, 716, 523]
    writeup lengths=[1200, 1195, 1130, 1080, 338, 1329, 1047, 533, 338]
    summary lengths=[365, 570, 560, 460, 415, 546, 412, 595, 444]
    total writeups length:8190, overall prompt length: 4602, overall summary length: 698
total of 436 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

436: competition RSNA 2022 Cervical Spine Fracture Detection, writeups 9


  0%|          | 0/9 [00:00<?, ?it/s]

     prompt lengths=[1583, 1419, 1028, 1254, 906, 1322, 553, 917, 1426]
    writeup lengths=[1399, 1235, 844, 1069, 722, 1137, 369, 733, 1242]
    summary lengths=[470, 515, 585, 454, 528, 455, 497, 531, 517]
    total writeups length:8750, overall prompt length: 4787, overall summary length: 424
total of 437 competitions processed
282


  0%|          | 0/1 [00:00<?, ?it/s]

437: competition AI Village Capture the Flag @ DEFCON, writeups 7


  0%|          | 0/7 [00:00<?, ?it/s]

     prompt lengths=[584, 292, 1588, 388, 2192, 299, 366]
    writeup lengths=[351, 60, 1354, 155, 1959, 67, 134]
    summary lengths=[372, 369, 418, 390, 466, 426, 422]
    total writeups length:4080, overall prompt length: 3133, overall summary length: 677
total of 438 competitions processed


  0%|          | 0/1 [00:00<?, ?it/s]

438: competition AI Village Capture the Flag @ DEFCON, writeups 7


  0%|          | 0/7 [00:00<?, ?it/s]

     prompt lengths=[813, 426, 251, 289, 268, 304, 414]
    writeup lengths=[581, 194, 18, 57, 35, 70, 182]
    summary lengths=[461, 360, 551, 410, 397, 525, 400]
    total writeups length:1137, overall prompt length: 3374, overall summary length: 460
total of 439 competitions processed


In [78]:
subset_count = 0
#for idx in range(0, start_idx+subset):
for idx in range(0, len(writeup_dataframes_filtered)):
    wdfs = writeup_dataframes_filtered[idx]
    subset_count += len(wdfs)
#    for wdf in wdfs:
#        subset_count += wdf.shape[0]
        
print(subset_count)
print(len(writeup_summaries))
print(len(overall_summaries))

439
439
439


In [79]:
writeup_summaries[-1]

["What a fascinating competition! As a helpful assistant, I'll extract the key points from a datascience application viewpoint for similar problem solving. Here are the key takeaways:\n\n1. **Domain knowledge**: The competition involves various machine learning security challenges, which requires a good understanding of the underlying concepts and techniques. This highlights the importance of domain knowledge in tackling complex problems.\n2. **Adversarial attacks**: The writeup showcases the use of adversarial attacks, such as gradient attacks, to manipulate machine learning models. This demonstrates the need to consider potential attacks when designing and deploying machine learning systems.\n3. **Transfer learning**: The use of pre-trained models, such as InceptionV3 and MobileNet, as proxies for gradient attacks shows the value of transfer learning in certain situations.\n4. **Exploration and experimentation**: The writeup highlights the importance of exploring different approaches

## Build the Subsets

When we come here, we have writeup summaries in a single flat list. Now need to re-group then so that each subset is its own list of writeup summaries for that subset.

In [80]:
def recombine_summaries(summaries, sublist_lengths):
    result = []
    start_index = 0
    for idx, length in enumerate(sublist_lengths):
        end_index = start_index + length
        sublist = summaries[start_index:end_index]
        if len(sublist) == 0:
            print(f"stopping recombo due to end of list at idx {idx}")
            break
        result.append(sublist)
        start_index = end_index

    return result 

# Example usage
summaries = ["summary1_1", "summary1_2", "summary2_1", "summary2_2", "summary3_1"]
sublist_lengths = [2, 2, 1]

recombine_summaries(summaries, sublist_lengths)


[['summary1_1', 'summary1_2'], ['summary2_1', 'summary2_2'], ['summary3_1']]

In [81]:
# this should produce 5 lists of subsets, since I capped it at 5 for Kaggle above
#sublist_lengths = [len(sub) for sub in writeup_dataframes_filtered[-subset:]]
sublist_lengths = [len(sub) for sub in writeup_dataframes_filtered]
len(sublist_lengths)

283

In [82]:
if N != "ALL":
    combined = recombine_summaries(overall_summaries, sublist_lengths)
else:
    combined = None

## Summarize the Subsets Separately

For the summarizing large numbers of writeups in parts experiment, I first need summaries of the subsets. This is where that happens.


In [83]:
def summarize_subsummaries(sublist, print_prompt=False, prompt_style="simple"):
    #print(f"summarizing: {sublist}")
    writeup_summaries = [summary.data for summary in sublist]
    if prompt_style == "simple":
        points = "\n\n".join(writeup_summaries)
    elif prompt_style == "subtitle writeup":
        points = ""
        for idx, writeup_summary in enumerate(writeup_summaries):
            points += "\n\n"
            points += f"writeup {idx}\n"
            points += writeup_summary
    else:
        points = ""
        for idx, writeup_summary in enumerate(writeup_summaries):
            points += "\n\n"
            points += f"writeups summary {idx}\n"
            points += writeup_summary
    prompt = f"""collect the key points listed in these writeup summaries into one list of key points.
    include the descriptions given for these key points.
    writeup summaries:
    {points}
    """
    # trying to remove <|eot_id|> just in case Llama3 thinks the sequence ends in the middle
    prompt = prompt.replace("<|eot_id|>", "\n")
    summary, summary_md = llama3_generate(prompt, print_prompt)
    if print_prompt:
        # this printing is just for debugging, because Gemma is giving some 
        # rather strange output at times it seems
        input_ids = tokenizer(prompt, return_tensors="pt")
        prompt_len = len(input_ids["input_ids"][0])
        print(f"------- PROMPT ------- {prompt_len} tokens ------")
        print(prompt)
        input_ids = tokenizer(summary, return_tensors="pt")
        summary_len = len(input_ids["input_ids"][0])
        print(f"------ SUMMARY: ------ {summary_len} tokens ------")
        print(summary)
    return summary, summary_md

#print(type(tokenizer))


## Summarize Competition Subset Summaries to One Per Competition

Competition writeup summaries -> summaries for subsets of those -> one summary per competition.

In [84]:
if N != "ALL":
    len(combined)

In [85]:
resummarized = None
resummarized_md = None
resummarized_simple = []
resummarized_md_simple = []
resummarized_writeup = []
resummarized_md_writeup = []
resummarized_writeup_summary = []
resummarized_md_writeup_summary = []

if N != "ALL":
    for idx, combo in tqdm(enumerate(combined), total=len(combined)):
        combo_filtered = [x for x in combo if not isinstance(x, str)]
        if len(combo_filtered) == 0:
            print(f"no data, skipping idx: {idx}")
            continue
        title = final_names[idx]#comp_dataframes[0]["Title of Competition"].iloc[0]
        #print(title)
        if on_kaggle:
            # just to keep the story on kaggle consistent :)
            summary, summary_md = summarize_subsummaries(combo_filtered)
            resummarized_simple.append(summary)
            resummarized_md_simple.append(summary_md)
        else: #TODO: here the other summary types
            summary, summary_md = summarize_subsummaries(combo_filtered)
            resummarized_simple.append(summary)
            resummarized_md_simple.append(summary_md)
            
            summary, summary_md = summarize_subsummaries(combo_filtered, False, "subtitle writeup")
            resummarized_writeup.append(summary)
            resummarized_md_writeup.append(summary_md)
            
            summary, summary_md = summarize_subsummaries(combo_filtered, False, "writeup summaries title")
            resummarized_writeup_summary.append(summary)
            resummarized_md_writeup_summary.append(summary_md)

else:
    plain_summaries = []
    for idx, summary in enumerate(overall_summaries):
        if isinstance(summary, str):
            print(f"string: {idx}:{summary}")
            plain_summaries.append(summary)
        else:
            plain_summaries.append(summary.data)
            
    #plain_summaries = [summary.data for summary in overall_summaries]
    resummarized_simple.extend(plain_summaries)
    resummarized_md_simple.extend(overall_summaries)
    resummarized_comps = final_names

  0%|          | 0/283 [00:00<?, ?it/s]

OK, so that is for the 5 competitions I filtered above for Kaggle. Now lets look at them and what do they say, because these latter ones are ones where Gemma seems to lose the context for some reason:

In [86]:
len(resummarized_simple)

283

In [87]:
Markdown(resummarized_simple[0])

Here is the list of key points with descriptions:

1. **Clearly define the problem and its requirements**: Understand the problem and its requirements, and define them clearly. This helps to ensure that everyone involved in the project is on the same page and that the solution is tailored to the specific needs of the problem.

2. **Explore different approaches**: Don't be afraid to try new things and explore different approaches. This helps to ensure that the best solution is found and that the problem is solved in the most effective way possible.

3. **Code sharing and collaboration**: Share code and collaborate with others to learn from their experiences and improve your approach. This helps to speed up the development process and to ensure that the solution is robust and accurate.

4. **Develop an experimentation framework**: Develop a framework for experimentation to rapidly prototype and refine your solutions. This helps to ensure that the solution is tested thoroughly and that any issues are identified and addressed early on.

5. **View failure as an opportunity to learn**: View failure as an opportunity to learn and improve. This helps to ensure that the solution is refined and improved over time, and that any mistakes are learned from.

6. **Engage with the community**: Engage with the community to share knowledge and learn from others. This helps to ensure that the solution is the best it can be and that any issues are identified and addressed quickly.

7. **Design a robust rating system**: Design a robust rating system that takes into account various factors, such as player performance and opponent ratings. This helps to ensure that the solution is accurate and reliable.

8. **Use historical data**: Use historical data to inform predictions and make informed decisions. This helps to ensure that the solution is based on real-world data and that any predictions are accurate.

9. **Handle missing data**: Handle missing data using techniques such as using faked opponents and reducing ratings for players with less than 15 weighted games. This helps to ensure that the solution is robust and accurate, even in the presence of missing data.

10. **Iterative approach**: Use an iterative approach to refine predictions and account for changing circumstances. This helps to ensure that the solution is refined and improved over time, and that any issues are identified and addressed early on.

11. **Use well-defined metrics and formulas**: Use well-defined metrics and formulas to make predictions and inform decisions. This helps to ensure that the solution is accurate and reliable, and that any predictions are based on a solid foundation.

12. **Handle edge cases**: Consider edge cases and develop strategies to handle them. This helps to ensure that the solution is robust and accurate, even in unusual or unexpected situations.

13. **Use creative solutions**: Use creative solutions to handle missing data and improve the accuracy of predictions. This helps to ensure that the solution is innovative and effective.

14. **Iterative refinement**: Refine predictions through multiple iterations. This helps to ensure that the solution is refined and improved over time, and that any issues are identified and addressed early on.

15. **Use domain-specific knowledge**: Use domain-specific knowledge to inform predictions and make informed decisions. This helps to ensure that the solution is tailored to the specific needs of the problem and that any predictions are based on a deep understanding of the domain.

16. **Use iterative calculation**: Use iterative calculation to refine predictions and account for changing circumstances. This helps to ensure that the solution is refined and improved over time, and that any issues are identified and addressed early on.

17. **Weighting and normalization**: Use weighting and normalization to handle missing data and improve the accuracy of predictions. This helps to ensure that the solution is robust and accurate, even in the presence of missing data.

18. **Incorporating future results**: Incorporate future results into the solution to refine predictions and account for changing circumstances. This helps to ensure that the solution is refined and improved over time, and that any issues are identified and addressed early on.

19. **Simple modifications**: Make simple modifications to the solution to refine predictions and account for changing circumstances. This helps to ensure that the solution is refined and improved over time, and that any issues are identified and addressed early on.

20. **Customization of the rating formula**: Customize the rating formula to fit the specific needs of the problem. This helps to ensure that the solution is tailored to the specific needs of the problem and that any predictions are based on a deep understanding of the domain.

21. **Use of additional parameters**: Use additional parameters to refine predictions and account for changing circumstances. This helps to ensure that the solution is refined and improved over time, and that any issues are identified and addressed early on.

22. **Prediction formula**: Use a prediction formula to make predictions and inform decisions. This helps to ensure that the solution is accurate and reliable, and that any predictions are based on a solid foundation.

23. **Code implementation**: Implement the solution using code to ensure that it is robust and accurate. This

### It all Looks Rather Lost

The above results show how Gemma seems to have lost all the key points its collected before from the writeups, before doing the final summary of all subsets. Lets take a closer look.

## Closer look at Selected Competition 1 and its Final Summary vs Parts

Lets try to verify that I did not mess it up in giving wrong inputs or misinterpret the output. The first competition from that list of 5 selected above, and its subset summaries:

In [88]:
#lets first check how many subsets and thus subset summaries the selected competition 1 has
if N != "ALL":
    len(combined[0])


In [89]:
# the first subset summary for the first competition in the list
if N != "ALL":
    combined[0][0]


In [90]:
# the second subset summary for the first competition in the list
#combined[0][1]

In [91]:
# the third and final subset summary for the first competition in the list
#combined[0][2]

## Rerun the Final Summary for Competition 1 with Debug Info

The three subset summaries shown above seem fine. So lets see what the input is actually for the final summarization step, and what the output is. Just to see there is no obvious issue with the input and how it is run.

I put the debug/print_prompt flag in my helper functions above to print the prompt and token counts just for this purpose.

In [92]:
if N!="ALL":
    summary, summary_md = summarize_subsummaries(combined[0], True, "simple")

input tokens: 545, output tokens: 1000
------- PROMPT ------- 536 tokens ------
collect the key points listed in these writeup summaries into one list of key points.
    include the descriptions given for these key points.
    writeup summaries:
    Based on the writeup summaries, here are the key points that can be applied to similar datascience problems:

1. **Clearly define the problem and its requirements**: Understand the problem and its requirements, and define them clearly.
2. **Explore different approaches**: Don't be afraid to try new things and explore different approaches.
3. **Code sharing and collaboration**: Share code and collaborate with others to learn from their experiences and improve your approach.
4. **Develop an experimentation framework**: Develop a framework for experimentation to rapidly prototype and refine your solutions.
5. **View failure as an opportunity to learn**: View failure as an opportunity to learn and improve.
6. **Engage with the community**: Enga

### Look at the Comp 1 Prompt and Output

Here, Gemma seems to say it has summarized the prompt key points and their descriptions. Yet it provides no key points as final output, and no descriptions. It also claims there are no data analysis methods mentioned, even if just above it there is one.

## Rerun the Final Summary for Competition 2 with Debug Info

Now lets check similarly, the second competition in the selected set of 5. It has 2 subsets, which the debug info will also print with the prompt below.


In [93]:
# number of subset summaries for this competition
if N != "ALL":
    print(len(combined[1]))


1


In [94]:
if N != "ALL":
    summary, summary_md = summarize_subsummaries(combined[1], True)

input tokens: 505, output tokens: 396
------- PROMPT ------- 496 tokens ------
collect the key points listed in these writeup summaries into one list of key points.
    include the descriptions given for these key points.
    writeup summaries:
    Based on the writeup summary, here are the key points that can be applied to similar problem-solving:

**Understanding the problem**: Identify the key factors that affect the outcome, such as traffic volume, road conditions, and time of day.

**Data preparation**: Handle missing values, clean the data, and perform feature engineering to ensure accurate and reliable data.

**Feature selection**: Select the most relevant features that can help predict the outcome, such as traffic volume, road conditions, time of day, and weather.

**Model selection**: Choose a suitable machine learning model, such as linear regression, decision trees, random forests, or neural networks.

**Hyperparameter tuning**: Optimize the model's performance using techniq

### Look at the Comp 2 Prompt and Output

This time Gemma seems to have lost it even better here, the answer mentions not finding any information about datasets, which is true but not related to the prompt in any way. The word dataset is not mentioned in the given input context at all, just a request to summarize the key points.

## Changing the Prompt

In the above debug prints, we see that my prompting might confuse Gemma a bit in asking to summarize multiple writeup summaries, but just concatenating them all together. So lets try giving it a clue where the next writeup summary starts.

In [95]:
if N != "ALL":
    summary, summary_md = summarize_subsummaries(combined[0], True, "subtitle writeup")

input tokens: 550, output tokens: 1000
------- PROMPT ------- 541 tokens ------
collect the key points listed in these writeup summaries into one list of key points.
    include the descriptions given for these key points.
    writeup summaries:
    

writeup 0
Based on the writeup summaries, here are the key points that can be applied to similar datascience problems:

1. **Clearly define the problem and its requirements**: Understand the problem and its requirements, and define them clearly.
2. **Explore different approaches**: Don't be afraid to try new things and explore different approaches.
3. **Code sharing and collaboration**: Share code and collaborate with others to learn from their experiences and improve your approach.
4. **Develop an experimentation framework**: Develop a framework for experimentation to rapidly prototype and refine your solutions.
5. **View failure as an opportunity to learn**: View failure as an opportunity to learn and improve.
6. **Engage with the commu

### Did it Help?

Well, giving Gemma subtitles in the input actually seems to have helped it do a good job. So lets see with a bit better formatting.

In [96]:
if N != "ALL":
    summary_md

### And Again a Slightly Different prompt

Instead of subtitling the subset summaries as "writeup X", lets try a more correct "writeups summary X" and see if it has any effect:

In [97]:
if N != "ALL":
    summary, summary_md = summarize_subsummaries(combined[0], True, "writeup summaries title")

input tokens: 551, output tokens: 1000
------- PROMPT ------- 542 tokens ------
collect the key points listed in these writeup summaries into one list of key points.
    include the descriptions given for these key points.
    writeup summaries:
    

writeups summary 0
Based on the writeup summaries, here are the key points that can be applied to similar datascience problems:

1. **Clearly define the problem and its requirements**: Understand the problem and its requirements, and define them clearly.
2. **Explore different approaches**: Don't be afraid to try new things and explore different approaches.
3. **Code sharing and collaboration**: Share code and collaborate with others to learn from their experiences and improve your approach.
4. **Develop an experimentation framework**: Develop a framework for experimentation to rapidly prototype and refine your solutions.
5. **View failure as an opportunity to learn**: View failure as an opportunity to learn and improve.
6. **Engage with 

### Any Difference?

Again, the results seem to be much better than just concatenating the summaries as input for final summarization. And it actually seems almost identical. So just subtitling the input sections seem to be needed, otherwise Gemma gets all confused.

In [98]:
if N != "ALL":
    summary_md

# Subset Summarization Conclusions So Far

You need to subtitle your subsets in the input context. What more can I say.. :)

In [99]:
#-------------------------------

In [100]:
import pickle

# List of all your lists and their corresponding file names
data = {
    "writeup_summaries": writeup_summaries,
    "writeup_summaries_md": writeup_summaries_md,
    "overall_summaries": overall_summaries,
    "processed_titles": processed_titles,
    "processed_titles_list": processed_titles_list,
    "skipped_titles": skipped_titles,
#    "skipped_titles_list": skipped_titles_list,
    "overall_summary_prompts": overall_summary_prompts,
    "prompts": prompts,
    "writeups": writeups,
    "prompt_lengths": prompt_lengths,
    "writeup_lengths": writeup_lengths,
    "summary_lengths": summary_lengths,
    "resummarized_simple": resummarized_simple,
    "resummarized_md_simple": resummarized_md_simple,
    "resummarized_writeup": resummarized_writeup,
    "resummarized_md_writeup": resummarized_md_writeup,
    "resummarized_writeup_summary": resummarized_writeup_summary,
    "resummarized_md_writeup_summary": resummarized_md_writeup_summary,
}

In [101]:
split_size

10

In [102]:
import pickle

# Assuming 'data' is your dictionary
with open(f'data_{split_size}.pickle', 'wb') as f:
    pickle.dump(data, f)

In [103]:
# Load the data back in
with open(f'data_{split_size}.pickle', 'rb') as f:
    data = pickle.load(f)

In [104]:
# Import pandas
import pandas as pd

# Calculate the number of writeups per competition
writeup_counts = df_writeups.groupby("Title of Competition").size()

# Create a DataFrame for writeup counts
df_writeup_counts = pd.DataFrame({
    'Competition': writeup_counts.index,
    'Number of Writeups': writeup_counts.values
})

# Rename the 'comp_name' column in df_comp_meta to 'Competition' for the merge
df_comp_meta_renamed = df_comp_meta.rename(columns={'comp_name': 'Competition'})

# Merge the two DataFrames on the 'Competition' column
df_summary = pd.merge(df_writeup_counts, df_comp_meta_renamed, on='Competition', how='outer')

# Replace NaN values in 'Number of Writeups' with 0
df_summary['Number of Writeups'] = df_summary['Number of Writeups'].fillna(0)

# Create a new column 'Metadata Available' that is True if 'desc' is not NaN
df_summary['Metadata Available'] = ~df_summary['desc'].isna()

# Reset the index
df_summary.reset_index(drop=True, inplace=True)

In [105]:
# find all column names in df_summary that are type float
float_cols = df_summary.select_dtypes(include=['float']).columns
float_cols

Index(['Number of Writeups', 'teams', 'competitors', 'Entries', 'start_date',
       'start_year', 'final_date', 'final_year'],
      dtype='object')

In [106]:
# Fill NaN values with -1 for columns in float_cols
df_summary[float_cols] = df_summary[float_cols].fillna(-1)

In [107]:
# convert all columns in float_cols to integer type
df_summary[float_cols] = df_summary[float_cols].astype(int)
df_summary.head()

Unnamed: 0,Competition,Number of Writeups,comp_Reward,comp_link,teams,competitors,Entries,Tag,desc,code_link,start_date,start_month,start_year,final_date,final_month,final_year,Metadata Available
0,iNaturalist Challenge at FGVC5,0,Kudos,https://www.kaggle.com/competitions/inaturalis...,59,71,759,meanbesterroratk,\nAs part of the FGVC5 workshop at CVPR 2018 w...,https://www.kaggle.com/competitions/inaturalis...,23,Feb,2018,29,May,2018,True
1,15.071x - The Analytics Edge (Spring 2015),0,Knowledge,https://www.kaggle.com/competitions/15-071x-th...,2920,2920,41624,auc,IMPORTANT NOTE: This competition is only open ...,https://www.kaggle.com/competitions/15-071x-th...,13,Apr,2015,5,May,2015,True
2,15.071x - The Analytics Edge (Spring 2015),0,Knowledge,https://www.kaggle.com/competitions/15-071x-th...,2920,2920,41624,auc,IMPORTANT NOTE: This competition is only open ...,https://www.kaggle.com/competitions/15-071x-th...,4,Apr,2015,4,Apr,2015,True
3,1st and Future - Player Contact Detection,12,,,-1,-1,-1,,,,-1,,-1,-1,,-1,False
4,20 Newsgroups Ciphertext Challenge,0,Swag,https://www.kaggle.com/competitions/20-newsgro...,142,145,1139,text data,\nThis isn't your classic decoder ring puzzle ...,https://www.kaggle.com/code/n0rm41/let-s-have-fun,14,Dec,2018,17,Jan,2019,True


In [108]:
too_long_indices = []
overall_summaries_raw = []

for idx, sum in enumerate(overall_summaries):
    # check if sum is string
    if isinstance(sum, str):
        overall_summaries_raw.append(sum)
        too_long_indices.append(idx)
    else:
        overall_summaries_raw.append(sum.data)

too_long_indices

[]

In [109]:
# combine compatition names and resummarized summaries into a dataframe
if N != "ALL":
    df_resummarized = pd.DataFrame({
        'Competition': processed_titles_list,
    #    'Resummarized Summary': overall_summaries_raw
        'Resummarized Summary Simple': data["resummarized_simple"],
        'Resummarized Summary Writeup': data["resummarized_writeup"],
        'Resummarized Summary Writeup Summary': data["resummarized_writeup_summary"]
    })
else:
    df_resummarized = pd.DataFrame({
        'Competition': processed_titles_list,
        'Overall Summary': data["resummarized_simple"],
    })

df_resummarized.head()

Unnamed: 0,Competition,Resummarized Summary Simple,Resummarized Summary Writeup,Resummarized Summary Writeup Summary
0,Chess ratings - Elo versus the Rest of the World,Here is the list of key points with descriptio...,Here is the list of key points with descriptio...,Here is the list of key points with descriptio...
1,RTA Freeway Travel Time Prediction,Here is the list of key points with descriptio...,Here is the list of key points with descriptio...,Here is the list of key points with descriptio...
2,Predict Grant Applications,Here is the list of key points with descriptio...,Here is the list of key points with descriptio...,Here is the list of key points with descriptio...
3,Stay Alert! The Ford Challenge,Here is the list of key points with descriptio...,Here is the list of key points with descriptio...,Here is the list of key points with descriptio...
4,Don't Overfit!,Here is the list of key points with descriptio...,Here is the list of key points with descriptio...,Here is the list of key points with descriptio...


In [110]:
#merge df_resummarized with df_summary
df_final = pd.merge(df_summary, df_resummarized, on='Competition', how='outer')
df_final.head(5)

Unnamed: 0,Competition,Number of Writeups,comp_Reward,comp_link,teams,competitors,Entries,Tag,desc,code_link,start_date,start_month,start_year,final_date,final_month,final_year,Metadata Available,Resummarized Summary Simple,Resummarized Summary Writeup,Resummarized Summary Writeup Summary
0,iNaturalist Challenge at FGVC5,0,Kudos,https://www.kaggle.com/competitions/inaturalis...,59,71,759,meanbesterroratk,\nAs part of the FGVC5 workshop at CVPR 2018 w...,https://www.kaggle.com/competitions/inaturalis...,23,Feb,2018,29,May,2018,True,,,
1,15.071x - The Analytics Edge (Spring 2015),0,Knowledge,https://www.kaggle.com/competitions/15-071x-th...,2920,2920,41624,auc,IMPORTANT NOTE: This competition is only open ...,https://www.kaggle.com/competitions/15-071x-th...,13,Apr,2015,5,May,2015,True,,,
2,15.071x - The Analytics Edge (Spring 2015),0,Knowledge,https://www.kaggle.com/competitions/15-071x-th...,2920,2920,41624,auc,IMPORTANT NOTE: This competition is only open ...,https://www.kaggle.com/competitions/15-071x-th...,4,Apr,2015,4,Apr,2015,True,,,
3,1st and Future - Player Contact Detection,12,,,-1,-1,-1,,,,-1,,-1,-1,,-1,False,,,
4,20 Newsgroups Ciphertext Challenge,0,Swag,https://www.kaggle.com/competitions/20-newsgro...,142,145,1139,text data,\nThis isn't your classic decoder ring puzzle ...,https://www.kaggle.com/code/n0rm41/let-s-have-fun,14,Dec,2018,17,Jan,2019,True,,,


In [111]:
df_final.to_csv(f"combined_df_{split_size}.csv", index=False)

In [112]:
# Initialize an empty dictionary
combined_dict = {}

idx = 0
# Iterate over processed_titles, writeup_dataframes, and combined simultaneously
#for title, dataframe, combine in zip(processed_titles_list, writeup_dataframes_filtered, combined):
for title, dataframe, combine in zip(processed_titles_list, writeup_dataframes_filtered, overall_summaries):
    # For each title, create a new dictionary with keys "writeup_dataframes" and "combined"
    combined_dict[title] = {"writeup_dataframes": dataframe, "combined": combine}
    if title == "2019 Data Science Bowl":
        print(idx)
    idx += 1

192


In [113]:
# write the combined_dict to a pickle file
with open(f'combined_dict_{split_size}.pickle', 'wb') as f:
    pickle.dump(combined_dict, f)

In [114]:
with open(f'combined_dict_{split_size}.pickle', 'rb') as f:
    combined_dict = pickle.load(f)

# Compare Subsets vs All-in-One

I ran bigger sets of summaries on my local desktop and collected them into a dataset for look here at Kaggle without spending too much GPU time. So lets load that up and see what it looks like:

In [115]:
def load_and_filter_df(split_size, more_than=10):
    # Load the combined dataframe
    if on_kaggle:
        df = pd.read_csv(f"/kaggle/input/gemma-summaries-for-kaggle-writeups/gemma_summaries_{split_size}.csv")
    else:
        df = pd.read_csv(f"combined_df_{split_size}.csv")
    print(f"Loaded combined dataframe with shape {df.shape}")
    # Filter the dataframe for competitions with more than 10 writeups (or more_than param)
    df_filtered = df[df["Number of Writeups"] > more_than]
    print(f"Filtered dataframe has shape {df_filtered.shape}")
    # drop compos that had no metadata and were thus not processed
    df_filtered = df_filtered[df_filtered["comp_Reward"].notna()]
    print(f"Filtered dataframe has shape {df_filtered.shape}")
    return df_filtered


In [116]:
df_5 = load_and_filter_df(5, 0)
df_10 = load_and_filter_df(10, 0)
df_all = load_and_filter_df("ALL", 0)

Loaded combined dataframe with shape (554, 20)
Filtered dataframe has shape (311, 20)
Filtered dataframe has shape (284, 20)
Loaded combined dataframe with shape (554, 20)
Filtered dataframe has shape (311, 20)
Filtered dataframe has shape (284, 20)
Loaded combined dataframe with shape (554, 18)
Filtered dataframe has shape (311, 18)
Filtered dataframe has shape (284, 18)


In [117]:
with open(f'data_5.pickle', 'rb') as f:
    data_5 = pickle.load(f)
with open(f'data_10.pickle', 'rb') as f:
    data_10 = pickle.load(f)
with open(f'data_ALL.pickle', 'rb') as f:
    data_all = pickle.load(f)

In [118]:
df_5["Number of Writeups"].unique()

array([24, 14, 13,  1, 33, 18, 11,  9,  3,  5, 31, 22,  2, 28, 19, 21,  6,
       10,  7, 41, 30, 16,  4, 17, 15, 12,  8, 35, 23, 20, 29, 36, 26, 25,
       27, 37, 32])

# Competition with 1 Writeup

This one has only a single writeup, so the summarization should be simple.

In [119]:
df_5[df_5["Number of Writeups"]==1].iloc[0]

Competition                                 AMS 2013-2014 Solar Energy Prediction Contest
Number of Writeups                                                                      1
comp_Reward                                                                          1000
comp_link                               https://www.kaggle.com/competitions/ams-2014-s...
teams                                                                                 160
competitors                                                                           199
Entries                                                                              2506
Tag                                                                                   mae
desc                                    Welcome to the American Meteorological Society...
code_link                               https://www.kaggle.com/competitions/ams-2014-s...
start_date                                                                              9
start_mont

## Subsets of 5

Splitting 1 writeup into subsets of 5 results in just a single subset with the single writeup. So not much to look at, but lets look for having a complete process.

In [120]:
comp1 = df_5[df_5["Number of Writeups"]==1]["Competition"].iloc[0]
summy = df_5[df_5["Competition"]==comp1]["Resummarized Summary Simple"].iloc[0]
Markdown(summy)

Here is the list of key points with descriptions:

1. **Raw Feature Usage**: Consider using raw features when there are many features to consider, as they can be useful in capturing patterns in the data. However, feature engineering can help identify the most important features and reduce dimensionality.

2. **Contiguous Validation**: When dealing with temporal data, use contiguous validation folds to capture patterns within each fold, which can help improve model performance.

3. **Feature Selection**: When dealing with a large number of features, use feature selection techniques to identify the most important features, which can help reduce dimensionality and improve model performance.

4. **Model Averaging**: Use model averaging to combine the predictions of multiple models, which can help reduce the variance of the predictions and improve overall performance.

5. **Gradient Boosting**: Consider using Gradient Boosting or other ensemble methods when dealing with complex regression problems, as it is a powerful algorithm that can be effective for regression problems.

6. **Data Analysis Methods**: The writeups highlight the importance of using various data analysis methods, including raw feature usage, contiguous validation, feature selection, model averaging, and gradient boosting, to develop a comprehensive approach to tackling a Kaggle data science competition.

These key points are interconnected, with raw feature usage and feature selection being used to handle large numbers of features, contiguous validation being used to capture patterns in temporal data, and model averaging and gradient boosting being used to improve model performance.<|eot_id|>

## Subsets of 10

For 1 writeup this is the same as subset 5. But lets see again..

In [121]:
summy = df_10[df_10["Competition"]==comp1]["Resummarized Summary Simple"].iloc[0]
Markdown(summy)

Here is the list of key points with descriptions:

1. **Raw Feature Usage**: Consider using raw features when there are many features to consider, as they can be useful in capturing patterns in the data. However, feature engineering can help identify the most important features and reduce dimensionality.

2. **Contiguous Validation**: When dealing with temporal data, use contiguous validation folds to capture patterns within each fold, which can help improve model performance.

3. **Feature Selection**: When dealing with a large number of features, use feature selection techniques to identify the most important features, which can help reduce dimensionality and improve model performance.

4. **Model Averaging**: Use model averaging to combine the predictions of multiple models, which can help reduce the variance of the predictions and improve overall performance.

5. **Gradient Boosting**: Consider using Gradient Boosting or other ensemble methods when dealing with complex regression problems, as it is a powerful algorithm that can be effective for regression problems.

6. **Data Analysis Methods**: The writeups highlight the importance of using various data analysis methods, including raw feature usage, contiguous validation, feature selection, model averaging, and gradient boosting, to develop a comprehensive approach to tackling a Kaggle data science competition.

These key points are interconnected, with raw feature usage and feature selection being used to handle large numbers of features, contiguous validation being used to capture patterns in temporal data, and model averaging and gradient boosting being used to improve model performance.<|eot_id|>

## All in one, no subsets

Again, about the same but missing the extra step of re-summarizing the subset summaries. Because it already summarizes all writeup summaries in one step. This is why it has the same points practically but the structuring and wording differs.

In [122]:
summy = df_all[df_all["Competition"]==comp1]["Overall Summary"].iloc[0]
Markdown(summy)

Based on the writeups, here are some guidelines for approaching a Kaggle data science competition:

**Raw Feature Usage**: Consider the trade-offs between feature engineering and raw feature usage. Raw features can be useful when there are many features to consider, but feature engineering can help identify the most important features and reduce dimensionality.

**Contiguous Validation**: When dealing with temporal data, consider using contiguous validation folds to capture patterns within each fold.

**Feature Selection**: When dealing with a large number of features, consider using feature selection techniques to identify the most important features. This can help reduce dimensionality and improve model performance.

**Model Averaging**: Model averaging can be a useful strategy when dealing with multiple models, as it can help reduce the variance of the predictions and improve overall performance.

**Gradient Boosting**: Gradient Boosting is a powerful algorithm that can be effective for regression problems. Consider using Gradient Boosting or other ensemble methods when dealing with complex regression problems.

**Data Analysis Methods**: The writeups highlight the following data analysis methods:

* Raw feature usage
* Contiguous validation
* Feature selection
* Model averaging
* Gradient Boosting

These methods are related across the writeups in the following ways:

* Raw feature usage and feature selection are both used to handle the large number of features in the dataset.
* Contiguous validation is used to capture patterns in the temporal data, which is related to the use of gradient boosting and model averaging to improve model performance.
* Model averaging is used to combine the predictions of multiple models, which is related to the use of gradient boosting as a powerful algorithm for regression problems.

By considering these guidelines and data analysis methods, you can develop a comprehensive approach to tackling a Kaggle data science competition.<|eot_id|>

## Odd...

The above "all" summary talks about "both" summaries, when there is only a single writeup. So a quick look at where does this come from:

In [123]:
comp1_idx = data_all["processed_titles_list"].index(comp1)
Markdown(data_all["overall_summary_prompts"][comp1_idx])

The following gives a summary of a Kaggle competition description, and a set of one or more writeups on solutions used in that competition, separated by ======.

Use these to summarize a set of guidelines for ideas on how to approach a given Kaggle data science competition. 

Competition description summary: Here is a concise summary of the contest topic and goals:

The contest aims to develop accurate short-term predictions of solar energy production at 98 Oklahoma Mesonet sites. The goal is to identify the best statistical and machine learning techniques for predicting daily solar energy totals, using numerical weather prediction data as input. The contest will evaluate predictions using training data from 1994-2007, public testing data from 2008-2009, and private testing data for a more recent period.<|eot_id|>



======

 writeup summary:
 Here are the key points from a datascience application viewpoint for similar problem solving:

1. **Raw feature usage**: The approach didn't involve much feature engineering, instead relying on raw features. This can be a good strategy when dealing with a large number of features, but may not always be the most effective approach.

Lesson: Consider the trade-offs between feature engineering and raw feature usage. Raw features can be useful when there are many features to consider, but feature engineering can help identify the most important features and reduce dimensionality.

2. **Contiguous validation**: The team used 3-fold contiguous validation, where each fold consisted of a contiguous period of years (e.g. 1994-1998). This can be a good strategy when dealing with temporal data, as it allows the model to learn patterns within each fold.

Lesson: When dealing with temporal data, consider using contiguous validation folds to capture patterns within each fold.

3. **Feature selection**: The team used all features from the forecast files without preprocessing, resulting in a large number of features (approximately 320). They also used features such as month of the year, distance to each used meso, and latitude difference to each meso.

Lesson: When dealing with a large number of features, consider using feature selection techniques to identify the most important features. This can help reduce dimensionality and improve model performance.

4. **Model averaging**: The team trained 11 models, one for each forecast member, and averaged the predictions to optimize MAE.

Lesson: Model averaging can be a useful strategy when dealing with multiple models, as it can help reduce the variance of the predictions and improve overall performance.

5. **Gradient Boosting**: The team used Python's GradientBoostedRegressor to optimize MAE.

Lesson: Gradient Boosting is a powerful algorithm that can be effective for regression problems. Consider using Gradient Boosting or other ensemble methods when dealing with complex regression problems.

Overall, this approach demonstrates a straightforward and effective strategy for dealing with a complex regression problem. By using raw features, contiguous validation, and model averaging, the team was able to achieve good results.<|eot_id|>



======

 Focus on the key points of the writeups and how they might have helped achieving better score in the competition.Extract specifically used data analysis methods, and summarize how they are related across writeups.

answer: 

Well, it seems to get confused by the separator ====== I used. Guess this model at this size has some training to take. Or need to be aware of these limitations.

# 5 Writeups

In [124]:
df_5[df_5["Number of Writeups"]==5].iloc[0]

Competition                             American Epilepsy Society Seizure Prediction C...
Number of Writeups                                                                      5
comp_Reward                                                                         25000
comp_link                               https://www.kaggle.com/competitions/seizure-pr...
teams                                                                                 504
competitors                                                                           653
Entries                                                                             17777
Tag                                                                                   auc
desc                                    Seizure forecasting systems hold promise for i...
code_link                               https://www.kaggle.com/code/mpwolke/epilepsy-m...
start_date                                                                             25
start_mont

## Subsets of 5

It is a single subset, and seems to summarize quite fine.

In [125]:
comp5 = df_5[df_5["Number of Writeups"]==5]["Competition"].iloc[0]
summy = df_5[df_5["Competition"]==comp5]["Resummarized Summary Simple"].iloc[0]
Markdown(summy)

Here is the list of key points with descriptions:

1. **Problem understanding**: Take the time to thoroughly understand the goal and requirements of the competition. This involves gaining a deep understanding of the problem and what is expected of the solution.

2. **Data exploration**: Perform exploratory data analysis (EDA) to gain insights into the data and identify potential features to extract. This helps to understand the data distribution, identify patterns, and detect anomalies.

3. **Feature engineering**: Engineer features that are relevant to the problem and suitable for modeling. This involves developing diverse feature sets and using regularization techniques to prevent overfitting.

4. **Model selection**: Select a suitable machine learning algorithm based on the data and problem characteristics. This involves choosing an algorithm that is well-suited to the problem and data type.

5. **Model evaluation**: Evaluate your model using techniques such as cross-validation to ensure generalizability. This helps to assess the model's performance and identify areas for improvement.

6. **Code organization and documentation**: Organize your code in a way that makes it easy to understand and maintain, and provide clear documentation and reports. This helps to ensure that the code is readable and modifiable.

7. **Openness to feedback**: Be open to feedback and willing to improve your solution based on input from others. This involves being receptive to suggestions and willing to make changes to improve the solution.

8. **Ensemble approach**: Combine multiple models to improve performance. This involves combining the predictions of multiple models to achieve better results.

9. **Domain knowledge integration**: Integrate domain knowledge into the problem-solving process. This involves incorporating expert knowledge and insights into the solution.

10. **Iterative improvement**: Continuously learn and improve your solution through experimentation and refinement. This involves refining the solution through repeated testing and iteration.

11. **Problem-specific approaches**: Develop problem-specific approaches that take into account the unique characteristics of the problem. This involves developing solutions that are tailored to the specific problem and data.

12. **Code sharing**: Share your code and report publicly to collaborate with others and contribute to the datascience community. This involves sharing knowledge and expertise with others to advance the field.

13. **Regularization techniques**: Use regularization techniques such as Lasso, elastic net, dropout, and early stopping to prevent overfitting. This helps to prevent the model from becoming too complex and overfitting the training data.

14. **Signal processing techniques**: Use signal processing techniques such as time-frequency features and spectral features to extract relevant information from the data. This helps to extract meaningful features from the data.

15. **Neural networks**: Use neural networks as a machine learning algorithm. This involves using a neural network to model the relationship between the input and output variables.

16. **GLM**: Use Generalized Linear Models (GLM) as a machine learning algorithm. This involves using a GLM to model the relationship between the input and output variables.

17. **Random Forest**: Use Random Forest as a machine learning algorithm. This involves using a Random Forest to model the relationship between the input and output variables.

18. **Support Vector Machines**: Use Support Vector Machines (SVM) as a machine learning algorithm. This involves using an SVM to model the relationship between the input and output variables.

19. **Cross-validation**: Use cross-validation to evaluate the model's performance. This involves splitting the data into training and testing sets and using the testing set to evaluate the model's performance.

20. **Early stopping**: Use early stopping to prevent overfitting. This involves stopping the training process early to prevent the model from becoming too complex and overfitting the training data.<|eot_id|>

## Subsets of 10

Practically the same as above subsets of 5. No wonder since the 5 writeups fit into both as a single subset.

In [126]:
summy = df_10[df_10["Competition"]==comp5]["Resummarized Summary Simple"].iloc[0]
Markdown(summy)

Here is the list of key points with descriptions:

1. **Problem understanding**: Take the time to thoroughly understand the goal and requirements of the competition. This involves gaining a deep understanding of the problem and what is expected of the solution.

2. **Data exploration**: Perform exploratory data analysis (EDA) to gain insights into the data and identify potential features to extract. This helps to understand the data distribution, identify patterns, and detect anomalies.

3. **Feature engineering**: Engineer features that are relevant to the problem and suitable for modeling. This involves developing diverse feature sets and using regularization techniques to prevent overfitting.

4. **Model selection**: Select a suitable machine learning algorithm based on the data and problem characteristics. This involves choosing an algorithm that is well-suited to the problem and data type.

5. **Model evaluation**: Evaluate your model using techniques such as cross-validation to ensure generalizability. This helps to assess the model's performance and identify areas for improvement.

6. **Code organization and documentation**: Organize your code in a way that makes it easy to understand and maintain, and provide clear documentation and reports. This helps to ensure that the code is readable and modifiable.

7. **Openness to feedback**: Be open to feedback and willing to improve your solution based on input from others. This involves being receptive to suggestions and willing to make changes to improve the solution.

8. **Ensemble approach**: Combine multiple models to improve performance. This involves combining the predictions of multiple models to achieve better results.

9. **Domain knowledge integration**: Integrate domain knowledge into the problem-solving process. This involves incorporating expert knowledge and insights into the solution.

10. **Iterative improvement**: Continuously learn and improve your solution through experimentation and refinement. This involves refining the solution through repeated testing and iteration.

11. **Problem-specific approaches**: Develop problem-specific approaches that take into account the unique characteristics of the problem. This involves developing solutions that are tailored to the specific problem and data.

12. **Code sharing**: Share your code and report publicly to collaborate with others and contribute to the datascience community. This involves sharing knowledge and expertise with others to advance the field.

13. **Regularization techniques**: Use regularization techniques such as Lasso, elastic net, dropout, and early stopping to prevent overfitting. This helps to prevent the model from becoming too complex and overfitting the training data.

14. **Signal processing techniques**: Use signal processing techniques such as time-frequency features and spectral features to extract relevant information from the data. This helps to extract meaningful features from the data.

15. **Neural networks**: Use neural networks as a machine learning algorithm. This involves using a neural network to model the relationship between the input and output variables.

16. **GLM**: Use Generalized Linear Models (GLM) as a machine learning algorithm. This involves using a GLM to model the relationship between the input and output variables.

17. **Random Forest**: Use Random Forest as a machine learning algorithm. This involves using a Random Forest to model the relationship between the input and output variables.

18. **Support Vector Machines**: Use Support Vector Machines (SVM) as a machine learning algorithm. This involves using an SVM to model the relationship between the input and output variables.

19. **Cross-validation**: Use cross-validation to evaluate the model's performance. This involves splitting the data into training and testing sets and using the testing set to evaluate the model's performance.

20. **Early stopping**: Use early stopping to prevent overfitting. This involves stopping the training process early to prevent the model from becoming too complex and overfitting the training data.<|eot_id|>

## No Subsets, All in One

The results are about the same but it lacks on step in re-summarization, so the result is a bit more "raw" and focused (the "additional points" start is not there):

In [127]:
summy = df_all[df_all["Competition"]==comp5]["Overall Summary"].iloc[0]
Markdown(summy)

Based on the writeup summaries, here are the key points that can be applied to similar problem-solving approaches in datascience:

1. **Problem understanding**: Take the time to thoroughly understand the goal and requirements of the competition.
2. **Data exploration**: Perform exploratory data analysis (EDA) to gain insights into the data and identify potential features to extract.
3. **Feature engineering**: Engineer features that are relevant to the problem and suitable for modeling.
4. **Model selection**: Select a suitable machine learning algorithm based on the data and problem characteristics.
5. **Model evaluation**: Evaluate your model using techniques such as cross-validation to ensure generalizability.
6. **Code organization and documentation**: Organize your code in a way that makes it easy to understand and maintain, and provide clear documentation and reports.
7. **Openness to feedback**: Be open to feedback and willing to improve your solution based on input from others.
8. **Ensemble approach**: Combine multiple models to improve performance.
9. **Domain knowledge integration**: Integrate domain knowledge into the problem-solving process.
10. **Feature engineering**: Develop diverse feature sets and use regularization techniques to prevent overfitting.
11. **Code sharing**: Share your code and report publicly to collaborate with others and contribute to the datascience community.
12. **Iterative improvement**: Continuously learn and improve your solution through experimentation and refinement.
13. **Problem-specific approaches**: Develop problem-specific approaches that take into account the unique characteristics of the problem.

The data analysis methods used across the writeups include:

* Exploratory data analysis (EDA)
* Feature engineering (e.g., time-frequency features, spectral features, signal processing techniques)
* Model selection (e.g., neural networks, GLM, Random Forest, Support Vector Machines)
* Model evaluation (e.g., cross-validation)
* Regularization techniques (e.g., Lasso, elastic net, dropout, early stopping)
* Ensemble approach
* Domain knowledge integration
* Code sharing and collaboration

These methods are related across the writeups in that they all contribute to developing a robust and effective solution to the competition problem. By combining these methods, teams can improve their chances of achieving a better score in the competition.<|eot_id|>

# 10 Writeups

This should be 2 subsets for size of 5, and all in one for size 10 and "ALL".

In [128]:
df_5[df_5["Number of Writeups"]==10].iloc[0]

Competition                                           COVID19 Global Forecasting (Week 4)
Number of Writeups                                                                     10
comp_Reward                                                                     Knowledge
comp_link                               https://www.kaggle.com/competitions/covid19-gl...
teams                                                                                 472
competitors                                                                          1290
Entries                                                                              1925
Tag                                                                          tabular data
desc                                    This is week 4 of Kaggle's COVID-19 forecastin...
code_link                               https://www.kaggle.com/code/mozattt/automated-...
start_date                                                                              9
start_mont

## Subsets of 5

This loses the key point details already:

In [129]:
comp10 = df_5[df_5["Number of Writeups"]==10]["Competition"].iloc[0]
summy = df_5[df_5["Competition"]==comp10]["Resummarized Summary Simple"].iloc[0]
Markdown(summy)

Here is the list of key points with descriptions:

1. **Time constraint**: Prioritize tasks, manage time effectively, and adapt to changing circumstances. (Description: Time constraints can impact the development of predictive models, and prioritizing tasks and adapting to changing circumstances can help overcome these limitations.)

2. **Data integration**: Combine multiple data sources to gain a more comprehensive understanding of the problem. (Description: Integrating multiple data sources can provide a more complete picture of the problem, allowing for more effective modeling and prediction.)

3. **Model selection**: Select the right tool for the task, considering factors such as data complexity, model interpretability, and computational resources. (Description: Choosing the right model for the task is crucial, as different models are better suited for different types of data and problems.)

4. **Benchmarking**: Document and share knowledge, allowing others to build upon and learn from previous work. (Description: Benchmarking allows for the sharing of knowledge and the ability to build upon previous work, which can lead to better outcomes.)

5. **Collaboration**: Share ideas and expertise with others to lead to better outcomes. (Description: Collaboration can lead to better outcomes by allowing for the sharing of ideas and expertise.)

6. **Adaptability**: Be willing to adapt to changing data patterns and re-evaluate model performance. (Description: Adapting to changing data patterns and re-evaluating model performance is essential for developing effective predictive models.)

7. **Lack of data**: Understand the limitations of data availability and explore alternative approaches when data is scarce. (Description: When data is scarce, understanding the limitations of data availability and exploring alternative approaches can help overcome these limitations.)

8. **Simple models can be effective**: Simple models can be effective in certain situations, especially when there is limited data. (Description: Simple models can be effective in certain situations, especially when there is limited data, and can provide a good starting point for more complex modeling.)

9. **Hyperparameter tuning**: Explore different hyperparameters and evaluate their impact on model performance. (Description: Hyperparameter tuning is essential for finding the optimal combination of hyperparameters that result in the best model performance.)

10. **Combining multiple models**: Combine multiple models to improve overall performance. (Description: Combining multiple models can improve overall performance by reducing the impact of individual model errors.)

11. **Adapting to changing data**: Adapt to changing data patterns and re-evaluate model performance. (Description: Adapting to changing data patterns and re-evaluating model performance is essential for developing effective predictive models.)

12. **Subjectivity in modeling**: Consider multiple perspectives and combine multiple models to improve overall performance. (Description: Subjectivity in modeling can be overcome by considering multiple perspectives and combining multiple models.)

13. **Importance of domain knowledge**: Consider domain knowledge in understanding the problem and developing effective models. (Description: Domain knowledge is essential for understanding the problem and developing effective models.)

14. **Uncertainty and limitations**: Acknowledge the uncertainty and limitations of predicting pandemics and explore alternative approaches when necessary. (Description: Acknowledging the uncertainty and limitations of predicting pandemics is essential for developing effective predictive models.)

15. **Complexity of long-term predictions**: Consider multiple scenarios and uncertainties when making long-term predictions. (Description: Long-term predictions require considering multiple scenarios and uncertainties, which can be complex and challenging.)

16. **Combining ML and domain expertise**: Combine machine learning with domain expertise to lead to better results. (Description: Combining machine learning with domain expertise can lead to better results by leveraging the strengths of both approaches.)

17. **Importance of pre-processing and feature engineering**: Pre-process and engineer features to improve model performance. (Description: Pre-processing and feature engineering are essential for improving model performance by selecting the most relevant features.)

18. **Use of ensemble methods**: Combine multiple models to reduce the impact of individual model errors. (Description: Ensemble methods can reduce the impact of individual model errors by combining multiple models.)

19. **Post-processing and rule-based adjustments**: Carefully evaluate and refine predictions based on domain knowledge and expert judgment. (Description: Post-processing and rule-based adjustments are essential for refining predictions based on domain knowledge and expert judgment.)

20. **Use of domain knowledge and expert judgment**: Consider domain-specific knowledge and expertise when developing predictive models. (Description: Domain knowledge and expert judgment are essential for developing effective predictive models.)

21. **Limitations of ML models**: Evaluate the limitations of ML models and consider alternative approaches when necessary. (Description: Evaluating the limitations of ML models is essential for developing effective predictive models.)

22. **Importance of data quality and availability**: Evaluate the quality and availability of data when developing predictive models. (Description: Evaluating the quality and availability of data is essential for developing effective predictive models.)

23. **Use of visualization and exploration**: Use visualization and exploration to aid decision-making. (Description:

## Subsets of 10

This only still has a single subset to summarize, and the result does seem better.

In [130]:
summy = df_10[df_10["Competition"]==comp10]["Resummarized Summary Simple"].iloc[0]
Markdown(summy)

Here is the list of key points with descriptions:

1. **Time constraint**: Prioritize tasks, manage time effectively, and adapt to changing circumstances. (Description: Recognize the importance of time constraints and adapt to changing circumstances to achieve better results.)

2. **Data integration**: Combine multiple data sources to gain a more comprehensive understanding of the problem. (Description: Combine multiple data sources to gain a deeper understanding of the problem and make more accurate predictions.)

3. **Model selection**: Select the right tool for the task, considering factors such as data complexity, model interpretability, and computational resources. (Description: Choose the most suitable model for the task, taking into account the complexity of the data, the need for interpretability, and the availability of computational resources.)

4. **Benchmarking**: Document and share knowledge, allowing others to build upon and learn from previous work. (Description: Document and share knowledge to allow others to build upon and learn from previous work, promoting collaboration and improvement.)

5. **Collaboration**: Engage in community discussions, share ideas, and learn from others to improve outcomes. (Description: Engage in community discussions, share ideas, and learn from others to improve outcomes and achieve better results.)

6. **Adaptability**: Be willing to change approaches and try new things in response to changing data and competition settings. (Description: Be willing to adapt to changing data and competition settings by trying new approaches and refining models.)

7. **Feature engineering**: Extract relevant features that capture the underlying patterns in the data. (Description: Extract relevant features that capture the underlying patterns in the data to improve model performance.)

8. **Model blending**: Combine the predictions of multiple models to reduce the impact of individual model errors. (Description: Combine the predictions of multiple models to reduce the impact of individual model errors and improve overall performance.)

9. **Post-processing**: Adjust predictions to ensure non-negative values, quantile adjustments, and smoothing can improve model output. (Description: Adjust predictions to ensure non-negative values, quantile adjustments, and smoothing can improve model output.)

10. **Iterative refinement**: Refine models by trying different approaches and evaluating their performance using cross-validation. (Description: Refine models by trying different approaches and evaluating their performance using cross-validation to achieve better results.)

11. **Data quality and availability**: Carefully evaluate the quality and availability of data when developing predictive models. (Description: Carefully evaluate the quality and availability of data when developing predictive models to ensure accurate predictions.)

12. **Visualization and exploration**: Use visualization and exploration to aid decision-making and identify patterns in the data. (Description: Use visualization and exploration to aid decision-making and identify patterns in the data to improve model performance.)

13. **Continuous learning and improvement**: Continuously refine and improve predictive models based on new data and insights. (Description: Continuously refine and improve predictive models based on new data and insights to achieve better results.)

14. **Reusable machine learning solutions**: Develop reusable models that can be applied to future pandemics or similar problems. (Description: Develop reusable models that can be applied to future pandemics or similar problems to promote collaboration and improvement.)

15. **Non-recursive predictions**: Avoid recursive predictions and train separate models for each day ahead. (Description: Avoid recursive predictions and train separate models for each day ahead to improve model performance.)

16. **Customized loss function**: Use a customized loss function, such as the pinball loss, to train models. (Description: Use a customized loss function to train models and improve performance.)

17. **Ensemble methods**: Combine the predictions of multiple models using ensemble methods. (Description: Combine the predictions of multiple models using ensemble methods to improve overall performance.)

18. **Smoothing and rounding**: Apply smoothing to predictions to reduce the impact of noise and rounding to integers to ensure discrete and meaningful predictions. (Description: Apply smoothing to predictions to reduce the impact of noise and rounding to integers to ensure discrete and meaningful predictions.)

19. **Model architecture**: Use a suitable model architecture, such as a CNN, to capture non-linear relationships in the data. (Description: Use a suitable model architecture to capture non-linear relationships in the data and improve performance.)

20. **Hyperparameter tuning**: Experiment with different hyperparameters to find the optimal combination that works well for the model. (Description: Experiment with different hyperparameters to find the optimal combination that works well for the model and improve performance.)

21. **Quantile estimation**: Estimate quantiles using techniques such as the Poisson distribution to understand the distribution of outcomes. (Description: Estimate quantiles using techniques such as the Poisson distribution to understand the distribution of outcomes and improve model performance.)

22. **Distribution selection**: Choose a distribution that fits the data and understand its underlying assumptions. (Description: Choose a distribution that fits the data and understand its underlying assumptions to improve model performance.)

23. **Simple yet effective approach**: Recognize that

## No Subsets, All in One

About the same as above subsets of 10, just minor variation in wording of the key points: 

In [131]:
summy = df_all[df_all["Competition"]==comp10]["Overall Summary"].iloc[0]
Markdown(summy)

Based on the writeup summaries, here are the key points that can be applied to similar problem-solving scenarios:

**Time constraint**: Prioritize tasks, manage time effectively, and adapt to changing circumstances.

**Data integration**: Combine multiple data sources to gain a more comprehensive understanding of the problem.

**Model selection**: Select the right tool for the task, considering factors such as data complexity, model interpretability, and computational resources.

**Benchmarking**: Document and share knowledge, allowing others to build upon and learn from previous work.

**Collaboration**: Engage in community discussions, share ideas, and learn from others to improve outcomes.

**Adaptability**: Be willing to change approaches and try new things in response to changing data and competition settings.

**Feature engineering**: Extract relevant features that capture the underlying patterns in the data.

**Model blending**: Combine the predictions of multiple models to reduce the impact of individual model errors.

**Post-processing**: Adjust predictions to ensure non-negative values, quantile adjustments, and smoothing can improve model output.

**Iterative refinement**: Refine models by trying different approaches and evaluating their performance using cross-validation.

**Data quality and availability**: Carefully evaluate the quality and availability of data when developing predictive models.

**Visualization and exploration**: Use visualization and exploration to aid decision-making and identify patterns in the data.

**Continuous learning and improvement**: Continuously refine and improve predictive models based on new data and insights.

**Reusable machine learning solutions**: Develop reusable models that can be applied to future pandemics or similar problems.

**Non-recursive predictions**: Avoid recursive predictions and train separate models for each day ahead.

**Customized loss function**: Use a customized loss function, such as the pinball loss, to train models.

**Ensemble methods**: Combine the predictions of multiple models using ensemble methods.

**Smoothing and rounding**: Apply smoothing to predictions to reduce the impact of noise and rounding to integers to ensure discrete and meaningful predictions.

**Model architecture**: Use a suitable model architecture, such as a CNN, to capture non-linear relationships in the data.

**Hyperparameter tuning**: Experiment with different hyperparameters to find the optimal combination that works well for the model.

**Quantile estimation**: Estimate quantiles using techniques such as the Poisson distribution to understand the distribution of outcomes.

**Distribution selection**: Choose a distribution that fits the data and understand its underlying assumptions.

**Simple yet effective approach**: Recognize that sometimes, simplicity can be effective, and build upon existing solutions or baselines.

**Focus on understanding the problem**: Focus on understanding the problem and data to identify suitable approaches and distributions.

**Ensemble approach**: Train multiple simple models and blend their predictions to improve overall performance.

**Feature engineering**: Design input features to capture trends and normalize them by the maximum value for the region.

**Time-series forecasting**: Predict the number of cases or fatalities on a future date based on input features.

**Weighted training**: Weight training data to give more significance to later days.

**Post-processing**: Adjust predictions using linear transformation and deviation factor.

**Exploration of different models**: Train multiple models with different combinations of hyperparameters to explore their impact on performance.

**Use of validation set**: Use a validation set to evaluate performance and optimize final predictions.

**Potential improvements**: Identify potential improvements that were not implemented due to time constraints.

The data analysis methods used across writeups include:

1. Feature engineering
2. Model selection
3. Model blending
4. Post-processing
5. Quantile estimation
6. Distribution selection
7. Time-series forecasting
8. Weighted training
9. Hyperparameter tuning
10. Ensemble methods

These methods are related across writeups in that they are all used to develop and refine predictive models for forecasting COVID-19 cases and fatalities. The writeups demonstrate the importance of combining multiple methods to achieve better performance and the need to adapt to changing data and competition settings.<|eot_id|>

# 15 Writeups

This would be 3 subsets for subset size 5, 2 subsets for size 10, and a single set for the "all" version:

In [132]:
df_5[df_5["Number of Writeups"]==15].iloc[0]

Competition                                                   Gendered Pronoun Resolution
Number of Writeups                                                                     15
comp_Reward                                                                         25000
comp_link                               https://www.kaggle.com/competitions/gendered-p...
teams                                                                                 838
competitors                                                                          1615
Entries                                                                               617
Tag                                                                                   nlp
desc                                    \nCan you help end gender bias in pronoun reso...
code_link                               https://www.kaggle.com/code/surajshiwal/visual...
start_date                                                                              6
start_mont

## Subsets of 5

This one starts overly pushing it all together, and seems to have lost a lot of key points and structure already:

In [133]:
comp15 = df_5[df_5["Number of Writeups"]==15]["Competition"].iloc[0]
summy = df_5[df_5["Competition"]==comp15]["Resummarized Summary Simple"].iloc[0]
Markdown(summy)

Here is the list of key points with descriptions:

1. **Preprocessing**: Careful preprocessing is crucial in natural language processing tasks, including steps such as tokenization, stopword removal, and stemming or lemmatization.
2. **Feature extraction**: Extracting relevant features from text data is essential, which can be done using pre-trained models like BERT, or by creating hand-crafted features.
3. **Model selection**: Selecting the right model architecture and experimenting with different models is important, including using pre-trained models, fine-tuning, and combining multiple models.
4. **Hyperparameter tuning**: Hyperparameter tuning is a crucial step in model development, including searching for optimal hyperparameters, using regularization techniques, and experimenting with different hyperparameter settings.
5. **Cross-validation**: Using cross-validation to evaluate model performance and avoid overfitting is essential.
6. **Ensemble methods**: Combining the predictions of multiple models can improve overall performance.
7. **Data augmentation**: Using data augmentation techniques, such as test augmentation, can improve model robustness.
8. **Experimentation and iteration**: Iterating on the design of a solution through experimentation and modification is important.
9. **Combining multiple models and embeddings**: Combining multiple models and embeddings can capture relevant information.
10. **Incorporating domain-specific knowledge and features**: Incorporating domain-specific knowledge and features can improve model performance.
11. **Using KFold Cross-Validation**: Using KFold Cross-Validation can help evaluate model performance and avoid overfitting.
12. **Leveraging community knowledge and insights**: Leveraging community knowledge and insights can help improve model performance.
13. **Adapting pre-trained models to specific problem requirements**: Adapting pre-trained models to specific problem requirements can improve model performance.
14. **Handling noisy labels and reducing the risk of overfitting/underfitting**: Handling noisy labels and reducing the risk of overfitting/underfitting is important.
15. **Experimenting with different models and combining them**: Experimenting with different models and combining them can improve model performance.
16. **Hyperparameter tuning and careful model selection**: Hyperparameter tuning and careful model selection are important steps in model development.
17. **Using sufficient folds in cross-validation**: Using sufficient folds in cross-validation is important to evaluate model performance and avoid overfitting.
18. **Ensemble methods and combining multiple models**: Ensemble methods and combining multiple models can improve overall performance.
19. **Iterative experimentation and refinement**: Iterative experimentation and refinement are important steps in model development.
20. **Feature engineering and exploring different architectures**: Feature engineering and exploring different architectures can improve model performance.
21. **Considering limitations and future directions in datascience projects**: Considering limitations and future directions in datascience projects is important.
22. **Combining multiple approaches**: Combining multiple approaches, including using pre-trained models, data blending, and data augmentation, can improve model performance.
23. **Model stacking**: Building a second layer model, such as LGBM, can help improve performance by leveraging the strengths of each individual model.
24. **BERT fine-tuning**: Fine-tuning BERT can be a useful technique, but it may not always improve performance.
25. **Hyperparameter tuning**: Using hyperparameter tuning frameworks, such as Hyperopt, can help find the optimal combination of hyperparameters.
26. **Model stacking**: Stacking models can help improve performance by leveraging the strengths of each individual model.
27. **Feature engineering**: Using a combination of distance, URL, and statistic features can be effective for pronoun resolution.
28. **Encoding pronouns**: Encoding pronouns as numbers can be a simple yet effective way to incorporate pronoun information into the model.
29. **Lessons for similar problem-solving**: Experimenting with different model structures and hyperparameters, using a combination of different techniques, paying attention to data quality, and implementing robust training processes can help improve model performance.

These key points highlight the importance of experimentation, iteration, and adaptation in datascience, as well as the importance of incorporating domain-specific knowledge and features, and leveraging community knowledge and insights.<|eot_id|>

## Subsets of 10

Here we are still good:

In [134]:
summy = df_10[df_10["Competition"]==comp15]["Resummarized Summary Simple"].iloc[0]
Markdown(summy)

Here is the list of key points with descriptions:

1. **Ensemble methods**: Combining multiple models and embeddings can improve performance.
2. **Pre-trained models**: Fine-tuning pre-trained models like BERT can be effective, but careful tuning and adaptation to the specific problem are crucial.
3. **Hyperparameter tuning**: Careful tuning of model parameters can significantly impact performance.
4. **Experimentation and iteration**: Iterative experimentation and refinement are essential in the datascience process.
5. **Feature engineering**: Extracting relevant features and incorporating domain-specific knowledge can improve performance.
6. **Model selection and combination**: Exploring different models and combining them can lead to better performance.
7. **Data analysis methods**: Techniques like cross-validation, ensemble methods, and hyperparameter tuning are essential in datascience.
8. **Cross-validation**: Using sufficient folds in cross-validation to avoid overfitting.
9. **Ensemble methods**: Combining multiple models using techniques like concatenation, averaging, and LightGBM.
10. **Hyperparameter tuning**: Tuning model parameters using techniques like grid search, random search, and Bayesian optimization.
11. **Feature engineering**: Extracting relevant features using techniques like wordpiece tokenization, truncation, and attention-pooling.
12. **Model architecture**: Exploring different model architectures, including concatenating uncased and cased BERT models.
13. **Data augmentation**: Creating new instances by substituting entities or using test-time augmentation (TTA) to evaluate the model's performance.
14. **Pre-trained language models**: Using pre-trained language models like BERT can save time and computational resources compared to training a model from scratch.
15. **Feature engineering**: Carefully selecting features that are relevant to the task at hand, including extracting features from BERT embeddings.
16. **Ensemble approaches**: Combining multiple models or architectures to improve performance.
17. **Cross-validation**: Using stratified 10-fold cross-validation to evaluate the performance of a model and avoid overfitting.
18. **Hyperparameter search**: Using hyperparameter search, such as Hyperopt, to find the optimal combination of hyperparameters.
19. **Blending models**: Blending the predictions from multiple models across folds to improve overall performance.
20. **Test augmentation**: Using test augmentation to improve the robustness of a model's predictions.
21. **Pipeline structure**: Combining different feature extraction methods and models to create a pipeline structure.
22. **BERT embeddings**: Concatenating embeddings from multiple BERT layers and entities to extract relevant features.
23. **Hand-crafted features**: Using hand-crafted features, such as neural coref, Stanford NLP, and e2e-coref model predictions, to improve performance.
24. **Model stacking**: Combining multiple models to produce a final prediction.
25. **Augmentation**: Using augmentation to create new instances and improve performance.
26. **BERT finetuning**: Fine-tuning BERT, especially with larger pronoun-related datasets, to improve performance.
27. **Prototyping and iteration**: Prototyping and iterating on different components, including stacking, to improve performance.
28. **Focus on "killer features"**: Focusing on "killer features" that can significantly improve performance.
29. **Code availability**: Providing code availability to make it possible for others to reproduce and build upon the work.
30. **Domain knowledge and intuition**: Using domain knowledge and intuition to guide the datascience process.

These key points can be applied to similar problem-solving in datascience, such as developing a preprocessing pipeline for natural language text data, extracting relevant features from text data using pre-trained models like BERT, selecting the best model for a particular problem, and more.<|eot_id|>

## No Subsets, All in One

About the same as above subsets of 10, but perhas a bit of differentiation between what seems to be slightly more generic topics, and a list of key points: 

In [135]:
summy = df_all[df_all["Competition"]==comp15]["Overall Summary"].iloc[0]
Markdown(summy)

Based on the writeup summaries, here are the key points that can be applied to similar problem-solving:

**Common themes:**

1. **Preprocessing and feature engineering**: Many writeups emphasize the importance of careful preprocessing and feature engineering to improve model performance.
2. **Model selection and combination**: Several writeups highlight the value of experimenting with different models and combining them to achieve better results.
3. **Hyperparameter tuning**: Many writeups stress the importance of thorough hyperparameter tuning to find the optimal combination.
4. **Data augmentation and regularization**: Several writeups demonstrate the effectiveness of data augmentation and regularization techniques to prevent overfitting and improve model robustness.
5. **Ensemble methods**: Many writeups show the power of combining multiple models using ensemble methods to improve overall performance.

**Data analysis methods:**

1. **BERT embeddings**: Many writeups use BERT embeddings as a crucial component in their models.
2. **Layer selection**: Several writeups experiment with different BERT layers to find the best combination.
3. **Data augmentation**: Many writeups apply data augmentation techniques, such as swapping A and B columns, to increase the size and diversity of the training dataset.
4. **Feature extraction**: Several writeups extract features from BERT embeddings, linguistic features, and other sources to improve model performance.
5. **Model stacking**: Many writeups combine multiple models using model stacking to improve overall performance.

**Relationship across writeups:**

1. **Preprocessing and feature engineering**: Many writeups emphasize the importance of careful preprocessing and feature engineering to improve model performance.
2. **Model selection and combination**: Several writeups highlight the value of experimenting with different models and combining them to achieve better results.
3. **Hyperparameter tuning**: Many writeups stress the importance of thorough hyperparameter tuning to find the optimal combination.
4. **Data augmentation and regularization**: Several writeups demonstrate the effectiveness of data augmentation and regularization techniques to prevent overfitting and improve model robustness.
5. **Ensemble methods**: Many writeups show the power of combining multiple models using ensemble methods to improve overall performance.

By applying these key points and data analysis methods, datascience practitioners can develop effective solutions for similar problems and improve their chances of success in competitions like this one.<|eot_id|>

# 36 Writeups

Now lets go for one of the biggest sets of writeups we have. First a look at the unique writeup counts:

In [136]:
counts = df_5["Number of Writeups"].unique()
np.sort(counts)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 35,
       36, 37, 41])

Highest counts here are 36, 37 and 41. Unfortunately 41 and 37 were too much fit the writeup summaries into Gemma 8k context, so I will go with the 36. This is both non-divisible by 5, and also as close to the top height as possible.

The too long lists:

In [137]:
comp41 = df_all[df_all["Number of Writeups"]==41]["Competition"].iloc[0]
summy = df_all[df_all["Competition"]==comp41]["Overall Summary"].iloc[0]
Markdown(summy)


skipped due to token limit (on Kaggle need shorter due to GPU memory limit)

In [138]:
comp37 = df_all[df_all["Number of Writeups"]==37]["Competition"].iloc[0]
summy = df_all[df_all["Competition"]==comp37]["Overall Summary"].iloc[0]
Markdown(summy)

skipped due to token limit (on Kaggle need shorter due to GPU memory limit)

The 36 writeup version did fit into the 8k context window, and it also has a number that is not divisible by 5, so a bit more interesting than exact subsets of 5. Lets see:

In [139]:
df_5[df_5["Number of Writeups"]==36].iloc[0]

Competition                                                     M5 Forecasting - Accuracy
Number of Writeups                                                                     36
comp_Reward                                                                         50000
comp_link                               https://www.kaggle.com/competitions/m5-forecas...
teams                                                                                5558
competitors                                                                          7022
Entries                                                                             88741
Tag                                                                  time series analysis
desc                                    Note: This is one of the two complementary com...
code_link                               https://www.kaggle.com/code/zwhite/eda-and-bas...
start_date                                                                              3
start_mont

## Subsets of 5

This splitting produces the generic pointless summary that Gemma often seems to produce, losing all of the real information:

In [140]:
comp36 = df_5[df_5["Number of Writeups"]==36]["Competition"].iloc[0]
summy = df_5[df_5["Competition"]==comp36]["Resummarized Summary Simple"].iloc[0]
Markdown(summy)

Here is the list of key points with descriptions:

**Common Themes**

1. **Hierarchical data**: Many writeups emphasize the importance of considering hierarchical relationships in the data.
2. **Feature engineering**: Feature engineering is a crucial step in many writeups, with authors experimenting with different features and combinations to find the most informative ones.
3. **Model selection and combination**: Authors often select the best-performing model or combine multiple models to improve forecasting accuracy.
4. **Cross-validation**: Cross-validation is a common technique used to evaluate model performance and prevent overfitting.
5. **Experimentation and iteration**: Authors emphasize the importance of experimentation and iteration in the datascience process, whether it's trying different models, features, or hyperparameters.
6. **Metric choice**: The choice of metric is crucial, and authors often express disappointment with the default metric used in the competition.
7. **Generalization**: Authors highlight the importance of evaluating model performance on unseen data to ensure generalization.

**Specific Data Analysis Methods**

1. **Lagged demand**: Calculating lagged demand is a common feature engineering technique used in many writeups.
2. **Permutation importance**: Authors use permutation importance to identify the most important features and remove those that don't contribute significantly to the model's performance.
3. **Tweedie objective function**: The Tweedie objective function is used in some writeups to model the data.
4. **StratifiedFold**: StratifiedFold is used in some writeups for hyperparameter tuning.
5. **Weighted average**: Authors use weighted averages to combine predictions from different levels or models.
6. **Calendar-based features**: Authors create calendar-based features using target encoding for ids (store, item,...) x [weekday, events, or month] or ids with last 3M, 1Y, 2Y (rolling, no leak).
7. **Kurtosis**: Kurtosis is used to detect the effect of outliers in the data.
8. **Event detection**: Event detection is used to create binary features that capture the effect of important events on sales.
9. **Visualization techniques**: Visualization techniques are used to understand the behavior of the model and the effect of different features on the predictions.

**Relationships Across Writeups**

1. **Hierarchical data structure**: Many writeups emphasize the importance of considering hierarchical relationships in data when modeling.
2. **Feature engineering**: Feature engineering is a crucial step in many writeups, involving the creation of relevant features that capture the underlying patterns in the data.
3. **Ensemble methods**: Ensemble methods, such as stacking binary prediction features or combining multiple models, are used to improve forecasting accuracy.
4. **Hyperparameter tuning**: Hyperparameter tuning is essential to optimize model performance, and techniques like grid search and walk-forward cross-validation are used.
5. **Data splitting and validation**: Data splitting and validation are critical steps to evaluate model performance and prevent overfitting.

**General Key Points**

1. **Simplification is key**: Balancing complexity with simplicity in model development can be effective.
2. **Feature engineering**: Extracting relevant information from the data through feature engineering is crucial.
3. **Domain knowledge**: Understanding the problem domain can inform feature selection and model development.
4. **Experimentation and iteration**: Trying different approaches and iterating on the model can lead to better results.
5. **Self-evaluation and calculation of metrics**: Evaluating model performance using relevant metrics is essential.
6. **Lessons from failure**: Failure can be a valuable learning experience, leading to more effective solutions.
7. **Choose the right algorithm**: Selecting an algorithm suitable for the problem can be crucial.
8. **Joint modeling approach**: Leverage hierarchical structure by using joint modeling approaches.
9. **Data preparation and training**: Use a framework that provides built-in notation and data structure to reduce preparation time.
10. **Hyperparameter tuning**: Optimize the model by experimenting with different hyperparameters.
11. **Runtime**: Consider the runtime of the approach and optimize it to reduce training and evaluation time.
12. **Understand the competition's nuances**: Understand the specific problem and its requirements.
13. **Don't overcomplicate**: Sometimes, simplicity can be more effective than complexity.
14. **Experimentation is key**: Try different approaches and experiment with different parameters.
15. **Visualize and analyze**: Analyze and visualize the data to improve the model.
16. **Avoid overfitting**: Use the right model and avoid overfitting at the lowest level.
17. **Ensemble methods can be effective**: Use averaging over several runs to improve forecasting accuracy.
18. **Feature engineering is important**: Extract relevant information from the data through feature engineering.
19. **Don't be afraid to try new approaches**: Be open to new approaches and not be afraid to try something different.
20. **Simple yet effective approach**: Sometimes, a straightforward approach can

## Subsets of 10

This one is not quite the shortest one-liner above, but a few points. Unfortunately considering it consists of 36 writeups and summarizing their key points into one final summary, it is very bad. Loses all details, and lists a few generic data science points:

In [141]:
comp36 = df_10[df_10["Number of Writeups"]==36]["Competition"].iloc[0]
summy = df_10[df_10["Competition"]==comp36]["Resummarized Summary Simple"].iloc[0]
Markdown(summy)

Here is the list of key points with descriptions:

**Common Themes:**

1. **Hierarchical Modeling**: Consider hierarchical relationships in data when modeling.
2. **Feature Engineering**: Create relevant features to capture underlying patterns in data.
3. **Ensemble Methods**: Combine multiple models or use a "mixture of experts" approach to improve forecasting accuracy.
4. **Cross-Validation**: Evaluate model performance and prevent overfitting using cross-validation.
5. **Experimentation and Iteration**: Try different approaches and refine them to achieve better results.

**Specific Data Analysis Methods:**

1. **Time-Series Decomposition**: Identify patterns in data using time-series decomposition techniques.
2. **Rolling Mean and Standard Deviation**: Create features that capture underlying patterns in data using rolling mean and standard deviation.
3. **Lagged Features**: Capture relationships between different levels of the hierarchy using lagged features.
4. **Target Encoding**: Create features that capture relationships between different levels of the hierarchy using target encoding.
5. **Kalman Filters**: Model relationships between different levels of the hierarchy using Kalman Filters.

**Relationships Across Writeups:**

1. **Hierarchical Modeling** and **Feature Engineering**: Often used together to create a robust model.
2. **Cross-Validation**: Used to evaluate model performance and prevent overfitting.
3. **Loss Function Manipulation** and **Ensemble Methods**: Used to improve model performance and adapt to new information.
4. **Iterative Refinement**: A common approach in datascience, where authors refine their approaches until they find a solution that works well.
5. **Community Input** and **Collaboration**: Essential in datascience, allowing authors to learn from others and share knowledge.

**Common Key Points:**

1. **Understand the Competition's Nuances**: Understand the specific problem and its requirements.
2. **Don't Overcomplicate**: Simple yet effective approaches can be more successful than complex ones.
3. **Experimentation is Key**: Trying different approaches and parameters is essential for finding the right solution.
4. **Visualize and Analyze**: Analyzing and visualizing the data can help improve the model.
5. **Avoid Overfitting**: Using the right model and avoiding overfitting can be crucial in achieving good results.
6. **Ensemble Methods can be Effective**: Combining multiple models can improve forecasting accuracy.
7. **Feature Engineering is Important**: Extracting relevant information from the data through feature engineering is crucial.
8. **Don't be Afraid to Try New Approaches**: Being open to new approaches can be beneficial in finding the right solution.

**Data Analysis Methods:**

1. **Feature Engineering**: Creating features that capture relevant patterns and trends in the data.
2. **Hierarchical Modeling**: Modeling data at multiple levels of aggregation or grouping.
3. **Ensemble Methods**: Combining multiple models to improve forecasting accuracy.
4. **Target Encoding**: Converting categorical variables into numerical features.
5. **Hyperparameter Tuning**: Optimizing model performance by adjusting hyperparameters.
6. **Cross-Validation**: Evaluating model performance using a holdout set.
7. **Sensitivity Analysis**: Testing the model's sensitivity to different inputs or parameters.
8. **Model Interpretability**: Understanding the model's predictions and identifying biases.

**Relationships Across Writeups:**

1. **Feature Engineering**: Many writeups emphasize the importance of feature engineering in capturing relevant patterns and trends in the data.
2. **Hierarchical Modeling**: Hierarchical modeling is used in several writeups to capture the hierarchical structure of the data.
3. **Ensemble Methods**: Ensemble methods are used in several writeups to improve forecasting accuracy by combining multiple models.
4. **Target Encoding**: Target encoding is used in several writeups to convert categorical variables into numerical features.
5. **Hyperparameter Tuning**: Hyperparameter tuning is used in several writeups to optimize model performance by adjusting hyperparameters.
6. **Cross-Validation**: Cross-validation is used in several writeups to evaluate model performance using a holdout set.
7. **Sensitivity Analysis**: Sensitivity analysis is used in several writeups to test the model's sensitivity to different inputs or parameters.
8. **Model Interpretability**: Model interpretability is emphasized in several writeups to understand the model's predictions and identify biases.<|eot_id|>

## No Subsets, All in One

This one has the key points collected, as there is no subsplitting before:

In [142]:
summy = df_all[df_all["Competition"]==comp36]["Overall Summary"].iloc[0]
Markdown(summy)

skipped due to token limit (on Kaggle need shorter due to GPU memory limit)

# Conclusions About Subsets

Well, Gemma seems to do quite fine up to a point. But when I start to re-summarize larger number of subsets, it seems to drop it and give a very bland and generic, overly short answer. Without much of real content. So I would say it would be best to trial some more and perhaps think of ways to show the initial summaries for the user as input for further insights. Or other ways to support the user in their task, in this case analyzing Kaggle writeups. Such as in my second notebook, where I tried to use the summaries as input to the user to drive their RAG-based question formulation.