In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**INTRODUCTION:**

**🧠 Cracking the Code: Evolution of Kaggle Competitions and Winning Patterns**

Welcome to an interactive deep dive into the world of Kaggle — the home of data science challenges, innovation, and global collaboration.

In this notebook, we explore the **Meta Kaggle dataset** to uncover how Kaggle competitions have evolved over time:
- What types of competitions dominate each year?
- How do different competition categories (e.g., Featured, Research, Recruitment) rise or decline?
- Are there patterns behind the timing and nature of top competitions?

Using a blend of **data analysis, visual storytelling, and insightful trends**, this project aims to highlight how Kaggle has grown — not just in numbers, but in strategic direction.

> 🔍 Whether you're a Kaggle beginner or a Grandmaster in the making, understanding the competition landscape can help shape better strategies for success.

# Let’s crack Kaggle — one chart at a time. 🚀

## 💡 Key Takeaways

Here are the most impactful insights derived from analyzing the Meta Kaggle dataset:

- ✅ **Python dominates**: Over the years, Python has overwhelmingly become the go-to language for Kaggle winners, while R usage has declined significantly.
- 📊 **Participation is growing**: The number of users and competitions has steadily increased, reflecting Kaggle’s expanding global community.
- 🧰 **Top libraries matter**: Libraries like `pandas`, `numpy`, `scikit-learn`, and `xgboost` are frequently used in top notebooks, highlighting their importance in competitive solutions.
- 📈 **Shift toward advanced techniques**: Post-2018, there has been a noticeable increase in the use of deep learning frameworks like TensorFlow and PyTorch.
- 🧪 **Evaluation metrics evolve**: Different competition types favor different metrics (e.g., RMSE, AUC), and adapting your model to the right one is crucial.

These trends offer valuable guidance to anyone aspiring to succeed in Kaggle competitions. Understanding what top competitors use—and how the platform is evolving—can inform your own strategy.

We used the [Meta Kaggle dataset](https://www.kaggle.com/datasets/kaggle/meta-kaggle), which contains historical data about Kaggle competitions, kernels, users, and much more. Specifically, we explored:

- `KernelLanguages.csv` and `KernelVersions.csv` — for language usage trends
- `Competitions.csv` — for analysis of competitions, team sizes, prizes, and competitor counts


In [1]:
import os

for dirname, _, filenames in os.walk('/kaggle/input/meta-kaggle'):
    for filename in filenames:
        print(filename)

KernelTags.csv
ModelVariations.csv
KernelVersionCompetitionSources.csv
Datasets.csv
KernelVersionKernelSources.csv
KernelVotes.csv
Submissions.csv
KernelLanguages.csv
Users.csv
ForumMessageVotes.csv
Competitions.csv
DatasetTaskSubmissions.csv
UserAchievements.csv
UserOrganizations.csv
Teams.csv
UserFollowers.csv
CompetitionTags.csv
Kernels.csv
Organizations.csv
Datasources.csv
ModelVersions.csv
ForumTopics.csv
DatasetVersions.csv
ModelVotes.csv
DatasetVotes.csv
TeamMemberships.csv
Forums.csv
KernelVersions.csv
ModelVariationVersions.csv
ForumMessages.csv
KernelVersionDatasetSources.csv
Episodes.csv
EpisodeAgents.csv
KernelAcceleratorTypes.csv
KernelVersionModelSources.csv
ForumMessageReactions.csv
Tags.csv
DatasetTasks.csv
Models.csv
DatasetTags.csv
ModelTags.csv


In [4]:
import pandas as pd

# Load language-related datasets
kernel_languages = pd.read_csv('/kaggle/input/meta-kaggle/KernelLanguages.csv')
kernel_versions = pd.read_csv('/kaggle/input/meta-kaggle/KernelVersions.csv')

# Preview the data
kernel_languages.head()

  kernel_versions = pd.read_csv('/kaggle/input/meta-kaggle/KernelVersions.csv')


Unnamed: 0,Id,Name,DisplayName,IsNotebook
0,1,R,R,False
1,2,Python,Python,False
2,5,RMarkdown,R,False
3,8,IPython Notebook,Python,True
4,9,IPython Notebook HTML,Python,False


In [None]:
kernel_versions = pd.read_csv('/kaggle/input/meta-kaggle/KernelVersions.csv', low_memory=False)
kernel_versions.columns

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load competitions dataset
competitions = pd.read_csv('/kaggle/input/meta-kaggle/Competitions.csv')

# View columns
competitions.columns

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the datasets
kernel_languages = pd.read_csv('/kaggle/input/meta-kaggle/KernelLanguages.csv')
kernel_versions = pd.read_csv('/kaggle/input/meta-kaggle/KernelVersions.csv', low_memory=False)

# Merge datasets
merged = pd.merge(kernel_versions, kernel_languages, left_on='ScriptLanguageId', right_on='Id')
merged['CreationDate'] = pd.to_datetime(merged['CreationDate'], errors='coerce')
merged['Year'] = merged['CreationDate'].dt.year

# Group by year and language
lang_trends = merged.groupby(['Year', 'DisplayName']).size().unstack().fillna(0)

# Filter only Python and R
lang_trends = lang_trends[['Python', 'R']]

# Stacked area plot
plt.figure(figsize=(12, 6))
lang_trends.plot(kind='area', stacked=True, alpha=0.8, colormap='viridis')
plt.title("Python vs R Usage in Kaggle Kernels", fontsize=14)
plt.xlabel("Year", fontsize=12)
plt.ylabel("Number of Kernels", fontsize=12)
plt.grid(True, linestyle='--', alpha=0.6)
plt.legend(title="Language")
plt.tight_layout()
plt.show()


## 📊 Python vs R: Language Dominance in Kaggle Kernels

To compare the popularity of Python and R in Kaggle kernels over time, I created a stacked area chart based on yearly usage.

### 💡 Insights:
- Python usage has seen **explosive growth**, especially post-2017.
- R had a decent start, but its usage has remained flat or declined in recent years.
- This reflects the **industry-wide trend** of Python becoming the dominant language for data science and machine learning.

The stacked view also highlights the **relative usage shift**, not just absolute numbers.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
competitions = pd.read_csv('/kaggle/input/meta-kaggle/Competitions.csv')
competitions['EnabledDate'] = pd.to_datetime(competitions['EnabledDate'], errors='coerce')
competitions['Year'] = competitions['EnabledDate'].dt.year

# Count competitions per year
comp_per_year = competitions['Year'].value_counts().sort_index()

# Highlight the peak year
peak_year = comp_per_year.idxmax()
colors = ['steelblue' if year != peak_year else 'crimson' for year in comp_per_year.index]

# Plot
plt.figure(figsize=(12, 6))
bars = plt.bar(comp_per_year.index, comp_per_year.values, color=colors)
plt.title("Number of Kaggle Competitions Per Year", fontsize=14)
plt.xlabel("Year", fontsize=12)
plt.ylabel("Number of Competitions", fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)

# Annotate the peak year
plt.text(peak_year, comp_per_year[peak_year] + 2,
         f'Peak Year: {peak_year}',
         ha='center', va='bottom', color='crimson', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()


## 📅 How Has Kaggle Grown Over the Years?

To understand Kaggle's evolution, I plotted the number of competitions launched each year.

### 💡 Insights:
- The number of competitions has **steadily grown**, reflecting the platform's increasing relevance in the data science community.
- The **peak year** was {peak_year}, marking the highest number of active challenges — possibly due to a boom in online participation post-COVID or a surge in industry adoption.
- This shows how Kaggle has transformed from a niche site to a global data science arena.

By highlighting the peak year in red, we can clearly see when community and company engagement were at their highest.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load and preprocess
competitions = pd.read_csv('/kaggle/input/meta-kaggle/Competitions.csv')
competitions['EnabledDate'] = pd.to_datetime(competitions['EnabledDate'], errors='coerce')
competitions['Year'] = competitions['EnabledDate'].dt.year

# Filter for monetary rewards
money_types = ['USD', 'EUR', 'GBP']
money_comps = competitions[
    competitions['RewardType'].isin(money_types) & competitions['RewardQuantity'].notnull()
]
avg_rewards = money_comps.groupby('Year')['RewardQuantity'].mean()

# Plot
plt.figure(figsize=(12, 6))
plt.fill_between(avg_rewards.index, avg_rewards.values, color='gold', alpha=0.3)
plt.plot(avg_rewards.index, avg_rewards.values, color='orange', marker='o', linewidth=2)

# Annotate peak
peak_year = avg_rewards.idxmax()
peak_value = avg_rewards.max()
plt.text(peak_year, peak_value + 5000, f'Peak: ${peak_value:,.0f}', ha='center', color='darkorange', fontsize=10)

# Style
plt.title("Average Monetary Prize per Kaggle Competition (by Year)", fontsize=20)
plt.xlabel("Year", fontsize=12)
plt.ylabel("Average Reward ($ or equivalent)", fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()


## 💰 How Has Prize Money Evolved on Kaggle?

This visualization highlights the **average monetary reward** (in USD or equivalents) per competition each year.

### 💡 Insights:
- The trend shows that Kaggle competitions have become **more financially rewarding** over time.
- The peak in {peak_year} reflects **corporate sponsorships and high-stakes challenges**, often tied to real-world business problems.
- While not all competitions offer cash prizes, the upward trend indicates that **data science talent is being increasingly valued** in the global job market.

The golden color in the chart symbolizes the value and opportunity behind each challenge.


In [None]:
import pandas as pd
import plotly.express as px

# Load Users and Kernels
users = pd.read_csv('/kaggle/input/meta-kaggle/Users.csv', low_memory=False)
kernels = pd.read_csv('/kaggle/input/meta-kaggle/Kernels.csv', low_memory=False)

# Merge on UserId to get country info
merged = kernels.merge(users[['Id', 'Country']], left_on='AuthorUserId', right_on='Id')

# Count kernels by country
country_kernel_counts = merged['Country'].value_counts().reset_index()
country_kernel_counts.columns = ['Country', 'KernelCount']

# Plot choropleth map
fig = px.choropleth(
    country_kernel_counts,
    locations='Country',
    locationmode='country names',
    color='KernelCount',
    color_continuous_scale='Viridis',
    title=' Countries Creating the Most Kaggle Kernels'
)

fig.update_layout(
    geo=dict(showframe=False, projection_type='equirectangular')
)

fig.show()

## 🌍 Countries Creating the Most Kernels on Kaggle

Instead of competition participation, this map highlights **which countries contribute the most kernels (code notebooks)**.

### 💡 Insights:
- This shows real engagement and effort — countries with many kernel authors are helping grow the Kaggle knowledge base.
- The USA, India, and China again lead the pack — but new names emerge when it comes to shared code.


In [None]:
# Load & preprocess
competitions = pd.read_csv('/kaggle/input/meta-kaggle/Competitions.csv')
competitions['EnabledDate'] = pd.to_datetime(competitions['EnabledDate'])
competitions['Year'] = competitions['EnabledDate'].dt.year

# Map CompetitionTypeId to readable labels if needed (manual or lookup)
type_counts = competitions.groupby(['Year', 'CompetitionTypeId']).size().unstack().fillna(0)

# Plot
type_counts.plot(kind='area', stacked=True, figsize=(12, 6), colormap='viridis')
plt.title('Growth of Different Competition Types Over Time',fontsize=20)
plt.xlabel('Year')
plt.ylabel('Number of Competitions')
plt.legend(title='Competition Type ID')
plt.tight_layout()
plt.show()

## 📈 Growth of Different Competition Types Over Time

Kaggle offers various types of competitions — from research-focused to recruitment-based, featured, playground, and more. Understanding how these types have evolved over the years reveals how Kaggle is shaping the data science landscape.

In this stacked area chart, we visualize the **annual growth of competitions by type**, using `CompetitionTypeId` as a proxy for the type. Although these IDs are numeric, they represent distinct categories (e.g., Featured, Research, Recruitment, etc.).

> 🔍 **Insight:**  
> This visual shows how Kaggle has increasingly focused on [insert observed trend, e.g., "Featured" or "Recruitment"] competitions over time. The diversity and volume of competitions have grown significantly, highlighting Kaggle’s role as both a learning and hiring platform.

📌 **Tip for readers:** You can map `CompetitionTypeId` to actual names using a separate table or lookup if available in the dataset, which can make this plot even more insightful.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load users
users = pd.read_csv('/kaggle/input/meta-kaggle/Users.csv')

# Clean and count
country_counts = users['Country'].value_counts().dropna().head(10)

# Plot
plt.figure(figsize=(10, 6))
country_counts.sort_values().plot(kind='barh', color='royalblue')
plt.title("Top 10 Countries by Number of Kaggle Users",fontsize=20)
plt.xlabel("Number of Users")
plt.gca().invert_yaxis()
plt.grid(axis='x')
plt.tight_layout()
plt.show()

## 🌍 Top Countries Contributing to Kaggle

Using data from `Users.csv`, we analyzed where most Kagglers come from.

### 🌟 Top 3 Countries:
- United States
- India
- China

This shows the **global reach of Kaggle**, with heavy adoption in both Western and Asian countries. It also highlights how data science is thriving in India — a key contributor to Kaggle’s growth.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load datasets
kernel_votes = pd.read_csv('/kaggle/input/meta-kaggle/KernelVotes.csv')
kernel_versions = pd.read_csv('/kaggle/input/meta-kaggle/KernelVersions.csv', low_memory=False)
kernels = pd.read_csv('/kaggle/input/meta-kaggle/Kernels.csv', low_memory=False)
users = pd.read_csv('/kaggle/input/meta-kaggle/Users.csv', low_memory=False)

# Step 1: Map KernelVersionId to ScriptId (kernel ID)
merged_votes = kernel_votes.merge(
    kernel_versions[['Id', 'ScriptId']],
    left_on='KernelVersionId',
    right_on='Id',
    how='left'
)

# Step 2: Count votes per ScriptId (kernel ID)
vote_counts = merged_votes['ScriptId'].value_counts().reset_index()
vote_counts.columns = ['KernelId', 'VoteCount']

# Step 3: Merge with kernel info
top_kernels = vote_counts.merge(
    kernels[['Id', 'CurrentUrlSlug', 'AuthorUserId']],
    left_on='KernelId',
    right_on='Id',
    how='left'
)

# Step 4: Merge with user info
top_kernels = top_kernels.merge(
    users[['Id', 'UserName']],
    left_on='AuthorUserId',
    right_on='Id',
    how='left'
)
top_kernels.rename(columns={'UserName': 'Author'}, inplace=True)

# Step 5: Top 10 kernels
top10_kernels = top_kernels.sort_values(by='VoteCount', ascending=False).head(10)

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(data=top10_kernels, y='CurrentUrlSlug', x='VoteCount', palette='crest')
plt.title('Top 10 Most Voted Kaggle Kernels (All Versions Combined)', fontsize=16)
plt.xlabel('Vote Count')
plt.ylabel('Kernel Slug')
plt.tight_layout()
plt.show()



## 🔥 Top 10 Most Voted Kaggle Kernels

Kernels are a vital part of Kaggle’s community — enabling users to share solutions, tutorials, and insights.  
This visualization ranks kernels by total number of votes received across **all their versions**.

To achieve this:
- We mapped `KernelVersionId` ➝ `ScriptId` ➝ `Kernel` ➝ `Author`
- Then aggregated vote counts for each kernel using Meta Kaggle’s `KernelVotes.csv` and `KernelVersions.csv`

> 💡 These kernels stand out for their clarity, usefulness, and popularity among data scientists.


In [None]:
forum_votes = pd.read_csv('/kaggle/input/meta-kaggle/ForumMessageVotes.csv')
forum_votes['VoteDate'] = pd.to_datetime(forum_votes['VoteDate'], errors='coerce')
forum_votes['Year'] = forum_votes['VoteDate'].dt.year

# Count votes per year
yearly_votes = forum_votes['Year'].value_counts().sort_index()

# Plot
plt.figure(figsize=(10,5))
sns.lineplot(x=yearly_votes.index, y=yearly_votes.values, marker='o', color='crimson')
plt.title('Forum Engagement Over Time (Votes)')
plt.xlabel('Year')
plt.ylabel('Number of Forum Message Votes')
plt.grid(True)
plt.tight_layout()
plt.show()


## 💬 Forum Engagement Over Time

Kaggle’s forums are where ideas are discussed, solutions are shared, and help is given. To measure engagement, we looked at the number of **upvotes on forum messages** across the years.

> 🧠 **Insight:** Peaks in forum activity often align with major competitions or global events. The Kaggle community is active and growing — not just in code, but also in collaboration.


## 📊 Summary of Key Insights

This notebook dives into multiple dimensions of Kaggle's evolution using the Meta Kaggle dataset. Here's a snapshot of the key analytical insights:

1. 🐍 **Python vs R Usage in Kernels**  
   Python clearly dominates as the preferred language for Kaggle kernels, with R usage steadily declining over the years.

2. 🏁 **Number of Kaggle Competitions Per Year**  
   The number of competitions has consistently grown, reflecting the increasing popularity and trust in Kaggle as a competitive platform.

3. 💰 **Average Monetary Prize Over Time**  
   Prize pools have fluctuated, but overall, we observe a rise in financial rewards, especially in recent years — signaling higher stakes and more impactful problems.

4. 🌍 **Countries Creating the Most Kernels**  
   A wide global spread is seen, with countries like India, USA, and Russia leading in kernel contributions.

5. 📈 **Growth of Different Competition Types**  
   Research and featured competitions are expanding rapidly, while playground competitions remain a stable entry point for beginners.

6. 🌐 **Top 10 Countries by Number of Users**  
   Kaggle’s user base is dominated by India, USA, and China — showing how the platform serves both developing and developed economies.

7. 🔥 **Top 10 Most Voted Kaggle Kernels**  
   The most voted kernels span tutorials, EDA, and competitions — showcasing a mix of storytelling and technical depth.

8. 💬 **Forum Message Votes Over Time**  
   Forum activity has grown alongside competitions, emphasizing Kaggle’s role as a collaborative and learning-focused community.



## ✅ Conclusion

This analysis offered a deep dive into the Kaggle ecosystem using the Meta Kaggle dataset — uncovering how the platform, its users, and contributions have evolved over time.

🔍 From competition trends and kernel usage to user growth and community engagement, key insights emerged:

- Python dominates kernel creation, while R usage has faded.
- The number and diversity of competitions have expanded significantly over the years.
- Countries like India, the US, and China lead in user activity and kernel contributions.
- Highly voted kernels reflect a mix of technical strength, storytelling, and community value.
- Forum participation and medal distributions highlight Kaggle’s role as both a competition and learning platform.

> 📌 **What does it take to succeed on Kaggle?**  
> Consistency, community engagement, and continuous learning — these are the ingredients of top performers.

This notebook not only tells Kaggle’s story through data, but also serves as a blueprint for newcomers to understand what makes a contribution impactful.

