Skip to content

【Kaggle Challenge】 Beginner-Friendly 30 Days of ML -- Level up in data science and machine learning

Notifications You must be signed in to change notification settings

jumpingchu/30-Days-of-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kaggle - 30 Days of ML

--

Week 1: Python Basics

week1

Day 1: Level up to Contributor

Today, you’ll set up your Kaggle account, move up from Novice to Contributor, and even make your very first submission to a Kaggle competition! The assignment should only take 45 minutes to complete.

Also, you’ll be able to jump into our Discord community & connect with other learners. It will be a great resource to ask questions and get help from others.

--

Day 2: Hello Python

In Lesson 1 (Hello Python), you’ll get a feel for Python syntax, and learn how to work with variables and do arithmetic in Python.

  • Read this tutorial (from Lesson 1 of the Python course)

  • Complete this exercise (from Lesson 1 of the Python course)

--

Day 3: Functions and Getting Help

In Lesson 2 (Functions and Getting Help), you’ll learn how to work with functions, which are reusable blocks of code designed to perform a task. You’ll also learn how to write your own!

  • Read this tutorial (from Lesson 2 of the Python course)

  • Complete this exercise (from Lesson 2 of the Python course)

--

Day 4: Booleans and Conditionals

In Lesson 3 (Booleans and Conditionals), you’ll learn all about the Boolean data type, which allows you to represent “True” and “False” in Python code.

This will provide a strong foundation for understanding how to write conditional statements, which are used to modify how code runs based on whether certain conditions hold.

  • Read this tutorial (from Lesson 3 of the Python course)

  • Complete this exercise (from Lesson 3 of the Python course)

--

Day 5: Lists, Loops and List Comprehensions

In Lesson 4 (Lists), you’ll learn how to use Python lists to store ordered collections of values. Lists are incredibly useful when writing code to manage several related variables.

  • Read this tutorial (from Lesson 4 of the Python course)

  • Complete this exercise (from Lesson 4 of the Python course)

In Lesson 5 (Loops and List Comprehensions), you’ll learn an efficient way to repeatedly execute code. With list comprehensions, you’ll often be able to condense code that would have taken several lines to just a single line!

  • Read this tutorial (from Lesson 5 of the Python course)

  • Complete this exercise (from Lesson 5 of the Python course)

--

Day 6: Strings and Dictionaries

In Lesson 6 (Strings and Dictionaries), you’ll learn about strings, which is a data type that is useful for representing human-readable data, such as text.

A dictionary is another new data type, that is similar to a list, but with important differences that makes it incredibly useful in its own right.

  • Read this tutorial (from Lesson 6 of the Python course)

  • Complete this exercise (from Lesson 6 of the Python course)

--

Day 7: Working with External Libraries

One of the best things about Python is the vast number of high-quality custom libraries that have been written for it. In Lesson 7 (Working with External Libraries), you’ll learn how to access this pre-written code and use it in your own work.

  • Read this tutorial (from Lesson 7 of the Python course)

  • Complete this exercise (from Lesson 7 of the Python course)


Week 2: Machine Learning & Basic Data Exploration

week2

Day 8: How Models Work

In Lesson 1 (How Models Work), you will start at the very beginning: what exactly is “machine learning”, and how is it used in the real world?

You’ll learn the answers to these questions and explore the basics of decision trees, as you start to build a strong foundation for some of the most cutting-edge techniques in data science.

  • Read this tutorial (from Lesson 1 of the Intro to ML course)

In Lesson 2 (Basic Data Exploration), you’ll learn all about pandas, the primary tool used by data scientists for exploring and manipulating data. Then, you’ll use your new knowledge to examine a dataset of home prices.

  • Read this tutorial (from Lesson 2 of the Intro to ML course)

  • Complete this exercise (from Lesson 2 of the Intro to ML course)

--

Day 9: Your First ML Model & Model Validation

In Lesson 3 (Your First Machine Learning Model), you’ll create a machine learning model using the scikit-learn library, one of the most popular and efficient tools for data analysis.

Along the way, you’ll learn some basic techniques for working with very large datasets. These skills are especially important for modern data scientists, who often work with “big data” containing millions of variables ― many more than a human can conceivably understand! Thankfully, machines excel at discovering useful patterns in datasets that are too large for humans to wrap their heads around. :)

  • Read this tutorial (from Lesson 3 of the Intro to ML course)

  • Complete this exercise (from Lesson 3 of the Intro to ML course)

Once you have built a model, how good is it? How exactly should you judge how close the model’s predictions are to what actually happened? In Lesson 4 (Model Validation), you’ll use model validation to measure the quality of your model.

  • Read this tutorial (from Lesson 4 of the Intro to ML course)

  • Complete this exercise (from Lesson 4 of the Intro to ML course)

--

Day 10: Underfitting/Overfitting & Random Forests

In Lesson 5 (Underfitting and Overfitting), you’ll learn about the fundamental concepts of underfitting and overfitting. Then you'll apply these ideas to gain a deep understanding of why some models succeed and others fail. This knowledge will make you much more efficient at discovering highly accurate machine learning models.

  • Read this tutorial (from Lesson 5 of the Intro to ML course)

  • Complete this exercise (from Lesson 5 of the Intro to ML course)

In Lesson 6 (Random Forests), you’ll learn all about random forests, another machine learning model you can add to your growing toolkit. Then, put your new knowledge to use immediately by building your own random forest model that exceeds the performance of the models that you’ve built so far!

  • Read this tutorial (from Lesson 6 of the Intro to ML course)

  • Complete this exercise (from Lesson 6 of the Intro to ML course)

  • Conclusion

    • Overfitting: capturing spurious patterns that won't recur in the future, leading to less accurate predictions. (分支過多)

    • Underfitting: failing to capture relevant patterns, again leading to less accurate predictions. (分支過少)

    • One of the best features of Random Forest models is that they generally work reasonably even without tuning. (隨機森林模型不加參數也有好表現)

--

Day 11: Machine Learning Competitions

One way to further improve your skills is to participate in machine learning competitions. In Lesson 7 (Machine Learning Competitions), you’ll create and submit your predictions to a Kaggle competition.

  • Read this tutorial (from Lesson 7 of the Intro to ML course)

  • Complete this exercise (from Lesson 7 of the Intro to ML course)

--

Day 12: Missing Values & Categorical Variables

In Lesson 1 (Introduction), you’ll learn more about what the course covers.

  • Read this tutorial (from Lesson 1 of the Intermediate ML course)

  • Complete this exercise (from Lesson 1 of the Intermediate ML course)

Most machine learning libraries (including scikit-learn) give an error if you try to build a model using data with missing values. In Lesson 2 (Missing Values), you’ll learn about three different approaches for dealing with missing values in your data.

  • Read this tutorial (from Lesson 2 of the Intermediate ML course)

  • Complete this exercise (from Lesson 2 of the Intermediate ML course)

A categorical variable is a variable that takes only a limited number of values, and it’s common to encounter them in data. Learn how to work with them in Lesson 3 (Categorical Variables).

  • Read this tutorial (from Lesson 3 of the Intermediate ML course)

  • Complete this exercise (from Lesson 3 of the Intermediate ML course)

--

Day 13: Pipelines & Cross-Validation

In Lesson 4 (Pipelines), you’ll learn a simple way to keep your data preprocessing and modeling code organized.

  • Read this tutorial (from Lesson 4 of the Intermediate ML course)

  • Complete this exercise (from Lesson 4 of the Intermediate ML course)

Step 1: Define Preprocessing Steps

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

Step 2: Define the Model

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

Step 3: Create and Evaluate the Pipeline

from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)

You’re already a bit familiar with model validation from the Intro to Machine Learning course. In Lesson 5 (Cross-Validation), you’ll explore a more advanced validation technique that gives a better measure of model performance.

  • Read this tutorial (from Lesson 5 of the Intermediate ML course)

  • Complete this exercise (from Lesson 5 of the Intermediate ML course)

from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')
                

Step 4: Visualize scores (Optional)

import matplotlib.pyplot as plt

plt.plot(list(results.keys()), list(results.values()))
plt.show()

--

Day 14: XGBoost & Data Leakage

In Lesson 6 (XGBoost), you will learn how to build and optimize models with gradient boosting. This method dominates many Kaggle competitions and achieves state-of-the-art results on a variety of datasets.

  • Read this tutorial (from Lesson 6 of the Intermediate ML course)

  • Complete this exercise (from Lesson 6 of the Intermediate ML course)

  • 屬於 Ensemble 中的 Gradien Boosting 方法

XGBoost Parameter Tuning

n_estimators
  • 是 model cycling 的次數,設定太高會造成 overfitting,太低則會 underfitting
  • 通常設定在 100~1000(但與 Learning rate 參數有很大關係)
early_stopping_rounds
  • 當 validation score 不再進步時,會讓模型提早結束迭代,方便找到最佳 n_estimators
  • 建議可以設定高 n_estimators 搭配 early_stopping_rounds=5 使用(代表連續五次 score 不再進步即停止)
  • 同時要設定 eval_set 作為 validation data
learning_rate
  • 將每次的預測在放進模型之前,先乘上一個數字
  • 可以讓新加入 ensemble 的 tree 影響變小,避免我們設定高 n_estimators 時的 overfitting
n_jobs
  • 利用電腦的核心做平行運算,減少 fit() 所需時間
  • 只適用於大型資料集,對小型的沒有幫助
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train, 
            early_stopping_rounds=5, 
            eval_set=[(X_valid, y_valid)], 
            verbose=False)

In Lesson 7 (Data Leakage), you will learn what data leakage is and how to prevent it. If you don't know how to prevent it, leakage will come up frequently, and it will ruin your models in subtle and dangerous ways. So, this is one of the most important concepts for practicing data scientists.

  • Read this tutorial (from Lesson 7 of the Intermediate ML course)

  • Complete this exercise (from Lesson 7 of the Intermediate ML course)


Week 3: Beginner-friendly competition

In the link above, you’ll find a detailed introduction to Kaggle competitions (that covers how to work in a team and much more), along with a getting started tutorial that walks you through how to make your very first submission.


Week 4: Beginner-friendly competition

Important Notes

Competition

  • It’s not too late to get started, if you have not already. This guide has all of the orientation you need.

  • You can make really strong progress by doing just a little bit each day: aim to submit to the competition at least once each day. Remember you can chat with other participants in the Discussion tab, and you can view code examples from the Kaggle community in the Code tab.

Google Developer Expert Workshops

  • The workshops are optional for the 30 Days of ML program. There are 3 available: Intro to Supervised Classification, How to Build a Data Science Portfolio and Scikit-optimize for LightGBM.

  • If you have any questions, there is a Question and Answer channel for each workshop in the 30 Days of ML Discord server. Note: the speakers may not be able to get to every question. If you are able to answer a question, feel free to jump in to help others.

About

【Kaggle Challenge】 Beginner-Friendly 30 Days of ML -- Level up in data science and machine learning

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published