# YouTube Views with XGBoost
> Predicting views using XGBoost and data from YouTube and Google APIs 
- toc: true
- badges: true
- comments: true
- categories: [XGBoost,YouTube,API,Google Vision]
- image: images/padelpy.png
---
![kaggle_img000](https://github.com/jmmerrell/data_science/blob/master/img/kaggle_jonathan-chng-HgoKvtKpyHA-unsplash.jpg?raw=true)

If you are starting your journey in data science and machine learning, you may have heard of [Kaggle](https://www.kaggle.com/), the world's largest data science community. With the myriad of courses, books, and tutorials addressing the subject online, it's perfectly normal to feel overwhelmed with no clue where to start. 

Although there isn't a unanimous agreement on the best approach to take when starting to learn a skill, getting started on Kaggle from the beginning of your data science path is solid advice.

It is an amazing place to learn and share your experience and data scientists of all levels can benefit from collaboration and interaction with other users. More experienced users can keep up to date with new trends and technologies, while beginners will find a great environment to get started in the field. 

Kaggle has several [crash courses](https://www.kaggle.com/learn/overview) to help beginners train their skills. There are courses on python, pandas, machine learning, deep learning, only to name a few. As you gain more confidence, you can enter competitions to test your skills. In fact, after a few courses, you will be encouraged to join your first competition.

In this article, I'll show you, in a straightforward approach, some tips on how to structure your first project. I'll be working on the [Housing Prices Competition](https://www.kaggle.com/c/home-data-for-ml-course), one of the best hands-on projects to start on Kaggle.

## 1. Understand the Data

The first step when you face a new data set is to take some time to know the data. In Kaggle competitions, you'll come across something like the sample below.

![kaggle_img001](https://github.com/rmpbastos/data_science/blob/master/img/kaggle_img1.jpg?raw=true)

On the competition's page, you can check the project description on **Overview** and you'll find useful information about the data set on the tab **Data**. In Kaggle competitions, it's common to have the training and test sets provided in separate files. On the same tab, there's usually a summary of the features you'll be working with and some basic statistics. It's crucial to understand which problem needs to be addressed and the data set we have at hand.

You can use the Kaggle notebooks to execute your projects, as they are similar to Jupyter Notebooks.


## 2. Import the necessary libraries and data set

### 2.1. Libraries

The libraries used in this project are the following.

In [1]:
import pandas as pd                                     # Data analysis tool
import numpy as np                                      # Package for scientific computing
from sklearn.model_selection import train_test_split    # Splits arrays or matrices into random train and test subsets
from sklearn.model_selection import KFold               # Cross-validator
from sklearn.model_selection import cross_validate      # Evaluate metrics by cross-validation
from sklearn.model_selection import GridSearchCV        # Search over specified parameter values for an estimator
from sklearn.compose import ColumnTransformer           # Applies transformers to columns of DataFrames
from sklearn.pipeline import Pipeline                   # Helps building a chain of transforms and estimators
from sklearn.impute import SimpleImputer                # Imputation transformer for completing missing values
from sklearn.preprocessing import OneHotEncoder         # Encode categorical features
from sklearn.metrics import mean_absolute_error         # One of many statistical measures of error
from xgboost import XGBRegressor                        # Our model estimator

### 2.2. Data set 

The next step is to read the data set into a pandas DataFrame and obtain target vector **y**, which will be the column `SalePrice`, and predictors **X**, which, for now, will be the remaining columns.

In [4]:
# Read training and test sets
X_full = pd.read_csv('https://raw.githubusercontent.com/rmpbastos/data_sets/main/housing_price_train.csv', index_col='Id')
X_test_full = pd.read_csv('https://raw.githubusercontent.com/rmpbastos/data_sets/main/housing_price_test.csv', index_col='Id')

# Obtain target vectors and predictors
X = X_full.copy()
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

To get an overview of the data, let's check the first rows and the size of the data set.

In [5]:
X.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal


In [6]:
X.shape

(1460, 79)

In [7]:
y.shape

(1460,)

We have 1,460 rows and 79 columns. Later on, we'll check these columns to verify which of them will be meaningful to the model.

In the next step, we'll split the data into training and validation sets. 

## 3. Training and validation data

It is crucial to break our data into a set for training the model and another one to validate the results. It's worth mentioning that we should never use the test data here. Our test set stays untouched until we are satisfied with our model's performance.

What we're going to do is taking the predictors **X** and target vector **y** and breaking them into training and validation sets. For that, we'll use scikit-learn's `train_test_split`.

In [8]:
# Split training and validation sets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

Checking the shape of our training and validation sets, we get the following.

In [9]:
print(f'Shape of X_train_full: {X_train_full.shape}')
print(f'Shape of X_valid_full: {X_valid_full.shape}')
print(f'Shape of y_train: {y_train.shape}')
print(f'Shape of y_valid: {y_valid.shape}')

Shape of X_train_full: (1168, 79)
Shape of X_valid_full: (292, 79)
Shape of y_train: (1168,)
Shape of y_valid: (292,)


## 4. Analyze and prepare the data

Now, we start analyzing the data by checking some information about the features.

In [10]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 79 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

From the summary above, we can observe that some columns have missing values. Let's take a closer look.

### 4.1. Missing Values

In [11]:
# Check for missing values
missing_values = X.isnull().sum()
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)
print(missing_values)

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
LotFrontage      259
GarageYrBlt       81
GarageType        81
GarageFinish      81
GarageQual        81
GarageCond        81
BsmtFinType2      38
BsmtExposure      38
BsmtFinType1      37
BsmtCond          37
BsmtQual          37
MasVnrArea         8
MasVnrType         8
Electrical         1
dtype: int64


Some features have missing values counting for the majority of their entries. Checking the [competition page](https://www.kaggle.com/c/home-data-for-ml-course/data), we find more details about the values for each feature, which will help us handle missing data.

For instance, in the columns `PoolQC`, `MiscFeature`, `Alley`, `Fence`, and `FireplaceQu`, the missing values mean that the house doesn't count with that specific feature, so, we'll fill the missing values with "NA". All the null values in columns starting with `Garage` and `Bsmt` are related to houses that don't have a garage or basement, respectively. We'll fill those and the remaining null values with "NA" or the mean value, considering if the features are categorical or numerical.

### 4.2. Preprocessing the categorical variables

Most machine learning models only work with numerical variables. Therefore, if we feed the model with categorical variables without preprocessing them first, we'll get an error.

There are several ways to deal with categorical values. Here, we'll use *One-Hot Encoding*, which will create new columns indicating the presence or absence of each value in the original data.

One issue of One-Hot Encoding is dealing with variables with numerous unique categories since it will create a new column for each unique category. Thus, this project will only include categorical variables with no more than 15 unique values.

In [12]:
# Select categorical columns with no more than 15 unique values
categorical_cols = [col for col in X_train_full.columns if 
                   X_train_full[col].nunique() <= 15 and
                   X_train_full[col].dtype == 'object']

# Select numeric values
numeric_cols = [col for col in X_train_full.columns if
                X_train_full[col].dtype in ['int64', 'float64']]

# Keep selected columns
my_columns = categorical_cols + numeric_cols
X_train = X_train_full[my_columns].copy()
X_valid = X_valid_full[my_columns].copy()
X_test = X_test_full[my_columns].copy()

### 4.3. Create a pipeline

*Pipelines* are a great way to keep the data modeling and preprocessing more organized and easier to understand. Creating a pipeline, we'll handle the missing values and the preprocessing covered in the previous two steps. 

As defined above, numerical missing entries will be filled with the mean value while missing categorical variables will be filled with "NA". Furthermore, categorical columns will also be preprocessed with One-Hot Encoding.

We are using *SimpleImputer* to fill in missing values and *ColumnTransformer* will help us to apply the numerical and categorical preprocessors in a single transformer.

In [13]:
# Preprocessing numerical values
numerical_transformer = SimpleImputer(strategy='mean')

# Preprocessing categorical values
categorical_transformer = Pipeline(steps=[
                                   ('imputer', SimpleImputer(strategy='constant', fill_value='NA')),
                                   ('onehot', OneHotEncoder(handle_unknown='ignore'))
                                   ])

# Pack the preprocessors together
preprocessor = ColumnTransformer(transformers=[
                                 ('num', numerical_transformer, numeric_cols),
                                 ('cat', categorical_transformer, categorical_cols)
                                 ])

## 5. Define a model

Now that we have bundled our preprocessors in a pipeline, we can define a model. In this article, we are working with **XGBoost**, one of the most effective machine learning algorithms, that presents great results in many Kaggle competitions. As a metric of evaluation, we are using the **Mean Absolute Error**.

In [14]:
# Define the model with default parameters
model = XGBRegressor(verbosity=0, random_state=0)

# Pack preprocessing and modeling together in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                              ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

print('MAE:', mean_absolute_error(y_valid, preds))

MAE: 16706.181988441782


## 6. Cross-validation

Using [Cross-Validation](https://scikit-learn.org/stable/modules/cross_validation.html#) can yield better results. Instead of simply using the training and test sets, cross-validation will run our model on different subsets of the data to get multiple measures of model quality.

We'll use the cross-validator [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) in its default setup to split the training data into 5 folds. Then, each fold will be used once as validation while the remaining folds will form the training set. After that, [cross-validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) will evaluate the metrics. In this case, we're using the Mean Absolute Error. 

In [15]:
# Using KFold cross-validator
kfold = KFold(shuffle=True, random_state=0)

# Evaluating the Mean Absolute Error
scores = cross_validate(my_pipeline, X_train, y_train, 
                              scoring='neg_mean_absolute_error', cv=kfold)

# Multiply by -1 since sklearn calculates negative MAE
print('Average MAE score:', (scores['test_score'] * -1).mean())

Average MAE score: 16168.894833206665


With cross-validation we could improve our score, reducing the error. In the next step, we'll try to further improve the model, optimizing some hyperparameters.

## 7. Hyperparameter tuning

**XGBoost** in its default setup usually yields great results, but it also has plenty of hyperparameters that can be optimized to improve the model. Here, we'll use a method called [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) which will search over specified parameter values and return the best ones. Once again, we'll utilize the pipeline and the cross-validator *KFold* defined above.

*GridSearchCV* will perform an exhaustive search over parameters, which can demand a lot of computational power and take a lot of time to be finished. We can speed up the process a little bit by setting the parameter `n_jobs` to `-1`, which means that the machine will use all processors on the task.

In [16]:
"""
To pass parameter in a pipeline, we should add the names of the steps and the parameter name separated by a ‘__’.
Ex: Instead of 'n_estimators', we should set 'model__n_estimators'.
https://github.com/scikit-learn/scikit-learn/issues/18472
"""
# parameters to be searched over
param_grid = {'model__n_estimators': [10, 50, 100, 200, 400, 600],
              'model__max_depth': [2, 3, 5, 7, 10],
              'model__min_child_weight': [0.0001, 0.001, 0.01],
              'model__learning_rate': [0.01, 0.1, 0.5, 1]}

# find the best parameter
kfold = KFold(shuffle=True, random_state=0)
grid_search = GridSearchCV(my_pipeline, param_grid, scoring='neg_mean_absolute_error', cv=kfold, n_jobs=-1)
grid_result = grid_search.fit(X_train, y_train)

In [17]:
print('Best result:', round((grid_result.best_score_ * -1), 2), 'for', grid_result.best_params_)

Best result: 15750.17 for {'model__learning_rate': 0.1, 'model__max_depth': 3, 'model__min_child_weight': 0.0001, 'model__n_estimators': 400}


## 8. Generate test predictions

After tuning some hyperparameters, it's time to go over the modeling process again to make predictions on the test set. We'll define our final model based on the optimized values provided by *GridSearchCV*.

In [32]:
# Define final model
final_model = XGBRegressor(n_estimators=400, 
                           max_depth=3, 
                           min_child_weight=0.0001, 
                           learning_rate=0.1, 
                           verbosity=0, 
                           random_state=0
                           )

# Create a pipeline
final_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                 ('final_model', final_model)
                                 ])

# Fit the model
final_pipeline.fit(X_train, y_train)

# Get predictions on the test set
final_prediction = final_pipeline.predict(X_test)

## 9. Submit your results

We're almost there! The machine learning modeling is done, but we still need to submit our results to have our score recorded.

This step is quite simple. We need to create a `.csv` file containing the predictions. This file consists of a DataFrame with two columns. In this case, one column for "Id" and the other one for the test predictions on the target feature. 


In [45]:
# Save test predictions to .csv file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': final_prediction})
output.to_csv('submission.csv', index=False)

## 10. Join the competition

Finally, we just need to join the competition. Please follow the steps below, according to Kaggle's instructions. 

*   Start by accessing the [competition page](https://www.kaggle.com/c/home-data-for-ml-course) and clicking on **Join Competition**.
*   In your Kaggle notebook, click on the blue Save Version button in the top right corner of the window.
*   A pop-up window will show up. Select the option **Save and Run All** and then click on the blue Save button.
*   A new pop-up shows up in the bottom left corner while your notebook is running. When it stops running, click on the number to the right of the **Save Version** button. You should click on the **ellipsis (...)** to the right of the most recent notebook version, and select **Open in Viewer**. This brings you into view mode of the same page.
*   Now, click on the **Output** tab on the right of the screen. Then, click on the blue **Submit** button to submit your results to the leaderboard.

After submitting, you can check your score and position on the [leaderboard](https://www.kaggle.com/c/home-data-for-ml-course/leaderboard).

![kaggle_img004](https://github.com/rmpbastos/data_science/blob/master/img/kaggle_img4.jpg?raw=true)






## Conclusion

This article was intended to be instructive, helping data science beginners to structure their first projects on Kaggle in simple steps. With this straightforward approach, I've got a score of **14,778.87**, which ranked this project in the Top 7%.

After further studying, you can go back on past projects and try to enhance their performance, using new skills you've learned. To improve this project, we could investigate and treat the outliers more closely, apply a different approach to missing values, or do some feature engineering, for instance.

My advice to beginners is to keep it simple when starting out. Instead of aiming at the "perfect" model, focus on completing the project, applying your skills correctly, and learning from your mistakes, understanding where and why you messed things up. The data science community is on constant expansion and there's plenty of more experienced folks willing to help on websites like Kaggle or Stack Overflow. Try to learn from their past mistakes as well! With practice and discipline, it's just a matter of time to start building more elaborate projects and climb up the ranking of Kaggle's competitions.