In [30]:
import pandas as pd

# YOUTUBE DATASETS
## A complete Machine Learning project
***Classification, Regression, Clustering, Clustering, Recommendation***

Mitko Stoychev Dimitrov

01/08/2024

## 1. Objective of the work

The objective of this practice is to extract the greatest possible value from the data in order to solve a series of business problems using the techniques I have learnt.

A package called models must be created and there must be in it:

A Python script with each necessary model (e.g. svm_classifier.py, mlp_regressor.py, kmeans_clustering.py...) containing a class with the model (e.g. SVMClassifier, MLPRegressor...). These classes will have 5 public methods and all the necessary private methods (for example, a private method called __preprocess_data() can be created to serve as an auxiliary method to perform the previous transformations required in those models that need them):

-  The **init** to which the appropriate parameters must be passed (it may not be necessary to pass any).
-  A method called **fit** that will receive input data in ‘Data Frame’ or ‘numpy array’ format and will train the model, storing the weights and other attributes that are deemed necessary to store within the self of the class.
-  A method called **predict** that will use this information stored in the class to make predictions based on the new data. For example, if the model is a classification model, this method should return the class or classes to which the queried data corresponds. It must be able to make inference from a single observation as well as from a set of observations.
-  A method called **save** that allows the data saved within the class (weights, states of preprocessor objects...) to be saved on disk serialised in pickle, allowing the model to continue to be used in the future without re-training it.
-  A method called **load** that allows reloading the state of the class itself from disk in order to be able to call predict again in a future session without having to ‘retrain’ the model.

Out of this package there should be 4 more files:

- A file called **preprocess_data.py** which will perform the required transformations, data cleaning and changes common to all models. Model-specific transformations must be performed inside the corresponding class, not in this file. It must receive two input arguments, one with the path to the ‘raw’ data file and the other with the path where to leave the data already cleaned and processed for the models.
- A file **called train_models.py** that will import the package, instantiate the models, train them by invoking the ‘fit’ method of each one with the dataset and then serialise to disk what has been learned by means of the ‘save’ method. Both the path from which to read the source data and the paths to save the pickle files of each model must be obtained from a configuration file in configparser format which we will call **train.conf**. The path to this configuration file must be passed as an argument to the script and the argparse library must be used to process this argument. It is recommended to create sections within this configuration file.
- A file called **inference_model.py** that will instantiate a model, read through the ‘load’ method what it has already learned and perform the inference of new data. It will receive three input arguments, and the argparse library must be used to process these arguments. These arguments will be the type of model to be used for inference, the concrete weights to be used from that model and the input data file on which we want to perform the inference.- 
A Jupyter notebook, which we will call** exploratory_analysis.ipyn**b. This notebook should load the data, perform an exhaustive exploratory analysis trying to extract as much information as possible from them and then try with the data all the techniques and models that apply to the resolution of the problem seen during module 3 (machine learning). The performance of each technique that is oriented to the same purpose should be compared and through this procedure the ones that will finally be industrialised within the models package should be chosen. Likewise, not only should you compare technique to technique, but once you have chosen one of them, you have to try to find the optimal hyperparameters and the most suitable input data transformations to obtain the best results


Now let's talk about the dataset used and the business problems we want to solve. The dataset includes several months of data on daily trending videos from YouTube. The data also includes a category_id field, which varies between regions. To retrieve the categories for a specific video, we need to look it up in the associated JSON. One such file is included for each of the regions in the dataset. The first step of the pre-processing should be to cross-reference both files for each region and replace the IDs in the CSVs with the corresponding tags. Then, all the data enriched in the previous step for all regions should be merged into a single CSV that includes a new column with the name of the region. All this logic should be implemented in **preprocess_data.py**.

All possible models for each of the following business problems should be tested, industrialising in the package only the best of them for each of the cases, with the best hyperparameters and prior transformations found.

- **Classification**:
    - Normal challenge: Create a classifier that predicts the category of the video.
    - Optional challenge: Create a classification algorithm that tries to find out as best as possible whether comments or ratings will be disabled by the video creator based on the rest of the information. As it is a highly unbalanced classification, it will be valued not only to obtain a high accuracy, but also to study in a confusion matrix if it is excessively biased towards the majority class and in case it happens to try to mitigate this fact.
- **Regression**:
    - Normal challenge: Try to predict the number of likes.
    - Optional up-scoring challenge: Try to predict the ratio of likes/dislikes for each video.
- **Clustering**:
    - Try to find groups in the videos using various clustering techniques. Evaluate how each overlaps with respect to the categories in the videos. Use dimensionality reduction techniques.
- **Recommendation**:
    - Create a recommender that, given a video, recommends similar videos..

## 2. Import the data
Original data source:  https://drive.google.com/file/d/16rdg0eP4e5Db99sAXwAQfSOvgUatOgLj/view?usp=sharing

The imported data are already **preprocessed**. The step of preprocessing the data has been performed in the script *preprocess_data.py*.

In [31]:
data = pd.read_csv("preprocessed_data/preprocessed_data.csv", index_col="video_id")
data

Unnamed: 0_level_0,trending_date,title,channel_title,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,state,category
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. BeyoncÃ©,EminemVEVO,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787425,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. BeyoncÃ© ...,Canada,Music
0dBIkQ4Mz1M,17.14.11,PLUSH - Bad Unboxing Fan Mail,iDubbbzTV,2017-11-13T17:00:00.000Z,"plush|""bad unboxing""|""unboxing""|""fan mail""|""id...",1014651,127794,1688,13030,https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg,False,False,False,STill got a lot of packages. Probably will las...,Canada,Comedy
5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146035,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO â¶ \n\nSUBSCRIBE âº ...,Canada,Comedy
d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095828,132239,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...,Canada,Entertainment
2Vv-BfVoq4g,17.14.11,Ed Sheeran - Perfect (Official Music Video),Ed Sheeran,2017-11-09T11:04:14.000Z,"edsheeran|""ed sheeran""|""acoustic""|""live""|""cove...",33523622,1634130,21082,85067,https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg,False,False,False,ð§: https://ad.gt/yt-perfect\nð°: https://...,Canada,Music
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
BZt0qjTWNhw,18.14.06,The Cat Who Caught the Laser,AaronsAnimals,2018-05-18T13:00:04.000Z,"aarons animals|""aarons""|""animals""|""cat""|""cats""...",1685609,38160,1385,2657,https://i.ytimg.com/vi/BZt0qjTWNhw/default.jpg,False,False,False,The Cat Who Caught the Laser - Aaron's Animals,United States,Pets & Animals
1h7KV2sjUWY,18.14.06,True Facts : Ant Mutualism,zefrank1,2018-05-18T01:00:06.000Z,[none],1064798,60008,382,3936,https://i.ytimg.com/vi/1h7KV2sjUWY/default.jpg,False,False,False,,United States,People & Blogs
D6Oy4LfoqsU,18.14.06,I GAVE SAFIYA NYGAARD A PERFECT HAIR MAKEOVER ...,Brad Mondo,2018-05-18T17:34:22.000Z,I gave safiya nygaard a perfect hair makeover ...,1066451,48068,1032,3992,https://i.ytimg.com/vi/D6Oy4LfoqsU/default.jpg,False,False,False,I had so much fun transforming Safiyas hair in...,United States,Entertainment
oV0zkMe1K8s,18.14.06,How Black Panther Should Have Ended,How It Should Have Ended,2018-05-17T17:00:04.000Z,"Black Panther|""HISHE""|""Marvel""|""Infinity War""|...",5660813,192957,2846,13088,https://i.ytimg.com/vi/oV0zkMe1K8s/default.jpg,False,False,False,How Black Panther Should Have EndedWatch More ...,United States,Film & Animation


## 3. Data Quality

Data Quality Analysis phase, ensuring the integrity and usability of data is crucial. Here are the steps that I will follow to check the data quality:
1. Global vision
   - General information of the dataset
   - Dataset shape

2. Check for Missing Values (Null Data)
    - Identify Missing Values: Use functions like isnull() or isna() in pandas to identify missing values.
    - Quantify Missing Values: Determine the percentage of missing values in each column to understand the extent of the issue.
    - Handle Missing Values: Decide on a strategy to handle missing values, such as: remove or impute missing data.
    
3. Check for Duplicate Data
    - Identify Duplicates: Use functions like duplicated() to find duplicate rows.
    - Remove Duplicates: Remove duplicates using drop_duplicates().
      
4. Check for Inconsistent Data
    - Consistency Checks: Ensure that categorical data values are consistent. For example, check for variations in the category names("USA" vs "U.S.A" vs "United States")
    - Standardization: Standardize categorical variables to a consistent format.
      
5. Check for Outliers
    - Identify Outliers: Use visualizations (box plots, scatter plots) and statistical methods (Z-score, IQR) to detect outliers.
    - Handle Outliers: Decide whether to remove, transform, or keep outliers based on their impact on the analysis.t on the analysis.

### 3.1. Global vision

#### 3.1.1. General data information

In [33]:
data.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
Index: 375942 entries, n1WpP7iowLc to ooyjaVdt-jA
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   trending_date           375942 non-null  object
 1   title                   375942 non-null  object
 2   channel_title           375942 non-null  object
 3   publish_time            375942 non-null  object
 4   tags                    375942 non-null  object
 5   views                   375942 non-null  int64 
 6   likes                   375942 non-null  int64 
 7   dislikes                375942 non-null  int64 
 8   comment_count           375942 non-null  int64 
 9   thumbnail_link          375942 non-null  object
 10  comments_disabled       375942 non-null  bool  
 11  ratings_disabled        375942 non-null  bool  
 12  video_error_or_removed  375942 non-null  bool  
 13  description             356464 non-null  object
 14  state                   37

#### 3.1.2. Dataset dimension

In [34]:
data.shape

(375942, 16)

#### 3.1.3. Index information

In [35]:
data.index

Index(['n1WpP7iowLc', '0dBIkQ4Mz1M', '5qpjK5DgCt4', 'd380meD0W0M',
       '2Vv-BfVoq4g', '0yIWz1XEeyc', '_uM5kFfkhB8', '2kyS6SvSYSE',
       'JzCsM1vtn78', '43sm-QwLcx4',
       ...
       'pcJo0tIWybY', '_QWZvU7VCn8', '7UoP9ABJXGE', 'ju_inUnrLc4',
       '1PhPYr_9zRY', 'BZt0qjTWNhw', '1h7KV2sjUWY', 'D6Oy4LfoqsU',
       'oV0zkMe1K8s', 'ooyjaVdt-jA'],
      dtype='object', name='video_id', length=375942)

#### 3.1.4. Column information

In [37]:
data.columns

Index(['trending_date', 'title', 'channel_title', 'publish_time', 'tags',
       'views', 'likes', 'dislikes', 'comment_count', 'thumbnail_link',
       'comments_disabled', 'ratings_disabled', 'video_error_or_removed',
       'description', 'state', 'category'],
      dtype='object')

#### 3.1.3. Check for Null Data

In [5]:
data.isna().sum().sort_values(ascending=False)

description               19478
category                   2738
trending_date                 0
title                         0
channel_title                 0
publish_time                  0
tags                          0
views                         0
likes                         0
dislikes                      0
comment_count                 0
thumbnail_link                0
comments_disabled             0
ratings_disabled              0
video_error_or_removed        0
state                         0
dtype: int64

Conclusions:

The vast majority of the variables do not have nulls. This is good, however 2 of the features do.
- "description" a priori  is not the most decisive feature, for example, in terms of prediction the number of likes of a video → drop feature
- "category" → fill data with after EDA

In [15]:
data.drop(columns=["description"], inplace=True)

In [20]:
data.category.value_counts(normalize=True)*100

category
Entertainment            29.208154
People & Blogs           14.483232
Music                    11.391625
News & Politics           9.991318
Comedy                    7.226611
Sports                    6.346127
Film & Animation          5.608729
Howto & Style             5.052465
Gaming                    3.080889
Science & Technology      2.189419
Education                 2.086794
Pets & Animals            1.303041
Autos & Vehicles          1.268475
Travel & Events           0.475879
Shows                     0.260983
Nonprofits & Activism     0.015273
Movies                    0.009646
Trailers                  0.001340
Name: proportion, dtype: float64

In [22]:
pd.crosstab(columns=["channel_title", "category"])

TypeError: crosstab() missing 1 required positional argument: 'index'

In [29]:
data[data.duplicated(keep=False)]


Unnamed: 0_level_0,trending_date,title,channel_title,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,state,category
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
RUCXD3_wW2w,18.14.05,New Hulu Show - SNL,Saturday Night Live,2018-05-13T04:59:28.000Z,"SNL|""Saturday Night Live""|""SNL Season 43""|""SNL...",723495,6362,1589,420,https://i.ytimg.com/vi/RUCXD3_wW2w/default.jpg,False,False,False,United Kingdom,Entertainment
p8npDG2ulKQ,18.14.05,BTS (ë°©íìëë¨) LOVE YOURSELF è½ Tear '...,ibighit,2018-05-06T15:00:02.000Z,"BIGHIT|""ë¹íí¸""|""ë°©íìëë¨""|""BTS""|""BA...",26912663,2636004,27675,366899,https://i.ytimg.com/vi/p8npDG2ulKQ/default.jpg,False,False,False,United Kingdom,Music
aixso4N2vhI,18.14.05,President Trump Gives Remarks on the Joint Com...,The White House,2018-05-08T18:45:37.000Z,[none],89715,2806,488,1657,https://i.ytimg.com/vi/aixso4N2vhI/default.jpg,False,False,False,United Kingdom,News & Politics
bu0m_UdtoaU,18.14.05,Serious Questions: Avengers Infinity War,Screen Junkies,2018-05-06T17:00:03.000Z,"screenjunkies|""screen junkies""|""serious questi...",443621,12892,3829,2275,https://i.ytimg.com/vi/bu0m_UdtoaU/default.jpg,False,False,False,United Kingdom,Film & Animation
i-G1hy73Mb8,18.14.05,Eurovision Song Contest 2018 - Opening Ceremon...,Eurovision Song Contest,2018-05-06T19:46:34.000Z,"Eurovision Song Contest|""2018""|""Lisbon""|""Openi...",698827,11721,461,1175,https://i.ytimg.com/vi/i-G1hy73Mb8/default.jpg,False,False,False,United Kingdom,Entertainment
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
iILJvqrAQ_w,18.15.05,Charlie Puth - BOY [Official Audio],Charlie Puth,2018-05-11T04:00:34.000Z,"charlie puth|""boy""|""charlie""|""puth""|""atlantic""...",2124177,81085,1321,4019,https://i.ytimg.com/vi/iILJvqrAQ_w/default.jpg,False,False,False,United States,Music
zcEE8J2Bqa8,18.15.05,The Goblin - JACK AND DEAN,Jack and Dean,2018-05-11T18:27:01.000Z,"Jack and Dean|""OMFGItsJackAndDean""|""Jack Howar...",165617,20572,140,1407,https://i.ytimg.com/vi/zcEE8J2Bqa8/default.jpg,False,False,False,United States,Comedy
q1jzwV_s8_Y,18.15.05,Christina Aguilera - Twice (Audio),CAguileraVEVO,2018-05-11T07:00:01.000Z,"Christina Aguilera|""Pop""|""RCA Records Label""|""...",1869585,64523,1891,5903,https://i.ytimg.com/vi/q1jzwV_s8_Y/default.jpg,False,False,False,United States,Music
mkz1zoo15zI,18.15.05,Richard Jefferson and Tracy McGrady have stron...,ESPN,2018-05-11T19:21:53.000Z,"espn|""espn live""|""dwane casey""|""raptors""|""toro...",472999,3505,163,1511,https://i.ytimg.com/vi/mkz1zoo15zI/default.jpg,False,False,False,United States,Sports


### 3.2. Check for duplicates

In [28]:
# Check for duplicates in all columns
total_duplicates = data.duplicated().sum()
print(f'Total duplicates considering all columns: {total_duplicates}')

# Check for duplicates in a subset of columns (example: 'video_id' and 'trending_date')
subset_duplicates = data.duplicated(subset=['trending_date', 'title']).sum()
print(f'Total duplicates in subset (video_id and trending_date): {subset_duplicates}')

# Visualize the duplicates in the subset
duplicates_subset = data[data.duplicated(subset=['video_id', 'trending_date'], keep=False)]
print(duplicates_subset)


Total duplicates considering all columns: 12570
Total duplicates in subset (video_id and trending_date): 50508


KeyError: Index(['video_id'], dtype='object')

In [26]:
data[data.duplicated() == True]

Unnamed: 0_level_0,trending_date,title,channel_title,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,state,category
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
RUCXD3_wW2w,18.14.05,New Hulu Show - SNL,Saturday Night Live,2018-05-13T04:59:28.000Z,"SNL|""Saturday Night Live""|""SNL Season 43""|""SNL...",723495,6362,1589,420,https://i.ytimg.com/vi/RUCXD3_wW2w/default.jpg,False,False,False,United Kingdom,Entertainment
p8npDG2ulKQ,18.14.05,BTS (ë°©íìëë¨) LOVE YOURSELF è½ Tear '...,ibighit,2018-05-06T15:00:02.000Z,"BIGHIT|""ë¹íí¸""|""ë°©íìëë¨""|""BTS""|""BA...",26912663,2636004,27675,366899,https://i.ytimg.com/vi/p8npDG2ulKQ/default.jpg,False,False,False,United Kingdom,Music
aixso4N2vhI,18.14.05,President Trump Gives Remarks on the Joint Com...,The White House,2018-05-08T18:45:37.000Z,[none],89715,2806,488,1657,https://i.ytimg.com/vi/aixso4N2vhI/default.jpg,False,False,False,United Kingdom,News & Politics
bu0m_UdtoaU,18.14.05,Serious Questions: Avengers Infinity War,Screen Junkies,2018-05-06T17:00:03.000Z,"screenjunkies|""screen junkies""|""serious questi...",443621,12892,3829,2275,https://i.ytimg.com/vi/bu0m_UdtoaU/default.jpg,False,False,False,United Kingdom,Film & Animation
i-G1hy73Mb8,18.14.05,Eurovision Song Contest 2018 - Opening Ceremon...,Eurovision Song Contest,2018-05-06T19:46:34.000Z,"Eurovision Song Contest|""2018""|""Lisbon""|""Openi...",698827,11721,461,1175,https://i.ytimg.com/vi/i-G1hy73Mb8/default.jpg,False,False,False,United Kingdom,Entertainment
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
iILJvqrAQ_w,18.15.05,Charlie Puth - BOY [Official Audio],Charlie Puth,2018-05-11T04:00:34.000Z,"charlie puth|""boy""|""charlie""|""puth""|""atlantic""...",2124177,81085,1321,4019,https://i.ytimg.com/vi/iILJvqrAQ_w/default.jpg,False,False,False,United States,Music
zcEE8J2Bqa8,18.15.05,The Goblin - JACK AND DEAN,Jack and Dean,2018-05-11T18:27:01.000Z,"Jack and Dean|""OMFGItsJackAndDean""|""Jack Howar...",165617,20572,140,1407,https://i.ytimg.com/vi/zcEE8J2Bqa8/default.jpg,False,False,False,United States,Comedy
q1jzwV_s8_Y,18.15.05,Christina Aguilera - Twice (Audio),CAguileraVEVO,2018-05-11T07:00:01.000Z,"Christina Aguilera|""Pop""|""RCA Records Label""|""...",1869585,64523,1891,5903,https://i.ytimg.com/vi/q1jzwV_s8_Y/default.jpg,False,False,False,United States,Music
mkz1zoo15zI,18.15.05,Richard Jefferson and Tracy McGrady have stron...,ESPN,2018-05-11T19:21:53.000Z,"espn|""espn live""|""dwane casey""|""raptors""|""toro...",472999,3505,163,1511,https://i.ytimg.com/vi/mkz1zoo15zI/default.jpg,False,False,False,United States,Sports
