Preprocessing

Exploratory

In [39]:
from pathlib import Path
import pandas as pd

In [40]:
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")
train.head()

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes
0,0,Mystery Matters,Episode 98,,True Crime,74.81,Thursday,Night,,0.0,Positive,31.41998
1,1,Joke Junction,Episode 26,119.8,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,88.01241
2,2,Study Sessions,Episode 16,73.9,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,44.92531
3,3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.7,2.0,Positive,46.27824
4,4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,75.61031


In [41]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750000 entries, 0 to 749999
Data columns (total 12 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           750000 non-null  int64  
 1   Podcast_Name                 750000 non-null  object 
 2   Episode_Title                750000 non-null  object 
 3   Episode_Length_minutes       662907 non-null  float64
 4   Genre                        750000 non-null  object 
 5   Host_Popularity_percentage   750000 non-null  float64
 6   Publication_Day              750000 non-null  object 
 7   Publication_Time             750000 non-null  object 
 8   Guest_Popularity_percentage  603970 non-null  float64
 9   Number_of_Ads                749999 non-null  float64
 10  Episode_Sentiment            750000 non-null  object 
 11  Listening_Time_minutes       750000 non-null  float64
dtypes: float64(5), int64(1), object(6)
memory usage: 68.7+ MB


Problems with missing values are related with:
- Episode_Length_minutes
- Guest_Popularity_percentage
- Number_of_Ads

In [42]:
null_rows_idx = train.isnull().any(axis=1)
rows_with_missing_values_count = len(train.loc[null_rows_idx])
empty_values = 750000 - train['Episode_Length_minutes'].count() + 750000 - train['Guest_Popularity_percentage'].count() + 1
print(f"check {empty_values} > {rows_with_missing_values_count} => check {'passed' if empty_values >= rows_with_missing_values_count else 'not passed'}")


check 233124 > 210952 => check passed


We already know about missing values, we will get back to them, once important unique value check is carried out.

In [43]:
train.nunique()

id                             750000
Podcast_Name                       48
Episode_Title                     100
Episode_Length_minutes          12268
Genre                              10
Host_Popularity_percentage       8038
Publication_Day                     7
Publication_Time                    4
Guest_Popularity_percentage     10019
Number_of_Ads                      12
Episode_Sentiment                   3
Listening_Time_minutes          42807
dtype: int64

In [44]:
float64_columns = ["Episode_Length_minutes", "Host_Popularity_percentage",
                    "Guest_Popularity_percentage", "Number_of_Ads", "Listening_Time_minutes"]

categorical_columns = ["Podcast_Name", "Episode_Title", "Genre", "Publication_Day", "Publication_Time", "Episode_Sentiment"]

for col in float64_columns:
    unique_values = train[col].dropna().unique().tolist()
    print(f"{col} min: {min(unique_values)} , max: {max(unique_values)}")

print("\n")

for col in categorical_columns:
    unique_values = train[col].dropna().unique().tolist()
    print(unique_values)


Episode_Length_minutes min: 0.0 , max: 325.24
Host_Popularity_percentage min: 1.3 , max: 119.46
Guest_Popularity_percentage min: 0.0 , max: 119.91
Number_of_Ads min: 0.0 , max: 103.91
Listening_Time_minutes min: 0.0 , max: 119.97


['Mystery Matters', 'Joke Junction', 'Study Sessions', 'Digital Digest', 'Mind & Body', 'Fitness First', 'Criminal Minds', 'News Roundup', 'Daily Digest', 'Music Matters', 'Sports Central', 'Melody Mix', 'Game Day', 'Gadget Geek', 'Global News', 'Tech Talks', 'Sport Spot', 'Funny Folks', 'Sports Weekly', 'Business Briefs', 'Tech Trends', 'Innovators', 'Health Hour', 'Comedy Corner', 'Sound Waves', 'Brain Boost', "Athlete's Arena", 'Wellness Wave', 'Style Guide', 'World Watch', 'Humor Hub', 'Money Matters', 'Healthy Living', 'Home & Living', 'Educational Nuggets', 'Market Masters', 'Learning Lab', 'Lifestyle Lounge', 'Crime Chronicles', 'Detective Diaries', 'Life Lessons', 'Current Affairs', 'Finance Focus', 'Laugh Line', 'True Crime Stories', 'Business Insig

As long as nothing distrubing can be observed regarding categorical columns, Number_of_Ads should contain only int values. Moreover problems list as follows:
- Episode_Length_minutes display conserning 325.24 minutes value as the highest possible and 0.0 as the lowest. 
- Guest_Popularity_percentage display conserning 0.0 as the lowest values (oposing to Host_Popularity_percentage min: 1.3 value),
this might indicate missing value. Max value is 119.91 - over 100% which might be alright, will see about predictions.
- Host_Popularity_percentage max value is 119.46 is over 100% again, might be ok.
- Listening_Time_minutes min: 0.0 might be ok but max: 119.97 should match (or at least be in close neighbourhood) 325.24. This observation suggests that 325.24 minutes might in fact be unrealistic, since not even one person watched the show entirely (unlikely considering rich 750k records database).


Summarizing initial analysis:
- there are 210952 records with missing values, must be handled.
- Episode_Length_minutes, Guest_Popularity_percentage have min values of 0.0, must be handled.
- Number_of_Ads, contain not integer values, must be handled (will change characteristic from float64 to categorical).
- Host_Popularity_percentage, Guest_Popularity_percentage might need standarization from [0.0, 119-ish]% interval to [0.0, 100.0]% interval, handling optional

All of observations above might be great milestone checkpoints. Additionally:
- Pandas DataFrame has .hist() method, which returns informations about variables' distributions, could be usefull in outlier handling (especially the most remote decile groups).

Preprocessing - Imputation and Exploratory Part problems solutions

In [45]:
print("Handling")

Handling
