# LioJotstar Merger: Data Analysis with Python for Strategic Optimization

## 2. Data Wrangling
This notebook performs data cleaning tasks such as handling missing values, correcting data types, and preparing the DataFrames for feature engineering and analysis.

### Importing Required Libraries

In [4]:
import pandas as pd
import numpy as np

### Loading DataFrames from Saved Parquet Files

In [6]:
try:
    jotstar_contents_df = pd.read_parquet('Parquet Data Files/01. Data Loading/Jotstar_db/contents.parquet')
    print("Jotstar - Contents table loaded successfully.")
    jotstar_subscribers_df = pd.read_parquet('Parquet Data Files/01. Data Loading/Jotstar_db/subscribers.parquet')
    print("Jotstar - Subscribers table loaded successfully.")
    jotstar_content_consumption_df = pd.read_parquet('Parquet Data Files/01. Data Loading/Jotstar_db/content_consumption.parquet')
    print("Jotstar - Content Consumption table loaded successfully.")
    liocinema_contents_df = pd.read_parquet('Parquet Data Files/01. Data Loading/LioCinema_db/contents.parquet')
    print("LioCinema - Contents table loaded successfully.")
    liocinema_subscribers_df = pd.read_parquet('Parquet Data Files/01. Data Loading/LioCinema_db/subscribers.parquet')
    print("LioCinema - Subscribers table loaded successfully.")
    liocinema_content_consumption_df = pd.read_parquet('Parquet Data Files/01. Data Loading/LioCinema_db/content_consumption.parquet')
    print("LioCinema - Content Consumption table loaded successfully.")
    print("\nData Loading Complete.")
    
except FileNotFoundError as e:
       print("Error: One or more Parquet files not found. Please check the file paths.")
       print(f"Details: {e}")
except Exception as e:
       print("An error occurred during data import.")
       print(f"Details: {e}")

Jotstar - Contents table loaded successfully.
Jotstar - Subscribers table loaded successfully.
Jotstar - Content Consumption table loaded successfully.
LioCinema - Contents table loaded successfully.
LioCinema - Subscribers table loaded successfully.
LioCinema - Content Consumption table loaded successfully.

Data Loading Complete.


### Checking and Correcting Data Format

In [8]:
jotstar_contents_df.dtypes

Content ID         object
Content Type       object
Language           object
Genre              object
Run Time (mins)     int64
dtype: object

In [9]:
jotstar_subscribers_df.dtypes

User ID                  object
Age Group                object
City Tier                object
Subscription Date        object
Subscription Plan        object
Last Active Date         object
Plan Change Date         object
New Subscription Plan    object
dtype: object

In [10]:
print("Correcting data type of Date columns in Jotstar Subscribers table")
jotstar_subscribers_df['Subscription Date'] = pd.to_datetime(jotstar_subscribers_df['Subscription Date'])
jotstar_subscribers_df['Last Active Date'] = pd.to_datetime(jotstar_subscribers_df['Last Active Date'])
jotstar_subscribers_df['Plan Change Date'] = pd.to_datetime(jotstar_subscribers_df['Plan Change Date'])
jotstar_subscribers_df.dtypes

Correcting data type of Date columns in Jotstar Subscribers table


User ID                          object
Age Group                        object
City Tier                        object
Subscription Date        datetime64[ns]
Subscription Plan                object
Last Active Date         datetime64[ns]
Plan Change Date         datetime64[ns]
New Subscription Plan            object
dtype: object

In [11]:
jotstar_content_consumption_df.dtypes

User ID                    object
Device Type                object
Total Watch Time (mins)     int64
dtype: object

In [12]:
liocinema_contents_df.dtypes

Content ID         object
Content Type       object
Language           object
Genre              object
Run Time (mins)     int64
dtype: object

In [13]:
liocinema_subscribers_df.dtypes

User ID                  object
Age Group                object
City Tier                object
Subscription Date        object
Subscription Plan        object
Last Active Date         object
Plan Change Date         object
New Subscription Plan    object
dtype: object

In [14]:
print("Correcting data type of Date columns in LioCinema Subscribers table")
liocinema_subscribers_df['Subscription Date'] = pd.to_datetime(liocinema_subscribers_df['Subscription Date'])
liocinema_subscribers_df['Last Active Date'] = pd.to_datetime(liocinema_subscribers_df['Last Active Date'])
liocinema_subscribers_df['Plan Change Date'] = pd.to_datetime(liocinema_subscribers_df['Plan Change Date'])
liocinema_subscribers_df.dtypes

Correcting data type of Date columns in LioCinema Subscribers table


User ID                          object
Age Group                        object
City Tier                        object
Subscription Date        datetime64[ns]
Subscription Plan                object
Last Active Date         datetime64[ns]
Plan Change Date         datetime64[ns]
New Subscription Plan            object
dtype: object

In [15]:
liocinema_content_consumption_df.dtypes

User ID                    object
Device Type                object
Total Watch Time (mins)     int64
dtype: object

### Checking and Handling Null Values

In [17]:
jotstar_contents_df.isnull().sum()
# No null values found in Jotstar Contents table

Content ID         0
Content Type       0
Language           0
Genre              0
Run Time (mins)    0
dtype: int64

In [18]:
print(jotstar_subscribers_df.isnull().sum())

User ID                      0
Age Group                    0
City Tier                    0
Subscription Date            0
Subscription Plan            0
Last Active Date         37968
Plan Change Date         37530
New Subscription Plan    37530
dtype: int64


In [19]:
print("Multiple null values found in 3 columns of Jotstar Subscribers table")
print("\nIssue 1: Active users have NULL in Last Active Date.")
print("Solution 1: Will keep it as it is to reflect them as Active Users.")
print("\nIssue 2: Users who never upgraded/downgraded have NULL in Plan Change Date.")
print("Solution 2: Will keep it as it is to reflect No Change in Subscription Plans.")
print("\nIssue 3: Users who never changed their plans have NULL in New Subscription Plan.")
print("Solution 3: Replace NULL with their Subscription Plan value.")
jotstar_subscribers_df['New Subscription Plan'] = jotstar_subscribers_df['New Subscription Plan'].replace({None : np.nan})
jotstar_subscribers_df['New Subscription Plan'] = jotstar_subscribers_df['New Subscription Plan'].fillna(jotstar_subscribers_df
                                                                                                         ['Subscription Plan'])
print("\nAfter Changes...")
print(jotstar_subscribers_df.isnull().sum())

Multiple null values found in 3 columns of Jotstar Subscribers table

Issue 1: Active users have NULL in Last Active Date.
Solution 1: Will keep it as it is to reflect them as Active Users.

Issue 2: Users who never upgraded/downgraded have NULL in Plan Change Date.
Solution 2: Will keep it as it is to reflect No Change in Subscription Plans.

Issue 3: Users who never changed their plans have NULL in New Subscription Plan.
Solution 3: Replace NULL with their Subscription Plan value.

After Changes...
User ID                      0
Age Group                    0
City Tier                    0
Subscription Date            0
Subscription Plan            0
Last Active Date         37968
Plan Change Date         37530
New Subscription Plan        0
dtype: int64


In [20]:
jotstar_content_consumption_df.isnull().sum()
# No null values found in Jotstar Content Consumption table

User ID                    0
Device Type                0
Total Watch Time (mins)    0
dtype: int64

In [21]:
liocinema_contents_df.isnull().sum()
# No null values found in LioCinema Contents table

Content ID         0
Content Type       0
Language           0
Genre              0
Run Time (mins)    0
dtype: int64

In [22]:
print(liocinema_subscribers_df.isnull().sum())

User ID                       0
Age Group                     0
City Tier                     0
Subscription Date             0
Subscription Plan             0
Last Active Date         101141
Plan Change Date         158432
New Subscription Plan    158432
dtype: int64


In [23]:
print("Multiple null values found in 3 columns of LioCinema Subscribers table")
print("\nIssue 1: Active users have NULL in Last Active Date.")
print("Solution 1: Will keep it as it is to reflect them as Active Users.")
print("\nIssue 2: Users who never upgraded/downgraded have NULL in Plan Change Date.")
print("Solution 2: Will keep it as it is to reflect No Change in Subscription Plans.")
print("\nIssue 3: Users who never changed their plans have NULL in New Subscription Plan.")
print("Solution 3: Replace NULL with their Subscription Plan value.")
liocinema_subscribers_df['New Subscription Plan'] = liocinema_subscribers_df['New Subscription Plan'].replace({None : np.nan})
liocinema_subscribers_df['New Subscription Plan'] = liocinema_subscribers_df['New Subscription Plan'].fillna(liocinema_subscribers_df
                                                                                                             ['Subscription Plan'])
print("\nAfter Changes...")
print(liocinema_subscribers_df.isnull().sum())

Multiple null values found in 3 columns of LioCinema Subscribers table

Issue 1: Active users have NULL in Last Active Date.
Solution 1: Will keep it as it is to reflect them as Active Users.

Issue 2: Users who never upgraded/downgraded have NULL in Plan Change Date.
Solution 2: Will keep it as it is to reflect No Change in Subscription Plans.

Issue 3: Users who never changed their plans have NULL in New Subscription Plan.
Solution 3: Replace NULL with their Subscription Plan value.

After Changes...
User ID                       0
Age Group                     0
City Tier                     0
Subscription Date             0
Subscription Plan             0
Last Active Date         101141
Plan Change Date         158432
New Subscription Plan         0
dtype: int64


In [24]:
liocinema_content_consumption_df.isnull().sum()
# No null values found in LioCinema Content Consumption table

User ID                    0
Device Type                0
Total Watch Time (mins)    0
dtype: int64

### Checking for Duplicates

In [26]:
jotstar_contents_df.duplicated().sum()
# No duplicates found in Jotstar Contents table

0

In [27]:
jotstar_subscribers_df.duplicated().sum()
# No duplicates found in Jotstar Subscribers table

0

In [28]:
jotstar_content_consumption_df.duplicated().sum()
# No duplicates found in Jotstar Content Consumption table

0

In [29]:
liocinema_contents_df.duplicated().sum()
# No duplicates found in LioCinema Contents table

0

In [30]:
liocinema_subscribers_df.duplicated().sum()
# No duplicates found in LioCinema Subscribers table

0

In [31]:
liocinema_content_consumption_df.duplicated().sum()
# No duplicates found in LioCinema Content Consumption table

0

### Checking for Values Inconcistencies and Correcting (if any)

In [33]:
#List of Unique Values in all (applicable) columns of Jotstar Contents Table
print("Unique Values in Content Type Column")
print(jotstar_contents_df['Content Type'].unique())
print("Unique Values in Language Column")
print(jotstar_contents_df['Language'].unique())
print("Unique Values in Genre Column")
print(jotstar_contents_df['Genre'].unique())
print("Unique Values in Run Time (mins) Column")
print(jotstar_contents_df['Run Time (mins)'].unique())

Unique Values in Content Type Column
['Movie' 'Series' 'Sports']
Unique Values in Language Column
['Bengali' 'English' 'Gujarati' 'Hindi' 'Kannada' 'Malayalam' 'Marathi'
 'Punjabi' 'Tamil' 'Telugu']
Unique Values in Genre Column
['Action' 'Comedy' 'Drama' 'Fantasy' 'Romance' 'Sci-Fi' 'Thriller'
 'Adventure' 'Family' 'Highlights' 'Documentaries' 'Live Matches']
Unique Values in Run Time (mins) Column
[ 90 135 120 150 180  45  30  20  10  60  15   5 300 240]


In [34]:
#List of Unique Values in all (applicable) columns of Jotstar Subscribers Table
print("Unique Values in Age Group Column")
print(jotstar_subscribers_df['Age Group'].unique())
print("Unique Values in City Tier Column")
print(jotstar_subscribers_df['City Tier'].unique())
print("Unique Values in Subscription Plan Column")
print(jotstar_subscribers_df['Subscription Plan'].unique())
print("Unique Values in New Subscription Plan Column")
print(jotstar_subscribers_df['New Subscription Plan'].unique())

Unique Values in Age Group Column
['18-24' '25-34' '35-44' '45+']
Unique Values in City Tier Column
['Tier 1' 'Tier 2' 'Tier 3']
Unique Values in Subscription Plan Column
['Premium' 'Free' 'VIP']
Unique Values in New Subscription Plan Column
['Premium' 'Free' 'VIP']


In [35]:
#List of Unique Values in all (applicable) columns of Jotstar Content Consumption Table
print("Unique Values in Device Type Column")
print(jotstar_content_consumption_df['Device Type'].unique())

Unique Values in Device Type Column
['Mobile' 'TV' 'Laptop']


In [36]:
#List of Unique Values in all (applicable) columns of LioCinema Contents Table
print("Unique Values in Content Type Column")
print(liocinema_contents_df['Content Type'].unique())
print("Unique Values in Language Column")
print(liocinema_contents_df['Language'].unique())
print("Unique Values in Genre Column")
print(liocinema_contents_df['Genre'].unique())
print("Unique Values in Run Time (mins) Column")
print(liocinema_contents_df['Run Time (mins)'].unique())

Unique Values in Content Type Column
['Movie' 'Series' 'Sports']
Unique Values in Language Column
['English' 'Hindi' 'Kannada' 'Malayalam' 'Marathi' 'Tamil' 'Telugu']
Unique Values in Genre Column
['Action' 'Comedy' 'Drama' 'Family' 'Horror' 'Romance' 'Thriller' 'Crime'
 'Documentaries' 'Highlights' 'Live Matches']
Unique Values in Run Time (mins) Column
[120 135 180  90 150  20  30  45  60  10  15 240 300   5]


In [37]:
#List of Unique Values in all (applicable) columns of LioCinema Subscribers Table
print("Unique Values in Age Group Column")
print(liocinema_subscribers_df['Age Group'].unique())
print("Unique Values in City Tier Column")
print(liocinema_subscribers_df['City Tier'].unique())
print("Unique Values in Subscription Plan Column")
print(liocinema_subscribers_df['Subscription Plan'].unique())
print("Unique Values in New Subscription Plan Column")
print(liocinema_subscribers_df['New Subscription Plan'].unique())

Unique Values in Age Group Column
['25-34' '18-24' '35-44' '45+']
Unique Values in City Tier Column
['Tier 3' 'Tier 1' 'Tier 2']
Unique Values in Subscription Plan Column
['Free' 'Basic' 'Premium']
Unique Values in New Subscription Plan Column
['Free' 'Basic' 'Premium']


In [38]:
#List of Unique Values in all (applicable) columns of LioCinema Content Consumption Table
print("Unique Values in Device Type Column")
print(liocinema_content_consumption_df['Device Type'].unique())

Unique Values in Device Type Column
['Mobile' 'TV' 'Laptop']


### Exporting Processed DataFrames to Parquet Files

In [40]:
# Saving all Jotstar DataFrames in Parquet Format
jotstar_contents_df.to_parquet('Parquet Data Files/02. Data Wrangling/Jotstar_db/contents.parquet', index = False)
jotstar_subscribers_df.to_parquet('Parquet Data Files/02. Data Wrangling/Jotstar_db/subscribers.parquet', index = False)
jotstar_content_consumption_df.to_parquet('Parquet Data Files/02. Data Wrangling/Jotstar_db/content_consumption.parquet', index = False)

# Saving all LioCinema DataFrames in Parquet Format
liocinema_contents_df.to_parquet('Parquet Data Files/02. Data Wrangling/LioCinema_db/contents.parquet', index = False)
liocinema_subscribers_df.to_parquet('Parquet Data Files/02. Data Wrangling/LioCinema_db/subscribers.parquet', index = False)
liocinema_content_consumption_df.to_parquet('Parquet Data Files/02. Data Wrangling/LioCinema_db/content_consumption.parquet', index = False)

print("All DataFrames are saved as Parquet files successfully.")

All DataFrames are saved as Parquet files successfully.


## Next Notebook: "3. Feature Engineering"