# About Dataset

This dataset provides a comprehensive overview of the top 250 television shows listed on IMDB. It offers insights into various aspects of these shows, including their titles, the years they aired, the total number of episodes in each series, the age rating assigned to each show, the average user rating on IMDB, the number of votes each show has received, and the category of the show (either a TV Series or a TV Mini-Series).

The dataset is particularly useful for understanding audience preferences and trends in the television industry. For instance, the ratings and vote counts can reveal which shows are most popular among viewers, while the distribution of categories can shed light on the relative popularity of different types of television shows. Additionally, the year of release can be used to analyze trends in television production over time.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [None]:
imdb_data= pd.read_csv(r'IMDB_Top250_Tvshows.csv', encoding="ISO-8859-1")

In [None]:
imdb_data.head()

In [None]:
imdb_data.shape

In [None]:
imdb_data.info()

In [None]:
# Check if there are any missing values
has_missing_values = imdb_data.isna().any().any()

if has_missing_values:
    print("The DataFrame has missing values.")
else:
    print("The DataFrame has no missing values.")

In [None]:
#how many total missing values?
total_missing= imdb_data.isnull().sum()
print(total_missing)


In [None]:
#Percentage of the values across the dataset
total_cells= np.product(imdb_data.shape)
percent_missing= (total_missing/total_cells)*100
print(percent_missing)

In [60]:
# Locate the missing values in 'Age' column

missing_ages= imdb_data[imdb_data['Age'].isna()]
print(missing_ages)

                 Titile       Year Total_episodes  Age  Rating Vote_count  \
46            47. As If      2021         55 eps  NaN     9.0      (23K)   
75           76. Gullak      2019         20 eps  NaN     9.1      (23K)   
83       84. Reply 1988  20152016         20 eps  NaN     9.1      (12K)   
126      127. Aspirants      2021         10 eps  NaN     9.2     (312K)   
194    195. Rocket Boys      2022         17 eps  NaN     8.9      (18K)   
240  241. Avrupa Yakasi  20042009        190 eps  NaN     8.6      (21K)   

      Category  
46   TV Series  
75   TV Series  
83   TV Series  
126  TV Series  
194  TV Series  
240  TV Series  


In [61]:
# Delete/ Drop the missing values
imdb_data.dropna()

Unnamed: 0,Titile,Year,Total_episodes,Age,Rating,Vote_count,Category
0,1. Breaking Bad,20082013,62 eps,18,9.5,(2.2M),TV Series
1,2. Planet Earth II,2016,6 eps,PG,9.5,(159K),TV Mini Series
2,3. Planet Earth,2006,11 eps,PG,9.4,(221K),TV Mini Series
3,4. Band of Brothers,2001,10 eps,15,9.4,(533K),TV Mini Series
4,5. Chernobyl,2019,5 eps,15,9.3,(876K),TV Mini Series
...,...,...,...,...,...,...,...
245,246. Your Lie in April,20142015,24 eps,12,8.5,(39K),TV Series
246,247. Community,20092015,110 eps,12,8.5,(295K),TV Series
247,248. Tear Along the Dotted Line,2021,6 eps,15,8.6,(15K),TV Mini Series
248,249. Chef's Table,20152019,30 eps,15,8.5,(17K),TV Series


In [63]:
# Double check if the missing values is already removed

missing_values_rows= imdb_data.loc[[46,75,83]]
print(missing_values_rows)

            Titile       Year Total_episodes  Age  Rating Vote_count  \
46       47. As If      2021         55 eps  NaN     9.0      (23K)   
75      76. Gullak      2019         20 eps  NaN     9.1      (23K)   
83  84. Reply 1988  20152016         20 eps  NaN     9.1      (12K)   

     Category  
46  TV Series  
75  TV Series  
83  TV Series  


In [64]:
# Check for the duplicates data
print(imdb_data.duplicated().values.any())

False


In [65]:
# Rename 'Titile' to 'Title'

imdb_data= imdb_data.rename(columns={'Titile': 'Title'})

#Remove numbers in the title name
imdb_data['Title'] = imdb_data['Title'].str.replace(r'^\d+\.\s*', '', regex=True)
imdb_data.head()

Unnamed: 0,Title,Year,Total_episodes,Age,Rating,Vote_count,Category
0,Breaking Bad,20082013,62 eps,18,9.5,(2.2M),TV Series
1,Planet Earth II,2016,6 eps,PG,9.5,(159K),TV Mini Series
2,Planet Earth,2006,11 eps,PG,9.4,(221K),TV Mini Series
3,Band of Brothers,2001,10 eps,15,9.4,(533K),TV Mini Series
4,Chernobyl,2019,5 eps,15,9.3,(876K),TV Mini Series


In [66]:
# Seperate the Year
imdb_data['StartYear']= imdb_data['Year'].str.split('', expand=True)[0]
imdb_data['EndYear']= imdb_data['Year'].str.split('', expand=True)[1]
imdb_data.tail()


Unnamed: 0,Title,Year,Total_episodes,Age,Rating,Vote_count,Category,StartYear,EndYear
245,Your Lie in April,20142015,24 eps,12,8.5,(39K),TV Series,2014,2015.0
246,Community,20092015,110 eps,12,8.5,(295K),TV Series,2009,2015.0
247,Tear Along the Dotted Line,2021,6 eps,15,8.6,(15K),TV Mini Series,2021,
248,Chef's Table,20152019,30 eps,15,8.5,(17K),TV Series,2015,2019.0
249,Sapne Vs Everyone,2023,5 eps,Mature,9.4,(67K),TV Series,2023,


In [67]:
imdb_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Title           250 non-null    object 
 1   Year            250 non-null    object 
 2   Total_episodes  250 non-null    object 
 3   Age             244 non-null    object 
 4   Rating          250 non-null    float64
 5   Vote_count      250 non-null    object 
 6   Category        250 non-null    object 
 7   StartYear       250 non-null    object 
 8   EndYear         207 non-null    object 
dtypes: float64(1), object(8)
memory usage: 17.7+ KB


In [68]:
# We need to convert the data types into integer instead of strings
imdb_data['StartYear']= imdb_data['StartYear'].astype('int64')
imdb_data['EndYear']= imdb_data['EndYear'].astype('int64')
imdb_data.info()

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'