[Link from the first notebook](./01_scraping_notebook_v2.ipynb)

# Explanatory Data Analysis

## A. EDA Set-up

Since this is a new notebook, there's a need to import certain scripts.

In [1]:
# Import basic libraries
import pandas as pd
import numpy as np

In [2]:
#!pip install matplotlib-venn (uncomment if the module isn't imported)

# Import visualisation libraries
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
import seaborn as sns

Importing the extracted data into this notebook.

In [3]:
tropes_df = pd.read_csv('tropes.csv')
wiki_df  = pd.read_csv('simpsons_episode_data2.csv')

Adjusting the output view.

In [4]:
# some display adjustments to account for the fact that we have many columns
# and some columns contain many characters

np.set_printoptions(threshold=np.inf)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 800)

In [5]:
import random

In [6]:
#checking to see what each header's type is for both dataframes

print(tropes_df.columns)
print(wiki_df.columns)

Index(['Trope Name', 'Trope Description', 'Text Length'], dtype='object')
Index(['Episode Title', 'Full Story', 'Tropes'], dtype='object')


In [7]:
tropes_df.describe()
tropes_summary_stats = tropes_df.describe()
print(tropes_summary_stats)

         Text Length
count     552.000000
mean     3322.565217
std      8132.603616
min       286.000000
25%      1339.250000
50%      2141.500000
75%      3345.500000
max    133288.000000


In [8]:
wiki_df.describe()
wiki_summary_stats = wiki_df.describe()
print(wiki_summary_stats)

                            Episode Title  \
count                                  13   
unique                                 13   
top     Simpsons Roasting on an Open Fire   
freq                                    1   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

In [9]:
# Check data types of each column for wiki_df
wiki_data_types = wiki_df.dtypes
print(wiki_data_types)

# Check for missing values in each column
wiki_missing_values = wiki_df.isnull().sum()
print(wiki_missing_values)

Episode Title    object
Full Story       object
Tropes           object
dtype: object
Episode Title    0
Full Story       0
Tropes           0
dtype: int64


In [10]:
# Doing the same for tropes_df
tropes_data_types = tropes_df.dtypes
print(tropes_data_types)

# Check for missing values in each column
tropes_missing_values = tropes_df.isnull().sum()
print(tropes_missing_values)

Trope Name           object
Trope Description    object
Text Length           int64
dtype: object
Trope Name           0
Trope Description    0
Text Length          0
dtype: int64


In [11]:
# Iterate through each column in wiki_df and count unique values (for list-type columns) first
for column in wiki_df.columns:
    if isinstance(wiki_df[column].iloc[0], list):
        unique_values = []
        for sublist in wiki_df[column]:
            for val in sublist:
                if val not in unique_values:
                    unique_values.append(val)
        unique_values_count = len(unique_values)
        print(f"Number of unique values in '{column}': {unique_values_count}")
    else:
        print(f"Skipped column '{column}' as it is not a list-type column.")

Skipped column 'Episode Title' as it is not a list-type column.
Skipped column 'Full Story' as it is not a list-type column.
Skipped column 'Tropes' as it is not a list-type column.


In [12]:
# Done the same for tropes_df
for column in tropes_df.columns:
    if isinstance(tropes_df[column].iloc[0], list):
        unique_values = []
        for sublist in tropes_df[column]:
            for val in sublist:
                if val not in unique_values:
                    unique_values.append(val)
        unique_values_count = len(unique_values)
        print(f"Number of unique values in '{column}': {unique_values_count}")
    else:
        print(f"Skipped column '{column}' as it is not a list-type column.")

Skipped column 'Trope Name' as it is not a list-type column.
Skipped column 'Trope Description' as it is not a list-type column.
Skipped column 'Text Length' as it is not a list-type column.


In [16]:
import sweetviz as sv
analyze_report = sv.analyze(tropes_df)
analyze_report.show_html(report.html, open_browser=False)

                                             |          | [  0%]   00:00 -> (? left)

NameError: name 'report' is not defined

As both are word-based datasets, we needed to do some tokenization.

But first we need to choose which 10 tropes to use.

In [None]:
tropes_df.head()

In [None]:
# Set the random seed for reproducibility
random.seed(42)

# Assuming your DataFrame has a 'trope' column that contains the trope names
# You can select 10 random tropes from the DataFrame
selected_tropes = random.sample(tropes_df['Trope Name'].tolist(), 10)

# Print the selected tropes
print(selected_tropes)

In [None]:
#Extracting out the 10 randomly selected tropes from the trope_df dataframe

# Filter the DataFrame to select rows corresponding to the chosen tropes
selected_tropes_df = tropes_df[tropes_df['Trope Name'].isin(selected_tropes)]

# Now, selected_tropes_df contains the rows associated with the selected tropes
selected_tropes_df