# PA Project - Team 5 : Code File 2/3
##Caglar Dogan - Ekantika Singh - Gurmehr Sohi

The code in this file reads the "IMDB_movie_data.csv" (which contains information about the first 2000 feature movies (according to the IMDB popularity score) in IMDB produced in the USA and available in English that were released between 2016-01-01 and 2019-12-31), applies pre-processing by converting the string features into forms directly usable in our models and cleans the data, and gathers the needed sentiment information metrics for each movie.

During this process, an intermediate csv file with the name "clean_data.csv" is generated, which includes cleaned and pre-processed data without sentiment information.

The end result is saved into a CSV file with the name "model_data.csv", from where it can be read and used for further data exploration and modeling.

## Importing the Data

In [1]:
#importing required Libraries
import pandas as pd 
import numpy as np 
import requests #to send HTTP requests
from bs4 import BeautifulSoup #to parse the files acquired through requests
import regex as re #to be able to clean the imported data

Here, we read the data from the 'IMDB_movie_data.csv' file.


In [2]:
df = pd.read_csv('IMDB_movie_data.csv', index_col=0)
df

Unnamed: 0,Name of movie,Watchtime,Budget,Opening Weekend Us And Canada,Openning Weekend Date
0,The Predator,"<span class=""runtime"">107 min</span>","$88,000,000 estimated","$24,632,284","Sep 16, 2018"
1,The Nice Guys,"<span class=""runtime"">116 min</span>","$50,000,000 estimated","$11,203,270","May 22, 2016"
2,Once Upon a Time in... Hollywood,"<span class=""runtime"">161 min</span>","$90,000,000 estimated","$41,082,018","Jul 28, 2019"
3,Get Out,"<span class=""runtime"">104 min</span>","$4,500,000 estimated","$33,377,060","Feb 26, 2017"
4,The Informer,"<span class=""runtime"">113 min</span>",,"$133,475","Nov 8, 2020"
...,...,...,...,...,...
1995,Diane,"<span class=""runtime"">95 min</span>",,"$24,467","Mar 31, 2019"
1996,All Creatures Here Below,"<span class=""runtime"">91 min</span>",,,
1997,Warning Shot,"<span class=""runtime"">90 min</span>","$2,000,000 estimated",,
1998,Assholes,"<span class=""runtime"">74 min</span>",,,


In [3]:
df.columns

Index(['Name of movie', 'Watchtime', 'Budget', 'Opening Weekend Us And Canada',
       'Openning Weekend Date'],
      dtype='object')

## Data Pre-Processing

First, we define a function to extract numerical values from a string column (to extract run time in minutes) and another function to extract numerical values from a string column denoting a USD value (to extract budget/box office gross information).

In [4]:
#Function to extract numerical data
def getNum(s):
    try:
      rgx = "[^\d\.]"
      num = float(re.sub(rgx, "", str(s)))
      return num
    except:
      return np.NaN

#Function to extract numerical currency data
def getNumUsd(s):
    try:
      if s[0] != "$":
        return np.NaN
      else:
        rgx = "[^\d\.]"
        num = float(re.sub(rgx, "", str(s)))
        return num
    except:
      return np.NaN

Then, we process the relevant columns to extract numerical data:


In [5]:
#Process the df columns to get pure numeric values

for col in ['Watchtime']:
    df[col] = df[col].apply(lambda s: getNum(s));

for col in ['Budget',
       'Opening Weekend Us And Canada']:
    df[col] = df[col].apply(lambda s: getNumUsd(s));

In [6]:
df

Unnamed: 0,Name of movie,Watchtime,Budget,Opening Weekend Us And Canada,Openning Weekend Date
0,The Predator,107.0,88000000.0,24632284.0,"Sep 16, 2018"
1,The Nice Guys,116.0,50000000.0,11203270.0,"May 22, 2016"
2,Once Upon a Time in... Hollywood,161.0,90000000.0,41082018.0,"Jul 28, 2019"
3,Get Out,104.0,4500000.0,33377060.0,"Feb 26, 2017"
4,The Informer,113.0,,133475.0,"Nov 8, 2020"
...,...,...,...,...,...
1995,Diane,95.0,,24467.0,"Mar 31, 2019"
1996,All Creatures Here Below,91.0,,,
1997,Warning Shot,90.0,2000000.0,,
1998,Assholes,74.0,,,


Now, we define functions to read date inforrmation in the format provided by IMDB (DD Month Year) and convert it to the format used in our social-media search functions (YYYY-MM-DD).

In [7]:
#taken from https://stackoverflow.com/questions/3418050/month-name-to-month-number-and-vice-versa-in-python
def month_string_to_number(string):
    m = {
        'jan': 1,
        'feb': 2,
        'mar': 3,
        'apr': 4,
        'may': 5,
        'jun': 6,
        'jul': 7,
        'aug': 8,
        'sep': 9,
        'oct': 10,
        'nov': 11,
        'dec': 12
        }
    s = string.strip()[:3].lower()

    try:
        out = m[s]
        return out
    except:
        raise ValueError('Not a month')

#Function to extract YYYY-MM-DD drom DD Month Year
def getDate(s):
    try:
      parts = s.replace(",", "").split(" ")
      return(parts[2] + "-" + str(month_string_to_number(parts[0])) + "-" + parts[1])
    except:
      return np.NaN

Then, we process the openning weekend date column as needed and store the result as a new feature.

In [8]:
#Process the openning weekend date data and store the result as a new feature
df["Openning Weekend Date 2"] = df["Openning Weekend Date"].apply(lambda x: getDate(x))

In [9]:
df

Unnamed: 0,Name of movie,Watchtime,Budget,Opening Weekend Us And Canada,Openning Weekend Date,Openning Weekend Date 2
0,The Predator,107.0,88000000.0,24632284.0,"Sep 16, 2018",2018-9-16
1,The Nice Guys,116.0,50000000.0,11203270.0,"May 22, 2016",2016-5-22
2,Once Upon a Time in... Hollywood,161.0,90000000.0,41082018.0,"Jul 28, 2019",2019-7-28
3,Get Out,104.0,4500000.0,33377060.0,"Feb 26, 2017",2017-2-26
4,The Informer,113.0,,133475.0,"Nov 8, 2020",2020-11-8
...,...,...,...,...,...,...
1995,Diane,95.0,,24467.0,"Mar 31, 2019",2019-3-31
1996,All Creatures Here Below,91.0,,,,
1997,Warning Shot,90.0,2000000.0,,,
1998,Assholes,74.0,,,,


Now, we select the columns relevant for future analysis and rename them for convenience.

In [10]:
df_subset = df[['Name of movie', 'Watchtime', 'Budget',
       'Opening Weekend Us And Canada', 'Openning Weekend Date 2']]

df_subset = df_subset.rename(columns={'Name of movie': 'name', 'Watchtime': 'watchtime', 
                                    'Budget': 'budget',
                                    'Opening Weekend Us And Canada': 'weekend_gross_us_canada', 
                                    'Openning Weekend Date 2': 'weekend_date'})

In [11]:
df_subset

Unnamed: 0,name,watchtime,budget,weekend_gross_us_canada,weekend_date
0,The Predator,107.0,88000000.0,24632284.0,2018-9-16
1,The Nice Guys,116.0,50000000.0,11203270.0,2016-5-22
2,Once Upon a Time in... Hollywood,161.0,90000000.0,41082018.0,2019-7-28
3,Get Out,104.0,4500000.0,33377060.0,2017-2-26
4,The Informer,113.0,,133475.0,2020-11-8
...,...,...,...,...,...
1995,Diane,95.0,,24467.0,2019-3-31
1996,All Creatures Here Below,91.0,,,
1997,Warning Shot,90.0,2000000.0,,
1998,Assholes,74.0,,,


Now, we can clean this DataFrame and save the result into a .csv file ('clean_data.csv') as follows:

In [12]:
clean_df = df_subset.dropna().reset_index(drop=True).copy()

In [13]:
clean_df.to_csv('clean_data.csv')

## Defining Functions to Retrieve Sentiment and Popularity Data

Here, we define functions to get sentiment scores for each movie using VADER and TextBlob to be able to compare them. We also include functionality to calculate a popularity score using the Reddit scores of the top posts.

To get these scores, we use data from the Reddit Pushshift API.


First, we install the necessary libraries:

In [14]:
#Install the Pushshift API
#Documentation:
#https://github.com/pushshift/api
#https://github.com/dmarx/psaw
#https://psaw.readthedocs.io/en/latest/

!pip install psaw

#Install swifter and the necessary dependancies
#To be able to apply functions on Pandas DataFrames concurrently:
#(This helps reduce the time needed to collect sentment data)
!pip install -U pandas # upgrade pandas
!pip install swifter # first time installation
!pip install swifter[groupby] # first time installation including dependency for groupby.apply functionality

!pip install -U swifter # upgrade to latest version if already installed
!pip install -U swifter[groupby] # upgrade to latest version to include dependency for gr

!pip install flair

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting psaw
  Downloading psaw-0.1.0-py3-none-any.whl (15 kB)
Installing collected packages: psaw
Successfully installed psaw-0.1.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting swifter
  Downloading swifter-1.3.4.tar.gz (830 kB)
[K     |████████████████████████████████| 830 kB 5.3 MB/s 
Collecting psutil>=5.6.6
  Downloading psutil-5.9.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (281 kB)
[K     |████████████████████████████████| 281 kB 56.8 MB/s 
Collecting jedi>=0.10
  Downloading jedi-0.18.1-py2.py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 39.0 MB/s 
Building wheels for collected packages: swifter
  Building wheel for swifter (setu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ray>=1.0.0
  Downloading ray-2.0.0-cp37-cp37m-manylinux2014_x86_64.whl (59.4 MB)
[K     |████████████████████████████████| 59.4 MB 1.1 MB/s 
Collecting grpcio<=1.43.0,>=1.28.1
  Downloading grpcio-1.43.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.1 MB)
[K     |████████████████████████████████| 4.1 MB 40.8 MB/s 
Collecting virtualenv
  Downloading virtualenv-20.16.3-py2.py3-none-any.whl (8.8 MB)
[K     |████████████████████████████████| 8.8 MB 40.1 MB/s 
Collecting platformdirs<3,>=2.4
  Downloading platformdirs-2.5.2-py3-none-any.whl (14 kB)
Collecting distlib<1,>=0.3.5
  Downloading distlib-0.3.5-py2.py3-none-any.whl (466 kB)
[K     |████████████████████████████████| 466 kB 54.7 MB/s 
[?25hInstalling collected packages: platformdirs, distlib, virtualenv, grpcio, ray
  Attempting uninstall: grpcio
    Found existing installation: grpcio 1.47.0
    Uninsta

Then, we define a function to return the text, Reddit score, and post date information for the top 100 posts for any movie name in a given time period. (We tried other limits for the returned number of posts and decided that 100 worked the best as a limit.)

In [15]:
#re-importing some required libraries
import pandas as pd   #to create dataframe
import requests       #to send the request to the URL
from bs4 import BeautifulSoup #to get the content in the form of HTML
import numpy as np  # to count the values (in our case)
import regex as re

#Importing new libraries for this part:
from psaw import PushshiftAPI
from datetime import datetime, timedelta

#General Parameters
default_num_days = 5
default_score_limit= 0
default_query_limit = 100 #Number of posts to return (max value: 500)

default_search_subreddit = ''

api = PushshiftAPI()

#-----     -----     -----     -----     -----

#Here, we create a function to search up any term and return relevant information.
#We can then utilize this for each movie to get sentiment data.

#datetime(yyyy,mm,dd)

def searchReddit(query_term, query_end_time, search_subreddit = default_search_subreddit, num_days = default_num_days, 
           score_limit = default_score_limit, query_limit = default_query_limit):
  
  timeslot = timedelta(days=num_days)
  start_epoch= int((query_end_time - timeslot).timestamp())
  end_epoch = int(query_end_time.timestamp())

  submissionlist = list(api.search_submissions(q=query_term,
                        after=start_epoch,
                        before = end_epoch,
                        score = (">"+str(score_limit)),
                        #subreddit=search_subreddit, #Not used
                        sort_type='score',
                        sort='desc',
                        filter=['title','selftext','created_utc','score'], 
                        #possible additions: 'num_comments','subreddit'
                        limit=query_limit))
  
  redditDf = pd.DataFrame([((s.title + " " + s.selftext),
                            s.score,
                            datetime.utcfromtimestamp(int(s.created_utc))
                            .strftime('%Y-%m-%d'))
                            for s in submissionlist],
                            columns=['text','reddit_score','date'])
  
  return redditDf

Now, we import the necessary libraries for sentiment analysis.

In [16]:
import swifter
import pandas as pd 

import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import tokenize

from textblob import TextBlob

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Then, we can define functions to be able to assign sentiment scores to a given movie/movie DataFrame using our Reddit search function as follows:

In [17]:
sid = SentimentIntensityAnalyzer()

#Turns "YYYY-MM-DD" string to datetime object
def datetimeFromString(dateStr: str):
    date_arr = dateStr.split("-")
    return datetime(int(date_arr[0]),int(date_arr[1]),int(date_arr[2]))

#Gets a list of strings representing the last days before end_date
def last_days(end_date: str, num_days: int):
    end_datetime = datetimeFromString(end_date)
    days = [(end_datetime - timedelta(days=i)).strftime('%Y-%m-%d') for i in range(num_days,0,-1)]
    return days

#Given a movie date and release date, returns sentiment information for the past 5 days
#(ordered by date) and the mean of these values
def getSentimentForMovie(movie_name: str, release_date: str):
    try:
        release_datetime = datetimeFromString(release_date) - timedelta(days = 2)
        reddit_df = searchReddit(movie_name , release_datetime)

        if len(reddit_df) == 0:
          return (np.NaN,np.NaN,np.NaN,0)

        #Calculate sentiment score for each item in df:
        
        reddit_df["sentences"] = reddit_df["text"].apply(tokenize.sent_tokenize);

        #"vader_sentiment","text_blob_polarity","text_blob_subjectivity","flair_sentiment"
        vader_sentiment = np.nanmean(reddit_df["sentences"].apply(lambda l:
                          np.nanmean(list(map(lambda s: sid.polarity_scores(s)["compound"],l))).mean()))

        blob_sentiment_polarity = np.nanmean(reddit_df["sentences"].apply(lambda l:
                                  np.nanmean(list(map(lambda s: TextBlob(s).sentiment.polarity,l))).mean()))
        
        blob_sentiment_subjectivity = np.nanmean(reddit_df["sentences"].apply(lambda l:
                                      np.nanmean(list(map(lambda s: TextBlob(s).sentiment.subjectivity,l))).mean()))
        
        #takes too long
        #flair_sentiment = np.nanmean(reddit_df["sentences"].apply(lambda l:
                                      #np.nanmean(list(map(lambda s: getFlairSentiment(s),l))).mean()))

        popularity = reddit_df["reddit_score"].sum()

        return (vader_sentiment, blob_sentiment_polarity,blob_sentiment_subjectivity,popularity)
    except:
        return (np.NaN,np.NaN,np.NaN,np.NaN)

#Given a DataFrame consisting of cleaned IMDB information for movies,
#Adds the needed sentiment information
def addMultipleSentimentToDf(movie_df): #.allow_dask_on_strings()
  movie_df[["vader_sentiment","text_blob_polarity","text_blob_subjectivity","popularity"]] = movie_df.swifter.allow_dask_on_strings().apply(
      lambda x: getSentimentForMovie(x['name'], x['weekend_date'])
      ,axis=1, result_type ='expand')
  

## Collecting the Sentiment Information for Movies and Saving The Relevant Results

First, we import the cleaned data:

In [18]:
df = pd.read_csv("clean_data.csv",index_col=0).drop_duplicates(subset=['name']).reset_index(drop=True)

In [19]:
movie_data = df.copy()

In [20]:
movie_data

Unnamed: 0,name,watchtime,budget,weekend_gross_us_canada,weekend_date
0,The Predator,107.0,88000000.0,24632284.0,2018-9-16
1,The Nice Guys,116.0,50000000.0,11203270.0,2016-5-22
2,Once Upon a Time in... Hollywood,161.0,90000000.0,41082018.0,2019-7-28
3,Get Out,104.0,4500000.0,33377060.0,2017-2-26
4,Midsommar,148.0,9000000.0,6560030.0,2019-7-7
...,...,...,...,...,...
620,Inside the Rain,90.0,1000000.0,8140.0,2020-3-15
621,Hacker,95.0,2000000.0,6716.0,2016-12-4
622,All Saints,108.0,2000000.0,1514278.0,2017-8-27
623,Indivisible,119.0,2700000.0,1503101.0,2018-10-28


Then, we gather social media information and add sentiment data to our DataFrame using the functions defined in the previous parts:

(Please note that this part is likely to continue running for half an hour or longer)

In [21]:
addMultipleSentimentToDf(movie_data)

Dask Apply:   0%|          | 0/4 [00:00<?, ?it/s]

Now, we can see the DataFrame containing the merged IMDB and social media information:

In [22]:
movie_data

Unnamed: 0,name,watchtime,budget,weekend_gross_us_canada,weekend_date,vader_sentiment,text_blob_polarity,text_blob_subjectivity,popularity
0,The Predator,107.0,88000000.0,24632284.0,2018-9-16,0.048495,0.062085,0.316292,35577.0
1,The Nice Guys,116.0,50000000.0,11203270.0,2016-5-22,0.128721,0.094116,0.376873,37543.0
2,Once Upon a Time in... Hollywood,161.0,90000000.0,41082018.0,2019-7-28,0.115208,0.070464,0.223233,7859.0
3,Get Out,104.0,4500000.0,33377060.0,2017-2-26,0.070698,0.058603,0.356233,489735.0
4,Midsommar,148.0,9000000.0,6560030.0,2019-7-7,0.085213,0.073727,0.330961,4627.0
...,...,...,...,...,...,...,...,...,...
620,Inside the Rain,90.0,1000000.0,8140.0,2020-3-15,0.024542,0.032651,0.320437,103.0
621,Hacker,95.0,2000000.0,6716.0,2016-12-4,-0.113417,0.023510,0.297022,16007.0
622,All Saints,108.0,2000000.0,1514278.0,2017-8-27,0.081345,0.069605,0.383614,3112.0
623,Indivisible,119.0,2700000.0,1503101.0,2018-10-28,0.076392,0.090659,0.306311,21.0


Here, we also add information about the return on investment on the openning weekend.

In [23]:
movie_data["weekend_roi"] = df["weekend_gross_us_canada"]/movie_data["budget"]

Now, we inspect the movies with the highest popularity scores. This check is necessary as our social media data collection process relies on movie data and thus can give wrong results for movies with names commonly used in the English language:

In [24]:
movie_data[movie_data['popularity'] > 300000].sort_values(by='popularity', axis=0, ascending=False)

Unnamed: 0,name,watchtime,budget,weekend_gross_us_canada,weekend_date,vader_sentiment,text_blob_polarity,text_blob_subjectivity,popularity,weekend_roi
9,It,135.0,35000000.0,123403419.0,2017-9-10,0.035815,0.05927,0.236053,2031721.0,3.525812
16,After,105.0,14000000.0,6002349.0,2019-4-14,-0.034392,0.056197,0.26936,1649532.0,0.428739
7,Us,116.0,20000000.0,71117625.0,2019-3-24,-0.007091,0.044347,0.280204,1373401.0,3.555881
501,Little,109.0,20000000.0,15405455.0,2019-4-14,0.074802,-0.012798,0.424217,952168.0,0.770273
130,Life,104.0,58000000.0,12501936.0,2017-3-26,0.015076,0.036583,0.318954,843739.0,0.215551
601,The Star,86.0,20000000.0,9812674.0,2017-11-19,-0.056561,0.024023,0.376176,636913.0,0.490634
3,Get Out,104.0,4500000.0,33377060.0,2017-2-26,0.070698,0.058603,0.356233,489735.0,7.417124
425,The House,88.0,40000000.0,8724795.0,2017-7-2,0.004066,0.019476,0.293525,424164.0,0.21812
201,Yesterday,116.0,26000000.0,17010050.0,2019-6-30,0.087757,0.079485,0.305711,419167.0,0.654233
503,The Kid,100.0,8000000.0,514286.0,2019-3-10,-0.011428,0.017583,0.324825,414679.0,0.064286


As we can see, movies with popularity scores above 350000 all have names commonly referred to in daily conversations in English. They thus generally have popularity scores disproportional to their other attributes.

Thus, we remove these movies from our dataset before proceeding to the next steps. This action should be noted when deriving any understanding of our results, as this filters the observation. Nevertheless, the removed group only consists of a handful of movies; thus, our results would be expected to be generalizable.

In [25]:
rslt_df = movie_data[movie_data['popularity'] < 350000].reset_index(drop=True)

In [26]:
rslt_df #Inspect the last form of the movie DataFrame

Unnamed: 0,name,watchtime,budget,weekend_gross_us_canada,weekend_date,vader_sentiment,text_blob_polarity,text_blob_subjectivity,popularity,weekend_roi
0,The Predator,107.0,88000000.0,24632284.0,2018-9-16,0.048495,0.062085,0.316292,35577.0,0.279912
1,The Nice Guys,116.0,50000000.0,11203270.0,2016-5-22,0.128721,0.094116,0.376873,37543.0,0.224065
2,Once Upon a Time in... Hollywood,161.0,90000000.0,41082018.0,2019-7-28,0.115208,0.070464,0.223233,7859.0,0.456467
3,Midsommar,148.0,9000000.0,6560030.0,2019-7-7,0.085213,0.073727,0.330961,4627.0,0.728892
4,Avengers: Endgame,181.0,356000000.0,357115007.0,2019-4-28,0.070877,0.050050,0.234112,218147.0,1.003132
...,...,...,...,...,...,...,...,...,...,...
582,Inside the Rain,90.0,1000000.0,8140.0,2020-3-15,0.024542,0.032651,0.320437,103.0,0.008140
583,Hacker,95.0,2000000.0,6716.0,2016-12-4,-0.113417,0.023510,0.297022,16007.0,0.003358
584,All Saints,108.0,2000000.0,1514278.0,2017-8-27,0.081345,0.069605,0.383614,3112.0,0.757139
585,Indivisible,119.0,2700000.0,1503101.0,2018-10-28,0.076392,0.090659,0.306311,21.0,0.556704


Now, we save our results into the 'model_data.csv' file.

In [27]:
rslt_df.dropna().reset_index(drop=True).to_csv('model_data.csv')

## Preliminary Inspection of the Results

We can see the correlations of all features with our target variable "weekend_gross_us_canada" and our newly defined attribute "weekend_roi" in the final DataFrame as follows:

In [28]:
rslt_df.corr()["weekend_gross_us_canada"]

watchtime                  0.324843
budget                     0.771990
weekend_gross_us_canada    1.000000
vader_sentiment           -0.072593
text_blob_polarity        -0.028202
text_blob_subjectivity    -0.106094
popularity                 0.207618
weekend_roi                0.191306
Name: weekend_gross_us_canada, dtype: float64

In [29]:
rslt_df.corr()["weekend_roi"]

watchtime                 -0.180228
budget                    -0.124756
weekend_gross_us_canada    0.191306
vader_sentiment            0.013745
text_blob_polarity        -0.054121
text_blob_subjectivity    -0.000234
popularity                 0.095318
weekend_roi                1.000000
Name: weekend_roi, dtype: float64

In [30]:
rslt_df.corr(method='spearman')["weekend_gross_us_canada"]

watchtime                  0.163085
budget                     0.688629
weekend_gross_us_canada    1.000000
vader_sentiment            0.013349
text_blob_polarity        -0.030403
text_blob_subjectivity    -0.182841
popularity                 0.174893
weekend_roi                0.633660
Name: weekend_gross_us_canada, dtype: float64

In [31]:
rslt_df.corr(method='spearman')["weekend_roi"]

watchtime                 -0.179873
budget                    -0.046533
weekend_gross_us_canada    0.633660
vader_sentiment            0.056156
text_blob_polarity        -0.067329
text_blob_subjectivity    -0.014452
popularity                 0.200901
weekend_roi                1.000000
Name: weekend_roi, dtype: float64

As we can see, the newly defined attribute ("weekend_roi") is not strongly correlated with any independent feature. While models predicting this variable can be added in future studies, we will disrecard it for our study here.