# COGS 108 - Final Project : An Insight on the Correlation of Social Media Impressions and Box Office Performance

## Permissions

Place an `X` in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).

* [  ] YES - make available
* [ X ] NO - keep private

# Overview

In this project we explore if several tweet factors indicate a likelihood of a higher box offices for the top 100 movies of 2021. We explore our data by graphing the box offices of our selection of movies, finding the frequency distribution of words in select movies and measuring sentiment for each tweet in our dataset. We move on to create linear regression models with scatterplots to measure correlations between tweet count and boxoffices as well as tweet sentiment and box offices. We find that our data supports our hypthothesis and that their is a corelation between the aforementioned variables.

# Names

  - Fajar Dirham
  - Robbie Kovar
  - Erik Cisneros
  - Julie Ngan
  - Mohamed Abdilahi


<a id='research_question'></a>
# Research Question

How do the amount of positive and negative tweets affect how well American movies do in the box office?

<a id='background'></a>

## Background & Prior Work

Trying to quantify how social media affects the real world is not a new idea. The specific question we are trying to ask, relating social media impressions and box office success for movies, has been explored by other parties in the past. A thesis paper from the University of Miami school of communication found positive correlation coefficients between engagements in Instagram, Facebook, and Twitter and box office numbers[^dejesus]. In the paper, engagements were calculated by accounting for the followers of posts. The author took movies from Rotten Tomatoes, used Box Office Mojo for revenue data, and used Rival IQ (a social media platform analytics service). These positive correlation coefficients suggest that there "exists a moderate or strong positive correlation between engagement rates and box office for limited release films".

Another similar study by Neuralink and Facebook hoping to uncover the impact social media has on a movie’s success, particularly box office sales, found that Facebook ads contributed a significant portion to a movie’s marketing (20%) despite only accounting for a small percentage of its expenditure (4%).[^meta] In other words, the budget spent on Facebook ads ended up having a larger impact than its cost would entail. In the study they acknowledge how social media has become a competitive platform for media consumption, likening it to how television was revolutionary in marketing back in its heyday. Similarly, our group acknowledges how social media has become an influential and impactful communication platform, replacing traditional methods of self-expression such as word of mouth with the publishing of public posts. As such we hope to uncover if the brief mention of movie titles in people’s posts, has the potential to drive up sales for movies.

References (include links):
- 1) De Jesus, Kimberly. “Social Media Engagement and Film Box Office.” University of Miami School of Communication, 2020. [^meta]: “New: Study Unveils Secrets to Box Office Marketing.” Meta for Business, https://www.facebook.com/business/m/verticals/entertainment-media/social-media-impact-on-movie-attendance.
- 2)

# Hypothesis


We believe that the more social media impressions a movie has before coming out results in it doing better in the box office in its opening weekend, regardless of if the impressions are positive or negative.

# Dataset(s)

Dataset 1

    Dataset Name: Movie Report
    Link to the dataset: https://www.the-numbers.com/movies/report/All/All/All/All/All/All/United-States/All/All/None/None/2021/2021/None/None/None/None/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc
    Number of observations: 100

This dataset is the list of the top 100 American movies domestically, internationally, and worldwide. We will use this dataset to help us answer our research question on whether the amount of tweets is correlated to how well it does commercially.
DataSet 2

    Dataset Name: Tweets
    Link to the dataset: N/A, webscraped
    Number of observations: 801929

This dataset is scraped from Twitter, using a 3rd party scraping tool, snscrape. Using the titles found in the Movie Report dataset, we will fetch the number of tweets that reference these titles, and list them with a tweet for each row. This dataset has a column marked 'movie_id' which corresponds to which movie the tweet is from and its index in the Movie Report datset.


# Setup

### Read MovieReport.csv into a dataframe

In [12]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import date
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [21]:
import pandas as pd
data = pd.read_csv('MovieReport.csv')
data.head()

Unnamed: 0,Released,Title,Domestic\r\nBox Office,International\r\nBox Office,Worldwide\r\nBox Office
1,"Dec 17, 2021",Spider-Man: No Way Home,"$804,617,772","$1,083,808,579","$1,888,426,351"
2,"Sep 3, 2021",Shang-Chi and the Legend of the Ten R…,"$224,543,292","$207,700,000","$432,243,292"
3,"Oct 1, 2021",Venom: Let There be Carnage,"$213,550,366","$288,050,013","$501,600,379"
4,"Jul 9, 2021",Black Widow,"$183,651,655","$196,100,000","$379,751,655"
5,"Jun 25, 2021",F9: The Fast Saga,"$173,005,945","$548,072,000","$721,077,945"


### Dataset 2 webscrape

In [14]:
# installation of webscraper
#!pip install -q snscrape==0.3.4

Code below takes in a movie title and returns movie titles for snscrape to query twitter with

In [15]:
def format_title(title):
  title = title.lower()

  #handle special cases
  if("f9" in title):
    return ["F9", "fast 9", "fast and furious 9", "f9 the fast saga"]
  if("shang-chi" in title):
    return ["shang chi", "shang chi and the legend of the ten rings"]
  if("summer of soul" in title):
    return ["summer of soul", "summer of soul or when the revolution cannot be televised"]
  if("roadrunner" in title):
    return ['Roadrunner', 'roadrunner anthony bourdain', 'road runner a film about anthony bourdain']
  if("christmas with the chosen" in title):
    return ["Christmas with the chosen", 'christmas with the chosen the messengers']
  if("quiet place" in title):
    return ["a quiet place", "a quiet place part 2", "a quiet place part II"]

  to_return = []
  
  #handle dashes
  title = title.replace("-", " ")

  # handle colons
  colon_split = title.split(':')
  if(len(colon_split) > 1):
    to_return.append(colon_split[0])
    to_return.append(" ".join(colon_split).replace("  ", " "))
  else:
    to_return.append(title)
  

  return to_return


Here we iterate through list of movies to run the appropriate queries. A typical query is ran by snscrape with the words: "{movie title} movie". We are looking for tweets made in the first two months of a movie's release.

READ BELOW BEFORE RUNNING:
1. Results will be outputted to a folder that has to exist and is titles "raw_tweets"
2. It takes hours for the queries to finish
3. The code below only works if the "data" dataframe, which currently only has the contents of MovieReports.csv, has been cleaned. See the next section for how to do this.

In [16]:
import datetime

def scrape_tweets():
    for index,row in data.iterrows():
        title = row['Title']
        words = title.split()
        
        # Skip over movie titles with only 1 word since that will return too many tweets
        if(len(words) == 1):
            continue

        since = row['Released'].strftime('%Y-%m-%d')
        until = (row['Released'] + datetime.timedelta(days=61)).strftime('%Y-%m-%d')
        filename = str(index) + '-' + title.lower().replace(' ', '')
        filename = filename.replace("(","")
        filename += '.txt'


        titles = format_title(title)
        for formatted_title in titles:
            command = 'snscrape --jsonl --since ' + since + ' twitter-search "' + formatted_title + ' movie until:' + until + '" >> raw_tweets/' + filename
            os.system('echo ' + filename)
            os.system(command)

#scrape_tweets()

# Data Cleaning

### Movie Report Cleaning

In [18]:
# Helper Functions

# Function for getting the total number of tweets for a movie title
# from_date - datetime object to start search
# to_date - datetime object to stop search
# title - string title of movie
# returns int number of tweets for the movie 
def num_tweets(from_date, to_date, title):
    start = str(from_date).replace(' 00:00:00', '')
    end = str(to_date).replace(' 00:00:00', '')

    os.system(f"snscrape --since {start} twitter-search '{title} until:{end}' > result-tweets.txt")
    if os.stat("result-tweets.txt").st_size == 0:
       counter = 0
    else:
       df = pd.read_csv('result-tweets.txt', names=['link'])
       counter = df.size
    os.remove('result-tweets.txt')
    print('Number Of Tweets : '+ str(counter))
    return counter

# Function for getting the total number of tweets for a movie title
# from_date - datetime object to start search
# to_date - datetime object to stop search
# title - string title of movie
# N - int number of posts to fetch
# returns list of first N tweets about the movie title 
def get_N_posts(from_date, to_date, title, N):
  posts_list = []
  start = str(from_date).replace(' 00:00:00', '')
  end = str(to_date).replace(' 00:00:00', '')
  os.system("snscrape --format '{content!r}'"+ f" --max-results {N} --since {start} twitter-search '{title} until:{end}' > tweets.txt")
  if os.stat("tweets.txt").st_size == 0:
    os.remove('tweets.txt')
    return posts_list
  else:
    df = pd.read_csv('tweets.txt', names=['content'])
    for row in df['content'].iteritems():
      posts_list.append(row)
    os.remove('tweets.txt')
    return posts_list

# Function for converting date strings into pandas datatime objects
# start - string datetime to start search
# end - string datetime to stop search
# title - string title of movie
# returns int number of tweets for the movie
def standardize_date(str): 
  try:
    str = str.strip()

    str = str.replace('Jan ', "01/")
    str = str.replace('Feb ', "02/")
    str = str.replace('Mar ', "03/")
    str = str.replace('Apr ', "04/")
    str = str.replace('May ', "05/")
    str = str.replace('Jun ', "06/")
    str = str.replace('Jul ', "07/")
    str = str.replace('Aug ', "08/")
    str = str.replace('Sep ', "09/")
    str = str.replace('Oct ', "10/")
    str = str.replace('Nov ', "11/")
    str = str.replace('Dec ', "12/")

    str = str.replace(', ', "/")

    str = str.replace('/1/', "/01/")
    str = str.replace('/2/', "/02/")
    str = str.replace('/3/', "/03/")
    str = str.replace('/4/', "/04/")
    str = str.replace('/5/', "/05/")
    str = str.replace('/6/', "/06/")
    str = str.replace('/7/', "/07/")
    str = str.replace('/8/', "/08/")
    str = str.replace('/9/', "/09/")

    out = str
  except:
    out = "n/a" 

  return out

In [23]:
 #simple renaming of columns so it can be easier to read
data = data.rename(columns={"Domestic\r\nBox Office":"Domestic Box Office", "International\r\nBox Office":"International Box Office", 
            "Worldwide\r\nBox Office":"Worldwide Box Office"})
#data = data.drop(columns=['Unnamed: 0'])

# Transformation of release date string to 'datetime' objects
data['Released'] = data['Released'].apply(standardize_date)
data['Released'] = data['Released'].apply(pd.to_datetime)

# Sort Movies by release date
data = data.sort_values(by='Released')
data = data.reset_index(drop = True)

# Remove dollar signs and commas from the box office numbers and convert them to floats
data['Domestic Box Office'] = data['Domestic Box Office'].str.replace('$','')
data['Domestic Box Office'] = data['Domestic Box Office'].str.replace(',','').astype(float)
data['International Box Office'] = data['International Box Office'].str.replace('$','')
data['International Box Office'] = data['International Box Office'].str.replace(',','').astype(float)
data['Worldwide Box Office'] = data['Worldwide Box Office'].str.replace('$','')
data['Worldwide Box Office'] = data['Worldwide Box Office'].str.replace(',','').astype(float)

# Remove null
data.head()

Unnamed: 0,Released,Title,Domestic Box Office,International Box Office,Worldwide Box Office
0,2021-01-15,The Marksman,15566093.0,5631181.0,21197274.0
1,2021-01-26,Wrong Turn,1251184.0,2392576.0,3643760.0
2,2021-01-29,The Little Things,15342746.0,14392476.0,29735222.0
3,2021-01-29,Nomadland,2180000.0,36818715.0,38998715.0
4,2021-02-12,Judas and the Black Messiah,5478009.0,1580271.0,7058280.0


### Process and clean Tweets

Here we iterate through all the raw tweet data and pull out the content, placing them in a folder that has to exist and is titled "processed_tweets"

In [24]:
import json

def process_file(filename):
    # Using readlines()
    file1 = open('raw_tweets/' + filename, 'r')
    file2 = open('processed_tweets/' + filename, 'a')
    Lines = file1.readlines()
    
    count = 0
    # Strips the newline character
    file2.write("tweet\n")
    for line in Lines:
        count += 1
        obj = json.loads(line.strip())
        content = obj['content']
        content = content.replace(",", "")
        content = content.replace("\n", "")
        content = content.replace("\r", "")
        file2.write(content + '\n')
    
    file1.close()
    file2.close()

def process_tweets():
    yourpath = './raw_tweets/'
    for root, dirs, files in os.walk(yourpath, topdown=True):
        for name in files:
            process_file(name)

#process_tweets()

We take all those processed tweets and put them all into one dataframe

In [26]:
def generate_csv():
  tweet_dfs = []

  def create_dfs(filename):
    temp_df = pd.read_csv('processed_tweets/' + filename)
    id = int(filename[:filename.index('-')])
    temp_df['movie'] = data.iloc[id - 1]['Title']
    temp_df['movie_id'] = id
    tweet_dfs.append(temp_df)

  yourpath = './processed_tweets/'
  for root, dirs, files in os.walk(yourpath, topdown=False):
    for name in files:
          create_dfs(name)

  tweets_df = pd.concat(tweet_dfs)
  tweets_df.sort_values(by='movie_id', inplace=True)
  tweets_df.reset_index(inplace=True, drop=True)
  tweets_df.drop_duplicates(subset=["tweet"], inplace=True)
  tweets_df.to_csv('tweets.csv.zip', index=False)

# This is how we used the data frame going forward after running
# generate_csv()
tweets_df = pd.read_csv("tweets.csv")
print(tweets_df.shape)
tweets_df.head()

(801929, 3)


Unnamed: 0,tweet,movie,movie_id
0,@SpiderManMovie @HarryHolland99 @IMAX @DolbyCi...,Spider-Man: No Way Home,1
1,@A_C_Mitchell @molly_kraus @MarvelStudios @Spi...,Spider-Man: No Way Home,1
2,“Spider Man trailer and stock prices”… Story |...,Spider-Man: No Way Home,1
3,@Gamer21690 @SpiderManMovie too obsessed,Spider-Man: No Way Home,1
4,Looking forward to Spider-Man tonight. If anyo...,Spider-Man: No Way Home,1


Now we clean the tweets, removing hashtags and mentions but keeping punctuation as that may convey information.

In [None]:
# Process the tweets first

def clean_text(text):
# Removes all special characters and numericals leaving the alphabets
    to_return = ''
    for word in text.split():
        if not '@' in word and not 'https:' in word:
            to_return += word + " "

    to_return = to_return.replace("#","")
    return to_return[0:len(to_return)-1]

test_clean = clean_text(tweets_df.iloc[0]['tweet'])
print(test_clean)
tweets_df["tweet_clean"] = tweets_df['tweet'].apply(clean_text)
tweets_df.head()

SpiderMan has saved the movie going experience!!!


Unnamed: 0,tweet,movie,movie_id,tweet_clean
0,@SpiderManMovie @HarryHolland99 @IMAX @DolbyCi...,Spider-Man: No Way Home,1,SpiderMan has saved the movie going experience!!!
1,@A_C_Mitchell @molly_kraus @MarvelStudios @Spi...,Spider-Man: No Way Home,1,Yes- so good! Grab extra napkins with your pop...
2,“Spider Man trailer and stock prices”… Story |...,Spider-Man: No Way Home,1,“Spider Man trailer and stock prices”… Story |...
3,@Gamer21690 @SpiderManMovie too obsessed,Spider-Man: No Way Home,1,too obsessed
4,Looking forward to Spider-Man tonight. If anyo...,Spider-Man: No Way Home,1,Looking forward to Spider-Man tonight. If anyo...


# Data Analysis & Results

## Dataset 1

In [29]:
data

Unnamed: 0,Released,Title,Domestic Box Office,International Box Office,Worldwide Box Office
0,2021-01-15,The Marksman,15566093.0,5631181.0,21197274.0
1,2021-01-26,Wrong Turn,1251184.0,2392576.0,3643760.0
2,2021-01-29,The Little Things,15342746.0,14392476.0,29735222.0
3,2021-01-29,Nomadland,2180000.0,36818715.0,38998715.0
4,2021-02-12,Judas and the Black Messiah,5478009.0,1580271.0,7058280.0
...,...,...,...,...,...
95,2021-12-22,The King’s Man,37176373.0,83948927.0,121125300.0
96,2021-12-22,The Matrix Resurrections,37686805.0,118781012.0,156467817.0
97,2021-12-22,Sing 2,162790990.0,234806442.0,397597432.0
98,2021-12-25,American Underdog: The Kurt Warner Story,26514814.0,0.0,26514814.0


### Size
For this dataset we have 100 entries

In [30]:
data.shape

(100, 5)

### Missingness
Searching for any anomalous entries we found that 6 movies had an International Box Office of 0. We believe these movies didn't have an international box office because they only released domestically

In [None]:
domestic_only =  data[data['International Box Office'] == 0]
domestic_only

Since we are using box office as a measure of success for a movie, we will remove the aforementioned entries. Their lack of an international box office, makes it difficult to compare them with other movies at an international scope. Additionally it hinders their worldwide box office making them appear less successful than other movies.

In [None]:
data = data[data['International Box Office'] != 0]

Plotting a histogram for the release date of movies, we see that it is fairly uniformly distributed. There is a slight lack of movies during the first quarter the year

In [None]:
sns.histplot(data, x='Released', bins=24)
plt.title('Distribution of Movie Release Date Across 2021')
plt.xticks(rotation=45)
print()

Plotting for Box Office success, we see all three distributions are highly skewed right. We also spot a few outliers who earned huge box offices.

In [None]:
sns.histplot(data, x='Domestic Box Office', bins=24)
plt.title('Distribution of Domestic Box Office')
print()

In [None]:
sns.histplot(data, x='International Box Office', bins=24)
plt.title('Distribution of International Box Office')
print()

In [None]:
sns.histplot(data, x='Worldwide Box Office', bins=24)
plt.title('Distribution of Worldwide Box Office')
plt.xticks(rotation=90)
print()

### Outliers
We see in both cases that the outlier is "Spider-Man: No Way Home"

In [None]:
data[data['Domestic Box Office'] > (7*(10**8))]

In [None]:
data[data['Worldwide Box Office'] > (1*(10**9))]

## Dataset 2

In [28]:
import pandas as pd
import patsy
import statsmodels.api as sm

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (17, 7)
plt.rcParams.update({'font.size': 14})
import seaborn as sns

#improve resolution
#comment this line if erroring on your machine/screen
%config InlineBackend.figure_format ='retina'

import warnings
warnings.filterwarnings('ignore')

#import natural language toolkit
import nltk

# download stopwords & punkt & VADER
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('vader_lexicon') 

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/fajardirham/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/fajardirham/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/fajardirham/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

### Tokenize

In [None]:
from nltk.tokenize import word_tokenize
tweets_df['tweet_token'] = tweets_df['tweet_clean'].apply(word_tokenize)
tweets_df.head()

### Remove stop words

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
tweets_df['tweet_stop'] = tweets_df['tweet_token'].apply(lambda x: [item for item in x if item not in stop_words])
tweets_df.head()

### Stemming

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

tweets_df['tweet_stem'] = tweets_df['tweet_stop'].apply(lambda x: [ps.stem(y) for y in x])
tweets_df.head()

### Frequency Distribution of words for F9 and Black Widow

In [None]:
# get words after stemming
bw_tweets = tweets_df[tweets_df['movie'] == 'Black Widow']
tweet_series = bw_tweets.squeeze()
tweet_stem = tweet_series['tweet_stem'].apply(pd.Series).stack()

from nltk.probability import FreqDist
import string

#calculate word frequency
fdist_tweets = FreqDist(tweet_stem)
#delete punctuation counts
for punc in string.punctuation:
    del fdist_tweets[punc]

#get top 20 words
fdist_tweets.plot(20, cumulative=False);

In [None]:
bw_tweets = tweets_df[tweets_df['movie'] == 'Black Widow']
tweet_series = bw_tweets.squeeze()
tweet_stem = tweet_series['tweet_stem'].apply(pd.Series).stack()

from nltk.probability import FreqDist
import string

#calculate word frequency
fdist_tweets = FreqDist(tweet_stem)
#delete punctuation counts
for punc in string.punctuation:
    del fdist_tweets[punc]

#get top 20 words
fdist_tweets.plot(20, cumulative=False);

### VADER sentiment analysis

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer 
analyser = SentimentIntensityAnalyzer()
test_tweet = tweets_df.iloc[0]['tweet_clean']
print(test_tweet)
print(analyser.polarity_scores(test_tweet))

In [None]:
sentiment_df = pd.DataFrame()
sentiment_df['tweet_clean'] = tweets_df['tweet_clean']
sentiment_df['sentiment'] = tweets_df['tweet_clean'].apply(analyser.polarity_scores)
sentiment_df.head()

In [None]:
def spread_sentiment(sentiment_obj, category):
    return sentiment_obj[category]
sentiment_df['compound'] = sentiment_df['sentiment'].apply(lambda x: spread_sentiment(x, 'compound'))
sentiment_df['neg'] = sentiment_df['sentiment'].apply(lambda x: spread_sentiment(x, 'neg'))
sentiment_df['neu'] = sentiment_df['sentiment'].apply(lambda x: spread_sentiment(x, 'neu'))
sentiment_df['pos'] = sentiment_df['sentiment'].apply(lambda x: spread_sentiment(x, 'pos'))
sentiment_df.drop(columns=['sentiment'], inplace=True)
sentiment_df.head()

In [None]:
#Group by and mean compound scores
grouped = sentiment_df.groupby(by=['movie_id'])
mean_sentiment_df = grouped.mean()
mean_sentiment_df.head()

In [None]:
# Get number of tweets per movie
count_sentiment_df = grouped.count()
movie_aggregate_df = mean_sentiment_df.merge(count_sentiment_df['tweet_clean'], left_index=True, right_on='movie_id')
movie_aggregate_df.columns = ['compound_mean', 'neg_mean', 'neu_mean', 'pos_mean', 'num_tweets']
movie_aggregate_df.head(10)

In [31]:
# Join with box office
movies_df = data
joined_df = movie_aggregate_df.merge(movies_df, left_on='movie_id', right_index=True)
joined_df.head(10)

NameError: name 'movie_aggregate_df' is not defined

In [None]:
# Plot tweet counts
sorted_count_df = joined_df.sort_values(by='num_tweets', ascending=True)
sorted_count_plot = sns.barplot(data=sorted_count_df, y='num_tweets', x='movie')
for item in sorted_count_plot.get_xticklabels():
    item.set_rotation(90)

In [None]:
# Describe num tweets
joined_df.describe()

We need to get rid of some outliers and low tweet counts to make our predictions more accurate.

In [None]:
# To remove low count tweets for correlation, we will use movies with 1000 tweets or above
full_joined_df = joined_df
joined_df = joined_df[joined_df['num_tweets'] > 1000]

# Remove spider man
joined_df = joined_df[joined_df['num_tweets'] < 100000]

print(joined_df.shape)
print(joined_df.head())

sorted_count_df = joined_df.sort_values(by='num_tweets', ascending=True)
sorted_count_plot = sns.barplot(data=sorted_count_df, y='num_tweets', x='movie')
for item in sorted_count_plot.get_xticklabels():
    item.set_rotation(90)

### Connections between Average Compound Scores and Domestic Box Office

Let's see if there is a direct connection between average compound scores and domestic box office numbers

In [None]:
def std_cash(string):
    string = string.replace('$','').replace(',','')
    return float(string)

correlation_df = pd.DataFrame()
correlation_df['compound'] = joined_df['compound_mean']
correlation_df['domestic'] = joined_df['domestic'].apply(std_cash)
correlation_df.head()

In [None]:
sns.lmplot(x='compound', y='domestic', data=correlation_df, ci=None, aspect=1.8)

In [None]:
outcome, predictors = patsy.dmatrices('domestic ~ compound', correlation_df)
model = sm.OLS(outcome, predictors)
results = model.fit()
print(results.summary())

#### Conclusion

There does not seem to be a direct correlation between average compound scores of tweets 2 months after release and domestic box office numbers. This is evident from the relatively high p value for our predictor number and low R squared, indicated that the pearson correlation number is low.

### Correlation Between Movie Tweet Count & Domestic Box Office

In [None]:
correlation_df['num_tweets'] = joined_df['num_tweets']
correlation_df.head(10)

In [None]:
sns.lmplot(x='num_tweets', y='domestic', data=correlation_df,
          ci=None, aspect=1.8)

In [None]:
outcome, predictors = patsy.dmatrices('domestic ~ num_tweets', correlation_df)
model = sm.OLS(outcome, predictors)
results = model.fit()
print(results.summary())

#### Conclusion
There does seem to be a linear relationship between the number of tweets and domestic box office numbers. The p value for the predictor is low and the pearson coefficient squared seems to be high enough to suggest a positive correlation (R^2 = 0.25 -> R = 0.5). The OLS model suggests that for every tweet, box office numbers would increase by $1847.

Realistically, this makes some sense. Assuming it costs $10 for each movie ticket, $1847 would equal to about 184 people buying tickets for a movie. It is within the realm of possibility that only one out of 184 people would tweet about a movie they have seen.

### Correlation Between Tweet Count of Differing Sentiments & Domestic Box Office 

First we will add labels to our tweets (pos, neg, and neu). Then we will add up the number of labels per movie. Then we will see if there is a correlation between the number of labeled tweets.

In [None]:
def compound_to_label(compound_score):
    if(compound_score >= 0.05): return 'pos'
    if(compound_score <= -0.05): return 'neg'
    return 'neu'

sentiment_df['label'] = sentiment_df['compound'].apply(compound_to_label)
sentiment_df.head(10)

In [None]:
pos_df = sentiment_df[sentiment_df['label'] == 'pos']
neu_df = sentiment_df[sentiment_df['label'] == 'neu']
neg_df = sentiment_df[sentiment_df['label'] == 'neg']


pos_grouped = pos_df.groupby(by=['movie_id'])
pos_count = pos_grouped.count()['tweet_clean']
neu_grouped = neu_df.groupby(by=['movie_id'])
neu_count = neu_grouped.count()['tweet_clean']
neg_grouped = neg_df.groupby(by=['movie_id'])
neg_count = neg_grouped.count()['tweet_clean']
labeled_count_df = pd.DataFrame()
labeled_count_df['pos_count'] = pos_count
labeled_count_df['neu_count'] = neu_count
labeled_count_df['neg_count'] = neg_count
labeled_count_df['num_tweets'] = labeled_count_df['pos_count'] + labeled_count_df['neu_count'] + labeled_count_df['neg_count']

# Remove outliers and low tweet counts
labeled_count_df = labeled_count_df[labeled_count_df['num_tweets'] > 1000]
labeled_count_df = labeled_count_df[labeled_count_df['num_tweets'] < 100000]

# Add domestic
labeled_count_df['domestic'] = correlation_df['domestic']

labeled_count_df.head(10)

In [None]:
# plot pos
outcome, predictors = patsy.dmatrices('domestic ~ pos_count', labeled_count_df)
model = sm.OLS(outcome, predictors)
results = model.fit()
print(results.summary())

sns.lmplot(x='pos_count', y='domestic', 
           data=labeled_count_df, ci=None, aspect=1.8)

In [None]:
# plot neu
outcome, predictors = patsy.dmatrices('domestic ~ neu_count', labeled_count_df)
model = sm.OLS(outcome, predictors)
results = model.fit()
print(results.summary())

sns.lmplot(x='neu_count', y='domestic', 
           data=labeled_count_df, ci=None, aspect=1.8)

In [None]:
# plot neg
outcome, predictors = patsy.dmatrices('domestic ~ neg_count', labeled_count_df)
model = sm.OLS(outcome, predictors)
results = model.fit()
print(results.summary())

sns.lmplot(x='neg_count', y='domestic', 
           data=labeled_count_df, ci=None, aspect=1.8)

### Notes
Just by looking at the graphs, the sentiments of the tweets do not matter as much as the more tweets generally equate to higher domestic box office numbers. We'll now use OLS to see how each of the labels correlate to the domestic numbers.

In [None]:
outcome, predictors = patsy.dmatrices('domestic ~ pos_count + neu_count + neg_count', labeled_count_df)
model = sm.OLS(outcome, predictors)
pos_results = model.fit()

print(pos_results.summary())

### Notes
According to this OLS predictor, the amount of neutral

In [None]:
outcome, predictors = patsy.dmatrices('domestic ~ compound + num_tweets + compound*num_tweets', correlation_df)
model = sm.OLS(outcome, predictors)
results = model.fit()

print(results.summary())

# Ethics & Privacy

Data Collection: The actual act of scraping the data from twitter for our research project is in the clear regarding its eithcality as twitter is a public space. Because we only want the sentiment of the individuals in regards to the movies, their anonymity should be upheld. So, we are planning to only use their tweets and not them as an individual.

Sentiment Analysis: Ideally we would like to not only see if any form of discussion on twitter about a movie increases its performance, we would also like to observe if whether the primary reasoning for performance is based on the the type of discussion being taken place. We would like to divide up the tweets into three; positive, negative or neutral sentiment. While there is a concern for ethics when using sentiment analysis, we believe that the context that we are using it in does not break any moral standards and we are transparent in our goal.

# Conclusion & Discussion

*Fill in your discussion information here*

# Team Contributions

*Specify who in your group worked on which parts of the project.*