<a href="https://colab.research.google.com/github/minako-m/datasci112_final_project/blob/main/112_final_project_data_exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Science 112 Final Project: Data Extraction**
# Amira and Sophia

This project explores the Cornell Movie Dialog Corpus (https://convokit.cornell.edu/documentation/movie.html).

Research questions:
1. Has movie dialogue sentiment changed over time?
2. Has the sentiment of movie dialogue spoken by men versus women changed over time?
3. Has the sentiment of movie dialogue spoken by men to men, by men to women, by women to men, and by women to women changed over time?

In this file we perform **exploratory analysis** of our data.

In [6]:
import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import pandas as pd

sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [7]:
import pandas as pd
movie_df = pd.read_csv('movie_dialogues.csv')
movie_df

Unnamed: 0.1,Unnamed: 0,text,utt id_x,reply_to id,speaker id,movie_name,gender_x,release year,rating,genre,utt id_y,gender_y,decade
0,0,They do not!,L1045,L1044,u0,10 things i hate about you,f,1999,6.9,"['comedy', 'romance']",L1044,m,1990-1999
1,1,They do to!,L1044,,u2,10 things i hate about you,m,1999,6.9,"['comedy', 'romance']",,,1990-1999
2,2,I hope so.,L985,L984,u0,10 things i hate about you,f,1999,6.9,"['comedy', 'romance']",L984,m,1990-1999
3,3,She okay?,L984,,u2,10 things i hate about you,m,1999,6.9,"['comedy', 'romance']",,,1990-1999
4,4,Let's go.,L925,L924,u0,10 things i hate about you,f,1999,6.9,"['comedy', 'romance']",L924,m,1990-1999
...,...,...,...,...,...,...,...,...,...,...,...,...,...
304708,304708,Lord Chelmsford seems to want me to stay back ...,L666371,L666370,u9030,zulu dawn,?,1979,6.4,"['action', 'adventure', 'drama', 'history', 'w...",L666370,?,1970-1979
304709,304709,I'm to take the Sikali with the main column to...,L666370,L666369,u9034,zulu dawn,?,1979,6.4,"['action', 'adventure', 'drama', 'history', 'w...",L666369,?,1970-1979
304710,304710,"Your orders, Mr Vereker?",L666369,,u9030,zulu dawn,?,1979,6.4,"['action', 'adventure', 'drama', 'history', 'w...",,,1970-1979
304711,304711,"Good ones, yes, Mr Vereker. Gentlemen who can ...",L666257,L666256,u9030,zulu dawn,?,1979,6.4,"['action', 'adventure', 'drama', 'history', 'w...",L666256,?,1970-1979


In [11]:
def get_sentiment(text):
  if pd.isna(text) :
    text = ""
  return sia.polarity_scores(text)["compound"]

movie_df['sentiment compound score'] = movie_df['text'].apply(get_sentiment)

###Taking a look at the distribution of sentiments across time

In [15]:
import plotly.express as px

movie_df = movie_df[(movie_df["sentiment compound score"].abs() >= 0.01) & (movie_df["gender_x"] != '?')]

px.box(movie_df, x='decade', y='sentiment compound score',
              title='Average Sentiment Value Over Time')

###Exloring potential trends in sentiments across different gender groups relative to year of movie release.

In [16]:
male_to_male = movie_df[(movie_df["gender_x"] == "m") & (movie_df["gender_y"] == "m")]
avg_sentiment_by_year_mm = male_to_male.groupby('release year')['sentiment compound score'].mean().reset_index()
avg_sentiment_by_year_mm["group"] = "Male to male"

px.line(avg_sentiment_by_year_mm, x='release year', y='sentiment compound score',
              title='Average Sentiment Value of Males speaking to Males Over Time')

In [17]:
male_to_female = movie_df[(movie_df["gender_x"] == "m") & (movie_df["gender_y"] == "f")]
avg_sentiment_by_year_mf = male_to_female.groupby('release year')['sentiment compound score'].mean().reset_index()
avg_sentiment_by_year_mf["group"] = "Male to female"


px.line(avg_sentiment_by_year_mf, x='release year', y='sentiment compound score',
              title='Average Sentiment Value of Males speaking to Females Over Time')

In [18]:
female_to_male = movie_df[(movie_df["gender_x"] == "f") & (movie_df["gender_y"] == "m")]
avg_sentiment_by_year_fm = female_to_male.groupby('release year')['sentiment compound score'].mean().reset_index()
avg_sentiment_by_year_fm["group"] = "Female to male"


px.line(avg_sentiment_by_year_fm, x='release year', y='sentiment compound score',
              title='Average Sentiment Value of Females speaking to Males Over Time')

In [19]:
female_to_female = movie_df[(movie_df["gender_x"] == "f") & (movie_df["gender_y"] == "f")]
avg_sentiment_by_year_ff = female_to_female.groupby('release year')['sentiment compound score'].mean().reset_index()
avg_sentiment_by_year_ff["group"] = "Female to female"

px.line(avg_sentiment_by_year_ff, x='release year', y='sentiment compound score',
              title='Average Sentiment Value of Females speaking to Females Over Time')

In [20]:
female_to_male[["release year", "sentiment compound score"]].corr()

Unnamed: 0,release year,sentiment compound score
release year,1.0,-0.043023
sentiment compound score,-0.043023,1.0


In [21]:
df = pd.concat([avg_sentiment_by_year_ff, avg_sentiment_by_year_fm, avg_sentiment_by_year_mf, avg_sentiment_by_year_mm], ignore_index=True)

px.line(df, x='release year', y='sentiment compound score', color='group',
              title='Sentiment Scores of Utterances between Males and Females')

As we see, there is little to no trends and/or correlations accross all of these gender groups. To make the visualisations more readable, we will try grouping by decade of release instead of year. But first, lets get more context of the distributions of different variables in our dataset.

###Visualising distributions of different variables in the dataset

In [22]:
movie_count_by_year = pd.DataFrame(movie_df.groupby("decade")["movie_name"].nunique())
fig = px.bar(movie_count_by_year,
             labels={'decade': 'Decade', 'value': 'Number of Movies'},
             title = "Number of Unique Movies per Decade",
             color_discrete_map={'movie_name' : 'royalblue'})

fig.update_layout(
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    height=400, width=600,
    showlegend = False
)
fig

In [23]:
gender_proportion_by_year = movie_df.groupby("release year")["gender_x"].value_counts().unstack()
#gender_proportion_by_year.drop(columns = ['?'], inplace=True)
gender_proportion_by_year = gender_proportion_by_year.divide(gender_proportion_by_year.sum(axis = "columns"), axis = "rows")

fig = px.bar(gender_proportion_by_year, barmode = "stack",
       labels={'release year': 'Release Year', 'value': 'Distribution of genders'},
       title = "Distribution of genders of speakers over time",
       color_discrete_map={'f': 'salmon', 'm': 'royalblue'})

fig.update_layout(
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
)
fig

In [24]:
temp = movie_df["gender_x"].value_counts(normalize = True)
fig = px.bar(temp,
             labels={'value': 'Proportion', 'm': 'Male', 'f': 'Female', 'index' : 'Gender'},
             title = "Overall Distribution of Genders",
             color_discrete_map={'f': 'salmon', 'm': 'royalblue'})
fig.update_layout(
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    height=400, width=300,
    showlegend = False
)
fig

###Exploring trends in dialogue sentiment accross different gender groups relative to decade of release.

In [None]:
male_to_male = movie_df[(movie_df["gender_x"] == "m") & (movie_df["gender_y"] == "m")]
avg_sentiment_by_year_mm = male_to_male.groupby('decade')['sentiment compound score'].mean().reset_index()
avg_sentiment_by_year_mm["group"] = "Male to male"

male_to_female = movie_df[(movie_df["gender_x"] == "m") & (movie_df["gender_y"] == "f")]
avg_sentiment_by_year_mf = male_to_female.groupby('decade')['sentiment compound score'].mean().reset_index()
avg_sentiment_by_year_mf["group"] = "Male to female"

female_to_male = movie_df[(movie_df["gender_x"] == "f") & (movie_df["gender_y"] == "m")]
avg_sentiment_by_year_fm = female_to_male.groupby('decade')['sentiment compound score'].mean().reset_index()
avg_sentiment_by_year_fm["group"] = "Female to male"

female_to_female = movie_df[(movie_df["gender_x"] == "f") & (movie_df["gender_y"] == "f")]
avg_sentiment_by_year_ff = female_to_female.groupby('decade')['sentiment compound score'].mean().reset_index()
avg_sentiment_by_year_ff["group"] = "Female to female"

df = pd.concat([avg_sentiment_by_year_ff, avg_sentiment_by_year_fm, avg_sentiment_by_year_mf, avg_sentiment_by_year_mm], ignore_index=True)

px.line(df, x='decade', y='sentiment compound score', color='group',
              title='Sentiment Scores of Utterances between Males and Females by Decade')

In [None]:
male_utterances = movie_df[(movie_df["gender_x"] == "m")]
female_utterances = movie_df[(movie_df["gender_x"] == "f")]

avg_sentiment_by_year_male = male_utterances.groupby('decade')['sentiment compound score'].mean().reset_index()
avg_sentiment_by_year_male["gender"] = "male"
avg_sentiment_by_year_female = female_utterances.groupby('decade')['sentiment compound score'].mean().reset_index()
avg_sentiment_by_year_female["gender"] = "female"

df = pd.concat([avg_sentiment_by_year_male, avg_sentiment_by_year_female], ignore_index=True)

px.line(df, x='decade', y='sentiment compound score', color='gender',
              title='Sentiment Scores of Utterances of Males and Females')

###Conclusion

From initial exploratory analysis, we saw that there are barely any noticable trends in sentiment scores in the dataset. We propose a hypothesis that this is because the sentiment scores are too neutral due to such words like stopwords. We will explore this hypothesis in the 'Analysis' file.