# Data

The dataset are using in this analysis scraped information from https://www.churchofjesuschrist.org/?lang=eng. The specific .csv was created in the data_compiler.R script (see repository). 

In [72]:
import pandas as pd
import numpy as np 
import warnings
warnings.filterwarnings("ignore")

In [14]:
# read in data and look at head
conf_raw = pd.read_csv('all_conference.csv')
conf_raw.head()

Unnamed: 0.1,Unnamed: 0,year,month,session,talk_title,speaker,talk_url,talk_text,word_count
0,1,1999,4,Saturday Morning Session,The Work Moves Forward,Gordon B. Hinckley,https://www.lds.org/general-conference/1999/04...,"Welcome to conference! We again welcome you, m...",890
1,2,1999,4,Saturday Morning Session,Teach Them the Word of God with All Diligence,L. Tom Perry,https://www.lds.org/general-conference/1999/04...,"On Sunday morning, December 9, 1849, at eight ...",2303
2,3,1999,4,Saturday Morning Session,"Greed, Selfishness, and Overindulgence",Joe J. Christensen,https://www.lds.org/general-conference/1999/04...,They say the gospel is to comfort the afflicte...,2041
3,4,1999,4,Saturday Morning Session,Preparing Our Families for the Temple,Carol B. Thomas,https://www.lds.org/general-conference/1999/04...,"Brothers and sisters, I think I am happy to be...",2022
4,5,1999,4,Saturday Morning Session,The Hands of the Fathers,Jeffrey R. Holland,https://www.lds.org/general-conference/1999/04...,On this Easter weekend I wish to thank not onl...,2057


### Data info

This dataset contains information about LDS conference talks from 1999-2019. Our target variable for this analysis is word_count, and our features are year, month, session, and speaker (will add more features as we try more with our webscraper).

### Cleaning

Let's check for strange values of our target variable to determine what cleaning we need to do. I suspect that observations with very small word counts would be trouble datapoints. 

In [15]:
conf_raw.nsmallest(5, 'word_count')

Unnamed: 0.1,Unnamed: 0,year,month,session,talk_title,speaker,talk_url,talk_text,word_count
613,614,2006,10,General Relief Society Meeting,Feeling the Love of the Lord,The Church of Jesus Christ of Latter-day Saints,https://www.lds.org/general-conference/2006/10...,,1
742,743,2008,4,General Young Women Meeting,Video Presentation,The Church of Jesus Christ of Latter-day Saints,https://www.lds.org/general-conference/2008/04...,,1
819,820,2009,4,General Young Women Meeting,Virtue: For Such a Time as This,The Church of Jesus Christ of Latter-day Saints,https://www.lds.org/general-conference/2009/04...,,1
899,900,2010,4,General Young Women Meeting,Be Strong: I Know Who I Am,The Church of Jesus Christ of Latter-day Saints,https://www.lds.org/general-conference/2010/04...,,1
939,940,2010,10,General Relief Society Meeting,"And of Some Have Compassion, Making a Difference",Barbara Thompson,https://www.lds.org/general-conference/2010/10...,,1


We have some datapoints with missing text. We should drop these values, since their word count values do not represent the actual talk length.

In [16]:
# find total number of Nans in 'talk_text' column
conf_raw.isna().sum()

Unnamed: 0     0
year           0
month          0
session        0
talk_title     0
speaker        0
talk_url       0
talk_text     17
word_count     0
dtype: int64

In [17]:
# drop these 17 talks
conf = conf_raw.dropna()

In [19]:
# compare the orignial dataframe to the new one
print(conf_raw.info())
print(conf.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1603 entries, 0 to 1602
Data columns (total 9 columns):
Unnamed: 0    1603 non-null int64
year          1603 non-null int64
month         1603 non-null int64
session       1603 non-null object
talk_title    1603 non-null object
speaker       1603 non-null object
talk_url      1603 non-null object
talk_text     1586 non-null object
word_count    1603 non-null int64
dtypes: int64(4), object(5)
memory usage: 112.8+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1586 entries, 0 to 1602
Data columns (total 9 columns):
Unnamed: 0    1586 non-null int64
year          1586 non-null int64
month         1586 non-null int64
session       1586 non-null object
talk_title    1586 non-null object
speaker       1586 non-null object
talk_url      1586 non-null object
talk_text     1586 non-null object
word_count    1586 non-null int64
dtypes: int64(4), object(5)
memory usage: 123.9+ KB
None


# EDA

Before we get into our main analysis, let's explore our data a little.

First, let's see how talk length changes over time. 

In [21]:
import plotly.graph_objects as go
import datetime

In [30]:
# summarize word count by year
df = pd.DataFrame(conf['word_count'].groupby(conf['year']).mean())

In [37]:
# add column of year
df['year'] = np.arange(1999,2020)

In [103]:
fig1 = go.Figure()
fig1.add_trace(go.Scatter(x = df['year'],
                         y = df['word_count'], 
                         line_color = '#FCD16B'))
fig1.update_layout(plot_bgcolor = 'white',
                  title = 'LDS General Conference Talk Length Over Time')
fig1.update_xaxes(showline = True, linewidth = 1, linecolor = 'black')
fig1.update_yaxes(showline = True, linewidth = 1, linecolor = 'black')

It appears that the talks tend to have gotten shorter in the past 10 years. 

Next, let's observe how talk length varies between session.

In [106]:
fig2 = go.Figure()
fig2.add_trace(go.Box(y = conf.loc[conf['session'] == 'Saturday Morning Session', 'word_count'],
                     name = 'Saturday Morning Session',
                     marker_color = '#2E604A'))

fig2.add_trace(go.Box(y = conf.loc[conf['session'] == 'Saturday Afternoon Session', 'word_count'],
                     name = 'Saturdat Afternoon Session',
                     marker_color = '#76A08A'))

fig2.add_trace(go.Box(y = conf.loc[conf['session'] == 'Priesthood Session', 'word_count'],
                     name = 'Priesthood Session',
                     marker_color = '#CECD7B'))

fig2.add_trace(go.Box(y = conf.loc[conf['session'] == 'Sunday Morning Session', 'word_count'],
                     name = 'Sunday Morning Session',
                     marker_color = '#FCD16B'))

fig2.add_trace(go.Box(y = conf.loc[conf['session'] == 'Sunday Afternoon Session', 'word_count'],
                     name = 'Sunday Afternoon Session',
                     marker_color = '#F8DF4F'))

fig2.add_trace(go.Box(y = conf.loc[conf['session'] == 'February 1999 Conversion and Retention Broadcast', 'word_count'],
                     name = 'February 1999 Conversion and Retention Broadcast',
                     marker_color = '#A35E60',
                     boxpoints = 'all'))

fig2.add_trace(go.Box(y = conf.loc[conf['session'] == 'General Young Women Meeting', 'word_count'],
                     name = 'General Young Women Meeting',
                     marker_color = '#CC8B3C'))

fig2.add_trace(go.Box(y = conf.loc[conf['session'] == 'General Relief Society Meeting', 'word_count'],
                     name = 'General Relief Society Meeting',
                     marker_color = '#D1362F'))

fig2.add_trace(go.Box(y = conf.loc[conf['session'] == 'Special Satellite Broadcast for Children', 'word_count'],
                     name = 'Special Satellite Broadcast for Children',
                     marker_color = '#E6A2C5'))

fig2.add_trace(go.Box(y = conf.loc[conf['session'] == "General Women's Meeting", 'word_count'],
                     name = "General Women's Meeting",
                     marker_color = '#F7B0AA'))

fig2.add_trace(go.Box(y = conf.loc[conf['session'] == "General Women's Session", 'word_count'],
                     name = "General Women's Session",
                     marker_color = '#7496D2'))

fig2.add_trace(go.Box(y = conf.loc[conf['session'] == 'General Priesthood Session', 'word_count'],
                     name = 'General Priesthood Session',
                     marker_color = '#C7CEF6'))

fig2.update_layout(title = 'Talk Length by Session',
                  plot_bgcolor = 'white',
                  showlegend = False)

fig2.update_xaxes(showline = True, linewidth = 1, linecolor = 'black')
fig2.update_yaxes(showline = True, linewidth = 1, linecolor = 'black')

Talk length doesn't seem to vary much by session type. There is an outlier, however. The February 1999 Conversion and Retension Broadycast is the obvious maximum here. 

Let's now see if there's a difference between men's and women's sessions. We'll have to do a little feature engineering to look at this comparison. If the session title contains 'women' then we'll assign this to be a women's sesssion. If it has 'priesthood in the title' we'll assign it a male value. All others will be assigned to both genders since males and females talk at these sessions. 

In [73]:
conf['session_gendered'] = np.where((conf['session'] == 'Priesthood Session') | 
                                    (conf['session'] == 'General Priesthood Session'), 'male',
                                   np.where((conf['session'] == 'General Young Women Meeting') |
                                           (conf['session'] == "General Women's Meeting") |
                                           (conf['session'] == "General Women's Session"), 'female', 'male and female'))

In [74]:
conf['session_gendered'].unique()

array(['male and female', 'male', 'female'], dtype=object)

Now we can make our visualization to see how the session's gender compares.

In [105]:
fig4 = go.Figure()
fig4.add_trace(go.Box(y = conf.loc[conf['session_gendered'] == 'male', 'word_count'],
                     name = 'male',
                     marker_color = '#7496D2',
                     boxpoints = 'all'))

fig4.add_trace(go.Box(y = conf.loc[conf['session_gendered'] == 'female', 'word_count'],
                     name = 'female',
                     marker_color = '#E6A2C5',
                     boxpoints = 'all'))

fig4.add_trace(go.Box(y = conf.loc[conf['session_gendered'] == 'male and female', 'word_count'],
                     name = 'male and female',
                     marker_color = '#C7CEF6',
                     boxpoints = 'all'))

fig4.update_layout(title = 'Talk Length by Session Gender',
                  plot_bgcolor = 'white')

fig4.update_xaxes(showline = True, linewidth = 1, linecolor = 'black')
fig4.update_yaxes(showline = True, linewidth = 1, linecolor = 'black')

There isn't much variation between session genders. Here are the medians for each:
- males: 1948 words
- females: 1750 words
- males and females: 1686 words

Finally, let's see how these gendered wordcounts vary over time.

In [113]:
# sumamrize word count by session_gendered and year
df2 = pd.DataFrame(conf.groupby(['session_gendered', 'year'])['word_count'].mean())

In [126]:
# take the index and set it as new columns so we can use these for our plot
df2['gender'] = [i[0] for i in df2.index] 
df2['year'] = [j[1] for j in df2.index] 

Now we can make our plot

In [137]:
fig5 = go.Figure()

fig5.add_trace(go.Scatter(x = df2['year'],
                         y = df2.loc[df2['gender'] == 'male', 'word_count'], 
                         line_color = '#7496D2',
                         name = 'male'))

fig5.add_trace(go.Scatter(x = df2['year'],
                         y = df2.loc[df2['gender'] == 'female', 'word_count'], 
                         line_color = '#E6A2C5',
                         name = 'female'))


fig5.update_layout(plot_bgcolor = 'white',
                  title = 'LDS General Conference Talk Length Over Time, By Session Gender')

fig5.update_xaxes(showline = True, linewidth = 1, linecolor = 'black')
fig5.update_yaxes(showline = True, linewidth = 1, linecolor = 'black',
                 range = [1600,2200])

Talk length has varied very similarly over time across the session genders. 