Skip to content

Latest commit

 

History

History
214 lines (147 loc) · 12 KB

README.md

File metadata and controls

214 lines (147 loc) · 12 KB

CAA Data Analysis Project

Our Fall 2020 Data Science Discovery Program research project was conducted under the guidance of the Cal Alumni Association. Our goal is to understand and analyze subscriber engagement in the past year with the CalCon newsletter, a monthly email newsletter for the UC Berkeley alumni community, in order to present actionable insights.

This project was created to not only understand the individuals who are subscribed, but also to identify highly-engaged content, grow the size of active subscribers, and sustain overall engagement with the alumni community.

We took various approaches to understand the most active subscribers and the most "attractive" content, which we have outlined below through points, and make suggestions to improve the CAA newsletter.


Engagement Among Clickers

We defined active engagement as >= 7 clicks in the past year and repeat clickers as those who clicked on the CalCon Newsletters more than or equal to 7 times in the past 12 months.

We found that there are issues with the data on repeated records on clicks and around 10 percent of missing values of clickers information. To clean up the data, we dropped duplicates of repeated clicks on the same website within a day and count only the first click as a click for the day. With that, we find that there are roughly 52605 clicks and 25402 unique clickers clicked on the CalCon newsletters at least once in the past year.

1

The average age of repeat clickers is around 63 years old, which indicates that those who actively clicked on the CalCon newsletters in the past year are also seniors and the age of most repeat clickers are greater than 75.

2

We can see that the average number of alumni activity participation of repeat clickers is 9 times in the past 12 months, which is 1.5 times of that of all clickers.

3

The regression plot implies that improving the clickers’ performance on engaging with the CalCon newsletters can increase their performance on the alumni activity participation.

4

This plot also indicates a positive correlation between the average amount of gift and the number of clicks of repeat clickers.


Opens-to-clicks Conversion Rate

The conversion rate between opens to clicks is defined as the number of people who have clicked a link over the number of people who have opened a CalCons newsletter.

The overall opens-to-clicks conversion rate is 0.23697, indicating that about 24% of people who opened a newsletter in the past year clicked on a link at least once. Analysis on individual subscriber opens-to-clicks conversion rates for those who have opened at least once shows that, on average, a subscriber will click about one link per two opens.

Screen Shot 2020-12-02 at 11 32 56 PM

After conducting a monthly breakdown of the opens-to-clicks conversion rate, it seems that alumni are generally less likely to click on links during the school year months. We hypothesize that the conversion rate was the highest in December 2019 (20%) because Cal won the Big Game that year; conversion rates were low in April 2020 (2%) because of the coronavirus stay-at-home order.

Screen Shot 2020-12-02 at 11 47 08 PM

Overall, the demographic of CalCon's opener engagers seem to be older, with 30 to 40 year olds and 40 to 50 year olds having the highest open rate, indicating this age group (Generation X) defines most of CalCon's engagers.

Screen Shot 2020-12-03 at 12 27 16 PM

To further understand clickers engagement, we analyzed the relationship between number of clicks per edition and number of URLs included in each monthly newsletter. We used the code below to extract the total number of links per PDF edition of every monthly CalCons newsletter. source

total_urls = []
for i in newsletters:
  pdf_file = pikepdf.Pdf.open(i)
  urls = []
    # iterate over PDF pages
  for page in pdf_file.pages:
    for annots in page.get("/Annots"):
      uri = annots.get("/A").get("/URI")
      if uri is not None:
          urls.append(uri)
  total_urls.append(len(urls))
  print("[*] Total URLs extracted:", len(urls))

There does seem to be a positive relationship between total number of links clicked and URLS included; for every additional URL included, we can expect about 3.5 more clicks for a newsletter.

Screen Shot 2020-12-03 at 3 18 08 PM

Then, we divided subscribers into three groups -- young, mid, and old -- based the CAA membership type as well as looking at the distribution of ages and dividing them accordingly:

Group Birth year range
Young 1980 - 2020
Mid 1955 - 1979
Old 0 - 1954

For those whose birth years were not available, we used their graduation year to infer their age and put them in the appropriate age group.

After conducting a series of Kruskal-Wallis tests on each age group's alumni event participation rate versus the population's event participation rate, we've concluded that the old age group had significantly higher participation rates compared to the young and mid age groups, participating in alumni events three times as much as the young age group on average.


Activity vs Student Activity

Another factor we took into account was the subscriber's activity/affiliation while a student. After grouping alumni event participation counts by Student Activity, we compared the average participation rate within each Student Activity to the population's participation rate to determine whether subscribers within certain Student Activities had significantly higher participation rates.

stud_acts_list = act_studActCount['student_activity_desc'].tolist()
sig_results = list()

for i in stud_acts_list:
  curr = stud_act_actCount.loc[stud_act_actCount['student_activity_desc'] == i]
  list1 = curr['counts']
  x, p_val = stats.kruskal(list1, stud_act_actCount['counts'])
  sig_results.append(p_val)
  
#use 5% significance, compare and yes (if smaller than 5%) / no
to_add = list()

for j in sig_results:
  if j <= 0.05:
    to_add.append('yes')
  else:
    to_add.append('no')
    
to_add_sig = pd.Series(to_add)
act_studActCount['significance'] = to_add_sig.values

mean_diff_acts = list()
for i in stud_acts_list:
  curr1 = stud_act_actCount.loc[stud_act_actCount['student_activity_desc'] == i]
  counts_mean = curr1['counts'].mean()
  mean_diff_acts.append(counts_mean)
  
#compare means to overall mean
to_add2 = list()
overall_acts_counts_mean = stud_act_actCount['counts'].mean()

for j in mean_diff_acts:
  if j <= overall_acts_counts_mean:
    to_add2.append('greater')
  else:
    to_add2.append('less')
    
to_add_diff2 = pd.Series(to_add2)
act_studActCount['greater or less mean?'] = to_add_diff2.values

sig_stud_act = act_studActCount.loc[(act_studActCount['significance'] == 'yes') & (act_studActCount['greater or less mean?'] == 'greater')]

Effect of link description on clicks in CalCons Newsletter

Calcons Newsletters have multiple links in them. While many have a description under their titles, many don't. We found this distribution of descriptions vs no descriptions to be randomly distributed across all CalCon Newsletters and decided to investigate what it meant on the number of clicks.

We made a scraper using Beautiful Soup that can scrape a url to get its title and description.

title_names = []
for i in left_over_links:
    url = i   
    try:
        # making requests instance 
        reqs = requests.get(url) 

        # using the BeaitifulSoup module 
        soup = BeautifulSoup(reqs.text, 'html.parser') 

        # displaying the title 
	print("Title of the website is : " + title.get_text())
	print("Description of title is : " + title.get_des()) 

        for title in soup.find_all('title'): 
            title_names.append(title.get_text()
	    print(title.get_text())
    except:
        title_names.append(0)
	print("ERROR")

Looking at the average number of times links in both the categories were clicked, we see that there were 73.6% more clicks on links with a description on average.

However, averages are prone to problems due to extremes. So we created a boxplot of the distribution of the number of clicks per category. We found that links with a description were really skewed due to a few outliers.

Even by looking at the median, we can see that the links with descriptions had 47.5% more clicks.

The barplot shows that 60% of the most clicked links for all top clicks/ newsletter issue had a description.

We proposed a hypothesis that links with descriptions get more clicks and conducted a Two Value T-test using scipy.

Null Hypothesis (H0): Having meta descriptions does not affect median number of clicks, Alternate Hypothesis (H1): Having meta descriptions matters affects the median number of clicks

from scipy import stats
stats.ttest_ind(vals_1, vals_0)

p_value in this case = 0.085 t_stat = 1.729

As 0.085 > 0.05, we fail to reject the null hypothesis at the 95% confidence level.

But, if we take 90% confidence interval, then p-value is 0.1 and then we reject the Null Hypothesis in favor of the Alternate Hypothesis

Due to extreme variance of the values in links with description, we think we get a higher p value. However, we can conclude at a 90% confidence level that there is strong statistical evidence against the Null Hypothesis. Therefore, adding more descriptions to links in newsletters is benefitial and has the ability to increase the median number of clicked links by 47.5%.


Next Steps

From the insights we have extracted through our analysis, we have made the following suggestions to CAA in order to optimize their content to increase their subscribers' engagement:

  • Appeal to older alumni by curating content for them and trying to get more senior subscribers as they are much more likely to participate in events and donate
  • Write a description under every link as it has the potential to increase the probability of that link being clicked.

However, there is still analysis left to do to solidify our findings and identify more areas for improvement:

  • Conduct NLP analysis on the "Feature Benefits" sections of each newsletter
  • Access data regarding when people subscribed/unsubscribed to the newsletter
  • Classify the content for links with description vs those without a description to see if people tend to generally click less on a particular type of content

We believe the insights and suggestions we have extracted will increase alumni engagement with the newsletters and events as well as donations, allowing long term success for the future operations of CAA.