In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import *
from scipy.stats import bootstrap, ttest_ind, ttest_rel
from statsmodels.stats import diagnostic

%matplotlib inline

# Homework 1 (HW1)

By the end of this homework, we expect you to be able to:

- Load data and handle data using pandas;
- Navigate the documentation of Python packages by yourself;
- Filter and tidy up noisy real-world datasets;
- Aggregate your data in different (and hopefully helpful) ways; 
- Create meaningful visualizations to analyze the data;

---

## Important Dates

- Homework release: Fri 14 Oct 2022
- **Homework due**: Sat 29 Oct 2022, 23:59
- Grade release: Mon 07 Nov 2022

---

##  Some rules

1. You are allowed to use any built-in Python library that comes with Anaconda. If you want to use an external library, 
you may do so, but must justify your choice.
2. Make sure you use the `data` folder provided in the repository in read-only mode. (Or alternatively, be sure you 
don’t change any of the files.)
3. Be sure to provide a textual description of your thought process, the assumptions you made, the solution you 
implemented, and explanations for your answers. A notebook that only has code cells will not suffice.
4. For questions containing the **/Discuss:/** prefix, answer not with code, but with a textual explanation
 (**in markdown**).
5. Back up any hypotheses and claims with data, since this is an important aspect of the course.
6. Please write all your comments in English, and use meaningful variable names in your code. Your repo should have a 
single notebook (plus the required data files) in the *master/main* branch. If there are multiple notebooks present, 
we will **not grade** anything.
7. We will **not run your notebook for you**! Rather, we will grade it as is, which means that only the results 
contained in your evaluated code cells will be considered, and we will not see the results in unevaluated code cells. 
Thus, be sure to hand in a **fully-run and evaluated notebook**. In order to check whether everything looks as intended,
 you can check the rendered notebook on the GitHub website once you have pushed your solution there.
8. In continuation to the previous point, interactive plots, such as those generated using `plotly`, should be 
**strictly avoided**!
9. Make sure to print results or dataframes that confirm you have properly addressed the task.

---

In this homework, we will analyze data from A/B tests of headlines conducted by Upworthy from January 2013 to April 2015 to study whether the language used in the headline determines the number of people that will read the associated news piece. The homework contains four tasks: in task 1, we will process the data; in task 2, we will extract meaningful signals from the data; in task 3, we will test whether the language of headlines impacts their success; and in task 4, we will explore the heterogeneity of this effects (e.g., does it vary through time?).


### **What is an A/B test?** 
A/B tests are experiments that compare two scenarios (e.g., scenario A vs. scenario B). 
They test subjects' responses to each of the variants to determine which is more effective ([read more about A/B tests on Wikipedia](https://en.wikipedia.org/wiki/A/B_testing)). 
A/B tests allow us to draw conclusions about the different scenarios by randomizing exposure to them, e.g., one could flip a coin and assign a user to scenario A if it lands heads and to B if it lands tails. 
Since exposure is randomized, we can be confident that the scenarios are the sole explanation for statistically significant differences in subjects' responses (if they exist). 
In theory, A/B testing refers to an experiment that compares two scenarios; however, in practice, the term is also used when we compare multiple scenarios (e.g., A vs. B vs. C), although the more precise terminology would be to call such an experiment a "multinomial test."

### **How were A/B tests used by Upworthy?** 
Upworthy used A/B testing to increase news readership, conducting experiments for each published news piece. 
In each experiment, they created multiple "packages" of stimuli, varying headlines, images, excerpts, and ledes for the same news piece. 
Different "packages" were shown on their (now defunct) website to engage users with the news pieces they produced. Upworthy found "the best" package by conducting A/B tests, showing different packages to different users, and measuring how often users clicked on each version. 
Below, we show three "packages" used by Upworthy in an experiment, each with a different headline for the same news piece. 
Upworthy randomized users that visited their website saw one of the three versions of the headline below. Then, they measured the percentage of times users in each scenario clicked to read the news. 
The headline with the highest percentage of clicks per view (click through rate) was then declared the "winner" and became the default for all visitors.

![Example A/B test](example.png)

 ### **Where does this data come from?** 
 
 From a paper [1].

[1] Matias, J.N., Munger, K., Le Quere, M.A. et al. The Upworthy Research Archive, a time series of 32,487 experiments in U.S. media. Sci Data 8, 195 (2021). https://doi.org/10.1038/s41597-021-00934-7

### **Where can I find this data?**  

You can find it in the `/data/` folder.

### **Terminology**

- **News piece:** A news article. In the dataset considered, these were all produced by Upworthy.
- **Package:** The set of visual stimuli inviting the user to read an article. The figure above shows a package with a headline and an image. At times, there was an excerpt of the article also shown in the package and/or the lede, i.e., ["the introductory section of a news story that is intended to entice the reader to read the full story."](https://www.merriam-webster.com/words-at-play/bury-the-lede-versus-lead#:~:text=In%20journalism%2C%20the%20lede%20refers,machines%20began%20disappearing%20from%20newsrooms.)
- **Experiment:** Each experiment is an A/B test (or multinomial test, to be more precise) comparing how users reacted to different "packages." Experiments measured two things: 1) how many users were shown each package; and 2) how many individuals clicked each package.

### **Data description**

| Column name          | Description                                                                                                                                                                                       |   |   |   |
|----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|---|---|
| created_at           | Time the package was created (timezone unknown)                                                                                                                                                   |   |   |   |
| test_week            | Week the package was created, a variable constructed by the archive creators for stratified random sampling                                                                                       |   |   |   |
| clickability_test_id | The test ID. Viewers were randomly assigned to packages with the same test ID                                                                                                                     |   |   |   |
| impressions          | The number of viewers who were assigned to this package. The total number of participants for a given test is the sum of impressions for all packages that share the same clickability_test_id    |   |   |   |
| headline             | The headline being tested                                                                                                                                                                         |   |   |   |
| eyecatcher_id        | Image ID. Image files are not available. Packages that shared the same image have the same eyecatcher_id                                                                                          |   |   |   |
| clicks               | The number of viewers (impressions) that clicked on the package. The clickrate for a given package is the number of clicks divided by the number of impressions                                   |   |   |   |
| excerpt              | Article excerpt                                                                                                                                                                                   |   |   |   |
| lede                 | The opening sentence or paragraph of the story                                                                                                                                                    |   |   |   |
| slug                 | Internal name for the web address                                                                                                                                                                 |   |   |   |
| share_text           | Summary for display on social media when the article is shared. This was not shown in tests, since tests were conducted on the Upworthy website                                                   |   |   |   |
| square               | When used, part of the same social media sharing suggestion as the share text                                                                                                                     |   |   |   |
| significance         | NOT an estimate of statistical significance; a complex, inconsistent calculation that compared the clicks on a package to the clicks on all previous packages that were fielded on the same pages |   |   |   |
| first_place          | Along with significance, shown to editors to guide decisions about what test to choose                                                                                                            |   |   |   |
| winner               | Whether a package was selected by editors to be used on the Upworthy site after the test                                                                                                          |   |   |   |
| updated_at           | The last time the package was updated in the Upworthy system                                                                                                                                      |   |   |   |


## Task 1: Getting familiar with the data

Your first task is to conduct initial analyses to understand the data and process it in a way that will allow us to more easily answer our key question: *how does the language of a headline determine its success?*

1.1 Load the data into memory using pandas and print the first lines to get a sense of it.

1.2 Each experiment comparing different versions of the same news piece ("packages") has a unique identifier (`clickability_test_id` column). 
Calculate how many different experiments were conducted in this dataset and, on average, how many packages were considered per experiment. 
Last, plot the distribution of packages per experiment with a visualization of your choice.

1.3 A common way to measure success in online A/B tests is what is called "the clickthrough rate."
Given that often A/B tests are created to find what engages users (here, "packages" of headlines, images, etc), we would expect that a "good" package makes people click often. 
Create a column named `ctr` by dividing the number of clicks a package received (`clicks` column) by the number of impressions it received (`impressions` column).

1.4 Packages varied any combination of the headline (`headline` column), the excerpt (`excerpt`), the first sentence of the article (`lede`), and the image that illustrates the news piece (`eyecatcher_id`, a hash per image). 
But we want to isolate the effect of the headline on the clickthrough rate. To do that, create a new dataframe where you filter all experiments where only one headline is present. 
Print the length of this new dataframe and how many experiments were discarded in the filtering process.

1.5 For comparison, repeat the procedure described in **T1.4** with the `eyecatcher_id` column, i.e., create a dataframe considering only experiments that vary the image. 
Again, print the length of this new dataframe and how many experiments were discarded in the filtering process.

1.6 **Discuss:** Considering the answers to questions **T1.4** and **T1.4**, what can we say about the different versions of the news tested by Upworthy?

1.7 For our subsequent analysis, we want to compare the causal effect of headlines on the success of a news piece. 
For that, we can compare pairs of packages with the same `eyecatcher_id`, `lede`, and `excerpt`, but different `headlines`.
Note that this means that if an experiment considered 5 different headlines and did not vary any other stimulus, we would have 5C2 (i.e., 5 choose 2, 10) pairs to consider.
Create a dataset where:
- each row corresponds to a pair of packages with different `headline` but the same `eyecatcher_id`, `lede`, and `excerpt`. 
- there are columns containing the headlines of each of the news versions (`headline1`, `headline2`) and the clickthrough rate of each of the news versions (`ctr1`, `ctr2`). 
- the columns `headline1` and `ctr1` contain the data associated with the news version with the highest clickthrough rate. Print the first columns of your newly created dataframe, as well as its length.
-  the columns where the two news pieces had exactly the same clickthrough rate should be filtered out (this is for simplicity's sake).
-  the column `date_created` contains the date when the news version with the highest clickthrough rate was created.

1.8 To get a sense of the impact of headline change, measure the average difference per pair between the most clicked-through (`ctr1`) and the least clicked-through headline (`ctr2`), as well as the average clickthrough rate for the least clicked through headline (`ctr2`). 

1.9 **Discuss:** Considering your answer to **T1.8**, and assuming the average differences in clickthrough rates between pairs are statistically significant, do you think that headlines are impactful in the news business? Justify with the data.

1.1 Load the data into memory using pandas and print the first lines to get a sense of it.

In [None]:
pd.set_option("display.max_columns", None)
pd.options.display.max_rows = 20

data_folder = "./Data/"

packages = pd.read_csv(
    data_folder + "upworthy.csv.gz",
    parse_dates=["created_at", "updated_at"],
    na_values=["Nan"],
)
packages.head()

1.2 Each experiment comparing different versions of the same news piece ("packages") has a unique identifier (clickability_test_id column). Calculate how many different experiments were conducted in this dataset and, on average, how many packages were considered per experiment. Last, plot the distribution of packages per experiment with a visualization of your choice.



In [None]:
experiment_by_id = packages.groupby("clickability_test_id")
count_by_experiment = packages.groupby("clickability_test_id").count()["created_at"]

nb_experiment = count_by_experiment.count()
mean_nb_experiment = count_by_experiment.mean()
std_nb_experiment = count_by_experiment.std()
print(
    "The total number of experiments is %d with on averange %0.2f packages considered per experiment and a standar deviation of %0.1f packages "
    % (nb_experiment, mean_nb_experiment, std_nb_experiment)
)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
fig.suptitle("Distribution of the number of experiments per number of packages")
ax1.hist(count_by_experiment.values, bins=40)

ax1.set(xlabel="Number of packages", ylabel="Number of experiments", yscale="log")
ax2.boxplot(count_by_experiment)
ax2 = ax2.set(xticks=[], ylabel="Number of packages")

1.3 A common way to measure success in online A/B tests is what is called "the clickthrough rate." Given that often A/B tests are created to find what engages users (here, "packages" of headlines, images, etc), we would expect that a "good" package makes people click often. Create a column named ctr by dividing the number of clicks a package received (clicks column) by the number of impressions it received (impressions column).

In [None]:
packages["ctr"] = packages.apply(lambda row: row["clicks"] / row["impressions"], axis=1)
packages.sort_values(by="ctr", ascending=False).head()

1.4 Packages varied any combination of the headline (headline column), the excerpt (excerpt), the first sentence of the article (lede), and the image that illustrates the news piece (eyecatcher_id, a hash per image). But we want to isolate the effect of the headline on the clickthrough rate. To do that, create a new dataframe where you filter all experiments where only one headline is present. Print the length of this new dataframe and how many experiments were discarded in the filtering process.

In [None]:
### Packages where there is more than one headline
# not_unique_headline_df = packages.groupby(by=['clickability_test_id']).filter(lambda x : len(x.groupby('headline'))!=1)
not_unique_headline_df = packages.groupby(by=["clickability_test_id"]).filter(
    lambda x: x["headline"].nunique() != 1
)

### Number of expermient where there is more than one headline
not_unique_headline_exp_count = (
    not_unique_headline_df.groupby("clickability_test_id").count()["created_at"].count()
)
print(
    "The new dataframe length is %d.\n%d expermients were retained and %d experiments were dropped "
    % (
        len(not_unique_headline_df),
        not_unique_headline_exp_count,
        nb_experiment - not_unique_headline_exp_count,
    )
)

1.5 For comparison, repeat the procedure described in T1.4 with the eyecatcher_id column, i.e., create a dataframe considering only experiments that vary the image. Again, print the length of this new dataframe and how many experiments were discarded in the filtering process.

In [None]:
### Packages where there is more than one image
# not_unique_image_df = packages.groupby(by=['clickability_test_id']).filter(lambda x : len(x.groupby('eyecatcher_id'))!=1)
not_unique_image_df = packages.groupby(by=["clickability_test_id"]).filter(
    lambda x: x["eyecatcher_id"].nunique() != 1
)

### Number of expermient where there is more than one image
not_unique_image_exp_count = (
    not_unique_image_df.groupby("clickability_test_id").count()["created_at"].count()
)
print(
    "The new dataframe length is %d.\n%d expermients were retained and %d experiments were dropped "
    % (
        len(not_unique_image_df),
        not_unique_image_exp_count,
        nb_experiment - not_unique_image_exp_count,
    )
)

1.6 **Discuss:** Considering the answers to questions **T1.4** and **T1.4**, what can we say about the different versions of the news tested by Upworthy?

We can see that the experiment concidered more variations on the headlines than the eye catchers since more than half experiments do vary the headlines (2586) whereas less than half experiments vary the eyecatchers (1719)

1.7 For our subsequent analysis, we want to compare the causal effect of headlines on the success of a news piece. 
For that, we can compare pairs of packages with the same `eyecatcher_id`, `lede`, and `excerpt`, but different `headlines`.
Note that this means that if an experiment considered 5 different headlines and did not vary any other stimulus, we would have 5C2 (i.e., 5 choose 2, 10) pairs to consider.
Create a dataset where:
- each row corresponds to a pair of packages with different `headline` but the same `eyecatcher_id`, `lede`, and `excerpt`. 
- there are columns containing the headlines of each of the news versions (`headline1`, `headline2`) and the clickthrough rate of each of the news versions (`ctr1`, `ctr2`). 
- the columns `headline1` and `ctr1` contain the data associated with the news version with the highest clickthrough rate. Print the first columns of your newly created dataframe, as well as its length.
-  the columns where the two news pieces had exactly the same clickthrough rate should be filtered out (this is for simplicity's sake).
-  the column `date_created` contains the date when the news version with the highest clickthrough rate was created.


In [None]:
comparaison = not_unique_headline_df.merge(
    not_unique_headline_df,
    on=["clickability_test_id", "eyecatcher_id", "lede", "excerpt"],
    how="inner",
)


### Filtering out rows where clickthrough rate is identical
comparaison = comparaison.drop(
    comparaison[comparaison["ctr_x"] == comparaison["ctr_y"]].index
)

### New dataframe
comparaison_final = pd.DataFrame()
### Retreiving ids for identification
# comparaison_final['clickability_test_id']= comparaison['clickability_test_id']
### Selecting headline1, headline2 , ctr1, ctr2, date_created based on ctr_comapraison criteria

comparaison_final["headline1"] = comparaison.apply(
    lambda row: row["headline_x"] if row["ctr_x"] > row["ctr_y"] else row["headline_y"],
    axis=1,
)
comparaison_final["headline2"] = comparaison.apply(
    lambda row: row["headline_x"] if row["ctr_x"] < row["ctr_y"] else row["headline_y"],
    axis=1,
)
comparaison_final["ctr1"] = comparaison.apply(
    lambda row: row["ctr_x"] if row["ctr_x"] > row["ctr_y"] else row["ctr_y"], axis=1
)
comparaison_final["ctr2"] = comparaison.apply(
    lambda row: row["ctr_x"] if row["ctr_x"] < row["ctr_y"] else row["ctr_y"], axis=1
)
comparaison_final["date_created"] = comparaison.apply(
    lambda row: row["created_at_x"]
    if row["ctr_x"] > row["ctr_y"]
    else row["created_at_y"],
    axis=1,
)
comparaison_final["eyecatcher_id"] = comparaison["eyecatcher_id"]
comparaison_final["lede"] = comparaison["lede"]
comparaison_final["excerpt"] = comparaison["excerpt"]

### Dropping duplicates
comparaison_final = comparaison_final.drop_duplicates()
print("The DataSetLength is equal to %d " % len(comparaison_final))
comparaison_final

1.8 To get a sense of the impact of headline change, measure the average difference per pair between the most clicked-through (`ctr1`) and the least clicked-through headline (`ctr2`), as well as the average clickthrough rate for the least clicked through headline (`ctr2`). 

In [None]:
comparaison_final["delta_ctr"] = comparaison_final.apply(
    lambda row: row["ctr1"] - row["ctr2"], axis=1
)

print(
    "The average difference per pair between the most clicked-through and the least clicked-through headline is "
    + "{:.2%}".format(comparaison_final["delta_ctr"].mean())
    + " and has a standard deviation of "
    "{:.2%}".format(comparaison_final["delta_ctr"].std())
)
print(
    "The average clickthrough rate for the least clicked through headline is "
    + "{:.2%}".format(comparaison_final["ctr2"].mean())
    + " and has a standard deviation of "
    "{:.2%}".format(comparaison_final["ctr2"].std())
)

1.9 **Discuss:** Considering your answer to **T1.8**, and assuming the average differences in clickthrough rates between pairs are statistically significant, do you think that headlines are impactful in the news business? Justify with the data.

Since the results are considered statistically significant, we can assert that the average difference per pair between the most clicked-through and the least clicked-through headline is 0.4%. 

At the first sight, these differences do not seem to be consequent.

However, when we mitigate them by the fact that the average clickthrough rate for the least clicked through headline is 1% we can infer that, only by choosing a better headline, the clickthrough rate increases by 40%.

Thus, headlines are a key element in the news business.

## Task 2: Extracting signals from the data

Your second task is to extract meaningful signals from the data. 
We start this task from the dataset obtained in **T1.7**. 
Recall that we have one A/B test per row with the clickthrough rate of two news pieces that differ only in their headline. 
We refer to the version with the higher clickthrough rate as the "winner" and the version with the lower as the "loser." 
(Note that this is not the same as the column `winner` in the original data, which captures a similar concept but considering the original experiments, where multiple comparisons were made!)
 
2.1 Using the function provided below, count the number of words in each headline, creating columns `numwords1` and `numwords2` corresponding to the number of words in the winner and loser headlines.

2.2 Using the dictionary of pronouns provided below, create indicator variables corresponding to each set of pronouns (e.g., first-person singular may yield columns `first_person_singular1` and `first_person_singular2` for the headlines in each A/B test). 
Each indicator variable in the dataframe should equal 1 if the corresponding headline uses the corresponding type of pronoun and 0 otherwise. 
Your code should be agnostic to lower/upper case.

2.3 One easy way to classify sentiment is simply to match negative or positive words. 
Use the linked lists of words ([positive][1], [negative][2]) to obtain "positive sentiment" and "negative sentiment" scores for each headline. Create columns `positive1`/`positive2` and `negative1`/`negative2` containing indicator variables for positive and negative sentiment, i.e., A headline has a "positive sentiment" (or negative) score equal 1 if it contains at least one positive (or negative) sentiment word on the list. Otherwise, its "positive sentiment" (or negative) score equals 0.
    
[1]: https://ptrckprry.com/course/ssd/data/positive-words.txt
[2]: https://ptrckprry.com/course/ssd/data/negative-words.txt

--- 

**Comments**

- For **T2.3**, beware of encodings!

2.1 Using the function provided below, count the number of words in each headline, creating columns `numwords1` and `numwords2` corresponding to the number of words in the winner and loser headlines.

In [None]:
# 2.1 (provided code)
def count_words_simple(x):
    return len(x.split(" "))


str_test = "How many words are here?"
print(str_test, count_words_simple(str_test))

In [None]:
comparaison_final["numwords1"] = comparaison_final.apply(
    lambda row: count_words_simple(row["headline1"]), axis=1
)
comparaison_final["numwords2"] = comparaison_final.apply(
    lambda row: count_words_simple(row["headline2"]), axis=1
)
comparaison_final.sort_values(
    by=["numwords1", "numwords1"], ascending=[False, False]
).head()

2.2 Using the dictionary of pronouns provided below, create indicator variables corresponding to each set of pronouns (e.g., first-person singular may yield columns `first_person_singular1` and `first_person_singular2` for the headlines in each A/B test). 
Each indicator variable in the dataframe should equal 1 if the corresponding headline uses the corresponding type of pronoun and 0 otherwise. 
Your code should be agnostic to lower/upper case.


In [None]:
# 2.2 (provided code)
feature_wordsets = dict(
    [
        # https://en.wikipedia.org/wiki/English_personal_pronouns
        (
            "first_person_singular",
            [
                "i",
                "me",
                "my",
                "mine",
                "myself",
                "i'd",
                "i'll",
                "i'm",
                "i've",
                "id",
                "im",
                "ive",
            ],
        ),
        (
            "first_person_plural",
            [
                "we",
                "us",
                "our",
                "ours",
                "ourselves",
                "we'd",
                "we'll",
                "we're",
                "we've",
            ],
        ),
        (
            "second_person",
            [
                "you",
                "your",
                "yours",
                "yourself",
                "ya",
                "you'd",
                "you'll",
                "you're",
                "you've",
                "youll",
                "youre",
                "youve",
                "yourselves",
            ],
        ),
        (
            "third_person_singular",
            [
                "he",
                "him",
                "his",
                "himself",
                "he'd",
                "he's",
                "hes",
                "she",
                "her",
                "hers",
                "herself",
                "she'll",
                "she's",
                "shes",
                "it",
                "its",
                "itself",
                "themself",
            ],
        ),
        (
            "third_person_plural",
            [
                "they",
                "them",
                "their",
                "theirs",
                "themselves",
                "they'd",
                "they'll",
                "they've",
                "theyll",
                "theyve",
            ],
        ),
    ]
)

In [None]:
for i in range(1, 3):
    for key in feature_wordsets:
        comparaison_final[key + str(i)] = comparaison_final.apply(
            lambda row: 0
            if len(
                set(row["headline" + str(i)].lower().split(" "))
                & set(feature_wordsets[key])
            )
            == 0
            else 1,
            axis=1,
        )

In [None]:
comparaison_final.sort_values(by=["ctr1", "ctr2"], ascending=[False, False]).head()

2.3 One easy way to classify sentiment is simply to match negative or positive words. 
Use the linked lists of words ([positive][1], [negative][2]) to obtain "positive sentiment" and "negative sentiment" scores for each headline. Create columns `positive1`/`positive2` and `negative1`/`negative2` containing indicator variables for positive and negative sentiment, i.e., A headline has a "positive sentiment" (or negative) score equal 1 if it contains at least one positive (or negative) sentiment word on the list. Otherwise, its "positive sentiment" (or negative) score equals 0.
    
[1]: https://ptrckprry.com/course/ssd/data/positive-words.txt
[2]: https://ptrckprry.com/course/ssd/data/negative-words.txt


In [None]:
# read positive words file
with open(data_folder + "positive-words.txt", encoding="ISO-8859-1") as p:
    pos = p.readlines()[35:]

pos = [s.strip() for s in pos]

# read negative words file
with open(data_folder + "negative-words.txt", encoding="ISO-8859-1") as n:
    neg = n.readlines()[35:]

neg = [s.strip() for s in neg]

In [None]:
comparaison_final["positive1"] = comparaison_final.apply(
    lambda row: 0
    if len(set(row["headline1"].lower().split(" ")) & set(pos)) == 0
    else 1,
    axis=1,
)
comparaison_final["positive2"] = comparaison_final.apply(
    lambda row: 0
    if len(set(row["headline2"].lower().split(" ")) & set(pos)) == 0
    else 1,
    axis=1,
)
comparaison_final["negative1"] = comparaison_final.apply(
    lambda row: 0
    if len(set(row["headline1"].lower().split(" ")) & set(neg)) == 0
    else 1,
    axis=1,
)
comparaison_final["negative2"] = comparaison_final.apply(
    lambda row: 0
    if len(set(row["headline2"].lower().split(" ")) & set(neg)) == 0
    else 1,
    axis=1,
)

In [None]:
comparaison_final.sort_values(by=["ctr1", "ctr2"], ascending=[False, False]).head()

## Task 3: Estimating the effect of language on headline success

Your third task revolves around the question *how does language impact headlines' success?*

3.1 First, we examine whether the winner headlines have more or fewer words than the loser headline. Conduct an independent sample t-test and paired t-test (see [scipy.stats](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind); for the independent sample t-test, assume equal variance). Also, calculate and print the mean difference between the number of words in the winner and the loser headlines.

3.2 **Discuss:** Are longer headlines more successful? Justify.

3.3 The [t-statistic](https://en.wikipedia.org/wiki/T-statistic) is the ratio of the departure of the estimated value of a parameter from its hypothesized value to its standard error. In a t-test, the higher the t-statistic, the more confidently we can reject the null hypothesis. Use `numpy.random` to create four samples, each of size 30:
- $X \sim Uniform(0,1)$
- $Y \sim Uniform(0,1)$
- $Z = X/2 + Y/2 + 0.1$
- $K = Y + 0.1$
    
3.4 **Discuss:** What are the expected values and the variance of $X$, $Y$, $Z$, and $K$? (You don't need to justify them!)

3.5 Run the following simulation 10000 times, storing the $p$-values for the tests at each run:
- Sample new values  for $X$, $Y$, $Z$ and $K$ ($n=30$ each). 
- Run independent sample t-test (assuming equal variance) and paired t-test comparing $X$ and $Z$.
-  Run independent sample t-test (assuming equal variance) and paired t-test comparing $X$ and $K$.

3.6 Recall that the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when the alternative hypothesis is true. Using the p-values and assuming that we reject the null hypothesis if $p < 0.05$, calculate the statistical power of:
- The independent sample t-test comparing $X$ and $Z$.
- The paired t-test comparing $X$ and $Z$.
- The independent sample t-test comparing $X$ and $K$.
- The paired t-test comparing $X$ and $K$.
    
3.7 **Discuss:** When are paired t-tests helpful? Justify.

3.8 With a bootstrapping approach (implemented by yourself, you should not use existing bootstrapping functions), estimate the average difference and 95% confidence intervals for:
- the mean ratio between the number of words in the winner headline and the loser headline (i.e., the number of words in the winner headline divided by the number of words in the loser headlines).
- the difference in usage of positive words between winner and loser headlines.
- the difference in usage of negative words between winner and loser headlines.
- The difference in usage of each type of pronoun between winner and loser headlines.

3.9 **Discuss:** According to the results obtained in **T3.8**, what headlines grab people's attention the most? Justify your answer.
    
---
**Comments:**

- Paired t-test formula: $t = \frac{\overline{x}_{\mathrm{diff}}}{s_{\mathrm{diff}} / \sqrt n }$ where:
    - $\overline{x}_{\mathrm{diff}}$ is the sample difference between the means of the matched sample; and
    - $s_{\mathrm{diff}}$ is the sample variance of the matched sample; and
    - $n$ is the number of matched samples.
    
- Independent samples t-test formula: $t = \frac{\overline{x}_{1} - \overline{x}_{2}}{\sqrt{\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}}}}$ where:
    - $\overline{x}_{\mathrm{1}}$ is the sample mean of the first group; and
    - $s_{\mathrm{1}}$ is the sample variance of the first group; and
    - $n_1$ is the number of samples in the first group;
    
     
- t-tests are valid for samples of non-normal distribution for large enough samples (a rule of thumb used is: n$\geq$30)!

**3.1** First, we examine whether the winner headlines have more or fewer words than the loser headline. Conduct an independent sample t-test and paired t-test (see scipy.stats; for the independent sample t-test, assume equal variance). Also, calculate and print the mean difference between the number of words in the winner and the loser headlines.

In [None]:
### Independent samples t-test
ind_t_test = ttest_ind(
    comparaison_final["numwords1"], comparaison_final["numwords2"], equal_var=True
)

### Paired t-test
paired_t_test = ttest_rel(
    comparaison_final["numwords1"], comparaison_final["numwords2"]
)

### Diff in legth between winner and looser headdline
diff = comparaison_final["numwords1"] - comparaison_final["numwords2"]
mean_diff = (diff).mean()
std_diff = (diff).std()


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
fig.suptitle(
    "Distribution of the number of pair of headline as a function of the differnce in length between winner and looser headline "
)
ax1.hist(diff.values, bins=120)


ax1.set(
    xlabel="Difference in length between winner and looser headline",
    ylabel="Number of pair of headline",
)
ax2.boxplot(diff.values)
ax2 = ax2.set(
    xticks=[], ylabel="Difference in length between winner and looser headline"
)


print(
    "Independent t-test results =>  Statistics : %0.2f , pvalue: %1.2E"
    % (ind_t_test[0], ind_t_test[1])
)
print(
    "Paired t-test results =>  Statistics : %0.2f , pvalue: %1.2E"
    % (paired_t_test[0], paired_t_test[1])
)
print(
    "The mean difference is equal to %0.2f words, and the standard deviation is equal to %0.2f"
    % (mean_diff, std_diff)
)

**3.2 Discuss:** Are longer headlines more successful? Justify.

Paired t-test : $ (p_{value} < 0.05)  \Rightarrow$ we can reject the null hypothesis of identical average headline length. 

Independent t-test : $ (p_{value} < 0.05)  \Rightarrow$ we can reject the null hypothesis of identical average headline length. 

From the conducted tests, we know that the winning and loosing headlines do not have the same average length.   Thus headline length imports.

Since the mean difference in length between winner and losser headline is equal to 0.28 words, thus positive, we can conclude that longer headlines are more successful

**3.3** The [t-statistic](https://en.wikipedia.org/wiki/T-statistic) is the ratio of the departure of the estimated value of a parameter from its hypothesized value to its standard error. In a t-test, the higher the t-statistic, the more confidently we can reject the null hypothesis. Use `numpy.random` to create four samples, each of size 30:
- $X \sim Uniform(0,1)$
- $Y \sim Uniform(0,1)$
- $Z = X/2 + Y/2 + 0.1$
- $K = Y + 0.1$
    

In [None]:
x = np.random.uniform(low=0, high=1, size=30)
y = np.random.uniform(low=0, high=1, size=30)
z = x / 2 + y / 2 + 0.1
k = y + 0.1

**3.4 Discuss:** What are the expected values and the variance of $X$, $Y$, $Z$, and $K$? (You don't need to justify them!)

$E[X] = \frac{1}{2} $ 

$E[Y]= \frac{1}{2} $

$E[Z]= E[\frac{X}{2}+ \frac{Y}{2} + 0.1] = \frac{1}{2}E[X] + \frac{1}{2}E[Y] + 0.1 = 0.6$

$E[K]= E[Y + 0.1]= E[Y] + 0.1 = 0.6$

$Var[X] = E[(X)^2] - E[X]^2 = \frac{1}{3} - \frac{1}{4} = \frac{1}{12} $ 

$Var[Y] = \frac{1}{12} $

$Var[Z] = Var[\frac{X}{2}+ \frac{Y}{2} + 0.1] = Var[\frac{X}{2}] + Var[\frac{Y}{2}]= \frac{1}{4} (Var[X]+ Var[Y])= \frac{1}{24}   $

$Var[K] = Var[Y + 0.1]=  Var[Y ]=  \frac{1}{12} $ 


**3.5** Run the following simulation 10000 times, storing the $p$-values for the tests at each run:
- Sample new values  for $X$, $Y$, $Z$ and $K$ ($n=30$ each). 
- Run independent sample t-test (assuming equal variance) and paired t-test comparing $X$ and $Z$.
-  Run independent sample t-test (assuming equal variance) and paired t-test comparing $X$ and $K$.

In [None]:
n_iters = 10000
p_values_ind_x_z = []
p_values_paired_x_z = []
p_values_ind_x_k = []
p_values_paired_x_k = []
for i in range(n_iters):
    x = np.random.uniform(low=0, high=1, size=30)
    y = np.random.uniform(low=0, high=1, size=30)
    z = x / 2 + y / 2 + 0.1
    k = y + 0.1
    p_values_ind_x_z.append(ttest_ind(x, z, equal_var=True).pvalue)
    p_values_paired_x_z.append(ttest_rel(x, z).pvalue)
    p_values_ind_x_k.append(ttest_ind(x, k, equal_var=True).pvalue)
    p_values_paired_x_k.append(ttest_rel(x, k).pvalue)

**3.6** Recall that the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when the alternative hypothesis is true. Using the p-values and assuming that we reject the null hypothesis if $p < 0.05$, calculate the statistical power of:
- The independent sample t-test comparing $X$ and $Z$.
- The paired t-test comparing $X$ and $Z$.
- The independent sample t-test comparing $X$ and $K$.
- The paired t-test comparing $X$ and $K$.

In [None]:
stat_power_ind_x_z = len([x for x in p_values_ind_x_z if x < 0.05]) / n_iters
stat_power_paired_x_z = len([x for x in p_values_paired_x_z if x < 0.05]) / n_iters

stat_power_ind_x_k = len([x for x in p_values_ind_x_k if x < 0.05]) / n_iters
stat_power_paired_x_k = len([x for x in p_values_paired_x_k if x < 0.05]) / n_iters

print("The independent sample t-test comparing X and Z : %0.2f" % stat_power_ind_x_z)
print("The paired t-test comparing X and Z : %0.2f" % stat_power_paired_x_z)
print("The independent sample t-test comparing X and K : %0.2f" % stat_power_ind_x_k)
print("The paired t-test comparing X and K : %0.2f" % stat_power_paired_x_k)

**3.7 Discuss:** When are paired t-tests helpful? Justify.

Paired t-test are useful when comparing variables from matched pairs, i.e., there exist a (temporal, logical, etc. ) relationship between the pairs of random variables.

We can see that paired t-test are statistically stronger when applied on Z and X, which are related random variables, with $ P( $ reject $H_0 | H_1 $is true $) = 0.74 $ compared to when applied on X and K which are not related at all, with $ P( $ reject $H_0 | H_1 $is true $) = 0.25 $. 

In this particular case paired t-test are statsitically as strong as independent sample t-test.

**3.8** With a bootstrapping approach (implemented by yourself, you should not use existing bootstrapping functions), estimate the average difference and 95% confidence intervals for:
- the mean ratio between the number of words in the winner headline and the loser headline (i.e., the number of words in the winner headline divided by the number of words in the loser headlines).
- the difference in usage of positive words between winner and loser headlines.
- the difference in usage of negative words between winner and loser headlines.
- The difference in usage of each type of pronoun between winner and loser headlines.


In [None]:
def find_confidence_interval_in_list(means_list):
    return (np.quantile(means_list, 0.025), np.quantile(means_list, 0.975))


def batch_iter(series, batch_size, num_batches):
    data_size = len(series)
    shuffle_indices = np.random.permutation(np.arange(data_size))
    shuffled_series = series[series.index.values[shuffle_indices]]
    for batch_num in range(num_batches):
        strt = np.random.randint(data_size - batch_size)
        start_index = min(shuffle_indices[strt], shuffle_indices[strt] - batch_size)
        end_index = start_index + batch_size
        if start_index != end_index:
            yield shuffled_series[start_index:end_index].values


def calculate_means(batches):
    batches = filter(lambda x: len(x) != 0, batches)
    means_array = []
    for batch in batches:
        means_array.append(np.mean(batch))
    return means_array


def confidence_interval(serie, batch_size=7500, num_batches=10000):
    batches = list(batch_iter(serie, batch_size, num_batches))
    means = calculate_means(batches)
    return find_confidence_interval_in_list(means)

In [None]:
# 1
ratio_nb_words = comparaison_final["numwords1"] / comparaison_final["numwords2"]
ci_mean_ratio = confidence_interval(ratio_nb_words)

avg_mean_ratio = ratio_nb_words.mean()
std_mean_ratio = ratio_nb_words.std()

print("ci_mean_ratio [%1.4f, %1.4f]" % ci_mean_ratio)
print("avg_mean_ratio %1.4f, std_mean_ratio %1.4f " % (avg_mean_ratio, std_mean_ratio))
print()

# 2
diff_pos_words = comparaison_final["positive1"] - comparaison_final["positive2"]
ci_pos_words = confidence_interval(diff_pos_words)
print("ci_pos_words [%1.4f, %1.4f]" % ci_pos_words)
avg_diff_pos_words = (diff_pos_words).mean()
std_diff_pos_words = (diff_pos_words).std()
print(
    "avg_diff_pos_words %1.4f, std_diff_pos_words %1.4f "
    % (avg_diff_pos_words, std_diff_pos_words)
)
print()

# 3
diff_neg_words = comparaison_final["negative1"] - comparaison_final["negative2"]
ci_neg_words = confidence_interval(diff_neg_words)
avg_diff_neg_words = (diff_neg_words).mean()
std_diff_neg_words = (diff_neg_words).std()
print("ci_neg_words [%1.4f, %1.4f]" % ci_neg_words)
print(
    "avg_diff_neg_words %1.4f, std_diff_neg_words %1.4f"
    % (avg_diff_neg_words, std_diff_neg_words)
)
print(),


ci_prononouns = {}
avg_diff_pronouns = {}
var_diff_pronouns = {}
for key in feature_wordsets:
    diff_pronouns = comparaison_final[key + "1"] - comparaison_final[key + "2"]
    ci_prononouns[key] = confidence_interval(diff_pronouns)
    print("ci_" + key + "[%1.4f, %1.4f]" % ci_prononouns[key])
    avg_diff_pronouns[key] = (diff_pronouns).mean()
    var_diff_pronouns[key] = (diff_pronouns).var()
    print(
        "avg_diff_%s %1.4f, var_diff_%s %1.4f "
        % (key, avg_diff_pronouns[key], key, var_diff_pronouns[key])
    )
    print()

3.9 **Discuss:** According to the results obtained in **T3.8**, what headlines grab people's attention the most? Justify your answer.

According to the previous results we can pinpoit some key elements that makes headline grab people's attention :

- Winner headlines are in average 6% longuer than the looser ones, thus there is a corrolation between the length and the success of the headlines. <br> <br>
- Negativity of the headlines seems to be determinant.<br> 
  Winner headlines, on average, contains less often positive words since  95% of the mean of diffrence of usage of positive words is negative <br>[-0.0102, 0.0007].<br> 
  Winner headline, on average, contains more often negative ones since 95% of the the mean of diffrence of usage of negative words is positive <br>[0.0087, 0.0240].<br> <br>
- Headlines mentioning singular people are the most attractive ones : people tends to be more engaged by headlines dealing with individuals than groups.<br>
    Winner headlines, on average, contains more often "first person singular" since  95% of the mean of diffrence of usage of positive words is positive <br>[0.0128, 0.0209]<br> 
    Winner headlines, on average, contains more often "third person singular" since  95% of the mean of diffrence of usage of positive words is positive <br>[0.0280, 0.0453]<br> 
    

## Task 4: Temporal validity and heterogeneity of the effect.

Last, we investigate how the effects studied in **T3** change with time and how they might be heterogeneous across different types of news.

4.1 Create a plot where you depict the monthly average number of words in winner and loser headlines. Consider only headlines created after April 2013 (the month of April inclusive). Include also bootstrapped 95% confidence intervals; here, you can use a third-party implementation if you want. Finally, recall that we created a column `date_created` which captures the creation of the winner headline; you can consider this date to correspond to the date of the creation of the A/B test.

4.2 Produce similar plots to each pronoun category, as well as for positive and negative sentiment. Here, unlike in **T4.1**, depict the month averages pooled across winner and loser headlines (i.e., for each month, you calculate the average across both winners and loser headlines).
Create all these plots in a single figure with no more than 11 inches of width and 11 inches of height. Again, consider only headlines created after April 2013 (the month of April inclusive).

4.3 **Discuss:** Has the type of headline Upworthy used in their A/B tests changed with time? Are these changes likely to be producing more or less engaging headlines? Justify.

4.4 Divide your data into two periods, $t_1$, which goes from  April 2013 (inclusive) to March 2014 (inclusive), and $t_2$, which goes from April 2014 (inclusive) to the latest A/B test in the data. Create a dataframe for A/B tests in each period.

4.5 Let's examine if the effects observed remained the same throughout the study period. Use an appropriate methodology  of your choice to determine if the effects observed in **T3.8** (length, each category of pronouns, positive words, and negative words) were different in $t_1$ and $t_2$. Here, note that we are considering "at least one positive outcome" to be the manifestation of an underlying effect, thus significance level must be adjusted down when performing multiple hypothesis tests!

4.6 **Discuss:** Hypothesize two reasons that could have led to a change in the observed effects. According to the analysis done in **T4.5**, have the effects observed remained the same across the study period? 

4.7 The features we are studying may interact with each other. For instance, people may like first person singular pronouns in headlines containing positive words (you are amazing!), but dislike headlines with negative words and first person pronouns (you are awful!). To help answer this question, create:
- a dataframe containing all A/B tests where both winner and loser headlines include a positive word; and
- a dataframe containing all A/B tests where both winner and loser headlines include a negative word;

4.8 Using an appropriate methodology of your choice, determine if the effect of the use of first person singular pronouns in the headline is heterogeneous across headlines with positive words and negative words, i.e., is the effect significantly stronger for one of the dataframes created in **T4.7**? 

4.9 **Discuss:** Considering the analyses you did throughout Tasks 3 and 4, write a short text (no more than 250 words) giving advice to Upworthy employees on how they should try to write engaging headlines. 
You can reference images present in the notebook by indicating a task (e.g., image plotted in **T3.3**) or a cell number. Note that you do not need to conduct any additional analysis to write this text. 


4.1 Create a plot where you depict the monthly average number of words in winner and loser headlines. Consider only headlines created after April 2013 (the month of April inclusive). Include also bootstrapped 95% confidence intervals; here, you can use a third-party implementation if you want. Finally, recall that we created a column `date_created` which captures the creation of the winner headline; you can consider this date to correspond to the date of the creation of the A/B test.

In [None]:
comparaison_final["date_created"] = comparaison_final["date_created"].astype(
    "datetime64[ns]"
)
after_april2013 = comparaison_final[comparaison_final["date_created"] >= "2013-04-01"]
comparaison_final_month = after_april2013.groupby(
    [after_april2013["date_created"].dt.month, after_april2013["date_created"].dt.year]
)


def bootstrap_CI(data, nbr_draws):
    means = np.zeros(nbr_draws)
    data = np.array(data)

    for n in range(nbr_draws):
        indices = np.random.randint(0, len(data), len(data))
        data_tmp = data[indices]
        means[n] = np.nanmean(data_tmp)

    return [np.nanpercentile(means, 2.5), np.nanpercentile(means, 97.5)]


stats_by_month1 = comparaison_final_month.apply(
    lambda x: pd.Series(
        {
            "average_numwords1": x["numwords1"].mean(),
            "lower_err_numwords1": bootstrap_CI(x["numwords1"], 1000)[0],
            "upper_err_numwords2": bootstrap_CI(x["numwords1"], 1000)[1],
        }
    )
)

stats_by_month2 = comparaison_final_month.apply(
    lambda x: pd.Series(
        {
            "average_numwords2": x["numwords2"].mean(),
            "lower_err_numwords2": bootstrap_CI(x["numwords2"], 1000)[0],
            "upper_err_numwords2": bootstrap_CI(x["numwords2"], 1000)[1],
        }
    )
)

stats_by_month1 = stats_by_month1.sort_index(level=[1, 0])
stats_by_month2 = stats_by_month2.sort_index(level=[1, 0])

stats_by_month1.index = stats_by_month1.index.map(lambda x: str(x[0]) + "-" + str(x[1]))
stats_by_month2.index = stats_by_month2.index.map(lambda x: str(x[0]) + "-" + str(x[1]))

plt.style.use("seaborn-whitegrid")
fig, ax = plt.subplots(figsize=(13, 5))
ax.errorbar(
    stats_by_month1.index,
    stats_by_month1.average_numwords1,
    yerr=[
        -stats_by_month1.lower_err_numwords1 + stats_by_month1.average_numwords1,
        -stats_by_month1.average_numwords1 + stats_by_month1.upper_err_numwords2,
    ],
    capsize=3,
)
ax.errorbar(
    stats_by_month2.index,
    stats_by_month2.average_numwords2,
    yerr=[
        -stats_by_month2.lower_err_numwords2 + stats_by_month2.average_numwords2,
        -stats_by_month2.average_numwords2 + stats_by_month2.upper_err_numwords2,
    ],
    capsize=3,
)

ax.set_xticks(stats_by_month1.index)
ax.set_xticklabels(stats_by_month1.index, rotation=45)
ax.set_xlabel("Month-Year")
ax.set_ylabel("Average number of words")
ax.set_title("Average number of words per month")
ax.legend(["Winners", "Loosers"])
plt.show()

4.2 Produce similar plots to each pronoun category, as well as for positive and negative sentiment. Here, unlike in **T4.1**, depict the month averages pooled across winner and loser headlines (i.e., for each month, you calculate the average across both winners and loser headlines).
Create all these plots in a single figure with no more than 11 inches of width and 11 inches of height. Again, consider only headlines created after April 2013 (the month of April inclusive).

In [None]:
plt.style.use("seaborn-whitegrid")
fig, ax = plt.subplots(figsize=(11, 6))  # check if less than 11x11 inches

for key in feature_wordsets:
    comparaison_final[key] = comparaison_final[key + "1"] + comparaison_final[key + "2"]
    after_april2013 = comparaison_final[
        comparaison_final["date_created"] >= "2013-04-01"
    ]
    comparaison_final_month = after_april2013.groupby(
        [
            after_april2013["date_created"].dt.month,
            after_april2013["date_created"].dt.year,
        ]
    )

    stats_by_month = comparaison_final_month.apply(
        lambda x: pd.Series(
            {
                "average_" + key: x[key].mean(),
                "lower_err_" + key: bootstrap_CI(x[key], 1000)[0],
                "upper_err_" + key: bootstrap_CI(x[key], 1000)[1],
            }
        )
    )

    stats_by_month = stats_by_month.sort_index(level=[1, 0])
    stats_by_month.index = stats_by_month.index.map(
        lambda x: str(x[0]) + "-" + str(x[1])
    )

    ax.errorbar(
        stats_by_month.index,
        stats_by_month["average_" + key],
        yerr=[
            -stats_by_month["lower_err_" + key] + stats_by_month["average_" + key],
            -stats_by_month["average_" + key] + stats_by_month["upper_err_" + key],
        ],
        capsize=3,
    )

    # comparaison_final.drop(key, axis=1, inplace=True)

comparaison_final["positive"] = (
    comparaison_final["positive1"] + comparaison_final["positive2"]
)
comparaison_final["negative"] = (
    comparaison_final["negative1"] + comparaison_final["negative2"]
)
after_april2013 = comparaison_final[comparaison_final["date_created"] >= "2013-04-01"]
comparaison_final_month = after_april2013.groupby(
    [after_april2013["date_created"].dt.month, after_april2013["date_created"].dt.year]
)

stats_pos = comparaison_final_month.apply(
    lambda x: pd.Series(
        {
            "average_positive": x["positive"].mean(),
            "lower_err_positive": bootstrap_CI(x["positive"], 1000)[0],
            "upper_err_positive": bootstrap_CI(x["positive"], 1000)[1],
        }
    )
)
stats_neg = comparaison_final_month.apply(
    lambda x: pd.Series(
        {
            "average_negative": x["negative"].mean(),
            "lower_err_negative": bootstrap_CI(x["negative"], 1000)[0],
            "upper_err_negative": bootstrap_CI(x["negative"], 1000)[1],
        }
    )
)

stats_pos = stats_pos.sort_index(level=[1, 0])
stats_neg = stats_neg.sort_index(level=[1, 0])

stats_pos.index = stats_pos.index.map(lambda x: str(x[0]) + "-" + str(x[1]))
stats_neg.index = stats_neg.index.map(lambda x: str(x[0]) + "-" + str(x[1]))

ax.errorbar(
    stats_pos.index,
    stats_pos.average_positive,
    yerr=[
        -stats_pos.lower_err_positive + stats_pos.average_positive,
        -stats_pos.average_positive + stats_pos.upper_err_positive,
    ],
    capsize=3,
)
ax.errorbar(
    stats_neg.index,
    stats_neg.average_negative,
    yerr=[
        -stats_neg.lower_err_negative + stats_neg.average_negative,
        -stats_neg.average_negative + stats_neg.upper_err_negative,
    ],
    capsize=3,
)

# comparaison_final.drop(['positive', 'negative'], axis=1, inplace=True)

ax.set_xticks(stats_by_month1.index)
ax.set_xticklabels(stats_by_month1.index, rotation=45)
ax.set_xlabel("Month-Year")
ax.set_ylabel("Average number of words")
ax.legend(
    list(feature_wordsets) + ["positive", "negative"],
    loc="upper right",
    bbox_to_anchor=(1.2, 1),
)
ax.set_title("Average number of words per category per month")
plt.show()

4.3 **Discuss:** Has the type of headline Upworthy used in their A/B tests changed with time? Are these changes likely to be producing more or less engaging headlines? Justify.

We note that the average number of words has increased over time in both winner (from 13.5 to 16 words) and looser (from 13.25 to 15.75 words) headlines. <br>

This is likely to produce more engaging headlines, as the average number of words in winner headlines is higher than the average number of words in loser headlines.

We also note that the average number of negative and positive words has decreased at the end of the study period in both winner and looser headlines. <br>
The average number of positive words went from 0.83 to 0.7.<br>
So did the avearage number of negative words going from 0.61 to 0.48.<br>

The use of second person pronouns has also decreased in both winner and looser headlines whereas the use of third person pronouns has significantly increased.

Headlines tend thus to be more neutral and less personal over time. 

4.4 Divide your data into two periods, $t_1$, which goes from  April 2013 (inclusive) to March 2014 (inclusive), and $t_2$, which goes from April 2014 (inclusive) to the latest A/B test in the data. Create a dataframe for A/B tests in each period.

In [None]:
t1 = comparaison_final[
    (comparaison_final["date_created"] >= "2013-04-01")
    & (comparaison_final["date_created"] < "2014-04-01")
]
t2 = comparaison_final[comparaison_final["date_created"] >= "2014-04-01"]

4.5 Let's examine if the effects observed remained the same throughout the study period. Use an appropriate methodology  of your choice to determine if the effects observed in **T3.8** (length, each category of pronouns, positive words, and negative words) were different in $t_1$ and $t_2$. Here, note that we are considering "at least one positive outcome" to be the manifestation of an underlying effect, thus significance level must be adjusted down when performing multiple hypothesis tests!

We want to see if an effect, which is a measurable impact of the usage of a category of words, or the length of the headline, remains the same in t1 or t2.<br>
To tackle this problem we thought about 2 methods:<br>

First we did sample pairs of winner/looser headlines from the period t1 and t2 into different batches of constant length.<br>

Method 1 : we found the mean difference between loser and winner in each batch in t1 and t2.<br>
Then we applied a t-test to see if they follow the same distribution.<br>

Method 2 : we did t-test on the loser and winner headlines to see if there is an effect or not on each batch of t1 and t2 <br>
When the p-value was inferior to 0.005 we considered it as a 1 else as a 0.<br>
Then we ran a t-test on the found values to see if they we have the same ditribution of effects in the two periods. <br>


In [None]:
def t_test_effect(t_1, key):
    return ttest_ind(t_1[key + str(1)], t_1[key + str(2)])

In [None]:
def batch_iter_2(series1, series2, batch_size, num_batches):
    data_size = len(series1)
    shuffle_indices = np.random.permutation(np.arange(data_size))

    shuffled_series1 = series1[series1.index[shuffle_indices]]
    shuffled_series2 = series2[series1.index[shuffle_indices]]

    for batch_num in range(num_batches):
        strt = np.random.randint(data_size - batch_size)
        yield (
            shuffled_series1[strt : strt + batch_size].values,
            shuffled_series2[strt : strt + batch_size].values,
        )


def methode_2(t_1, t_2, key):
    batches_of_key_t1 = batch_iter_2(t_1[key + str(1)], t_1[key + str(2)], 1000, 100)
    batches_of_key_t2 = batch_iter_2(t_2[key + str(1)], t_2[key + str(2)], 1000, 100)

    p_1 = []
    p_2 = []

    for batch_w, batch_l in batches_of_key_t1:
        if batch_w.shape[0] != 0 and batch_l.shape[0] != 0:
            p_1.append(1 if ttest_ind(batch_w, batch_l)[1] < 0.0005 else 0)

    for batch_w, batch_l in batches_of_key_t2:
        if batch_w.shape[0] != 0 and batch_l.shape[0] != 0:
            p_2.append(1 if ttest_ind(batch_w, batch_l)[1] < 0.0005 else 0)

    return ttest_ind(p_1, p_2)


def methode_1(t_1, t_2, key):
    batches_of_key_t1 = batch_iter_2(t_1[key + str(1)], t_1[key + str(2)], 100, 1000)
    batches_of_key_t2 = batch_iter_2(t_2[key + str(1)], t_2[key + str(2)], 100, 1000)

    p_1 = []
    p_2 = []

    for batch_w, batch_l in batches_of_key_t1:
        if batch_w.shape[0] != 0 and batch_l.shape[0] != 0:
            p_1.append((batch_w - batch_l).mean())

    for batch_w, batch_l in batches_of_key_t2:
        if batch_w.shape[0] != 0 and batch_l.shape[0] != 0:
            p_2.append((batch_w - batch_l).mean())

    return ttest_ind(p_1, p_2)

In [None]:
print("Number of words")
print("t1", t_test_effect(t1, "numwords"))
print("t2", t_test_effect(t2, "numwords"))
print("Methode 1  ", methode_1(t1, t2, "numwords"))
print("Methode 2  ", methode_2(t1, t2, "numwords"))
print()
print("Positive")
print("t1", t_test_effect(t1, "positive"))
print("t2", t_test_effect(t2, "positive"))
print("Methode 1 ", methode_1(t1, t2, "positive"))
print("Methode 2 ", methode_2(t1, t2, "positive"))
print()

print("Negative")
print("t1", t_test_effect(t1, "negative"))
print("t2", t_test_effect(t2, "negative"))
print("Methode 1 ", methode_1(t1, t2, "negative"))
print("Methode 2 ", methode_2(t1, t2, "negative"))

print()


for key in feature_wordsets:
    print(key)
    print("t1", t_test_effect(t1, key))
    print("t2", t_test_effect(t2, key))
    print("Methode 1  ", methode_1(t1, t2, key))
    print("Methode 2  ", methode_2(t1, t2, key))
    print()


# comparaison_final.drop(['length', 'positive', 'negative'], axis=1, inplace=True)

4.6 **Discuss:** Hypothesize two reasons that could have led to a change in the observed effects. According to the analysis done in **T4.5**, have the effects observed remained the same across the study period? 

The observed effect did vary on some of the features.<br>

In fact, people seem to be less responsive to negative words. In t1 we coud reject the hypothesis of identical distribution of winner and loser in term of negative words but we couldn't anymore in t2.<br>

The same applies for the effect due to the usage of first_person_plural<br>

However, the effect remained the same for the other features.<br>

One reason for which we could have led to a change in the observed effects, is the eagerness of people to novelty, i.e., people want to see something different.

One other reason, could be that they are searching for individentuality and personal space.

4.7 The features we are studying may interact with each other. For instance, people may like first person singular pronouns in headlines containing positive words (you are amazing!), but dislike headlines with negative words and first person pronouns (you are awful!). To help answer this question, create:
- a dataframe containing all A/B tests where both winner and loser headlines include a positive word; and
- a dataframe containing all A/B tests where both winner and loser headlines include a negative word;

In [None]:
winner_looser_pos = comparaison_final[
    (comparaison_final["positive1"] > 0) & (comparaison_final["positive2"] > 0)
]
winner_looser_neg = comparaison_final[
    (comparaison_final["negative1"] > 0) & (comparaison_final["negative2"] > 0)
]
print(
    "There are {} A/B tests where both headlines include at least a positive word".format(
        len(winner_looser_pos)
    )
)
print(
    "There are {} A/B tests where both headlines include at least a negative word".format(
        len(winner_looser_neg)
    )
)

4.8 Using an appropriate methodology of your choice, determine if the effect of the use of first person singular pronouns in the headline is heterogeneous across headlines with positive words and negative words, i.e., is the effect significantly stronger for one of the dataframes created in **T4.7**? 

In [None]:
ttest = ttest_ind(
    winner_looser_pos["first_person_singular"],
    winner_looser_neg["first_person_singular"],
)
print(
    "first_person_singular: t-test= {0:.3f}, p-value= {1:.3f}".format(
        ttest[0], ttest[1]
    )
)

4.9 **Discuss:** Considering the analyses you did throughout Tasks 3 and 4, write a short text (no more than 250 words) giving advice to Upworthy employees on how they should try to write engaging headlines. 
You can reference images present in the notebook by indicating a task (e.g., image plotted in **T3.3**) or a cell number. Note that you do not need to conduct any additional analysis to write this text. 