# Regression on Instagram

### The "Why" of exercise

To understand **trends** on social networks, it is important to be able to **assess the impact of a message** on the people who receive it. A message can be, for example, an Instagram post, a tweet, a tiktok video... There are two main ways to quantify the impact of a message:

1. by estimating the number of people who viewed the message
2. by counting the number of people who will actively adhere to the message by signaling their support, typically by "liking"

This second way is very powerful for understanding trends on social networks. It is useful both for observers (opinion leaders, experts, researchers, etc.) and influencers (digital marketing, prevention campaigns, etc.).

### The "What"

In this exercise, we will focus on posts on Instagram. Our goal is to model the number of `likes` that can be expected on a message, depending on the characteristics of the account sending the message.

To achieve this objective, we will decipher the influence of various variables (number of followers, account age, account activity, etc.).

In particular, quantifying the relationship between followers and likes makes it possible to estimate how many likes each additional follower generates on average. This gives the statistical "conversion rate" of followers to likes.

The dataset comes from [heuritech.com](Heuritech), a startup specializing in predicting clothing fashion trends from social networks.

### The "How"

We seek to quantify the relationship between several continuous variables: it is therefore a **regression problem**.

We will adopt the classic approach:

1. loading the dataset

2. exploring

3. separation into train and test

4. Modeling on the train

5. application on the test

## 1. Loading the dataset

<b>1.A)</b> Load the `posts.csv.bz` file, in the form of a `DataFrame` called `posts`. You can download the data at this [address](https://drive.google.com/file/d/1O8ey3uytjqzRUQXTnmXkiqGBoa6lne1C/view?usp=sharing).

We will use the **compression='bz2'** argument of the pandas read_csv function.

<b>1.B)</b> How many posts and authors are there in this dataset?

1.C) This dataset contains multiple posts per author.

**Delete part of the data so as to keep only the last message posted chronologically, for each author.**

This avoids biases related to the over-representation of certain authors in our dataset, while using the most recent data.

<details>
<summary><i>Click for a hint</i></summary>
    ⟿ you can use the pandas functions `sort_values` and `drop_duplicates` with the keep argument to specify which duplicate element to keep.
</details>

<b>1.D)</b> Re-calculate the number of authors and posts in the dataset to verify that you now have a single post per author.

## 2. Explore

<b>2.A)</b> Let's start with the eagle view: the "pair-plot", which allows to visualize the distribution of each variable (on the diagonal) and its relation with the other variables (other graphs of the line).

Display this pair-plot on a subsample corresponding to 10% of the dataset.

<details>
<summary><i>Click for a hint</i></summary>
    ⟿ You can use the `pairplot` function of the `seaborn` module: https://seaborn.pydata.org/generated/seaborn.pairplot.html
</details>

<b>2.B)</b> Use this graph to answer the following questions:

- How are the likes distributed?

- What other variables are they related to? Do these relationships seem logical / explainable to you?

<b>2.C)</b> We now focus on the "followers" variable to predict "likes".

To confirm the intuition given by the pair-plot, we want to display the predicted variable ("likes") as a function of the explanatory variable ("followed"), accompanied by the marginal distribution of these two variables.

Complete the following code by replacing the `~~~~` with the correct variables.

In [None]:
import plotly.express as px

fig = px.density_heatmap(data_frame= ~~~~ , # to complete
                 x= ~~~~ , # to complete
                 y= ~~~~ , # to complete
                 marginal_x="histogram",
                 marginal_y="histogram"
                )

fig.update_layout(height=700,
                  width=700)

fig

<b>2.D)</b> Do you see a linear statistical relationship between the two variables? What do you think of the variance of this relationship?

## 3. Separation into "train" and "test"

<b>3.A)</b> Separate the dataset into 80% random for training (`posts_train` variable) and 20% for testing (`posts_test` variable).

To do this, you can use the `train_test_split` function after importing it:

```
from sklearn.model_selection import train_test_split
```

## 4. Modeling on the "train"

For our first model, we will use the variable with the greatest potential: the number of `followers`.

This first model will therefore predict the number of `likes` according to the number of `followers`.

The simplest model to predict this relationship is a linear model, which is expressed as:

```
number_of_likes = ( constant ) + ( coefficient x number_of_followers )
```

**Training** such a model is equivalent to **finding the values ​​of `constant` and `coefficient` that minimize the prediction error**.

<b>4.A)</b> Train such a linear model on the "train" sample.

<b>4.B)</b> What is the value of the constant, also sometimes called the `intercept`?

How to interpret this value concretely? Explain its meaning in one sentence.

<b>4.C)</b> What is the value of the second parameter of this template, which is called `coefficient` in the above formula?

Explain in one sentence the interpretation of this value.

<b>4.D)</b> Calculate the R2 score on the "train" sample? What is the MSE error? MAE error?

Does the model look good to you?

<details>
<summary><i>Click for a hint</i></summary>
    ⟿ you can either calculate the MSE and the MAE "by hand", or use the `mean_squared_error` and `mean_absolute_error` functions of the `sklearn.metrics` module .
</details>

## 5. Application on the "test"

To validate the relevance of our model, we will now apply it to our "test" sample.

<b>5.A)</b> Use your model learned in the previous section to predict the values ​​on your "test". Plot these predictions in a graph as a function of the number of `followers`.

<b>5.B)</b> Calculate the error metrics on the "test" and compare them with those obtained previously, in "train". Does your model show signs of overfitting?

# 6. Creating a Better Model

The previous model is very simple. It already delivers a certain value, because it makes it possible to highlight the link between the number of followers of an account and the number of likes of posts.

It does not show overfitting.

But its error is very large: 35 likes on average, while the likes of this dataset are distributed in a range from about 0 to about 200.

<b>6.A)</b> To make a better model, we will test a new explanatory variable.

In particular, the number of 'likes' posts have received in the past could be a good predictor of the 'likes' of a new post from the same account.

The code below:

1. reload the dataset from the `posts.csv.bz` archive
2. isolates the last post of each author, like what we did at the beginning of the TP
3. and associates *for each account* the median of the number of `likes` of all previous posts.

**Study the code below carefully to understand it, then run it.**

In [None]:
# loading the archive
archive = pd.read_csv('posts.csv.bz', compression='bz2')

# sort by chronological order
archive = archive.sort_values(by='ts', ascending=True)

# last_posts isolates the last post of each author
last_posts = archive.drop_duplicates('id', keep='last')

# we resume the archive, removing the last_ posts (thus keeping all the others)
posts_without_last = archive[~archive.index.isin(last_posts.index)]

# we calculate the median, by author, of the likes of previous posts
average_last_posts = posts_without_last.groupby('id', as_index=False)[['likes']].median()

# we rename this new variable (the historical median of likes) with a new name, so as not to create confusion
average_last_posts = average_last_posts.rename({'likes': 'likes_history'}, axis=1)

# we integrate this variable into the dataframe that contains the last post of each author
posts = last_posts.merge(average_last_posts, on="id")

# finally, to prepare the sequel, we create as before the "posts_train" and "posts_test" splits
posts_train, posts_test = train_test_split(posts, test_size=0.2).copy()`

<b>6.B)</b> Make a model that predicts the number of `likes` based on `likes_history`, the previous number of likes.

Is this model better than a model based on `followers`?

Conclude on the relevance of the `likes_history` variable to predict `likes`.

<b>6.C)</b> To conclude on linear models, make a model that predicts the number of likes based on the two explanatory variables tested so far:

- `likes_historical`
- `followers`

What do you think of this model?

<b>6.D)</b> Use this last model to predict the values ​​on your "test", and represent these predictions in a graph according to the number of `followers`.

# To go further: regression with other methods

Linear regression has been manipulated so far, because it is the most commonly used regression model in practice.

However, there are many other mathematical formulations.

Among the best known are:

- so-called "local" methods, which consist in the interpolation of the closest data points to predict the value of a new point;

- methods based on decision trees; predictions are made by traversing the successive nodes of a tree (each node is a branch, which determines the continuation of the traversal of the tree based on the values ​​of the explanatory variables).

Thanks to `sklearn` it is very easy to test these other methods on this problem.

<b>A)</b> Read the `KNeighborsRegressor` documentation [in the official documentation](
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor) and use this algorithm to predict the number of likes. Vary the `n_neighbors` setting. What do you observe?

<b>B)</b> Read the documentation for the `DecisionTreeRegressor` [in the official documentation](
https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html#sphx-glr-auto-examples-tree-plot-tree-regression-py) and use this algorithm to predict the number of likes based on `followers`. Vary the `max_depth` setting. What do you observe?