# Classification of trolls on Shapr

### The "Why" of exercise

On social networks, it is important to be able to identify individuals with harmful behavior towards the community. The **moderation problem is a classification**: it is a question of predicting, for a given account, whether it complies with the rules of the community or not.

In particular **Shapr**, a professional networking network, realized that a low number of users registered on its platform but exhibited a set of behaviors that were inappropriate and at odds with the purpose of the platform. .

### The "What"

In this exercise, we will use real data from Shapr. In particular we provide:

- the result of a manual validation describing whether the account is problematic or not (this is the "variable to explain")

- as well as a corpus of data describing the new accounts (how long is the text typed in the bio, what other social networks are entered, what is the name of the email provider, etc.): these are the explanatory variables

### The "How"

The practical work reproduces the classification process that was developed internally, and takes place in 2 stages:

I. Exploratory Analysis

  The exploratory analysis aims to understand "who" the problematic profiles are. This is done by studying the ability of explanatory variables to discriminate.

II. Modeling and Interpretation of the model

  Modeling is done with a classification model, which predicts whether a given profile is problematic based on the explanatory variables. By examining how the model works, we understand its mode of operation, and in particular its use of explanatory variables.


Up to you ! And remember: don't get stuck, search, ask your questions, have fun! We are all here to learn and practice.

## 0. [REQUIRED] Creating a new environment

We will use the [Pycaret](https://pycaret.gitbook.io/docs/) library during this lab.

This library requires the creation of a new environment because it uses certain specific versions of certain libraries.

By [following this link](https://pycaret.gitbook.io/docs/), create a new environment and install Pycaret.

Activate it for the rest of this lab.

If for some specific reason an error has occurred, don't get stuck and use [Google Colaboratory](https://colab.research.google.com/?utm_source=scs-index) to complete this lab.

## I. Exploratory Analysis

### I.a. Loading the dataset

Download the `moderation.csv` file at this [address](https://drive.google.com/file/d/1w8geUl5qQ1GonuH7G9YB1h_RrRl4HMgl/view?usp=sharing). It contains different columns describing Shapr profiles:

`node_id` ⇨ internal and unique id of each user

`email` ⇨ the email used to register (other refers to emails other than the main ones)

`has_picture_cover` ⇨ has the user added a photo to their profile?

`has_linkedin` ⇨ has the user added their LinkedIn handle?

`has_personal_url` ⇨ has the user added a link to their personal url?

`has_instagram` ⇨ has the user added their Instagram handle?

`tags` ⇨ the set of profile tags, separated by `;`

`goals` ⇨ the set of goals chosen by the user, separated by `;`

`nb_chars_in_bio` ⇨ the number of characters in their profile bio

`is_unwanted` ⇨ has the user been classified as unwanted?

The purpose of the lab is to study the variable `is_unwanted` to differentiate a normal account (`is_unwanted == 0`) from a troll (`is_unwanted == 1`)

<b>I.1.a)</b> Load the csv file and store it as a Data Frame which you will call `df`. Do you find all the columns described above?

<b>I.1.b)</b> Some profiles that have no bio have `nb_chars_in_bio` set to `NaN` ("Not a Number"). How many are there? Recode these values ​​to 0 in your dataframe.

Do the same by recoding the `tags` and `goals` fields which equal `NaN`, with an empty character string (`""`).

<b>I.1.c)</b> How many profiles are there in this dataset? How many unwanted profiles?

<b>I.1.d)</b> Check that the `node_id` column contains different values. Does this column seem relevant to you to keep in your classification model?

### I.2 Study of explanatory variables

We will now look at the explanatory variables, starting with the email providers.

<b>I.2.a)</b> How many email providers are there? Represent their distribution in the form of a histogram.

<details>
<summary><i>Click for a hint</i></summary>
    ⟿ Plotly Express has the "histogram" function which allows you to make this distribution graph: https://plotly.com/python/histograms/
</details>

<b>I.2.b)</b> We now want to study the link between <b>fournisseur d'emails</b> and <b>probabilité</b> to be labeled as `unwanted`. Calculate this probability according to the email and display it on a graph.


<details>
<summary><i>Click for a hint</i></summary>
    ⟿ Think of `groupby` followed by a `mean` to calculate the probability of `is_unwanted` by different email provider

    <br/>

    ⟿ The display can use the Plotly Express `bar` function: https://plotly.com/python/bar-charts/
</details>

<b>I.2.c)</b> Which email provider is linked to a higher likelihood of being a spam account? What about the lowest probability? Do you think that the email provider is a good explanatory variable?

<b>I.2.d)</b> We now want to link the fact of having given an avatar image and/or informing about its presence on other social networks, with the probability of being an undesirable account.

Examine the following code and run it.

In [None]:
col_names = ["has_picture_cover", "has_linkedin", "has_twitter", "has_personal_url", "has_instagram"]
list_probas = []

for v in col_names:
    proba = df.groupby(v)['is_unwanted'].mean() * 100
    print(f"probability of unwanted count for {v} being 0 or 1: {proba.values}")
    list_probas.append(proba)

table_probas = pd.concat(list_probas, axis=1).transpose()
table_probas.index = col<_>names

px.bar(table_probas, barmode='group')

<b>I.2.e)</b> Looking at the graph produced, what do you think of the link between the explanatory variables "has_picture_cover", "has_linkedin", "has_twitter", "has_personal_url", "has_instagram" and the variable to predict?

Does this match your intuition?

<b>I.2.f)</b> We now want to look at the link between the size of the text entered in the "bio", and the fact of being an unwanted account.

Show boxplots showing the distribution of character counts in bios (`nb_chars_in_bio`) for normal and junk profiles.


<details>
<summary><i>Click for a hint</i></summary>
    ⟿ The display can use the "box" function of Plotly Express: https://plotly.com/python/bar-plots/
</details>

<b>I.2.g)</b> Looking at the graph produced, what do you think of the link between the explanatory variable "nb_chars_in_bio" and the variable to be predicted?

Does this match your intuition?

<b>I.2.h)</b> We now want to see if `tags` and `goals` have been entered, assuming that a "troll" account does not bother to enter them.

Create two new columns, `has_goal` and `has_tag`, which are worth `1` when the `goals` and `tags` columns actually contain information.

Using the code from question `I.2.d)`, conclude on the relevance of these two variables.

## II. Modelization

At the end of the first part, we identified some variables with a clear link to being a spam account:

- the size of the bio
- the presence or absence of a profile picture
- the presence or absence of links to other social networks (LinkedIn, Instagram, Twitter or personal site)
- the fact of having entered `tags` and `goals` in the profile

### II.1. Data preparation

<b>II.1.a)</b> Construct a DataFrame containing the explanatory variables stated above as well as the variable to be predicted `is_unwanted`.

Save this dataframe in a variable named `dataset`. It should contain 10,000 rows and 9 columns.

<b>II.1.b)</b> As we saw previously, the dataset contains 9 times more normal users (`is_unwanted` being 0) than trolls (`is_unwanted` being 1).

Learning a model with this bias is not desirable, because regardless of the user to classify, it would suffice to answer "no-troll" to be right in 90% of cases!

<b>To guard against this, you are asked to implement a simple strategy of re-balancing the `dataset_train` by "under-sampling". This consists of the random retention of a subsample of normal users.</b>

Once the rebalancing is complete, check that the `dataset` variable contains 1,000 rows with `is_unwanted` at 0 and 1,000 rows with `is_unwanted` at 1.

<b>II.1.c)</b> Perform the 80% / 20% split of your dataset with the code below.

What is the `stratify` parameter? Why can it be useful here?

In [None]:
from sklearn.model_selection import train_test_split

dataset_train, dataset_unseen = train_test_split(dataset,
                                                 test_size=0.2,
                                                 random_state=42,
                                                 stratify=dataset['is_unwanted'])

The `stratify=dataset['is_unwanted']` parameter is important here because it ensures that the proportion of unwanted people remains the same after slicing between the train and the test. This is an important point, because an unlucky random slicing could result in a zero-example training dataset of one of the classes, making training impossible.

## II.2. Screen models with PyCaret

<b>II.2.a)</b> Now it's time to prepare the classification with the call to the `setup` function of <b>PyCaret</b>.

Complete the following code, justifying your choice for the value of `normalize`.

In [None]:
from pycaret.classification import setup

xp = setup(data = ~~~~ , # to be completed
           test_data = ~~~~ , # to be completed
           target = ~~~~ , # to be completed
           normalize = ~~~~ , # to be completed
           session_id = 42,
           silent = True,
          )

<b>II.2.a)</b> Sift through PyCaret's models and compare their performance.

Which model has the best accuracy?

NB: You can specify `exclude=['xgboost', 'catboost']` when sifting to make the calculations much faster. These two algorithms are indeed the most greedy in terms of resources, and (spoiler alert) they do not deliver excellent performance on this problem (/spoiler alert).

<b>II.2.a)</b> Display confusion matrix on test data.

What can you say?

<b>II.2.b)</b> Based on the numbers in this matrix, calculate the precision and recall in the test sample.

Do you observe the same performance as in k-fold validation?

### III. Bonus

<b>III.A)</b> To improve performance, it is possible to vary the model's hyper-parameters.

This can be done with a simple call to the `tune_model` function, [described here](https://pycaret.org/tune-model/).

Take the best model and optimize its hyper-parameters. Do you see an improvement in performance?

<b>III.B)</b> Another avenue for improvement lies in the combination of several models, and the <i>blending</i> of their predictions via a majority vote.

This can be done with a simple call to the `blend_models` function, [described here](https://pycaret.org/blend_models/).

Take the 5 best models and perform an <i>blending</i>. Do you see an improvement in performance in testing?

<b>III.C)</b> To conclude this lab, we propose to provide an explanation <i>a posteriori</i> of the functioning of the classifier.

To obtain the plot of importance of the explanatory variables, call the function `plot_model` with as argument the model of your choice and with the argument `plot="feature"`.

How do you interpret these values?