# Exercise 1 - Analysis

## Students : Leo Pellandini, Steven Jaquet et André Quintas Gervasio

The bank UBS is offering the possibility to invest money in investment funds. A fund is composed of financial values such as stocks or bonds. For example, a fund composed mostly of stocks has more return potential but is more risky in case of stock market recession. There are thousands of funds available, see https://fundgate.ubs.com/. The probability to invest or not in a fund is conditioned by the profile of the fund and of the client. For example, a younger client with no child is potentially more interested into funds composed with stocks, showing higher risks but also higher potential returns. A family father will be more inclined to invest into low-risk funds. UBS want to build a system as illustrated on Figure~\ref{fig:ubs_system}, taking as input a set of values characterizing the fund and the client profile.

An investment fund can be characterized by the following elements: 

- The name of the fund.
- The current value of 1 share in the fund, expressed in CHF.
- The proportion of stock and bonds composing the fund (2 values in percentage).
- A vector of float values with the 5 last yearly returns over years from 2015 to 2019 (5 values expressed in percentage).
- A level of risk expressed with A, B, C, D, E with A representing the highest risk and E representing the lowest risk level.
- A sectorial information such as technology, pharmaceutical, financial. There are 24 different sectors available in UBS funds.
-  As the set of funds are worldwide, the emiting location is also available with the address of the managing entity of the fund, e.g. Market Street 1234, New York, USA.

A client profile contains the following information: 

- First name and last name of the client.
- The mother tongue of the client (mostly de, fr, it and en but other languages are present).
- The age of the client.
- The number of children of the client.
- The current wealth of the client that could be used to buy funds, expressed in CHF (total of cash available in the different accounts, not yet invested in funds).
- The postal code of the address of the client.
- A level of acceptance to risk expressed with A, B, C, D, E with A representing the highest level of acceptance of risk and E representing the lowest acceptance of risk.

Answer the following questions:

1. For each available information in the fund and client profile, explain how you would prepare the data: encoding, normalization, outlier treatment, etc.
2. How could you collect targets (output of the system) to train the system? How would you prepare the different sets?

**Be as comprehensive as possible.**  Don't limit your explanation to the "how" but also the "why".

---

**For each available information in the fund and client profile, explain why and how you would prepare the data: encoding, normalization, outlier treatment, etc.**

### About investment fund:

The name of the fund can be ignored for model training since it is not a quantitative piece of information useful for prediction. It can be used as a reference for the results.

The numerical value of a share in CHF requires normalization to prevent large values from dominating other features. For example, min-max scaling between 0 and 1 or  using Z-Norm. Check for outliers and exclude them if they are not realistic.

Since the percentages of stocks and bonds are already on a standard scale, a min-max normalization can still be applied to keep them bounded between [0, 1]. A verification can be added to ensure that their sum equals 100%.

With yearly returns from 2015 to 2019, we have a time series, and the sliding-window method can be used. Methods such as Z-Norm can be applied to handle these values.

The risk level (A to E) is an ordinal variable, and an ordinal encoding can be used with these increasing levels, described as follow :

<center>

    
|  Risk        | Encoding     |
|---           |--:           |
| A            | 5            |
| B            | 4            |
| C            | 3            |
| D            | 2            |
| E            | 1            |
</center>

("A" representing the highest risk => 5)

We can do a 1-hot encoding for the sectorial information (no relationship between sectors). There are 24 associations of 1 input for 1 sector.

Regarding the emitting locations, using latitude and longitude coordinates are not particularly relevant in this context. Only the country of emission is kept, as more detailed information such as street or city does not directly influence the fund’s performance or risk. A geographical bucketing by region or economic zone can be an interesting option to represent the location of the funds.


### About client profile:

Once again, it is not necessary to take into account the client's first and last name, as this information is not useful for model training.

For the native language (mother tongue), there are four main languages in this context (DE, FR, IT, EN), so we can use a one-hot encoding and add an additional column that groups the other languages.

The age has no real outliers, and we can simply apply a min-max scaling between [0, 1].

We can process the number of children in the same way as the age. It is important to ensure that there are no excessively high values, which should not be so frequent.

The current wealth is the same as the value of one share. There are some extreme values, either low or very high. We need to apply feature clipping and min-max rescaling for this.

Handling postal codes should not be a problem, as they are already numerical values and can be used directly without modification. It is only necessary to ensure that they fall within the expected range.

As with the risk level of a fund, we can use an ordinal encoding for the risk acceptance levels with this specific encoding :

<center>


|  Risk        | Encoding     |
|---           |--:           |
| A            | 5            |
| B            | 4            |
| C            | 3            |
| D            | 2            |
| E            | 1            |
</center>

("A" representing the highest level of acceptance of risk => 5)

**How could you collect targets (output of the system) to train the system? How would you prepare the different sets?**

### Collect the targets :

Use UBS’s historical data, identifying each client’s profile and the funds in which they have invested.
Analyze client behavior and trends on the UBS website [fundgate.ubs.com](https://fundgate.ubs.com/), identifying which types of funds are most frequently viewed by which types of clients.

### Three independent data sets:

Use 60% of the collected data to train the model. Ensure a good proportion of each class (funds chosen or not) to avoid an imbalance that could bias the predictions. If certain types of funds are chosen much more frequently than others (for example, low-risk funds that are often popular), techniques such as oversampling underrepresented funds / undersampling overrepresented ones should be applied to reduce bias.

Use a validation set with 20% of the data to adjust the hyperparameters and evaluate performance outside the training set, preventing overfitting. We can also apply a k-fold cross-validation, to rotate through different folds so that every sample is used once.

Reserve the remaining 20% of the data for the final test. These data should not be used during the training phase, in order to evaluate the model’s generalization capability.

Finally, we may need to balance the datasets by randomly selecting samples and then verifying the distribution to ensure it is accurate. 