# Exercise 1 - Analysis

The bank UBS is offering to its client the possibility to invest money in funds. See https://fundgate.ubs.com/. There are thousands of investment funds available. Clients, according to their profile, will be more or less inclined to invest in a given fund, according to the fund characteristics. For example, a younger client with no child is potentially more interested into funds composed with stocks, showing higher risks but also higher potential returns. A family father will be more inclined to invest into low-risk funds. UBS want to build a system taking as input a set of values characterizing the fund and a set of values defining the client profile.

An investment fund can be characterized by the following elements: 

- The name of the fund.
- The current value of 1 share in the fund, expressed in CHF.
- The proportion of stock and bonds composing the fund (2 values in percentage).
- A vector of float values with the 5 last yearly returns over years from 2015 to 2019 (5 values expressed in percentage).
- A level of risk expressed with A, B, C, D, E with A representing the highest risk and E representing the lowest risk level.
- A sectorial information such as technology, pharmaceutical, financial. There are 24 different sectors available in UBS funds.

A client profile contains the following information: 

- First name and last name of the client.
- The mother tongue of the client (mostly de, fr, it and en but other languages are present).
- The age of the client.
- The number of children of the client.
- The current wealth of the client that could be used to buy funds, expressed in CHF (total of cash available in the different accounts, not yet invested in funds).
- The postal code of the address of the client.
- A level of acceptance to risk expressed with A, B, C, D, E with A representing the highest level of acceptance of risk and E representing the lowest acceptance of risk.

Answer the following questions:

1. For each available information in the fund and client profile, explain how you would prepare the data: encoding, normalization, outlier treatment, etc.
2. How could you collect targets (output of the system) to train the system? How would you prepare the different sets?

**Be as comprehensive as possible.** Imagine that you give your analysis to your trainee: he must be able to implement the system from it.

---

**For each available information in the fund and client profile, explain how you would prepare the data: encoding, normalization, outlier treatment, etc.**

|Data type|Processing|
|----------|------------|
|Fund name|Irrelevant, can be ignored|
|Fund value|Encode as float, normalize using Z-norm|
|Stocks/bond ratio|Only keep the first value, as the second can be infered and does not add any information. Encode as float, normalize with min-max scaling|
|Last yearly returns|Separate all five values as five distinct inputs. For each, encode as float, normlize with Z-norm|
|Risk|Encode classes (A, B, C, D, E) using one-hot encoding|
|Sector|Encode all 24 classes using one-hot encoding|
|First/last name|Irrelevant, can be ignored|
|Mother tongue|Encode classes (de, fr, it, en, other) using one-hot encoding|
|Client's age|Assuming not many outliers, encode as float, normalize with min-max scaling|
|Number of children|Encode as a float and, because of switzerland's low fertility rate, normalize with log scaling|
|Current wealth|Encode as a float, normalize with Z-norm|
|Postal code|Probably irrelevant, but can be done by encoding all NPAs using one-hot encoding|
|Risk acceptance|Encode classes (A, B, C, D, E) using one-hot encoding|


**How could you collect targets (output of the system) to train the system? How would you prepare the different sets?**

For a given fund and client, the system returns the likelyhood that this client would buy this fund. Training the system requires that we come up with values that express this likelyhood. Lets assume that tUBS comes up with a training set that contains a set of people and the funds they are currently investing in. One thing we could do is:

- Define typical profiles (such as a young adult without family, a family father with children, etc...)
- Group people in the training set by the profile that matches them best
- For each group, compute empirically the probability that this group invests in fund X (say we have 10 people in our family father group, and 8 of them bought fund X, then fund X has a probability of 0.8 for this group)
- Train the model over the set of all funds and people. The ground truth to use is the probability that a given person's group buys the given stock.