# Machine learning techniques for efficient portfolio diversification

Supervised Learning and Unsupervised Learning are based on different assumptions.

While both supervised learning and unsupervised learning assume “adequate data” for analysis, problems in supervised learning require labels (correct answers), whereas unsupervised learning does not assume labels.  Thus, the evaluation of a successful algorithm is more complicated and judgmental.  

# Benefits of portfolio diversification

![image.png](attachment:image.png)

![image.png](attachment:image.png)

# Portfolio diversification measures

![image.png](attachment:image.png)

Practice question: What is the effective number of constituents (ENC) of a portfolio invested in 3 stocks with weights 25%, 50% and 25%? 

Answer: 1/(0.25^2+0.5^2+0.25^2)=2.67

![image.png](attachment:image.png)

Used for test how well the portfolio has been diversified.

# Principle component analysis (PCA)


![image.png](attachment:image.png)

Practice question: What is the goal of Principle Component Analysis (PCA)?      

Answer: Reduce number of dimensions in data and project onto a set of orthogonal factors

PCA is a form of unsupervised learning

PCA reduces the dimensions for problems with a large set of variables to a small set of variables

Reduced set of variables are independent of each other.

Size of eigenvalues indicates the variance explained for the accompanying eigenvector direction.

# Role of clustering

![image.png](attachment:image.png)

The goal of a clustering exercise: 

Clustering fits within the unsupervised learning framework. Thus, labeled data is unavailable for classification. 

In fact, the data should be addressed at face value, and generic patterns detected.  As always, we assume that the data is a “sample” taken from an unknown population. 

And we are interested in discovering interesting patterns in the data. 

Ideally, a  cluster should represent a homogeneous group of  objects with common characteristics, with heterogeneous characteristics across clusters. 

Clustering can be carried out in a heuristic fashion or via a formal optimization algorithm.

![image.png](attachment:image.png)

It suit for solve variables under 5000.

The goals of a cluster analysis of stocks: 

Separate entities (stocks in our case) into relatively homogeneous groups. As a consequence and ideally, the entities in different cluster should display heterogeneous factors behavior. 

Clustering is a form of <mark>
unsupervised learning</mark>.

It's goal is to identify groups of homogeneous entities (stocks)

There are a wide variety of clustering algorithms, most of which are <mark>heuristic in nature</mark>



# Graphical analysis

![image.png](attachment:image.png)

![image.png](attachment:image.png)

The graphical network is a visualization of the relationship between and among a set of entities. In our example, we apply the network to several examples such as stocks or countries. A node depicts the entities, whereas the arcs provide a measure of conditional dependencies. Also, the location on a graph is helpful insofar as it provides a rough measure of an overall pattern across the total set of entities. 

The graphical analysis involves <mark>unsupervised learning

It's goal is to specify through a network graph the conditional independence structure among a group of entities.

The graph is highly interpretable. We can discover interesting patterns. Especially as conditions change.


# Selecting a portfolio of assets

![image.png](attachment:image.png)

There are several applications of forming a widely diverse set of stocks. These stocks could be preselected in a manner to choose higher expected value components, and then we could form a representative set of the promising stocks via diversification.  This strategy could protect against unforeseen risks.  Alternatively, the low volatility factor has proven effective over relatively long time periods and we could invest in this strategy by means of a widely diverse set of stocks. This would eliminate the need to depend upon correlation estimates. 

Both PCA/Clustering and a Graphical approach show promise for selecting a widely diversified portfolio of stocks and other securities.

The graphical structure, perhaps has a slight edge in so far as it is Highly interpretable.



Question 1
What are examples of successful applications of unsupervised learning in finance? Multiple responses possible (3 correct answers)
1 point

<mark>Graphical network analysis to identify clients who might be amenable to new and novel services</mark>

Indicate the conditions of a crash period

<mark>Unusual behavior by credit card holders within a fraud detection system</mark>

<mark>Identifying firms with unusual levels of income for their respective industry and investments</mark>


Question 2
Can PCA be applied to the time series of a portfolio of stock prices?
1 point

<mark>True</mark>

False


Question 3
What is the optimal number of clusters for a particular set of stocks such as the S&P 500? Multiple responses possible
1 point

<mark>The analysts should plot the total similarity measure as a function of the number of clusters and select a point on the elbow of this curve.</mark>

<mark>There is a tradeoff between the number of clusters and the degree of parsimony in a cluster.  </mark>

<mark>The cluster analysis should provide summary statistics on the similarity measure for stocks within a cluster and chose the number of clusters that minimizes similarity</mark>

Clustering output requires at least 10 clusters. 


Question 4
How might you identify a firm that has highly unusual behavior from the output of a cluster analysis? Multiple responses possible
1 point

<mark>Stocks with similar behavior should be identified in a single cluster</mark>

<mark>The stock would appear in its own cluster </mark>

Stocks in a single industry will always show up in a common cluster


Question 5
Much of the success in machine learning is due to access to massive data on customer behavior coming out of user tracking. True or false?
1 point

<mark>True</mark>

False


Question 6
Can unsupervised learning make decisions about the best stock to invest in? True or false?
1 point

True

<mark>False</mark>


<mark>?Question 7</mark>
What did the empirical tests show regarding the application of PCA/clustering and via graphical analysis to a stock selection application?Multiple responses possible
1 point

?<mark>The graphical analysis was slightly improved over the PCA/clustering method</mark>

Both methods outperformed the Markowitz portfolio 

The empirical tests did not improve performance


Question 8
Does the graphical analysis fit in the area of supervised learning?  True or false? 
1 point

True


<mark>False</mark>


Question 9
In the discussed empirical tests between the PCA/clustering algorithm and the graphical network analysis, did one of the approaches dramatically outperform the other? True or false?   
1 point

True

<mark>False</mark>


?Question 10
Most clustering methods employ heuristic algorithms, as compared with a formal optimization model. What leads to this situation?Multiple responses possible
1 point

?<mark>The optimization model is much harder to solve, especially for an application with a large number of points/customers/variables to cluster</mark>

?<mark>The heuristic methods are easy to interpret</mark>

?<mark>The distance function and objective function required by the optimization presents an informational barrier</mark>


Question 11
A cluster always has at least two stocks
1 point

<mark>False</mark>


True


Question 12
Which of the following sentences correctly summarizes the relationship between the graphical network and the number of sectors?
1 point

The number of clusters must not be grater than the number of sectors

The number of clusters must be grater than the number of sectors

The number of clusters and the number of sectors must be equal

<mark>None of above</mark>


Question 13
When graphical analysis is performed, the results are different for different lengths of time (6M, 1Y, 5Y). What are the differences and what could cause these differences?
1 point

<mark>The longer the length of time, the more separated the clusters: using more information the trend of how stocks behave is clearer and then companies that are not in the same cluster show more discrepancies in their trends.</mark>
 

The longer the length of time, the more separated the stocks: using more information the trend of how stocks behave is less clear and then companies show more independent trends.
 

The shorter the length of time, the more separated the clusters, using less information the trend of how stocks behave is clearer and then companies that are not in the same sector show more discrepancies in their trends.
 

The shorter the length of time, the more separated the clusters, using less information the trend of how stocks behave is clearer and then companies show more independent trends.


Question 14
In this notebook, when we calculate for the summary statistics, we notice that the stock “BABA” was dropped because it has missing data for our specified period of time. Please revise the notebook to display the summary statistics
information for “BABA” over the full period of time for which its data are available. What are the “Annu. Ave Return” and the “Annualized Sharpe” for
“BABA”? (Hint: the command “first_valid_index()” can be useful here.)
1 point

<mark>21.77%; 0.63</mark>


19.47%; 0.72


11.26%; 0.66


8.84%; 0.45


<mark>?Question 15</mark>
Using the notebook, perform graphical analysis with S&P500 (i.e. 23 stock returns in total) for the time period 2015-07-01 to 2020-06-28. Which of the following sentences correctly summarizes the resulting network plot? Multiple responses possible.
1 point

*A: 4 clusters are identified

?<mark>All banks are identified to be in the same cluster</mark>

?<mark>APPL, GOOG, MSFT, and SP500 are identified to be in the same cluster</mark>

?<mark>HSBC and JNJ are not connected directly by an edge</mark>


Question 16
In this notebook, provided that two stocks in the network graph are disconnected (i.e. are not connected directly by an edge), what can we conclude about the relationship between these two stocks?
 
1 point

The two stocks are independent of each other

<mark>The two stocks are independent conditionally on the others</mark>

The two stocks are in the same cluster

The two stocks are from the same sector


Question 17
Using the notebook, perform graphical analysis without S&P500 (i.e. 22 stock returns in total) for the time period 2015-07-01 to 2020-06-28. Which of the following sentences correctly summarizes the resulting network plot? Multiple responses possible.
1 point

4 clusters are identified

<mark>All banks are identified to be in the same cluster</mark>

APPL, GOOG, MSFT, and BABA are identified to be in the same cluster

<mark>HSBC and JNJ are not connected directly by an edge</mark>


Question 18
Comparing the two network plots for the graphical analysis with and without S&P500 (i.e. 23 and 22 stock returns in total, respectively) for the time period 2015-07-01 to 2020-06-28, which of the following observation is correct? Multiple responses possible.
1 point

Both analyses group the stock returns into 4 clusters

<mark>BABA is identified to be in a cluster of size one, regardless of whether or not S&P 500 is included</mark>

<mark>The international banks (HSBC, RY) tend to be positioned at the periphery of the bank sector, regardless of whether or not S&P 500 is included</mark>

None of the above