In [2]:
from IPython.display import HTML

HTML('''<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>''')

# Industry footprint of Flat6Labs 
author: Oliver Tsappis, otsappis@googlemail.com
[HTML friendly version which doesn't include interactive bokeh and 3D plotly charts.]

### Brief overview
* Flat6Labs is a startup accelerator operating across the MENA region.
* Every startup invested in by the Flat6Labs fund impacts industries, people and markets around the world. 
* How can they measure these things to improve their business and help the startups in their portfolio?
* This project creates a visual map of startups, clustered by their industries to provide insights to support their business.
* This is simply an exploratory project, which actually raises more questions than it answers, so needs to be optimised and developed further to make useful conclusions.
##### Insights
* We find that Flat6Labs startups can be generalised e-commerce, tech and creative industries.
* Several startups operate between industries, which means their business models join industries in a relatively unique way and could mean a particularly innovate or disruptive idea. For example, Reform Studio operates somewhere between technology and creative industries using a socially responsible business model to disrupt the way we think about design and sourcing our home furniture.
* There seems to be more of a divide between creative industries and the rest, which could be exploited further by Flat6Labs investment strategy, although it could show trends in entreneurship in the region or that in comparison with other investors, thats already a high proportion. It needs further investigation or intuition.


### Background
##### What is Flat6Labs?
Flat6Labs is a leading startup accelerator operating in the MENA region, founded by Sawari Ventures in Cairo. Their aim is to push innovation in the Middle East & North Africa by supporting early stage startups and entrepreneurs with 3 tools:
	1. Angel investment.
	2. Acceleration program: If selected, local entrepreneurs are trained and connected to a network of experts and investors to improve performance of their startups.
	3. Follow-on investment program. 
Their main objectives: 
	1. To give local entrepreneurs a platform to bring innovation to local and global markets for the greater good of MENA economies and people.
	2. Improve value of their angel & follow-on investment portfolios.

 
In order to create the largest impact and return possible, Flat6Labs must optimise their startup selection procedure and acceleration program. To do this effectively, they must understand the complex world in which they operate: industries, people, markets etc. This is especially important information for the selection committee and program managers. These people need to know the future potential of applicants and in what way to help startups already in the acceleration program. Being able to model, visualise and communicate these things would allow Flat6Labs to understand the entreprenurial environment better and find opportunities to optimise their business:
	- What types of problems are regional entrepreneurs trying to solve?
	- What types of businesses are investors most excited about?
	- What strengths, patterns or gaps are their in the investment strategy of Flat6Labs?
    - Where can technology and recent innovation can be applied elsewhere, especially tech developed inside the Flat6Labs portfolio?
	- What business models are having the largest impact in terms of investment, revenue, employment etc.?

One company that is already modeling this kind of data and is answering these types of questions for their clients is Quid, who've partly inspired this project. You can read more on how they do it here (non-technical): https://goo.gl/YeD1EH

The scope of this kind of research can be huge, but we will start by looking at industries the startups are currently operating in and dive a bit deeper from there.

### Objective
* To visualise the map of Flat6Labs' investment portfolio of startups in terms of their industries.
* Imply some answers to the questions: 
	1. What types of problems are regional entrepreneurs trying to solve?
	2. What patterns or gaps are their in the investment strategy of Flat6Labs?
    3. Are there any potential market gaps the selection team should look into?

### Data
* We have available a list of Flat6Labs startups, each tagged with their industries.
* The tags were authored by myself based on indipendent research. 
* Not all the startups are still currently operating, so the list shows the ideas that Flat6Labs have directly invested in
* Columns: name | industry.
* Each startup can have multiple entries.

### Approach
* Principle component analysis (PCA) will be used to generate coordinates to visualise the startups on a map, positioned according to their individual combination of industries. 
* K-means will be used to allocate startups to clusters.
* We have the following industry maps:
    - Level 1: This first map will show all Flat6Labs startups in 3 main clusters as an intro.
    - Level 2: The second map will break-down these 3 clusters into 8 clusters to give more detail.
    
### Model limitations
* It doesn't pick out some key industries very well, for example there are a few startups in film, but they get swallowed up into 'creative' and 'e-commerce' clusters.
* There are a few startups that are allocated to some clusters that don't follow the same logic as the majority.
* It would be nice to be able to interpret the white spaces between clusters, but PCA can't reliably tell you much by the actual coordinates it provides. It should only be used as an approximation really (credit to Mohammady Mahdy in Cairo for pointing it out).

### Future steps
* Modelling the clusters in 3d to show relationships between startups, most likely using plotly.
* Optimise the level 2 footprint using trial and error with multiple k-means arrangements.
* It would be interesting to use a network graph approach on top of the clustering algorithm, showing direct links between startups, perhaps defining the edges with a different set of data.
* Improve presentational elements eg. adjusting marker sizes to show number of employees or level of investment.
* Adjust the dimension reduction approach to make the white spaces between startups interpretable. This would be useful for better spotting the startups with disruptive business models.



## 1. Level 1 industry footprint:
Clustering the startups into 3 main groups, giving a general overview of where the Flat6Labs portfolio is operating.

In [1]:
''' Data prep and preprocessing '''
import pandas as pd

# importing startup names and their industry tags
df = pd.read_csv('data/industry_complete.csv', usecols=["name", "industry"])

In [4]:
# creating a matrix with tags as features and binary documents to show if a startup has a tag or not
df['values'] = 1
matrix = df.pivot_table(index=['name'], columns=['industry'], values=['values'])
matrix = matrix.fillna(0).reset_index()

# save a list of the industry columns to fit the model
x_cols = matrix.columns[1:]

In [5]:
# creating prediction labels using kmeans
from sklearn.cluster import KMeans
model = KMeans(3, random_state=4)
matrix['labels'] = model.fit_predict(matrix[x_cols])

In [6]:
from sklearn.decomposition import PCA
pca = PCA(2)
matrix['x'] = pca.fit_transform(matrix[x_cols])[:,0]
matrix['y'] = pca.fit_transform(matrix[x_cols])[:,1]

matrix = matrix.reset_index()

In [7]:
from bokeh.plotting import figure, output_notebook, show, ColumnDataSource
from bokeh.models import HoverTool
from bokeh.palettes import brewer
output_notebook()

In [8]:
# create a table to feed into the visualisation
df_c = matrix.copy() # make a copy to avoid warnings
df_c.columns = df_c.columns.droplevel(1)

df_c.drop([x for x in df_c.columns if x not in ['name','x','y','labels']], inplace=True, axis=1)

In [9]:
# Label the clusters with their common industries
# The investigation was done outside this file to speed things up 
mapping = {0:'e-commerce', 1:'technology', 2:'creative'}
df_c['cluster'] = df_c['labels'].map(mapping)

In [10]:
# collecting distinct colours for each cluster
import numpy as np
colormap = np.array(brewer['Set1'][3])
df_c['color'] = colormap[df_c['labels']]

# bokeh needs to be fed the table using 'source'
source = ColumnDataSource(df_c)

# adding the startup labels you'lal see when you hover over points on the visualisation 
hover = HoverTool(tooltips=[('/s/','@name')])

# setting up plot object
p = figure(plot_width=800,
         plot_height=600,
         title="Fig 1.) LEVEL 1 INDUSTRY MAP",
         tools=[hover],
         min_border=1)

p.circle('x', 'y', size=10, fill_alpha=0.5, source=source, color='color', legend='cluster')

p.axis.visible = False
p.grid.visible = False

p.legend.location = 'top_left'
#p.legend.click_policy = 'hide'

show(p)

##### Can we infer any answers to our questions from this?

##### On the Flat6Labs investment strategy
In this chart, technology, e-commerce and creative industries were the most common types across the startups. This might not tell us a whole lot, but it shows there's quite a wide range of industries that Flat6Labs likes to invest in.

##### On the motivation of local entrepreneurs
There might be fewer startups in the creative cluster compared to the other two, but i'd imagine thats actually a pretty high ratio for an angel investor (with no evidence to back up that point). It could mean entrepreneurs in the MENA region are particularly creative and design focused compared to there counterparts elsewhere. 

##### On the potential market opportunities
You can also see that there isn't as much cross-over between creative companies with tech or e-commerce, whereas tech and e-commerse regularly go hand-in-hand. Perhaps there are market opportunities not being exploted using tech or e-commerce in creative industries, but we can dig a bit deeper to get a better picture.

## 2. Level 2 industry footprint:
Clustering the startups into more niche groups of indutries (8 in total), giving more detail to the Flat6Labs portfolio.

In [11]:
# creating prediction labels using kmeans
model = KMeans(8, random_state=4)
matrix['labels'] = model.fit_predict(matrix[x_cols])

In [12]:
pca = PCA(2)
matrix['x'] = pca.fit_transform(matrix[x_cols])[:,0]
matrix['y'] = pca.fit_transform(matrix[x_cols])[:,1]

matrix = matrix.reset_index()

In [13]:
# create a table to feed into the visualisation
df_c = matrix.copy() # make a copy to avoid warnings
df_c.columns = df_c.columns.droplevel(1)

df_c.drop([x for x in df_c.columns if x not in ['name','x','y','labels']], inplace=True, axis=1)

In [14]:
mapping = {0:'e-commerce: marketing & professional services', 1:'tech: software, mobile & AI', 2:'tech: hardware & IoT',
          3:'e-commerce: logistics & crowdsourcing', 4:'social network: content & marketing', 5:'creative: design & education', 
           6:'finance: professional services & banking', 7:'social: sustainability & waste management'}

df_c['cluster'] = df_c['labels'].map(mapping)

#a table showing the 10 top tags in each cluster
#df['cluster'] = df['name'].map(df_c.set_index('name')['cluster'])

#ind_counts = df.groupby(['cluster', 'industry']).size()
#ind_counts = ind_counts.groupby(level=0).nlargest(10)
#ind_counts = ind_counts.loc[:,1:]

#ind_counts

In [26]:
# collecting distinct colours for each cluster
colormap = np.array(brewer['Set1'][8])
df_c['color'] = colormap[df_c['labels']]

# bokeh needs to be fed the table using 'source'
source = ColumnDataSource(df_c)

# adding the startup labels you'lal see when you hover over points on the visualisation 
hover = HoverTool(tooltips=[('/s/','@name')])

# setting up plot object
p = figure(plot_width=800,
         plot_height=600,
         title="Fig 2.) LEVEL 2 INDUSTRY MAP",
         tools=[hover],
         min_border=1,
         x_range=(-1.5,1.5),
         y_range=(-1,3))

p.circle('x', 'y', size=10, fill_alpha=0.5, source=source, color='color', legend='cluster')

p.axis.visible = False
p.grid.visible = False

p.legend.location = 'top_left'
#p.legend.click_policy = 'hide'

show(p)

##### On the Flat6Labs investment strategy and motivations of local entrepreneurs
There's a reasonable sized group of startup's who's main focus is social impact, ie. sustainability, waste management, renuable energy, agritech. Considering the difficulty of bringing heavily vocational startups to market (they are largely dependent on experimental tech), it's admirable that Flat6Labs invests in this area and so many MENA entrepreneurs are driven to solve social problems. 

The same could be said for educational startups that are largely in the creative cluster using gaming like SpicaTech Academy to teach kids about game development. There are other startups that are tackling the education space however, with tech, like GitHelp and e-commerce, like Marj3 and iRehab. 

##### On the market opportunities and unique startups
* Looking at the space between the creative cluster and the rest, Reform Studio (social cluster) stands out. They specialises in designing and manufacturing furniture using plastic waste like plastic bags. They're using industries common for tech companies such as 'manufacturing', 'waste management', 'sustainability' as well as being in 'creative' and 'design' like others located nearby. 
* Gorfah is another startup trying to bridge the gap between creative industries and the rest, in this case, with e-commerce. They specialise in providing interior design services.
* The tech & software startups that also involve creative industries tend to be mobile gaming companies, perhaps there are otherways to utilise tech in the creative domain.

## 3. Level 2 industry map in 3D
Improving the visuals of the industry map using another dimension, hopefully providing better readability and making patterns more obvious.

In [16]:
# pca to 3 components
pca = PCA(3)
matrix['x'] = pca.fit_transform(matrix[x_cols])[:,0]
matrix['y'] = pca.fit_transform(matrix[x_cols])[:,1]
matrix['z'] = pca.fit_transform(matrix[x_cols])[:,2]

In [17]:
# slicing a cleaner table to vis with
df_d = matrix.copy()
df_d.columns = df_d.columns.droplevel(1)

df_d.drop([x for x in df_d.columns if x not in ['name','x','y','z','labels']], inplace=True, axis=1)

# retrieving same colours as before
df_d['color'] = colormap[df_d['labels']]

In [18]:
# vis using plotly as bokeh is not possible for 3d
import plotly as py
import plotly.graph_objs as go

py.offline.init_notebook_mode(connected=True)

trace1 = go.Scatter3d(
    x=df_d['x'],
    y=df_d['y'],
    z=df_d['z'],
    mode='markers',
    name=df_d['labels'],
    text=df_d['name'],
    marker=dict(
        size=12,
        color=df_d['color'],                  
        opacity=0.5
    )
)
    
    
data = [trace1]
layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    ),
    scene=dict(
    xaxis=dict(
        showticklabels=False,
        title="",
        showline=False,
        zeroline=False,
        showgrid=False),
        
    yaxis=dict(
        showticklabels=False,
        title="",
        showline=False,
        zeroline=False,
        showgrid=False),
        
    zaxis=dict(
        showticklabels=False,
        title="",
        showline=False,
        zeroline=False,
        showgrid=False)
    
    )
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)

### Can we gain any useful insights from the extra dimension?
* It's the same chart as before, but you might notice the 3D model gives an interesting angle on some of the clusters.
* For example, turning the chart on its head shows that the pink-purple clusters are closer together and the orange-brown clusters sitting at the back of the chart (using the view from the previous 2D version as a focal point).
* The financial (brown cluster) startups look even closer related here. Perhaps, sitting between e-commerce and technology, financial solutions tend to involve similar industries.


##### Can we gauge startups operating in whitespace better?
* Remember, the experimental startups should be expected to be a lot more isolated from the others because of their unique combination of industries.
* Some startups start to look a lot more distanced from the pack than previous, for example Snapze (orange) and iRehab (red).
* The new angle has really pulled out LightSense 

### Evaluating the model

##### Intuitive limitations:
1. You can see visually, the level 2 charts look all over the place, so not as pleasing on the eye but there are 2 key reasons for that:
    - The way the industry tags were allocated to startups to build the raw data has brought a lot of startups closer together in the map.
    - The way k-means builds the clusters using the most important tags, meaning a startup could be located very close to another, but is allocated to a different cluster.

2. Some niche industries don't get picked out very well, for example, there are actually a few startups working in the film industry, but they got swallowed up into the 'creative' cluster. This might not necessarily be wrong, but means there's some detail were not capturing here.

3. Some important industries don't get picked out well because they are used evenly across a few clusters. For example, startups who work heavily in 'AI' and 'data' arn't so obvious in this map because they are actually spread across some much larger industries, like 'e-commerce', 'technology' and 'creative'. This will mean it's diffcult to visually look for those types of startups without adding a manual layer of analysis.

4. There are startups that are placed in very intuitive clusters, but others in a completely unexpected one, showing the difficulty of trying to match business logic with that of an algorithm. A perfect example of this is Dashroad, who make car sensors that capture data for fleet management such as route and fuel consumption. They have been tagged with the industries 'hardware', 'IoT', 'technology', 'data' and 'auto', so logically you'd expect them to be in the 'tech: hardware & IoT' cluster with other hardware statups. The algorithm then chose to put them in a cluster with mainly software companies, possibly seeing a connection between them at the 'data' industry tag. 