In [2]:
import numpy as np 
import pandas as pd
import matplotlib as plt

In [3]:
df = pd.read_csv("startups.csv")

In [4]:
df.head(10)

Unnamed: 0,name,city,tagline,description
0,Campus Bubble,Atlanta,Your Academic Identity,Campus Bubble (“CB”) is the Academic Community...
1,DueProps,Atlanta,Gamifying the $46 Billion Employee Incentives ...,t unprecedented ...
2,SalesLoft,Atlanta,Quickly build high-quality prospect lists,build high-quality prospect lists\n
3,The Coca-Cola Company,Atlanta,,Coca-Cola Journey is a digital magazine that f...
4,EarthLink,Atlanta,,
5,REscour,Atlanta,Market intelligence and analytics for commerci...,REscour is a data platform and decision engine...
6,viaCycle,Atlanta,"Zipcar for bicycles. Call or text, unlock, and...",viaCycle creates bicycle sharing technology th...
7,Seraph Group,Atlanta,,
8,Kabbage,Atlanta,,Kabbage delivers small businesses financing. B...
9,Usable Health,Atlanta,Menu-personalization software to attract regulars,SmartMenus are web-based ordering kiosks and t...


In [5]:
df.isnull().sum()

name              2
city              0
tagline        5307
description    4863
dtype: int64

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42038 entries, 0 to 42037
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         42036 non-null  object
 1   city         42038 non-null  object
 2   tagline      36731 non-null  object
 3   description  37175 non-null  object
dtypes: object(4)
memory usage: 1.3+ MB


We use this dataset to take a start-up's description into consideration. 
As observed from the df.info, we can see that it has 37175 non-null values, compare to a relatively small null values of the description. We can remove the null values

In [7]:
df1 = df.dropna(subset = ['name','description'])

In [8]:
df1.isnull().sum()

name              0
city              0
tagline        1137
description       0
dtype: int64

Now as we have gotten rid of all the null values in the 2 most important columns, we can move on to applying Natural Language Processing algo to deal with text data.

In [9]:
punc = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''

In [10]:
df1['description'] = df1['description'].str.replace('[^\w\s]','')

  df1['description'] = df1['description'].str.replace('[^\w\s]','')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['description'] = df1['description'].str.replace('[^\w\s]','')


In [11]:
df1

Unnamed: 0,name,city,tagline,description
0,Campus Bubble,Atlanta,Your Academic Identity,Campus Bubble CB is the Academic Community Net...
1,DueProps,Atlanta,Gamifying the $46 Billion Employee Incentives ...,t unprecedented
2,SalesLoft,Atlanta,Quickly build high-quality prospect lists,build highquality prospect lists\n
3,The Coca-Cola Company,Atlanta,,CocaCola Journey is a digital magazine that fo...
5,REscour,Atlanta,Market intelligence and analytics for commerci...,REscour is a data platform and decision engine...
...,...,...,...,...
42031,Weekend Package,Washington DC,Worldwide Hotel Deals Online,is modest
42032,LogiAnalytics.com,Washington DC,,Logi Info embeds interactive visualizations an...
42035,UP Technologies,Washington DC,Ultra Portable Commercial Hardware Solutions,Ultra Portable Technologies UP Technologies is...
42036,Galaxie restaurant and bar,Washington DC,"concept of a unique, modern restaurant",Concept of a modern 10 15 000 sf restaurant wi...


In [12]:
import nltk
from nltk.corpus import stopwords

In [13]:
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nhoan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
stop = stopwords.words('english')

In [15]:
df1['new_des'] = df1['description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['new_des'] = df1['description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


In [16]:
df1.head(10)

Unnamed: 0,name,city,tagline,description,new_des
0,Campus Bubble,Atlanta,Your Academic Identity,Campus Bubble CB is the Academic Community Net...,Campus Bubble CB Academic Community Network Li...
1,DueProps,Atlanta,Gamifying the $46 Billion Employee Incentives ...,t unprecedented,unprecedented
2,SalesLoft,Atlanta,Quickly build high-quality prospect lists,build highquality prospect lists\n,build highquality prospect lists
3,The Coca-Cola Company,Atlanta,,CocaCola Journey is a digital magazine that fo...,CocaCola Journey digital magazine focuses impo...
5,REscour,Atlanta,Market intelligence and analytics for commerci...,REscour is a data platform and decision engine...,REscour data platform decision engine utilizes...
6,viaCycle,Atlanta,"Zipcar for bicycles. Call or text, unlock, and...",viaCycle creates bicycle sharing technology th...,viaCycle creates bicycle sharing technology fl...
8,Kabbage,Atlanta,,Kabbage delivers small businesses financing Bo...,Kabbage delivers small businesses financing Bo...
9,Usable Health,Atlanta,Menu-personalization software to attract regulars,SmartMenus are webbased ordering kiosks and ta...,SmartMenus webbased ordering kiosks tablets se...
11,OpenStudy,Atlanta,Social Learning for Open Courses,rovides realtime study communities for over 40...,rovides realtime study communities 40 OpenCour...
12,We&Co_,Atlanta,People Analytics for Hospitality,WeCo provides People Analytics to the Hospital...,WeCo provides People Analytics Hospitality ind...


In [17]:
#Make all the words in description lower case
df1['new_des'] = df1['new_des'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['new_des'] = df1['new_des'].str.lower()


In [18]:
df1.head(10)

Unnamed: 0,name,city,tagline,description,new_des
0,Campus Bubble,Atlanta,Your Academic Identity,Campus Bubble CB is the Academic Community Net...,campus bubble cb academic community network li...
1,DueProps,Atlanta,Gamifying the $46 Billion Employee Incentives ...,t unprecedented,unprecedented
2,SalesLoft,Atlanta,Quickly build high-quality prospect lists,build highquality prospect lists\n,build highquality prospect lists
3,The Coca-Cola Company,Atlanta,,CocaCola Journey is a digital magazine that fo...,cocacola journey digital magazine focuses impo...
5,REscour,Atlanta,Market intelligence and analytics for commerci...,REscour is a data platform and decision engine...,rescour data platform decision engine utilizes...
6,viaCycle,Atlanta,"Zipcar for bicycles. Call or text, unlock, and...",viaCycle creates bicycle sharing technology th...,viacycle creates bicycle sharing technology fl...
8,Kabbage,Atlanta,,Kabbage delivers small businesses financing Bo...,kabbage delivers small businesses financing bo...
9,Usable Health,Atlanta,Menu-personalization software to attract regulars,SmartMenus are webbased ordering kiosks and ta...,smartmenus webbased ordering kiosks tablets se...
11,OpenStudy,Atlanta,Social Learning for Open Courses,rovides realtime study communities for over 40...,rovides realtime study communities 40 opencour...
12,We&Co_,Atlanta,People Analytics for Hospitality,WeCo provides People Analytics to the Hospital...,weco provides people analytics hospitality ind...


# REPEAT THE PROCESS WITH THE INVESTOR DATASET

In [23]:
investor = pd.read_csv('investor_data.csv')
investors = investor.head(20)

In [24]:
investors

Unnamed: 0,Organization Name,Description,Number of Investments
0,Alumni Ventures Group,Alumni Ventures Group provides diversified ven...,835
1,Tech Coast Angels,Invests in high-growth start-up's headquartere...,567
2,Ontario Centres of Excellence,The Ontario Centres of Excellence invests in p...,433
3,StartEngine,StartEngine is an equity crowdfunding platform...,349
4,SFC Capital,Early-stage investor combining its own angel s...,331
5,Keiretsu Forum,Keiretsu Forum is a California-based angel gro...,291
6,Sand Hill Angels,"Sand Hill Angels, a venture capital firm, inve...",202
7,Alliance of Angels,We are a group of 140+ active angel investors....,185
8,VA Angels,"VA Angels, also known as VA Angels, is an Ange...",184
9,Houston Angel Network,Houston Angel Network provides capital to earl...,179


In [27]:
investors.info() #no null value

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 3 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Organization Name      20 non-null     object
 1   Description            20 non-null     object
 2   Number of Investments  20 non-null     int64 
dtypes: int64(1), object(2)
memory usage: 608.0+ bytes


In [28]:
punc = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''

In [29]:
investors['Description'] = investors['Description'].str.replace('[^\w\s]','')

  investors['Description'] = investors['Description'].str.replace('[^\w\s]','')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  investors['Description'] = investors['Description'].str.replace('[^\w\s]','')


In [30]:
investors['new_des'] = investors['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  investors['new_des'] = investors['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


In [33]:
investors['new_des'] = investors['new_des'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  investors['new_des'] = investors['new_des'].str.lower()


In [34]:
investors

Unnamed: 0,Organization Name,Description,Number of Investments,new_des
0,Alumni Ventures Group,Alumni Ventures Group provides diversified ven...,835,alumni ventures group provides diversified ven...
1,Tech Coast Angels,Invests in highgrowth startups headquartered i...,567,invests highgrowth startups headquartered sout...
2,Ontario Centres of Excellence,The Ontario Centres of Excellence invests in p...,433,the ontario centres excellence invests project...
3,StartEngine,StartEngine is an equity crowdfunding platform...,349,startengine equity crowdfunding platform allow...
4,SFC Capital,Earlystage investor combining its own angel sy...,331,earlystage investor combining angel syndicate ...
5,Keiretsu Forum,Keiretsu Forum is a Californiabased angel grou...,291,keiretsu forum californiabased angel group off...
6,Sand Hill Angels,Sand Hill Angels a venture capital firm invest...,202,sand hill angels venture capital firm invests ...
7,Alliance of Angels,We are a group of 140 active angel investors E...,185,we group 140 active angel investors each year ...
8,VA Angels,VA Angels also known as VA Angels is an Angel ...,184,va angels also known va angels angel investor ...
9,Houston Angel Network,Houston Angel Network provides capital to earl...,179,houston angel network provides capital earlyst...


In [35]:
investors.to_csv('investors.csv')

# HEAT MAP FOR THE CORRELATION AMONG START-UPS

This algorithm will contribute to one of the app's functions, which is the "start-ups for you" or "investors that you might be interested in" section. 
To visualize the data better, we'll use the first 20 start-ups as an example to see their correlation and how the algorithm will work on our dataset.

In [34]:
dictionary = dict(zip(up['name'], up['new_des']))
dictionary

{'Campus Bubble': 'campus bubble cb academic community network linkedin professional community network facebook social community network cb provides academic institutions student powered cross platform private online community focused',
 'DueProps': 'unprecedented',
 'SalesLoft': 'build highquality prospect lists',
 'The Coca-Cola Company': 'cocacola journey digital magazine focuses important topics social causes news the cocacola company',
 'REscour': 'rescour data platform decision engine utilizes proprietary market analysis based massive data aggregation identify commercial real estate investment opportunities rescour onesizefitsall communicates custom tailored recommendations',
 'viaCycle': 'viacycle creates bicycle sharing technology flexible inexpensive easy use users unlock smart bikes instantly using cell phone operators place bikes anywhere kiosks special infrastructure required we enable bike',
 'Kabbage': 'kabbage delivers small businesses financing both ecommerce brickandmo

In [21]:
ids = list(dictionary.keys())

WHICH MEANS WE USE 269 DIFFERENT WORDS TO DESCRIBE 20 START-UPS