# Personal Information
Name: **Nilansha Dargan**

StudentID: **13130366**

Email: [**nilansha.dargan@student.uva.nl**](nilansha.dargan@student.uva.nl)

Submitted on: **22.03.2024**

# Data Context
**Our research focuses on exploring the feasibility of classifying startup companies across multiple dimensions for investment purposes, specifically in a zero-shot context. This investigation is particularly relevant for our use case: Azimut Zero, a Due Diligence Startup specializing in evaluating deep tech startups for investment viability. The majority of this notebook will involve performing exploratory data analysis on data samples, that are AI start-ups' information, provided by the company.  These samples comprise two main types of textual data: PDF files of slide decks and Excel spreadsheets containing questionnaire responses. Often these samples also have a word document. Our analysis will delve into two samples  this textual data.**

**Along with that, we are developing an evaluation dataset based on the Y Combinator Directory, available on Kaggle (link: [Y Combinator Directory](https://www.kaggle.com/datasets/miguelcorraljr/y-combinator-directory?select=2023-07-13-yc-companies.csv)). It will be made by scarping the homepages of the company websites and classifying them in three dimension. As this dataset is currently under construction, only preliminary EDA has been conducted on it.**

# Data Description

**As mentioned before the type of data in question for the reseach is texual data. In this data, three experiments were analysed.**

***Analysis 1: Word Count Analysis***

**The initial experiment focused on quantifying the total number of words present in each data sample. Analysis of the first data sample revealed a word count of 4,386, whereas the second sample contained 6,638 words. These findings suggest a typical range of 4,000 to 7,000 words for similar data samples. The self-generated dataset from YC will a lower word count.**


***Analysis 2: Word Frequency Analysis***

**The second experiment aimed to identify the 10 most frequently occurring words within each data sample, under two conditions: with and without the removal of stopwords. The findings are as follows:**

**Data Sample 1:**

**With stopwords: [('the', 216), ('of', 147), ('and', 106), ('our', 88), ('to', 88), ('a', 76), ('is', 60), ('in', 60), ('for', 58), ('we', 57)]**
**Without stopwords: [('data', 54), ('machine', 43), ('model', 35), ('company', 26), ('ai', 26), ('maintenance', 23), ('training', 21), ('system', 20), ('models', 17), ('process', 16)]**


**Data Sample 2:**

**With stopwords: [('the', 340), ('to', 179), ('and', 158), ('of', 143), ('is', 136), ('our', 117), ('a', 110), ('in', 100), ('for', 100), ('we', 92)]**
**Without stopwords: [('data', 48), ('models', 47), ('product', 41), ('new', 38), ('detection', 33), ('company', 32), ('per', 31), ('security', 27), ('streams', 25), ('model', 22)]**
**This experiment showed the dominance of stopwords in the datasets when they are not removed.**

***Analysis 3: Cosine Similarity Analysis***

**The third analytical phase incorporated a cosine similarity computation. This involved measuring the semantic similarity between the combined text from all data sources for a sample and a set of predefined dimensions relevant to the research. Building on insights from the previous experiment, the scores were also checked on text where stopwords were removed. The results indicated a notable increase in similarity scores across different dimensions, providing a crucial insight into the significance of this data cleaning step in this research.**

***Preliminary Dataset Formation Analysis***

**The last analysis of the first steps towards developing a dataset from the Y Combinator (YC) directory. From a pool of 8,411  startup companies, it was determined that approximately 599 companies specialize in Artificial Intelligence (AI).**

In [1]:
!pip install python-docx PyPDF2

Collecting python-docx
  Obtaining dependency information for python-docx from https://files.pythonhosted.org/packages/5f/d8/6948f7ac00edf74bfa52b3c5e3073df20284bec1db466d13e668fe991707/python_docx-1.1.0-py3-none-any.whl.metadata
  Downloading python_docx-1.1.0-py3-none-any.whl.metadata (2.0 kB)
Collecting PyPDF2
  Obtaining dependency information for PyPDF2 from https://files.pythonhosted.org/packages/8e/5e/c86a5643653825d3c913719e788e41386bee415c2b87b4f955432f2de6b2/pypdf2-3.0.1-py3-none-any.whl.metadata
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading python_docx-1.1.0-py3-none-any.whl (239 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m239.6/239.6 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-docx, PyPDF2
Successfully ins

In [2]:
# Imports
import os
import numpy as np
import pandas as pd
import docx
import PyPDF2
import re
from collections import Counter
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import ast
from nltk.tokenize import word_tokenize
nltk.download('punkt')


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nilanshadargan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/nilanshadargan/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Data Loading

In [8]:
data_1_pdf = 'Data/Data sample 1/ai deck.pdf'
data_1_excel = 'Data/Data sample 1/data sample 1.xlsx'
data_2_pdf = '/content/20240102_AzimutZer0_Slides.pdf'
data_2_pdf_2 = '/content/Q10. Models Finetuning Process.pdf'
data_2_pdf_3 = '/content/Q6. Quality Assurance overview.pdf'
data_2_excel = '/content/data sample 2.xlsx'
data_2_doc_1 = '/content/Q17. 20240102 Update to team_ch6.docx'
data_2_doc_2 = '/content/Q27. Procedure Security Incidents.docx'
ycd_data = '/content/2023-07-13-yc-companies.csv'


###Data Cleaning

In [9]:
def read_excel_file(file_path, columns=None):
    """Reads specified columns from an Excel file."""
    try:
        data = pd.read_excel(file_path, usecols=columns)
        return data
    except Exception as e:
        return f"Error reading {file_path}: {e}"

def read_word_file(file_path):
    """Reads data from a Word document."""
    try:
        doc = docx.Document(file_path)
        return '\n'.join([para.text for para in doc.paragraphs])
    except Exception as e:
        return f"Error reading {file_path}: {e}"

def read_pdf_file(file_path):
    """Reads data from a PDF file using PyPDF2 version 3.0.0 or later."""
    try:
        with open(file_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            text = []
            for page in pdf_reader.pages:
                text.append(page.extract_text())
            return '\n'.join(text)
    except Exception as e:
        return f"Error reading {file_path}: {e}"

In [10]:
#Data Sample 1
data_1_pdf_data = read_pdf_file(data_1_pdf)
data_1_excel_data = read_excel_file(data_1_excel)

#Data Sample 2
data_2_pdf_data = read_pdf_file(data_2_pdf)
data_2_excel_data = read_excel_file(data_2_excel)
data_2_pdf_2_data = read_pdf_file(data_2_pdf_2)
data_2_pdf_3_data = read_pdf_file(data_2_pdf_3)
data_2_doc_1_data = read_word_file(data_2_doc_1)
data_2_doc_2_data = read_word_file(data_2_doc_2)


###Data Cleaning

Excels: Non useful columns Drop

In [11]:
data_1_excel_data

Unnamed: 0,Question,Comment,Your answer
0,What is the estimated TRL level of your product?,See https://ec.europa.eu/research/participants...,The current Technology Readiness Level (TRL) o...
1,To what extent do you build your algorithms in...,I.e. fully built in-house vs. based on 3rd par...,Our models have been developed and assessed en...
2,Can you give an overview of the relevant data ...,Describe your datasets and any issues/gaps it ...,We maintain an extensive library of recordings...
3,Do you own the relevant data?,"Why (not)? If you don't own it, who does?",We hold exclusive ownership of all data we hav...
4,Can you briefly describe your data processing ...,,Our machine learning (ML) approach for the pro...
5,How do you measure performance of your AI models?,Which metrics do you use and why are they adeq...,We have developed an advanced machine learning...
6,Can you briefly describe your (AI/ML) product ...,"Concerning design choices and considerations, ...","We are having many brainstorming discussions, ..."
7,Can you briefly describe your IP generation & ...,"I.e. patents, scientific publications, FTO...",our company holds full ownership of its intell...
8,Can you give an overview of the (AI/ML) techni...,,Our team consists of four highly skilled indiv...
9,Can you provide your technology and/or product...,"Alternatively, point us to its place in the do...","Provided as part of the ""02 Technology Overvie..."


In [12]:
data_1_excel_data.drop(['Question', 'Comment'], axis=1, inplace=True)

In [13]:
data_1_excel_data

Unnamed: 0,Your answer
0,The current Technology Readiness Level (TRL) o...
1,Our models have been developed and assessed en...
2,We maintain an extensive library of recordings...
3,We hold exclusive ownership of all data we hav...
4,Our machine learning (ML) approach for the pro...
5,We have developed an advanced machine learning...
6,"We are having many brainstorming discussions, ..."
7,our company holds full ownership of its intell...
8,Our team consists of four highly skilled indiv...
9,"Provided as part of the ""02 Technology Overvie..."


In [None]:
data_2_excel_data.drop(['Question', 'Comment'], axis=1, inplace=True)

PDFs: Punctuation and '/n' Removal

In [None]:
def clean_text(input_text):
    # Replace newline (\n) characters with spaces
    cleaned_text = input_text.replace('\n', ' ')

    # Remove punctuations
    cleaned_text = re.sub(r'[^\w\s]', '', cleaned_text)

    return cleaned_text


In [None]:
data_1_pdf_data = clean_text(data_1_pdf_data)
data_1_pdf_data



In [None]:
data_2_pdf_data = clean_text(data_2_pdf_data)
data_2_pdf_2_data = clean_text(data_2_pdf_2_data)
data_2_pdf_3_data = clean_text(data_2_pdf_3_data)

Docs: Punctuation Removal

In [None]:
data_2_doc_1_data = clean_text(data_2_doc_1_data)
data_2_doc_2_data = clean_text(data_2_doc_2_data)


In [None]:
data_2_doc_2_data

'         Procedure Security Incidents   \t Introduction This document describes how The Company defines security incidents and how these incidents are being handled What is a security incident A security incident occurs when an event occurs where there is the possibility that the confidentiality integrity or availability of information or informationprocessing systems is or could be endangered Some examples of security incidents are viruses andor malware infections or attempts to gain unauthorized access to information or systems What is a data breach A security incident can concern a data breach There is a data breach if there is an infringement on the protection of personal data Not only the release or leakage of personal data results in a data breach but also when unlawful data is processed Reporting a security incident Every employee of The Company is authorized to report a security incident This is done by the companys Teams channel security which is monitored by the Security Off

### Analysis 1: Word Count

Data sample 1

In [None]:
def count_words(text):
    """Counts the number of words in the provided text."""
    words = text.split()
    return len(words)

In [None]:
#Adjusting text in the dataframe column
data_1_excel_combined_text = ' '.join(data_1_excel_data.iloc[:, 0].astype(str))
data_2_excel_combined_text = ' '.join(data_2_excel_data.iloc[:, 0].astype(str))

In [None]:
#Word count in first data sample
cw_1_1 = count_words(data_1_pdf_data)
cw_1_2 = count_words(data_1_excel_combined_text)
print(cw_1_1 + cw_1_2 )

4386


In [None]:
#Word count in second data sample
cw_2_1 = count_words(data_2_pdf_data)
cw_2_2 = count_words(data_2_pdf_2_data)
cw_2_3 = count_words(data_2_pdf_3_data)
cw_2_4 = count_words(data_2_excel_combined_text)
cw_2_5 = count_words(data_2_doc_1_data)
cw_2_6 = count_words(data_2_doc_2_data)
print(cw_2_1 + cw_2_2 + cw_2_3 + cw_2_4 + cw_2_5 + cw_2_6)

6638


### Analysis 2: Word Frequency (10 most frequent words)

In [None]:
def top_10_frequent_words(text, remove_stopwords=False):
    # Remove punctuation and convert text to lowercase
    words = re.findall(r'\b\w+\b', text.lower())

    # Count word frequencies
    word_counts = Counter(words)

    if remove_stopwords:
        # Remove stop words
        sw = set(stopwords.words('english'))
        filtered_words = [word for word in words if word not in sw]
        word_counts = Counter(filtered_words)

    # Return the 10 most common words
    return word_counts.most_common(10)


10 most frequent words in Data sample 1 before and after stop words are dropped:

In [None]:
joint_text_1 = ' '.join([data_1_pdf_data,data_1_excel_combined_text ])
print(top_10_frequent_words(joint_text_1, remove_stopwords=False))
print(top_10_frequent_words(joint_text_1, remove_stopwords=True))

[('the', 216), ('of', 147), ('and', 106), ('our', 88), ('to', 88), ('a', 76), ('is', 60), ('in', 60), ('for', 58), ('we', 57)]
[('data', 54), ('machine', 43), ('model', 35), ('company', 26), ('ai', 26), ('maintenance', 23), ('training', 21), ('system', 20), ('models', 17), ('process', 16)]


10 most frequent words in Data sample 2 before and after stop words are dropped:

In [None]:
joint_text_2 = ' '.join([data_2_pdf_data, data_2_pdf_2_data, data_2_pdf_3_data, data_2_excel_combined_text, data_2_doc_1_data,  data_2_doc_2_data ])
print(top_10_frequent_words(joint_text_2, remove_stopwords=False))
print(top_10_frequent_words(joint_text_2, remove_stopwords=True))

[('the', 340), ('to', 179), ('and', 158), ('of', 143), ('is', 136), ('our', 117), ('a', 110), ('in', 100), ('for', 100), ('we', 92)]
[('data', 48), ('models', 47), ('product', 41), ('new', 38), ('detection', 33), ('company', 32), ('per', 31), ('security', 27), ('streams', 25), ('model', 22)]


### Analysis 3: Cosine Similarity of text and dimensions

In [None]:
def calculate_cosine_similarity(text, list_of_dimensions):
    # Combine the list items and the text into a single list
    documents = list_of_dimensions + [text]

    # Vectorize the documents
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)

    # Calculate the cosine similarity
    cosine_similarities = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])

    # Flatten to a 1D array and return
    return cosine_similarities.flatten()


In [None]:
def remove_stop_words(text):
    # Tokenize
    words = word_tokenize(text)

    # stop words
    stop_words = set(stopwords.words('english'))

    # Remove the stop words
    filtered_words = [word for word in words if word.lower() not in stop_words]

    # Join the words back into a string
    filtered_text = ' '.join(filtered_words)

    return filtered_text

In [None]:
list_of_dimensions = ['Dataset size & quality', 'AI product roadmap', 'AI & data strategy']

Data sample 1

In [None]:
print(calculate_cosine_similarity(joint_text_1, list_of_dimensions))

[0.01349922 0.02475472 0.09397015]


In [None]:
print(calculate_cosine_similarity(remove_stop_words(joint_text_1), list_of_dimensions))

[0.03700716 0.06786334 0.25761256]


Data sample 2

In [None]:
print(calculate_cosine_similarity(joint_text_2, list_of_dimensions))

[0.01238437 0.05125239 0.05558784]


In [None]:
print(calculate_cosine_similarity(remove_stop_words(joint_text_2), list_of_dimensions))

[0.04013975 0.16611739 0.18016927]


### Analysis 4: Number of AI start-ups in Y Combinator Directory

In [None]:
#reading csv file
ycd = pd.read_csv(ycd_data)

In [None]:
#Change column from string to list of strings
ycd['tags'] = ycd['tags'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) and x.startswith('[') else x)

In [None]:
#Collecting AI companies
ai_companies = ycd[ycd['tags'].apply(lambda tags: 'Artificial Intelligence' in tags if isinstance(tags, list) else False)]
print(ai_companies)

      company_id   company_name  \
2          28409        BerriAI   
9          28367      Atri Labs   
22         28183          Metal   
44         28114      BabylonAI   
49         28089         Thread   
...          ...            ...   
7930         511     Semantics3   
8033         391    Canopy Labs   
8112         105     LeadGenius   
8273         245  Directed Edge   
8311         289    JustSpotted   

                                      short_description  \
2                Stop OpenAI Errors w/ 1 line of code 👈   
9       Open-source web framework for Python developers   
22             Machine learning embeddings as a service   
44         Datadog for machine learning on edge devices   
49     Incident Management platform for large enterp...   
...                                                 ...   
7930  Data and AI platform for Ecommerce & Cross-Bor...   
8033  Canopy Labs automates customer analytics for s...   
8112  LeadGenius is an account-based marketing a