# Pipeline 간단 버전 - 무엇을 하는가?

**목표**: AI/ML 스타트업의 약속 애매모호함(vagueness)이 펀딩 성공에 미치는 영향 분석

In [9]:
import pandas as pd
import numpy as np

DATA_DIR = "raw"
SAMPLE = 1000  # 빠른 테스트용

## Step 1: Company 데이터 - AI/ML 회사 찾기

In [10]:
# 회사 데이터 로드
company = pd.read_csv(f"{DATA_DIR}/Company20230501.dat", sep='|', nrows=SAMPLE, low_memory=False)
print(f"전체 회사: {len(company)}개")

# AI/ML 회사만 필터링
ai_keywords = ['AI', 'ML', 'machine learning', 'artificial intelligence']
company['is_ai_ml'] = company.apply(
    lambda row: any(kw.lower() in str(row.get('Description', '')).lower() 
                    for kw in ai_keywords), axis=1
)
ai_companies = company[company['is_ai_ml']].copy()

print(f"AI/ML 회사: {len(ai_companies)}개")
ai_companies[['CompanyID', 'CompanyName', 'Description']].head()

전체 회사: 1000개
AI/ML 회사: 349개


Unnamed: 0,CompanyID,CompanyName,Description
4,100001-71,Pollo Regio,Operator of a chain of restaurants intended to...
6,100002-07,Pequeno Mexico Operating Company,Operator of an entertainment and commercial ve...
7,100002-25,Yogurtland Franchising,Operator of a chain of frozen yogurt restauran...
8,100002-52,Inova US,Operator of a product marketing company with a...
11,100002-97,Career Educational Services,Provider of vocational career training service...


## Step 2: 애매모호함(Vagueness) 계산

Description에서 애매한 단어 vs 명확한 단어 비율

In [11]:
def calc_vagueness(desc):
    if pd.isna(desc):
        return 50
    text = str(desc).lower()
    vague_words = ['approximately', 'around', 'flexible', 'scalable']
    precise_words = ['precisely', 'exactly', 'guaranteed', 'specific']
    
    vague_count = sum(text.count(w) for w in vague_words)
    precise_count = sum(text.count(w) for w in precise_words)
    
    return max(0, min(100, 50 + 10 * (vague_count - precise_count)))

ai_companies['vagueness'] = ai_companies['Description'].apply(calc_vagueness)

print(f"평균 vagueness: {ai_companies['vagueness'].mean():.1f}")
ai_companies[['CompanyName', 'vagueness']].head(10)

평균 vagueness: 50.1


Unnamed: 0,CompanyName,vagueness
4,Pollo Regio,50
6,Pequeno Mexico Operating Company,50
7,Yogurtland Franchising,50
8,Inova US,50
11,Career Educational Services,50
16,G2See,50
18,ChainSync,50
25,Craft Equipment Company,50
26,Woodham Mortimer,50
28,ImPress Systems,50


## Step 3: Deal 데이터 - VC 투자 찾기

In [12]:
# Deal 데이터 로드
deal = pd.read_csv(f"{DATA_DIR}/Deal20230501.dat", sep='|', nrows=SAMPLE, low_memory=False)
print(f"전체 딜: {len(deal)}개")

# VC 투자만 필터링
vc_deals = deal[deal['DealType'].str.contains('VC', case=False, na=False)].copy()
print(f"VC 딜: {len(vc_deals)}개")

# Series A (초기) vs Series B (후기) 분류
early_rounds = ['1st Round', 'Seed Round', 'Angel']
later_rounds = ['2nd Round', '3rd Round', '4th Round']

vc_deals['round'] = np.where(
    vc_deals['VCRound'].isin(early_rounds), 'Series A',
    np.where(vc_deals['VCRound'].isin(later_rounds), 'Series B', None)
)

vc_deals = vc_deals[vc_deals['round'].notna()]
print(f"Series A: {sum(vc_deals['round'] == 'Series A')}개")
print(f"Series B: {sum(vc_deals['round'] == 'Series B')}개")

vc_deals[['CompanyID', 'CompanyName', 'round', 'DealSize']].head()

전체 딜: 1000개
VC 딜: 258개
Series A: 127개
Series B: 100개


Unnamed: 0,CompanyID,CompanyName,round,DealSize
3,100001-08,Zana,Series A,
26,100003-15,Premama,Series A,1.399999
27,100003-15,Premama,Series B,3.25
29,100003-15,Premama,Series B,3.500001
30,100003-15,Premama,Series B,5.9


## Step 4: 펀딩 성공 정의

In [13]:
# 성공 = 딜 완료 & 금액 > 0
vc_deals['DealSize'] = pd.to_numeric(vc_deals['DealSize'], errors='coerce').fillna(0)
vc_deals['funding_success'] = (
    (vc_deals['DealSize'] > 0) & 
    (vc_deals['DealStatus'].str.contains('Completed', case=False, na=False))
).astype(int)

print(f"성공한 딜: {vc_deals['funding_success'].sum()}개")
print(f"성공률: {vc_deals['funding_success'].mean():.1%}")

성공한 딜: 123개
성공률: 54.2%


## Step 5: Company + Deal 조인 - 분석 패널 생성

In [14]:
# CompanyID로 조인
ai_companies_indexed = ai_companies.set_index('CompanyID')

panel = vc_deals.merge(
    ai_companies_indexed[['vagueness']],
    left_on='CompanyID',
    right_index=True,
    how='inner'
)

print(f"\n분석 패널: {len(panel)}개 관측치")
print(f"Series A: {sum(panel['round'] == 'Series A')}개")
print(f"Series B: {sum(panel['round'] == 'Series B')}개")

panel[['CompanyName', 'round', 'vagueness', 'funding_success']].head(10)


분석 패널: 65개 관측치
Series A: 28개
Series B: 37개


Unnamed: 0,CompanyName,round,vagueness,funding_success
39,G2See,Series A,50,0
67,ImPress Systems,Series A,50,1
127,UOKO,Series A,50,1
128,UOKO,Series B,50,1
129,UOKO,Series B,50,1
130,UOKO,Series B,50,1
194,Hyakusenrenma,Series B,50,0
195,Hyakusenrenma,Series B,50,1
196,Hyakusenrenma,Series B,50,1
207,Maven,Series B,50,1


## Step 6: 핵심 질문 - Vagueness가 펀딩 성공에 미치는 영향은?

**가설**: 초기(Series A)에는 애매한 약속이 좋지만, 후기(Series B)에는 명확한 약속이 좋다

In [15]:
# 라운드별 성공률 비교
summary = panel.groupby(['round', pd.cut(panel['vagueness'], bins=[0, 50, 100], labels=['Low', 'High'])]).agg({
    'funding_success': ['count', 'sum', 'mean']
}).round(3)

summary.columns = ['Count', 'Successes', 'Success_Rate']
print("\n성공률 by Round & Vagueness:")
print(summary)


성공률 by Round & Vagueness:
                    Count  Successes  Success_Rate
round    vagueness                                
Series A Low           28         12         0.429
         High           0          0           NaN
Series B Low           37         27         0.730
         High           0          0           NaN


## Step 7: 회귀 분석 - 통계적 검증

In [16]:
import statsmodels.formula.api as smf

# 변수 준비
panel['vagueness_scaled'] = panel['vagueness'] / 100
panel['series_b_dummy'] = (panel['round'] == 'Series B').astype(int)

# 회귀식: funding_success ~ vagueness + round + interaction
formula = 'funding_success ~ vagueness_scaled + series_b_dummy + vagueness_scaled:series_b_dummy'

try:
    model = smf.logit(formula, data=panel).fit(disp=False)
except:
    model = smf.ols(formula, data=panel).fit()

print("\n회귀 분석 결과:")
print(model.summary())


회귀 분석 결과:
                            OLS Regression Results                            
Dep. Variable:        funding_success   R-squared:                       0.093
Model:                            OLS   Adj. R-squared:                  0.078
Method:                 Least Squares   F-statistic:                     6.434
Date:                Wed, 22 Oct 2025   Prob (F-statistic):             0.0137
Time:                        14:17:56   Log-Likelihood:                -42.689
No. Observations:                  65   AIC:                             89.38
Df Residuals:                      63   BIC:                             93.73
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------

## 핵심 해석

- **vagueness_scaled**: Series A에서 vagueness 1단위 증가 시 성공 확률 변화
- **vagueness_scaled:series_b_dummy**: Series B에서는 효과가 반대로 (reversal)
- p < 0.05이면 통계적으로 유의미