## Tämä notebook suorittaa blurb-datan vektorisoinnin kutsumalla blurb_tfidf-moduulia ja tulostaa eri asetusten vaikutukset
## piirreavaruuden kokoon ja harvuuteen

In [1]:
import pandas as pd

from preprocessing.blurb_tfidf import make_blurb_tfidf_pipeline


In [2]:
df = pd.read_parquet("../data/processed_data.parquet")

df.head()


Unnamed: 0,blurb,category_parent_id,category_parent_name,country,creator_id,currency,deadline,goal,id,name,state,launched_date,fx_date,fx_daily_mean,usd_goal_fx
0,A Year of Sanderson: Enjoy books and swag boxe...,18,Publishing,US,74501917,USD,2022-03-31,1000000.0,1497949659,Surprise! Four Secret Novels by Brandon Sanderson,successful,2022-03-01,2022-03-01,1.0,1000000.0
1,Color e-paper smartwatch with up to 7 days of ...,7,Design,US,597507018,USD,2015-03-28,500000.0,1799979574,"Pebble Time - Awesome Smartwatch, No Compromises",successful,2015-02-24,2015-02-24,1.0,500000.0
2,Beginning with The Stormlight Archive and expa...,12,Games,US,237961243,USD,2024-08-30,250000.0,7816448,Brandon Sanderson's Cosmere® RPG,successful,2024-08-06,2024-08-06,1.0,250000.0
3,The COOLEST is a portable party disguised as a...,7,Design,US,203090294,USD,2014-08-30,50000.0,342886736,COOLEST COOLER: 21st Century Cooler that's Act...,successful,2014-07-08,2014-07-08,1.0,50000.0
4,Euro-inspired dungeon crawling sequel to the 2...,12,Games,US,1350948450,USD,2020-05-01,500000.0,374744378,Frosthaven,successful,2020-03-31,2020-03-31,1.0,500000.0


In [3]:
blurbs = df["blurb"].fillna("").astype(str)
len(blurbs)


596353

In [4]:
configs = {
    # unigram = yksittäiset sanat
    # unibigram = yksittäiset sanat + sanaparit
    # max_features = kuinka monta sanaa/sanaparia otetaan mukaan sanastoon
    "unigram_2000": dict(ngram_range=(1, 1), max_features=2000, stop_words="english"),
    "unigram_5000": dict(ngram_range=(1, 1), max_features=5000, stop_words="english"),
    "unigram_10000": dict(ngram_range=(1, 1), max_features=10000, stop_words="english"),
    "unibigram_5000": dict(ngram_range=(1, 2), max_features=5000, stop_words="english"),

    # stop_words
    "unigram_5000_no_stop": dict(
        ngram_range=(1, 1),
        max_features=5000,
        stop_words=None,
    ),

    # min_df / max_df: harvinaisten / hyvin yleisten sanojen suodatus
    # min_df=2: sana otetaan mukaan vain jos se esiintyy vähintään 2 blurbin tekstissä
    "unigram_5000_min_df2": dict(
        ngram_range=(1, 1),
        max_features=5000,
        stop_words="english",
        min_df=2,
    ),
    # min_df=5: jätetään pois sanat, jotka esiintyvät alle 5 blurbissa
    "unigram_5000_min_df5": dict(
        ngram_range=(1, 1),
        max_features=5000,
        stop_words="english",
        min_df=5,
    ),
    # max_df=0.8: jätetään pois sanat, jotka esiintyvät yli 80 % blurbeista
    "unigram_5000_max_df08": dict(
        ngram_range=(1, 1),
        max_features=5000,
        stop_words="english",
        max_df=0.8,
    ),
}




In [5]:
results = {}

for name, cfg in configs.items():
    pipe = make_blurb_tfidf_pipeline(**cfg)
    X = pipe.fit_transform(blurbs)

    rows, cols = X.shape
    osumat = X.nnz  # kuinka monessa kohdassa matriisi ei ole nolla
    density = osumat / (rows * cols)

    results[name] = {
        "rows": rows,
        "cols": cols,
        "osumat": osumat,
        "density": density,
    }

results



{'unigram_2000': {'rows': 596353,
  'cols': 2000,
  'osumat': 3990680,
  'density': 0.0033459041876204194},
 'unigram_5000': {'rows': 596353,
  'cols': 5000,
  'osumat': 4928980,
  'density': 0.0016530410679580717},
 'unigram_10000': {'rows': 596353,
  'cols': 10000,
  'osumat': 5455452,
  'density': 0.0009148024743733997},
 'unibigram_5000': {'rows': 596353,
  'cols': 5000,
  'osumat': 5170768,
  'density': 0.0017341299532323976},
 'unigram_5000_no_stop': {'rows': 596353,
  'cols': 5000,
  'osumat': 8213989,
  'density': 0.0027547405647326334},
 'unigram_5000_min_df2': {'rows': 596353,
  'cols': 5000,
  'osumat': 4928973,
  'density': 0.001653038720355226},
 'unigram_5000_min_df5': {'rows': 596353,
  'cols': 5000,
  'osumat': 4928972,
  'density': 0.0016530383849833908},
 'unigram_5000_max_df08': {'rows': 596353,
  'cols': 5000,
  'osumat': 4928980,
  'density': 0.0016530410679580717}}

In [6]:
for name, r in results.items():
    print(name)
    print(f"  Rows:          {r['rows']:,}")
    print(f"  Features:      {r['cols']:,}")
    print(f"  Osumien määrä: {r['osumat']:,}")
    print(f"  Density:       {r['density']:.4%}")
    print()



unigram_2000
  Rows:          596,353
  Features:      2,000
  Osumien määrä: 3,990,680
  Density:       0.3346%

unigram_5000
  Rows:          596,353
  Features:      5,000
  Osumien määrä: 4,928,980
  Density:       0.1653%

unigram_10000
  Rows:          596,353
  Features:      10,000
  Osumien määrä: 5,455,452
  Density:       0.0915%

unibigram_5000
  Rows:          596,353
  Features:      5,000
  Osumien määrä: 5,170,768
  Density:       0.1734%

unigram_5000_no_stop
  Rows:          596,353
  Features:      5,000
  Osumien määrä: 8,213,989
  Density:       0.2755%

unigram_5000_min_df2
  Rows:          596,353
  Features:      5,000
  Osumien määrä: 4,928,973
  Density:       0.1653%

unigram_5000_min_df5
  Rows:          596,353
  Features:      5,000
  Osumien määrä: 4,928,972
  Density:       0.1653%

unigram_5000_max_df08
  Rows:          596,353
  Features:      5,000
  Osumien määrä: 4,928,980
  Density:       0.1653%

