# 잠재 디리클레 할당(Latent Dirichlet Allocation, LDA)

# Latent Dirichlet Allocation (LDA)

- LDA(Latent Dirichlet Allocation)는 토픽 모델링 알고리즘입니다.
- LDA는 각 문서의 토픽 분포와 각 토픽 내의 단어 분포를 추정합니다.


- 사용자는 LDA에 몇 개의 토픽을 찾을지 결정하며, 이를 하이퍼파라미터 K 로 설정 합니다.


- 예를 들어, K=2(num_topics) 일 때, 알고리즘이 두 개의 토픽을 찾도록 요청합니다.
- 예를 들어, 두 개의 토픽을 찾았을 때, 각 문서는 두 토픽 중 하나에 속하며, 그 분포를 표시합니다.


- 또한 각 토픽은 특정 단어들의 분포를 보여줍니다.
- 이를 통해 사용자는 문서의 주제를 파악할 수 있습니다.

- LDA(Latent Dirichlet Allocation)는 문서의 집합에서 토픽을 찾아내는 알고리즘입니다. 
- 이를 위해 LDA는 문서의 단어들을 빈도수 기반의 표현 방법인 Bag of Words(BOW)나 TF-IDF 행렬을 입력으로 사용합니다. 


- LDA는 단어의 순서는 고려하지 않고, 각 문서가 작성될 때 사용된 토픽과 단어를 역추적하여 토픽을 추출합니다. 
- 이를 통해 문서의 주제를 파악할 수 있습니다.


- 각 문서는 단어의 혼합을 특정 주제로 정하고, 해당 주제에서 단어를 선택하여 작성된 것으로 가정됩니다.
- LDA는 이러한 가정을 바탕으로 역공학을 수행하여 문서로부터 토픽을 추출합니다.

[프로세스]
1. 문서에 사용할 단어의 개수 K(num_topics)을 정한다.
2. 문서에 사용할 토픽의 혼합을 확률 분포에 기반하여 결정한다.
3. 문서에 사용할 각 단어를 정한다.
   1) 토픽 분포에서 토픽 T를 확률적으로 고른다.
   2) 선택한 토픽 T 에서 단어의 출현 확률 분포에 기반해 문서에 사용할 단어를 고른다.

In [1]:
# 부모 폴더의 경로 추가
import sys; sys.path.insert(0, '..')

from util.data_loader import DataLoader
from util.metric_calculator import MetricCalculator

In [2]:
# Movielens 데이터 로딩
data_loader = DataLoader(num_users=1000, num_test_items=5, data_path='../data/ml-10M100K/')
movielens = data_loader.load()

In [3]:
import gensim
import logging
from gensim.corpora.dictionary import Dictionary

movie_content = movielens.item_content.copy()
# tag가 부여되어 있지 않은 영화도 있지만, genre는 모든 영화에 부여되어 있다
# tag와 genre를 결합한 것을 영화 콘텐츠 정보로 해서 비슷한 영화를 찾아 추천한다
# tag가 없는 영화는 NaN으로 되어 있으므로, 빈 리스트로 변환한 뒤 처리한다
movie_content['tag_genre'] = movie_content['tag'].fillna("").apply(list) + movie_content['genre'].apply(list)
movie_content['tag_genre'] = movie_content['tag_genre'].apply(lambda x:list(map(str, x)))

# 태그와 장르 데이터를 사용해 LDA를 학습한다
tag_genre_data = movie_content.tag_genre.tolist()

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
common_dictionary = Dictionary(tag_genre_data)
common_corpus = [common_dictionary.doc2bow(text) for text in tag_genre_data]

# LDA 학습
lda_model = gensim.models.LdaModel(common_corpus, id2word=common_dictionary, num_topics=50, passes=30)
lda_topics = lda_model[common_corpus]

  "class": algorithms.Blowfish,
2024-03-31 21:54:37,557 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2024-03-31 21:54:37,809 : INFO : adding document #10000 to Dictionary<14749 unique tokens: ['3d', 'Adventure', 'Animation', 'Children', 'Comedy']...>
2024-03-31 21:54:37,823 : INFO : built Dictionary<15261 unique tokens: ['3d', 'Adventure', 'Animation', 'Children', 'Comedy']...> from 10681 documents (total 117144 corpus positions)
2024-03-31 21:54:37,825 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<15261 unique tokens: ['3d', 'Adventure', 'Animation', 'Children', 'Comedy']...> from 10681 documents (total 117144 corpus positions)", 'datetime': '2024-03-31T21:54:37.824976', 'gensim': '4.3.2', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22621-SP0', 'event': 'created'}
2024-03-31 21:54:37,974 : INFO : using symmetric alpha at 0.02
2024-03-31 21:54:37,975 : INFO : using symmetric eta at 0.02
2

2024-03-31 21:54:40,573 : INFO : topic #43 (0.020): 0.044*"marx brothers" + 0.039*"bill murray" + 0.033*"memory" + 0.027*"Comedy" + 0.027*"jack nicholson" + 0.025*"jane austen" + 0.023*"Drama" + 0.021*"wedding" + 0.018*"liam neeson" + 0.018*"hugh grant"
2024-03-31 21:54:40,573 : INFO : topic #2 (0.020): 0.119*"Drama" + 0.080*"War" + 0.050*"criterion" + 0.043*"Action" + 0.033*"martial arts" + 0.026*"Comedy" + 0.026*"Adventure" + 0.018*"poignant" + 0.018*"deliberate" + 0.017*"reflective"
2024-03-31 21:54:40,579 : INFO : topic diff=0.231356, rho=0.447214
2024-03-31 21:54:40,845 : INFO : -21.396 per-word bound, 2759379.7 perplexity estimate based on a held-out corpus of 681 documents with 5199 words
2024-03-31 21:54:40,845 : INFO : PROGRESS: pass 0, at document #10681/10681
2024-03-31 21:54:40,999 : INFO : merging changes from 681 documents into a model of 10681 documents
2024-03-31 21:54:41,029 : INFO : topic #9 (0.020): 0.198*"Thriller" + 0.137*"Mystery" + 0.105*"Drama" + 0.056*"r" + 0.0

2024-03-31 21:54:42,371 : INFO : PROGRESS: pass 1, at document #10000/10681
2024-03-31 21:54:42,623 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:54:42,668 : INFO : topic #32 (0.020): 0.036*"underrated" + 0.025*"michael crichton" + 0.022*"janus 50" + 0.022*"1980s" + 0.017*"business" + 0.017*"brian de palma" + 0.015*"canada" + 0.015*"mother-son relationship" + 0.013*"roald dahl" + 0.012*"Drama"
2024-03-31 21:54:42,668 : INFO : topic #18 (0.020): 0.190*"anime" + 0.091*"japan" + 0.037*"pg" + 0.023*"Animation" + 0.022*"movie to see" + 0.016*"affectionate" + 0.015*"Drama" + 0.015*"fashion" + 0.013*"president" + 0.012*"romance"
2024-03-31 21:54:42,668 : INFO : topic #31 (0.020): 0.296*"Comedy" + 0.194*"Documentary" + 0.075*"zombies" + 0.057*"documentary" + 0.016*"zombie" + 0.013*"to see" + 0.012*"humorous" + 0.011*"campy" + 0.011*"light" + 0.010*"sundance award winner"
2024-03-31 21:54:42,668 : INFO : topic #10 (0.020): 0.032*"Drama" + 0.030*"keir

2024-03-31 21:54:44,298 : INFO : topic #39 (0.020): 0.059*"mel gibson" + 0.058*"samuel l. jackson" + 0.051*"uma thurman" + 0.041*"kidnapping" + 0.029*"steve martin" + 0.027*"revenge" + 0.019*"predictable" + 0.019*"witch" + 0.014*"hollywood" + 0.013*"bfi classic"
2024-03-31 21:54:44,298 : INFO : topic #43 (0.020): 0.051*"jack nicholson" + 0.051*"memory" + 0.042*"bill murray" + 0.033*"marx brothers" + 0.030*"Comedy" + 0.029*"british" + 0.028*"wedding" + 0.026*"boring" + 0.026*"jane austen" + 0.019*"hugh grant"
2024-03-31 21:54:44,298 : INFO : topic #4 (0.020): 0.068*"surreal" + 0.064*"quirky" + 0.046*"satirical" + 0.034*"irreverent" + 0.031*"dark comedy" + 0.030*"humorous" + 0.030*"biting" + 0.030*"cynical" + 0.030*"nicolas cage" + 0.028*"black comedy"
2024-03-31 21:54:44,298 : INFO : topic #24 (0.020): 0.110*"romance" + 0.049*"chick flick" + 0.038*"pg-13" + 0.037*"nonlinear" + 0.033*"girlie movie" + 0.032*"video game adaptation" + 0.028*"Comedy" + 0.027*"bechdel test:fail" + 0.023*"juli

2024-03-31 21:54:46,068 : INFO : topic #11 (0.020): 0.133*"comedy" + 0.069*"Comedy" + 0.060*"funny" + 0.028*"hilarious" + 0.028*"seen more than once" + 0.022*"jim carrey" + 0.020*"parody" + 0.017*"stupid" + 0.017*"quirky" + 0.016*"funny as hell"
2024-03-31 21:54:46,068 : INFO : topic diff=0.110822, rho=0.327201
2024-03-31 21:54:46,078 : INFO : PROGRESS: pass 3, at document #8000/10681
2024-03-31 21:54:46,543 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:54:46,588 : INFO : topic #32 (0.020): 0.040*"michael crichton" + 0.038*"underrated" + 0.032*"janus 50" + 0.027*"1980s" + 0.022*"brian de palma" + 0.019*"mother-son relationship" + 0.017*"canada" + 0.015*"switching places" + 0.015*"gangster" + 0.014*"russia"
2024-03-31 21:54:46,588 : INFO : topic #9 (0.020): 0.242*"Thriller" + 0.149*"Mystery" + 0.140*"Drama" + 0.053*"nudity (topless - notable)" + 0.046*"Crime" + 0.038*"directorial debut" + 0.028*"r" + 0.023*"Action" + 0.020*"baseball" + 0.019*

2024-03-31 21:54:48,497 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:54:48,536 : INFO : topic #38 (0.020): 0.174*"fantasy" + 0.061*"racism" + 0.039*"russell crowe" + 0.030*"courtroom" + 0.027*"courtroom drama" + 0.021*"space travel" + 0.016*"judaism" + 0.016*"court" + 0.015*"adventure" + 0.014*"stage"
2024-03-31 21:54:48,538 : INFO : topic #29 (0.020): 0.069*"library vhs" + 0.048*"peter jackson" + 0.044*"philip seymour hoffman" + 0.035*"military" + 0.031*"eddie murphy" + 0.027*"christopher walken" + 0.026*"julianne moore" + 0.020*"new zealand" + 0.019*"forest whitaker" + 0.018*"disability"
2024-03-31 21:54:48,539 : INFO : topic #42 (0.020): 0.141*"drugs" + 0.059*"religion" + 0.046*"Drama" + 0.026*"london" + 0.025*"r" + 0.021*"coming of age" + 0.021*"christianity" + 0.021*"wired 50 greatest soundtracks" + 0.020*"depressing" + 0.020*"ewan mcgregor"
2024-03-31 21:54:48,540 : INFO : topic #3 (0.020): 0.068*"serial killer" + 0.062*"politics" + 0

2024-03-31 21:54:50,246 : INFO : topic #40 (0.020): 0.265*"less than 300 ratings" + 0.135*"Drama" + 0.093*"prison" + 0.025*"friendship" + 0.024*"sven's to see list" + 0.020*"War" + 0.016*"adaptation" + 0.012*"united states" + 0.011*"prison escape" + 0.010*"stephen king"
2024-03-31 21:54:50,248 : INFO : topic #22 (0.020): 0.230*"Action" + 0.171*"Adventure" + 0.151*"Sci-Fi" + 0.073*"Comedy" + 0.057*"70mm" + 0.056*"Western" + 0.035*"Fantasy" + 0.032*"can't remember" + 0.029*"Thriller" + 0.014*"mel brooks"
2024-03-31 21:54:50,249 : INFO : topic #36 (0.020): 0.063*"magic" + 0.062*"harrison ford" + 0.061*"sequel" + 0.036*"george lucas" + 0.033*"fantasy" + 0.026*"space opera" + 0.026*"adventure" + 0.021*"funniest movies" + 0.019*"seen more than once" + 0.016*"star wars"
2024-03-31 21:54:50,251 : INFO : topic diff=0.056679, rho=0.296950
2024-03-31 21:54:50,251 : INFO : PROGRESS: pass 5, at document #6000/10681
2024-03-31 21:54:50,438 : INFO : merging changes from 2000 documents into a model of

2024-03-31 21:54:51,888 : INFO : topic #3 (0.020): 0.085*"politics" + 0.071*"serial killer" + 0.065*"crime" + 0.053*"brad pitt" + 0.040*"psychology" + 0.040*"morgan freeman" + 0.032*"owned" + 0.023*"Thriller" + 0.023*"violent" + 0.022*"murder"
2024-03-31 21:54:51,888 : INFO : topic diff=0.224548, rho=0.284665
2024-03-31 21:54:51,888 : INFO : PROGRESS: pass 6, at document #4000/10681
2024-03-31 21:54:52,189 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:54:52,258 : INFO : topic #1 (0.020): 0.107*"oscar (best picture)" + 0.043*"oscar (best directing)" + 0.043*"oscar (best cinematography)" + 0.041*"tom hanks" + 0.040*"oscar (best actor)" + 0.027*"overrated" + 0.026*"drama" + 0.024*"imdb top 250" + 0.024*"afi 100" + 0.020*"tumey's dvds"
2024-03-31 21:54:52,263 : INFO : topic #36 (0.020): 0.066*"magic" + 0.063*"harrison ford" + 0.063*"sequel" + 0.037*"george lucas" + 0.034*"fantasy" + 0.027*"space opera" + 0.024*"adventure" + 0.022*"funniest movie

2024-03-31 21:54:53,831 : INFO : topic diff=0.109476, rho=0.284665
2024-03-31 21:54:53,831 : INFO : PROGRESS: pass 7, at document #2000/10681
2024-03-31 21:54:54,174 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:54:54,256 : INFO : topic #2 (0.020): 0.121*"Drama" + 0.107*"War" + 0.055*"criterion" + 0.031*"atmospheric" + 0.027*"martial arts" + 0.026*"bleak" + 0.024*"poignant" + 0.018*"melancholy" + 0.018*"tumey's dvds" + 0.018*"reflective"
2024-03-31 21:54:54,258 : INFO : topic #1 (0.020): 0.127*"oscar (best picture)" + 0.048*"oscar (best directing)" + 0.039*"oscar (best actor)" + 0.037*"oscar (best cinematography)" + 0.030*"tom hanks" + 0.028*"overrated" + 0.025*"drama" + 0.025*"afi 100" + 0.024*"imdb top 250" + 0.019*"national film registry"
2024-03-31 21:54:54,258 : INFO : topic #14 (0.020): 0.236*"classic" + 0.043*"national film registry" + 0.033*"erlend's dvds" + 0.031*"oscar (best actress)" + 0.031*"afi 100" + 0.028*"afi 100 (laughs)" + 

2024-03-31 21:54:56,168 : INFO : topic #11 (0.020): 0.105*"comedy" + 0.094*"Comedy" + 0.064*"funny" + 0.038*"parody" + 0.030*"seen more than once" + 0.027*"hilarious" + 0.025*"england" + 0.022*"kung fu" + 0.022*"jim carrey" + 0.016*"stupid"
2024-03-31 21:54:56,168 : INFO : topic #44 (0.020): 0.235*"remake" + 0.085*"death" + 0.069*"national film registry" + 0.041*"slow" + 0.033*"college" + 0.026*"Drama" + 0.018*"greg kinnear" + 0.018*"alcoholism" + 0.013*"18th century" + 0.011*"opera"
2024-03-31 21:54:56,168 : INFO : topic #41 (0.020): 0.093*"sports" + 0.059*"africa" + 0.039*"french" + 0.033*"basketball" + 0.032*"tragedy" + 0.032*"peter sellers" + 0.031*"hulu" + 0.030*"milla jovovich" + 0.027*"science" + 0.022*"gangs"
2024-03-31 21:54:56,175 : INFO : topic #35 (0.020): 0.051*"motorcycle" + 0.046*"india" + 0.043*"assassination" + 0.043*"post-apocalyptic" + 0.039*"cold war" + 0.034*"dance" + 0.033*"marriage" + 0.031*"infidelity" + 0.026*"betrayal" + 0.021*"germany"
2024-03-31 21:54:56,182

2024-03-31 21:54:57,730 : INFO : topic #19 (0.020): 0.101*"gay" + 0.084*"watched 2006" + 0.077*"oscar (best supporting actor)" + 0.043*"Drama" + 0.029*"poverty" + 0.026*"oscar (best foreign language film)" + 0.026*"small town" + 0.024*"anthony hopkins" + 0.021*"glbt" + 0.021*"queer"
2024-03-31 21:54:57,732 : INFO : topic #7 (0.020): 0.079*"cars" + 0.031*"nazis" + 0.030*"adventure" + 0.024*"john cusack" + 0.024*"fighting" + 0.022*"mst3k" + 0.022*"race" + 0.018*"sword fight" + 0.018*"want to see again" + 0.018*"archaeology"
2024-03-31 21:54:57,734 : INFO : topic diff=0.050972, rho=0.264069
2024-03-31 21:54:57,863 : INFO : -20.908 per-word bound, 1967774.3 perplexity estimate based on a held-out corpus of 681 documents with 5199 words
2024-03-31 21:54:57,863 : INFO : PROGRESS: pass 8, at document #10681/10681
2024-03-31 21:54:57,999 : INFO : merging changes from 681 documents into a model of 10681 documents
2024-03-31 21:54:58,038 : INFO : topic #37 (0.020): 0.130*"murder" + 0.071*"james 

2024-03-31 21:54:59,380 : INFO : topic diff=0.058074, rho=0.255317
2024-03-31 21:54:59,388 : INFO : PROGRESS: pass 9, at document #10000/10681
2024-03-31 21:54:59,650 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:54:59,718 : INFO : topic #5 (0.020): 0.111*"based on a tv show" + 0.097*"christmas" + 0.052*"george clooney" + 0.044*"nicole kidman" + 0.030*"star trek" + 0.021*"prostitution" + 0.021*"xmas theme" + 0.020*"tv" + 0.016*"environmental" + 0.015*"recorded"
2024-03-31 21:54:59,718 : INFO : topic #7 (0.020): 0.078*"cars" + 0.031*"nazis" + 0.028*"adventure" + 0.024*"john cusack" + 0.024*"fighting" + 0.022*"mst3k" + 0.022*"race" + 0.019*"divx" + 0.019*"archaeology" + 0.019*"want to see again"
2024-03-31 21:54:59,718 : INFO : topic #16 (0.020): 0.168*"betamax" + 0.148*"johnny depp" + 0.060*"tim burton" + 0.059*"dvd-video" + 0.050*"dvd-r" + 0.034*"clv" + 0.032*"scary movies to see on halloween" + 0.031*"dvd-ram" + 0.020*"dogs" + 0.020*"sad"
2

2024-03-31 21:55:01,406 : INFO : topic #27 (0.020): 0.410*"Horror" + 0.098*"Thriller" + 0.038*"horror" + 0.035*"vampire" + 0.027*"pirates" + 0.023*"gothic" + 0.023*"Fantasy" + 0.019*"ummarti2006" + 0.016*"tumey's dvds" + 0.015*"australia"
2024-03-31 21:55:01,407 : INFO : topic #13 (0.020): 0.204*"nudity (topless)" + 0.139*"nudity (topless - brief)" + 0.053*"teen" + 0.051*"nudity (rear)" + 0.051*"Drama" + 0.027*"will smith" + 0.027*"Comedy" + 0.026*"virtual reality" + 0.021*"food" + 0.020*"john wayne"
2024-03-31 21:55:01,408 : INFO : topic #28 (0.020): 0.070*"Drama" + 0.050*"true story" + 0.048*"drama" + 0.037*"history" + 0.035*"biography" + 0.030*"based on a true story" + 0.029*"holocaust" + 0.029*"music" + 0.028*"edward norton" + 0.018*"boxing"
2024-03-31 21:55:01,409 : INFO : topic #25 (0.020): 0.134*"aliens" + 0.043*"angelina jolie" + 0.029*"computers" + 0.029*"blaxploitation" + 0.027*"to-rent" + 0.027*"swashbuckler" + 0.026*"medieval" + 0.023*"sandra bullock" + 0.023*"owen wilson" 

2024-03-31 21:55:03,166 : INFO : topic #37 (0.020): 0.101*"murder" + 0.089*"james bond" + 0.060*"007" + 0.060*"lesbian" + 0.060*"bond" + 0.037*"assassin" + 0.028*"paris" + 0.026*"france" + 0.021*"franchise" + 0.019*"blindfold"
2024-03-31 21:55:03,168 : INFO : topic #20 (0.020): 0.070*"vhs" + 0.063*"quentin tarantino" + 0.055*"nonlinear" + 0.053*"heist" + 0.049*"vampires" + 0.042*"sean connery" + 0.041*"tarantino" + 0.034*"corvallis library" + 0.022*"gambling" + 0.017*"gangsters"
2024-03-31 21:55:03,173 : INFO : topic diff=0.064732, rho=0.240143
2024-03-31 21:55:03,175 : INFO : PROGRESS: pass 11, at document #8000/10681
2024-03-31 21:55:03,516 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:55:03,552 : INFO : topic #45 (0.020): 0.088*"robin williams" + 0.054*"matt damon" + 0.053*"hayao miyazaki" + 0.039*"surrealism" + 0.028*"1970s" + 0.024*"vincent price" + 0.021*"mathematics" + 0.020*"special" + 0.018*"my dvds" + 0.015*"seen more than once"
20

2024-03-31 21:55:05,018 : INFO : topic diff=0.039210, rho=0.233504
2024-03-31 21:55:05,018 : INFO : PROGRESS: pass 12, at document #6000/10681
2024-03-31 21:55:05,277 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:55:05,362 : INFO : topic #31 (0.020): 0.499*"Comedy" + 0.153*"Documentary" + 0.043*"to see" + 0.042*"zombies" + 0.028*"documentary" + 0.012*"zombie" + 0.009*"sam raimi" + 0.008*"light" + 0.007*"sundance award winner" + 0.006*"cult classic"
2024-03-31 21:55:05,367 : INFO : topic #29 (0.020): 0.077*"library vhs" + 0.046*"peter jackson" + 0.044*"philip seymour hoffman" + 0.037*"military" + 0.030*"eddie murphy" + 0.027*"christopher walken" + 0.024*"julianne moore" + 0.023*"disability" + 0.021*"forest whitaker" + 0.021*"nature"
2024-03-31 21:55:05,367 : INFO : topic #10 (0.020): 0.033*"vietnam war" + 0.029*"on dvr" + 0.028*"police" + 0.027*"suicide" + 0.027*"vietnam" + 0.026*"jude law" + 0.022*"great acting" + 0.017*"denzel washington" +

2024-03-31 21:55:07,219 : INFO : topic #37 (0.020): 0.107*"murder" + 0.095*"james bond" + 0.067*"007" + 0.064*"bond" + 0.061*"lesbian" + 0.044*"assassin" + 0.024*"paris" + 0.021*"franchise" + 0.020*"france" + 0.019*"Action"
2024-03-31 21:55:07,219 : INFO : topic #43 (0.020): 0.072*"jack nicholson" + 0.053*"british" + 0.044*"bill murray" + 0.042*"jane austen" + 0.039*"boring" + 0.033*"memory" + 0.029*"wedding" + 0.024*"relationships" + 0.021*"pregnancy" + 0.021*"marx brothers"
2024-03-31 21:55:07,219 : INFO : topic #40 (0.020): 0.280*"less than 300 ratings" + 0.150*"Drama" + 0.087*"prison" + 0.031*"sven's to see list" + 0.030*"friendship" + 0.022*"War" + 0.015*"adaptation" + 0.012*"united states" + 0.010*"prison escape" + 0.009*"women's lib"
2024-03-31 21:55:07,219 : INFO : topic #7 (0.020): 0.057*"cars" + 0.041*"nazis" + 0.038*"archaeology" + 0.029*"adventure" + 0.028*"john cusack" + 0.021*"fighting" + 0.021*"indiana jones" + 0.020*"race" + 0.020*"afternoon section" + 0.018*"want to se

2024-03-31 21:55:09,027 : INFO : topic #38 (0.020): 0.164*"fantasy" + 0.070*"racism" + 0.042*"russell crowe" + 0.025*"courtroom drama" + 0.025*"courtroom" + 0.019*"seen 2007" + 0.016*"stage" + 0.016*"movie to see" + 0.014*"foreign" + 0.014*"court"
2024-03-31 21:55:09,027 : INFO : topic #32 (0.020): 0.051*"underrated" + 0.036*"michael crichton" + 0.034*"1980s" + 0.023*"mother-son relationship" + 0.019*"do zassania" + 0.016*"gangster" + 0.016*"movie to see" + 0.016*"canada" + 0.016*"janus 50" + 0.014*"christopher lloyd"
2024-03-31 21:55:09,027 : INFO : topic diff=0.166244, rho=0.221727
2024-03-31 21:55:09,037 : INFO : PROGRESS: pass 14, at document #4000/10681
2024-03-31 21:55:09,407 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:55:09,495 : INFO : topic #27 (0.020): 0.394*"Horror" + 0.104*"Thriller" + 0.041*"horror" + 0.036*"vampire" + 0.024*"gothic" + 0.022*"Fantasy" + 0.022*"pirates" + 0.020*"muppets" + 0.019*"franchise" + 0.017*"ummarti2006

2024-03-31 21:55:11,927 : INFO : topic diff=0.083659, rho=0.221727
2024-03-31 21:55:11,928 : INFO : PROGRESS: pass 15, at document #2000/10681
2024-03-31 21:55:12,388 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:55:12,462 : INFO : topic #9 (0.020): 0.233*"Thriller" + 0.184*"Mystery" + 0.157*"Drama" + 0.062*"nudity (topless - notable)" + 0.047*"directorial debut" + 0.024*"tommy lee jones" + 0.021*"baseball" + 0.016*"watched 2007" + 0.015*"seen 2006" + 0.013*"robert rodriguez"
2024-03-31 21:55:12,464 : INFO : topic #8 (0.020): 0.377*"Drama" + 0.248*"Romance" + 0.177*"Comedy" + 0.026*"pixar" + 0.018*"bibliothek" + 0.017*"movie to see" + 0.011*"Fantasy" + 0.008*"library" + 0.008*"journalism" + 0.005*"football"
2024-03-31 21:55:12,466 : INFO : topic #38 (0.020): 0.165*"fantasy" + 0.069*"racism" + 0.042*"russell crowe" + 0.025*"courtroom drama" + 0.025*"courtroom" + 0.019*"seen 2007" + 0.016*"stage" + 0.016*"movie to see" + 0.014*"foreign" + 0.01

2024-03-31 21:55:15,345 : INFO : topic #1 (0.020): 0.072*"oscar (best picture)" + 0.042*"overrated" + 0.039*"tom hanks" + 0.036*"oscar (best cinematography)" + 0.028*"oscar (best directing)" + 0.028*"oscar (best actor)" + 0.022*"imdb top 250" + 0.022*"drama" + 0.020*"tumey's dvds" + 0.020*"forceful"
2024-03-31 21:55:15,345 : INFO : topic #28 (0.020): 0.072*"Drama" + 0.052*"true story" + 0.050*"based on a true story" + 0.040*"biography" + 0.040*"history" + 0.037*"drama" + 0.027*"music" + 0.026*"holocaust" + 0.022*"edward norton" + 0.019*"historical"
2024-03-31 21:55:15,345 : INFO : topic #10 (0.020): 0.035*"suicide" + 0.035*"great acting" + 0.033*"keira knightley" + 0.028*"on dvr" + 0.027*"jude law" + 0.027*"police" + 0.024*"male nudity" + 0.020*"elijah wood" + 0.019*"denzel washington" + 0.019*"torture"
2024-03-31 21:55:15,348 : INFO : topic #32 (0.020): 0.050*"underrated" + 0.041*"1980s" + 0.029*"do zassania" + 0.028*"mother-son relationship" + 0.026*"michael crichton" + 0.023*"movie 

2024-03-31 21:55:17,819 : INFO : topic #14 (0.020): 0.182*"classic" + 0.046*"national film registry" + 0.039*"erlend's dvds" + 0.039*"oscar (best actress)" + 0.035*"afi 100 (laughs)" + 0.030*"woody allen" + 0.027*"Drama" + 0.023*"family" + 0.023*"black and white" + 0.022*"afi 100"
2024-03-31 21:55:17,820 : INFO : topic #6 (0.020): 0.240*"r" + 0.106*"clearplay" + 0.060*"Drama" + 0.030*"adultery" + 0.028*"shakespeare" + 0.025*"movie to see" + 0.024*"based on a play" + 0.021*"to see" + 0.019*"kevin spacey" + 0.018*"adapted from:play"
2024-03-31 21:55:17,820 : INFO : topic diff=0.038219, rho=0.211570
2024-03-31 21:55:18,298 : INFO : -20.897 per-word bound, 1952912.8 perplexity estimate based on a held-out corpus of 681 documents with 5199 words
2024-03-31 21:55:18,301 : INFO : PROGRESS: pass 16, at document #10681/10681
2024-03-31 21:55:18,512 : INFO : merging changes from 681 documents into a model of 10681 documents
2024-03-31 21:55:18,602 : INFO : topic #11 (0.020): 0.110*"Comedy" + 0.1

2024-03-31 21:55:20,267 : INFO : topic diff=0.046518, rho=0.206988
2024-03-31 21:55:20,269 : INFO : PROGRESS: pass 17, at document #10000/10681
2024-03-31 21:55:20,562 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:55:20,649 : INFO : topic #13 (0.020): 0.237*"nudity (topless)" + 0.147*"nudity (topless - brief)" + 0.056*"nudity (rear)" + 0.054*"Drama" + 0.043*"teen" + 0.023*"jonossa" + 0.023*"will smith" + 0.019*"food" + 0.019*"john wayne" + 0.019*"virtual reality"
2024-03-31 21:55:20,649 : INFO : topic #31 (0.020): 0.488*"Comedy" + 0.159*"Documentary" + 0.055*"zombies" + 0.045*"to see" + 0.041*"documentary" + 0.012*"zombie" + 0.008*"sam raimi" + 0.007*"sundance award winner" + 0.005*"bruce campbell" + 0.005*"sundance grand jury prize"
2024-03-31 21:55:20,656 : INFO : topic #17 (0.020): 0.059*"mafia" + 0.043*"tom cruise" + 0.042*"espionage" + 0.040*"al pacino" + 0.027*"organized crime" + 0.023*"guns" + 0.021*"marlon brando" + 0.019*"car chase"

2024-03-31 21:55:22,302 : INFO : topic #1 (0.020): 0.083*"oscar (best picture)" + 0.045*"tom hanks" + 0.041*"oscar (best cinematography)" + 0.035*"oscar (best actor)" + 0.033*"oscar (best directing)" + 0.032*"overrated" + 0.027*"drama" + 0.022*"tumey's dvds" + 0.021*"forceful" + 0.020*"imdb top 250"
2024-03-31 21:55:22,304 : INFO : topic #43 (0.020): 0.061*"jack nicholson" + 0.050*"memory" + 0.048*"british" + 0.044*"bill murray" + 0.041*"boring" + 0.032*"jane austen" + 0.031*"marx brothers" + 0.031*"wedding" + 0.022*"hugh grant" + 0.022*"relationships"
2024-03-31 21:55:22,305 : INFO : topic #47 (0.020): 0.135*"world war ii" + 0.109*"comic book" + 0.080*"superhero" + 0.050*"super-hero" + 0.050*"war" + 0.049*"War" + 0.025*"adapted from:comic" + 0.024*"sean penn" + 0.023*"Action" + 0.018*"propaganda"
2024-03-31 21:55:22,305 : INFO : topic #7 (0.020): 0.073*"cars" + 0.033*"nazis" + 0.032*"adventure" + 0.027*"mst3k" + 0.027*"archaeology" + 0.026*"john cusack" + 0.023*"want to see again" + 0

2024-03-31 21:55:24,515 : INFO : topic #40 (0.020): 0.281*"less than 300 ratings" + 0.152*"Drama" + 0.080*"prison" + 0.032*"friendship" + 0.032*"sven's to see list" + 0.022*"War" + 0.016*"adaptation" + 0.011*"daniel day-lewis" + 0.010*"united states" + 0.010*"women's lib"
2024-03-31 21:55:24,517 : INFO : topic #37 (0.020): 0.102*"murder" + 0.088*"james bond" + 0.061*"lesbian" + 0.059*"007" + 0.059*"bond" + 0.038*"assassin" + 0.028*"paris" + 0.026*"france" + 0.022*"franchise" + 0.019*"blindfold"
2024-03-31 21:55:24,519 : INFO : topic #46 (0.020): 0.071*"clint eastwood" + 0.050*"pg13" + 0.040*"los angeles" + 0.039*"ridley scott" + 0.039*"western" + 0.034*"seen 2008" + 0.032*"road trip" + 0.031*"spaghetti western" + 0.026*"submarine" + 0.025*"sergio leone"
2024-03-31 21:55:24,519 : INFO : topic #14 (0.020): 0.214*"classic" + 0.044*"national film registry" + 0.038*"erlend's dvds" + 0.037*"oscar (best actress)" + 0.035*"afi 100 (laughs)" + 0.029*"afi 100" + 0.028*"woody allen" + 0.026*"Dram

2024-03-31 21:55:27,177 : INFO : topic #25 (0.020): 0.156*"aliens" + 0.046*"angelina jolie" + 0.030*"sandra bullock" + 0.028*"computers" + 0.028*"to-rent" + 0.027*"medieval" + 0.023*"owen wilson" + 0.022*"alien invasion" + 0.021*"blaxploitation" + 0.020*"scary"
2024-03-31 21:55:27,181 : INFO : topic #27 (0.020): 0.398*"Horror" + 0.103*"Thriller" + 0.041*"horror" + 0.036*"vampire" + 0.025*"gothic" + 0.022*"pirates" + 0.022*"Fantasy" + 0.019*"franchise" + 0.019*"muppets" + 0.018*"ummarti2006"
2024-03-31 21:55:27,187 : INFO : topic diff=0.031210, rho=0.194844
2024-03-31 21:55:27,190 : INFO : PROGRESS: pass 20, at document #6000/10681
2024-03-31 21:55:27,548 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:55:27,606 : INFO : topic #20 (0.020): 0.069*"vhs" + 0.067*"quentin tarantino" + 0.053*"heist" + 0.052*"nonlinear" + 0.049*"vampires" + 0.043*"tarantino" + 0.042*"sean connery" + 0.036*"corvallis library" + 0.023*"gambling" + 0.017*"gangsters"
202

2024-03-31 21:55:29,326 : INFO : topic diff=0.140193, rho=0.191248
2024-03-31 21:55:29,328 : INFO : PROGRESS: pass 21, at document #4000/10681
2024-03-31 21:55:29,677 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:55:29,745 : INFO : topic #21 (0.020): 0.076*"bruce willis" + 0.062*"twist ending" + 0.056*"cult film" + 0.052*"ghosts" + 0.052*"new york city" + 0.039*"easily confused with other movie(s) (title)" + 0.029*"new york" + 0.029*"interesting" + 0.023*"ensemble cast" + 0.018*"dvd"
2024-03-31 21:55:29,746 : INFO : topic #23 (0.020): 0.061*"disturbing" + 0.058*"tense" + 0.052*"atmospheric" + 0.050*"hitchcock" + 0.043*"Thriller" + 0.038*"alfred hitchcock" + 0.037*"tumey's dvds" + 0.033*"menacing" + 0.030*"ominous" + 0.027*"stylized"
2024-03-31 21:55:29,747 : INFO : topic #46 (0.020): 0.065*"clint eastwood" + 0.057*"pg13" + 0.045*"los angeles" + 0.042*"western" + 0.036*"road trip" + 0.035*"spaghetti western" + 0.035*"ridley scott" + 0.032*"se

2024-03-31 21:55:31,562 : INFO : topic #3 (0.020): 0.083*"politics" + 0.067*"serial killer" + 0.065*"crime" + 0.051*"brad pitt" + 0.043*"psychology" + 0.040*"violence" + 0.037*"morgan freeman" + 0.036*"owned" + 0.024*"murder" + 0.023*"violent"
2024-03-31 21:55:31,565 : INFO : topic #39 (0.020): 0.077*"mel gibson" + 0.077*"samuel l. jackson" + 0.068*"revenge" + 0.041*"uma thurman" + 0.038*"kidnapping" + 0.024*"hollywood" + 0.023*"predictable" + 0.019*"witch" + 0.018*"steve martin" + 0.013*"environment"
2024-03-31 21:55:31,566 : INFO : topic #9 (0.020): 0.233*"Thriller" + 0.183*"Mystery" + 0.159*"Drama" + 0.062*"nudity (topless - notable)" + 0.048*"directorial debut" + 0.023*"tommy lee jones" + 0.021*"baseball" + 0.017*"watched 2007" + 0.015*"seen 2006" + 0.013*"robert rodriguez"
2024-03-31 21:55:31,568 : INFO : topic #14 (0.020): 0.243*"classic" + 0.044*"national film registry" + 0.035*"erlend's dvds" + 0.033*"afi 100" + 0.033*"oscar (best actress)" + 0.029*"afi 100 (laughs)" + 0.028*"w

2024-03-31 21:55:33,282 : INFO : topic #24 (0.020): 0.117*"romance" + 0.061*"pg-13" + 0.052*"bechdel test:fail" + 0.047*"chick flick" + 0.038*"video game adaptation" + 0.037*"Romance" + 0.027*"girlie movie" + 0.024*"claymation" + 0.021*"adam sandler" + 0.021*"julia roberts"
2024-03-31 21:55:33,282 : INFO : topic #35 (0.020): 0.050*"motorcycle" + 0.046*"india" + 0.041*"assassination" + 0.040*"dance" + 0.038*"cold war" + 0.035*"marriage" + 0.032*"infidelity" + 0.023*"betrayal" + 0.021*"germany" + 0.019*"pointless"
2024-03-31 21:55:33,291 : INFO : topic diff=0.070602, rho=0.187844
2024-03-31 21:55:33,291 : INFO : PROGRESS: pass 23, at document #2000/10681
2024-03-31 21:55:33,567 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:55:33,636 : INFO : topic #27 (0.020): 0.384*"Horror" + 0.108*"Thriller" + 0.044*"horror" + 0.035*"vampire" + 0.027*"pirates" + 0.023*"Fantasy" + 0.022*"ummarti2006" + 0.021*"gothic" + 0.020*"franchise" + 0.018*"slasher"
2024

2024-03-31 21:55:35,541 : INFO : topic diff=0.032676, rho=0.184615
2024-03-31 21:55:35,707 : INFO : -20.894 per-word bound, 1948087.1 perplexity estimate based on a held-out corpus of 681 documents with 5199 words
2024-03-31 21:55:35,708 : INFO : PROGRESS: pass 23, at document #10681/10681
2024-03-31 21:55:35,890 : INFO : merging changes from 681 documents into a model of 10681 documents
2024-03-31 21:55:35,970 : INFO : topic #41 (0.020): 0.099*"sports" + 0.055*"africa" + 0.040*"french" + 0.034*"peter sellers" + 0.034*"basketball" + 0.033*"hulu" + 0.031*"tragedy" + 0.028*"milla jovovich" + 0.026*"science" + 0.024*"gangs"
2024-03-31 21:55:35,976 : INFO : topic #3 (0.020): 0.091*"politics" + 0.054*"serial killer" + 0.051*"crime" + 0.047*"psychology" + 0.044*"brad pitt" + 0.041*"violence" + 0.038*"owned" + 0.034*"morgan freeman" + 0.026*"violent" + 0.026*"murder"
2024-03-31 21:55:35,976 : INFO : topic #47 (0.020): 0.122*"comic book" + 0.114*"world war ii" + 0.101*"superhero" + 0.045*"War"

2024-03-31 21:55:38,129 : INFO : topic #40 (0.020): 0.354*"less than 300 ratings" + 0.172*"Drama" + 0.054*"prison" + 0.032*"sven's to see list" + 0.029*"War" + 0.026*"friendship" + 0.014*"adaptation" + 0.011*"united states" + 0.008*"police corruption" + 0.008*"trilogy"
2024-03-31 21:55:38,130 : INFO : topic #17 (0.020): 0.061*"mafia" + 0.043*"tom cruise" + 0.041*"espionage" + 0.040*"al pacino" + 0.028*"organized crime" + 0.023*"guns" + 0.021*"marlon brando" + 0.019*"car chase" + 0.017*"my movies" + 0.016*"francis ford coppola"
2024-03-31 21:55:38,131 : INFO : topic #25 (0.020): 0.121*"aliens" + 0.049*"angelina jolie" + 0.043*"to-rent" + 0.029*"owen wilson" + 0.027*"medieval" + 0.027*"swashbuckler" + 0.024*"computers" + 0.024*"sandra bullock" + 0.023*"blaxploitation" + 0.019*"alien invasion"
2024-03-31 21:55:38,133 : INFO : topic #35 (0.020): 0.056*"motorcycle" + 0.048*"dance" + 0.034*"india" + 0.031*"infidelity" + 0.031*"marriage" + 0.029*"assassination" + 0.028*"cold war" + 0.023*"poi

2024-03-31 21:55:39,876 : INFO : topic #41 (0.020): 0.111*"sports" + 0.048*"peter sellers" + 0.044*"french" + 0.040*"hulu" + 0.037*"africa" + 0.033*"basketball" + 0.029*"gangs" + 0.026*"tragedy" + 0.023*"milla jovovich" + 0.018*"science"
2024-03-31 21:55:39,876 : INFO : topic #7 (0.020): 0.072*"cars" + 0.034*"nazis" + 0.029*"adventure" + 0.028*"archaeology" + 0.026*"mst3k" + 0.026*"john cusack" + 0.023*"want to see again" + 0.021*"fighting" + 0.020*"race" + 0.018*"afternoon section"
2024-03-31 21:55:39,885 : INFO : topic diff=0.039675, rho=0.178627
2024-03-31 21:55:39,885 : INFO : PROGRESS: pass 25, at document #10000/10681
2024-03-31 21:55:40,106 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:55:40,149 : INFO : topic #30 (0.020): 0.128*"Children" + 0.115*"Musical" + 0.099*"Comedy" + 0.066*"Animation" + 0.064*"disney" + 0.056*"Fantasy" + 0.055*"Adventure" + 0.050*"animation" + 0.036*"musical" + 0.021*"children"
2024-03-31 21:55:40,149 : INFO 

2024-03-31 21:55:41,438 : INFO : topic diff=0.044364, rho=0.175844
2024-03-31 21:55:41,438 : INFO : PROGRESS: pass 26, at document #8000/10681
2024-03-31 21:55:41,816 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:55:41,857 : INFO : topic #28 (0.020): 0.070*"Drama" + 0.050*"true story" + 0.048*"drama" + 0.036*"history" + 0.036*"biography" + 0.032*"based on a true story" + 0.029*"music" + 0.029*"holocaust" + 0.027*"edward norton" + 0.019*"boxing"
2024-03-31 21:55:41,859 : INFO : topic #40 (0.020): 0.326*"less than 300 ratings" + 0.168*"Drama" + 0.065*"prison" + 0.032*"sven's to see list" + 0.028*"friendship" + 0.024*"War" + 0.014*"adaptation" + 0.009*"united states" + 0.008*"women's lib" + 0.008*"trilogy"
2024-03-31 21:55:41,860 : INFO : topic #36 (0.020): 0.097*"magic" + 0.060*"sequel" + 0.051*"harrison ford" + 0.044*"fantasy" + 0.029*"george lucas" + 0.028*"funniest movies" + 0.022*"brazil" + 0.021*"space opera" + 0.020*"adventure" + 0.015*"

2024-03-31 21:55:43,735 : INFO : topic #4 (0.020): 0.075*"quirky" + 0.062*"surreal" + 0.054*"humorous" + 0.048*"dark comedy" + 0.039*"satirical" + 0.032*"black comedy" + 0.031*"Comedy" + 0.030*"nicolas cage" + 0.030*"irreverent" + 0.026*"biting"
2024-03-31 21:55:43,737 : INFO : topic #47 (0.020): 0.137*"world war ii" + 0.108*"comic book" + 0.085*"superhero" + 0.051*"war" + 0.046*"War" + 0.045*"super-hero" + 0.027*"adapted from:comic" + 0.022*"Action" + 0.021*"sean penn" + 0.020*"wwii"
2024-03-31 21:55:43,738 : INFO : topic #2 (0.020): 0.125*"Drama" + 0.103*"War" + 0.062*"criterion" + 0.036*"martial arts" + 0.032*"atmospheric" + 0.025*"bleak" + 0.023*"poignant" + 0.021*"tumey's dvds" + 0.019*"jackie chan" + 0.019*"talky"
2024-03-31 21:55:43,738 : INFO : topic #42 (0.020): 0.137*"drugs" + 0.070*"religion" + 0.053*"Drama" + 0.033*"coming of age" + 0.032*"london" + 0.025*"christianity" + 0.022*"depressing" + 0.021*"television" + 0.020*"ewan mcgregor" + 0.019*"notable soundtrack"
2024-03-31

2024-03-31 21:55:45,317 : INFO : topic #31 (0.020): 0.486*"Comedy" + 0.166*"Documentary" + 0.055*"to see" + 0.046*"zombies" + 0.030*"documentary" + 0.009*"zombie" + 0.008*"sam raimi" + 0.007*"sundance award winner" + 0.006*"bechdel test:pass" + 0.005*"upbeat"
2024-03-31 21:55:45,317 : INFO : topic #13 (0.020): 0.239*"nudity (topless)" + 0.159*"nudity (topless - brief)" + 0.056*"Drama" + 0.053*"teen" + 0.051*"nudity (rear)" + 0.030*"will smith" + 0.022*"virtual reality" + 0.021*"food" + 0.017*"meg ryan" + 0.013*"john wayne"
2024-03-31 21:55:45,317 : INFO : topic diff=0.026753, rho=0.170646
2024-03-31 21:55:45,317 : INFO : PROGRESS: pass 28, at document #6000/10681
2024-03-31 21:55:45,515 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:55:45,549 : INFO : topic #38 (0.020): 0.178*"fantasy" + 0.073*"racism" + 0.041*"russell crowe" + 0.030*"courtroom" + 0.027*"courtroom drama" + 0.018*"court" + 0.017*"adventure" + 0.015*"stage" + 0.014*"judaism" + 

2024-03-31 21:55:46,733 : INFO : topic diff=0.121515, rho=0.168215
2024-03-31 21:55:46,735 : INFO : PROGRESS: pass 29, at document #4000/10681
2024-03-31 21:55:47,065 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:55:47,112 : INFO : topic #0 (0.020): 0.052*"Drama" + 0.041*"lyrical" + 0.028*"dark" + 0.026*"enigmatic" + 0.026*"dreamlike" + 0.024*"bittersweet" + 0.022*"beautiful" + 0.022*"whimsical" + 0.022*"hw drama" + 0.021*"meditative"
2024-03-31 21:55:47,112 : INFO : topic #8 (0.020): 0.377*"Drama" + 0.243*"Romance" + 0.179*"Comedy" + 0.027*"pixar" + 0.019*"bibliothek" + 0.014*"movie to see" + 0.010*"Fantasy" + 0.008*"journalism" + 0.008*"library" + 0.007*"football"
2024-03-31 21:55:47,112 : INFO : topic #47 (0.020): 0.130*"world war ii" + 0.112*"comic book" + 0.095*"superhero" + 0.046*"super-hero" + 0.046*"war" + 0.044*"War" + 0.029*"adapted from:comic" + 0.022*"batman" + 0.022*"Action" + 0.021*"sean penn"
2024-03-31 21:55:47,116 : INFO : t

In [4]:
# LDAContent추천
from src.lda_content import LDAContentRecommender
recommender = LDAContentRecommender()
recommend_result = recommender.recommend(movielens)

2024-03-31 21:55:48,475 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2024-03-31 21:55:48,575 : INFO : adding document #10000 to Dictionary<14749 unique tokens: ['3d', 'Adventure', 'Animation', 'Children', 'Comedy']...>
2024-03-31 21:55:48,588 : INFO : built Dictionary<15261 unique tokens: ['3d', 'Adventure', 'Animation', 'Children', 'Comedy']...> from 10681 documents (total 117144 corpus positions)
2024-03-31 21:55:48,589 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<15261 unique tokens: ['3d', 'Adventure', 'Animation', 'Children', 'Comedy']...> from 10681 documents (total 117144 corpus positions)", 'datetime': '2024-03-31T21:55:48.589759', 'gensim': '4.3.2', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22621-SP0', 'event': 'created'}
2024-03-31 21:55:48,655 : INFO : using symmetric alpha at 0.02
2024-03-31 21:55:48,660 : INFO : using symmetric eta at 0.02
2024-03-31 21:55:48,660 : INFO : 

2024-03-31 21:55:50,674 : INFO : topic #16 (0.020): 0.080*"nudity (rear)" + 0.053*"pg" + 0.041*"military" + 0.035*"claymation" + 0.031*"weird" + 0.030*"nudity (topless)" + 0.024*"aardman" + 0.019*"nudity (topless - brief)" + 0.019*"childhood" + 0.018*"penguins"
2024-03-31 21:55:50,677 : INFO : topic diff=0.241935, rho=0.447214
2024-03-31 21:55:50,825 : INFO : -21.496 per-word bound, 2957512.5 perplexity estimate based on a held-out corpus of 681 documents with 5199 words
2024-03-31 21:55:50,825 : INFO : PROGRESS: pass 0, at document #10681/10681
2024-03-31 21:55:50,940 : INFO : merging changes from 681 documents into a model of 10681 documents
2024-03-31 21:55:51,003 : INFO : topic #36 (0.020): 0.035*"russell crowe" + 0.031*"virus" + 0.029*"not available from netflix" + 0.028*"Comedy" + 0.026*"based on a book" + 0.025*"cult film" + 0.025*"peter jackson" + 0.024*"movielens quickpick" + 0.023*"1980s" + 0.021*"adam sandler"
2024-03-31 21:55:51,005 : INFO : topic #6 (0.020): 0.125*"Musical

2024-03-31 21:55:52,625 : INFO : topic #42 (0.020): 0.086*"martial arts" + 0.059*"robin williams" + 0.042*"jonossa" + 0.041*"seen 2006" + 0.031*"japan" + 0.029*"samurai" + 0.027*"Drama" + 0.025*"orson welles" + 0.024*"hugh grant" + 0.023*"divorce"
2024-03-31 21:55:52,625 : INFO : topic #41 (0.020): 0.141*"remake" + 0.058*"new york city" + 0.049*"christmas" + 0.044*"family" + 0.037*"new york" + 0.031*"Drama" + 0.027*"seen" + 0.024*"kevin spacey" + 0.023*"predictable" + 0.021*"friday night movie"
2024-03-31 21:55:52,625 : INFO : topic #5 (0.020): 0.162*"classic" + 0.082*"musical" + 0.027*"afi 100 (laughs)" + 0.024*"los angeles" + 0.018*"witch" + 0.018*"national film registry" + 0.016*"john travolta" + 0.015*"Musical" + 0.014*"adapted from b'way" + 0.013*"movie to see"
2024-03-31 21:55:52,625 : INFO : topic #49 (0.020): 0.081*"national film registry" + 0.026*"natalie portman" + 0.025*"tumey's dvds" + 0.022*"Film-Noir" + 0.020*"film noir" + 0.020*"black and white" + 0.019*"imdb top 250" + 

2024-03-31 21:55:54,666 : INFO : topic #45 (0.020): 0.235*"less than 300 ratings" + 0.109*"Drama" + 0.061*"nudity (topless - notable)" + 0.036*"Comedy" + 0.035*"tim burton" + 0.023*"library" + 0.018*"michael moore" + 0.015*"blindfold" + 0.014*"depressing" + 0.011*"golden raspberry (worst actor)"
2024-03-31 21:55:54,670 : INFO : topic #37 (0.020): 0.185*"War" + 0.109*"world war ii" + 0.103*"Drama" + 0.047*"history" + 0.047*"war" + 0.035*"jim carrey" + 0.025*"Action" + 0.019*"70mm" + 0.014*"nazis" + 0.013*"mental illness"
2024-03-31 21:55:54,670 : INFO : topic #38 (0.020): 0.142*"drugs" + 0.037*"samuel l. jackson" + 0.035*"peter sellers" + 0.030*"ewan mcgregor" + 0.024*"divx" + 0.022*"poverty" + 0.020*"addiction" + 0.019*"hulu" + 0.018*"michael caine" + 0.017*"Drama"
2024-03-31 21:55:54,675 : INFO : topic diff=0.119826, rho=0.346261
2024-03-31 21:55:54,675 : INFO : PROGRESS: pass 2, at document #10000/10681
2024-03-31 21:55:54,982 : INFO : merging changes from 2000 documents into a model

2024-03-31 21:55:56,650 : INFO : topic #11 (0.020): 0.127*"religion" + 0.115*"keanu reeves" + 0.020*"dark" + 0.018*"reality tv" + 0.018*"courtesan" + 0.018*"slash" + 0.017*"slavery" + 0.017*"facebook rec" + 0.014*"class issues" + 0.012*"product placement"
2024-03-31 21:55:56,655 : INFO : topic diff=0.113857, rho=0.327201
2024-03-31 21:55:56,663 : INFO : PROGRESS: pass 3, at document #8000/10681
2024-03-31 21:55:56,885 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:55:56,955 : INFO : topic #20 (0.020): 0.086*"biography" + 0.060*"Drama" + 0.043*"biopic" + 0.042*"ingmar bergman" + 0.035*"ensemble cast" + 0.031*"julia roberts" + 0.030*"suicide" + 0.028*"corvallis library" + 0.027*"multiple storylines" + 0.025*"medieval"
2024-03-31 21:55:56,964 : INFO : topic #47 (0.020): 0.059*"serial killer" + 0.046*"psychology" + 0.045*"brad pitt" + 0.041*"edward norton" + 0.033*"heist" + 0.031*"hayao miyazaki" + 0.031*"steven spielberg" + 0.029*"crime" + 0.029

2024-03-31 21:55:58,460 : INFO : topic diff=0.057320, rho=0.310978
2024-03-31 21:55:58,460 : INFO : PROGRESS: pass 4, at document #6000/10681
2024-03-31 21:55:58,677 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:55:58,713 : INFO : topic #40 (0.020): 0.236*"r" + 0.117*"clearplay" + 0.053*"prison" + 0.051*"movie to see" + 0.039*"Drama" + 0.033*"morgan freeman" + 0.025*"friendship" + 0.020*"john wayne" + 0.019*"mel brooks" + 0.017*"conspiracy"
2024-03-31 21:55:58,715 : INFO : topic #5 (0.020): 0.268*"classic" + 0.066*"musical" + 0.046*"afi 100 (laughs)" + 0.028*"Musical" + 0.027*"national film registry" + 0.020*"afi 100" + 0.019*"john travolta" + 0.012*"delights" + 0.012*"adapted from b'way" + 0.012*"breakthroughs"
2024-03-31 21:55:58,715 : INFO : topic #49 (0.020): 0.076*"national film registry" + 0.049*"imdb top 250" + 0.035*"black and white" + 0.032*"afi 100" + 0.030*"tumey's dvds" + 0.027*"afi 100 (thrills)" + 0.024*"afi 100 (movie quotes)"

2024-03-31 21:56:00,485 : INFO : topic #6 (0.020): 0.098*"Musical" + 0.090*"politics" + 0.067*"satire" + 0.052*"sean connery" + 0.048*"nicolas cage" + 0.030*"terrorism" + 0.030*"dvd-r" + 0.029*"police" + 0.022*"political" + 0.020*"good dialogue"
2024-03-31 21:56:00,485 : INFO : topic #28 (0.020): 0.083*"oscar (best cinematography)" + 0.066*"Drama" + 0.052*"oscar (best supporting actress)" + 0.047*"oscar (best actress)" + 0.041*"oscar (best actor)" + 0.039*"in netflix queue" + 0.027*"remade" + 0.019*"marlon brando" + 0.017*"70mm" + 0.016*"Romance"
2024-03-31 21:56:00,485 : INFO : topic #32 (0.020): 0.137*"johnny depp" + 0.064*"clint eastwood" + 0.045*"vhs" + 0.043*"western" + 0.041*"jackie chan" + 0.038*"spaghetti western" + 0.032*"sergio leone" + 0.031*"kung fu" + 0.027*"india" + 0.026*"david lynch"
2024-03-31 21:56:00,485 : INFO : topic #36 (0.020): 0.091*"cult film" + 0.049*"russell crowe" + 0.034*"1980s" + 0.033*"downbeat" + 0.032*"adam sandler" + 0.025*"peter jackson" + 0.021*"80s"

2024-03-31 21:56:01,948 : INFO : topic #37 (0.020): 0.179*"War" + 0.106*"world war ii" + 0.101*"Drama" + 0.061*"war" + 0.045*"jim carrey" + 0.044*"history" + 0.028*"Action" + 0.023*"nazis" + 0.018*"wwii" + 0.015*"mental illness"
2024-03-31 21:56:01,948 : INFO : topic #43 (0.020): 0.067*"surreal" + 0.056*"stanley kubrick" + 0.037*"satirical" + 0.035*"cynical" + 0.032*"narrated" + 0.029*"dreamlike" + 0.029*"irreverent" + 0.026*"quirky" + 0.025*"biting" + 0.023*"hallucinatory"
2024-03-31 21:56:01,954 : INFO : topic diff=0.233043, rho=0.284665
2024-03-31 21:56:01,954 : INFO : PROGRESS: pass 6, at document #4000/10681
2024-03-31 21:56:02,164 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:56:02,196 : INFO : topic #46 (0.020): 0.089*"action" + 0.064*"sci-fi" + 0.047*"fantasy" + 0.033*"adventure" + 0.028*"dvd" + 0.024*"seen at the cinema" + 0.023*"seen more than once" + 0.021*"Adventure" + 0.021*"space" + 0.020*"harrison ford"
2024-03-31 21:56:02,196

2024-03-31 21:56:03,401 : INFO : topic #0 (0.020): 0.115*"can't remember" + 0.110*"based on a tv show" + 0.087*"directorial debut" + 0.071*"Comedy" + 0.056*"ummarti2006" + 0.055*"keira knightley" + 0.028*"australia" + 0.026*"immigrants" + 0.026*"dani2006" + 0.025*"australian"
2024-03-31 21:56:03,405 : INFO : topic diff=0.116521, rho=0.284665
2024-03-31 21:56:03,406 : INFO : PROGRESS: pass 7, at document #2000/10681
2024-03-31 21:56:03,704 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:56:03,735 : INFO : topic #34 (0.020): 0.069*"ghosts" + 0.049*"bill murray" + 0.044*"television" + 0.028*"courtroom drama" + 0.027*"secret service" + 0.026*"courtroom" + 0.019*"roman polanski" + 0.018*"18th century" + 0.018*"opera" + 0.017*"smoking"
2024-03-31 21:56:03,735 : INFO : topic #13 (0.020): 0.056*"fairy tale" + 0.054*"jane austen" + 0.037*"assassination" + 0.035*"historical" + 0.033*"road trip" + 0.033*"cold war" + 0.029*"sequel" + 0.023*"gerard depardi

2024-03-31 21:56:04,967 : INFO : PROGRESS: pass 7, at document #10681/10681
2024-03-31 21:56:05,114 : INFO : merging changes from 681 documents into a model of 10681 documents
2024-03-31 21:56:05,149 : INFO : topic #21 (0.020): 0.074*"Sci-Fi" + 0.064*"aliens" + 0.044*"will smith" + 0.034*"space" + 0.033*"tommy lee jones" + 0.030*"Action" + 0.026*"ridley scott" + 0.025*"sci-fi" + 0.025*"futuristmovies.com" + 0.021*"monster"
2024-03-31 21:56:05,149 : INFO : topic #37 (0.020): 0.229*"War" + 0.125*"Drama" + 0.095*"world war ii" + 0.051*"history" + 0.048*"war" + 0.039*"Action" + 0.023*"jim carrey" + 0.022*"nazis" + 0.011*"mental illness" + 0.010*"wwii"
2024-03-31 21:56:05,149 : INFO : topic #26 (0.020): 0.043*"Drama" + 0.030*"james bond" + 0.021*"assassin" + 0.020*"poignant" + 0.020*"reflective" + 0.019*"atmospheric" + 0.019*"bittersweet" + 0.018*"bond" + 0.018*"007" + 0.017*"lyrical"
2024-03-31 21:56:05,154 : INFO : topic #46 (0.020): 0.077*"action" + 0.065*"fantasy" + 0.049*"sci-fi" + 0.0

2024-03-31 21:56:06,805 : INFO : topic #3 (0.020): 0.131*"oscar (best picture)" + 0.056*"oscar (best directing)" + 0.053*"oscar (best actor)" + 0.037*"al pacino" + 0.027*"oscar (best supporting actor)" + 0.024*"afi 100 (cheers)" + 0.022*"Drama" + 0.020*"great acting" + 0.019*"tumey's dvds" + 0.019*"civil war"
2024-03-31 21:56:06,805 : INFO : topic #15 (0.020): 0.109*"disney" + 0.100*"animation" + 0.087*"pixar" + 0.050*"pirates" + 0.040*"Animation" + 0.039*"children" + 0.030*"Children" + 0.028*"angelina jolie" + 0.024*"disney animated feature" + 0.022*"cartoon"
2024-03-31 21:56:06,815 : INFO : topic #14 (0.020): 0.060*"criterion" + 0.055*"disturbing" + 0.053*"Drama" + 0.051*"tense" + 0.046*"atmospheric" + 0.031*"tumey's dvds" + 0.030*"stylized" + 0.029*"bleak" + 0.024*"erlend's dvds" + 0.023*"menacing"
2024-03-31 21:56:06,818 : INFO : topic diff=0.050891, rho=0.264069
2024-03-31 21:56:06,983 : INFO : -20.979 per-word bound, 2066694.4 perplexity estimate based on a held-out corpus of 681

2024-03-31 21:56:08,314 : INFO : topic #27 (0.020): 0.319*"Horror" + 0.150*"Thriller" + 0.127*"Mystery" + 0.040*"Drama" + 0.030*"Fantasy" + 0.030*"easily confused with other movie(s) (title)" + 0.025*"Film-Noir" + 0.019*"eerie" + 0.015*"tumey's dvds" + 0.013*"franchise"
2024-03-31 21:56:08,314 : INFO : topic diff=0.056291, rho=0.255317
2024-03-31 21:56:08,320 : INFO : PROGRESS: pass 9, at document #10000/10681
2024-03-31 21:56:08,514 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:56:08,590 : INFO : topic #7 (0.020): 0.106*"romance" + 0.059*"Romance" + 0.043*"chick flick" + 0.040*"boring" + 0.027*"Comedy" + 0.025*"girlie movie" + 0.025*"love story" + 0.024*"baseball" + 0.024*"whimsical" + 0.023*"comedy"
2024-03-31 21:56:08,591 : INFO : topic #11 (0.020): 0.130*"religion" + 0.106*"keanu reeves" + 0.026*"product placement" + 0.022*"slavery" + 0.021*"facebook rec" + 0.021*"courtesan" + 0.018*"reality tv" + 0.017*"social message" + 0.016*"cross-dr

2024-03-31 21:56:09,831 : INFO : topic diff=0.069619, rho=0.247382
2024-03-31 21:56:09,833 : INFO : PROGRESS: pass 10, at document #8000/10681
2024-03-31 21:56:10,179 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:56:10,236 : INFO : topic #16 (0.020): 0.269*"nudity (topless)" + 0.184*"nudity (topless - brief)" + 0.067*"nudity (rear)" + 0.028*"pg" + 0.026*"claymation" + 0.022*"military" + 0.018*"childhood" + 0.018*"weird" + 0.016*"aardman" + 0.010*"shark"
2024-03-31 21:56:10,236 : INFO : topic #23 (0.020): 0.081*"high school" + 0.081*"teen" + 0.065*"cars" + 0.034*"dance" + 0.031*"marx brothers" + 0.029*"eddie murphy" + 0.024*"(s)vcd" + 0.023*"car chase" + 0.023*"Comedy" + 0.019*"sandra bullock"
2024-03-31 21:56:10,236 : INFO : topic #30 (0.020): 0.237*"Adventure" + 0.115*"Children" + 0.104*"Comedy" + 0.100*"Fantasy" + 0.077*"Action" + 0.070*"70mm" + 0.046*"Animation" + 0.045*"Drama" + 0.024*"Musical" + 0.009*"submarine"
2024-03-31 21:56:10,236

2024-03-31 21:56:11,688 : INFO : topic #28 (0.020): 0.087*"oscar (best cinematography)" + 0.073*"Drama" + 0.057*"oscar (best supporting actress)" + 0.049*"in netflix queue" + 0.044*"oscar (best actress)" + 0.029*"remade" + 0.019*"marlon brando" + 0.014*"exceptional acting" + 0.013*"china" + 0.013*"alcoholism"
2024-03-31 21:56:11,694 : INFO : topic #12 (0.020): 0.126*"comic book" + 0.101*"superhero" + 0.074*"dystopia" + 0.061*"holocaust" + 0.055*"super-hero" + 0.032*"adapted from:comic" + 0.024*"kidnapping" + 0.022*"batman" + 0.020*"alter ego" + 0.016*"Action"
2024-03-31 21:56:11,694 : INFO : topic #27 (0.020): 0.321*"Horror" + 0.154*"Thriller" + 0.127*"Mystery" + 0.039*"Drama" + 0.028*"Fantasy" + 0.027*"easily confused with other movie(s) (title)" + 0.026*"Film-Noir" + 0.017*"eerie" + 0.015*"slasher" + 0.015*"franchise"
2024-03-31 21:56:11,694 : INFO : topic #41 (0.020): 0.116*"remake" + 0.068*"christmas" + 0.063*"new york city" + 0.044*"family" + 0.036*"new york" + 0.036*"bibliothek" 

2024-03-31 21:56:13,392 : INFO : topic #15 (0.020): 0.166*"disney" + 0.091*"animation" + 0.076*"pixar" + 0.047*"Animation" + 0.043*"children" + 0.039*"Children" + 0.034*"disney animated feature" + 0.025*"angelina jolie" + 0.024*"pirates" + 0.020*"cartoon"
2024-03-31 21:56:13,392 : INFO : topic #33 (0.020): 0.112*"comedy" + 0.108*"Comedy" + 0.056*"funny" + 0.036*"parody" + 0.036*"dark comedy" + 0.033*"quirky" + 0.032*"seen more than once" + 0.029*"coen brothers" + 0.028*"hilarious" + 0.022*"black comedy"
2024-03-31 21:56:13,394 : INFO : topic #27 (0.020): 0.309*"Horror" + 0.157*"Thriller" + 0.127*"Mystery" + 0.040*"Drama" + 0.029*"Film-Noir" + 0.029*"easily confused with other movie(s) (title)" + 0.029*"Fantasy" + 0.016*"eerie" + 0.015*"franchise" + 0.015*"tumey's dvds"
2024-03-31 21:56:13,394 : INFO : topic #44 (0.020): 0.109*"mafia" + 0.087*"music" + 0.053*"martin scorsese" + 0.050*"organized crime" + 0.042*"rock and roll" + 0.031*"Musical" + 0.030*"wired 50 greatest soundtracks" + 0.

2024-03-31 21:56:14,897 : INFO : topic #42 (0.020): 0.084*"robin williams" + 0.077*"japan" + 0.067*"martial arts" + 0.037*"samurai" + 0.035*"akira kurosawa" + 0.033*"orson welles" + 0.032*"divorce" + 0.030*"Drama" + 0.025*"kurosawa" + 0.022*"jonossa"
2024-03-31 21:56:14,897 : INFO : topic #30 (0.020): 0.236*"Adventure" + 0.129*"Children" + 0.112*"Comedy" + 0.106*"Fantasy" + 0.078*"Action" + 0.049*"Animation" + 0.046*"Drama" + 0.044*"70mm" + 0.020*"Musical" + 0.012*"submarine"
2024-03-31 21:56:14,904 : INFO : topic diff=0.175800, rho=0.227387
2024-03-31 21:56:14,904 : INFO : PROGRESS: pass 13, at document #4000/10681
2024-03-31 21:56:15,189 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:56:15,229 : INFO : topic #4 (0.020): 0.072*"sports" + 0.062*"Drama" + 0.055*"boxing" + 0.045*"underrated" + 0.045*"british" + 0.039*"london" + 0.034*"sven's to see list" + 0.033*"hw drama" + 0.030*"sylvester stallone" + 0.028*"inspirational"
2024-03-31 21:56:15

2024-03-31 21:56:16,431 : INFO : topic diff=0.091201, rho=0.227387
2024-03-31 21:56:16,433 : INFO : PROGRESS: pass 14, at document #2000/10681
2024-03-31 21:56:16,638 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:56:16,664 : INFO : topic #22 (0.020): 0.172*"betamax" + 0.091*"dvd-video" + 0.082*"clv" + 0.038*"library on hold" + 0.034*"aviation" + 0.026*"terry gilliam" + 0.019*"dvd collection" + 0.017*"steven seagal" + 0.016*"gilliam" + 0.016*"family drama"
2024-03-31 21:56:16,674 : INFO : topic #40 (0.020): 0.270*"r" + 0.138*"clearplay" + 0.064*"movie to see" + 0.056*"prison" + 0.044*"Drama" + 0.035*"morgan freeman" + 0.028*"friendship" + 0.021*"conspiracy" + 0.015*"1970s" + 0.012*"mel brooks"
2024-03-31 21:56:16,674 : INFO : topic #9 (0.020): 0.264*"based on a book" + 0.112*"adapted from:book" + 0.053*"Drama" + 0.042*"stephen king" + 0.042*"jack nicholson" + 0.035*"based on book" + 0.027*"imdb bottom 100" + 0.022*"literary adaptation" + 0.02

2024-03-31 21:56:18,214 : INFO : topic #43 (0.020): 0.077*"surreal" + 0.040*"narrated" + 0.038*"satirical" + 0.032*"cynical" + 0.032*"irreverent" + 0.031*"quirky" + 0.030*"dreamlike" + 0.027*"biting" + 0.023*"hallucinatory" + 0.021*"stanley kubrick"
2024-03-31 21:56:18,214 : INFO : topic #35 (0.020): 0.355*"Comedy" + 0.314*"Drama" + 0.201*"Romance" + 0.013*"lesbian" + 0.008*"sean penn" + 0.008*"bibliothek" + 0.007*"philip seymour hoffman" + 0.005*"m. night shyamalan" + 0.005*"oscar (best foreign language film)" + 0.003*"library vhs"
2024-03-31 21:56:18,214 : INFO : topic #31 (0.020): 0.160*"zombies" + 0.097*"pg-13" + 0.056*"horror" + 0.028*"zombie" + 0.026*"infidelity" + 0.024*"movie to see" + 0.023*"betrayal" + 0.022*"books" + 0.022*"joaquin phoenix" + 0.021*"cult classic"
2024-03-31 21:56:18,226 : INFO : topic diff=0.089153, rho=0.221727
2024-03-31 21:56:18,227 : INFO : PROGRESS: pass 15, at document #2000/10681
2024-03-31 21:56:18,486 : INFO : merging changes from 2000 documents int

2024-03-31 21:56:19,974 : INFO : topic #37 (0.020): 0.194*"War" + 0.112*"Drama" + 0.102*"world war ii" + 0.057*"war" + 0.049*"history" + 0.036*"Action" + 0.031*"jim carrey" + 0.018*"africa" + 0.017*"nazis" + 0.014*"wwii"
2024-03-31 21:56:19,983 : INFO : topic diff=0.039867, rho=0.216470
2024-03-31 21:56:20,167 : INFO : -20.968 per-word bound, 2051264.9 perplexity estimate based on a held-out corpus of 681 documents with 5199 words
2024-03-31 21:56:20,168 : INFO : PROGRESS: pass 15, at document #10681/10681
2024-03-31 21:56:20,254 : INFO : merging changes from 681 documents into a model of 10681 documents
2024-03-31 21:56:20,294 : INFO : topic #40 (0.020): 0.310*"r" + 0.159*"clearplay" + 0.074*"movie to see" + 0.049*"Drama" + 0.032*"prison" + 0.029*"friendship" + 0.027*"morgan freeman" + 0.016*"conspiracy" + 0.013*"1970s" + 0.013*"don cheadle"
2024-03-31 21:56:20,295 : INFO : topic #39 (0.020): 0.150*"time travel" + 0.054*"adultery" + 0.046*"motorcycle" + 0.046*"post apocalyptic" + 0.04

2024-03-31 21:56:21,689 : INFO : PROGRESS: pass 16, at document #10000/10681
2024-03-31 21:56:21,934 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:56:21,992 : INFO : topic #40 (0.020): 0.282*"r" + 0.124*"clearplay" + 0.057*"movie to see" + 0.046*"Drama" + 0.039*"prison" + 0.029*"morgan freeman" + 0.025*"friendship" + 0.018*"john wayne" + 0.018*"conspiracy" + 0.015*"don cheadle"
2024-03-31 21:56:21,994 : INFO : topic #7 (0.020): 0.108*"romance" + 0.062*"Romance" + 0.045*"chick flick" + 0.041*"boring" + 0.026*"girlie movie" + 0.025*"baseball" + 0.024*"love story" + 0.024*"whimsical" + 0.024*"Comedy" + 0.019*"wedding"
2024-03-31 21:56:21,994 : INFO : topic #45 (0.020): 0.258*"less than 300 ratings" + 0.117*"Drama" + 0.059*"nudity (topless - notable)" + 0.038*"tim burton" + 0.031*"not corv lib" + 0.026*"library" + 0.017*"depressing" + 0.016*"seen 2007" + 0.016*"blindfold" + 0.013*"michael moore"
2024-03-31 21:56:21,998 : INFO : topic #41 (0.020)

2024-03-31 21:56:23,764 : INFO : topic #39 (0.020): 0.182*"time travel" + 0.054*"adultery" + 0.041*"motorcycle" + 0.038*"post apocalyptic" + 0.035*"post-apocalyptic" + 0.030*"jude law" + 0.024*"dystopia" + 0.019*"want to own" + 0.019*"ian mckellen" + 0.018*"hollywood"
2024-03-31 21:56:23,764 : INFO : topic #48 (0.020): 0.089*"bruce willis" + 0.081*"twist ending" + 0.063*"netflix" + 0.047*"coming of age" + 0.033*"journalism" + 0.031*"sexy" + 0.026*"Drama" + 0.025*"avi" + 0.022*"to see" + 0.022*"talking animals"
2024-03-31 21:56:23,768 : INFO : topic #8 (0.020): 0.142*"anime" + 0.090*"tom hanks" + 0.077*"true story" + 0.056*"based on a true story" + 0.027*"drama" + 0.026*"good" + 0.025*"japan" + 0.023*"interesting" + 0.021*"not funny" + 0.018*"archaeology"
2024-03-31 21:56:23,768 : INFO : topic #15 (0.020): 0.140*"disney" + 0.096*"animation" + 0.079*"pixar" + 0.046*"Animation" + 0.040*"children" + 0.037*"Children" + 0.033*"pirates" + 0.029*"disney animated feature" + 0.025*"angelina joli

2024-03-31 21:56:25,514 : INFO : topic #47 (0.020): 0.065*"serial killer" + 0.056*"psychology" + 0.053*"brad pitt" + 0.042*"edward norton" + 0.032*"steven spielberg" + 0.032*"heist" + 0.028*"matt damon" + 0.024*"crime" + 0.021*"hayao miyazaki" + 0.017*"scary"
2024-03-31 21:56:25,514 : INFO : topic #34 (0.020): 0.093*"ghosts" + 0.044*"bill murray" + 0.041*"television" + 0.037*"courtroom" + 0.033*"courtroom drama" + 0.023*"roman polanski" + 0.022*"court" + 0.018*"secret service" + 0.017*"2.5" + 0.016*"opera"
2024-03-31 21:56:25,514 : INFO : topic diff=0.054279, rho=0.202691
2024-03-31 21:56:25,514 : INFO : PROGRESS: pass 18, at document #8000/10681
2024-03-31 21:56:25,814 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:56:25,884 : INFO : topic #16 (0.020): 0.275*"nudity (topless)" + 0.186*"nudity (topless - brief)" + 0.067*"nudity (rear)" + 0.028*"pg" + 0.027*"claymation" + 0.022*"military" + 0.018*"childhood" + 0.017*"weird" + 0.017*"aardman" +

2024-03-31 21:56:27,379 : INFO : topic diff=0.027116, rho=0.198652
2024-03-31 21:56:27,383 : INFO : PROGRESS: pass 19, at document #6000/10681
2024-03-31 21:56:27,627 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:56:27,693 : INFO : topic #5 (0.020): 0.277*"classic" + 0.065*"musical" + 0.045*"afi 100 (laughs)" + 0.033*"Musical" + 0.029*"national film registry" + 0.021*"afi 100" + 0.018*"john travolta" + 0.017*"70mm" + 0.016*"adapted from b'way" + 0.015*"breakthroughs"
2024-03-31 21:56:27,693 : INFO : topic #36 (0.020): 0.084*"cult film" + 0.046*"peter jackson" + 0.044*"russell crowe" + 0.038*"downbeat" + 0.031*"adam sandler" + 0.029*"1980s" + 0.022*"80s" + 0.021*"new zealand" + 0.019*"not available from netflix" + 0.014*"must see!"
2024-03-31 21:56:27,693 : INFO : topic #11 (0.020): 0.129*"religion" + 0.115*"keanu reeves" + 0.035*"island" + 0.021*"slavery" + 0.018*"courtesan" + 0.017*"reality tv" + 0.016*"class issues" + 0.016*"slash" + 0.016

2024-03-31 21:56:29,248 : INFO : topic #9 (0.020): 0.245*"based on a book" + 0.112*"adapted from:book" + 0.054*"Drama" + 0.049*"stephen king" + 0.041*"jack nicholson" + 0.037*"based on book" + 0.026*"imdb bottom 100" + 0.024*"stupid" + 0.021*"literary adaptation" + 0.021*"los angeles"
2024-03-31 21:56:29,248 : INFO : topic #36 (0.020): 0.089*"cult film" + 0.047*"russell crowe" + 0.040*"downbeat" + 0.032*"adam sandler" + 0.032*"1980s" + 0.030*"peter jackson" + 0.020*"80s" + 0.020*"not available from netflix" + 0.020*"new zealand" + 0.016*"virus"
2024-03-31 21:56:29,248 : INFO : topic #15 (0.020): 0.162*"disney" + 0.091*"animation" + 0.076*"pixar" + 0.049*"Animation" + 0.043*"children" + 0.040*"Children" + 0.033*"disney animated feature" + 0.026*"angelina jolie" + 0.025*"pirates" + 0.020*"cartoon"
2024-03-31 21:56:29,248 : INFO : topic #42 (0.020): 0.092*"robin williams" + 0.089*"martial arts" + 0.070*"japan" + 0.039*"samurai" + 0.036*"akira kurosawa" + 0.029*"Drama" + 0.027*"orson welle

2024-03-31 21:56:30,937 : INFO : topic #15 (0.020): 0.167*"disney" + 0.094*"animation" + 0.074*"pixar" + 0.047*"Animation" + 0.043*"children" + 0.038*"Children" + 0.030*"disney animated feature" + 0.029*"pirates" + 0.024*"angelina jolie" + 0.021*"cartoon"
2024-03-31 21:56:30,937 : INFO : topic #23 (0.020): 0.084*"high school" + 0.061*"teen" + 0.051*"cars" + 0.035*"dance" + 0.028*"marx brothers" + 0.027*"sandra bullock" + 0.023*"Comedy" + 0.021*"kate winslet" + 0.020*"(s)vcd" + 0.020*"eddie murphy"
2024-03-31 21:56:30,943 : INFO : topic diff=0.143643, rho=0.191248
2024-03-31 21:56:30,944 : INFO : PROGRESS: pass 21, at document #4000/10681
2024-03-31 21:56:31,238 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:56:31,311 : INFO : topic #13 (0.020): 0.054*"fairy tale" + 0.044*"jane austen" + 0.041*"road trip" + 0.036*"sequel" + 0.030*"assassination" + 0.029*"cold war" + 0.027*"historical" + 0.022*"kids" + 0.021*"heartwarming" + 0.020*"gerard depar

2024-03-31 21:56:32,853 : INFO : PROGRESS: pass 22, at document #2000/10681
2024-03-31 21:56:33,153 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:56:33,213 : INFO : topic #2 (0.020): 0.278*"Documentary" + 0.121*"to see" + 0.055*"documentary" + 0.030*"mockumentary" + 0.029*"star trek" + 0.025*"kevin smith" + 0.022*"movie to see" + 0.020*"dogs" + 0.019*"sexuality" + 0.016*"owen wilson"
2024-03-31 21:56:33,221 : INFO : topic #34 (0.020): 0.073*"ghosts" + 0.050*"bill murray" + 0.043*"television" + 0.030*"courtroom" + 0.029*"courtroom drama" + 0.026*"secret service" + 0.020*"roman polanski" + 0.017*"18th century" + 0.017*"opera" + 0.016*"court"
2024-03-31 21:56:33,221 : INFO : topic #31 (0.020): 0.143*"zombies" + 0.076*"pg-13" + 0.074*"horror" + 0.031*"cult classic" + 0.027*"zombie" + 0.024*"sam raimi" + 0.023*"campy" + 0.021*"infidelity" + 0.021*"books" + 0.020*"joaquin phoenix"
2024-03-31 21:56:33,221 : INFO : topic #33 (0.020): 0.109*"Comedy" 

2024-03-31 21:56:34,610 : INFO : topic #22 (0.020): 0.209*"betamax" + 0.086*"dvd-video" + 0.053*"clv" + 0.051*"library on hold" + 0.036*"aviation" + 0.019*"family drama" + 0.019*"seen 2008" + 0.018*"terry gilliam" + 0.013*"gilliam" + 0.013*"dvd collection"
2024-03-31 21:56:34,610 : INFO : topic #20 (0.020): 0.100*"biography" + 0.071*"Drama" + 0.055*"death" + 0.045*"suicide" + 0.045*"biopic" + 0.038*"corvallis library" + 0.031*"ensemble cast" + 0.028*"julia roberts" + 0.023*"forest whitaker" + 0.023*"history"
2024-03-31 21:56:34,610 : INFO : topic #39 (0.020): 0.156*"time travel" + 0.055*"adultery" + 0.045*"motorcycle" + 0.045*"post apocalyptic" + 0.042*"post-apocalyptic" + 0.035*"jude law" + 0.024*"bechdel test:fail" + 0.023*"dystopia" + 0.019*"want to own" + 0.017*"old"
2024-03-31 21:56:34,610 : INFO : topic #42 (0.020): 0.092*"japan" + 0.088*"martial arts" + 0.063*"robin williams" + 0.038*"akira kurosawa" + 0.035*"samurai" + 0.035*"Drama" + 0.028*"seen 2006" + 0.027*"divorce" + 0.027

2024-03-31 21:56:36,130 : INFO : topic #34 (0.020): 0.075*"ghosts" + 0.058*"bill murray" + 0.040*"television" + 0.038*"courtroom" + 0.032*"courtroom drama" + 0.024*"roman polanski" + 0.021*"ben stiller" + 0.019*"las vegas" + 0.018*"court" + 0.016*"secret service"
2024-03-31 21:56:36,131 : INFO : topic #40 (0.020): 0.278*"r" + 0.124*"clearplay" + 0.058*"movie to see" + 0.047*"Drama" + 0.040*"prison" + 0.029*"morgan freeman" + 0.025*"friendship" + 0.018*"conspiracy" + 0.018*"john wayne" + 0.015*"1970s"
2024-03-31 21:56:36,135 : INFO : topic diff=0.033403, rho=0.184615
2024-03-31 21:56:36,263 : INFO : -20.964 per-word bound, 2045057.6 perplexity estimate based on a held-out corpus of 681 documents with 5199 words
2024-03-31 21:56:36,263 : INFO : PROGRESS: pass 23, at document #10681/10681
2024-03-31 21:56:36,343 : INFO : merging changes from 681 documents into a model of 10681 documents
2024-03-31 21:56:36,383 : INFO : topic #14 (0.020): 0.059*"criterion" + 0.053*"disturbing" + 0.051*"Dra

2024-03-31 21:56:37,503 : INFO : topic diff=0.039945, rho=0.181547
2024-03-31 21:56:37,513 : INFO : PROGRESS: pass 24, at document #10000/10681
2024-03-31 21:56:37,709 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:56:37,747 : INFO : topic #29 (0.020): 0.095*"nudity (full frontal - notable)" + 0.054*"Drama" + 0.052*"shakespeare" + 0.045*"based on a play" + 0.036*"christianity" + 0.034*"adapted from:play" + 0.029*"religion" + 0.021*"biblical" + 0.021*"disability" + 0.019*"fascism"
2024-03-31 21:56:37,748 : INFO : topic #26 (0.020): 0.045*"Drama" + 0.033*"james bond" + 0.021*"reflective" + 0.021*"poignant" + 0.020*"007" + 0.020*"atmospheric" + 0.019*"bittersweet" + 0.019*"lyrical" + 0.019*"bond" + 0.018*"deliberate"
2024-03-31 21:56:37,749 : INFO : topic #13 (0.020): 0.062*"fairy tale" + 0.048*"road trip" + 0.041*"sequel" + 0.040*"jane austen" + 0.026*"historical" + 0.025*"gerard depardieu" + 0.024*"swashbuckler" + 0.024*"assassination" + 0.024

2024-03-31 21:56:38,987 : INFO : topic #47 (0.020): 0.063*"serial killer" + 0.057*"psychology" + 0.050*"brad pitt" + 0.041*"edward norton" + 0.034*"heist" + 0.028*"matt damon" + 0.028*"steven spielberg" + 0.027*"hayao miyazaki" + 0.023*"crime" + 0.016*"scary"
2024-03-31 21:56:38,993 : INFO : topic #22 (0.020): 0.259*"betamax" + 0.101*"dvd-video" + 0.068*"clv" + 0.030*"aviation" + 0.020*"library on hold" + 0.019*"terry gilliam" + 0.017*"dvd collection" + 0.015*"seen 2008" + 0.014*"gilliam" + 0.012*"steven seagal"
2024-03-31 21:56:38,993 : INFO : topic #5 (0.020): 0.270*"classic" + 0.062*"musical" + 0.045*"afi 100 (laughs)" + 0.036*"Musical" + 0.030*"national film registry" + 0.019*"70mm" + 0.018*"afi 100" + 0.016*"john travolta" + 0.016*"adapted from b'way" + 0.015*"breakthroughs"
2024-03-31 21:56:38,993 : INFO : topic #8 (0.020): 0.138*"anime" + 0.090*"tom hanks" + 0.078*"true story" + 0.058*"based on a true story" + 0.027*"drama" + 0.025*"good" + 0.025*"japan" + 0.023*"interesting" + 

2024-03-31 21:56:40,343 : INFO : topic #18 (0.020): 0.082*"magic" + 0.079*"gay" + 0.062*"racism" + 0.044*"pg13" + 0.040*"Drama" + 0.027*"food" + 0.027*"social commentary" + 0.023*"homosexuality" + 0.020*"19th century" + 0.019*"denzel washington"
2024-03-31 21:56:40,343 : INFO : topic #19 (0.020): 0.062*"nudity (full frontal)" + 0.061*"drama" + 0.035*"Drama" + 0.034*"philip k. dick" + 0.028*"vietnam war" + 0.027*"rape" + 0.025*"vietnam" + 0.022*"very good" + 0.020*"to see" + 0.016*"slow"
2024-03-31 21:56:40,353 : INFO : topic diff=0.046241, rho=0.175844
2024-03-31 21:56:40,353 : INFO : PROGRESS: pass 26, at document #8000/10681
2024-03-31 21:56:40,613 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:56:40,678 : INFO : topic #2 (0.020): 0.287*"Documentary" + 0.104*"to see" + 0.064*"documentary" + 0.026*"mockumentary" + 0.025*"sexuality" + 0.023*"star trek" + 0.021*"dogs" + 0.021*"kevin smith" + 0.018*"movie to see" + 0.017*"in netflix queue"
2024

2024-03-31 21:56:41,963 : INFO : topic diff=0.023122, rho=0.173186
2024-03-31 21:56:41,966 : INFO : PROGRESS: pass 27, at document #6000/10681
2024-03-31 21:56:42,143 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:56:42,182 : INFO : topic #18 (0.020): 0.081*"magic" + 0.079*"gay" + 0.062*"racism" + 0.044*"pg13" + 0.040*"Drama" + 0.027*"food" + 0.027*"social commentary" + 0.023*"homosexuality" + 0.020*"19th century" + 0.019*"denzel washington"
2024-03-31 21:56:42,182 : INFO : topic #32 (0.020): 0.135*"johnny depp" + 0.067*"clint eastwood" + 0.059*"vhs" + 0.045*"jackie chan" + 0.037*"western" + 0.031*"kung fu" + 0.031*"leonardo dicaprio" + 0.030*"spaghetti western" + 0.029*"propaganda" + 0.029*"david lynch"
2024-03-31 21:56:42,182 : INFO : topic #26 (0.020): 0.042*"Drama" + 0.035*"james bond" + 0.024*"007" + 0.023*"bond" + 0.021*"mel gibson" + 0.021*"reflective" + 0.020*"atmospheric" + 0.019*"poignant" + 0.018*"lyrical" + 0.017*"bittersweet"
202

2024-03-31 21:56:43,533 : INFO : topic #7 (0.020): 0.113*"romance" + 0.065*"Romance" + 0.054*"chick flick" + 0.038*"boring" + 0.030*"girlie movie" + 0.027*"baseball" + 0.023*"Comedy" + 0.020*"love story" + 0.019*"whimsical" + 0.017*"wedding"
2024-03-31 21:56:43,534 : INFO : topic #5 (0.020): 0.292*"classic" + 0.062*"musical" + 0.045*"afi 100 (laughs)" + 0.034*"Musical" + 0.030*"national film registry" + 0.021*"afi 100" + 0.016*"70mm" + 0.015*"adapted from b'way" + 0.015*"john travolta" + 0.015*"breakthroughs"
2024-03-31 21:56:43,535 : INFO : topic #31 (0.020): 0.127*"zombies" + 0.074*"horror" + 0.072*"pg-13" + 0.039*"cult classic" + 0.025*"infidelity" + 0.025*"campy" + 0.024*"zombie" + 0.024*"joaquin phoenix" + 0.024*"sam raimi" + 0.020*"books"
2024-03-31 21:56:43,539 : INFO : topic diff=0.022572, rho=0.170646
2024-03-31 21:56:43,541 : INFO : PROGRESS: pass 28, at document #6000/10681
2024-03-31 21:56:43,745 : INFO : merging changes from 2000 documents into a model of 10681 documents
2

2024-03-31 21:56:44,823 : INFO : topic #17 (0.020): 0.330*"Sci-Fi" + 0.071*"Horror" + 0.068*"Action" + 0.044*"video game adaptation" + 0.027*"animals" + 0.025*"movie to see" + 0.022*"football" + 0.018*"futuristic" + 0.015*"milla jovovich" + 0.013*"g"
2024-03-31 21:56:44,823 : INFO : topic #34 (0.020): 0.074*"ghosts" + 0.051*"bill murray" + 0.044*"television" + 0.030*"courtroom" + 0.030*"courtroom drama" + 0.026*"secret service" + 0.021*"roman polanski" + 0.017*"18th century" + 0.017*"opera" + 0.017*"court"
2024-03-31 21:56:44,831 : INFO : topic diff=0.124178, rho=0.168215
2024-03-31 21:56:44,832 : INFO : PROGRESS: pass 29, at document #4000/10681
2024-03-31 21:56:45,023 : INFO : merging changes from 2000 documents into a model of 10681 documents
2024-03-31 21:56:45,053 : INFO : topic #31 (0.020): 0.127*"zombies" + 0.074*"horror" + 0.072*"pg-13" + 0.039*"cult classic" + 0.025*"infidelity" + 0.025*"campy" + 0.024*"zombie" + 0.024*"joaquin phoenix" + 0.024*"sam raimi" + 0.020*"books"
2024

2024-03-31 21:56:46,215 : INFO : LdaModel lifecycle event {'msg': 'trained LdaModel<num_terms=15261, num_topics=50, decay=0.5, chunksize=2000> in 57.50s', 'datetime': '2024-03-31T21:56:46.215775', 'gensim': '4.3.2', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22621-SP0', 'event': 'created'}


In [5]:
# 평가
metric_calculator = MetricCalculator()
metrics = metric_calculator.calc(
    movielens.test.rating.tolist(), recommend_result.rating.tolist(),
    movielens.test_user2items, recommend_result.user2items, k=10)
print(metrics)

rmse=0.000, Precision@K=0.004, Recall@K=0.012


#  딥 러닝을 이용한 자연어 처리 입문
## 19. 토픽 모델링(Topic Modeling) 
### 19-03 사이킷런의 잠재 디리클레 할당

In [8]:
import pandas as pd
import urllib.request
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# urllib.request.urlretrieve("https://raw.githubusercontent.com/ukairia777/tensorflow-nlp-tutorial/main/19.%20Topic%20Modeling%20(LDA%2C%20BERT-Based)/dataset/abcnews-date-text.csv")
# '../data/ml-10M100K/'
data = pd.read_csv('./data/abcnews-date-text.csv', error_bad_lines=False)
print('뉴스 제목 개수 :',len(data))




  data = pd.read_csv('./data/abcnews-date-text.csv', error_bad_lines=False)


뉴스 제목 개수 : 1244184


In [9]:
print(data.head(5))

   publish_date                                      headline_text
0      20030219  aba decides against community broadcasting lic...
1      20030219     act fire witnesses must be aware of defamation
2      20030219     a g calls for infrastructure protection summit
3      20030219           air nz staff in aust strike for pay rise
4      20030219      air nz strike to affect australian travellers


In [10]:
text = data[['headline_text']]
text.head(5)

Unnamed: 0,headline_text
0,aba decides against community broadcasting lic...
1,act fire witnesses must be aware of defamation
2,a g calls for infrastructure protection summit
3,air nz staff in aust strike for pay rise
4,air nz strike to affect australian travellers


2) 텍스트 전처리  
이번 실습에서는 불용어 제거, 표제어 추출, 길이가 짧은 단어 제거라는 세 가지 전처리 기법을 사용합니다.

In [12]:
import nltk
nltk.download('punkt')
text['headline_text'] = text.apply(lambda row: nltk.word_tokenize(row['headline_text']), axis=1)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jkm20\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = text.apply(lambda row: nltk.word_tokenize(row['headline_text']), axis=1)


In [13]:
stop_words = stopwords.words('english')
text['headline_text'] = text['headline_text'].apply(lambda x: [word for word in x if word not in (stop_words)])
print(text.head(5))

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - 'C:\\Users\\jkm20/nltk_data'
    - 'C:\\Users\\jkm20\\anaconda3\\nltk_data'
    - 'C:\\Users\\jkm20\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\jkm20\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\jkm20\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


상위 5개의 샘플에 대해서 불용어를 제거하기 전과 후의 데이터만 비교해도 확실히 몇 가지 단어들이 사라진 것이 보입니다. against, be, of, a, in, to 등의 단어가 제거되었습니다. 이제 표제어 추출을 수행합니다. 표제어 추출로 3인칭 단수 표현을 1인칭으로 바꾸고, 과거 현재형 동사를 현재형으로 바꿉니다.

In [None]:
text['headline_text'] = text['headline_text'].apply(lambda x: [WordNetLemmatizer().lemmatize(word, pos='v') for word in x])
print(text.head(5))

In [None]:
tokenized_doc = text['headline_text'].apply(lambda x: [word for word in x if len(word) > 3])
print(tokenized_doc[:5])


In [None]:
# 역토큰화 (토큰화 작업을 되돌림)
detokenized_doc = []
for i in range(len(text)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)

# 다시 text['headline_text']에 재저장
text['headline_text'] = detokenized_doc


In [None]:
text['headline_text'][:5]


In [None]:
# 상위 1,000개의 단어를 보존 
vectorizer = TfidfVectorizer(stop_words='english', max_features= 1000)
X = vectorizer.fit_transform(text['headline_text'])

# TF-IDF 행렬의 크기 확인
print('TF-IDF 행렬의 크기 :',X.shape)


In [None]:
lda_model = LatentDirichletAllocation(n_components=10,learning_method='online',random_state=777,max_iter=1)


In [None]:
lda_top = lda_model.fit_transform(X)


In [None]:
print(lda_model.components_)
print(lda_model.components_.shape) 


In [None]:
# 단어 집합. 1,000개의 단어가 저장됨.
terms = vectorizer.get_feature_names()

def get_topics(components, feature_names, n=5):
    for idx, topic in enumerate(components):
        print("Topic %d:" % (idx+1), [(feature_names[i], topic[i].round(2)) for i in topic.argsort()[:-n - 1:-1]])

get_topics(lda_model.components_,terms)
