# **Unsupervised BERTopic Jeopardy Classification Analysis**

We chose to use BERTopic to generate topics for each question in our dataset. The below notebook documents our choices in creating a model and learnings.

Before running this notebook, please be sure to visit our GitHub repo to download the data files needed.

In [None]:
# installing all packages and dependencies
!pip install bertopic[all]

In [2]:
from bertopic import BERTopic
import pandas as pd

In [3]:
# importing local file dataset and creating different document filters to pass into BERTopic model
# filtered_smaller_sample.csv can be found on our github repo

df = pd.read_csv('filtered_smaller_sample.csv', engine="python")
docs = df['data_a']
docs_filter = df['filtered_sentence']
docs_combined = df['combined']

### **Control Model**
Before testing parameters, we set a control variable by running an entirely unsupervised model on our combined testing data set that includes 536 samples. This data included both questions and answers. 

We found that the model resulted in a handful of usable topics that fit a minority of our samples. The remainder of our data was in an outlier topic, meaning the model could not fit it into any of the categories. 


In [4]:
# extracting topic from our second set of docs - using only combined questions and answers, with stop words
topic_model_combined = BERTopic()
topics, _ = topic_model_combined.fit_transform(docs_combined)

In [5]:
topic_model_combined.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,495,-1_of_in_on_was
1,0,16,0_zoo_exiled_exile_everlast
2,1,13,1_language_policy_curriculum_syllabus
3,2,11,2_airline_travel_bus_transportation
4,3,1,3_hrefhttpwwwjarchivecommedia20100706dj26jpg_o...


In [6]:
topic_model_combined.visualize_topics()

### **First Test**

Our first model was fit to the smaller sample of 536 Jeopardy **answers**.

We chose to use the `"paraphrase-MiniLM-L6-v2"` embedding model as it was recommended by the creator of BERTopic as a lightweight embedding option created to be used for small and medium sized documents.

We also set additional parameters to what we found to be the minimum for our clusters. Minimum topic size was set to be 4 and maximum number of topics was set to be 22. 

Findings:


*   A small number for topic minimum (`min_topic_size`) generally leads to overlapping topic clusters
*   This parameter also led to two major clusters that had smaller sub-clusters classified into a topic. It reflects the basic unsupervised model run on the data without any parameters set. 





In [7]:
# extracting topic from our docs - using only the answers
topic_model_answers_only = BERTopic(min_topic_size=4, nr_topics=22, embedding_model="paraphrase-MiniLM-L6-v2")
topics, _ = topic_model_answers_only.fit_transform(docs)

In [8]:
# getting all topic information 
topic_model_answers_only.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,119,-1_jpg_target_http_www
1,0,100,0_oregon_city_tour_england
2,1,87,1_john_michael_william_jefferson
3,2,25,2_death_angels_angel_mir
4,3,20,3_elizabeth_victoria_queen_emily
5,4,17,4_cerebellum_mephistopheles_copernicus_octopus
6,5,17,5_train_streetcar_parachutes_urban
7,6,15,6_kiss_teeth_wax_soft
8,7,15,7_toyota_gmc_ibm_close
9,8,12,8_soup_sugar_cans_leaves


In [9]:
topic_model_answers_only.visualize_topics()

### **Second Test**
For the second model, we chose to maintain the `min_topic_size`, `nr_topics`, and `embedding_model` the same as the first test. However, instead of using answer data, we fit the model to our pre-filtered combined documents. 

These documents had stop-words removed prior to being fit into the model and combined both questions and answers.

We anticipate that this will result in more usable topics as there are more words that can be used through the model. 

Findings:
*   Greater data led to more usable clusters
*   Sub-clusters still experienced significant overlap



In [10]:
# extracting topic from our second set of docs - using only the filtered sentences (stop words pre-removed, includes both questions and answers)
topic_model_filtered = BERTopic(min_topic_size=4, nr_topics=22, embedding_model="paraphrase-MiniLM-L6-v2")
topics, _ = topic_model_filtered.fit_transform(docs_filter)

In [11]:
# accessing information from filtered topic 
topic_model_filtered.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,161,-1_name_warhol_words_oscar
1,0,32,0_mythology_animal_alaska_dog
2,1,30,1_sports_skiing_sportsmen_athletes
3,2,27,2_entertainment_cinema_movie_film
4,3,24,3_tourism_etropolis_architects_saigon
5,4,22,4_civil_penn_john_history
6,5,20,5_science_gases_energy_ph
7,6,17,6_airline_shuttle_bus_transportation
8,7,17,7_state_gov_governor_illinois
9,8,16,8_people_everlast_events_event


In [12]:
topic_model_filtered.visualize_topics()

### **Third Test**
For the third model, we chose to increase the `min_topic_size`, remove the `nr_topics` parameter, and maintain the `embedding_model`. Instead of using the pre-filtered data, we passed in our combined documents data. 

These samples included the combined questions and answers without removing the stop-words.

We understand that BERTopic goes through each document and removes stop-words automatically. We anticipate that this will result in more accurate topics as the minimum number of topics is greater.  

Findings:
*   Not removing stop words results in more relevant topics 
*   Removing the maximum number of topics has little impact when the minimum topic sized is increased slightly
*   The intertopic distance starts to spread out and we begin to see less overlap between topics





In [16]:
# testing combined docs with an nr_topics = 20
topic_model_combined_20 = BERTopic(min_topic_size=8, embedding_model="paraphrase-MiniLM-L6-v2")
topics, _ = topic_model_combined_20.fit_transform(docs_combined)

topic_model_combined_20.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,135,-1_peach_com_from_archive
1,0,64,0_authors_poets_art_wrote
2,1,45,1_city_geography_island_etropolis
3,2,28,2_warhol_company_business_industry
4,3,28,3_movie_tv_drama_cinema
5,4,26,4_skiing_sports_sportsmen_athletes
6,5,22,5_science_measuring_gases_measured
7,6,22,6_let_bounce_emoticons_rhymes
8,7,20,7_state_annual_festival_pendleton
9,8,20,8_animal_angels_mythology_dog


In [17]:
topic_model_combined_20.topics

{-1: [('peach', 0.01643589791893067),
  ('com', 0.016419014582589847),
  ('from', 0.015669990440089546),
  ('archive', 0.015617945667724054),
  ('mexico', 0.013148718335144536),
  ('this', 0.01297581910908265),
  ('it', 0.01286144078405064),
  ('greek', 0.012205893397087032),
  ('dictionary', 0.012205893397087032),
  ('he', 0.012187541926527292)],
 0: [('authors', 0.028340036385174874),
  ('poets', 0.021937103860471256),
  ('art', 0.02101906983834661),
  ('wrote', 0.02024288313226777),
  ('history', 0.018930111148443965),
  ('literature', 0.01844854935751783),
  ('robert', 0.017549683088377007),
  ('john', 0.014757549430532311),
  ('artists', 0.014705542146391488),
  ('frank', 0.014705542146391488)],
 1: [('city', 0.05350030542120269),
  ('geography', 0.04835336104893239),
  ('island', 0.03776867676422316),
  ('etropolis', 0.0374287699887422),
  ('islands', 0.03147664864607424),
  ('tourism', 0.029943015990993756),
  ('country', 0.02858324234972881),
  ('kingdom', 0.023607486484555677)

In [18]:
topic_model_combined_20.visualize_topics()

### **Fourth Test**
For the third model, we chose to return to a  `min_topic_size` equal to 4, a `nr_topics` parameter equal to 22, and maintain the `embedding_model`. Instead of using the pre-filtered data, we passed in our combined documents data. 

Findings:
*   Clusters begin to move away from each other and show less overlap
*   Topics are more usable and understandable

So far, this model has provided the best results on our test data. We chose to pass this through to our original data set as a dataframe and assign names to each topic in an effort to compare the results with our manually encoded topics.



In [19]:
# testing combined docs with an nr_topics = 22
topic_model_combined_22_4 = BERTopic(min_topic_size=4, nr_topics=22, embedding_model="paraphrase-MiniLM-L6-v2")
topics, _ = topic_model_combined_22_4.fit_transform(docs_combined)

topic_model_combined_22_4.get_topic_info()


Unnamed: 0,Topic,Count,Name
0,-1,123,-1_flag_world_was_mexico
1,0,28,0_warhol_company_business_industry
2,1,26,1_angels_theatre_angel_drama
3,2,25,2_authors_poets_art_penn
4,3,24,3_state_gov_annual_college
5,4,24,4_anatomy_animal_dog_teeth
6,5,23,5_disney_radio_train_band
7,6,23,6_to_take_word_from
8,7,22,7_sports_athletes_sportsmen_espn
9,8,21,8_geography_island_kingdom_islands


In [31]:
topic_model_combined_22_4.visualize_topics()

### **Testing at Scale**

In [21]:
# importing new testing dataset
new_df = pd.read_csv('jeopardy_combined.csv', engine="python")
new_df.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,Q_A_Combined
0,4680,12/31/2004,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,"For the last 8 years of his life, Galileo was ..."
1,4680,12/31/2004,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,No. 2: 1912 Olympian; football star at Carlisl...
2,4680,12/31/2004,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,The city of Yuma in this state has a record av...
3,4680,12/31/2004,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,"In 1963, live on ""The Art Linkletter Show"", th..."
4,4680,12/31/2004,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,"Signer of the Dec. of Indep., framer of the Co..."


In [22]:
# defining new variable with total docs of 5000
testing = new_df["Q_A_Combined"]
testing_docs = testing.loc[:25000]

In [23]:
# fitting in a new embedding model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('msmarco-distilbert-base-v3')

HBox(children=(FloatProgress(value=0.0, max=244721538.0), HTML(value='')))




### **Fifth Test**
Once we identified that our best fit model relied on 4 minimum topics and 22 maximum topics, we wanted to test at scale. 

We pulled in our combined data set and passed through 25000 samples. After further research we found the `msmarco-distilbert-base-v3` SentenceTransformer that has been trained on limited works of text such as sentences. As each Jeopardy sample is fairly limited, we hypothesized that this would be a better embedding model for our data. 

Findings:
*   Visually, our topic clusters reverted back to the two extremes with significant overlap. 

In [24]:
# fitting 22_4 model into new data to see how it performs - testing at scale (50 times larger): 

topic_model_combined_scale_min = BERTopic(min_topic_size=4, nr_topics=22, embedding_model=model)
topics, _ = topic_model_combined_scale_min.fit_transform(testing_docs)

topic_model_combined_scale_min.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,10396,-1_for_is_was_with
1,0,1428,0_for_with_on_as
2,1,1191,1_is_was_are_his
3,2,1174,2_archive_jpg__blank_clue
4,3,862,3_to_was_president_ship
5,4,807,4_song_band_you_album
6,5,773,5_city_museum_capital_square
7,6,767,6_was_first_planet_atomic
8,7,749,7_he_as_his_for
9,8,705,8_letter_word_flag_latin


In [25]:
topic_model_combined_scale_min.topics

{-1: [('for', 0.01635956466766821),
  ('is', 0.01597417168689026),
  ('was', 0.01488675171364743),
  ('with', 0.011437741861372246),
  ('in', 0.011285518110318596),
  ('one', 0.009191239506113951),
  ('an', 0.009176475392612808),
  ('be', 0.007155897723185443),
  ('are', 0.006936569138675962),
  ('city', 0.0066762783716564035)],
 0: [('for', 0.01819268588485113),
  ('with', 0.01495866585948198),
  ('on', 0.014609539021665955),
  ('as', 0.012193532603992889),
  ('an', 0.011160500182880916),
  ('he', 0.01086887947940326),
  ('his', 0.010306840058905637),
  ('used', 0.009393236417142289),
  ('are', 0.009109418666241967),
  ('instrument', 0.008054839796769995)],
 1: [('is', 0.017763422704059097),
  ('was', 0.015171073990739144),
  ('are', 0.013222076628863388),
  ('his', 0.011135714749029642),
  ('from', 0.011062908431796463),
  ('with', 0.010426537496662416),
  ('bird', 0.009750982132771627),
  ('king', 0.009120311795571205),
  ('her', 0.007665798353331849),
  ('named', 0.0075085417593990

In [26]:
topic_model_combined_scale_min.visualize_topics()

### **Final Test**
As a final test, we decided to fit our data to a greater sample with updated parameters. 

In a previous test, a larger `min_topic_size` provided us with a larger intertopic distance. Additionally, because we have a larger dataset, we increased the maximum number of topics (`nr_topics`) to 40 and passed through the embedding model set in the previous test.

Findings:
*   The intertopic distance spread out significantly more than in previous tests (this is *good*)
*   Overlap was still prevalent, but after further review, each topic appeared to provide reasonable and human understandable clusters

In [27]:
# fitting 22_4 model into new data to see how it performs - testing at scale (50 times larger): 

topic_model_combined_scale = BERTopic(min_topic_size=8, nr_topics=40, embedding_model=model)
topics, _ = topic_model_combined_scale.fit_transform(testing_docs)

topic_model_combined_scale.get_topic_info()


Unnamed: 0,Topic,Count,Name
0,-1,11740,-1_was_is_on_as
1,0,1157,0_archive_jpg__blank_clue
2,1,856,1_song_no_band_you
3,2,722,2_word_meaning_latin_means
4,3,455,3_game_baseball_win_player
5,4,454,4_played_show_starred_actor
6,5,450,5_ballet_composer_beethoven_symphony
7,6,424,6_president_governor_elected_secretary
8,7,409,7_novel_book_published_wrote
9,8,390,8_soup_meat_eat_pie


In [28]:
topic_model_combined_scale.visualize_topics()

In [29]:
topic_model_combined_scale.topics

{-1: [('was', 0.0095893509582378),
  ('is', 0.009397942679157818),
  ('on', 0.009309515898306347),
  ('as', 0.008506722488549676),
  ('his', 0.008385246436111682),
  ('by', 0.008064809902234155),
  ('with', 0.008023316458448122),
  ('an', 0.00745925318665524),
  ('its', 0.006304647305467054),
  ('are', 0.005766064648884699)],
 0: [('archive', 0.05786998044822204),
  ('jpg', 0.05651815803934633),
  ('_blank', 0.056495753687389724),
  ('clue', 0.031074941867273616),
  ('crew', 0.027246411730515653),
  ('07', 0.014793651118773161),
  ('01', 0.0137895325282916),
  ('wmv', 0.012998814061417105),
  ('04', 0.011769048029508514),
  ('03', 0.01156436893334315)],
 1: [('song', 0.03686849529737),
  ('no', 0.02157555770835065),
  ('band', 0.021288487198111054),
  ('you', 0.020095467145894),
  ('heard', 0.017407837674084513),
  ('album', 0.01674755170264017),
  ('my', 0.015959472195670236),
  ('sang', 0.01550374308057221),
  ('singer', 0.01357253780382622),
  ('rock', 0.011638273682175885)],
 2: [(

### **How did we do?**
After determining that our final model resulted in the most distinct clusters with less overlap than in previous models, we predicted topics to a sample size of twice the size of the testing model.

We found that the model predicts topics fairly well, but 80% of our data falls within the outlier column. 

More research and a deeper understanding of available parameters and embedding models is needed to create a topic generating tool to better classify Jeopardy data. 

In [30]:
prediction_docs = testing.loc[:50000]

In [35]:
pre_prediction_df = new_df.loc[:50000] 
pre_prediction_df.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,Q_A_Combined
0,4680,12/31/2004,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,"For the last 8 years of his life, Galileo was ..."
1,4680,12/31/2004,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,No. 2: 1912 Olympian; football star at Carlisl...
2,4680,12/31/2004,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,The city of Yuma in this state has a record av...
3,4680,12/31/2004,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,"In 1963, live on ""The Art Linkletter Show"", th..."
4,4680,12/31/2004,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,"Signer of the Dec. of Indep., framer of the Co..."


In [31]:
# NEXT STEPS: pass in best fit model into new set up to 50k samples, combine into df, bin using human readable topics, and analyze topics to see how it performed

prediction_model = topic_model_combined_scale.transform(prediction_docs)

# do a left join with the results to generate frame

In [32]:
from pandas import DataFrame
prediction_list = prediction_model[0]

# creating prediction dataframe to join with original testing df
prediction_df = DataFrame(prediction_list, columns=["prediction_code"])

Unnamed: 0,prediction_code
0,-1
1,-1
2,-1
3,-1
4,-1
...,...
49996,-1
49997,-1
49998,-1
49999,-1


In [74]:
# joining original df with prediction df 
prediction_joined_df = pre_prediction_df.join(prediction_df) 
prediction_joined_df.head()


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,Q_A_Combined,prediction_code
0,4680,12/31/2004,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,"For the last 8 years of his life, Galileo was ...",-1
1,4680,12/31/2004,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,No. 2: 1912 Olympian; football star at Carlisl...,-1
2,4680,12/31/2004,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,The city of Yuma in this state has a record av...,-1
3,4680,12/31/2004,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,"In 1963, live on ""The Art Linkletter Show"", th...",-1
4,4680,12/31/2004,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,"Signer of the Dec. of Indep., framer of the Co...",-1


In [111]:
# creating bins and labels to pass to our encoder

x = prediction_joined_df["prediction_code"]

usable_bins = [-2, -1, 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
       16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
       33, 34, 35, 36, 37, 38, 39]

labels = ["outlier", "miscellaneous.general", "music.general", "language.general", "sports.general", "entertainment.tv_movies", "music.classical", "culture.politics", "literature.general", "consumables.food", "religion.general", "entertainment.awards", "science.space", "people.general", "religion.judeo-christian", "entertainment.disney", "history.places", "art.general", "culture.places" , "music.instruments", "organizations.miscellaneous", "history.roman", "sports.teams", "history.politics", "literature.general", "geography.us_states", "geography.water", "language.symbols", "nature.general", "nature.animals", "history.american", "miscellaneous.garments", "history.british", "science.anatomy", "geography.water", "miscellaneous.news", "science.inventions", "science.gems", "miscellaneous.ships", "history.american", "consumables.drinks"]

assert len(usable_bins) == len(labels) + 1

In [110]:
len(labels)

41

In [112]:
prediction_joined_df["predicted_label"] = pd.cut(
    x=x,
    bins=usable_bins,
    labels=labels,
    ordered=False
)

# new df including predicted label as final column
prediction_joined_df.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,Q_A_Combined,prediction_code,predicted_label
0,4680,12/31/2004,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,"For the last 8 years of his life, Galileo was ...",-1,outlier
1,4680,12/31/2004,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,No. 2: 1912 Olympian; football star at Carlisl...,-1,outlier
2,4680,12/31/2004,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,The city of Yuma in this state has a record av...,-1,outlier
3,4680,12/31/2004,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,"In 1963, live on ""The Art Linkletter Show"", th...",-1,outlier
4,4680,12/31/2004,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,"Signer of the Dec. of Indep., framer of the Co...",-1,outlier


In [113]:
prediction_joined_df.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,Q_A_Combined,prediction_code,predicted_label
0,4680,12/31/2004,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,"For the last 8 years of his life, Galileo was ...",-1,outlier
1,4680,12/31/2004,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,No. 2: 1912 Olympian; football star at Carlisl...,-1,outlier
2,4680,12/31/2004,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,The city of Yuma in this state has a record av...,-1,outlier
3,4680,12/31/2004,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,"In 1963, live on ""The Art Linkletter Show"", th...",-1,outlier
4,4680,12/31/2004,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,"Signer of the Dec. of Indep., framer of the Co...",-1,outlier


In [114]:
# grouping by predicted label to visualize number of samples within each topic
predicted_group_by_labels_df = prediction_joined_df.groupby(by="predicted_label", dropna=False).count()

In [89]:
count_df.columns

Index(['Q_A_Combined'], dtype='object')

In [115]:
count_df = DataFrame(predicted_group_by_labels_df["Q_A_Combined"])
count_df.sort_values(by=["Q_A_Combined"], ascending=False)

Unnamed: 0_level_0,Q_A_Combined
predicted_label,Unnamed: 1_level_1
outlier,41426
miscellaneous.general,879
music.general,513
language.general,439
literature.general,407
music.classical,333
geography.water,320
sports.general,317
religion.general,261
culture.politics,255


In [116]:
is_literature = prediction_joined_df["predicted_label"] == "literature.general"

literature_df = prediction_joined_df.loc[is_literature, :]
literature_df

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,Q_A_Combined,prediction_code,predicted_label
121,3751,12/18/2000,Jeopardy!,FOREWORDS,$100,"""Conrad begins (and ends) Marlow's journey... ...",Heart of Darkness,"""Conrad begins (and ends) Marlow's journey... ...",23,literature.general
144,3751,12/18/2000,Jeopardy!,"""I"" LADS",$500,This auto exec's autobiography is one of the b...,Lee Iacocca,This auto exec's autobiography is one of the b...,7,literature.general
327,5690,5/8/2009,Double Jeopardy!,AMERICAN AUTHORS,$400,Her home Orchard House was the model for whre ...,Louisa May Alcott,Her home Orchard House was the model for whre ...,23,literature.general
349,5690,5/8/2009,Double Jeopardy!,NAME THE DECADE,"$1,800","George Orwell, 34 years dead, hits the bestsel...",the 1980s,"George Orwell, 34 years dead, hits the bestsel...",7,literature.general
493,5243,5/30/2007,Jeopardy!,AMERICAN AUTHORS,$800,"Under the name Laura Bancroft, he wrote about ...",L. Frank Baum,"Under the name Laura Bancroft, he wrote about ...",23,literature.general
...,...,...,...,...,...,...,...,...,...,...
24903,3334,2/18/1999,Double Jeopardy!,ILLUSTRATORS,$200,"In the early 1820s, before publishing his bird...",John J. Audubon,"In the early 1820s, before publishing his bird...",7,literature.general
24921,3334,2/18/1999,Double Jeopardy!,ILLUSTRATORS,$800,The ALA's medal for the artist of the best chi...,Randolph Caldecott,The ALA's medal for the artist of the best chi...,23,literature.general
24926,3334,2/18/1999,Double Jeopardy!,SPELL THE LAST NAME,$800,"Florentine author of ""The Prince"" Niccolo....",M-A-C-H-I-A-V-E-L-L-I,"Florentine author of ""The Prince"" Niccolo....M...",23,literature.general
24964,6226,10/17/2011,Double Jeopardy!,"YOU'VE GOT THE WRITE STUFF, JAMES",$400,"You wrote the novels ""Hawaii"" & ""Alaska""; too ...",(James) Michener,"You wrote the novels ""Hawaii"" & ""Alaska""; too ...",7,literature.general


In [117]:
is_disney = prediction_joined_df["predicted_label"] == "entertainment.disney"

disney_df = prediction_joined_df.loc[is_disney, :]
disney_df

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,Q_A_Combined,prediction_code,predicted_label
205,3673,7/19/2000,Double Jeopardy!,ALASKA,$200,4 different species of bears live in Alaska: ...,Polar bears,4 different species of bears live in Alaska: ...,14,entertainment.disney
1814,3113,2/25/1998,Jeopardy!,AT THE BUILDING SITE,$400,"He'll get you stoned or brickworked, & maybe e...",a mason,"He'll get you stoned or brickworked, & maybe e...",14,entertainment.disney
1904,3447,9/7/1999,Double Jeopardy!,NURSERY RHYMES,$800,"""Hey Diddle, Diddle!"" After the little dog la...",Dish & spoon,"""Hey Diddle, Diddle!"" After the little dog la...",14,entertainment.disney
2050,1800,5/29/1992,Jeopardy!,NATURE,$400,The Dorcas type of this graceful antelope is o...,a gazelle,The Dorcas type of this graceful antelope is o...,14,entertainment.disney
2132,5981,9/20/2010,Double Jeopardy!,GOAT-POURRI,"$1,200","In ""The Hunchback of Notre Dame"", Pierre Gring...",Esmeralda,"In ""The Hunchback of Notre Dame"", Pierre Gring...",14,entertainment.disney
...,...,...,...,...,...,...,...,...,...,...
24425,4167,10/15/2002,Double Jeopardy!,ACRONYMS,$400,"A race-conscious term, buppie is short for this",black urban professional,"A race-conscious term, buppie is short for thi...",14,entertainment.disney
24461,3320,1/29/1999,Jeopardy!,CHOCOLATEY QUOTES,$200,Sonny the breakfast cereal spokes-bird is asso...,"""I'm Cuckoo for Cocoa Puffs""",Sonny the breakfast cereal spokes-bird is asso...,14,entertainment.disney
24511,3320,1/29/1999,Final Jeopardy!,ANIMALS,,"This animal's name is from Bantu for ""mock man""",Chimpanzee,"This animal's name is from Bantu for ""mock man...",14,entertainment.disney
24550,3643,6/7/2000,Double Jeopardy!,DIRECTORS' FIRST FEATURES,$400,"""What's Up, Tiger Lily?"" (1966)",Woody Allen,"""What's Up, Tiger Lily?"" (1966)Woody Allen",14,entertainment.disney


In [118]:
is_language = prediction_joined_df["predicted_label"] == "language.general"

language_df = prediction_joined_df.loc[is_language, :]
language_df

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,Q_A_Combined,prediction_code,predicted_label
86,5957,7/6/2010,Double Jeopardy!,SCIENCE CLASS,$400,99.95% of the mass of an atom is in this part,the nucleus,99.95% of the mass of an atom is in this partt...,2,language.general
97,5957,7/6/2010,Double Jeopardy!,IN THE DICTIONARY,$800,"As an adjective, it can mean proper; as a verb...",correct,"As an adjective, it can mean proper; as a verb...",2,language.general
104,5957,7/6/2010,Double Jeopardy!,SCIENCE CLASS,"$5,000","Of the 6 noble gases on the periodic table, it...",helium,"Of the 6 noble gases on the periodic table, it...",2,language.general
109,5957,7/6/2010,Double Jeopardy!,IN THE DICTIONARY,"$5,000",This word for someone who walks comes from the...,pedestrian,This word for someone who walks comes from the...,2,language.general
181,3673,7/19/2000,Jeopardy!,GENERAL SCIENCE,$200,The time it takes for 50% of the atoms to deca...,Half-life,The time it takes for 50% of the atoms to deca...,2,language.general
...,...,...,...,...,...,...,...,...,...,...
24782,1245,1/19/1990,Double Jeopardy!,"""F"" IN MATH",$200,A number written as a quotient; examples are 2...,Fraction,A number written as a quotient; examples are 2...,2,language.general
24871,5271,7/9/2007,Double Jeopardy!,WORD WORDS,"$2,000","From the Latin for ""a foot and a half long"" co...",sesquipedalian,"From the Latin for ""a foot and a half long"" co...",2,language.general
24906,3334,2/18/1999,Double Jeopardy!,LEGAL LINGO,$200,"It can precede ""with a deadly weapon"", ""with i...",Assault,"It can precede ""with a deadly weapon"", ""with i...",2,language.general
24973,6226,10/17/2011,Double Jeopardy!,WORD DERIVATIONS,$800,"It's from the Old English for simply ""woman""; ...",wife,"It's from the Old English for simply ""woman""; ...",2,language.general


In [119]:
prediction_joined_df.to_csv("predicted_labels_final.csv")

In [None]:
!pip install nbconvert


In [3]:
!jupyter nbconvert --to html Final_BERTopic_Unsupervised_Model_1.ipynb


[NbConvertApp] Converting notebook Final_BERTopic_Unsupervised_Model_1.ipynb to html
[NbConvertApp] Writing 544074 bytes to Final_BERTopic_Unsupervised_Model_1.html
