# Guided Topic Modeling
This technique guides the topic modeling approach by setting several seed topics to which the model will converge to.

<u>Warning</u>: BERTopic is merely nudged towards creating those topics. In practice, if the seeded topics do not exist or might be divided into smaller topics, then they will not be modeled. Thus, seed topics need to be accurate to accurately converge towards them.

Read more about it [here](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html).

<b>Results</b>: Technique was implemented successfully and it did improve slightly (about 300 more articles categorized), however, we believe that the potential of using this technique is higher if appropriate research is done and better, more accurate seed topics are fed. We encourage ZHL to further investigate

In [None]:
import pandas as pd
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from bertopic import BERTopic
import os

In [32]:
# Read the data and perform preprocessing
df = pd.read_csv("data/articles_summary_cleaned.csv", parse_dates=["date"]) # Read data into 'df' dataframe
print(df.shape) # Print dataframe shape

docs = df["summary"].tolist() # Create a list containing all article summaries

df.head() # Show first 5 dataframe entries

(18520, 5)


Unnamed: 0,summary,date,location_article,lat,lng
0,The article discusses the passing of the new C...,2011-07-07,Juba,4.859363,31.57125
1,The article discusses the military actions tak...,2011-07-03,Abyei,9.838551,28.486396
2,The article discusses the signing of a Framewo...,2011-06-30,Southern Kordofan,11.036544,30.895824
3,The article discusses the upcoming independenc...,2011-07-04,South Sudan,6.876992,31.306979
4,The article discusses the need for South Sudan...,2011-07-02,Juba,4.859363,31.57125


In [83]:
if os.path.exists('guided_model'):
    guided_bertopic = BERTopic.load('guided_model')
else:
    # Create a list of seed topics
    seed_topics = [["corruption", "governance", "political instability", "leadership crisis", "failed state"],  # Political
               ["oil production", "china", "india", "resource exploitation", "oil revenues", "resource curse"],  # Oil
               ["hunger", "food security", "poverty", "famine", "food aid", "livelihoods", "food crisis"],  # Food Insecurity
               ["migration", "refugees", "displacement", "asylum", "internal displacement", "IDPs", "returnees"],  # Refugee
               ["humanitarian", "health", "education", "aid and development", "NGOs", "UN agencies", "humanitarian crisis"],  # Human Aid
               ["peace and security", "conflict", "violence", "civil war", "ethnic conflict", "peace process", "insecurity"],  # Conflict
               ["terrorism", "extremism", "armed groups", "rebel forces", "insurgency", "extremist organizations"],  # Terrorism
               ["climate", "flood", "drought", "environment", "climate change", "natural disasters", "environmental degradation"],  # Climate causes
               ["livestock", "cattle", "animals", "herding", "pastoralism", "livestock diseases", "livestock health", "livestock markets"]]  # Livestock


    guided_bertopic = BERTopic(language="english", calculate_probabilities=True, verbose=True, seed_topic_list = seed_topics) # Initialize the BERTopic model
    topics, _ = guided_bertopic.fit_transform(docs)
    guided_bertopic.save("guided_model") # Save the trained model as "guided_trial_model"

In [None]:
guided_bertopic.visualize_documents(docs)

In [None]:
guided_bertopic.visualize_topics()

In [None]:
guided_bertopic.generate_topic_labels()

['-1_and_the_in',
 '0_independence_author_tribalism',
 '1_oil_pipeline_production',
 '2_journalists_media_freedom',
 '3_abyei_referendum_area',
 '4_border_zone_agreements',
 '5_cabinet_governor_speaker',
 '6_albashir_president_visit',
 '7_party_splm_liberation',
 '8_refugees_unhcr_refugee',
 '9_humanitarian_aid_million',
 '10_kiir_author_leadership',
 '11_church_churches_bishop',
 '12_transitional_unity_machar',
 '13_lakes_rumbek_dhuol',
 '14_china_chinese_chinas',
 '15_murle_lou_jonglei',
 '16_heglig_panthou_withdraw',
 '17_human_rights_commission',
 '18_corruption_anticorruption_money',
 '19_food_famine_hunger',
 '20_detainees_release_trial',
 '21_deal_agreement_machar',
 '22_uhuru_kenyatta_kenyattas',
 '23_arms_embargo_weapons',
 '24_traders_ugandan_trade',
 '25_refugees_uganda_refugee',
 '26_updf_ugandas_troops',
 '27_cup_match_football',
 '28_children_child_soldiers',
 '29_igad_talks_intergovernmental',
 '30_signing_peace_revolutionary',
 '31_sanctions_us_individuals',
 '32_saf_ja

In [84]:
# We create a function to calculate a list of the top n topics related to (a) given keyword(s)

def get_relevant_topics(bertopic_model, keywords, top_n):
    '''
    Retrieve a list of the top n number of relevant topics to the provided (list of) keyword(s)


    Parameters:
        bertopic_model: a (fitted) BERTopic model object

        keywords:   a string containing one or multiple keywords to match against,

                    This can also be a list in the form of ['keyword(s)', keyword(s), ...]

                    In this case a maximum of top_n topics will be found per list element
                    and subsetted to the top_n most relevant topics.

                    !!!
                    Take care that this method only considers the relevancy per inputted keyword(s)
                    and not the relevancy to the combined list of keywords.

                    In other words, topics that appear in the output might be significantly related to a
                    particular element in the list of keywords but not so to any other element,

                    while topics that do not appear in the output might be significantly related to the
                    combined list of keywords but not much to any of the keyword(s) in particular.
                    !!!

        top_n: an integer indicating the number of desired relevant topics to be retrieved


        Return: a list of the top_n (or less) topics most relevant to the (list of) provided keyword(s)
    '''

    if type(keywords) is str: keywords = [keywords] # If a single string is provided convert it to list type

    relevant_topics = list() # Initilize an empty list of relevant topics

    for keyword in keywords: # Iterate through list of keywords

        # Find the top n number of topics related to the current keyword(s)
        topics = bertopic_model.find_topics(keyword, top_n = top_n)

        # Add the topics to the list of relevant topics in the form of (topic_id, relevancy)
        relevant_topics.extend(
            zip(topics[0], topics[1]) # topics[0] = topic_id, topics[1] = relevancy
        )


    relevant_topics.sort(key=lambda x: x[1]) # Sort the list of topics on ASCENDING ORDER of relevancy

    # Get a list of the set of unique topics (with greates relevancy in case of duplicate topics)
    relevant_topics = list(dict(relevant_topics).items())


    relevant_topics.sort(key=lambda x: x[1], reverse=True) # Now sort the list of topics on DESCENDING ORDER of relevancy

    return relevant_topics[:10] # Return a list of the top_n unique relevant topics

# Livestock

In [85]:
# Get the top 10 topics related to the keywords 'cattle', 'livestock', 'animals'
relevant_topics = get_relevant_topics(bertopic_model = guided_bertopic, keywords=['cattle', 'livestock', 'animals'], top_n=15)

topic_ids = [el[0] for el in relevant_topics] # Create seperate list of topic IDs

for topic_id, relevancy in relevant_topics: # Print neat list of (topic_id, relevancy) tuples
    print(topic_id, relevancy)

df["livestock"] = [t in topic_ids for t in guided_bertopic.topics_] # Add boolean column to df if topic in list of relevant topics

# View the Count, Name, Representation, and Representative Docs for the relevant topics
guided_bertopic.get_topic_info().set_index('Topic').loc[topic_ids]

159 0.67029715
73 0.57171
219 0.4856301
35 0.47053766
64 0.4285058
55 0.41244912
47 0.3668123
13 0.35744536
19 0.34786773
24 0.32890373


Unnamed: 0_level_0,Count,Name,Representation,Representative_Docs
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
159,21,159_livestock_animal_cattle_animals,"[livestock, animal, cattle, animals, diseases,...",[The article discusses how 13 months of civil ...
73,44,73_cattle_raiders_cows_rustling,"[cattle, raiders, cows, rustling, warrap, raid...",[The article discusses a clash between police ...
219,12,219_wildlife_poaching_park_conservation,"[wildlife, poaching, park, conservation, eleph...",[The article discusses the increase in wildlif...
35,84,35_agriculture_agricultural_food_farmers,"[agriculture, agricultural, food, farmers, far...",[The article discusses the need for cooperatio...
64,52,64_fao_food_seeds_kits,"[fao, food, seeds, kits, million, livelihood, ...",[The article discusses FAO's efforts to provid...
55,63,55_food_hunger_insecurity_farmers,"[food, hunger, insecurity, farmers, million, w...",[The article discusses the findings of a new C...
47,69,47_wfp_food_assistance_world,"[wfp, food, assistance, world, programme, mill...",[The article discusses the European Commission...
13,146,13_lakes_rumbek_dhuol_governor,"[lakes, rumbek, dhuol, governor, chut, matur, ...",[The article discusses the death of Colonel Yo...
19,107,19_food_famine_hunger_million,"[food, famine, hunger, million, crisis, starva...",[The article discusses how extreme hunger is a...
24,100,24_traders_ugandan_trade_uganda,"[traders, ugandan, trade, uganda, market, comp...",[The article discusses how the government of U...


# Corruption

In [73]:
# Get the top 10 topics related to the keywords 'corruption' and 'coup'
relevant_topics = get_relevant_topics(bertopic_model = guided_bertopic, keywords=['corruption', 'coup'], top_n=15)

topic_ids = [el[0] for el in relevant_topics] # Create seperate list of topic IDs

for topic_id, relevancy in relevant_topics: # Print neat list of (topic_id, relevancy) tuples
    print(topic_id, relevancy)

df["corruption"] = [t in topic_ids for t in guided_bertopic.topics_] # Add boolean column to df if topic in list of relevant topics

# View the Count, Name, Representation, and Representative Docs for the relevant topics
guided_bertopic.get_topic_info().set_index('Topic').loc[topic_ids]

123 0.795946
18 0.5801017
98 0.4768626
149 0.3980472
79 0.3964479
225 0.39140248
144 0.36744428
10 0.35612983
20 0.35333145
131 0.3499308


Unnamed: 0_level_0,Count,Name,Representation,Representative_Docs
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
123,28,123_corruption_transparency_scored_plaintiff,"[corruption, transparency, scored, plaintiff, ...",[The article discusses the definition of corru...
18,120,18_corruption_anticorruption_money_officials,"[corruption, anticorruption, money, officials,...",[The article discusses South Sudan's President...
98,35,98_sentry_report_corruption_illicit,"[sentry, report, corruption, illicit, mel, dol...",[The article discusses the release of a new in...
149,22,149_tax_collection_revenue_taxes,"[tax, collection, revenue, taxes, finance, cen...",[The article discusses the Republic of South S...
79,43,79_kenyans_kenyan_kenyatta_were,"[kenyans, kenyan, kenyatta, were, four, famili...",[The article discusses how families of four Ke...
225,12,225_athuai_deng_alliance_kidnapping,"[athuai, deng, alliance, kidnapping, society, ...",[The article discusses the shooting of Deng At...
144,23,144_land_grabbing_lease_issue,"[land, grabbing, lease, issue, grabbers, equat...",[The article discusses the issue of land grabb...
10,193,10_kiir_author_leadership_nuer,"[kiir, author, leadership, nuer, his, kiirs, p...",[The article discusses the political crisis in...
20,106,20_detainees_release_trial_treason,"[detainees, release, trial, treason, prisoners...",[The article discusses the release of seven So...
131,26,131_city_mayor_plastic_garbage,"[city, mayor, plastic, garbage, cleaning, clea...",[The article discusses young South Sudanese wh...


# Oil Production

In [74]:
# Get the top 10 topics related to the keywords 'oil', 'petrolium', 'china' and 'india'
relevant_topics = get_relevant_topics(bertopic_model = guided_bertopic, keywords=['oil', 'china', 'india'], top_n=15)

topic_ids = [el[0] for el in relevant_topics] # Create seperate list of topic IDs

for topic_id, relevancy in relevant_topics: # Print neat list of (topic_id, relevancy) tuples
    print(topic_id, relevancy)

df["oil"] = [t in topic_ids for t in guided_bertopic.topics_] # Add boolean column to df if topic in list of relevant topics

# View the Count, Name, Representation, and Representative Docs for the relevant topics
guided_bertopic.get_topic_info().set_index('Topic').loc[topic_ids]

41 0.43788797
1 0.4376254
174 0.39840442
14 0.39138252
181 0.36217442
215 0.3026979
119 0.278493
113 0.2638155
66 0.24643165
175 0.24450883


Unnamed: 0_level_0,Count,Name,Representation,Representative_Docs
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
41,78,41_oil_production_cooperation_countries,"[oil, production, cooperation, countries, resu...",[The article discusses a meeting held in Khart...
1,378,1_oil_pipeline_production_fees,"[oil, pipeline, production, fees, crude, trans...",[The article discusses the release of impounde...
174,18,174_oil_machars_riek_machar,"[oil, machars, riek, machar, revenues, reserve...",[The article discusses the major offensive lau...
14,146,14_china_chinese_chinas_beijing,"[china, chinese, chinas, beijing, oil, visit, ...",[The article discusses the inauguration of the...
181,17,181_fuel_petrol_trucks_petroleum,"[fuel, petrol, trucks, petroleum, shortage, su...",[The article discusses how the ministry of pet...
215,13,215_fragile_index_ranked_most,"[fragile, index, ranked, most, fsi, ffp, world...",[The article discusses the release of the Frag...
119,29,119_prices_price_inflation_beverages,"[prices, price, inflation, beverages, goods, c...",[The article discusses a decrease in inflation...
113,31,113_japanese_japan_japans_engineering,"[japanese, japan, japans, engineering, tokyo, ...",[The article discusses the establishment of a ...
66,50,66_itu_internet_network_mtn,"[itu, internet, network, mtn, code, cable, tel...",[The article discusses the signing of a memora...
175,18,175_omer_albashir_bashir_khartoum,"[omer, albashir, bashir, khartoum, kordofan, h...",[The article discusses the threat of war by Su...


# Hunger

In [75]:
# Get the top 10 topics related to the keywords 'hunger' and 'food insecurity'
relevant_topics = get_relevant_topics(bertopic_model = guided_bertopic, keywords=['hunger', 'food insecurity'], top_n=15)

topic_ids = [el[0] for el in relevant_topics] # Create seperate list of topic IDs

for topic_id, relevancy in relevant_topics: # Print neat list of (topic_id, relevancy) tuples
    print(topic_id, relevancy)

df["hunger"] = [t in topic_ids for t in guided_bertopic.topics_] # Add boolean column to df if topic in list of relevant topics

# View the Count, Name, Representation, and Representative Docs for the relevant topics
guided_bertopic.get_topic_info().set_index('Topic').loc[topic_ids]

55 0.65119267
19 0.4918156
64 0.47771373
137 0.43900698
47 0.43301296
35 0.35561422
146 0.32271755
235 0.2837204
159 0.2780196
9 0.27740264


Unnamed: 0_level_0,Count,Name,Representation,Representative_Docs
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
55,63,55_food_hunger_insecurity_farmers,"[food, hunger, insecurity, farmers, million, w...",[The article discusses the findings of a new C...
19,107,19_food_famine_hunger_million,"[food, famine, hunger, million, crisis, starva...",[The article discusses how extreme hunger is a...
64,52,64_fao_food_seeds_kits,"[fao, food, seeds, kits, million, livelihood, ...",[The article discusses FAO's efforts to provid...
137,24,137_malnutrition_children_unicef_breastfeeding,"[malnutrition, children, unicef, breastfeeding...",[The article discusses the severe acute malnut...
47,69,47_wfp_food_assistance_world,"[wfp, food, assistance, world, programme, mill...",[The article discusses the European Commission...
35,84,35_agriculture_agricultural_food_farmers,"[agriculture, agricultural, food, farmers, far...",[The article discusses the need for cooperatio...
146,23,146_children_unicef_families_million,"[children, unicef, families, million, malnutri...",[The article discusses how violence and insecu...
235,10,235_tons_metric_deliver_humanitarian,"[tons, metric, deliver, humanitarian, delivery...",[The article discusses the opening of a humani...
159,21,159_livestock_animal_cattle_animals,"[livestock, animal, cattle, animals, diseases,...",[The article discusses how 13 months of civil ...
9,209,9_humanitarian_aid_million_assistance,"[humanitarian, aid, million, assistance, fundi...",[The article discusses the United States' anno...


# Migration

In [76]:
# Get the top 10 topics related to the keywords 'refugees', 'displaced' and 'flee'
relevant_topics = get_relevant_topics(bertopic_model = guided_bertopic, keywords=['refugees', 'displaced', 'flee'], top_n=15)

topic_ids = [el[0] for el in relevant_topics] # Create seperate list of topic IDs

for topic_id, relevancy in relevant_topics: # Print neat list of (topic_id, relevancy) tuples
    print(topic_id, relevancy)

df["refugees"] = [t in topic_ids for t in guided_bertopic.topics_] # Add boolean column to df if topic in list of relevant topics

# View the Count, Name, Representation, and Representative Docs for the relevant topics
guided_bertopic.get_topic_info().set_index('Topic').loc[topic_ids]

128 0.81677043
25 0.68290854
8 0.6551687
132 0.6215856
40 0.5845568
201 0.5435972
152 0.52415866
44 0.5093847
136 0.4976229
146 0.49740756


Unnamed: 0_level_0,Count,Name,Representation,Representative_Docs
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
128,26,128_refugees_migrants_asylum_israeli,"[refugees, migrants, asylum, israeli, israel, ...",[The article discusses Israeli Prime Minister ...
25,100,25_refugees_uganda_refugee_district,"[refugees, uganda, refugee, district, adjumani...",[The article discusses the launch of a regiona...
8,210,8_refugees_unhcr_refugee_nile,"[refugees, unhcr, refugee, nile, yida, camp, a...",[The article discusses the concerns expressed ...
132,25,132_kakuma_refugee_camp_refugees,"[kakuma, refugee, camp, refugees, kenya, camps...",[The article discusses the influx of refugees ...
40,80,40_displaced_idps_internally_people,"[displaced, idps, internally, people, displace...",[The article discusses the high number of inte...
201,15,201_bentiu_conditions_flooding_camp,"[bentiu, conditions, flooding, camp, base, peo...",[The article discusses the dire humanitarian s...
152,22,152_civilians_unmiss_bases_peacekeeping,"[civilians, unmiss, bases, peacekeeping, un, r...",[The article discusses new fighting in South S...
44,72,44_returnees_kosti_iom_repatriation,"[returnees, kosti, iom, repatriation, migratio...",[The article discusses the arrival of the last...
136,24,136_kenyans_evacuation_kenyan_flight,"[kenyans, evacuation, kenyan, flight, national...",[The article discusses the evacuation of Kenya...
146,23,146_children_unicef_families_million,"[children, unicef, families, million, malnutri...",[The article discusses how violence and insecu...


# Humanitarian

In [77]:
# Get the top 10 topics related to the keyword 'humanitarian'
relevant_topics = get_relevant_topics(bertopic_model = guided_bertopic, keywords=['humanitarian'], top_n=15)

topic_ids = [el[0] for el in relevant_topics] # Create seperate list of topic IDs

for topic_id, relevancy in relevant_topics: # Print neat list of (topic_id, relevancy) tuples
    print(topic_id, relevancy)

df["humanitarian"] = [t in topic_ids for t in guided_bertopic.topics_] # Add boolean column to df if topic in list of relevant topics

# View the Count, Name, Representation, and Representative Docs for the relevant topics
guided_bertopic.get_topic_info().set_index('Topic').loc[topic_ids]

187 0.67556524
229 0.6606716
190 0.64318967
76 0.64160997
9 0.64067316
56 0.6112286
40 0.60082227
152 0.6005495
201 0.5981987
97 0.5817331


Unnamed: 0_level_0,Count,Name,Representation,Representative_Docs
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
187,17,187_education_ecw_children_global,"[education, ecw, children, global, school, pay...",[The article discusses the passing of Congress...
229,11,229_blood_health_canadas_healthcare,"[blood, health, canadas, healthcare, hospital,...",[The article discusses the inauguration of the...
190,17,190_lanzer_toby_humanitarian_coordinator,"[lanzer, toby, humanitarian, coordinator, mr, ...",[The article discusses a press briefing with t...
76,44,76_humanitarian_pibor_jonglei_affected,"[humanitarian, pibor, jonglei, affected, aid, ...",[The article discusses the aid distribution op...
9,209,9_humanitarian_aid_million_assistance,"[humanitarian, aid, million, assistance, fundi...",[The article discusses the United States' anno...
56,63,56_workers_aid_humanitarian_worker,"[workers, aid, humanitarian, worker, killing, ...",[The article discusses the disappearance of si...
40,80,40_displaced_idps_internally_people,"[displaced, idps, internally, people, displace...",[The article discusses the high number of inte...
152,22,152_civilians_unmiss_bases_peacekeeping,"[civilians, unmiss, bases, peacekeeping, un, r...",[The article discusses new fighting in South S...
201,15,201_bentiu_conditions_flooding_camp,"[bentiu, conditions, flooding, camp, base, peo...",[The article discusses the dire humanitarian s...
97,36,97_red_icrc_cross_ifrc,"[red, icrc, cross, ifrc, medical, care, cresce...",[The article discusses the International Feder...


# Conflict

In [78]:
# Get the top 10 topics related to the keywords 'conflict', 'fighting', 'murder' and 'troops'
relevant_topics = get_relevant_topics(bertopic_model = guided_bertopic, keywords=['conflict', 'fighting', 'murder', 'troops'], top_n=15)

topic_ids = [el[0] for el in relevant_topics] # Create seperate list of topic IDs

for topic_id, relevancy in relevant_topics: # Print neat list of (topic_id, relevancy) tuples
    print(topic_id, relevancy)

df["conflict"] = [t in topic_ids for t in guided_bertopic.topics_] # Add boolean column to df if topic in list of relevant topics

# View the Count, Name, Representation, and Representative Docs for the relevant topics
guided_bertopic.get_topic_info().set_index('Topic').loc[topic_ids]

158 0.44024155
232 0.4298405
81 0.42673117
54 0.41361254
173 0.40674824
48 0.4008087
26 0.39658153
99 0.38852888
28 0.38664374
122 0.3857227


Unnamed: 0_level_0,Count,Name,Representation,Representative_Docs
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
158,21,158_her_murder_sister_shot,"[her, murder, sister, shot, veronika, was, rac...",[The article discusses the call made by Bishop...
232,11,232_truce_positions_attacking_army,"[truce, positions, attacking, army, upper, opp...",[The article discusses the accusation made by ...
81,42,81_force_troops_deployment_protection,"[force, troops, deployment, protection, peacek...",[The article discusses the deployment of troop...
54,64,54_lra_kony_lords_resistance,"[lra, kony, lords, resistance, joseph, central...",[The article discusses the end of the six-year...
173,19,173_ddr_excombatants_reintegration_program,"[ddr, excombatants, reintegration, program, de...",[The article discusses the plans of the South ...
48,68,48_border_kordofan_blue_both,"[border, kordofan, blue, both, accusations, su...",[The article discusses the upcoming presidenti...
26,99,26_updf_ugandas_troops_ugandan,"[updf, ugandas, troops, ugandan, withdrawal, u...",[The article discusses the withdrawal of Ugand...
99,34,99_talks_ababa_addis_round,"[talks, ababa, addis, round, peace, ethiopia, ...",[The article discusses the final round of peac...
28,97,28_children_child_soldiers_recruitment,"[children, child, soldiers, recruitment, unice...",[The article discusses the United Nations Spec...
122,28,122_rwanda_rdf_peacekeeping_peacekeepers,"[rwanda, rdf, peacekeeping, peacekeepers, rwan...",[The article discusses the four-day official v...


# Terrorism

In [79]:
# Get the top 10 topics related to the keywords 'terrorism'
relevant_topics = get_relevant_topics(bertopic_model = guided_bertopic, keywords=['terrorism'], top_n=15)

topic_ids = [el[0] for el in relevant_topics] # Create seperate list of topic IDs

for topic_id, relevancy in relevant_topics: # Print neat list of (topic_id, relevancy) tuples
    print(topic_id, relevancy)

df["terrorism"] = [t in topic_ids for t in guided_bertopic.topics_] # Add boolean column to df if topic in list of relevant topics

# View the Count, Name, Representation, and Representative Docs for the relevant topics
guided_bertopic.get_topic_info().set_index('Topic').loc[topic_ids]

161 0.4508683
191 0.4389815
139 0.43733042
32 0.43284658
206 0.4307009
228 0.42565465
233 0.4194704
157 0.40761083
154 0.39486194
165 0.39259413


Unnamed: 0_level_0,Count,Name,Representation,Representative_Docs
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
161,21,161_bentiu_civilians_killed_killings,"[bentiu, civilians, killed, killings, mosque, ...",[The article discusses the United Nations' con...
191,16,191_attack_convoy_un_unmiss,"[attack, convoy, un, unmiss, peacekeepers, inv...",[The article discusses an attack on a U.N. con...
139,24,139_jonglei_violence_civilians_communities,"[jonglei, violence, civilians, communities, un...",[The article discusses the ongoing ethnic viol...
32,87,32_saf_jau_bombing_aerial,"[saf, jau, bombing, aerial, attacks, armed, fo...",[The article discusses the condemnation by the...
206,15,206_civilians_ethnic_crimes_killings,"[civilians, ethnic, crimes, killings, ethnicit...",[The article discusses how both sides in South...
228,11,228_bombs_cluster_use_munitions,"[bombs, cluster, use, munitions, remnants, fou...",[The article discusses how Human Rights Watch ...
233,10,233_weapons_firearms_wau_marking,"[weapons, firearms, wau, marking, unauthorized...",[The article discusses South Sudan's interior ...
157,21,157_gambella_ethiopian_gambela_attack,"[gambella, ethiopian, gambela, attack, childre...",[The article discusses armed men from the Murl...
154,21,154_nuba_mountains_wolf_wolfs,"[nuba, mountains, wolf, wolfs, frank, obama, r...",[The article discusses the bombing of civilian...
165,20,165_unamid_darfur_peacekeepers_abeche,"[unamid, darfur, peacekeepers, abeche, khor, p...",[The article discusses the review of the situa...


# Climate

In [80]:
# Get the top 10 topics related to the keywords 'flooding' and 'droughts'
relevant_topics = get_relevant_topics(bertopic_model = guided_bertopic, keywords=['flooding', 'droughts'], top_n=15)

topic_ids = [el[0] for el in relevant_topics] # Create seperate list of topic IDs

for topic_id, relevancy in relevant_topics: # Print neat list of (topic_id, relevancy) tuples
    print(topic_id, relevancy)

df["nature"] = [t in topic_ids for t in guided_bertopic.topics_] # Add boolean column to df if topic in list of relevant topics

# View the Count, Name, Representation, and Representative Docs for the relevant topics
guided_bertopic.get_topic_info().set_index('Topic').loc[topic_ids]

196 0.63046366
57 0.55128306
160 0.36778143
201 0.33107322
55 0.2890054
19 0.28028253
68 0.27285278
80 0.27004346
13 0.26021403
36 0.24976525


Unnamed: 0_level_0,Count,Name,Representation,Representative_Docs
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
196,16,196_water_drought_climate_flooding,"[water, drought, climate, flooding, horn, dipo...",[The article discusses how climate change may ...
57,61,57_flooding_floods_flood_affected,"[flooding, floods, flood, affected, rains, peo...",[The article discusses the distribution of aid...
160,21,160_water_supply_drinking_project,"[water, supply, drinking, project, clean, dise...",[The article discusses the water crisis in Jub...
201,15,201_bentiu_conditions_flooding_camp,"[bentiu, conditions, flooding, camp, base, peo...",[The article discusses the dire humanitarian s...
55,63,55_food_hunger_insecurity_farmers,"[food, hunger, insecurity, farmers, million, w...",[The article discusses the findings of a new C...
19,107,19_food_famine_hunger_million,"[food, famine, hunger, million, crisis, starva...",[The article discusses how extreme hunger is a...
68,49,68_dam_egypt_renaissance_grand,"[dam, egypt, renaissance, grand, gerd, ethiopi...",[The article discusses the meeting of irrigati...
80,42,80_basin_water_egypt_nile,"[basin, water, egypt, nile, irrigation, projec...",[The article discusses Minister of Water Resou...
13,146,13_lakes_rumbek_dhuol_governor,"[lakes, rumbek, dhuol, governor, chut, matur, ...",[The article discusses the death of Colonel Yo...
36,84,36_cholera_outbreak_cases_health,"[cholera, outbreak, cases, health, hygiene, sp...",[The article discusses the outbreak of cholera...


# Saving

In [81]:
original_df = pd.read_csv('data/articles_summary_cleaned.csv', parse_dates=["date"])

# Combine article summaries with the newly created features
df = original_df.merge(
    df[["summary", "hunger", "refugees", "humanitarian", "conflict", "corruption", "terrorism", "nature", 'oil','livestock']],
    how="left",
    left_on="summary",
    right_on="summary",
)
df.to_csv("articles_topics.csv", index=False) # Save DataFrame to articles_topics.csv

In [86]:
print(len(df))
print(len(df[(df["hunger"]==False) & (df["refugees"] == False) & (df["humanitarian"] == False) & (df["conflict"] == False) & (df["corruption"] == False) & (df["terrorism"] == False) & (df["nature"] == False) & (df["oil"] == False)& (df["livestock"] == False)]))

18520
14406
