# Digital Methods - Dictionary Classifier
_____

## Table of Content

1. [Libraries](#libraries)
2. [Load Data](#load-data)
3. [Data Preprocessing](#preprocessing-of-the-data)
4. [Set Up of Dictionary](#building-dictionary)
5. [Classifier](#classifier)
_____

## Libraries

All libraries which are needed to execute the code are listed here. Install the packages by using the `requirements.txt` file. 

The documentation can be found in the [README.md](README.md) file.

In [1]:
# import packages
import pandas as pd 
from tqdm import tqdm
import nltk
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem.snowball import SnowballStemmer
from nltk.util import ngrams
from preprocessing_functions import *

## Load Data

Load collected YouTube comments from the Data Collection.

In [2]:
df = pd.read_csv('data/comments_final.csv')

In [3]:
# process data with using functions from functions.py
processed_df = (
    df.pipe(remove_users, 'text')
      .pipe(lowercase_text, 'text')
      .pipe(remove_whitespace, 'text')
      .pipe(remove_punctuation, 'text')
)

In [4]:
# heading the processed df (removed users, lowercased, removed whitspace and punctuation, stemmed and lemmatized)
processed_df.head()

Unnamed: 0,video_id,published_at,like_count,text,author
0,uW6fi2tCnAc,2023-02-19T21:22:45Z,1,the answer is if china and india dont help it ...,0.0
1,uW6fi2tCnAc,2023-02-19T00:43:40Z,2,and that guy is an expert were screwed,1.0
2,uW6fi2tCnAc,2023-02-18T22:57:38Z,4,kennedy is a gem,2.0
3,uW6fi2tCnAc,2023-02-18T22:22:49Z,0,and just how do we get a nation like china to ...,3.0
4,uW6fi2tCnAc,2023-02-18T21:44:49Z,3,that man was going for an oscar,4.0


# Preprocessing of the Data

- Stemming and Lemmatizing
- Tokenization of the comments
- Building N-grams

In [5]:
# text column to string
processed_df['text'] = processed_df['text'].astype('str')
processed_df['text'] = processed_df['text'].str.replace('\'', '')

# use stemming to reduce words to their root words
processed_df = stem_words(processed_df, 'text')

# use lemmatization to reduce words to their root form
processed_df = lemmatize_words(processed_df, 'text')

# convert date format
processed_df = convert_date_format(processed_df, 'published_at')

In [6]:
#dropping the na from lemmatized & stemmed text to avoid issues with creating n_grams
processed_df.lemmatized_text = processed_df.lemmatized_text.apply(lambda x: '' if str(x) == 'nan' else x)
processed_df.stemmed_text = processed_df.stemmed_text.apply(lambda x: '' if str(x) == 'nan' else x)

In [7]:
# Tokenizing the lemmatized and stemmed text before creating n-grams

def tokenize_words(text):
    words = word_tokenize(text)
    #words_with_quotes = [f"'{word}'" for word in words]
    return words

processed_df["stemmed_tokens"] = processed_df["stemmed_text"].apply(lambda x: tokenize_words(x))
processed_df["lemmatized_tokens"] = processed_df["lemmatized_text"].apply(lambda x: tokenize_words(x))
processed_df.head()

Unnamed: 0,video_id,published_at,like_count,text,author,stemmed_text,lemmatized_text,stemmed_tokens,lemmatized_tokens
0,uW6fi2tCnAc,2023-02-19,1,the answer is if china and india dont help it ...,0.0,the answer is if china and india dont help it ...,the answer be if china and india dont help it ...,"[the, answer, is, if, china, and, india, dont,...","[the, answer, be, if, china, and, india, dont,..."
1,uW6fi2tCnAc,2023-02-19,2,and that guy is an expert were screwed,1.0,and that guy is an expert were screw,and that guy be an expert be screw,"[and, that, guy, is, an, expert, were, screw]","[and, that, guy, be, an, expert, be, screw]"
2,uW6fi2tCnAc,2023-02-18,4,kennedy is a gem,2.0,kennedi is a gem,kennedy be a gem,"[kennedi, is, a, gem]","[kennedy, be, a, gem]"
3,uW6fi2tCnAc,2023-02-18,0,and just how do we get a nation like china to ...,3.0,and just how do we get a nation like china to ...,and just how do we get a nation like china to ...,"[and, just, how, do, we, get, a, nation, like,...","[and, just, how, do, we, get, a, nation, like,..."
4,uW6fi2tCnAc,2023-02-18,3,that man was going for an oscar,4.0,that man was go for an oscar,that man be go for an oscar,"[that, man, was, go, for, an, oscar]","[that, man, be, go, for, an, oscar]"


In [8]:
# Creating n-grams 
tqdm.pandas() #Creates a progress bar and below use "progress_apply" instead of "apply" to create a progress bar (This is more of a "nice to have" than a "need to have")

#Defining a function that will create bigrams 
def bigrams(doc): # a doc is a list of tokens/unigrams in same order as in tweets 
    
    bigrams = [] #Empty list to save the bigrams
    
    for bigram in list(nltk.bigrams(doc)):  #Creating bigrams as tuples with nltk.bigrams and iterating over these them
        bigrams.append("_".join(bigram))    #Joining each bigram-tuple pair with an underscore and saving to list
    
    return bigrams

#Defining a function that will create bigrams 
def trigrams(doc): # a doc is a list of unigrams in same order as in tweets 
    
    trigrams = [] #Empty list to save the bigrams
    
    for trigram in list(nltk.trigrams(doc)):  #Creating bigrams as tuples with nltk.bigrams and iterating over these them
        trigrams.append("_".join(trigram))    #Joining each bigram-tuple pair with an underscore and saving to list
    
    return trigrams

#Defining a function that will create bigrams 
def fourgrams(doc): # a doc is a list of unigrams in same order as in tweets 
    
    fourgrams = [] #Empty list to save the bigrams
    
    for fourgram in list(ngrams(doc, 4)):  #Creating bigrams as tuples with nltk.bigrams and iterating over these them
        fourgrams.append("_".join(fourgram))    #Joining each bigram-tuple pair with an underscore and saving to list
    
    return fourgrams

#Creating a column with bigrams by applying function to column of unigrams
processed_df['bigrams_lemma'] = processed_df["lemmatized_tokens"].progress_apply(lambda x: bigrams(x))
processed_df['trigrams_lemma'] = processed_df['lemmatized_tokens'].progress_apply(lambda x : trigrams(x))
processed_df['fourgrams_lemma'] = processed_df['lemmatized_tokens'].progress_apply(lambda x : fourgrams(x))

processed_df['bigrams_stem'] = processed_df["stemmed_tokens"].progress_apply(lambda x: bigrams(x))
processed_df['trigrams_stem'] = processed_df['stemmed_tokens'].progress_apply(lambda x : trigrams(x))
processed_df['fourgrams_stem'] = processed_df['stemmed_tokens'].progress_apply(lambda x : fourgrams(x))


100%|██████████| 96595/96595 [00:01<00:00, 69386.92it/s] 
100%|██████████| 96595/96595 [00:00<00:00, 106254.24it/s]
100%|██████████| 96595/96595 [00:01<00:00, 85811.63it/s] 
100%|██████████| 96595/96595 [00:00<00:00, 150662.73it/s]
100%|██████████| 96595/96595 [00:00<00:00, 115284.99it/s]
100%|██████████| 96595/96595 [00:01<00:00, 78540.57it/s] 


In [9]:
# creating one column with all n-grams (unigrams, bigrams, trigrams, fourgrams)
processed_df["all_n_grams_lemmatized"] = processed_df["lemmatized_tokens"] + processed_df["bigrams_lemma"] + processed_df["trigrams_lemma"] + processed_df["fourgrams_lemma"]
processed_df["all_n_grams_stemmed"] = processed_df["stemmed_tokens"] + processed_df["bigrams_stem"] + processed_df["trigrams_stem"] + processed_df["fourgrams_stem"]
processed_df.head()

Unnamed: 0,video_id,published_at,like_count,text,author,stemmed_text,lemmatized_text,stemmed_tokens,lemmatized_tokens,bigrams_lemma,trigrams_lemma,fourgrams_lemma,bigrams_stem,trigrams_stem,fourgrams_stem,all_n_grams_lemmatized,all_n_grams_stemmed
0,uW6fi2tCnAc,2023-02-19,1,the answer is if china and india dont help it ...,0.0,the answer is if china and india dont help it ...,the answer be if china and india dont help it ...,"[the, answer, is, if, china, and, india, dont,...","[the, answer, be, if, china, and, india, dont,...","[the_answer, answer_be, be_if, if_china, china...","[the_answer_be, answer_be_if, be_if_china, if_...","[the_answer_be_if, answer_be_if_china, be_if_c...","[the_answer, answer_is, is_if, if_china, china...","[the_answer_is, answer_is_if, is_if_china, if_...","[the_answer_is_if, answer_is_if_china, is_if_c...","[the, answer, be, if, china, and, india, dont,...","[the, answer, is, if, china, and, india, dont,..."
1,uW6fi2tCnAc,2023-02-19,2,and that guy is an expert were screwed,1.0,and that guy is an expert were screw,and that guy be an expert be screw,"[and, that, guy, is, an, expert, were, screw]","[and, that, guy, be, an, expert, be, screw]","[and_that, that_guy, guy_be, be_an, an_expert,...","[and_that_guy, that_guy_be, guy_be_an, be_an_e...","[and_that_guy_be, that_guy_be_an, guy_be_an_ex...","[and_that, that_guy, guy_is, is_an, an_expert,...","[and_that_guy, that_guy_is, guy_is_an, is_an_e...","[and_that_guy_is, that_guy_is_an, guy_is_an_ex...","[and, that, guy, be, an, expert, be, screw, an...","[and, that, guy, is, an, expert, were, screw, ..."
2,uW6fi2tCnAc,2023-02-18,4,kennedy is a gem,2.0,kennedi is a gem,kennedy be a gem,"[kennedi, is, a, gem]","[kennedy, be, a, gem]","[kennedy_be, be_a, a_gem]","[kennedy_be_a, be_a_gem]",[kennedy_be_a_gem],"[kennedi_is, is_a, a_gem]","[kennedi_is_a, is_a_gem]",[kennedi_is_a_gem],"[kennedy, be, a, gem, kennedy_be, be_a, a_gem,...","[kennedi, is, a, gem, kennedi_is, is_a, a_gem,..."
3,uW6fi2tCnAc,2023-02-18,0,and just how do we get a nation like china to ...,3.0,and just how do we get a nation like china to ...,and just how do we get a nation like china to ...,"[and, just, how, do, we, get, a, nation, like,...","[and, just, how, do, we, get, a, nation, like,...","[and_just, just_how, how_do, do_we, we_get, ge...","[and_just_how, just_how_do, how_do_we, do_we_g...","[and_just_how_do, just_how_do_we, how_do_we_ge...","[and_just, just_how, how_do, do_we, we_get, ge...","[and_just_how, just_how_do, how_do_we, do_we_g...","[and_just_how_do, just_how_do_we, how_do_we_ge...","[and, just, how, do, we, get, a, nation, like,...","[and, just, how, do, we, get, a, nation, like,..."
4,uW6fi2tCnAc,2023-02-18,3,that man was going for an oscar,4.0,that man was go for an oscar,that man be go for an oscar,"[that, man, was, go, for, an, oscar]","[that, man, be, go, for, an, oscar]","[that_man, man_be, be_go, go_for, for_an, an_o...","[that_man_be, man_be_go, be_go_for, go_for_an,...","[that_man_be_go, man_be_go_for, be_go_for_an, ...","[that_man, man_was, was_go, go_for, for_an, an...","[that_man_was, man_was_go, was_go_for, go_for_...","[that_man_was_go, man_was_go_for, was_go_for_a...","[that, man, be, go, for, an, oscar, that_man, ...","[that, man, was, go, for, an, oscar, that_man,..."


In [10]:
def contains_climate(lst):
    return 'climate' in lst

# Applying the function to create a boolean mask
climate_mask = processed_df['all_n_grams_lemmatized'].apply(contains_climate)

# Summing the rows where the mask is True
sum_rows_with_climate = climate_mask.sum()

print(f"Number of rows containing 'climate': {sum_rows_with_climate}")

Number of rows containing 'climate': 14821


## Building Dictionary

Building the Dictionary for the Classifier. This process is based on the qualitative research and Word2Vec Model + Topic Modelling.

In [11]:
# Creating Dictionary 

sc1_kw = ['no_climate_emergency', 'melting', 'arctic_ice', 'arctic_sea ice', 'sea_level_rise', 'extreme_weather', 'global_cooling', 'greenland_ice',
          'ice_cap', 'arctic_ice', 'extreme_heat', 'extreme_cold' ]
# unsure &  included: 'melting'
# unsure & not included: 'glacier', 'wildfires', 'climate emergency', 'unproven', 'global warming'

sc2_kw = ['natural_cycle', 'CO2_is_not_the_cause', 'greenhouse_gas', 'no_CO2_Greenhouse_Effect', 'no_effect', 'miniscule_effect', 'Man_has_no_control']
# unsure & not included: 'natural process'

sc3_kw = ['plant_food', 'plant_growth', 'thrive', 'carbon_element_is_essential', 'average_temperature_increase', '1_degree', 
          'more_fossil_fuels', 'no_co2', 'plant_food', 'not_pollution', '0.1C', 'ppm', 'not_a_pollutant']
# unsure & not included: 'beneficial'

sc4_kw = ['green_energy', 'renewable_energy', 'energy_production', 'windmills', 'solar_panel']
# unsure &  included: 'renewable energy'

sc5_kw = ['alarmism', 'catastrophist', 'doomsday_cult', 'climate_hysteric', 'unscientific', 'corrupt_politician', 'LIE_ABOUT_EVERYTHING',
          'idiocy', 'lunatics', 'CLIMATE_Worship', 'Climatists', 'alarmists', 'compliant_media', 'climate_hysteria', 'climate_narrative', 'climate_cult',
             'scientism', 'climate_science_myths', 'lying_in_science', 'climate_apocalypse', 'propaganda', 'doomsayers', 'clown_show', 'fake_climate',
               'climate_change_agenda', 'money_made', 'fake_news', 'climate_terrorists']
# unsure & not included: 'scientist', 'global warming scam' (could also be sc7), 'greta', 'john kerry'

sc7_kw = ['globalist', 'globalist_elites', 'elitist', 'global_government', 'one_world', 'one_world_government', 'globalism', 
          'one_world_utopia', 'new_world_order', 'enriching_themselves', 'saving_the_planet', 'control_over_your_lives', 
          'tyranny', 'global_elite', 'wef', 'population_control']
# unsure & included: 'tyranny', 'population control'
# unsure & not included: 'totalitarian'

print(len(sc1_kw))
print(len(sc2_kw))
print(len(sc3_kw))
print(len(sc4_kw))
print(len(sc5_kw))
print(len(sc7_kw))

12
7
13
5
28
16


In [12]:
# lowercase and stemming/lemmatizing the keyword lists 

# Function to remove underscores and convert to lowercase
def preprocess_keywords(keywords):
    return [keyword.replace('_', ' ').lower() for keyword in keywords]

# Stem words
def stem_words(words):
    stemmer = SnowballStemmer(language='english')
    return [" ".join([stemmer.stem(word) for word in word_tokenize(keyword)]) for keyword in words]

# Lemmatize words
def lemmatize_words(words):
    lemmatizer = WordNetLemmatizer()
    
    # Mapping NLTK POS tags to WordNet POS tags
    tag_map = {
        'J': wordnet.ADJ,
        'V': wordnet.VERB,
        'R': wordnet.ADV,
        'N': wordnet.NOUN
    }
    
    lemmatized_keywords = []
    for keyword in words:
        tokens = word_tokenize(keyword)
        pos_tags = pos_tag(tokens)
        lemmatized_tokens = [lemmatizer.lemmatize(word, tag_map.get(tag[0], wordnet.NOUN)) for word, tag in pos_tags]
        lemmatized_keywords.append(" ".join(lemmatized_tokens))
    
    return lemmatized_keywords


# Replacing the whitespaces with underscores again to create n-grams
def postprocess_keywords(keywords):
    return [keyword.replace(' ', '_') for keyword in keywords]

# Apply the preprocessing, stemming, and lemmatization
sc1_kw_lemmatized = postprocess_keywords(lemmatize_words(preprocess_keywords(sc1_kw)))
sc2_kw_lemmatized = postprocess_keywords(lemmatize_words(preprocess_keywords(sc2_kw)))
sc3_kw_lemmatized = postprocess_keywords(lemmatize_words(preprocess_keywords(sc3_kw)))
sc4_kw_lemmatized = postprocess_keywords(lemmatize_words(preprocess_keywords(sc4_kw)))
sc5_kw_lemmatized = postprocess_keywords(lemmatize_words(preprocess_keywords(sc5_kw)))
sc7_kw_lemmatized = postprocess_keywords(lemmatize_words(preprocess_keywords(sc7_kw)))

sc1_kw_stemmed = postprocess_keywords(stem_words(preprocess_keywords(sc1_kw)))
sc2_kw_stemmed = postprocess_keywords(stem_words(preprocess_keywords(sc2_kw)))
sc3_kw_stemmed = postprocess_keywords(stem_words(preprocess_keywords(sc3_kw)))
sc4_kw_stemmed = postprocess_keywords(stem_words(preprocess_keywords(sc4_kw)))
sc5_kw_stemmed = postprocess_keywords(stem_words(preprocess_keywords(sc5_kw)))
sc7_kw_stemmed = postprocess_keywords(stem_words(preprocess_keywords(sc7_kw)))

# print lemmatized dictionary
print("sc1_kw_lemmatized:", sc1_kw_lemmatized)
print("sc2_kw_lemmatized:", sc2_kw_lemmatized)
print("sc3_kw_lemmatized:", sc3_kw_lemmatized)
print("sc4_kw_lemmatized:", sc4_kw_lemmatized)
print("sc5_kw_lemmatized:", sc5_kw_lemmatized)
print("sc7_kw_lemmatized:", sc7_kw_lemmatized)

# print stemmed dictionary
print("sc1_kw_stemmed:", sc1_kw_stemmed)
print("sc2_kw_stemmed:", sc2_kw_stemmed)
print("sc3_kw_stemmed:", sc3_kw_stemmed)
print("sc4_kw_stemmed:", sc4_kw_stemmed)
print("sc5_kw_stemmed:", sc5_kw_stemmed)
print("sc7_kw_stemmed:", sc7_kw_stemmed)

sc1_kw_lemmatized: ['no_climate_emergency', 'melt', 'arctic_ice', 'arctic_sea_ice', 'sea_level_rise', 'extreme_weather', 'global_cooling', 'greenland_ice', 'ice_cap', 'arctic_ice', 'extreme_heat', 'extreme_cold']
sc2_kw_lemmatized: ['natural_cycle', 'co2_be_not_the_cause', 'greenhouse_gas', 'no_co2_greenhouse_effect', 'no_effect', 'miniscule_effect', 'man_have_no_control']
sc3_kw_lemmatized: ['plant_food', 'plant_growth', 'thrive', 'carbon_element_be_essential', 'average_temperature_increase', '1_degree', 'more_fossil_fuel', 'no_co2', 'plant_food', 'not_pollution', '0.1c', 'ppm', 'not_a_pollutant']
sc4_kw_lemmatized: ['green_energy', 'renewable_energy', 'energy_production', 'windmill', 'solar_panel']
sc5_kw_lemmatized: ['alarmism', 'catastrophist', 'doomsday_cult', 'climate_hysteric', 'unscientific', 'corrupt_politician', 'lie_about_everything', 'idiocy', 'lunatic', 'climate_worship', 'climatists', 'alarmist', 'compliant_medium', 'climate_hysteria', 'climate_narrative', 'climate_cult',

In [13]:
# Creating the dictionaries for the classifier
keyword_dict_lemmatized = {
    1: sc1_kw_lemmatized,
    2: sc2_kw_lemmatized,
    3: sc3_kw_lemmatized,
    4: sc4_kw_lemmatized,
    5: sc5_kw_lemmatized,
    6: sc7_kw_lemmatized
}


keyword_dict_stemmed = {
    1: sc1_kw_stemmed,
    2: sc2_kw_stemmed,
    3: sc3_kw_stemmed,
    4: sc4_kw_stemmed,
    5: sc5_kw_stemmed,
    6: sc7_kw_stemmed
}

keyword_dict_lemmatized_valid = {
    1: ['climate'],
    2: ['climate_change']
}

#print keyword dictionaries
print(keyword_dict_lemmatized)
print(keyword_dict_stemmed)

{1: ['no_climate_emergency', 'melt', 'arctic_ice', 'arctic_sea_ice', 'sea_level_rise', 'extreme_weather', 'global_cooling', 'greenland_ice', 'ice_cap', 'arctic_ice', 'extreme_heat', 'extreme_cold'], 2: ['natural_cycle', 'co2_be_not_the_cause', 'greenhouse_gas', 'no_co2_greenhouse_effect', 'no_effect', 'miniscule_effect', 'man_have_no_control'], 3: ['plant_food', 'plant_growth', 'thrive', 'carbon_element_be_essential', 'average_temperature_increase', '1_degree', 'more_fossil_fuel', 'no_co2', 'plant_food', 'not_pollution', '0.1c', 'ppm', 'not_a_pollutant'], 4: ['green_energy', 'renewable_energy', 'energy_production', 'windmill', 'solar_panel'], 5: ['alarmism', 'catastrophist', 'doomsday_cult', 'climate_hysteric', 'unscientific', 'corrupt_politician', 'lie_about_everything', 'idiocy', 'lunatic', 'climate_worship', 'climatists', 'alarmist', 'compliant_medium', 'climate_hysteria', 'climate_narrative', 'climate_cult', 'scientism', 'climate_science_myth', 'lie_in_science', 'climate_apocalypse

## Classifier

Classification of YouTube comments related to each claim, based on the text of the comment.

In [14]:
# Classifiying the comments into categories
def classify_comments(comments, keyword_dict):
    classifications = [] #initialize empty list of classifications
    
    for comment in comments: #loop through each comment of the df 
        categories = [] #initialize empty list of categories
        comment_str = ",".join(comment)  # Join the tokens of one comement into a single string for easier matching
        
        for category, keywords in keyword_dict.items(): #iterating through each key-value pair of the dictionary
            for keyword in keywords: #for each category: iterate through list of keywords. Check if each keyword is present in comment_str
                if keyword in comment_str:
                    categories.append(category) #if a keyworrd is found, the category is appended to list.
                    break  # Stop checking more keywords for this category
        
        if not categories:
            categories = ['uncategorized']
        
        classifications.append(categories)
    
    return classifications

# Apply classifier to lemmatized comments 
processed_df['category_lemmatized_comments'] = classify_comments(processed_df['all_n_grams_lemmatized'], keyword_dict_lemmatized)
processed_df.head()


Unnamed: 0,video_id,published_at,like_count,text,author,stemmed_text,lemmatized_text,stemmed_tokens,lemmatized_tokens,bigrams_lemma,trigrams_lemma,fourgrams_lemma,bigrams_stem,trigrams_stem,fourgrams_stem,all_n_grams_lemmatized,all_n_grams_stemmed,category_lemmatized_comments
0,uW6fi2tCnAc,2023-02-19,1,the answer is if china and india dont help it ...,0.0,the answer is if china and india dont help it ...,the answer be if china and india dont help it ...,"[the, answer, is, if, china, and, india, dont,...","[the, answer, be, if, china, and, india, dont,...","[the_answer, answer_be, be_if, if_china, china...","[the_answer_be, answer_be_if, be_if_china, if_...","[the_answer_be_if, answer_be_if_china, be_if_c...","[the_answer, answer_is, is_if, if_china, china...","[the_answer_is, answer_is_if, is_if_china, if_...","[the_answer_is_if, answer_is_if_china, is_if_c...","[the, answer, be, if, china, and, india, dont,...","[the, answer, is, if, china, and, india, dont,...",[uncategorized]
1,uW6fi2tCnAc,2023-02-19,2,and that guy is an expert were screwed,1.0,and that guy is an expert were screw,and that guy be an expert be screw,"[and, that, guy, is, an, expert, were, screw]","[and, that, guy, be, an, expert, be, screw]","[and_that, that_guy, guy_be, be_an, an_expert,...","[and_that_guy, that_guy_be, guy_be_an, be_an_e...","[and_that_guy_be, that_guy_be_an, guy_be_an_ex...","[and_that, that_guy, guy_is, is_an, an_expert,...","[and_that_guy, that_guy_is, guy_is_an, is_an_e...","[and_that_guy_is, that_guy_is_an, guy_is_an_ex...","[and, that, guy, be, an, expert, be, screw, an...","[and, that, guy, is, an, expert, were, screw, ...",[uncategorized]
2,uW6fi2tCnAc,2023-02-18,4,kennedy is a gem,2.0,kennedi is a gem,kennedy be a gem,"[kennedi, is, a, gem]","[kennedy, be, a, gem]","[kennedy_be, be_a, a_gem]","[kennedy_be_a, be_a_gem]",[kennedy_be_a_gem],"[kennedi_is, is_a, a_gem]","[kennedi_is_a, is_a_gem]",[kennedi_is_a_gem],"[kennedy, be, a, gem, kennedy_be, be_a, a_gem,...","[kennedi, is, a, gem, kennedi_is, is_a, a_gem,...",[uncategorized]
3,uW6fi2tCnAc,2023-02-18,0,and just how do we get a nation like china to ...,3.0,and just how do we get a nation like china to ...,and just how do we get a nation like china to ...,"[and, just, how, do, we, get, a, nation, like,...","[and, just, how, do, we, get, a, nation, like,...","[and_just, just_how, how_do, do_we, we_get, ge...","[and_just_how, just_how_do, how_do_we, do_we_g...","[and_just_how_do, just_how_do_we, how_do_we_ge...","[and_just, just_how, how_do, do_we, we_get, ge...","[and_just_how, just_how_do, how_do_we, do_we_g...","[and_just_how_do, just_how_do_we, how_do_we_ge...","[and, just, how, do, we, get, a, nation, like,...","[and, just, how, do, we, get, a, nation, like,...",[6]
4,uW6fi2tCnAc,2023-02-18,3,that man was going for an oscar,4.0,that man was go for an oscar,that man be go for an oscar,"[that, man, was, go, for, an, oscar]","[that, man, be, go, for, an, oscar]","[that_man, man_be, be_go, go_for, for_an, an_o...","[that_man_be, man_be_go, be_go_for, go_for_an,...","[that_man_be_go, man_be_go_for, be_go_for_an, ...","[that_man, man_was, was_go, go_for, for_an, an...","[that_man_was, man_was_go, was_go_for, go_for_...","[that_man_was_go, man_was_go_for, was_go_for_a...","[that, man, be, go, for, an, oscar, that_man, ...","[that, man, was, go, for, an, oscar, that_man,...",[uncategorized]


In [15]:
# Categories Count of Lemmatized Unigrams
processed_df['category_lemmatized_comments'].value_counts()

category_lemmatized_comments
[uncategorized]    88626
[5]                 2957
[6]                 1891
[1]                 1083
[4]                  580
[3]                  571
[2]                  296
[5, 6]               141
[1, 5]                72
[3, 5]                51
[1, 3]                42
[2, 3]                38
[3, 6]                30
[4, 5]                28
[1, 6]                28
[1, 2]                25
[2, 5]                18
[1, 4]                18
[4, 6]                16
[3, 4]                11
[2, 4]                10
[1, 3, 5]              9
[1, 2, 3, 5]           7
[2, 6]                 7
[1, 5, 6]              5
[1, 2, 3]              5
[2, 3, 5]              5
[1, 2, 5]              4
[2, 3, 5, 6]           4
[1, 3, 6]              3
[1, 3, 4, 6]           3
[1, 2, 6]              2
[2, 5, 6]              1
[1, 2, 4]              1
[3, 5, 6]              1
[1, 2, 3, 4, 5]        1
[2, 3, 4]              1
[1, 3, 4, 5]           1
[1, 2, 3, 5, 6]      

In [16]:
# Categories of Stemmed Unigrams
processed_df['category_stemmed_comments'] = classify_comments(processed_df['all_n_grams_stemmed'], keyword_dict_stemmed)
print(processed_df['category_stemmed_comments'].value_counts())

category_stemmed_comments
[uncategorized]       86049
[6]                    4312
[5]                    2913
[1]                     807
[4]                     540
[3]                     522
[5, 6]                  339
[1, 6]                  282
[2]                     226
[3, 6]                   82
[2, 6]                   79
[1, 5]                   58
[4, 6]                   51
[3, 5]                   43
[1, 5, 6]                32
[4, 5]                   23
[1, 3]                   23
[1, 3, 6]                22
[2, 3, 6]                22
[2, 3]                   17
[1, 2]                   17
[2, 5]                   16
[3, 5, 6]                13
[1, 4]                   12
[3, 4]                   11
[1, 2, 6]                 8
[2, 4]                    8
[1, 3, 5, 6]              7
[2, 3, 5, 6]              6
[4, 5, 6]                 6
[1, 2, 3, 5]              5
[1, 2, 5, 6]              5
[2, 5, 6]                 5
[1, 2, 3]                 4
[1, 3, 4, 6]          

In [17]:
# Categories of Validation with Climate and Climate Change
processed_df['category_lemmatized_comments_validated'] = classify_comments(processed_df['all_n_grams_lemmatized'], keyword_dict_lemmatized_valid)
print(processed_df['category_lemmatized_comments_validated'].value_counts())

category_lemmatized_comments_validated
[uncategorized]    81518
[1]                 7795
[1, 2]              7282
Name: count, dtype: int64
