# Focused Coding
___

## Table of Content

1. [Libraries](#libraries)
2. [Data Preprocessing](#preprocessing-data-and-grouping)
3. [Data Preprocessing](#preprocessing-of-the-data)
4. [Set Up of Dictionary](#building-dictionary)
5. [Classifier](#classifier)
____

## Libraries

All libraries which are needed to execute the code are listed here. Install the packages by using the `requirements.txt` file. 

The documentation can be found in the [README.md](README.md) file.

In [127]:
# import packages
import pandas as pd 
import os
import numpy as np
from tqdm import tqdm
from nltk.tokenize import TweetTokenizer
import nltk
from preprocessing_functions import *

## Preprocessing Data and Grouping

In [130]:
# load data
df = pd.read_csv('data/comments_final.csv')
df.head(3)

Unnamed: 0,video_id,published_at,like_count,text,author
0,uW6fi2tCnAc,2023-02-19T21:22:45Z,1,"The answer is if China and India don't help, it won't matter how much money the rest of the world throws at reducing carbon footprint = complete waste of 50 TRILLION DOLLARS 🤦‍♀️",0.0
1,uW6fi2tCnAc,2023-02-19T00:43:40Z,2,"and that guy is an expert, we're screwed",1.0
2,uW6fi2tCnAc,2023-02-18T22:57:38Z,4,Kennedy is a gem.,2.0


In [136]:
# group data by author and see distribution of comments 
df['author'] = pd.to_numeric(df['author'], errors='coerce').astype('Int64')
summary = df.groupby('author').agg(
    count=('author', 'size'),
    unique_video_id_count=('video_id', 'nunique')
).reset_index()
summary.sort_values(by='count', ascending=False, inplace=True)

# print top 20 authors
latex_table = summary.head(20).to_latex(index=False)
print(latex_table)

\begin{tabular}{rrr}
\toprule
author & count & unique_video_id_count \\
\midrule
17605 & 206 & 206 \\
3743 & 188 & 124 \\
18119 & 166 & 161 \\
17604 & 155 & 138 \\
18380 & 154 & 154 \\
17977 & 142 & 138 \\
17906 & 139 & 115 \\
17676 & 135 & 103 \\
17755 & 134 & 134 \\
14017 & 129 & 126 \\
14057 & 128 & 123 \\
2610 & 125 & 125 \\
25732 & 116 & 88 \\
1645 & 107 & 62 \\
17581 & 106 & 104 \\
6165 & 103 & 52 \\
6 & 99 & 98 \\
1106 & 97 & 75 \\
17608 & 92 & 80 \\
17590 & 87 & 79 \\
\bottomrule
\end{tabular}



In [123]:
# extract original text for seeing in comparising capitalized words etc.
extracted_col = df["text"]

# process data with using functions from functions.py
processed_df = (
    df.pipe(remove_users, 'text')
      .pipe(lowercase_text, 'text')
      .pipe(remove_whitespace, 'text')
      .pipe(remove_punctuation, 'text')
)

In [126]:
# Add the extracted column to the second DataFrame
processed_df = pd.concat([processed_df, extracted_col.rename("og_text")], axis=1)
processed_df.head(3)

Unnamed: 0,video_id,published_at,like_count,text,author,og_text,og_text.1
0,uW6fi2tCnAc,2023-02-19T21:22:45Z,1,the answer is if china and india dont help it wont matter how much money the rest of the world throws at reducing carbon footprint complete waste of 50 trillion dollars 🤦‍♀️,0.0,"The answer is if China and India don't help, it won't matter how much money the rest of the world throws at reducing carbon footprint = complete waste of 50 TRILLION DOLLARS 🤦‍♀️","The answer is if China and India don't help, it won't matter how much money the rest of the world throws at reducing carbon footprint = complete waste of 50 TRILLION DOLLARS 🤦‍♀️"
1,uW6fi2tCnAc,2023-02-19T00:43:40Z,2,and that guy is an expert were screwed,1.0,"and that guy is an expert, we're screwed","and that guy is an expert, we're screwed"
2,uW6fi2tCnAc,2023-02-18T22:57:38Z,4,kennedy is a gem,2.0,Kennedy is a gem.,Kennedy is a gem.


In [134]:
# use lemmatization to reduce words to their root form
processed_df['text'] = processed_df['text'].astype('str')
processed_df = lemmatize_words(processed_df, 'text')

In [16]:
processed_df.lemmatized_text = processed_df.lemmatized_text.apply(lambda x: '' if str(x) == 'nan' else x)

## Focused Coding -- Sample helper

this helper is written to find comments with important keywords we could firstly find through exploring the comments or through topic modeling and Word2Vec.

How does it work?

- Input: Insert a lemmatized keyword of interest or multiple in `substrings` or `string`
- Ouput: Get comments as output which are talking about those keywords to get a variety of comments based on that topic or word of interest.

In [63]:
substrings = ['god plan', 'greenhouse gas', 'natural cycle', 'hoax']
string = ['wef']
pattern = '|'.join(string)

In [64]:
filtered_df = processed_df[processed_df['lemmatized_text'].str.contains(pattern, case=False, na=False)]
pd.set_option('display.max_colwidth', None)
print(len(filtered_df))
filtered_df.sample(n=10)

822


Unnamed: 0,video_id,published_at,like_count,text,author,og_text,og_text.1,lemmatized_text
90961,ThTkXT06UiM,2024-01-19T04:02:32Z,1,the narrative is indoctrination based on mainstream media and weird science the wef who are globalism the oceans are rich with volcanic activity as well water warming has little to do with cow farts and carbon units war and the machines of war seem to be acceptable though peel back the economic foreskin and expose the 12 inch of reality 🌈🐛🌪,@williamrome2257,the narrative is indoctrination based on mainstream media and weird science the wef who are globalism the oceans are rich with volcanic activity as well water warming has little to do with cow farts and carbon units war and the machines of war seem to be acceptable though peel back the economic foreskin and expose the 12 inch of reality 🌈🐛🌪,"The narrative is indoctrination. Based on mainstream media and weird science . The WEF WHO, are Globalism . The Oceans are rich with volcanic activity as well. Water warming. has little to do with cow farts and carbon units. War and the machines of war seem to be acceptable though. Peel back the Economic Foreskin and expose the 1/2 inch of reality. 🌈🐛🌪",the narrative be indoctrination base on mainstream medium and weird science the wef who be globalism the ocean be rich with volcanic activity as well water warm have little to do with cow fart and carbon unit war and the machine of war seem to be acceptable though peel back the economic foreskin and expose the 12 inch of reality 🌈🐛🌪
2609,ry-bRYhN1Xs,2023-02-02T18:00:56Z,0,cut the budget starting here hanging witj the wef is a clue these people are pure evil,@chuckbabbs2726,cut the budget starting here hanging witj the wef is a clue these people are pure evil,"Cut the budget starting here, hanging witj the WEF is a clue these people are pure evil!",cut the budget starting here hang witj the wef be a clue these people be pure evil
21241,reaABJ5HpLk,2023-01-16T20:54:20Z,0,hey doc in a weird turn it seems to me that organizations like the wef the un and others have taken the malthuthian theory and twisted it old enough to remember the 70s and the miniice age then al gore peace be upon him and how millions of people would perish by 2016 stupid,@mikedawson1376,hey doc in a weird turn it seems to me that organizations like the wef the un and others have taken the malthuthian theory and twisted it old enough to remember the 70s and the miniice age then al gore peace be upon him and how millions of people would perish by 2016 stupid,"Hey doc. In a weird turn, it seems to me that organizations like the WEF, the UN and others have taken the Malthuthian theory and twisted it. Old enough to remember the '70s and the mini-ice age, then Al Gore (peace be upon him) and how millions of people would perish by 2016. Stupid.",hey doc in a weird turn it seem to me that organization like the wef the un and others have take the malthuthian theory and twist it old enough to remember the 70 and the miniice age then al gore peace be upon him and how million of people would perish by 2016 stupid
